Algorithmic Trading

Brad JonesBlockedUnblockFollowFollowingMay 30Working with geography in survey dataSurvey researchers frequently explore differences in public opinion by demographic group — how men’s views compare with those of women, for example, or how younger people compare with older people.

Often, it’s also possible to look at differences by survey respondents’ geographic location.

Yet, when geographic information is available in survey data, it’s not always at the level researchers are looking for.

At Pew Research Center, for instance, we typically ask our phone survey respondents for their ZIP code, primarily so we can accurately identify their state and Census region for the purposes of weighting our surveys.

On their own, ZIP codes are of little value to researchers interested in doing geographic analyses, in part because there are over 30,000 ZIP codes in the United States.

However, researchers can use ZIP codes to determine location at a more granular level and do geographic analyses that use spatial relationships.

Using ZIP codes, it’s possible to locate respondents in a specific place with some degree of accuracy — specifically, the latitude and longitude at the centroid (or geographical center) of the Census’s ZIP code tabulation area.

In this post, I’ll walk through an example from early 2018 that used ZIP codes to conduct geographic analysis.

The analysis in question was based on a survey question we asked about public support for offshore drilling.

It seemed reasonable to explore whether respondents’ proximity to a coastline was associated with their attitudes toward offshore drilling.

Geolocating respondentsThe first step in this kind of analysis is geolocating survey respondents.

In this data, the best we can do is locate people within their ZIP codes.

There are a couple of things to keep in mind when working with ZIP codes.

First, we have to determine how to translate the ZIP codes into geographic coordinates.

In this analysis, we’ll assign respondents to the centroids of their ZIP code tabulation areas.

(One great resource for georeferenced Census units is GeoCorr, and you can download the ZIP code data here)First, we’ll read in the data (you can find the survey data here) and merge it together, as shown below.

(A side note on working with ZIP codes: Always check the format.

ZIP code is stored as a factor variable in the survey data but as a character in the GeoCorr data.

To avoid problems merging the two together, make sure they are in the same format.

Additionally, with ZIP codes in particular, always be mindful of leading zeros.

)library(foreign)###Read in the datasetsdat <- read.

spss("small_jan_data.

sav",to.

data.

frame = TRUE)##zip code data from geocorr for all zip code centroidszip <- read.

csv("geocorr2018.

csv", as.

is = TRUE)head(zip)#remove the first rowzip <- zip[-1,]zip$lon <- as.

numeric(zip$intptlon)zip$lat <- as.

numeric(zip$intptlat)##format zipcode as string variabledat$zipcode <- as.

character(dat$finalzip)###merge datadat <- merge(dat, zip[,c(“zcta5”,”lat”,”lon”)], by.

x = “zipcode”, by.

y = “zcta5”, all.

x = TRUE)###Plot the zip code coordinateswith(dat, plot(lon, lat, pch = 20))Plot of ZIP code centroidsCalculating distance from coastThe next step is to bring in the map of coastlines and find the distance between respondents’ locations and the coastline.

##bring in coastline shapefilelibrary(sp)library(rgeos)library(rgdal)##availble from here: http://www.

soest.

hawaii.

edu/pwessel/gshhg/map <- readOGR(“coastline map/GSHHS_l_L1.

shp”)####extract the coordinates from the map objectpolys <- map@polygons##container for coordinatesall.

coords <- NULLfor (j in 1:length(polys)) { ##loop through polygons##find coordinate slotsif (is.

element(“coords”, slotNames(polys[[j]]))) {coords <- polys[[1]]@coordsall.

coords <- rbind(all.

coords, coords)}##get polygon slots for more complicated geographiesif (is.

element(“Polygons”, slotNames(polys[[j]]))) {p <- polys[[j]]@Polygons##extract coordinates from these more complex objectscoords <- NULLfor (k in 1:length(p)) {coords <- rbind(coords, p[[k]]@coords)}all.

coords <- rbind(all.

coords, coords)}}plot(all.

coords, pch = ‘.

’)Plot of the points that make up the coastline shapefile###select N.

America coast linex <- which(all.

coords[,1] > -180 & all.

coords[,1] < -45)y <- which(all.

coords[,2] > 20)wh <- intersect(x, y)plot(all.

coords[wh,], pch = ‘.

’)###cut out NE Canadian coastlinex <- which(all.

coords[,1] > -120)y <- which(all.

coords[,2] > 55)wh2 <- intersect(x, y)wh <- setdiff(wh, wh2)us <- all.

coords[wh,]plot(us, pch = ‘.

’)points(zip$lon, zip$lat, pch = ‘.

’, col = ‘red’)Calculating distance is a little more involved than simply using Euclidian distance.

Given that we are dealing with a coordinate system on a globe, we’ll use a great circle distance approximation:####Functions for calculating great circle distances in miles####slightly modified from this post: https://www.

r-bloggers.

com/great-circle-distance-calculations-in-r/####convert degrees to radians for distance calculationdeg2rad <- function(deg) return(deg*pi/180)# Calculates the geodesic distance between two points specified by radian latitude/longitude using the# Spherical Law of Cosines (slc)gcd.

slc <- function(long1, lat1, long2, lat2) {##convert coordinates to radianslat1 <- deg2rad(lat1)long1 <- deg2rad(long1)lat2 <- deg2rad(lat2)long2 <- deg2rad(long2)R <- 3959 # Earth mean radius [miles]##container for distancesd_vec <- rep(NA, length(long2))for (i in 1:length(long2)) {##find distances from each pointd_vec[i] <- acos(sin(lat1)*sin(lat2[i]) +cos(lat1)*cos(lat2[i]) * cos(abs(long2[i]-long1))) * R}return(d_vec)}##column for distance to coastdat$dist_to_coast <- NAfor (j in 1:nrow(dat)) {dist <- gcd.

slc(dat$lon[j], dat$lat[j],us[,1], us[,2])dat$dist_to_coast[j] <- min(dist)}##create a 3-way distance variabledat$dist3 <- 99w <- which(!is.

na(dat$dist_to_coast))dat$dist3[w] <- 3w <- which(dat$dist_to_coast < 300)dat$dist3[w] <- 2w <- which(dat$dist_to_coast < 25)dat$dist3[w] <- 1dat$dist3 <- factor(dat$dist3, labels = c(“Less than 25 miles”,“25–300 miles”, “More than 300 miles”, “Missing”))Close reading of this code shows that we’re actually calculating distance to the nearest points that make up the coastline.

It would be better to calculate the distance to the nearest point on the line segments that make up the coastline, but this would be a much more involved calculation.

However, the large number of points that make up the coastline means the error is pretty minimal.

A larger source of error is the fact that respondents are located at the centroids of their ZIP codes rather than their actual addresses.

Examining the resultsGiven the distance measure, we can now examine attitudes toward offshore drilling by respondents’ proximity to the coast:library(survey)design <- svydesign(id=~1, weights=~weight, data=dat)svyby(~q90, ~dist3, design = design,FUN = svymean, keep.

names = FALSE, na.

rm = TRUE)dist3 q90Favor q90Oppose1 Less than 25 miles 0.

3457668 0.

55620202 25–300 miles 0.

4507683 0.

47836573 More than 300 miles 0.

4603733 0.

49678274 Missing 0.

3720786 0.

4979092Indeed, there appears to be a significant difference in attitudes toward offshore drilling between those who live nearer to a coastline and those who live farther away.

People who live within 25 miles of the coast were about 10 percentage points less likely to say they favor offshore drilling than those who live more than 25 miles from the coast.

However, Democrats and Democratic-leaning independents live nearer to the coast, on average, than Republicans and Republican leaners do:svyby(~dist_to_coast, ~partysum, design = design,FUN = svymean, keep.

names = FALSE, na.

rm = TRUE)partysum dist_to_coast se1 Rep/lean Rep 256.

6374 12.

055832 Dem/lean Dem 205.

2576 10.

726593 DK/Ref-no lean 216.

7199 28.

74201In a multivariate framework, there seems to be no relationship between attitudes toward offshore drilling and proximity to the coast after controlling for partisanship:summary(svyglm(q90 == "Favor" ~ dist3 + partysum, design = design, family = 'quasibinomial'))Call:svyglm(formula = q90 == "Favor" ~ dist3 + partysum, design = design, family = "quasibinomial")Survey design:svydesign(id = ~1, weights = ~weight, data = dat)Coefficients: Estimate Std.

Error t value Pr(>|t|) (Intercept) 0.

6551 0.

1519 4.

314 1.

71e-05 ***dist325-300 miles 0.

2855 0.

1719 1.

661 0.

097 .

dist3More than 300 miles 0.

2284 0.

1823 1.

253 0.

210 dist3Missing 0.

1270 0.

3351 0.

379 0.

705 partysumDem/lean Dem -2.

0681 0.

1494 -13.

844 < 2e-16 ***partysumDK/Ref-no lean -1.

4222 0.

2350 -6.

051 1.

81e-09 ***—Signif.

codes: 0 ‘***’ 0.

001 ‘**’ 0.

01 ‘*’ 0.

05 ‘.

’ 0.

1 ‘ ’ 1(Dispersion parameter for quasibinomial family taken to be 0.

999249)Number of Fisher Scoring iterations: 4NotesAs mentioned above, there is some error with using the centroid of a ZIP code instead of an actual address.

If more granular data (e.

g.

, exact addresses) are available, that would involve translating those addresses into exact coordinates rather than the ZIP code centroid.

There are many ways to translate street address into geographic coordinates — for example, the ggmap package in R interfaces with the GoogleMaps API to extract latitude and longitude coordinates for address data.

If you’re interested in trying your own geographic analysis of survey data, it’s possible to get ZIP code data for a subset of Pew Research Center general public surveys conducted by phone, once those datasets have already been publicly released.

Because respondent privacy is of utmost concern, we do request additional information from researchers and require them to subscribe to additional data usage agreements.

If you have a request along these lines, please send an email to info@pewresearch.

org and include the specific survey name and dates (or link to the survey).

Please note that detailed geographic variables (like ZIP code) are not available for surveys of rare populations or surveys conducted through the Center’s American Trends Panel.

(Due to the longitudinal nature of the panel, there is a great deal of information available about panelists so we need to take extra safeguards to protect their confidentiality.

)Bradley Jones is a research associate focusing on U.

S.

politics and policy at Pew Research Center.

.

. More details

Leave a Reply