US Police Killings: What the data tells us

US Police Killings: What the data tells usExploratory Data Analysis on Police Killings from 2015-2016Protest in St. Louis on police violence- Image taken from the IndependentIntroductionIn this article, we will analyze one of America’s hottest political topics, which encompasses issues ranging from institutional racism to the role of Law Enforcement personnel in society. But first, I have a favor to ask. For the next 10 minutes, let’s leave our preconceived notions of what’s true at the door. Prior domain knowledge is vital for making inferences from data. But if we build our statistical models based on preexisting beliefs, we are less likely to get to the right answers and more likely to ask the wrong questions. That was my schpeal on the Philosophy of Statistics. Let’s get started.Background and goalsThe ever-growing argument, pushed by American liberals and libertarians and opposed heavily by a staunch conservative base, is that the US has a flawed Law Enforcement system that costs too many innocent civilians their lives. US cops kill around a 1000 people a year. If we contrast this number with other developed countries like Finland where the cops fired only 6 shots in 2013, we get a grotesque picture. But if we look at other countries with similar levels of violent crimes and homicides, the picture gets fuzzier. In this project, we will limit our scope to analyzing police killings inside the US, and try to come up with useful insights based on data from police encounters that led to killings. Without further ado, let’s dig into the data.Data-sets usedWe are using 5 data-sets for this project: 1) a data-set on police killings from 20162) an identically framed data-set from 20153) a data-set for January-June 2015 that has less data-points than 2) but more informational features on the incidents4) a data-set on US state populations and incidences of Violent Crime in 20155) a data-set with statistics from 1000+ US citiesThe first 2 were compiled via “The Counted”, a project by The Guardian, a British Newspaper. The 3rd was compiled by Nate Silver’s FiveThirtyEight. I gathered all three from the website, Kaggle. I customized the 4th data-set on state populations myself, trimming down data compiled by the FBI. The fifth data-set on city statistics was obtained from You can find all of these files on my GitHub repository.Pre-processing: Data-cleaning & Feature EngineeringThis project will be completed on R. Data visualization, one of R’s main strengths, comes very handy in such analyses. My programming scripts and plots are publicly available on GitHub.Merging data sets: The first 2 data-sets we use are from the same source, so merging them together requires just one line of code. The third data-set, while similar, is from a different source so there are discrepancies we need to address. Even though it has all the features as the first 2, some of them are named slightly differently. There are additional features in this one, relating to demographic details of the tracts/areas in which the killings occurred. Instead of trying to merge all three data-sets together we are going to use the larger merged data-set from the first 2 data-sets for most of our visualizations, and work with the 3rd data-set exclusively when looking at demographics data. The 4th and 5th data-sets will be also be used to see how different states and cities compare in relation to police killings.Removing and transforming features: We remove identifying features like ID, Name and Street Address as they are almost always unique to each case and don’t allow us to reach any generalizations. We also remove the date features: Year, Month and Day. I checked to make sure these variables are uniformly distributed so we are not losing any valuable information. For the extended features data-set, we also remove any feature that serves as a name/ID for the county/area where the killing happened. These values are also mostly unique to each incident and rule out general trends. We do leave in location features such as City, State, Latitude and Longitude. Many of these features are imported to our data-frame in the wrong format when we load the data-sets. So we have to manually check the data-set summaries to ensure the numeric features are indeed saved as numeric info, the categorical features are saved as factors and so on, and convert them to the right format when they aren’t.Missing values and data sub-sets: A small portion of the data-points have missing values for some of the features, such as Longitude and Latitude. But instead of removing these data points from our data-frame entirely, it makes more sense to use subsets of the data-frames accordingly to leave out specific data points when we are trying to visualize certain features.Feature extraction: Next we add some new features. We add the feature, Region, to our data-set. Based on the state in which the killings occurred, we assign the data-points to one of the 4 regions in the US: West, South, Midwest, and Northeast. We also add another feature, Agegroup, to separate the deceased into different age groups.After trimming our features, our 5 data-sets have the following features:A) 2015 data-set with extended features- 467 killings1) age- Age of deceased 2) gender- Gender of deceased 3) raceethnicity- Race/ethnicity of deceased 4) city- City where incident occurred 5) state- State where incident occurred 6) latitude- Latitude, geocoded from address 7) longitude- Longitude, geocoded from address8) lawenforcementagency- Agency involved in incident 9) cause- Cause of death 10) armed- How/whether deceased was armed 11) region- Region of the US where the incident occured12) agegroup- Age-group that the deceased belongs to13) share_white- Share of pop that is non-Hispanic white 14) share_black- Share of pop that is black (alone, not in combination) 15) share_hispanic- Share of pop that is Hispanic/Latino (any race) 16) p_income- Tract-level median personal income 17) h_income- Tract-level median household income 18) county_income- County-level median household income 19) pov- Tract-level poverty rate20) urate- Tract-level unemployment rate 21) college- Share of 25+ pop with BA or higherB) 2015–16 data-set- 2226 killings-Features 1-12 from data-set AC) 2015 State data-set- 1) State ID 2) Population 3) Violent Crime incidents in 2015D) City data-set1) City2) State ID3) State name4) Population5) Population proper6) Population densityAnalysis: Geographical visualizations & MapsAt first we are going to compare different states with each other, and then look at the cities with the most police killings.The y-axis represents the number of violent crimes in each state in 2015 and the x-axis represents the number of Police killings in that state in same time period..Given that incidences of violence are proportional to the population of a state, if we were to create a 3D scatter plot for the Population, Incidences of Violence and Killings in each state, we would expect something similar to a straight line through a 3D plane (it’s in my R script but I haven’t included it here to keep things simple).We make a similar plot for the cities where there were 10 or more Police killings in 2015..Next, we are going to put some maps on our plots to better visualize the locations.Map methodology- For all the map visualizations, we will work with the large 2015–16 data-set and leave out killings in the states of Alaska and Hawaii, so we can zoom in geographically to see finer details..If you are an aspiring data-scientist, I would strongly encourage learning to make good use of these R libraries.For the first map, we are looking at all police killings during 2015–16 with the shade of the dot indicating the age of the deceased..If you see the US population heat-map below, you should notice that these two plots look somewhat similar.Now let’s take a look at a smaller subset of the data which consists only of individuals who were killed when they were unarmed..That would be 389 people, or roughly 17.4% of the total people killed in those 2 years.If the data is representative of what’s still happening now, then it seems that the killings of unarmed individuals are also a wide-spread problem throughout the US..So we subset our data-set to only include those killed who were either “Black”, “White”, “Hispanic/Latino” or “Native American”..It seems that Whites killed were spread throughout most of the country..Note that this is only information from less than 500 people who were killed between January-June 2015..First, let’s look at incomes.The important thing to note here that these incomes were not of the individuals who were killed, but rather of the people who reside at the location where the killings occurred..But according to the data from the killings, Native Americans got killed in lower income areas than Blacks did..Among many other things, this could suggest that either a lot of black people got killed in neighborhoods without predominantly black families or that the national incomes of Native American households were boosted when they are categorized in the same group as Native Alaskans..For our data-set, the killings occurred in areas where the median household income is $42,759.. More details

Leave a Reply