What can we tell about a city from its data? Here's a brief look at what drives crime and 311 calls in Chicago.
Let's start with the big picture. In 2014, there were 272,374 recorded instances of crime (wow), and this is the locations of all the crime reports. To put this into perspective, the estimated population of Chicago in 2013 was 2.72 million, so there was about a crime recorded for every 10 people. This is enough and sufficiently spread out over the city that we can make out the geography (even Washington Park is visible near the lake, apparently not a hotspot for either criminals or maybe police). I'm not that familiar with Chicago neighborhoods, but you can see that there seems to be a hotspot down on the South Side around the intersection of 90 and 94, out to the west around Oak Park and downtown by the Near East Side.
How about the 311 calls? These are calls for city services like to report graffiti or request maintenance. These seem to show strong geographic variation between each other. The most prominent is graffiti as is clear from the density of points, so let's focus on that for a moment. Since graffiti is associated with crime, maybe it should be most reported in areas that are crime heavy, but it's clear that reported graffiti is not indicative of crime level. In fact, you can show that the wards where graffiti reporting is the highest are correlated with gentrification (I suppose the idea is that gentrifying areas might be more sensitive to disorderly conduct, but that's not the only viable hypothesis). In contrast, calls for emptying garbage, tree trimming, abandoned cars, or extinguished street lights are much more uniform. This is important because it tells us that geographic variation is not just because people in certain neighborhoods don't have phones or just don't call, people do call but they're not calling in about certain things. Also, are rodents really only a problem on the north side?
If we plot the number of reported events over the course of the year, we find that nearly everything has some relationship with temperature. This is a well-known phenomenon that crime, for example, increases with the temperature. But we might as well plot it for fun, and I show it below. We have to be careful because different wards have different mean levels of crime, so I am showing the fluctuations as a percentage of the mean to make a fair comparison. For every degree increase in the maximum daily temperature, crime reports increase about 0.8% above the mean. You might note that crime is typically higher on Saturday than the average weekday (either these criminals have weekday jobs or there are more targets on Saturday but not Sunday).
I show similar plots for the different types of 311 calls. Some of these have no real correlation with the temperature such as report about abandoned cars or graffiti. Others have strong temperature dependence as would be expected, and these are garbage (2.6% increase per degree--yuck!), tree trimming, sanitation, etc. One that has a negative correlation is the report of street lights (-1.7%) and that is presumably because the days get longer during the summer (and maybe cold has a negative impact on the longevity of bulbs?) For 311 calls, it's quite important to distinguish between weekdays and weekends and so the shown linear regressions exclude the weekends.
The huge impact of the temperature is astounding. In fact, many have shown temperature to be highly predictive of conflict and crime in many other places (see Hsiang et al. 2011). Clearly, the direction of causality must be from nature to human behavior, but it's not clear whether it's really temperature, the longer day, or something related that is causing these reports to rise and fall.
Clearly, fluctuations in crime vary systematically over the entire city, but there is at least one clear reason why these large fluctuations are correlated. So if I were to look at how fluctuations about the mean were correlated between different wards over the course of the year, the correlations would be dominated by these systemic trends. What happens if I explicitly remove the trend. Are the leftover fluctuations still correlated? I check this by subtracting off the mean across the wards, which still seems to give a rich correlation structure as visible below.
Taking the crime data renormalized by the means of each ward and then having subtracted the mean over hte wards, I use SVD to pull out the patterns characterizing the wards. It looks like only a few wards have strong deviations along the first and second components once we've accounted for the mean trend. Noticeably large deviations are visible for wards 11, 19, 22, 38, 40, 41, 43, 44, 46, and 47. To see how these outliers fit in with the geography of the city, I took the wards that were more than half a standard deviation away from 0 and colored them by whether they are positive or negative. What is really interesting is that all the clusters are not geographically proximal. For example, the cluster of wards 1, 32, 43 and 44 deviate from the mean trend in similar way as 10, 12, 22 and 36 despite the separation.
If you're curious, the other matrix returned SVD will tell us about the variation over the days. After the mean has been subtracted, the first component reflects the weekly variation. The second component is more complicated, and it's not immediately clear to me what it refers to. But it's below for you to inspect. Let me know if you have any ideas...What I see from inspecting this is a slight depression at the beginning of the year and during the summer. So, it looks like the summer is not just a trend up but maxes out?
What's the conclusion? I don't think any of these posts are ever thorough enough to lead to any real conclusion, but it's stunning to see how strong global effects like temperature and the weekly cycle are on the crime cycle in Chicago (or any city probably). Besides that, its important to mention that the right way to compare the crime statistics between wards seem to be to rescale them by the means (or geometric means). A lot of statistical tools that researchers tend to take off the shelf make normality assumptions, and if that is the case, one should really take these variables into the log domain after rescaling, or perhaps use a statistics that isn't sensitive to outliers like Spearman's rank correlation.
This post was originally motivated by a datathon at the Computational Social Science Summit of 2015. We were investigating the incidence of crime in Chicago using crime reports collected during 2014 and 311 calls. Although the focus of the datathon was on answering some sociological questions and making policy recommendations, there were some very interesting patterns that might merit further attention. I thought it would be fun to show some of them here. I should mention that the Chicago Tribune has a very nice website set up for visualizing these data across many more years here.
The crime data consists of all the reported crimes including details of the date, type of crime, a discription, the location, and whether an arrest was made or not. The 311 data that we looked at included calls about abandonced vehicles, garbage carts, graffiti removal, rodent baiting, sanitational code complaints, broken street lights, tree debris, tree trimming, and vacant and abandoned buildings (say occupied by the homeless). Similarly, these files include time, location, and some other relevant information such as the type of complaint and whether it was addressed or not. This is a lot of different variables to play with, and one can easily stick all of these different columns into a machine and try to analyze what it spits out, but I thought I would show more details since I don't think black boxes are particularly helpful.Back to top