My outcome of interest is retail theft. There are many types of thefts such as pocket-picking, purse-snatching, from building, etc. From this predictive policing task, I am going to focus on retail theft which tends to have more selection bias. I used to watch a video by a vlogger named Jerry Kowal who is from New York. He went to Detroit and recorded his driving tour in the neighborhoods in Detroit. The neighborhoods shown in the video are very bleak with abandoned houses, bad sanitation and environment. And what inspired me to focus on retail theft is that a KFC restaurant in a neighborhood even has bullet-proof glass on the cashier counter. Jerry said that walking through the neighborhood made him feel very unsafe, and it made sense that there was bullet-proof glass in KFC due to high crime rate. I also used to read news about theft or robbery in fast food restaurants, and the neighborhoods of those restaurants are also bleak and unsafe.
As there are many locations for retail thefts taking place, in order to be fair, specific and match what I took from the video and news, I only selected retail thefts took place in convenience store, grocery food store, small retail store, restaurant, tavern/liquor store and drug store, which are relatively small-scaled, and are easily to find and common in neighborhoods. Finally, there are 6526 out of total 10460 retail theft cases.
To elaborate the selection bias of such retail theft, here is the summary of characteristics of locations: 1. Tend to take place in small stores in neighborhoods (that is how I made the selection from location description, large department stores may have stronger policing) 2. Places with weak policing 3. Places with poor environment and facilities (dilapidated housing, bad sanitation, poor facades, vacant lots) 4. There is no definite time for the crime taking place, daytime and nighttime are both possible.
The above characteristics also give a signal of “broken-window” theory.
The following map is the retail theft points around Chicago, and we can see retail thefts in Chicago are not everywhere. But surprisingly, it seems like many retail thefts are concentrated in Loop! Honestly, it is out of my expectation/assumption.
crime <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")
theft<-
crime%>%
filter(Primary.Type == "THEFT"&
Description == "RETAIL THEFT"&
Location.Description %in% c ("CONVENIENCE STORE",
"GROCERY FOOD STORE",
"SMALL RETAIL STORE",
"RESTAURANT",
"TAVERN/LIQUOR STORE",
"DRUG STORE")) %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform(102271) %>%
distinct()
#Clip all theft points into the boundary
theft <-
theft[chicagoBoundary,]
#Plot
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = theft, colour="red", size=0.1, show.legend = "point") +
labs(title= "Retail theft, Chicago - 2017") +
mapTheme()
The following map maps the count of retail theft by grid cell (the cell size is 500). And the spatial structure of retail theft begins to take shape in the map below as the hotspots (more close to yellow) are immediately apparent. We can clearly see Loop and some northeast areas have high count values of retail thefts. Again, retail thefts are not everywhere in Chicago, so many grid cells don’t contain any count(or value).
#2.2 Data wrangling: Joining theft to the fishnet--[Crime net]
crime_net <-
theft %>%
dplyr::select() %>%
mutate(countTheft = 1) %>%
aggregate(., fishnet, sum) %>% #aggregate joins burglaries to the fishnet using the sum operation to sum countBurglaries
mutate(countTheft = ifelse(is.na(countTheft), 0, countTheft),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE)) #Roughly 100 cvIDs are generated (round(nrow(burglaries) / 100)) to allow 100-fold cross validation below
ggplot() +
geom_sf(data = crime_net, aes(fill = countTheft)) +
scale_fill_viridis() +
labs(title = "Count of retail theft for the fishnet",
subtitle = "Gray area means no data/no retail thefts") +
mapTheme()
Besides the six given risk factors ( Abandoned cars, Street lights out, Graffiti remediation, Sanitation complaints, Abandon buildings, Retail stores that sell liquor to go), I added two additional risk factors from 311 requests as well : rodent baiting and garbage carts. Rodent baiting indicates the damage of rats, and garbage carts indicate garbages needed to be replaced or get maintenance. Both new risk factors focus on environment as well.
From retail theft points, we’ve noticed that there are many cases taking place in Loop. As we know, Loop is the downtown area of Chicago. I would argue that downtown may not be always good in physical environment, there may also be a lot of trash on streets, abandoned buildings, and other problems. But whether current risk factors make sense in predicting retail theft, let’s wait and see for regression results later in this assignment.
#2.3 Data wrangling risk factors
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ == "Front" |
where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
#Rat/Rodent baiting
rat <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historical/97t6-zrhs/data") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Rodent baiting") #50921 entries
#Garbage cart
GarbageCart <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Garbage cart") #52837 entries
The following facet maps the points of all eight risk factors:
#Mapped factors out!
ggplot() +
geom_sf(data=chicagoBoundary) +
geom_sf(data = rbind(abandonCars,
streetLightsOut,
abandonBuildings,
liquorRetail,
graffiti,
sanitation,
rat,
GarbageCart),
size = .1) +
facet_wrap(~Legend, ncol = 4) +
labs(title = "Risk Factors") +
mapTheme()
Feature engineering for the eight risk factors involves three major parts: count of features by grid cell, nearest neighbor distance from one grid cell to each risk factors nearby, measurement of distance to the centroid of Loop.
For the two new risk factors, they illustrate slightly different spatial processes. Rodent baiting are mainly clustered in north and north east Chicago, while garbage carts are mostly in both north and south.
#3.1 Feature engineering - Count of features by grid cell
#The first simply sums the number of a given risk factor points occuring in a given grid cell.
#This is done by binding together the individual risk factor layers;
#then spatial joining (st_join) the fishnet to each point.
#The outcome is a large point layer with a column for each fishnet uniqueID.
#The layer is then converted to a data frame with st_set_geometry.
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail,graffiti,sanitation,rat,GarbageCart) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>% # converted to a data frame with st_set_geometry
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>% #full_join adds the fishnet geometries
spread(Legend, count, fill=0) %>% #spread converts to wide form
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
#let’s map each count feature as part of a small multiple map
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable) #vector of Variable names
mapList <- list() #empty list
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet: Count of features by grid cell"))
For nearest neighbor distance, one common feature is that the southmost part of Chicago has the largest NND from all eight risk factors (more yellow), while a majority of Chicago is very close to those eight risk factors (more dark blue).
#3.2 Feature engineering - Nearest neighbor features
#NND function
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
#Calculate NND
#These features are created by converting vars_net polygon grid cells to points, then measuring average nearest neighbor distance
vars_net$Abandoned_Buildings.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)
vars_net$Abandoned_Cars.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)
vars_net$Graffiti.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)
vars_net$Liquor_Retail.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)
vars_net$Street_Lights_Out.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)
vars_net$Sanitation.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)
vars_net$Rat.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(rat), 3)
vars_net$GarbageCart.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(GarbageCart), 3)
#Mapped them out!
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>% #use of select and ends_with to map only the newly generated nearest neighor features
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet: Nearest Neighbor Distance"))
It also makes sense measure distance to a single point, like the centroid for the Loop which is Chicago’s CBD.
#3.3 Feature Engineering - Measure distance to one point
#It also may be reasonable to measure distance to a single point,
#like the centroid for the Loop neighbor - Chicago’s Central Business District.
#This is done with st_distance.
loopPoint <-
neighborhoods %>%
filter(name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
ggplot() +
geom_sf(data=vars_net, aes(fill=loopDistance)) +
scale_fill_viridis() +
labs(title="Euclidean distance to The Loop") +
mapTheme()
Finally, a final net is created by combining data from feature engineering and the crime net, and join with neighborhood name and police districts. The facet shows the aggregate areas by police districts and neighborhoods respectively in the form of fishnet.
#3.4 Feature Engineering - Create the final_net
#crime_net and vars_net layers are joined into a final dataset
final_net <-
left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID")
final_net$countTheft[is.na(final_net$countTheft)] <- 0
final_net <-
st_centroid(final_net) %>%
st_join(., dplyr::select(neighborhoods, name)) %>% #Name for neighborhoods
st_join(., dplyr::select(policeDistricts, District)) %>% #District for police district
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
dplyr::select(final_net, name, District) %>%
gather(Variable, Value, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Value)) +
facet_wrap(~Variable) +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Aggregate Areas",
subtitle="Not everywhere has retail theft!") +
mapTheme() + theme(legend.position = "none")
The local spatial structure of retail thefts(by grid cell) is quantified and visualized by Local Moran’s I.
Very high values of Local Moran’s I represent strong and statistically significant evidence of local clustering. This test certainly provides insight on the scale and intensity of the retail theft spatial process. From “Local_Morans_I” in the facet, we can clearly see retail thefts are highly clustered in Loop area, which actually occupies a very small portion of Chicago. The rest area shows very low (almost zero) clustering.
From “Significant_Hotspots” in the facet, we can see slightly different from the result of Local Moran’s I, not only Loop area is the significant hotspots, but a large area in the north of Loop, and some areas in the south of Loop and west Chicago are also significant hotspots for retail theft.
#4.1 Exploring the spatial structure of retail theft
#The code block below create a neighbor list, poly2nb, and the spatial weights matrix,
#final_net.weights, using queen continguity, meaning that every grid cell is realted to its eight adjacent neighbors.
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
#Moran's I
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$countTheft, final_net.weights)),
as.data.frame(final_net, NULL)) %>%
st_sf() %>%
dplyr::select(Theft_Count = countTheft,
Local_Morans_I = Ii,
P_Value = `Pr(z > 0)`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(varList, ncol = 2, top = "Local Moran's I statistics, Retail theft"))
To determine the highly significant hotspots and to find clusters that are far more significant than 0.05, p-value <= 0.0000001. Distance from each grid cell to its nearest significant cluster is measured and shown as below. This distance feature makes some significant assumptions about the local spatial process that retail thefts dissipate in a linear fashion from highly significant hotspots outward.
#Code spots, hot spots
final_net <-
final_net %>%
mutate(theft.isSig = ifelse(localmoran(final_net$countTheft, #Dummy variable
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(theft.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, theft.isSig == 1))), 1 ))
#measures distance from each grid cell to its nearest significant cluster
ggplot() +
geom_sf(data = final_net, aes(fill = theft.isSig.dist)) +
scale_fill_viridis() +
labs(title = "Distance to highly significant local retail theft hotspots") +
mapTheme()
From the scatterplot, the correlation between count of retail theft and count of garbage carts is very small (r is 0.02), while the correlation between count of retail theft and count of rodent baiting is more conspicuous (r is 0.16). As the count of rodent baiting increases in a grid cell, so does the number of retail theft. As the distance to rodent baiting decreases, the count of retail theft increases.
correlation.long <-
st_set_geometry(final_net, NULL) %>%
dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
gather(Variable, Value, -countTheft)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countTheft, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countTheft)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Retail theft count as a function of risk factors")
The distribution of retail theft is highly skewed, most grids don’t have retail theft count, given that crime is not a very common event. When data is distributed in this way, Poisson Regression is appropriate.
ggplot(final_net, aes(countTheft)) +
geom_histogram(binwidth = 1) +
labs(title = "Distribution of retail theft by grid cell")
Cross-validated Poisson regression is applied, and we applied two kinds: random k-fold and LOGO (leave-one-group-out) cross validation. For each cross validation type, we looked at “Just risk factors” and the “Spatial Structure” as two groups of features.
The facet below maps predictions for each regression. It is clear to see that the models with the spatial structural features do a better job picking up on very localized hotspots.
#5.2 Cross-validated poisson Regression
#Just risk factors
reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn","loopDistance",
"Rat.nn","GarbageCart.nn")
#risk factors plus the Spatial Structure
reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn",
"Street_Lights_Out.nn", "Sanitation.nn", "loopDistance",
"Rat.nn","GarbageCart.nn",
"theft.isSig", "theft.isSig.dist")
#Cross validate funciton
###For now, the countBurglaries variable is hardwired into the function, which you will have to change when you do your homework.
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countTheft~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
#Add function!
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countTheft",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countTheft, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countTheft",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countTheft, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countTheft",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countTheft, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countTheft",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countTheft, Prediction, geometry)
#5.2 Accuracy & Generalzability
reg.summary <-
rbind(
mutate(reg.cv, Error = countTheft - Prediction,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = countTheft - Prediction,
Regression = "Random k-fold CV: Spatial Structure"),
mutate(reg.spatialCV, Error = countTheft - Prediction,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = countTheft - Prediction,
Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
st_sf()
#Plot
grid.arrange(
reg.summary %>%
ggplot() +
geom_sf(aes(fill = Prediction)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Predicted retail theft by Regression") +
mapTheme() + theme(legend.position="bottom"),
filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
ggplot() +
geom_sf(aes(fill = countTheft)) +
scale_fill_viridis() +
labs(title = "Observed retail theft\n") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
Below maps prediction raw errors (not absolute errors) for the two kinds of cross validation and two groups of features. We can see without the local spatial structure, many of the hotspots go under-predicted (with high error value) leaving clustered errors at the hotspots.
Still, the models with the spatial structural features do a better job, as from the change of colors, they tend to have smaller error values.
#Plot asmall multiple map of model errors by random k-fold and spatial cross validation
reg.summary %>%
ggplot() +
geom_sf(aes(fill = Error)) +
facet_wrap(~Regression,ncol=2) +
scale_fill_viridis() +
labs(title = "Errors of retail theft by Regression",
subtitle="Regression: Random k-fold and spatial cross validation") +
mapTheme() + theme(legend.position="bottom")
To interpret MAE and the standard deviation of MAE across each fold for each regression type, the errors are roughly 0.756 retail theft on average. But for context, the mean retail theft count is only 0.695! The average error is greater than the average count. So the overall accuracy of predicted values from all regression models is not quite satisfactory, at least compared to the actual count value.
But from the table, a reconfirmation is that regressions with the spatial structural features are much more accurate and more generalizable. These features are doing a better job predicting the hotspots.
#MAE and the standard deviation of MAE across each fold for each regression type
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
summarize(MAE = round(mean(abs(Prediction - countTheft), na.rm = T),3),
SD_MAE = round(sd(abs(Prediction - countTheft), na.rm = T),3)) %>%
kable(caption = "MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(4, color = "black", background = "#FDE725FF")
Regression | MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 0.777 | 1.350 |
Random k-fold CV: Spatial Structure | 0.733 | 1.230 |
Spatial LOGO-CV: Just Risk Factors | 0.780 | 1.352 |
Spatial LOGO-CV: Spatial Structure | 0.734 | 1.220 |
The facet below provides another approach for visualizing how the spatial structural variables provide better predictions. Each facet represents a regression type. The observed retail theft count are split into ten deciles, running from lowest to highest. For random k-fold, the spatial structure features provide marginal improvements in the 1st to the 8th deciles. But for LOGO, the improvement is only visible in the 5th to the 7th deciles. However, in the highest retail theft grid cells, the spatial features improve errors by more than a half for both cross validation types.
All models overpredict in the 1st through 7th deciles, but then underpredict in the 8th through 10th deciles. In other words, we overpredict in low retail theft areas and underpredict in high retail theft areas.
#Visualizing how the spatial structural variables provide better predictions.
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
mutate(theft_Decile = ntile(countTheft, 10)) %>%
group_by(Regression, theft_Decile) %>%
summarize(meanObserved = mean(countTheft, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -theft_Decile) %>%
ggplot(aes(theft_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = theft_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed retail theft by observed retail theft decile")
To check generalizability on racial context, following the tutorial, we used the package “tidycensus” to download the ACS 5-year race (percentage of the white). Then, we divided the context of race into two categories: Majority White (if the neighborhoods’ white population is higher than half of the total population) and Majority Non-White. Since model with spatial structure is better, I looked at the random K-fold and spatial LOGO with spatial structure regression for comparison.
The table below shows the mean error by regression and race context. The model overpredicts in Majority_Non_White neighborhoods and underpredicts in Majority_White neighborhoods. However, both mean errors are relatively small (very close to +/- 0.04-0.05), I may argue that the generalizability over racial context is fair.
library(tidyverse)
library(tidycensus)
library(viridis)
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2017, state=17, county=031, geometry=T) %>%
st_transform(102271) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
#TABLE:A table of raw errors by race context for a random k-fold vs.spatial cross validation regression.
final_reg <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure" |
Regression == "Random k-fold CV: Spatial Structure") %>%
mutate(uniqueID = rownames(.))
final_reg.tracts <-
st_join(st_centroid(final_reg), tracts17) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_reg, uniqueID)) %>%
st_sf() %>%
na.omit()
st_set_geometry(final_reg.tracts, NULL) %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
Regression | Majority_Non_White | Majority_White |
---|---|---|
Random k-fold CV: Spatial Structure | 0.0466962 | -0.0504453 |
Spatial LOGO-CV: Spatial Structure | 0.0468787 | -0.0501568 |
Here it is going to compare the algorithm with kernel density hotspot mapping. Firstly, I built a kernel density with a 1000 ft search radius with 1500 sample points of retail theft overlayed on top.
#Kernel density
library(raster)
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD.1000 <- spatstat::density.ppp(theft_ppp, 1000)
theft_KD.1500 <- spatstat::density.ppp(theft_ppp, 1500)
theft_KD.2000 <- spatstat::density.ppp(theft_ppp, 2000)
theft_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(theft_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(theft_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(theft_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
theft_KD.df$Legend <- factor(theft_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
#creates a kernel density with a 1000 foot search radius
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD <- spatstat::density.ppp(theft_ppp, 1000)
as.data.frame(theft_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(theft, 1500), size = .5) + #1500 Points
scale_fill_viridis() +
mapTheme()
Below are maps generated of the risk categories for both model types with test set points overlayed. Since a strong fit model should should that the highest risk category is uniquely targeted to places with a high density of retail theft points, in this case, our algorithm for risk prediction is much better as it does a better job uniquely targeting to places with a high density of points (i.e.more precise and specific).
#whether the risk predictions capture more of the observed retail thefts relative to the kernel density
# Compute kernel density
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD <- spatstat::density.ppp(theft_ppp, 1000)
# Convert kernel density to grid cells taking the mean
theft_KDE_sf <- as.data.frame(theft_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
bind_cols(
aggregate(
dplyr::select(theft) %>% mutate(theftCount = 1), ., length) %>%
mutate(theftCount = replace_na(theftCount, 0))) %>%
#Select the fields we need
dplyr::select(label, Risk_Category, theftCount)
#Prediction from the LOGO-CV with the spatial features
#theft_risk_sf
theft_risk_sf <-
filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
bind_cols(
aggregate(
dplyr::select(theft) %>% mutate(theftCount = 1), ., length) %>%
mutate(theftCount = replace_na(theftCount, 0))) %>%
dplyr::select(label,Risk_Category, theftCount)
#plot
rbind(theft_KDE_sf, theft_risk_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(theft, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
Last but not least, the rate of retail theft points by risk category and model type is calculated and plotted. From the bar chart below, we can see the risk prediction captures a much greater share of observed retail theft events in the highest risk category relative to kernel density. Therefore, our algorithm model wins!
#Bar chart
rbind(theft_KDE_sf, theft_risk_sf) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countTheft = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countTheft / sum(countTheft)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE)
As I learned from the whole process of building the model, the geospatial risk prediction model borrows the retail theft experience in places retail theft has been observed, and test whether that experience generalizes to places where the risk is ‘latent’. As we know, a useful model is one that is 1) more accurate than the current decision making process and 2) equally as such for areas across the City. The algorithm can uniquely targets to places with a high density of retail theft points, and can pick up on very localized hotspots, however, I still do not recommend the algorithm be put into production for the following reasons:
First, it is not generalized enough. The algorithm tends to overpredict in low retail theft areas and underpredict in high retail theft areas. It will give rise to a confusion of police allocation. Especially, the underprediction in high retail theft areas is quite severe. Also, in the comparison between the algorithm and kernel density, although the difference between bar heights is large in high risk categories, but the difference is much smaller in low risk categories (1% to 29%, 30% to 49%), which means that at least the algorithm can’t predict equally well for areas with different risk categories. False police allocation will lead to a waste of social resource in terms of time and cost. We need to prevent the probability of incorrect police allocation as much as possible.
Secondly, it is not accurate enough. The mean retail theft count is only 0.695 per grid cell, but the errors from all regressions are roughly around 0.756. I think a potential reason behind is that there are a majority of places in Chicago don’t have retail thefts, in other words, there are not enough data. So the prediction capability of my risk factors is more or less limited. On the other hand, not enough data may also be a result of reporting bias. For example, we have already seen that a cluster of retail thefts are in Loop which is the CBD, it is possible that there might be more cases there, but shopkeepers didn’t report because they are likely to get used to “suffer” from theft in their stores.
To conclude, personally I would not recommend this algorithm to be put into production at least for retail theft, the crime type.