Risk prediction - Retail theft

Introduction: retail theft and selection bias

My outcome of interest is retail theft. There are many types of thefts such as pocket-picking, purse-snatching, from building, etc. From this predictive policing task, I am going to focus on retail theft which tends to have more selection bias. I used to watch a video by a vlogger named Jerry Kowal who is from New York. He went to Detroit and recorded his driving tour in the neighborhoods in Detroit. The neighborhoods shown in the video are very bleak with abandoned houses, bad sanitation and environment. And what inspired me to focus on retail theft is that a KFC restaurant in a neighborhood even has bullet-proof glass on the cashier counter. Jerry said that walking through the neighborhood made him feel very unsafe, and it made sense that there was bullet-proof glass in KFC due to high crime rate. I also used to read news about theft or robbery in fast food restaurants, and the neighborhoods of those restaurants are also bleak and unsafe.

As there are many locations for retail thefts taking place, in order to be fair, specific and match what I took from the video and news, I only selected retail thefts took place in convenience store, grocery food store, small retail store, restaurant, tavern/liquor store and drug store, which are relatively small-scaled, and are easily to find and common in neighborhoods. Finally, there are 6526 out of total 10460 retail theft cases.

To elaborate the selection bias of such retail theft, here is the summary of characteristics of locations: 1. Tend to take place in small stores in neighborhoods (that is how I made the selection from location description, large department stores may have stronger policing) 2. Places with weak policing 3. Places with poor environment and facilities (dilapidated housing, bad sanitation, poor facades, vacant lots) 4. There is no definite time for the crime taking place, daytime and nighttime are both possible.

The above characteristics also give a signal of “broken-window” theory.

The following map is the retail theft points around Chicago, and we can see retail thefts in Chicago are not everywhere. But surprisingly, it seems like many retail thefts are concentrated in Loop! Honestly, it is out of my expectation/assumption.

crime <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")

theft<-
  crime%>%
  filter(Primary.Type == "THEFT"& 
           Description == "RETAIL THEFT"&
           Location.Description %in% c ("CONVENIENCE STORE",
                                        "GROCERY FOOD STORE",
                                        "SMALL RETAIL STORE",
                                        "RESTAURANT",
                                        "TAVERN/LIQUOR STORE",
                                        "DRUG STORE")) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(102271) %>% 
  distinct()

#Clip all theft points into the boundary
theft <- 
  theft[chicagoBoundary,]

#Plot
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = theft, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Retail theft, Chicago - 2017") +
  mapTheme()

The following map maps the count of retail theft by grid cell (the cell size is 500). And the spatial structure of retail theft begins to take shape in the map below as the hotspots (more close to yellow) are immediately apparent. We can clearly see Loop and some northeast areas have high count values of retail thefts. Again, retail thefts are not everywhere in Chicago, so many grid cells don’t contain any count(or value).

#2.2 Data wrangling: Joining theft to the fishnet--[Crime net]
crime_net <- 
  theft %>% 
  dplyr::select() %>% 
  mutate(countTheft = 1) %>% 
  aggregate(., fishnet, sum) %>% #aggregate joins burglaries to the fishnet using the sum operation to sum countBurglaries
  mutate(countTheft = ifelse(is.na(countTheft), 0, countTheft),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE)) #Roughly 100 cvIDs are generated (round(nrow(burglaries) / 100)) to allow 100-fold cross validation below

ggplot() +
  geom_sf(data = crime_net, aes(fill = countTheft)) +
  scale_fill_viridis() +
  labs(title = "Count of retail theft for the fishnet",
       subtitle = "Gray area means no data/no retail thefts") +
  mapTheme()

Risk factors

Besides the six given risk factors ( Abandoned cars, Street lights out, Graffiti remediation, Sanitation complaints, Abandon buildings, Retail stores that sell liquor to go), I added two additional risk factors from 311 requests as well : rodent baiting and garbage carts. Rodent baiting indicates the damage of rats, and garbage carts indicate garbages needed to be replaced or get maintenance. Both new risk factors focus on environment as well.

From retail theft points, we’ve noticed that there are many cases taking place in Loop. As we know, Loop is the downtown area of Chicago. I would argue that downtown may not be always good in physical environment, there may also be a lot of trash on streets, abandoned buildings, and other problems. But whether current risk factors make sense in predicting retail theft, let’s wait and see for regression results later in this assignment.

#2.3 Data wrangling risk factors
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(date_service_request_was_received,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  filter(where_is_the_graffiti_located_ == "Front" |
           where_is_the_graffiti_located_ == "Rear" | where_is_the_graffiti_located_ == "Side") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

#Rat/Rodent baiting
rat <-  
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historical/97t6-zrhs/data") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Rodent baiting") #50921 entries

#Garbage cart
GarbageCart <-  
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Garbage cart") #52837 entries

The following facet maps the points of all eight risk factors:

#Mapped factors out!
ggplot() +
  geom_sf(data=chicagoBoundary) +
  geom_sf(data = rbind(abandonCars,
                       streetLightsOut,
                       abandonBuildings,
                       liquorRetail, 
                       graffiti, 
                       sanitation,
                       rat,
                       GarbageCart),
          size = .1) +
  facet_wrap(~Legend, ncol = 4) +
  labs(title = "Risk Factors") +  
  mapTheme()

Feature engineering: for risk factors

Feature engineering for the eight risk factors involves three major parts: count of features by grid cell, nearest neighbor distance from one grid cell to each risk factors nearby, measurement of distance to the centroid of Loop.

1. Count of features by grid cell

For the two new risk factors, they illustrate slightly different spatial processes. Rodent baiting are mainly clustered in north and north east Chicago, while garbage carts are mostly in both north and south.

#3.1 Feature engineering - Count of features by grid cell
#The first simply sums the number of a given risk factor points occuring in a given grid cell. 
#This is done by binding together the individual risk factor layers; 
#then spatial joining (st_join) the fishnet to each point.
#The outcome is a large point layer with a column for each fishnet uniqueID. 
#The layer is then converted to a data frame with st_set_geometry.
vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail,graffiti,sanitation,rat,GarbageCart) %>%
  st_join(., fishnet, join=st_within) %>%
  st_set_geometry(NULL) %>% # converted to a data frame with st_set_geometry
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>% #full_join adds the fishnet geometries 
  spread(Legend, count, fill=0) %>% #spread converts to wide form
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit()

#let’s map each count feature as part of a small multiple map
vars_net.long <- 
  vars_net %>%
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable) #vector of Variable names
mapList <- list() #empty list 

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet: Count of features by grid cell"))

2. Nearest neighbor features

For nearest neighbor distance, one common feature is that the southmost part of Chicago has the largest NND from all eight risk factors (more yellow), while a majority of Chicago is very close to those eight risk factors (more dark blue).

#3.2 Feature engineering - Nearest neighbor features

#NND function
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

#Calculate NND
#These features are created by converting vars_net polygon grid cells to points, then measuring average nearest neighbor distance
vars_net$Abandoned_Buildings.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonBuildings), 3)

vars_net$Abandoned_Cars.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(abandonCars), 3)

vars_net$Graffiti.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(graffiti), 3)

vars_net$Liquor_Retail.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)

vars_net$Street_Lights_Out.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)

vars_net$Sanitation.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)

vars_net$Rat.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(rat), 3)

vars_net$GarbageCart.nn =
  nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(GarbageCart), 3)

#Mapped them out!
vars_net.long.nn <- 
  vars_net %>%
  dplyr::select(ends_with(".nn")) %>% #use of select and ends_with to map only the newly generated nearest neighor features
  gather(Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =4, top = "Risk Factors by Fishnet: Nearest Neighbor Distance"))

3. Measure distance to one point

It also makes sense measure distance to a single point, like the centroid for the Loop which is Chicago’s CBD.

#3.3 Feature Engineering - Measure distance to one point
#It also may be reasonable to measure distance to a single point,
#like the centroid for the Loop neighbor - Chicago’s Central Business District. 
#This is done with st_distance.
loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net, aes(fill=loopDistance)) +
  scale_fill_viridis() +
  labs(title="Euclidean distance to The Loop") +
  mapTheme()

4. The final net: Join neigbhorhoods and police districts

Finally, a final net is created by combining data from feature engineering and the crime net, and join with neighborhood name and police districts. The facet shows the aggregate areas by police districts and neighborhoods respectively in the form of fishnet.

#3.4 Feature Engineering - Create the final_net
#crime_net and vars_net layers are joined into a final dataset
final_net <-
  left_join(crime_net, st_set_geometry(vars_net, NULL), by="uniqueID") 

final_net$countTheft[is.na(final_net$countTheft)] <- 0

final_net <-
  st_centroid(final_net) %>%
  st_join(., dplyr::select(neighborhoods, name)) %>% #Name for neighborhoods
  st_join(., dplyr::select(policeDistricts, District)) %>% #District for police district
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()

dplyr::select(final_net, name, District) %>%
  gather(Variable, Value, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Value)) +
  facet_wrap(~Variable) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Aggregate Areas",
       subtitle="Not everywhere has retail theft!") +
  mapTheme() + theme(legend.position = "none")

Exploring the spatial structure of retail theft

The local spatial structure of retail thefts(by grid cell) is quantified and visualized by Local Moran’s I.

1. Local Moran’s I

Very high values of Local Moran’s I represent strong and statistically significant evidence of local clustering. This test certainly provides insight on the scale and intensity of the retail theft spatial process. From “Local_Morans_I” in the facet, we can clearly see retail thefts are highly clustered in Loop area, which actually occupies a very small portion of Chicago. The rest area shows very low (almost zero) clustering.

From “Significant_Hotspots” in the facet, we can see slightly different from the result of Local Moran’s I, not only Loop area is the significant hotspots, but a large area in the north of Loop, and some areas in the south of Loop and west Chicago are also significant hotspots for retail theft.

#4.1 Exploring the spatial structure of retail theft
#The code block below create a neighbor list, poly2nb, and the spatial weights matrix, 
#final_net.weights, using queen continguity, meaning that every grid cell is realted to its eight adjacent neighbors.
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

#Moran's I
final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countTheft, final_net.weights)),
    as.data.frame(final_net, NULL)) %>% 
  st_sf() %>%
  dplyr::select(Theft_Count = countTheft, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Moran's I statistics, Retail theft"))

2. Distance from each grid cell to its nearest significant cluster

To determine the highly significant hotspots and to find clusters that are far more significant than 0.05, p-value <= 0.0000001. Distance from each grid cell to its nearest significant cluster is measured and shown as below. This distance feature makes some significant assumptions about the local spatial process that retail thefts dissipate in a linear fashion from highly significant hotspots outward.

#Code spots, hot spots
final_net <-
  final_net %>% 
  mutate(theft.isSig = ifelse(localmoran(final_net$countTheft, #Dummy variable
                                            final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(theft.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                           st_coordinates(st_centroid(
                                             filter(final_net, theft.isSig == 1))), 1 ))
#measures distance from each grid cell to its nearest significant cluster

ggplot() + 
  geom_sf(data = final_net, aes(fill = theft.isSig.dist)) +
  scale_fill_viridis() +
  labs(title = "Distance to highly significant local retail theft hotspots") +
  mapTheme()

Poisson regression

1. Correlation tests: countTheft as a function of the risk factors

From the scatterplot, the correlation between count of retail theft and count of garbage carts is very small (r is 0.02), while the correlation between count of retail theft and count of rodent baiting is more conspicuous (r is 0.16). As the count of rodent baiting increases in a grid cell, so does the number of retail theft. As the distance to rodent baiting decreases, the count of retail theft increases.

correlation.long <-
  st_set_geometry(final_net, NULL) %>%
  dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
  gather(Variable, Value, -countTheft)

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countTheft, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countTheft)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Retail theft count as a function of risk factors")

2. Histogram of countTheft (retail theft)

The distribution of retail theft is highly skewed, most grids don’t have retail theft count, given that crime is not a very common event. When data is distributed in this way, Poisson Regression is appropriate.

ggplot(final_net, aes(countTheft)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of retail theft by grid cell")

3. Accuracy: predicted retail theft by regression

Cross-validated Poisson regression is applied, and we applied two kinds: random k-fold and LOGO (leave-one-group-out) cross validation. For each cross validation type, we looked at “Just risk factors” and the “Spatial Structure” as two groups of features.

The facet below maps predictions for each regression. It is clear to see that the models with the spatial structural features do a better job picking up on very localized hotspots.

#5.2 Cross-validated poisson Regression
#Just risk factors
reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn","loopDistance",
              "Rat.nn","GarbageCart.nn")

#risk factors plus the Spatial Structure 
reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", "Liquor_Retail.nn", 
                 "Street_Lights_Out.nn", "Sanitation.nn", "loopDistance", 
                 "Rat.nn","GarbageCart.nn",
                 "theft.isSig", "theft.isSig.dist")

#Cross validate funciton
###For now, the countBurglaries variable is hardwired into the function, which you will have to change when you do your homework.
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countTheft~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

#Add function!
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countTheft",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countTheft, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countTheft",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countTheft, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countTheft",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countTheft, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countTheft",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = name, countTheft, Prediction, geometry)

#5.2 Accuracy & Generalzability
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = countTheft - Prediction,
           Regression = "Random k-fold CV: Just Risk Factors"),
    
    mutate(reg.ss.cv,        Error = countTheft - Prediction,
           Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV,    Error = countTheft - Prediction,
           Regression = "Spatial LOGO-CV: Just Risk Factors"),
    
    mutate(reg.ss.spatialCV, Error = countTheft - Prediction,
           Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
  st_sf()

#Plot
grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Prediction)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Predicted retail theft by Regression") +
    mapTheme() + theme(legend.position="bottom"),
  
  filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
    ggplot() +
    geom_sf(aes(fill = countTheft)) +
    scale_fill_viridis() +
    labs(title = "Observed retail theft\n") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

4. Accuracy: errors by random k-fold and spatial cross validation

Below maps prediction raw errors (not absolute errors) for the two kinds of cross validation and two groups of features. We can see without the local spatial structure, many of the hotspots go under-predicted (with high error value) leaving clustered errors at the hotspots.

Still, the models with the spatial structural features do a better job, as from the change of colors, they tend to have smaller error values.

#Plot asmall multiple map of model errors by random k-fold and spatial cross validation
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Error)) +
    facet_wrap(~Regression,ncol=2) +
    scale_fill_viridis() +
    labs(title = "Errors of retail theft by Regression",
         subtitle="Regression: Random k-fold and spatial cross validation") +
    mapTheme() + theme(legend.position="bottom")

5. Generalizability: MAE and standard deviation MAE by regression

To interpret MAE and the standard deviation of MAE across each fold for each regression type, the errors are roughly 0.756 retail theft on average. But for context, the mean retail theft count is only 0.695! The average error is greater than the average count. So the overall accuracy of predicted values from all regression models is not quite satisfactory, at least compared to the actual count value.

But from the table, a reconfirmation is that regressions with the spatial structural features are much more accurate and more generalizable. These features are doing a better job predicting the hotspots.

#MAE and the standard deviation of MAE across each fold for each regression type
st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  summarize(MAE = round(mean(abs(Prediction - countTheft), na.rm = T),3),
            SD_MAE = round(sd(abs(Prediction - countTheft), na.rm = T),3)) %>% 
  kable(caption = "MAE by regression") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF")

MAE by regression
Regression	MAE	SD_MAE
Random k-fold CV: Just Risk Factors	0.777	1.350
Random k-fold CV: Spatial Structure	0.733	1.230
Spatial LOGO-CV: Just Risk Factors	0.780	1.352
Spatial LOGO-CV: Spatial Structure	0.734	1.220

The facet below provides another approach for visualizing how the spatial structural variables provide better predictions. Each facet represents a regression type. The observed retail theft count are split into ten deciles, running from lowest to highest. For random k-fold, the spatial structure features provide marginal improvements in the 1st to the 8th deciles. But for LOGO, the improvement is only visible in the 5th to the 7th deciles. However, in the highest retail theft grid cells, the spatial features improve errors by more than a half for both cross validation types.

All models overpredict in the 1st through 7th deciles, but then underpredict in the 8th through 10th deciles. In other words, we overpredict in low retail theft areas and underpredict in high retail theft areas.

#Visualizing how the spatial structural variables provide better predictions.
st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>%
  mutate(theft_Decile = ntile(countTheft, 10)) %>%
  group_by(Regression, theft_Decile) %>%
  summarize(meanObserved = mean(countTheft, na.rm=T),
            meanPrediction = mean(Prediction, na.rm=T)) %>%
  gather(Variable, Value, -Regression, -theft_Decile) %>%          
  ggplot(aes(theft_Decile, Value, shape = Variable)) +
  geom_point(size = 2) + geom_path(aes(group = theft_Decile), colour = "black") +
  scale_shape_manual(values = c(2, 17)) +
  facet_wrap(~Regression) + xlim(0,10) +
  labs(title = "Predicted and observed retail theft by observed retail theft decile")

6. Generalizability: racial context

To check generalizability on racial context, following the tutorial, we used the package “tidycensus” to download the ACS 5-year race (percentage of the white). Then, we divided the context of race into two categories: Majority White (if the neighborhoods’ white population is higher than half of the total population) and Majority Non-White. Since model with spatial structure is better, I looked at the random K-fold and spatial LOGO with spatial structure regression for comparison.

The table below shows the mean error by regression and race context. The model overpredicts in Majority_Non_White neighborhoods and underpredicts in Majority_White neighborhoods. However, both mean errors are relatively small (very close to +/- 0.04-0.05), I may argue that the generalizability over racial context is fair.

library(tidyverse)
library(tidycensus)
library(viridis)

tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2017, state=17, county=031, geometry=T) %>%
  st_transform(102271)  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

#TABLE:A table of raw errors by race context for a random k-fold vs.spatial cross validation regression.
final_reg <- 
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure" |
           Regression == "Random k-fold CV: Spatial Structure") %>%
  mutate(uniqueID = rownames(.))

final_reg.tracts <- 
  st_join(st_centroid(final_reg), tracts17) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_reg, uniqueID)) %>%
  st_sf() %>%
  na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean Error by neighborhood racial context") %>%
  kable_styling("striped", full_width = F)

Mean Error by neighborhood racial context
Regression	Majority_Non_White	Majority_White
Random k-fold CV: Spatial Structure	0.0466962	-0.0504453
Spatial LOGO-CV: Spatial Structure	0.0468787	-0.0501568

Does this model allocate better than traditional crime hotspots?

1. Build Kernel Density

Here it is going to compare the algorithm with kernel density hotspot mapping. Firstly, I built a kernel density with a 1000 ft search radius with 1500 sample points of retail theft overlayed on top.

#Kernel density
library(raster)
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD.1000 <- spatstat::density.ppp(theft_ppp, 1000)
theft_KD.1500 <- spatstat::density.ppp(theft_ppp, 1500)
theft_KD.2000 <- spatstat::density.ppp(theft_ppp, 2000)
theft_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(theft_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(theft_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(theft_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

theft_KD.df$Legend <- factor(theft_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

#creates a kernel density with a 1000 foot search radius
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD <- spatstat::density.ppp(theft_ppp, 1000)

as.data.frame(theft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  ggplot() +
  geom_sf(aes(fill=value)) +
  geom_sf(data = sample_n(theft, 1500), size = .5) + #1500 Points
  scale_fill_viridis() +
  mapTheme()

2. Comparing kernel density to risk predictions - Map

Below are maps generated of the risk categories for both model types with test set points overlayed. Since a strong fit model should should that the highest risk category is uniquely targeted to places with a high density of retail theft points, in this case, our algorithm for risk prediction is much better as it does a better job uniquely targeting to places with a high density of points (i.e.more precise and specific).

#whether the risk predictions capture more of the observed retail thefts relative to the kernel density
# Compute kernel density
theft_ppp <- as.ppp(st_coordinates(theft), W = st_bbox(final_net))
theft_KD <- spatstat::density.ppp(theft_ppp, 1000)
# Convert kernel density to grid cells taking the mean
theft_KDE_sf <- as.data.frame(theft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  #Mutate the Risk_Category field as defined below.
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  #Bind to a layer where test set crime counts are spatially joined to the fisnnet.
  bind_cols(
    aggregate(
      dplyr::select(theft) %>% mutate(theftCount = 1), ., length) %>%
      mutate(theftCount = replace_na(theftCount, 0))) %>%
  #Select the fields we need
  dplyr::select(label, Risk_Category, theftCount)

#Prediction from the LOGO-CV with the spatial features
#theft_risk_sf
theft_risk_sf <-
  filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols(
    aggregate(
      dplyr::select(theft) %>% mutate(theftCount = 1), ., length) %>%
      mutate(theftCount = replace_na(theftCount, 0))) %>%
  dplyr::select(label,Risk_Category, theftCount)

#plot
rbind(theft_KDE_sf, theft_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Risk_Category), colour = NA) +
  geom_sf(data = sample_n(theft, 1500), size = .1, colour = "black") +
  facet_wrap(~label, ) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title="Comparison of Kernel Density and Risk Predictions",
       subtitle="Relative to test set points (in black)") +
  mapTheme()

3. Comparing kernel density to risk predictions - Bar plot

Last but not least, the rate of retail theft points by risk category and model type is calculated and plotted. From the bar chart below, we can see the risk prediction captures a much greater share of observed retail theft events in the highest risk category relative to kernel density. Therefore, our algorithm model wins!

#Bar chart
rbind(theft_KDE_sf, theft_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countTheft = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countTheft / sum(countTheft)) %>%
  ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
  geom_bar(aes(fill=label), position="dodge", stat="identity") +
  scale_fill_viridis(discrete = TRUE)

Risk prediction - Retail theft

Ran Xin

11/06/2019

Introduction: retail theft and selection bias

Risk factors

Feature engineering: for risk factors

1. Count of features by grid cell

2. Nearest neighbor features

3. Measure distance to one point

4. The final net: Join neigbhorhoods and police districts

Exploring the spatial structure of retail theft

1. Local Moran’s I

2. Distance from each grid cell to its nearest significant cluster

Poisson regression

1. Correlation tests: countTheft as a function of the risk factors

2. Histogram of countTheft (retail theft)

3. Accuracy: predicted retail theft by regression

4. Accuracy: errors by random k-fold and spatial cross validation

5. Generalizability: MAE and standard deviation MAE by regression

6. Generalizability: racial context

Does this model allocate better than traditional crime hotspots?

1. Build Kernel Density

2. Comparing kernel density to risk predictions - Map

3. Comparing kernel density to risk predictions - Bar plot

Risk prediction - Retail theft

Ran Xin

11/06/2019

Introduction: retail theft and selection bias

Risk factors

Feature engineering: for risk factors

1. Count of features by grid cell

2. Nearest neighbor features

3. Measure distance to one point

4. The final net: Join neigbhorhoods and police districts

Exploring the spatial structure of retail theft

1. Local Moran’s I

2. Distance from each grid cell to its nearest significant cluster

Poisson regression

1. Correlation tests: countTheft as a function of the risk factors

2. Histogram of countTheft (retail theft)

3. Accuracy: predicted retail theft by regression

4. Accuracy: errors by random k-fold and spatial cross validation

5. Generalizability: MAE and standard deviation MAE by regression

6. Generalizability: racial context

Does this model allocate better than traditional crime hotspots?

1. Build Kernel Density

2. Comparing kernel density to risk predictions - Map

3. Comparing kernel density to risk predictions - Bar plot

Conclusion: Do I recommend the algorithm?