Introduction

Virginia Beach is the most populous city in Virginia with approximately 450,980 residents. Virginia Beach Emergency Medical Services (VBEMS) operates the nation’s largest volunteer based rescue system. Ten volunteer rescue squads work together with 56 career staff to provide emergency healthcare services. Over 800 active volunteer members staff ambulances, specialty teams, and help behind the scenes.

With the increasing population and demand of public resources, improving EMS efficiency is also crucial. The ultimate goal of our project is providing the mechanism for and designing an APP for ambulance drivers to figure out WHERE they should hang out to reduce response time, and give a sense about WHEN an EMS call will likely be. Currently in Virginia Beach, there are 22 rescue stations in the City. Having such an APP to predict when and where EMS calls will be made would help to assist the allocation of EMS services and improve the overall EMS efficiency. Based on predictions, for future strategies, the VBEMS could also choose to increase the number of ambulances in each of the 22 rescue stations or establish new rescue stations based on hotspots of EMS calls as well.

We used Poisson Regression to predict WHERE EMS calls will be, and linear space-time model to predict WHEN EMS calls will be. Poisson Regression using fishnet can be more suitable to predict the location of potential EMS calls, which narrows down the scope into grid cells, while space-time model takes time variables into consideration so that it can tell us the hotspots of time of EMS calls. Unlike other demand such as parking or bike share, EMS demand seems to be random because EMS is contingent to a variety of reasons such as individual physical condition, weather, accidents, crimes, etc, in a word, we don’t know when and where something will happen. We strive to use existing EMS calls and a bunch of variables to investigate a pattern of EMS demand based on existing conditions, and we hope our model and APP can help VBEMS to improve their efficiency and better allocation of ambulances.

Below is the setup code:

library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
options(tigris_use_cache = TRUE) 
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gifski)
library(png)
library(ggsn)
library(gganimate)
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(car) #test colinarity
library(ggplot2)
library(dplyr)
library(gmodels)
library(MASS)
library(e1071)
library(Hmisc)
library(Boruta)
library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(GoodmanKruskal)
library(spatstat)
library(riem)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette7<-c("#ffd700",
             "#ffb14e",
             "#fa8775",
             "#ea5f94",
             "#cd34b5",
             "#9d02d7",
             "#0000ff")
palette6 <- c("#ffd700",
              "#ffb14e",
              "#fa8775",
              "#ea5f94",
              "#cd34b5",
              "#9d02d7")
palette5 <- c("#ffd700",
              "#ffb14e",
              "#fa8775",
              "#ea5f94",
              "#cd34b5")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

Data

We used the data from City of Virginia Beach portal-EMS calls for services. The dataset includes EMS call records from 7/2/2010 to 2/27/2018, total 326,610 records. Due to such a large volume of data and in order to remain the completeness regarding time, for our project, we chose to use the EMS calls in the whole year of 2017, which includes 44,460 EMS calls. We assume that EMS may be related to seasons, for example, winter may be the high peak time for cardiovascular diseases for the eldly, so we would like to take a whole year record for our model.

The average EMS calls per day is 121 in 2017, and according to VBEMS 2018 Annual Report, the average chute time (United Notified - Unit EnRoute) is 00:01:12, and the average scene response time (Unit EnRoute - Unit Arrived on Scene) is 00:07:54 in 2017.

The components of the EMS calls data from the City are as follows:

require(readr)
require(readxl)
metadata <- read_excel("/Users/ranxin/Desktop/590 final/EMS_description.xlsx")
kable(metadata,
      caption = "EMS calls for services data description") %>%
  kable_styling("striped", full_width = F)
EMS calls for services data description
Column Type Label Description
EMS Call Number text EMS Call Number The computer-aided dispatch (CAD) system generates a series of numbers that uniquely identifies an incident.
Block Address text Block Address This is the 100- block number, street name, City, and zip code of the location of the incident.
City text City Virginia Beach
State text State Virginia
Call Priority numeric Call Priority The priority category that the call is classified under. There are three priority classifications.
NA NA NA Priority 1 – Based on criteria met during emergency medical dispatch screening, the patient’s condition has an increased likelihood of being life-threatening and/or emergent.
NA NA NA Priority 2 – Based on criteria met during emergency medical dispatch screening, the information provided indicates the patient’s condition is potentially stable and not as emergent as Priority 1.
NA NA NA Priority 3 – Based on criteria met during emergency medical dispatch screening, the information provided indicates a stable patient and that a delayed response (if necessary) will not adversely impact patient outcome.
Rescue Squad Number text Rescue Squad Number The EMS rescue squad number of the unit responding to the incident.
Call Date and Time timestamp Call Date and Time The date and time the call was received.
Entry Date and Time timestamp Entry Date and Time The time the call was entered into the computer-aided dispatch (CAD) system.
Dispatch Date and Time timestamp Dispatch Date and Time The time the call was initially dispatched.
En route Date and Time timestamp En route Date and Time Represents the first en route time captured. If multiple units are responding, this time represents the first unit classified as en route.
On Scene Date and Time timestamp On Scene Date and Time The time the first unit arrives on scene.
Close Date and Time timestamp Close Date and Time Time that the call is concluded.

The following codes import and select 2017 EMS call data:

#I. Import EMS calls data
#Read geocoded EMS shapefile
EMS_shp<- st_read("/Users/ranxin/Desktop/590 final/Street_Ranges/EMS_point.shp")%>%
  st_set_crs(4326) %>%
  st_transform(2284)

#Read EMS csv
EMS<-read.csv('EMS_new1.csv')

#Change the column name for match (left join)
names(EMS_shp)[names(EMS_shp) == 'EMSCallNum'] <- 'EMSCallNumber'

#left join based on call number
map_and_data <- dplyr::left_join(EMS, EMS_shp, by='EMSCallNumber')
#write.csv(map_and_data,'EMS_with_coordinates.csv')

#Remove 0 cooridnates, and turn to geodataframe
sf1<-
  map_and_data%>%
  filter(X!= 0 & Y!=0) %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform(2284)

#Read Virginia Beach boudnary
VB<- 
  st_read("https://opendata.arcgis.com/datasets/82ada480c5344220b2788154955ce5f0_0.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2284)

#Clip geodataframe into VB boundary
EMS_dat<- 
  sf1[VB,]

#Select useful columns and Standardize time data
EMS_dat2 <- EMS_dat %>%
  dplyr::select(EMSCallNumber,
                BlockAddress,
                City.x,
                State.x,
                CallPriority,
                RescueSquad.Number,
                CallDateandTime,
                DispatchDateandTime,
                EnrouteDateandTime,
                OnSceneDateandTime,
                CloseDateandTime,
                DisplayX,
                DisplayY)%>%
  mutate(CallDateandTime = mdy_hm(CallDateandTime),
         DispatchDateandTime = mdy_hm(DispatchDateandTime),
         EnrouteDateandTime = mdy_hm(EnrouteDateandTime),
         OnSceneDateandTime = mdy_hm(OnSceneDateandTime),
         CloseDateandTime = mdy_hm(CloseDateandTime))

#Select only 2017 data
Date1<-as.Date("2017-01-01 00:00:00")  
Date2<-as.Date("2017-12-31 23:59:59")
# Creating a data range with the start and end date:
dates <- seq(Date1, Date2, by="days")
#Subset to get only 2017**********
EMS_2017 <- subset(EMS_dat2, as.Date(CallDateandTime) %in% dates)

And below two graphs are the EMS call points and the density of points, we can see EMS calls are clustered in east and middle of the City:

#EMS calls points 
grid.arrange(
ggplot() +
  geom_sf(data=VB,fill = "grey55") +
  geom_sf(data = EMS_2017,colour = "blue", size = .30)+
  labs(title = "EMS calls for services 2017") +
  mapTheme(),

#Density of 2017 EMS calls points
ggplot() + 
  geom_sf(data = VB, fill = "grey55") +
  stat_density2d(data = data.frame(st_coordinates(EMS_2017)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of EMS calls 2017") +
  mapTheme(),
ncol = 2)

Below is all features we used to build our prediction models:

metadata2 <- read_excel("/Users/ranxin/Desktop/590 final/EMS_all data.xlsx")
kable(metadata2,
      caption = "List of all features in the models") %>%
  kable_styling("striped", full_width = F)
List of all features in the models
Category Data Type Source
Demographics Median household income Number ACS 2017
Demographics Median age Number ACS 2018
Demographics Percent of aging population Number ACS 2019
Demographics Percent of car ownership Number ACS 2020
Demographics Percent of population with health insurance Number ACS 2021
Weather Precipitation Number Norfolk International Airport (ORF)
Weather Temperature Number Norfolk International Airport (ORF)
Weather Wind speed Number Norfolk International Airport (ORF)
Facilities Distance to industrial places Number VB GIS data portal, NND calculation
Facilities Distance to town center Number VB GIS data portal, NND calculation
Facilities Distance to housing (using condos) Number VB GIS data portal, NND calculation
Spatial If in noise zone (1-Yes, 0-No) Dummy VB GIS data portal
Spatial Census tract GEOID ACS 2017
Spatial Council district Text VB GIS data portal
Time lag lagHour Time lag Computation in R
Time lag lag2Hours Time lag Computation in R
Time lag lag3Hours Time lag Computation in R
Time lag lag4Hours Time lag Computation in R
Time lag lag12Hours Time lag Computation in R
Time lag lag1day Time lag Computation in R
Time lag holidaylag Time lag Computation in R
Time Hour Number in EMS dataset
Time Day of week Number in EMS dataset

Weather

The weather we looked at is from Norfolk International Airport (ORF), we chose to look at precipitation, wind speed and temperature. Below is the weather trend in 2017:

#II. IMPORT VARIABLES

#1. Weather

VB_weather.Panel <- 
  riem_measures(station = "ORF", date_start = "2017-01-01", date_end = "2017-12-31") %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  dplyr::summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#Plot!!!!!
grid.arrange(
  ggplot(VB_weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Percipitation") + plotTheme(),
  ggplot(VB_weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
  ggplot(VB_weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - Virginia Beach - 2017")

Socio-economic

We assume that the locations of EMS calls may be related with local demographic features, so we found several socio-economic features in census tract level: Median household income, median age, percent of aging population, percent of population with car, percent of population with health insurance. For example, it is possible that more EMS calls may take place in tracts with higher aging population share or with higher median age.

The following code imports socio-economic data in census tract level:

census_api_key("b247934399fbe16edeb8ffd74eda6189874d35a1", overwrite = TRUE,install=TRUE)
acs_variable_list.2017 <- load_variables(2017, #year
                                         "acs5", #five year ACS estimates
                                         cache = TRUE)
acs_vars <- c("B01001_001E", #Total pop
              "B01001A_001E",
              "B06011_001E", #Median household income
              "B25046_001E", #Car
              #Male pop by age
              "B01001_020E",
              "B01001_021E",
              "B01001_022E",
              "B01001_023E",
              "B01001_024E",
              "B01001_025E",
              #Female pop by age
              "B01001_044E",
              "B01001_045E",
              "B01001_046E",
              "B01001_047E",
              "B01001_048E",
              "B01001_049E",
              #Median age
              "B01002_001E",
              #Male with insurance by age
              "B27001_004E",
              "B27001_007E",
              "B27001_010E",
              "B27001_013E",
              "B27001_016E",
              "B27001_019E",
              "B27001_022E",
              "B27001_025E",
              "B27001_028E",
              #Female with insurance by age
              "B27001_032E",
              "B27001_035E",
              "B27001_038E",
              "B27001_041E",
              "B27001_044E",
              "B27001_047E",
              "B27001_050E",
              "B27001_053E",
              "B27001_056E")

acsTractsVB.2017 <- get_acs(geography = "tract", # What is the lowest level of geography are we interested in?
                            year = 2017, # What year do we want - this can also be used for 2000 census data
                            variables = acs_vars, # let's use our variables we specified above
                            geometry = TRUE, # Do we want this as a shapefile? Yes!
                            state = "VA", # What state?
                            county = "Virginia Beach", # What County?
                            output = "wide")

acsTractsVB.2017 <- acsTractsVB.2017 %>%
  dplyr::select(acs_vars, NAME, GEOID) %>%
  dplyr::rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E,
         Car_Ownership = B25046_001E,
         #Male pop by age
         Male_65_66 = B01001_020E,
         Male_67_69 = B01001_021E,
         Male_70_74 =B01001_022E,
         Male_75_79 =B01001_023E,
         Male_80_84 =B01001_024E,
         Male_85_over =B01001_025E,
         #Female pop by age
         Female_65_66=B01001_044E,
         Female_67_69=B01001_045E,
         Female_70_74=B01001_046E,
         Female_75_79=B01001_047E,
         Female_80_84=B01001_048E,
         Female_85_over=B01001_049E,
         #Median age
         median_age=B01002_001E) %>%
  mutate(CarOwn_Percent = Car_Ownership / TotalPop,
         percent_pop65over = (Male_65_66+ 
                              Male_67_69+
                              Male_70_74+
                              Male_75_79+
                              Male_80_84+
                              Male_85_over+
                              Female_65_66+
                              Female_67_69+
                              Female_70_74+
                              Female_75_79+
                              Female_80_84+
                              Female_85_over)/TotalPop,
         percent_popInsurance=(B27001_004E+
                              B27001_007E+
                              B27001_010E+
                              B27001_013E+
                              B27001_016E+
                              B27001_019E+
                              B27001_022E+
                              B27001_025E+
                              B27001_028E+
                              
                              B27001_032E+
                              B27001_035E+
                              B27001_038E+
                              B27001_041E+
                              B27001_044E+
                              B27001_047E+
                              B27001_050E+
                              B27001_053E+
                              B27001_056E)/TotalPop,
         percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

Below are each socio-economic feature visualized in census tract level, we can clearly see about two census tracts are missing the data:

#Plot: Median age
grid.arrange(
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(acsTractsVB.2017), aes(fill = median_age)) +
  labs(title = "Median age of population") +
  scale_fill_gradient(low = "#EE92ED", high = "#6800A4", name = "Age") +
  mapTheme(),

#Plot: Percent of population aged over 65
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(acsTractsVB.2017), aes(fill = percent_pop65over)) +
  labs(title = "Percent of population aged over 65") +
  scale_fill_gradient(low = "#EE92ED", high = "#6800A4", name = "Percent") +
  mapTheme(),

#Plot: Percent of population with health insurance
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(acsTractsVB.2017), aes(fill = percent_popInsurance)) +
  labs(title = "Percent of population with health insurance") +
  scale_fill_gradient(low = "#FFFFAF", high = "#A54900", name = "Percent") +
  mapTheme(),

#Plot: Median Household Income
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(acsTractsVB.2017), aes(fill = Median_Income)) +
  labs(title = "Median Household Income") +
  scale_fill_gradient(low = "#C1DED9", high = "#142623", name = "$") +
  mapTheme(),

#Plot: Car ownership per person
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(acsTractsVB.2017), aes(fill = Car_Ownership)) +
  labs(title = "Population with car(s)") +
  scale_fill_gradient(low = "#0B9ED9", high = "#080E33", name = "Percent") +
  mapTheme(),
ncol=3)

Facilities

We also assumed that more EMS calls may be from where are close to industrial areas (maybe environmental issues) and existing housing (more clusters of residents), and we also would like to investigate how far each EMS call is from town center.

Thus, we calculated the Nearest-Neighbor Distance (NND) from each EMS call point to existing housing (we used condos), industrial area and town center. Below is the code for importing the facilities data, turning facility polygons to centroid (points) and calculating NND:

#3. FACILITIES
#Read industrial (including non-industrial town center)
industrial<- 
  st_read("https://opendata.arcgis.com/datasets/89fde1d3a1a84ab68cedf851fe2d6cbd_1.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2284)

#Get centroid for all
industrial$centroid <- st_centroid(industrial$geometry)

#3.1 Town Center
#Get the point of town center! ****
towncenter<-
  industrial %>%
  filter(IND_PARK_N == "Town Center")

#3.2 Industrial
#Get the point of ACTUAL industrial places****
industrial_area<-
  industrial %>%
  filter(IND_PARK_N != "Town Center")

#Plot actual industrial place centroids
industrial_area_centroid <- industrial_area$centroid


#3.3 Condos/housing (points)
condos<- 
  st_read("https://opendata.arcgis.com/datasets/759ad66064974eab9a556d3a90efa1a1_18.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2284)

#===========================================
#Get NND to facilities from each EMS call

#FUNCTION
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

#NND to industrial area
EMS_2017$industrial.nn =
  nn_function(st_coordinates(EMS_2017), st_coordinates(st_centroid(industrial_area)), 1)

#NND to condo housing (need centroid!)
EMS_2017$housing.nn =
  nn_function(st_coordinates(EMS_2017), st_coordinates(st_centroid(condos)), 3)

#NND to town center
EMS_2017$towncenter.nn =
  nn_function(st_coordinates(EMS_2017), st_coordinates(st_centroid(towncenter)), 1)

Spatial characteristics

For spatial characteristics, we looked at the Council districts, census tracts and if the places is in Aircraft Noise Level Zone. The following code shows how to import spatial charcateristics and spatial join:

#4. SPATIAL
#4.1 Read NOISE (polygon/area)
noise<- 
  st_read("https://opendata.arcgis.com/datasets/3088488c6c284f3981d61b9efe9c1d04_3.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2284)

#Clip into VB boundary
noise <- 
  st_intersection(noise, VB)

#Noise zone: set dummy variable (0/1)
noise$if_noise<-1

#Spatial join
EMS_2017 <-
  EMS_2017 %>%
  st_join(dplyr::select(noise,if_noise)) #spatial join

EMS_2017$if_noise[is.na(EMS_2017$if_noise)]<-0 #turn NA to 0: 0-non-noise zone, 1-noise zone


#4.2 Read COUNCIL DISTRCITS
council<-
  st_read("https://opendata.arcgis.com/datasets/7cae00e39a684380b7aa912b3d6c9d55_2.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2284)

#Clip into VB boundary
council <- 
  st_intersection(council, VB)

#Spatial join
EMS_2017 <-
  EMS_2017 %>%
  st_join(dplyr::select(council,NAME))

#Change column name
names(EMS_2017)[names(EMS_2017) == 'NAME'] <- 'council_district'


#4.3 Census tracts:
#Create an sf object that consists only of GEOIDs and geometries
VBTracts <- 
  acsTractsVB.2017 %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf

The following graphs show the council districts, census tracts and highlighted Aircraft Noise Level Zone:

grid.arrange(
ggplot() +
  geom_sf(data = VBTracts, fill = "grey70") +
  labs(title = "Census tracts, City of Virginia Beach") +
  mapTheme(),

ggplot() +
  geom_sf(data = council, fill='grey70') +
  labs(title = "Council districts, City of Virginia Beach") +
  mapTheme(),

ggplot() +
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = noise, fill = "red") +
  labs(title = "Aircraft Noise Level Zone, City of Virginia Beach",
       subtitle="Noise zone highlighted in red") +
  mapTheme(),
ncol=3)

Time

For time, we looked at day of week and hour to explore if there is a regular pattern for the number of EMS calls. Below is the code for getting day of the week and hour for each EMS call:

#5. TIME
EMS_2017<- EMS_2017 %>%
  mutate(interval60 = floor_date(ymd_hms(CallDateandTime), unit = "hour"),
         interval15 = floor_date(ymd_hms(CallDateandTime), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

Space-time panel

We created a space-time panel to make sure each unique **origin tract* and hour/day combo exists in our dataset. This is done in order to create a “panel” (e.g. a time-series) data set where each time period in the study is represented by a row - whether an observation took place then or not. So if a tract didn’t have any EMS call originating from it at a given hour, we still need a zero in that spot in the panel.

Below is the code showing how our panel is created:

#ADD the spatial information to our data (census tract) first
VB_dat_census <- st_join(EMS_2017 %>% 
                               filter(is.na(DisplayX) == FALSE &
                                        is.na(DisplayY) == FALSE),
                            VBTracts %>%
                            st_transform(crs=2284),
                             join=st_intersects,
                             left = TRUE) %>%
  dplyr::rename(Origin.Tract = GEOID) %>% #Origin.Tract
  mutate(from_longitude = unlist(map(geometry, 1)), #get long column
         from_latitude = unlist(map(geometry, 2)))%>%  #get lat column
  as.data.frame() %>%
  dplyr::select(-geometry)

#V. TIME-SPACE PANEL
#Step 1: Determining the maximum number of combinations
length(unique(VB_dat_census$interval60)) * length(unique(VB_dat_census$Origin.Tract))

#Step 2: Compare that to the actual number of combinations
VB_study.panel <- 
  expand.grid(interval60=unique(VB_dat_census$interval60), 
              Origin.Tract= unique(VB_dat_census$Origin.Tract))%>%
  left_join(., VB_dat_census %>%
              dplyr::select(Origin.Tract, 
                            from_longitude, 
                            from_latitude)%>%
                            distinct())

#Step 3: Create the full panel by summarizing counts by rescue squads for each time interval, 
#keep census info and lat/lon information along for joining later to other data. 
#We remove data for Origin.Tract that are FALSE

VB_EMS.panel <- 
  VB_dat_census%>% 
  mutate(EMS_Counter = 1) %>%
  right_join(VB_study.panel) %>% 
  dplyr::group_by(
        interval60, 
        Origin.Tract) %>%
  dplyr::summarize(EMS_Count= sum(EMS_Counter, na.rm=TRUE)) %>%
  left_join(VB_weather.Panel) %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE))%>%
  filter(is.na(Origin.Tract) == FALSE)

VB_EMS.panel<- 
  left_join(VB_EMS.panel, acsTractsVB.2017 %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

#Left join again to regain facilities, spatial and internal variables
VB_EMS.panel2 <- 
  left_join(VB_EMS.panel, VB_dat_census, by=c(
                             "Origin.Tract", 
                             "interval60", 
                             "week",
                             'dotw'))

#Drop unnecessary columns
unnecessary=c("Male_65_66",
            "Male_67_69", 
            "Male_70_74",        
            "Male_75_79",
            "Male_80_84",
            "Male_85_over",     
            "Female_65_66",
            "Female_67_69", 
            "Female_70_74",      
            "Female_75_79",        
            "Female_80_84",
            "Female_85_over",
            "B27001_004E" ,
            "B27001_007E"  ,       
            "B27001_010E",
            "B27001_013E",
            "B27001_016E"  ,       
            "B27001_019E" ,
            "B27001_022E",
            "B27001_025E"  ,       
            "B27001_028E",
            "B27001_032E",
            "B27001_035E"   ,      
            "B27001_038E",
            "B27001_041E",
            "B27001_044E" ,        
            "B27001_047E",
            "B27001_050E",
            "B27001_053E",        
            "B27001_056E",
            "Car_Ownership")

VB_EMS.panel2 <-  dplyr::select(VB_EMS.panel2, -(unnecessary))

Time lags

We also take several time lags into consideration: 1 hour, 2 hours, 3 hours, 4 hours, 12 hours, 1 day and holiday lag. The following code shows how to create time lags and holiday lags:

VB_EMS.panel2<- 
  VB_EMS.panel2 %>% 
  arrange(Origin.Tract,interval60) %>%  #Reorder rows
  mutate(lagHour = dplyr::lag(EMS_Count,1),
         lag2Hours = dplyr::lag(EMS_Count,2),
         lag3Hours = dplyr::lag(EMS_Count,3),
         lag4Hours = dplyr::lag(EMS_Count,4),
         lag12Hours = dplyr::lag(EMS_Count,12),
         lag1day = dplyr::lag(EMS_Count,24),
         holiday = ifelse(yday(interval60) %in% c(2,
                                                  16,
                                                  51,
                                                  149,
                                                  185,
                                                  247,
                                                  282,
                                                  314,
                                                  327,
                                                  359),1,0)) %>%
  mutate(day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 24) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 48) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 72) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 24) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 48) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 72) == 1 ~ "MinusThreeDays"),
         holidayLag = tidyr::replace_na(holidayLag, 0))

Exploratory analysis: EMS call counts

We can also look at how EMS call counts varies by week, time in the day and origin census tracts. We begin by examining the time and frequency components of our data.

Temporal pattern

The plot below breaks out the study time series in small multiple form by week. This approach provides an even better illustration of the similarity from week to week. The overall pattern of each week is similar with each other.

Holidays are highlighted. We can see the discrepancy of EMS calls count between holidays and non-holidays.

#Relationship between EMS call counts and features
#1. Time lag
as.data.frame(VB_EMS.panel2) %>%
  group_by(week, interval60) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  ggplot(aes(interval60,EMS_Count)) + 
  geom_line() +
  plotTheme() +
  facet_wrap(~week, scales="free", ncol=9) +
  ylim(0,25) +
  labs(title="EMS calls count by week. Virginia Beach 2017",
       subtitle="Holiday marked in blue", x="Day", y="Trip Counts") +
geom_vline(xintercept = as.POSIXct("2017-01-02 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-01-16 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-02-20 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-05-29 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-07-04 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-09-04 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-10-09 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-11-10 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-11-23 01:00:00 UTC"),color = "blue",size=1.1)+
geom_vline(xintercept = as.POSIXct("2017-12-25 01:00:00 UTC"),color = "blue",size=1.1)

We can also look at the overall time pattern by week:

#ADD the spatial information to our data (census tract) first
VB_dat_census <- st_join(EMS_2017 %>% 
                               filter(is.na(DisplayX) == FALSE &
                                        is.na(DisplayY) == FALSE),
                            VBTracts %>%
                            st_transform(crs=2284),
                             join=st_intersects,
                             left = TRUE) %>%
  dplyr::rename(Origin.Tract = GEOID) %>% #Origin.Tract
  mutate(from_longitude = unlist(map(geometry, 1)), #get long column
         from_latitude = unlist(map(geometry, 2)))%>%  #get lat column
  as.data.frame() %>%
  dplyr::select(-geometry)

ggplot(VB_dat_census %>%
         group_by(week)%>%
         tally())+
  geom_line(aes(x = week, y = n),colour = "grey50")+
  labs(title="EMS calls per week. Virginia Beach, 2017",
       x="Date", 
       y="Number of trips")+
  plotTheme()

Let’s examine the distribution of EMS calls volume by rescue squad for different times of the day. We clearly have a few high volume periods but mostly low volume. There are more EMS calls in Mid-day and PM Rush:

#2. Distribution of EMS calls for different times of the day, grouped by rescue squad
VB_dat_census %>%
  dplyr::mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, RescueSquad.Number, time_of_day) %>%
  tally()%>%
  group_by(RescueSquad.Number, time_of_day)%>%
  dplyr::summarize(mean_EMScalls = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_EMScalls), binwidth = 0.5,fill='#6C4E8C')+
  labs(title="Mean number of EMS calls per rescue squad. Virginia Beach 2017",
       x="Number of EMS calls", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

Visually, we can map the EMS calls per hour by origin tracts to have a straightforward observation. We can also divide it into weekday and weekend. It is clear to see on weekdays, EMS calls occur more during Mid-Day and PM Rush hours in Mid and East of the City. While on weekends, EMS calls occur much fewer in count and less concentrated in physical space:

#6: EMS calls per hour by origin tracts- [MAP]
ggplot()+
  geom_sf(data = VBTracts %>%
            st_transform(crs=2284))+
  geom_point(data = VB_dat_census%>% 
               mutate(hour = hour(CallDateandTime),
                      weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(Origin.Tract, from_latitude, from_longitude, weekend, time_of_day) %>%
               tally(),
             aes(x=from_longitude, y = from_latitude, color = n), 
             fill = "transparent", alpha = 1, size = 0.3)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(VB_dat_census$from_latitude), max(VB_dat_census$from_latitude))+
  xlim(min(VB_dat_census$from_longitude), max(VB_dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="EMS calls count per hour by origin tracts. Virginia Beach 2017")+
  mapTheme()

Zooming in to see the time pattern, we can see EMS calls take place more on Wednesday and Thursday during 10:00- 15:00 which includes the Mid-Day hours:

#4. EMS calls by day of the week (Sun-Sat)
ggplot(VB_dat_census %>% mutate(hour = hour(CallDateandTime)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="EMS calls count by day of the week. Virginia Beach 2017",
       x="Hour", 
       y="EMS call counts")+
  plotTheme()

To synthesize, weekdays have more EMS calls than weekend, and calls take place frequently during Mid-Day hours.

#5. EMS calls count : WEEKEND VS WEEKDAY
ggplot(VB_dat_census %>% 
         mutate(hour = hour(CallDateandTime),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="EMS calls count - weekend vs weekday. Virginia Beach 2017",
       x="Hour", 
       y="EMS call counts")+
  plotTheme()

Spatial pattern

We can also visualize the spatial distribution of EMS call count to understand if there are systematic patterns across space that can be harnessed when forecasting. Besides Week 53 which is the last week of our study period, the spatial pattern for the rest of weeks are similar:

#1. SUM of EMS counts by TRACTS and WEEKS
VB_EMS.panel2 %>% 
  group_by(week, Origin.Tract) %>%
  dplyr::summarize(Sum_EMS_Count = sum(EMS_Count)) %>%
  left_join(VBTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_EMS_Count))) +
  facet_wrap(~week, ncol = 9) +
  scale_fill_manual(values = palette5,
                    labels = c("0","4", "7", "12", "63"),
                    name = "Trip_Count") +
  labs(title="Sum of EMS calls count by tract and week",
       subtitle='Virginia Beach 2017') +
  mapTheme()+ theme(legend.position = "right")

Next, we visualize the spatial distribution of EMS call counts by day of the week, there are more EMS calls in several census tracts in south, middle and east of the City:

#2. Sum of EMS counts by TRACTS and day of the week
VB_EMS.panel2 %>% 
  group_by(dotw, Origin.Tract) %>%
  dplyr::summarize(Sum_EMS_Count = sum(EMS_Count)) %>%
  left_join(VBTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_EMS_Count))) +
  facet_wrap(~dotw, ncol = 7) +
  scale_fill_manual(values = palette5,
                    labels = c("3","35", "56", "88", "262"),
                    name = "EMS_Count") +
  labs(title="Sum of EMS calls count by tract and day of the week",
       subtitle='Virginia Beach 2017') +
  mapTheme()+ theme(legend.position="right")

For the spatial pattern in each hour, we can clearly see the transition/change of EMS call counts from night to morning and morning to night:

#3. Sum of EMS counts by tract and hour of the day
VB_EMS.panel2 %>% 
  group_by(hour = hour(interval60), Origin.Tract) %>%
  dplyr::summarize(Sum_EMS_Count = sum(EMS_Count)) %>%
  left_join(VBTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_EMS_Count))) +
  facet_wrap(~hour, ncol = 9) +
  scale_fill_manual(values = palette5,
                    labels = c("0", "8", "15", "26", "121"),
                    name = "Trip_Count") +
  labs(title="Sum of EMS calls count by tract and hour of the day", 
       subtitle="Virginia Beach 2017") +
  mapTheme() + theme(legend.position = "right")

Overall, we can visualize the total EMS call counts by census tracts and council districts. We can identify which census tracts and council districts in Virginia Beach have the highest EMS calls:

#4. Sum of TOTAL EMS counts by council districts
VB_EMS.panel2 %>% 
  na.omit() %>% 
  group_by(council_district) %>%
  dplyr::summarize(Sum_EMS_Count = sum(EMS_Count)) %>%
  left_join(council, by=c("council_district" = "NAME")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = Sum_EMS_Count)) +
  labs(title="Sum of total EMS calls count by council districts", 
       subtitle="Virginia Beach 2017") +
  scale_fill_gradient(low = "#FFC5F7", high = "#93003C", name = "Count") +
  mapTheme() + theme(legend.position = "right")

#5. Sum of TOTAL EMS counts by census tracts
VB_EMS.panel2 %>% 
  na.omit() %>% 
  group_by(Origin.Tract) %>%
  dplyr::summarize(Sum_EMS_Count = sum(EMS_Count)) %>%
  left_join(VBTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = Sum_EMS_Count)) +
  labs(title="Sum of total EMS calls count by census tracts", 
       subtitle="Virginia Beach 2017") +
  scale_fill_gradient(low = "#FFC5F7", high = "#93003C", name = "Count") +
  mapTheme() + theme(legend.position = "right")

We can also create an animation for EMS count for each tract on one day:

week26<-
  VB_dat_census %>%
  filter(week == 26 & dotw == "Fri")

week26.panel <-
  expand.grid(
    interval15=unique(week26$interval15),
    Origin.Tract = unique(week26$Origin.Tract))

EMS.animation.data <-
  week26 %>%
  mutate(EMS_Counter = 1) %>%
  right_join(week26.panel) %>% 
  group_by(interval15, Origin.Tract) %>%
  dplyr::summarize(EMS_Count = sum(EMS_Counter, na.rm=T)) %>% 
  left_join(VBTracts,by=c("Origin.Tract" = "GEOID")) %>%
  st_sf() %>%
  mutate(EMS = case_when(EMS_Count == 0 ~ "0 trips",
                           EMS_Count > 0 & EMS_Count <= 1 ~ "1 trip",
                           EMS_Count > 1 & EMS_Count <= 3 ~ "2-3 trips",
                           EMS_Count > 4 ~ "4+ trips")) %>%
  mutate(EMS  = factor(EMS, levels=c("0 trips","1 trip","2-3 trips","4+ trips")))

EMS_animation <-
  ggplot() +
  geom_sf(data=EMS.animation.data, aes(fill=EMS)) +
  scale_fill_manual(values = palette4) +
  labs(title = "EMS call counts for one day in June 2017, Virginia Beach",
       subtitle = "15 minute intervals: {current_frame}") +
  transition_manual(interval15)+
  mapTheme()

library(gganimate)
animate(EMS_animation, duration=20)

Exploratory analysis: EMS calls count and features

We can then explore the correlation between EMS call count and different features.

Time lag

EMS call counts count as a function of the various time lags are in the following table. These effects are very strong for lag1hour, we can clearly see the correlation decreases with each additional hour, however, the predictive power will bounce back a little with the 24 hour lag:

#Relationship: trip count and time lags (table version)
plotData.lag<-as.data.frame(VB_EMS.panel2) %>%
  group_by(interval60) %>% 
  summarise_at(c(vars(starts_with("lag"), "EMS_Count")), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -EMS_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))
correlation.lag<-
  plotData.lag %>%
  group_by(Variable) %>%  
  dplyr::summarize(correlation = round(cor(Value, EMS_Count),2))

correlation.lag %>%
  kable(caption = "Correlation between EMS calls count and time lags") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#CFC0D1")%>%
  row_spec(3, color = "black", background = "#CFC0D1")%>%
  row_spec(5, color = "black", background = "#CFC0D1")

Weather

Let’s look at precipitation first as a function of bike share trips. Looking at week by group means in the resulting barplot does not necessarily support the hypothesis that rain and snow increase the propensity to EMS calls. From the barplots, it seems like EMS call counts do not necessarily relate to rain/snow or not.

#2. With WEATHER
#Relationship: trip count and weather - Precipitation
as.data.frame(VB_EMS.panel2) %>%
  group_by(interval60) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  left_join(VB_weather.Panel) %>%
  mutate(isPercip = ifelse(Precipitation > 0, "Rain/Snow", "None")) %>%
  na.omit()%>%
  group_by(week = week(interval60), isPercip) %>%
  dplyr::summarize(Mean_EMS_Count = mean(EMS_Count)) %>%
  ggplot(aes(isPercip, Mean_EMS_Count)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~week,ncol=9) +
  labs(title="Does EMS call counts vary when it's raining or snowing?(NA removed)",
       subtitle="Mean EMS calls count by week; Virginia Beach 2017",
       x="Percipitation", y="Mean EMS calls count") +
  plotTheme()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

Are there more EMS calls when it’s colder or hotter? Let’s look at a small multiple plot of EMS call counts count as a function of temperature by week. It seems like in most weeks, there are more EMS calls when it’s getting hotter, but the relationship is not strong: [#Relationship: trip count and weather - temperature]

#Relationship: trip count and weather - Temperature
as.data.frame(VB_EMS.panel2) %>%
  group_by(interval60) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  left_join(VB_weather.Panel) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, EMS_Count)) + 
  geom_point() + facet_wrap(~week,ncol=9) + geom_smooth(method = "lm", se= FALSE) +
  xlim(30,85)+
  plotTheme()+
  labs(title="EMS call counts as a function of temperature by week",
       subtitle="EMS call counts by week; Virginia Beach 2017",
       x="Temperature", y="Mean EMS calls count")

Let’s look at the wind speed.The plot shows that besides Week 53 which lacks of weather, the relationship between wind speed and EMS calls count is also weak:

#Relationship: trip count and weather - Wind speed
as.data.frame(VB_EMS.panel2) %>%
  group_by(interval60) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  left_join(VB_weather.Panel) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, EMS_Count)) + 
  geom_point() + facet_wrap(~week,ncol=9) + geom_smooth(method = "lm", se= FALSE) +
  xlim(0,30)+
  plotTheme() +
  labs(title="EMS call counts as a function of wind speed by week",
       subtitle="EMS call counts by week; Virginia Beach 2017",
       x="Wind Speed", y="Mean EMS calls count")

Socio-economic

Because of the space/time panel nature of the data, census data cannot be used in the forecast. Nevertheless, for the purpose of exploratory analysis, we can still briefly analyzes correlation between EMS call count by tract a function of 5 census variables. The table of correlation is shown as below. We can see median age, aging population share and population with health insurance share have weak positive correlation with EMS call count, car ownership and median income have weak negative correlation with EMS call count. It makes sense that with more aging population, there might be more EMS call likely.

#3.With socio-economic data
plotData.census <- 
  as.data.frame(VB_EMS.panel2) %>%
  group_by(Origin.Tract) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count))%>%
  left_join(acsTractsVB.2017, by=c("Origin.Tract" = "GEOID")) %>%
  dplyr::select(CarOwn_Percent,
                percent_pop65over,
                percent_popInsurance,
                Median_Income,
                median_age, 
                EMS_Count) %>%
  gather(Variable, Value, -EMS_Count) %>%
  na.omit() %>%
  filter(Variable =="CarOwn_Percent"| Variable == "percent_pop65over"|Variable == "percent_popInsurance"|
           Variable == "Median_Income"|Variable == "median_age") 

#Correlation with table
correlation.census <-
  plotData.census %>%
  group_by(Variable) %>%  
  dplyr::summarize(correlation = round(cor(Value, log(EMS_Count)),2))

correlation.census%>% 
  kable(caption = "EMS calls count (log) as a function of socio-economic factors") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#CFC0D1")%>%
  row_spec(3, color = "black", background = "#CFC0D1")%>%
  row_spec(5,color = "black", background = "#CFC0D1")

Spatial characteristics

We summarized the EMS call counts by different spatial characteristics. For example, the number of total EMS calls in noise zone or non-noise zone is summarized as below, we can see that more EMS demand is from non-noise zone.

#4. SPATIAL
#Noise zone
as.data.frame(VB_EMS.panel2) %>%
  group_by(if_noise) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  ggplot(aes(if_noise, EMS_Count),binwidth = 0.5) + 
  geom_bar(stat = "identity") +
  labs(title="EMS call counts by if in noise zone",
       subtitle="0=Non-noise zone, 1-noise zone; Virginia Beach 2017",
       x="if in noise zone", y="EMS calls count") +
  plotTheme()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

Also, the following is the number of EMS calls by each council district, we can see overall, Beach has the most EMS calls:

#Council district
as.data.frame(VB_EMS.panel2) %>%
  group_by(council_district) %>% 
  na.omit()%>%
  dplyr::summarize(EMS_Count = sum(EMS_Count)) %>%
  ggplot(aes(council_district, EMS_Count)) + 
  geom_bar(stat = "identity") +
  labs(title="EMS call counts by council district",
       subtitle="Virginia Beach 2017",
       x="if in noise zone", y="EMS calls count") +
  plotTheme()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

Facilities

The relationship between EMS call counts and nearest-neighbor distance to housing, town center and industrial areas as below, all three features have weak relationship with EMS call counts as well:

#6. Facilities (NND)
plotData.nn <- 
  as.data.frame(VB_EMS.panel2) %>%
  group_by(Origin.Tract) %>% 
  dplyr::summarize(EMS_Count = sum(EMS_Count))%>%
  left_join(VB_dat_census, by="Origin.Tract") %>%
  dplyr::select(industrial.nn, 
                housing.nn,
                towncenter.nn,
                EMS_Count) %>%
  gather(Variable, Value, -EMS_Count) %>%
  na.omit() %>%
  filter(Variable =="industrial.nn"| Variable == "housing.nn"|Variable == "towncenter.nn") 

#Correlation with table
correlation.nn <-
  plotData.nn %>%
  group_by(Variable) %>%  
  dplyr::summarize(correlation = round(cor(Value, EMS_Count),2))

correlation.nn%>% 
  kable(caption = "EMS calls count as a function of Nearest-Neighbor Distance to facilities") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#CFC0D1")%>%
  row_spec(3, color = "black", background = "#CFC0D1")

Model

We used two models to predict WHEN and WHERE EMS calls will be. We used time-space modeling to predict WHEN by using time-space panel, and Poisson Regression to predict WHERE by implementing fishnet. Due to the distinct characteristics for each model, each model are not taking all features from the full variable list, for example, Poisson Regression model didn’t take the time, and time-space model didn’t take socio-economic features.

EMS Calls Time Prediction: Space-Time prediction model

Although Space-Time prediction can also predict space, we think in our case, it is more suitable to focus on predicting time.

Model choice

We split the data into a training set and a test test (Week 36 as the “boundary”, there are 18 test set weeks). We created 4 linear models using the “lm” function in the training set, but made predictions and get results in the test set:

Model 1: Includes time, space fixed effects and time lag features.

Model 2: Adds holiday and holidayLag effects.

Model 3: Adds spatial characteristics effects (if in noise zone and council districts)

Model 4: Adds distances to facilities effects (NND)

The following is the code for building the 4 models:

#VII. MODELING
EMS.Train <- filter(VB_EMS.panel2, week >= 36)
EMS.Test <- filter(VB_EMS.panel2, week <= 53)

reg4 <- 
  lm(EMS_Count ~  Origin.Tract +  hour(interval60) + dotw + Temperature + Precipitation+Wind_Speed +
       lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, data=EMS.Train)

reg5 <- 
  lm(EMS_Count ~  Origin.Tract  + hour(interval60) + dotw + Temperature + Precipitation+Wind_Speed +
       lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, data=EMS.Train)

reg6 <- 
  lm(EMS_Count ~  Origin.Tract  + hour(interval60) + dotw + Temperature + Precipitation+Wind_Speed +
       lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + if_noise + council_district,
     data=EMS.Train)

reg7 <- 
  lm(EMS_Count ~  Origin.Tract  + hour(interval60) + dotw + Temperature + Precipitation +Wind_Speed+
       lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + if_noise + council_district+
       industrial.nn+housing.nn+towncenter.nn,data=EMS.Train)

Accuracy

The MAE is plotted by model by week as below. We see that the distances to facilities effects do not bring much additional predictive power to the mode. Distances to facilities effects make the prediction less accurate. Therefore, we chose to stick on Model 3 which excludes distances to facilities effects. The following shows the MAE results for each of 4 models by weeks:

#Predict for test data
#create a nested data frame of test data by week.
EMS.Test.weekNest <- 
  EMS.Test %>%
  nest(-week)

model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

EMS_week_predictions <-  #For test set!
 EMS.Test.weekNest %>% 
  mutate(ATime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
         BTime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
         CTime_Space_FE_timeLags_holidayLags_SF = map(.x = data, fit = reg6, .f = model_pred),
         DTime_Space_FE_timeLags_holidayLags_SF_FACILI = map(.x = data, fit = reg7, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, EMS_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

#Accuracy
#MAE by model specification and week
EMS_week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette4) +
  labs(title = "Mean Absolute Errors by model specification and week",
       subtitle = "EMS calls count predicton; A test set of 18 weeks; Virginia Beach 2017") +
  plotTheme()

Result

The detailed result for Model 3 is summarized as below, the R-square is 0.424 which means that 42.4% of the variation of EMS call count can be explained by the model. From the table, we can clearly see variables including time lag, holiday lag and day of the week are the most statistically significant ones. While for our additive variables such as distances to facilities and spatial characteristics, they have little predictive power.

htmltools::includeHTML("/Users/ranxin/Desktop/590 final/graphs/file253389e9207.htm")
  EMS_Count
Predictors Estimates CI p
(Intercept) 1.00 0.90 – 1.10 <0.001
Origin.Tract
[51810040200]
0.03 -0.08 – 0.13 0.629
Origin.Tract
[51810040402]
0.02 -0.08 – 0.12 0.658
Origin.Tract
[51810040403]
0.02 -0.08 – 0.13 0.652
Origin.Tract
[51810040404]
0.01 -0.09 – 0.11 0.874
Origin.Tract
[51810040600]
0.01 -0.09 – 0.12 0.789
Origin.Tract
[51810040801]
0.01 -0.09 – 0.11 0.853
Origin.Tract
[51810040802]
-0.00 -0.11 – 0.10 0.931
Origin.Tract
[51810041002]
0.03 -0.07 – 0.13 0.587
Origin.Tract
[51810041003]
-0.01 -0.11 – 0.10 0.908
Origin.Tract
[51810041004]
0.04 -0.06 – 0.15 0.427
Origin.Tract
[51810041200]
0.04 -0.06 – 0.15 0.403
Origin.Tract
[51810041400]
0.00 -0.10 – 0.11 0.962
Origin.Tract
[51810041600]
0.02 -0.09 – 0.12 0.727
Origin.Tract
[51810041801]
0.03 -0.07 – 0.13 0.524
Origin.Tract
[51810041802]
0.00 -0.10 – 0.11 0.946
Origin.Tract
[51810042000]
-0.03 -0.15 – 0.10 0.696
Origin.Tract
[51810042201]
-0.01 -0.13 – 0.11 0.861
Origin.Tract
[51810042202]
-0.01 -0.12 – 0.10 0.896
Origin.Tract
[51810042400]
0.02 -0.09 – 0.13 0.704
Origin.Tract
[51810042600]
0.03 -0.08 – 0.14 0.569
Origin.Tract
[51810042801]
-0.00 -0.11 – 0.11 0.986
Origin.Tract
[51810042802]
-0.01 -0.12 – 0.11 0.927
Origin.Tract
[51810043002]
-0.01 -0.12 – 0.10 0.862
Origin.Tract
[51810043003]
-0.00 -0.11 – 0.11 0.961
Origin.Tract
[51810043004]
-0.02 -0.13 – 0.10 0.786
Origin.Tract
[51810043200]
0.02 -0.09 – 0.14 0.682
Origin.Tract
[51810043400]
-0.01 -0.13 – 0.11 0.889
Origin.Tract
[51810043600]
-0.01 -0.13 – 0.11 0.889
Origin.Tract
[51810043800]
0.00 -0.11 – 0.11 0.979
Origin.Tract
[51810044001]
0.03 -0.09 – 0.14 0.652
Origin.Tract
[51810044003]
0.01 -0.10 – 0.12 0.908
Origin.Tract
[51810044004]
0.02 -0.09 – 0.13 0.759
Origin.Tract
[51810044200]
0.02 -0.09 – 0.13 0.689
Origin.Tract
[51810044401]
-0.01 -0.13 – 0.11 0.852
Origin.Tract
[51810044402]
0.03 -0.08 – 0.14 0.552
Origin.Tract
[51810044600]
0.04 -0.07 – 0.15 0.488
Origin.Tract
[51810044805]
0.01 -0.11 – 0.12 0.924
Origin.Tract
[51810044806]
0.04 -0.07 – 0.15 0.484
Origin.Tract
[51810044807]
0.02 -0.09 – 0.13 0.681
Origin.Tract
[51810044808]
0.01 -0.11 – 0.12 0.926
Origin.Tract
[51810045000]
0.02 -0.10 – 0.15 0.729
Origin.Tract
[51810045200]
-0.01 -0.12 – 0.10 0.884
Origin.Tract
[51810045405]
0.02 -0.09 – 0.13 0.659
Origin.Tract
[51810045406]
0.01 -0.10 – 0.12 0.852
Origin.Tract
[51810045407]
0.05 -0.06 – 0.15 0.415
Origin.Tract
[51810045408]
0.02 -0.09 – 0.14 0.664
Origin.Tract
[51810045412]
0.01 -0.11 – 0.12 0.886
Origin.Tract
[51810045414]
0.02 -0.09 – 0.13 0.731
Origin.Tract
[51810045415]
-0.00 -0.12 – 0.11 0.950
Origin.Tract
[51810045417]
0.02 -0.10 – 0.13 0.791
Origin.Tract
[51810045420]
-0.01 -0.12 – 0.11 0.894
Origin.Tract
[51810045421]
-0.01 -0.14 – 0.12 0.898
Origin.Tract
[51810045422]
-0.01 -0.13 – 0.10 0.835
Origin.Tract
[51810045423]
0.02 -0.09 – 0.14 0.659
Origin.Tract
[51810045424]
0.00 -0.11 – 0.12 0.972
Origin.Tract
[51810045425]
-0.01 -0.13 – 0.12 0.923
Origin.Tract
[51810045426]
-0.00 -0.12 – 0.11 0.952
Origin.Tract
[51810045427]
0.02 -0.10 – 0.14 0.735
Origin.Tract
[51810045428]
0.02 -0.09 – 0.13 0.729
Origin.Tract
[51810045601]
-0.02 -0.13 – 0.10 0.768
Origin.Tract
[51810045603]
0.00 -0.10 – 0.11 0.975
Origin.Tract
[51810045604]
0.02 -0.09 – 0.13 0.753
Origin.Tract
[51810045801]
0.04 -0.07 – 0.14 0.517
Origin.Tract
[51810045803]
-0.02 -0.12 – 0.09 0.783
Origin.Tract
[51810045805]
-0.00 -0.12 – 0.11 0.938
Origin.Tract
[51810045806]
0.02 -0.09 – 0.13 0.684
Origin.Tract
[51810045807]
-0.00 -0.13 – 0.12 0.950
Origin.Tract
[51810045808]
-0.01 -0.12 – 0.11 0.876
Origin.Tract
[51810045809]
0.05 -0.06 – 0.16 0.408
Origin.Tract
[51810045810]
0.01 -0.11 – 0.13 0.862
Origin.Tract
[51810046002]
-0.00 -0.11 – 0.10 0.981
Origin.Tract
[51810046005]
0.04 -0.06 – 0.14 0.472
Origin.Tract
[51810046006]
0.01 -0.09 – 0.12 0.807
Origin.Tract
[51810046009]
0.00 -0.10 – 0.11 0.985
Origin.Tract
[51810046010]
-0.01 -0.11 – 0.10 0.909
Origin.Tract
[51810046011]
0.03 -0.07 – 0.14 0.553
Origin.Tract
[51810046012]
0.01 -0.10 – 0.12 0.875
Origin.Tract
[51810046013]
0.03 -0.08 – 0.14 0.607
Origin.Tract
[51810046014]
-0.01 -0.12 – 0.10 0.918
Origin.Tract
[51810046015]
0.00 -0.11 – 0.11 0.956
Origin.Tract
[51810046016]
-0.01 -0.13 – 0.11 0.905
Origin.Tract
[51810046204]
-0.00 -0.11 – 0.10 0.932
Origin.Tract
[51810046206]
0.03 -0.08 – 0.13 0.627
Origin.Tract
[51810046207]
-0.01 -0.12 – 0.09 0.819
Origin.Tract
[51810046211]
-0.00 -0.11 – 0.11 0.991
Origin.Tract
[51810046212]
0.00 -0.11 – 0.11 0.960
Origin.Tract
[51810046213]
-0.01 -0.12 – 0.10 0.921
Origin.Tract
[51810046214]
0.02 -0.10 – 0.13 0.780
Origin.Tract
[51810046216]
0.01 -0.10 – 0.13 0.800
Origin.Tract
[51810046217]
-0.02 -0.13 – 0.10 0.800
Origin.Tract
[51810046219]
-0.02 -0.14 – 0.10 0.696
Origin.Tract
[51810046220]
0.03 -0.07 – 0.14 0.551
Origin.Tract
[51810046221]
-0.01 -0.12 – 0.10 0.910
Origin.Tract
[51810046222]
0.01 -0.10 – 0.12 0.915
Origin.Tract
[51810046223]
0.03 -0.09 – 0.14 0.659
Origin.Tract
[51810046224]
-0.03 -0.14 – 0.09 0.653
Origin.Tract
[51810046225]
0.01 -0.10 – 0.12 0.866
Origin.Tract
[51810046400]
-0.01 -0.12 – 0.11 0.910
hour(interval60) 0.00 -0.00 – 0.00 0.149
dotw.L -0.00 -0.01 – 0.01 0.842
dotw.Q -0.01 -0.01 – 0.00 0.120
dotw.C 0.01 -0.00 – 0.01 0.181
dotw^4 -0.01 -0.02 – -0.00 0.022
dotw^5 -0.00 -0.01 – 0.01 0.541
dotw^6 -0.00 -0.01 – 0.00 0.313
Temperature -0.00 -0.00 – 0.00 0.908
Precipitation -0.01 -0.02 – 0.00 0.170
Wind_Speed 0.00 -0.00 – 0.00 0.759
lagHour 0.42 0.41 – 0.43 <0.001
lag2Hours -0.01 -0.02 – -0.00 0.039
lag3Hours -0.01 -0.02 – 0.00 0.255
lag12Hours -0.01 -0.03 – -0.00 0.029
lag1day 0.00 -0.01 – 0.01 0.458
holidayLag [MinusOneDay] 0.00 -0.01 – 0.02 0.844
holidayLag
[MinusThreeDays]
-0.01 -0.02 – 0.01 0.496
holidayLag [MinusTwoDays] -0.00 -0.02 – 0.02 0.884
holidayLag [PlusOneDay] -0.02 -0.03 – -0.00 0.016
holidayLag
[PlustThreeDays]
0.02 0.00 – 0.04 0.013
holidayLag [PlustTwoDays] -0.01 -0.02 – 0.01 0.411
holiday 0.00 -0.01 – 0.02 0.584
if_noise -0.00 -0.02 – 0.02 0.982
council_district [Beach] 0.02 -0.02 – 0.06 0.433
council_district
[Centerville]
0.01 -0.02 – 0.05 0.467
council_district
[Kempsville]
0.01 -0.01 – 0.03 0.191
council_district
[Lynnhaven]
0.01 -0.03 – 0.05 0.653
council_district
[Princess Anne]
0.01 -0.04 – 0.05 0.799
council_district [Rose
Hall]
0.00 -0.04 – 0.04 0.863
Observations 13716
R2 / R2 adjusted 0.424 / 0.419

Generalizability

If we plot observed vs. predicted for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general:

#Generalizability
#By time1
EMS_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         Origin.Tract = map(data, pull, Origin.Tract), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw)) %>%
  dplyr::select(interval60, Origin.Tract, from_longitude, 
                from_latitude, Observed, Prediction, Regression,
                dotw) %>%
  unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags_SF")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted EMS call counts by weekday/weekend from Model 3\nA test set of 18 weeks;VB 2017",
       x="Observed trips", 
       y="Predicted trips",
       subtitle='Black line:45 degree; Red line: observed vs predicted')+
  plotTheme()

Thus, in detail, let’s look at our errors on a map by weekend/weekday and time of day. Weekends and weekdays have different error patterns in time. Between each time of the day, during weekdays, middle and east of the City tends to have more errors in AM rush and PM rush hours.

#By time 2
EMS_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         Origin.Tract = map(data, pull, Origin.Tract), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw) ) %>%
  dplyr::select(interval60, Origin.Tract, from_longitude, 
                from_latitude, Observed, Prediction, Regression,
                dotw) %>%
  unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags_SF")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(Origin.Tract, weekend, time_of_day, from_longitude, from_latitude) %>%
  dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = VBTracts %>%
            st_transform(crs=2284), color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(VB_dat_census$from_latitude), max(VB_dat_census$from_latitude))+
  xlim(min(VB_dat_census$from_longitude), max(VB_dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors of Test Set \nby weekend/weekday and time of day from Model 3",
       subtitle = "EMS call counts predicton; A test set of 18 weeks; VB 2017")+
  mapTheme()

Last but not least, the model cannot predict well equally for different socio-economic contexts:

Percent of population with car ownership: the model predicts pretty well where there is a higher percent of population having car(s).

Percent of population over 65: the model predicts pretty well where there is lower percent of population over 65.

Median age: the model predicts pretty well where people with higher median age.

Percent of population with health insurance: the model predicts well where there is a higher percent of population with health insurance.

Median household income: the model predicts well where there is higher median household income (more than 40,000).

#Error as a function of socio-economic factors
EMS_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         Origin.Tract= map(data, pull, Origin.Tract), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw),
         CarOwn_Percent = map(data, pull, CarOwn_Percent),
         percent_pop65over = map(data, pull, percent_pop65over),
         percent_popInsurance = map(data, pull, percent_popInsurance),
         Median_Income=map(data,pull,Median_Income),
         median_age = map(data,pull, median_age)) %>%
  dplyr::select(interval60, Origin.Tract, from_longitude, 
                from_latitude, Observed, Prediction, Regression,
                dotw, CarOwn_Percent,
                percent_pop65over,
                percent_popInsurance,
                Median_Income,
                median_age) %>%
  unnest() %>%
  filter(Regression == "CTime_Space_FE_timeLags_holidayLags_SF")%>%
  group_by(Origin.Tract,
    CarOwn_Percent,
           percent_pop65over,
           percent_popInsurance,
           Median_Income,
           median_age) %>%
  dplyr::summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-Origin.Tract, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       subtitle = "A test set of 18 weeks",
       y="Mean Absolute Error (EMS call counts)")+
  plotTheme()

Short Conclusion

  1. According to our predicted results, the EMS calls are most likely to happen in weekdays during Mid-Day hours (10:00-15:00) and Rush PM hours (15:00-18:00). So, the app could use this prediction result to provide the time to be prepared for ambulance drivers.

  2. The accuracy and generalizability of our model could be improved by adding more related variables as well.

EMS Calls Location Prediction: Poisson Regression

Since weather and time factors are not location-related, we opted to not use those variables for location prediction of EMS calls. So, for location-prediction model, we prepared socio-economic data (median household income, median age, percent of aging population (65 or older), car ownership rate, and health insurance ownership rate), distance to facilities data including industrial areas, town center and condos, and spatial features data (if in aircraft noise zone or not, and council district division) for the explanatory variables of our prediction model.

Feature enginnering

1. EMS calls by grid cells

In addition, under our consideration, the happenings of emergency circumstances would not vary across administrative units, but supposed to be clustered, which means the emergency situation in one location is likely correlated with the other emergencies nearby. So, to represent the geographical distribution of EMS calls, we decided to set a lattice of grid cells, a “fishnet”, to aggregate these point phenomenon in each grid. As the code below shown, after created the fishnet and clipped it in city boundary, we joined the EMS calls data in the fishnet.

The map below shows the general geographical distribution of the EMS calls with the count of EMS calls in each grid cell. As it shown, similar to the map which shows the distribution pattern of each EMS call point, most EMS call cases seem to happen at the north part not including the northern end area of the city. Furthermore, among the grid cells with a large number of EMS calls, those close to the east shoreline seem to have the largest number of calls.

# 2.1 Create a fishnet
fishnet <- 
  st_make_grid(VB, cellsize = 2000)%>%
  st_sf()

#Clip the fishnet into VB boundary
fishnet <- 
  fishnet[VB,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

# 2.2 Joining EMS to the fishnet--[EMS net]
EMS_net <- 
  EMS_2017 %>% 
  dplyr::select() %>% 
  mutate(countEMS = 1) %>% 
  aggregate(., fishnet, sum) %>% #aggregate joins burglaries to the fishnet using the sum operation to sum countBurglaries
  mutate(countEMS = ifelse(is.na(countEMS), 0, countEMS),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet)/24), size=nrow(fishnet), replace = TRUE)) #Roughly 100 cvIDs are generated (round(nrow(burglaries) / 100)) to allow 100-fold cross validation below
#Plot...
ggplot() +
  geom_sf(data = EMS_net, aes(fill = countEMS)) +
  scale_fill_viridis() +
  labs(title = "Count of EMS calls for the fishnet") +
  mapTheme()

2. Socio-economic features by grid cells

After wrangling with the EMS calls data, we started to join the data of all influence factors in the fishnet to get the data in grid cell level. At first, we spatially joined all 5 types of socio-economic data in the fishnet and visualized their spatial processes, which are similar to the visualization in census tract-level.

VBTracts <- 
  acsTractsVB.2017 %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf

VBTracts <- VBTracts %>%
  st_transform(crs=2284)

fishnet1 <-
  fishnet %>%
  st_join(dplyr::select(VBTracts, GEOID))#Spatial join the data into Tracts 

fishnet1 <- fishnet1[!duplicated(fishnet1$uniqueID),]

Demo2017 <- 
  acsTractsVB.2017 %>%
  dplyr::select("GEOID","percent_pop65over","percent_popInsurance","Median_Income","Car_Ownership","median_age")

Demo2017 <- Demo2017 %>%
  st_transform(crs=2284)

Demo_net <- 
  left_join(fishnet1, st_set_geometry(Demo2017, NULL), by="GEOID")
grid.arrange(
ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(Demo_net), aes(fill = percent_pop65over)) +
  labs(title = "Percent of population aged over 65") +
  scale_fill_gradient(low = "#EE92ED", high = "#6800A4", name = "Percent") +
  mapTheme(),

ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(Demo_net), aes(fill = percent_popInsurance)) +
  labs(title = "Percent of population with health insurance") +
  scale_fill_gradient(low = "#FFFFAF", high = "#A54900", name = "Percent") +
  mapTheme(),

ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(Demo_net), aes(fill = Median_Income)) +
  labs(title = "Median Household Income") +
  scale_fill_gradient(low = "#C1DED9", high = "#142623", name = "$e") +
  mapTheme(),


ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(Demo_net), aes(fill = Car_Ownership)) +
  labs(title = "Population with car(s)") +
  scale_fill_gradient(low = "#0B9ED9", high = "#080E33", name = "Percent") +
  mapTheme(),

ggplot() + 
  geom_sf(data = VB, fill = "grey70") +
  geom_sf(data = na.omit(Demo_net), aes(fill = median_age)) +
  labs(title = "Median age") +
  scale_fill_gradient(low = "#0B9ED9", high = "#080E33", name = "Age") +
  mapTheme(),
ncol=3)

#### 3. If in Noise Zone for each grid cell Then, we included the dummy variable about if the EMS call case is in Noise Zone or not to the grid cells and joined the socio-economic and noise data to our fishnet with the EMS calls information.

Noise_net <-
  fishnet1 %>%
  st_join(dplyr::select(noise,if_noise))
Noise_net$if_noise[is.na(Noise_net$if_noise)]<-0
Noise_net <- Noise_net[!duplicated(Noise_net$uniqueID),]

EMS_net <- 
  left_join(EMS_net, st_set_geometry(Demo_net, NULL), by="uniqueID")

EMS_net <- 
  left_join(EMS_net, st_set_geometry(Noise_net, NULL), by="uniqueID")

4. Facilities: Nearest neighbor features

Afterwards, because counting the number of industrial areas, town center, and condo units by grid cell would only test our exposure hypothesis to these facility features in a rigid scale and opens up to MAUP bias, which would not be very useful for our prediction, we used average nearest neighbor distance function to measure the average distance between the centroid of each grid cell to the centroids of closest several facility features. The code is shown below:

information.

EMS_net$industrial.nn =
  nn_function(st_coordinates(st_centroid(EMS_net)),st_coordinates(st_centroid(industrial_area)), 1)
EMS_net$housing.nn =
  nn_function(st_coordinates(st_centroid(EMS_net)), st_coordinates(st_centroid(condos)), 3)
EMS_net$towncenter.nn =
  nn_function(st_coordinates(st_centroid(EMS_net)),st_coordinates(st_centroid(towncenter)), 1)

# Nearest Neighbor influence Factors by Fishnet

EMS_net.long.nn <- 
  EMS_net %>%
  dplyr::select(ends_with(".nn")) %>%
  gather(Variable, value, -geometry)

EMS_net.long <-
  EMS_net %>%
  gather(Variable, value, -geometry)

vars <- unique(EMS_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(EMS_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(mapList, ncol =2, top = "Nearest Neighbor influence Factors by Fishnet"))

5. Join the final fishnet with Council layer

In our last step for feature engineering, after getting all possible influence factors by grid cells in our final version of fishnet, we spatially joined the fishnet to the council layer for understanding the location of future prediction better.

# Feature Engineering - Join council
EMS_net <-
  st_centroid(EMS_net) %>%
  st_join(., dplyr::select(council, NAME)) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(EMS_net, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()

dplyr::select(EMS_net, NAME) %>%
  gather(Variable, Value, -geometry) %>%
  ggplot() +
  geom_sf(aes(fill = Value)) +
  scale_fill_viridis(discrete = TRUE) +
  labs(title = "Aggregate Areas of Councils") +
  mapTheme() + theme(legend.position = "none")

In the next two sections, we explored Local Moran’s I in EMS calls to examine spatial autocorrelation of EMS calls on a local scale and then conducted a correlation test to understand which type of features should be included in the model.

Spatial structure: Local Moran’s I

In fact, our null hypothesis for Local Moran’s I is that EMS call at a given location is randomly distributed relative to its immediate neighbors. Because a spatial weights matrix should be used to relate a unit to its neighbors, we created a neighbors list together with the matrix. In the fishnet, each grid cell is related to its 8 closest neighbors!

Afterwards, we tried to describe the spatial process of interest. To do this, we firstly calculated the Local Moran’s I. And then, we tested its results and outputted several useful test statistics including the P values, and the hotspot cells whose p values are equal or less than 0.05. At last, we created a small multiple map including the maps of the count of EMS calls, Local Moran’s I and the other test statistics. [Code for Local Moran’s I statistics and visualizations]

# Exploring the spatial structure of EMS calls
EMS_net.nb <- poly2nb(EMS_net, queen=TRUE)
EMS_net.weights <- nb2listw(EMS_net.nb, style="W", zero.policy=TRUE)

# local Moran's I
EMS_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(EMS_net$countEMS, EMS_net.weights)),
    as.data.frame(EMS_net, NULL)) %>% 
  st_sf() %>%
  dplyr::select(EMS_Count = countEMS, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z > 0)`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars <- unique(EMS_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = filter(EMS_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
    scale_fill_viridis(name="") +
    labs(title=i) +
    mapTheme()}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, EMS calls"))

In the maps above, the higher the value (color in yellow), the stronger and more statistically significant evidence of local clustering the EMS call is in City of Virginia Beach. As the maps showed, the spatial autocorrelation in the mid-northern, especially the area close to the east coast of the city is strong and the EMS call spatial process has a clearly clustering pattern.

Afterwards, because our hypothesis is that the spatial structure features will lead to more accurate and more generalizable predictions, we tried to engineer the new features to best capture the local spatial structure. To do this, we created a dummy variable to indicate the significant cluster with P-value <= 0.0000001. Then, measured the distance from each cell to its nearest cluster. After adding these two variables in the model, we have finished collecting and wrangling all the data for dependent and explanatory variables.

# Distance to highly significant local EMS call hotspots
EMS_net <-
  EMS_net %>% 
  mutate(EMS.isSig = ifelse(localmoran(EMS_net$countEMS, 
                                           EMS_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(EMS.isSig.dist = nn_function(st_coordinates(st_centroid(EMS_net)),
                                          st_coordinates(st_centroid(
                                            filter(EMS_net, EMS.isSig == 1))), 1 ))

ggplot() + 
  geom_sf(data = EMS_net, aes(fill = EMS.isSig.dist)) +
  scale_fill_viridis() +
  labs(title = "Distance to highly significant local EMS calls hotspots") +
  mapTheme()

Model choice

Whereafter, before building the model for prediction, we firstly looked at the distribution of the count of EMS calls which is the dependent variable in the model by plotting a histogram.

# 5.1 Poisson Regression-->model a count outcome

# Distribution of count
ggplot(EMS_net, aes(countEMS)) + 
  geom_histogram(binwidth = 20, fill = "grey30") +
  labs(title = "Distribution of EMS call by grid cell")

Since the emergency circumstances with EMS calling are not easily to happen, the distribution of the EMS calls by grid cell doesn’t normally distribute, but becomes far more skewed. So, an OLS regression is not good to be used and a Poisson Regression which is used to modeling count outcomes would be more appropriate.

Cross-validated Poisson model

So, to gauge the generalizability of our model, we performed spatial cross-validation through conducting both random k-fold and Leave-one-group-out (LOGO) cross validation. What’s more, goodness of fit metrics are generated for both cross validation approaches for two different specifications (with or without spatial structure features: Council district).

# 5.2 Cross-validated poisson Regression
# Leave-one-group-out cross-validation (LOGO-CV)

# Goodness of fit metrics 
reg.vars <- c("percent_pop65over", "percent_popInsurance", "Median_Income", 
              "Car_Ownership", "median_age", "if_noise", "industrial.nn", "housing.nn", "towncenter.nn")

reg.ss.vars <- c("percent_pop65over", "percent_popInsurance", "Median_Income", 
                 "Car_Ownership", "median_age", "if_noise", "industrial.nn", "housing.nn", "towncenter.nn", 
                 "EMS.isSig", "EMS.isSig.dist")

# Cross Validation--function (countEMS)
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countEMS ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

# CV (four different regressions)
reg.cv <- crossValidate(
  dataset = EMS_net,
  id = "cvID",
  dependentVariable = "countEMS",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = cvID, countEMS, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = EMS_net,
  id = "cvID",
  dependentVariable = "countEMS",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countEMS, Prediction, geometry)

reg.spatialCV <- crossValidate(
  dataset = EMS_net,
  id = "NAME",
  dependentVariable = "countEMS",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = NAME, countEMS, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = EMS_net,
  id = "NAME",
  dependentVariable = "countEMS",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = NAME, countEMS, Prediction, geometry)

Results

Next, after setting up these regression models, I mapped the location predictions of EMS calls for each of them.

# 5.2 Accuracy & Generalzability
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = countEMS - Prediction,
           Regression = "Random k-fold CV: Just Influence Factors"),
    
    mutate(reg.ss.cv,        Error = countEMS - Prediction,
           Regression = "Random k-fold CV: Spatial Structure"),
    
    mutate(reg.spatialCV,    Error = countEMS - Prediction,
           Regression = "Spatial LOGO-CV: Just Influence Factors"),
    
    mutate(reg.ss.spatialCV, Error = countEMS - Prediction,
           Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
  st_sf() 

# predictions for each Regression
grid.arrange(
  reg.summary %>%
    ggplot() +
    geom_sf(aes(fill = Prediction)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Predicted EMS calls by Regression") +
    mapTheme() + theme(legend.position="bottom"),
  
  filter(reg.summary, Regression == "Random k-fold CV: Just Influence Factors") %>%
    ggplot() +
    geom_sf(aes(fill = countEMS)) +
    scale_fill_viridis() +
    labs(title = "Observed EMS calls\n") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

According to the maps, the predicted EMS calls would still concentrate at the mid-northern part of the city. However, not identical as the existing condition, it seems that the area in the west part would be more possible to have larger number of EMS calls than east side. Furthermore, it’s clear that the models with spatial structure features are better in picking up on very localized hotspots.

Accuracy

Then, to check the accuracy of the models, I firstly mapped the prediction errors (raw errors) for two different specifications of both models below.

According to the maps of both approaches, the model without spatial structures seem to underpredict many hotspots and leave clustered errors at the hotspots. Besides, the distribution of the errors across different cross-validation types seems similar.

# prediction errors maps
filter(reg.summary,  Regression == "Random k-fold CV: Just Influence Factors" |
         Regression == "Random k-fold CV: Spatial Structure" |
         Regression == "Spatial LOGO-CV: Just Influence Factors" | 
         Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  ggplot() +
  geom_sf(aes(fill = Error)) +
  facet_wrap(~Regression) +
  scale_fill_viridis() +
  labs(title = "EMS calls errors by Regression") +
  mapTheme()

After that, we examined the MAE and standard deviation of MAE by each regression type and we found that the models with spatial structure factor err by about 14 and without spatial structure factor err by about 18 EMS calls on average. Since the mean value of the count of EMS calls is about 21, the error seems not very small. Thus, in this perspective, our model has the potential to be improved for better prediction.

# MAE & SD of MAE
st_set_geometry(reg.summary, NULL) %>%
  group_by(Regression) %>% 
  dplyr::summarize(MAE = round(mean(abs(Prediction - countEMS), na.rm = T),2),
            SD_MAE = round(sd(abs(Prediction - countEMS), na.rm = T),2)) %>% 
  kable(caption = "MAE by regression") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "white", background = "#004E87") %>%
  row_spec(4, color = "white", background = "#004E87") 
MAE by regression
Regression MAE SD_MAE
Random k-fold CV: Just Influence Factors 17.74 39.74
Random k-fold CV: Spatial Structure 14.20 30.02
Spatial LOGO-CV: Just Influence Factors 19.04 41.82
Spatial LOGO-CV: Spatial Structure 14.28 30.33

Besides, since overpredicting would draw undue attention to some areas in the city, it might be particularly helpful to look at the raw errors rather than absolute errors. In addition, according to the table, it’s obvious that the regressions with the spatial structure features are more accurate and generalizable.

Generalizability

Then, since their would have reporting or selection bias in different census tracts, to furtherly test the generalizability, we used ACS socio-economic data (Races & Median Household Incomes) to set up two urban contexts and then showed the mean error of the prediction by census tract race and income contexts.

final_reg <- 
  filter(reg.summary,  Regression == "Random k-fold CV: Just Influence Factors" |
           Regression == "Random k-fold CV: Spatial Structure" |
           Regression == "Spatial LOGO-CV: Just Influence Factors" | 
           Regression == "Spatial LOGO-CV: Spatial Structure") %>%
  mutate(uniqueID = rownames(.))
final_reg <-final_reg %>%
  st_transform(2284)
acsTractsVB.2017 <- acsTractsVB.2017 %>%
  st_transform(2284)

final_reg.tracts <- 
  st_join(st_centroid(final_reg), acsTractsVB.2017) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(final_reg, uniqueID)) %>%
  st_sf() %>%
  na.omit()

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, raceContext) %>%
  dplyr::summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable(caption = "Mean (Raw) Error by tract racial context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "white", background = "#004E87") %>%
  row_spec(4, color = "white", background = "#004E87")
Mean (Raw) Error by tract racial context
Regression Majority Non-White Majority White
Random k-fold CV: Just Influence Factors 9.5573604 -0.3243506
Random k-fold CV: Spatial Structure -0.2719520 0.0614812
Spatial LOGO-CV: Just Influence Factors 11.3123084 -0.5091783
Spatial LOGO-CV: Spatial Structure 0.4858889 0.2946270

According to the table, to some extents, all 4 types of models overpredicted or underpredicted the EMS calls in majority non-white census tracts and majority white census tracts. However, the models with the spatial features reports lower errors and a much lower difference in errors across racial context. Furthermore, the mean predicted errors in randomly cross-validated model which has the spatial structure factor are very small, which means that the model works well in this specification.

# Mean error by tract income context

st_set_geometry(final_reg.tracts, NULL) %>%
  group_by(Regression, incomeContext) %>%
  dplyr::summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(incomeContext, mean.Error) %>%
  kable(caption = "Mean (Raw) Error by tract income context") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "white", background = "#004E87") %>%
  row_spec(4, color = "white", background = "#004E87")
Mean (Raw) Error by tract income context
Regression High Income Low Income
Random k-fold CV: Just Influence Factors -1.6387369 7.605730
Random k-fold CV: Spatial Structure -0.2808231 1.491058
Spatial LOGO-CV: Just Influence Factors -2.0538278 8.855768
Spatial LOGO-CV: Spatial Structure -0.0782893 1.973303

According to the table, the model overpredicted EMS calls in low-income census tracts and underpredicted in high-income census tracts. What’s more, the models with the spatial features reports lower errors and a much lower difference in errors across income context.

Short conclusion

  1. According to our predicted results, the EMS calls are most likely to happen in the mid-north part of the city, especially the areas near the west coast. So, the app could use this prediction result to provide the trip destinations for ambulance drivers.

  2. The accuracy and generalizability of our model could be improved by adding more related variables.

App Wireframe

We can then integrate our modeling mechanism into an APP for ambulance drivers, the detailed logic for our APP is as follows:

And below are the detailed APP wireframe. Basically, we would like to keep the wireframe as precise and easy for users to handle with and provide the fastest route to direct to the right place:

Conclusion and improvement

We would recommend our models for EMS calls prediction for Virginia Beach. Space-Time model narrows down the time range for likely EMS calls, and Poisson Regression indicates the specific location by grid cells of EMS demand. However, echoing from the beginning, in daily life, EMS calls are relatively random and people don’t know when and where something emergent will happen. Thus, the ultimate goal of our models is to use “what we’ve known” from existing EMS calls to predict as accurately as possible. And successfully, the models tell us a general idea about when and where are more likely to have EMS demand. VBEMS can allocate more ambulances in the stations in these potential areas, and it can also choose to have more rescue stations if there is enough financial budget.

The limitation of our models primarily touches on inadequacy of data. During our model-building process, we’ve already noticed that some places lack of socio-economic data. Furthermore, it would be more helpful to have more regional data including health conditions, accidents (crime, traffic) and environmental factors.

The prediction results may also be helpful for local government and VBEMS to investigate the areas with high EMS demand to figure out why: are there any social/environmental/health issues? Because high EMS demand is not a good social phenomenon for a specific region. So the government could do something to improve the region by providing assistance.