Bike share is a network of bicycles located at self-service rental stations around the city. Bike share will be available all year long and 24 hours a day. With bike share, you can ride from station to station whenever you need a convenient and affordable way to travel. The Indego bike share system in Philadelphia features hundreds of bikes available 24/7, 365 days a year. Station locations were selected based on proximity to community resources, employment centers, bike infrastructure, and transit in consultation with partner agencies, institutions, community groups, and the public. (reference)
The algorithm model aims to predict Indego bike share demand by taking time, physical and socio-economic factors into account, so that the model can assist Indego to better allocate bike share resources based on demand varying across time and space, and hopefully plan and design new bike share stations in future. The model will perform meaningfully for decision-making process.
Based on existing available bike share trip data and variables, the result from the current model shows that the model needs to be improved for predicting the demand where and when bike share trips are more frequent to take place, because current model predicts well only in the areas and time slots where and when bike share is less demanded. In addition, current model should be improved and retrofitted to generalize different socio-economic contexts equally, so that the demand can be accurately predicted and resources can be equally distributed among different groups of people.
The census data is from ACS 2017. We will mainly focus on percent of White population, mean commute time, percent of population taking public transit, median household income and median age for the subsequent analysis.
#2.3 Import census info
#Census info:
phillyCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
#Census tracts:
#Create an sf object that consists only of GEOIDs and geometries
phillyTracts <-
phillyCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf
#Add the spatial information to our rideshare data as origin and destination data,
#first joining the origin station,
#then the destination station to our census data.
philly_dat_census <- st_join(philly_dat2 %>%
filter(is.na(start_lat) == FALSE &
is.na(start_lon) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lon) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>% #Origin.Tract
mutate(from_longitude = unlist(map(geometry, 1)), #get long column
from_latitude = unlist(map(geometry, 2)))%>% #get lat column
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>% #Destination.Tract
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)
The weather data is imported from Philadelphia International Airport (code PHL). There are data of temperature, wind speed, precipitation on an hourly basis.
The following plots the temperature, precipitation and wind speed trends over our study period:
#2.4. Import Weather Data
#Import weather data from O’Hare airport (code ORD) using riem_measures
#We can mutate the data to get temperature, wind speed, precipitation
#on an hourly basis and plot the temperature and precipitation trends over our study period.
philly_weather.Panel <-
riem_measures(station = "PHL", date_start = "2019-07-01", date_end = "2019-09-30") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#Plot!!!!!
grid.arrange(
ggplot(philly_weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(philly_weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(philly_weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Philadelphia PHL - Jul~Sept 2019")
We can also look at how bike share trip counts varies by week, time in the day and station. We begin by examining the time and frequency components of our data.
The plot below breaks out the study time series in small multiple form by week. This approach provides an even better illustration of the similarity from week to week. Besides July 1 and July 2, the overall pattern of each week is similar with each other.
Monday and two holidays are highlighted. We clearly see the discrepancy of trip count between weekend and Monday, and between holidays and non-holidays:
mondayMidnight <-
philly_ride.panel %>%
mutate(mondayMidnight = ifelse(wday(interval60) == 2 & hour(interval60) == 1,
as.POSIXct(interval60),0)) %>%
filter(mondayMidnight != 0)
#1. Bike share trips by week
as.data.frame(philly_ride.panel) %>%
group_by(week, interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
ggplot(aes(interval60,Trip_Count)) +
geom_line() +
plotTheme +
facet_wrap(~week, scales="free", ncol=3) +
ylim(0,500) +
labs(title="Trip count by week. Philadelphia, Jul~Sept 2019",
subtitle="Monday marked in black; Independence Day and Labor Day in blue", x="Day", y="Trip Counts") +
geom_vline(xintercept = as.POSIXct("2019-07-04 01:00:00 UTC"),color = "blue",size=1.1) +
geom_vline(xintercept = as.POSIXct("2019-09-02 01:00:00 UTC"),color = "blue",size=1.1) +
geom_vline(data = mondayMidnight, aes(xintercept = mondayMidnight), colour="black")
We can also look at the overall time pattern - there is clearly a daily periodicity and there are lull periods on weekends:
#A. Overall time pattern for bike share-
#there is clearly a daily periodicity and there are lull periods on weekends
ggplot(philly_dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n),colour="#6C4E8C")+
labs(title="Bike share trips per hour. Philadelphia, Jul~Sept 2019",
x="Date",
y="Number of trips")+
plotTheme
Let’s examine the distribution of trip volume by station for different times of the day. We clearly have a few high volume periods but mostly low volume. Our data must consist of a lot of low demand station/hours and a few high demand station hours:
#B. Distribution of trip volume by station for different times of the day
philly_dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1,fill='#6C4E8C')+
labs(title="Mean number of hourly trips per station. Philadelphia, Jul~Sept 2019",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
#C. Bike share trips per HOUR by STATION
ggplot(philly_dat_census %>%
group_by(interval60, start_station) %>%
tally())+ #Summarize
geom_histogram(aes(n), binwidth = 5,fill='#6C4E8C')+
labs(title="Bike share trips per hour by station. Philadelphia, Jul~Sept 2019",
x="Trip Counts",
y="Number of Stations")+
plotTheme
Visually, we can map the bike share trips per hour by stations to have a straightforward observation. We can also divide it into weekday and weekend. It is clear to see on weekdays, bike share trips occur more during PM rush hours in Center City. While on weekends, bike share trips occur much fewer in count and less concentrated in physical space:
#F: Bike share trips per hour by station -- [MAP]
ggplot()+
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_point(data = philly_dat_census%>%
mutate(hour = hour(starttime),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
tally(),
aes(x=from_longitude, y = from_latitude, color = n),
fill = "transparent", alpha = 1, size = 0.5)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hour by station. Philadelphia, Jul~Sept 2019")+
mapTheme
Zooming in to see the time pattern, we can see bike share trips take place more on Monday and Tuesday during 15:00- 20:00 which includes the PM rush hours:
ggplot(philly_dat_census %>% mutate(hour = hour(starttime)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips by day of the week. Philadelphia, Jul~Sept 2019",
x="Hour",
y="Trip Counts")+
plotTheme
To synthesize, weekdays have more bike share trips than weekend. But in weekends, more trips take place from 10:00 to 15:00, we can assume that it is due to more recreation activities:
#E. Bike share tips: WEEKEND VS WEEKDAY
ggplot(philly_dat_census %>%
mutate(hour = hour(starttime),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips - weekend vs weekday. Philadelphia, Jul~Sept 2019",
x="Hour",
y="Trip Counts")+
plotTheme
We cam also visualize the spatial distribution of bike share trip count to understand if there are systematic rideshare patterns across space that can be harnessed when forecasting. Besides Week 26 which only covers two days of our study period, the spatial pattern for the rest of weeks are similar:
#Trip_Count variation across space
#We look at the spatial pattern of trips across Philly
philly_ride.panel %>%
group_by(week, Origin.Tract) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf()%>%
ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
facet_wrap(~week, ncol = 5) +
scale_fill_manual(values = palette5,
labels = c("0","81", "167", "322", "1699"),
name = "Trip_Count") +
labs(title="Sum of bike share trips by tract and week",
subtitle='Philadelphia, Jul~Sept 2019') +
mapTheme+ theme(legend.position = "right")
Next, we visualize the spatial distribution of trip count by day of the week. There are some clear differences between weekends and weekdays. During weekdays, Monday, Tuesday and Friday have more trip count than Wednesday and Thursday. During all seven days in a week, trips seems to eminate outward from Center City:
#Sum of trips by tract and day of the week
philly_ride.panel %>%
group_by(dotw, Origin.Tract) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf()%>%
ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
facet_wrap(~dotw, ncol = 7) +
scale_fill_manual(values = palette5,
labels = c("32","175", "348", "628", "3117"),
name = "Trip_Count") +
labs(title="Sum of bike share trips by tract and day of the week",
subtitle='Philadelphia, Jul~Sept 2019') +
mapTheme+ theme(legend.position="right")
For the spatial pattern in each hour, we can clear see the transition/change of trip count from night to morning and morning to night, from Center City to outward:
#Sum of bike share trips by tract and hour of the day
philly_ride.panel %>%
group_by(hour = hour(interval60), Origin.Tract) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf()%>%
ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
facet_wrap(~hour, ncol = 8) +
scale_fill_manual(values = palette5,
labels = c("0", "26", "76", "183", "3308"),
name = "Trip_Count") +
labs(title="Sum of bike share trips by tract and hour of the day",
subtitle="Philadelphia, Jul~Sept 2019") +
mapTheme + theme(legend.position = "right")
Lastly, we can build an animated gif map of rideshare trips over both space and time with 15 minute interval:
week33<-
philly_dat_census %>%
filter(week == 33 & dotw == "Mon")
week33.panel <-
expand.grid(
interval15=unique(week33$interval15),
Origin.Tract = unique(week33$Origin.Tract))
ride.animation.data <-
week33 %>%
mutate(Trip_Counter = 1) %>%
right_join(week33.panel) %>%
group_by(interval15, Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(phillyTracts,by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 5 ~ "1-5 trips",
Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
Trip_Count > 20 ~ "21+ trips")) %>%
mutate(Trips = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))
rideshare_animation <-
ggplot() +
geom_sf(data=ride.animation.data, aes(fill=Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Bike share trips for one day in August 2019, Philadelphia",
subtitle = "15 minute intervals: {current_frame}", caption = "Ran Xin") +
transition_manual(interval15)
animate(rideshare_animation, duration=20)
We can then explore the correlation between bike share trip count and different variables.
Bike share trip count as a function of the various time lags are both in table and visualized below. These effects are very strong, we can clearly see the correlation decreases with each additional hour, however, the predictive power returns with the 24 hour lag:
#2. Evaluate correlations in these lags
plotData.lag<-as.data.frame(philly_ride.panel) %>%
group_by(interval60) %>%
summarise_at(c(vars(starts_with("lag"), "Trip_Count")), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag <-
plotData.lag %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
correlation.lag %>%
kable(caption = "Correlation between trip count and time lags") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#CFC0D1")%>%
row_spec(3, color = "black", background = "#CFC0D1")%>%
row_spec(5, color = "black", background = "#CFC0D1")
Variable | correlation |
---|---|
lagHour | 0.81 |
lag2Hours | 0.56 |
lag3Hours | 0.35 |
lag4Hours | 0.17 |
lag12Hours | -0.45 |
lag1day | 0.80 |
#2.1. Relationship: trip count and time lags (plot)
plotData.lag %>%
ggplot(aes(Value, Trip_Count)) +
geom_point() + geom_smooth(method = "lm", se = F) +
facet_wrap(~Variable) +
geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "blue",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
labs(title = "Trip count as a function of time lags",
subtitle= "Philadelphia, Jul~Sept 2019", x = "Lag Trip Count") +
plotTheme
Let’s look at precipitation first as a function of bike share trips.
Looking at week by group means in the resulting barplot does not neccessarily support the hypothesis that rain and snow increase the propensity to take rideshare. In most weeks of the study period, bike share trips decrease in rainy/snowy day:
#3. Relationship: trip count and weather - Precipitation
as.data.frame(philly_ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(philly_weather.Panel) %>%
mutate(isPercip = ifelse(Precipitation > 0, "Rain/Snow", "None")) %>%
na.omit()%>%
group_by(week = week(interval60), isPercip) %>%
summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
ggplot(aes(isPercip, Mean_Trip_Count)) +
geom_bar(stat = "identity") +
facet_wrap(~week,ncol=7) +
labs(title="Does ridership vary when it's raining or snowing?(NA removed)",
subtitle="Mean trip count by week; Philadelphia, Jul~Sept 2019",
x="Percipitation", y="Mean Trip Count") +
plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Do more people take ride share when it’s colder? Let’s look at a small multiple plot of bike share trip count as a function of temperature by week. But they suggest the opposite - that ride share actually increases as temperatures get warmer:
#4. Relationship: trip count and weather - Temperature
as.data.frame(philly_ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(philly_weather.Panel) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point() + facet_wrap(~week,ncol=3) + geom_smooth(method = "lm", se= FALSE) +
xlim(40,95)+
plotTheme +
labs(title="Trip count as a function of temperature by week",
subtitle="Trip count by week; Philadelphia, Jul~Sept 2019",
x="Temperature", y="Mean Trip Count")
Let’s look at the wind speed.The plot shows that besides Week 26, the bike share demand increases a little bit when it is more windy:
as.data.frame(philly_ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(philly_weather.Panel) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Wind_Speed, Trip_Count)) +
geom_point() + facet_wrap(~week,ncol=3) + geom_smooth(method = "lm", se= FALSE) +
xlim(0,30)+
plotTheme +
labs(title="Trip Count as a function of wind speed by week",
subtitle="Trip count by week; Philadelphia, Jul~Sept 2019",
x="Temperature", y="Mean Trip Count")
Because of the space/time panel nature of the data, census data cannot be used in the forecast. Nevertheless, for the purpose of exploratory analysis, we can still briefly analyzes correlation between trip count by tract for one week as a function of 5 census variables. The table and plot of correlation are shown as below.
We can see where there is higher percentage of White population, higher mean commute time and median household income, there are more bike share trips. While the trip count decreases a little bit as median age of population in this tract increase. And trip count also has an inverse relationship with percent of the population taking the public transit. One guess is that when more people choose public transit such as SEPTA trolley, metro or bus, the demand of bike share decreases, because those public transits are more speedy for longer distance travel:
#6. Relationship: trip count and socio-economic
plotData.census <-
as.data.frame(philly_ride.panel) %>%
group_by(Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(phillyCensus, by=c("Origin.Tract" = "GEOID")) %>%
dplyr::select(Med_Inc,
Mean_Commute_Time,
Percent_Taking_Public_Trans,
Percent_White,
Med_Age,
Trip_Count) %>%
gather(Variable, Value, -Trip_Count) %>%
na.omit() %>%
filter(Variable =="Med_Inc"| Variable == "Total_Pop"|Variable == "Percent_White"|
Variable == "Mean_Commute_Time"|Variable == "Percent_Taking_Public_Trans"|Variable == "Med_Age")
#Correlation with table
correlation.census <-
plotData.census %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, log(Trip_Count)),2))
correlation.census%>%
kable(caption = "Trip count (log) as a function of socio-economic factors") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#CFC0D1")%>%
row_spec(3, color = "black", background = "#CFC0D1")%>%
row_spec(5, color = "black", background = "#CFC0D1")
Variable | correlation |
---|---|
Mean_Commute_Time | 0.39 |
Med_Age | -0.03 |
Med_Inc | 0.56 |
Percent_Taking_Public_Trans | -0.63 |
Percent_White | 0.54 |
#Correlation with plot
ggplot(plotData.census, aes(Value,log(Trip_Count))) + geom_point() + geom_smooth(method="lm", se = F) +
facet_wrap(~Variable, scales="free", ncol=2) +
geom_text(data=correlation.census, aes(label=paste("R =", correlation)),
colour = "blue", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
plotTheme+
labs(title="Trip count (log) as a function of selected census variables",
subtitle="Trip count by census tract; Philadelphia, Jul~Sept 2019")
I split the data into a training set and a test test (Week 33 as the “boundary”, there are 7 test set weeks). I created 5 linear models using the “lm” function in the training set, but made predictions and get results in the test set.
Model 1: “reg1”, focuses on just time effects. The first model includes hour fixed effects, day of the week fixed effects and temperature. Lag effects are not included for now.
Model 2: “reg2”, replaces the time fixed effects with a vector of origin tracts fixed effects.
Model 3: “reg3”, includes both time and space fixed effects.
Model 4: “reg4”, adds the time lag features.
Model 5: “reg5”, adds holiday and holidayLag effects.
Then I applied the five models to come out with predicted value and error values correspondingly.
#Create five linear models using the lm funtion
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=philly_ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + dotw + Temperature, data=philly_ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation,
data=philly_ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, data=philly_ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, data=philly_ride.Train)
#4.2. Predict for test data
#create a nested data frame of test data by week.
philly_ride.Test.weekNest <-
philly_ride.Test %>%
nest(-week)
#Create a function called model_pred which we can then map onto each data frame in our nested structure
#The argument fit takes the **name of a regression** you have created that you want to use to make predictions
#map(.x = data, fit = name_of_your_regression, .f = model_pred)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
philly_week_predictions <- #For test set!
philly_ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
The MAE is plotted by model by week as below. We see that the holiday effects do not bring much additional predictive power to the model, except Week 27. Conversely, in the rest of the six weeks, holiday effects make the prediction less accurate. Note that both the training and test has one holiday. However, the fact shows that holiday effects do not work well.
#A. MAE by model specification and week
philly_week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week",
subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019") +
plotTheme
For each model, the predicted and observed trip count is plotted in time series form below. Probably the simple, most all-encompassing interpretation of the results below is that with increasing sophistication, it becomes increasingly possible to predict well for the trip count spikes.
It is clear to see time lag effects add a tremendous amount of predictive power to this model, very nearly predicting the highest spikes. Not surprisingly, bike share patterns from the very recent past do a great job predicting in the near future. But simply observing from the plot, holiday effects also don’t bring consipicuous predictive power.
#B. Predicted/Observed bike share time series
philly_week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share trips by time series",
subtitle = "Philadelphia, Jul~Sept 2019; A test set of 7 weeks",
x = "Hour", y= "Station Trips") +
plotTheme
Therefore, by examining Mean Absolute Error (MAE) and the plot above, I would stick to Model 4 (with time lags, no holiday effects) which seems to have the best goodness of fit generally.
There are different dimensions to test the generalizability of the model (Model 4): space, time and socio-economic factors. The core question to be answered is: does the model predict well in different context of those dimensions?
We can look at our mean absolute errors by station - there seems to be a spatial pattern to our error, for instance, several stations in the center have high errors (more than 1 trip) for predicted trip count. But we need to go a bit further to get at the spatial and temporal element of the error.
philly_week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
dplyr::select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags") %>%
group_by(start_station, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
labs(title="Mean Absolute Errors of Test Set by Station from Model 4",
subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019")+
mapTheme
If we plot observed vs. predicted for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general:
philly_week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw)) %>%
dplyr::select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted bike share trips by weekday/weekend from Model 4\nA test set of 7 weeks; Philadelphia, Jul~Sept 2019",
x="Observed trips",
y="Predicted trips",
subtitle='Black line:45 degree; Red line: observed vs predicted')+
plotTheme
Thus, in detail, let’s look at our errors on a map by weekend/weekday and time of day. Weekends and weekdays have different error patterns in time. Between each time of the day, during weekdays, Central City tends to have more errors in AM rush and PM rush hours; while during weekends, Central City has more errors in Mid-Day and PM Rush, which means the predictive ability of the model does not perform well during these periods. Comparing weekdays and weekends in general, weekdays tend to have more errors:
#Let’s look at our errors on a map by weekend/weekday and time of day.
philly_week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
dplyr::select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size = 0.5, alpha = 1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors of Test Set \nby weekend/weekday and time of day from Model 4",
subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019")+
mapTheme
Let’s look at the generalizability of the model by census tracts. We can see the high errors are mostly concentrated in Central City where has higher density of population, facilities and office places, and more trips are likely to take place. It seems like the errors radiate from the Central City, we cannot tell the model is better at predicting in time or in space.
filter(philly_week_predictions, Regression == "DTime_Space_FE_timeLags") %>%
unnest() %>%
left_join(., phillyTracts, by = c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
dplyr::select(Origin.Tract, Absolute_Error, geometry) %>%
gather(Variable, Value, -Origin.Tract, -geometry) %>%
group_by(Variable, Origin.Tract) %>%
na.omit() %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = q5(MAE))) +
scale_fill_manual(values = palette5,
labels = c("2", "3", "4", "5", "8"),
name = "MAE") +
labs(title="Chloropleth map of errors (MAE) by neighborhood from Model 4",
subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019") +
mapTheme
Obviously from the plot of errors below, the model cannot predict well equally for different socio-economic contexts:
Mean commute time: the model predicts pretty well where people spend less commute time.
Percent of population taking public transit: the model predicts pretty well where there is higher percent of population taking public transit.
Median age: the model predicts pretty well where people with median age around 20 and median age above 50.
Percent of White population: the model predicts well where there is fewer percent of White population.
Median household income: the model predicts well where there is lower median household income (less than 30,000).
#Error as a function of socio-economic factors
#Add median age
philly_week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station= map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White),
Med_Age=map(data,pull,Med_Age),
Mean_Commute_Time = map(data,pull, Mean_Commute_Time)) %>%
dplyr::select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Med_Age,Mean_Commute_Time) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Med_Age,Mean_Commute_Time) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
subtitle = "Philadelphia, Jul~Sept 2019; A test set of 7 weeks",
y="Mean Absolute Error (Trips)")+
plotTheme
To sum up, the model does not predict well for where and when more trips are likely to occur, as there are higher errors taking place in those context, for example, the rush hours in Central City. And the key takeaway is that the model underpredicts the bike share demand during rush hours in Central City in weekdays. Therefore, one piece of advice is, Indego should supply more bike share during those time in Central City, and people should be prepared to schedule well if they need to use bike share service at those time in Central City.
The model is good at predicting when and where trips are less frequent, such as non-rush hour during weekdays and in the areas aways from Central City.
Furthermore, the model cannot generalize equally for different socio-economic context.