Bike share prediction - Indego in Philadelphia

Executive summary

Bike share is a network of bicycles located at self-service rental stations around the city. Bike share will be available all year long and 24 hours a day. With bike share, you can ride from station to station whenever you need a convenient and affordable way to travel. The Indego bike share system in Philadelphia features hundreds of bikes available 24/7, 365 days a year. Station locations were selected based on proximity to community resources, employment centers, bike infrastructure, and transit in consultation with partner agencies, institutions, community groups, and the public. (reference)

The algorithm model aims to predict Indego bike share demand by taking time, physical and socio-economic factors into account, so that the model can assist Indego to better allocate bike share resources based on demand varying across time and space, and hopefully plan and design new bike share stations in future. The model will perform meaningfully for decision-making process.

Based on existing available bike share trip data and variables, the result from the current model shows that the model needs to be improved for predicting the demand where and when bike share trips are more frequent to take place, because current model predicts well only in the areas and time slots where and when bike share is less demanded. In addition, current model should be improved and retrofitted to generalize different socio-economic contexts equally, so that the demand can be accurately predicted and resources can be equally distributed among different groups of people.

Data

Bike share data

The data used is the Indego bike share data from 2019 Q3 (July 1-September 30). During this time period, there are 14 weeks with 2 national holidays: Independence Day (July 4) and Labor Day (Sept 2).

The components of the selected bike share data is as follows:

philly_dat<-read.csv('/Users/ranxin/Desktop/CPLN590/HW7/indego-trips-2019-q3.csv')

#Let’s use some date parsing to bin" the data by 15 and 60 minute intervals by rounding.
#Notice we use the time format ymd_hms to denote year, month, day and hour, minute and seccond. 
#We extract the week of the observation (ranging from 1-52 throughout the year) and the dotw for day of the week.

#Convert time to standard format first!
#The raw time data is not standardized
philly_dat2 <- philly_dat %>%
  mutate(starttime = mdy_hm(start_time),
         endtime=mdy_hm(end_time))

philly_dat2 <- philly_dat2 %>%
  mutate(interval60 = floor_date(ymd_hms(starttime), unit = "hour"),
         interval15 = floor_date(ymd_hms(starttime), unit = "15 mins"),
         week = week(interval60), 
         dotw = wday(interval60, label=TRUE)) #day of the week!!

require(readr)
require(readxl)
metadata <- read_excel("/Users/ranxin/Desktop/CPLN590/HW7/indego_metadata.xlsx")
kable(metadata,
      caption = "Indego bike share metadata") %>%
  kable_styling("striped", full_width = F)

Indego bike share metadata
Column	Description
trip_id	Locally unique integer that identifies the trip
duration	Length of trip in minutes
start_time	The date/time when the trip began, presented in ISO 8601 format in local time
end_time	The date/time when the trip ended, presented in ISO 8601 format in local time
start_station	The station ID where the trip originated (for station name and more information on each station see the Station Table)
start_lat	The latitude of the station where the trip originated
start_lon	The longitude of the station where the trip originated
end_station	The station ID where the trip terminated (for station name and more information on each station see the Station Table)
end_lat	The latitude of the station where the trip terminated
end_lon	The longitude of the station where the trip terminated
bike_id	Locally unique integer that identifies the bike
plan_duration	The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)
trip_route_category	“Round Trip” for trips starting and ending at the same station or “One Way” for all other trips
passholder_type	The name of the passholder’s plan
bike_type	The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes

Census data

The census data is from ACS 2017. We will mainly focus on percent of White population, mean commute time, percent of population taking public transit, median household income and median age for the subsequent analysis.

#2.3 Import census info
#Census info:
phillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

#Census tracts:
#Create an sf object that consists only of GEOIDs and geometries
phillyTracts <- 
  phillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf

#Add the spatial information to our rideshare data as origin and destination data, 
#first joining the origin station, 
#then the destination station to our census data. 
philly_dat_census <- st_join(philly_dat2 %>% 
                        filter(is.na(start_lat) == FALSE &
                                 is.na(start_lon) == FALSE &
                                 is.na(end_lat) == FALSE &
                                 is.na(end_lon) == FALSE) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      phillyTracts %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>% #Origin.Tract
  mutate(from_longitude = unlist(map(geometry, 1)), #get long column
         from_latitude = unlist(map(geometry, 2)))%>%  #get lat column
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>% #Destination.Tract
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)

Weather data

The weather data is imported from Philadelphia International Airport (code PHL). There are data of temperature, wind speed, precipitation on an hourly basis.

The following plots the temperature, precipitation and wind speed trends over our study period:

#2.4. Import Weather Data
#Import weather data from O’Hare airport (code ORD) using riem_measures
#We can mutate the data to get temperature, wind speed, precipitation 
#on an hourly basis and plot the temperature and precipitation trends over our study period.

philly_weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2019-07-01", date_end = "2019-09-30") %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#Plot!!!!!
grid.arrange(
  ggplot(philly_weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(philly_weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(philly_weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Philadelphia PHL - Jul~Sept 2019")

Exploratory analysis: trip count

We can also look at how bike share trip counts varies by week, time in the day and station. We begin by examining the time and frequency components of our data.

Temporal pattern

The plot below breaks out the study time series in small multiple form by week. This approach provides an even better illustration of the similarity from week to week. Besides July 1 and July 2, the overall pattern of each week is similar with each other.

Monday and two holidays are highlighted. We clearly see the discrepancy of trip count between weekend and Monday, and between holidays and non-holidays:

mondayMidnight <- 
  philly_ride.panel %>%
  mutate(mondayMidnight = ifelse(wday(interval60) == 2 & hour(interval60) == 1,
                                 as.POSIXct(interval60),0)) %>%
  filter(mondayMidnight != 0) 

#1. Bike share trips by week
as.data.frame(philly_ride.panel) %>%
  group_by(week, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ggplot(aes(interval60,Trip_Count)) + 
  geom_line() +
  plotTheme +
  facet_wrap(~week, scales="free", ncol=3) +
  ylim(0,500) +
  labs(title="Trip count by week. Philadelphia, Jul~Sept 2019",
       subtitle="Monday marked in black; Independence Day and Labor Day in blue", x="Day", y="Trip Counts") +
  geom_vline(xintercept = as.POSIXct("2019-07-04 01:00:00 UTC"),color = "blue",size=1.1) +
  geom_vline(xintercept = as.POSIXct("2019-09-02 01:00:00 UTC"),color = "blue",size=1.1) +
  geom_vline(data = mondayMidnight, aes(xintercept = mondayMidnight), colour="black")

We can also look at the overall time pattern - there is clearly a daily periodicity and there are lull periods on weekends:

#A. Overall time pattern for bike share- 
#there is clearly a daily periodicity and there are lull periods on weekends
ggplot(philly_dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n),colour="#6C4E8C")+
  labs(title="Bike share trips per hour. Philadelphia, Jul~Sept 2019",
       x="Date", 
       y="Number of trips")+
  plotTheme

Let’s examine the distribution of trip volume by station for different times of the day. We clearly have a few high volume periods but mostly low volume. Our data must consist of a lot of low demand station/hours and a few high demand station hours:

#B. Distribution of trip volume by station for different times of the day
philly_dat_census %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, start_station, time_of_day) %>%
  tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1,fill='#6C4E8C')+
  labs(title="Mean number of hourly trips per station. Philadelphia, Jul~Sept 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

#C. Bike share trips per HOUR by STATION
ggplot(philly_dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+ #Summarize
  geom_histogram(aes(n), binwidth = 5,fill='#6C4E8C')+
  labs(title="Bike share trips per hour by station. Philadelphia, Jul~Sept 2019",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme

Visually, we can map the bike share trips per hour by stations to have a straightforward observation. We can also divide it into weekday and weekend. It is clear to see on weekdays, bike share trips occur more during PM rush hours in Center City. While on weekends, bike share trips occur much fewer in count and less concentrated in physical space:

#F: Bike share trips per hour by station -- [MAP]
ggplot()+
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_point(data = philly_dat_census%>% 
               mutate(hour = hour(starttime),
                      weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
               tally(),
             aes(x=from_longitude, y = from_latitude, color = n), 
             fill = "transparent", alpha = 1, size = 0.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
  xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hour by station. Philadelphia, Jul~Sept 2019")+
  mapTheme

Zooming in to see the time pattern, we can see bike share trips take place more on Monday and Tuesday during 15:00- 20:00 which includes the PM rush hours:

ggplot(philly_dat_census %>% mutate(hour = hour(starttime)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips by day of the week. Philadelphia, Jul~Sept 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

To synthesize, weekdays have more bike share trips than weekend. But in weekends, more trips take place from 10:00 to 15:00, we can assume that it is due to more recreation activities:

#E. Bike share tips: WEEKEND VS WEEKDAY
ggplot(philly_dat_census %>% 
         mutate(hour = hour(starttime),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips - weekend vs weekday. Philadelphia, Jul~Sept 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

Spatial pattern

We cam also visualize the spatial distribution of bike share trip count to understand if there are systematic rideshare patterns across space that can be harnessed when forecasting. Besides Week 26 which only covers two days of our study period, the spatial pattern for the rest of weeks are similar:

#Trip_Count variation across space
#We look at the spatial pattern of trips across Philly
philly_ride.panel %>% 
  group_by(week, Origin.Tract) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
  facet_wrap(~week, ncol = 5) +
  scale_fill_manual(values = palette5,
                    labels = c("0","81", "167", "322", "1699"),
                    name = "Trip_Count") +
  labs(title="Sum of bike share trips by tract and week",
       subtitle='Philadelphia, Jul~Sept 2019') +
  mapTheme+ theme(legend.position = "right")

Next, we visualize the spatial distribution of trip count by day of the week. There are some clear differences between weekends and weekdays. During weekdays, Monday, Tuesday and Friday have more trip count than Wednesday and Thursday. During all seven days in a week, trips seems to eminate outward from Center City：

#Sum of trips by tract and day of the week
philly_ride.panel %>% 
  group_by(dotw, Origin.Tract) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
  facet_wrap(~dotw, ncol = 7) +
  scale_fill_manual(values = palette5,
                    labels = c("32","175", "348", "628", "3117"),
                    name = "Trip_Count") +
  labs(title="Sum of bike share trips by tract and day of the week",
       subtitle='Philadelphia, Jul~Sept 2019') +
  mapTheme+ theme(legend.position="right")

For the spatial pattern in each hour, we can clear see the transition/change of trip count from night to morning and morning to night, from Center City to outward:

#Sum of bike share trips by tract and hour of the day
philly_ride.panel %>% 
  group_by(hour = hour(interval60), Origin.Tract) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  left_join(phillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
  st_sf()%>%
  ggplot() + geom_sf(aes(fill = q5(Sum_Trip_Count))) +
  facet_wrap(~hour, ncol = 8) +
  scale_fill_manual(values = palette5,
                    labels = c("0", "26", "76", "183", "3308"),
                    name = "Trip_Count") +
  labs(title="Sum of bike share trips by tract and hour of the day", 
       subtitle="Philadelphia, Jul~Sept 2019") +
  mapTheme + theme(legend.position = "right")

Space/time animation

Lastly, we can build an animated gif map of rideshare trips over both space and time with 15 minute interval:

week33<-
  philly_dat_census %>%
  filter(week == 33 & dotw == "Mon")

week33.panel <-
  expand.grid(
    interval15=unique(week33$interval15),
    Origin.Tract = unique(week33$Origin.Tract))

ride.animation.data <-
  week33 %>%
  mutate(Trip_Counter = 1) %>%
  right_join(week33.panel) %>% 
  group_by(interval15, Origin.Tract) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  left_join(phillyTracts,by=c("Origin.Tract" = "GEOID")) %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 5 ~ "1-5 trips",
                           Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
                           Trip_Count > 20 ~ "21+ trips")) %>%
  mutate(Trips  = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))

rideshare_animation <-
  ggplot() +
  geom_sf(data=ride.animation.data, aes(fill=Trips)) +
  scale_fill_manual(values = palette5) +
  labs(title = "Bike share trips for one day in August 2019, Philadelphia",
       subtitle = "15 minute intervals: {current_frame}", caption = "Ran Xin") +
  transition_manual(interval15)

animate(rideshare_animation, duration=20)

Exploratory analysis: trip count and variables

We can then explore the correlation between bike share trip count and different variables.

Time lags

Bike share trip count as a function of the various time lags are both in table and visualized below. These effects are very strong, we can clearly see the correlation decreases with each additional hour, however, the predictive power returns with the 24 hour lag:

#2. Evaluate correlations in these lags
plotData.lag<-as.data.frame(philly_ride.panel) %>%
  group_by(interval60) %>% 
  summarise_at(c(vars(starts_with("lag"), "Trip_Count")), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))
correlation.lag <-
  plotData.lag %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Trip_Count),2))

correlation.lag %>%
  kable(caption = "Correlation between trip count and time lags") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#CFC0D1")%>%
  row_spec(3, color = "black", background = "#CFC0D1")%>%
  row_spec(5, color = "black", background = "#CFC0D1")

Correlation between trip count and time lags
Variable	correlation
lagHour	0.81
lag2Hours	0.56
lag3Hours	0.35
lag4Hours	0.17
lag12Hours	-0.45
lag1day	0.80

#2.1. Relationship: trip count and time lags (plot)
plotData.lag %>%
  ggplot(aes(Value, Trip_Count)) + 
  geom_point() + geom_smooth(method = "lm", se = F) + 
  facet_wrap(~Variable) +
  geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "blue", 
            x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  labs(title = "Trip count as a function of time lags", 
       subtitle= "Philadelphia, Jul~Sept 2019", x = "Lag Trip Count") +
  plotTheme

Weather

Let’s look at precipitation first as a function of bike share trips.

Looking at week by group means in the resulting barplot does not neccessarily support the hypothesis that rain and snow increase the propensity to take rideshare. In most weeks of the study period, bike share trips decrease in rainy/snowy day:

#3. Relationship: trip count and weather - Precipitation
as.data.frame(philly_ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  left_join(philly_weather.Panel) %>%
  mutate(isPercip = ifelse(Precipitation > 0, "Rain/Snow", "None")) %>%
  na.omit()%>%
  group_by(week = week(interval60), isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
  ggplot(aes(isPercip, Mean_Trip_Count)) + 
  geom_bar(stat = "identity") +
  facet_wrap(~week,ncol=7) +
  labs(title="Does ridership vary when it's raining or snowing?(NA removed)",
       subtitle="Mean trip count by week; Philadelphia, Jul~Sept 2019",
       x="Percipitation", y="Mean Trip Count") +
  plotTheme + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Do more people take ride share when it’s colder? Let’s look at a small multiple plot of bike share trip count as a function of temperature by week. But they suggest the opposite - that ride share actually increases as temperatures get warmer:

#4. Relationship: trip count and weather - Temperature
as.data.frame(philly_ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  left_join(philly_weather.Panel) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
  geom_point() + facet_wrap(~week,ncol=3) + geom_smooth(method = "lm", se= FALSE) +
  xlim(40,95)+
  plotTheme +
  labs(title="Trip count as a function of temperature by week",
       subtitle="Trip count by week; Philadelphia, Jul~Sept 2019",
       x="Temperature", y="Mean Trip Count")

Let’s look at the wind speed.The plot shows that besides Week 26, the bike share demand increases a little bit when it is more windy:

as.data.frame(philly_ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  left_join(philly_weather.Panel) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, Trip_Count)) + 
  geom_point() + facet_wrap(~week,ncol=3) + geom_smooth(method = "lm", se= FALSE) +
  xlim(0,30)+
  plotTheme +
  labs(title="Trip Count as a function of wind speed by week",
       subtitle="Trip count by week; Philadelphia, Jul~Sept 2019",
       x="Temperature", y="Mean Trip Count")

Socio-economic factors

Because of the space/time panel nature of the data, census data cannot be used in the forecast. Nevertheless, for the purpose of exploratory analysis, we can still briefly analyzes correlation between trip count by tract for one week as a function of 5 census variables. The table and plot of correlation are shown as below.

We can see where there is higher percentage of White population, higher mean commute time and median household income, there are more bike share trips. While the trip count decreases a little bit as median age of population in this tract increase. And trip count also has an inverse relationship with percent of the population taking the public transit. One guess is that when more people choose public transit such as SEPTA trolley, metro or bus, the demand of bike share decreases, because those public transits are more speedy for longer distance travel:

#6. Relationship: trip count and socio-economic
plotData.census <- 
  as.data.frame(philly_ride.panel) %>%
  group_by(Origin.Tract) %>% 
  summarize(Trip_Count = sum(Trip_Count))  %>%
  left_join(phillyCensus, by=c("Origin.Tract" = "GEOID")) %>%
  dplyr::select(Med_Inc,
                Mean_Commute_Time,
                Percent_Taking_Public_Trans,
                Percent_White,
                Med_Age, 
                Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  na.omit() %>%
  filter(Variable =="Med_Inc"| Variable == "Total_Pop"|Variable == "Percent_White"|
           Variable == "Mean_Commute_Time"|Variable == "Percent_Taking_Public_Trans"|Variable == "Med_Age") 

#Correlation with table
correlation.census <-
  plotData.census %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, log(Trip_Count)),2))

correlation.census%>% 
  kable(caption = "Trip count (log) as a function of socio-economic factors") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#CFC0D1")%>%
  row_spec(3, color = "black", background = "#CFC0D1")%>%
  row_spec(5, color = "black", background = "#CFC0D1")

Trip count (log) as a function of socio-economic factors
Variable	correlation
Mean_Commute_Time	0.39
Med_Age	-0.03
Med_Inc	0.56
Percent_Taking_Public_Trans	-0.63
Percent_White	0.54

#Correlation with plot
ggplot(plotData.census, aes(Value,log(Trip_Count))) + geom_point() + geom_smooth(method="lm", se = F) +
  facet_wrap(~Variable, scales="free", ncol=2) +
  geom_text(data=correlation.census, aes(label=paste("R =", correlation)),
            colour = "blue", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
  plotTheme+
  labs(title="Trip count (log) as a function of selected census variables",
       subtitle="Trip count by census tract; Philadelphia, Jul~Sept 2019")

Modeling

Model building and estimation

I split the data into a training set and a test test (Week 33 as the “boundary”, there are 7 test set weeks). I created 5 linear models using the “lm” function in the training set, but made predictions and get results in the test set.

Model 1: “reg1”, focuses on just time effects. The first model includes hour fixed effects, day of the week fixed effects and temperature. Lag effects are not included for now.

Model 2: “reg2”, replaces the time fixed effects with a vector of origin tracts fixed effects.

Model 3: “reg3”, includes both time and space fixed effects.

Model 4: “reg4”, adds the time lag features.

Model 5: “reg5”, adds holiday and holidayLag effects.

Then I applied the five models to come out with predicted value and error values correspondingly.

#Create five linear models using the lm funtion
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=philly_ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=philly_ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=philly_ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, data=philly_ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
       lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, data=philly_ride.Train)

#4.2. Predict for test data
#create a nested data frame of test data by week.
philly_ride.Test.weekNest <- 
  philly_ride.Test %>%
  nest(-week)

#Create a function called model_pred which we can then map onto each data frame in our nested structure
#The argument fit takes the **name of a regression** you have created that you want to use to make predictions
#map(.x = data, fit = name_of_your_regression, .f = model_pred)
model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

philly_week_predictions <-  #For test set!
  philly_ride.Test.weekNest %>% 
  mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
         BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
         CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
         DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
         ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

Accuracy

The MAE is plotted by model by week as below. We see that the holiday effects do not bring much additional predictive power to the model, except Week 27. Conversely, in the rest of the six weeks, holiday effects make the prediction less accurate. Note that both the training and test has one holiday. However, the fact shows that holiday effects do not work well.

#A. MAE by model specification and week
philly_week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  labs(title = "Mean Absolute Errors by model specification and week",
       subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019") +
  plotTheme

For each model, the predicted and observed trip count is plotted in time series form below. Probably the simple, most all-encompassing interpretation of the results below is that with increasing sophistication, it becomes increasingly possible to predict well for the trip count spikes.

It is clear to see time lag effects add a tremendous amount of predictive power to this model, very nearly predicting the highest spikes. Not surprisingly, bike share patterns from the very recent past do a great job predicting in the near future. But simply observing from the plot, holiday effects also don’t bring consipicuous predictive power.

#B. Predicted/Observed bike share time series
philly_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station)) %>%
  dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start_station) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed bike share trips by time series", 
       subtitle = "Philadelphia, Jul~Sept 2019; A test set of 7 weeks",  
       x = "Hour", y= "Station Trips") +
  plotTheme

Therefore, by examining Mean Absolute Error (MAE) and the plot above, I would stick to Model 4 (with time lags, no holiday effects) which seems to have the best goodness of fit generally.

Generalizability

There are different dimensions to test the generalizability of the model (Model 4): space, time and socio-economic factors. The core question to be answered is: does the model predict well in different context of those dimensions?

Stations

We can look at our mean absolute errors by station - there seems to be a spatial pattern to our error, for instance, several stations in the center have high errors (more than 1 trip) for predicted trip count. But we need to go a bit further to get at the spatial and temporal element of the error.

philly_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude)) %>%
  dplyr::select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(start_station, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 0.8)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
  xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
  labs(title="Mean Absolute Errors of Test Set by Station from Model 4", 
       subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019")+
  mapTheme

Time

If we plot observed vs. predicted for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general:

philly_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw)) %>%
  dplyr::select(interval60, start_station, from_longitude, 
         from_latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted bike share trips by weekday/weekend from Model 4\nA test set of 7 weeks; Philadelphia, Jul~Sept 2019",
       x="Observed trips", 
       y="Predicted trips",
       subtitle='Black line:45 degree; Red line: observed vs predicted')+
  plotTheme

Thus, in detail, let’s look at our errors on a map by weekend/weekday and time of day. Weekends and weekdays have different error patterns in time. Between each time of the day, during weekdays, Central City tends to have more errors in AM rush and PM rush hours; while during weekends, Central City has more errors in Mid-Day and PM Rush, which means the predictive ability of the model does not perform well during these periods. Comparing weekdays and weekends in general, weekdays tend to have more errors:

#Let’s look at our errors on a map by weekend/weekday and time of day.
philly_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw) ) %>%
  dplyr::select(interval60, start_station, from_longitude, 
         from_latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(philly_dat_census$from_latitude), max(philly_dat_census$from_latitude))+
  xlim(min(philly_dat_census$from_longitude), max(philly_dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors of Test Set \nby weekend/weekday and time of day from Model 4",
       subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019")+
  mapTheme

Space

Let’s look at the generalizability of the model by census tracts. We can see the high errors are mostly concentrated in Central City where has higher density of population, facilities and office places, and more trips are likely to take place. It seems like the errors radiate from the Central City, we cannot tell the model is better at predicting in time or in space.

filter(philly_week_predictions, Regression == "DTime_Space_FE_timeLags") %>% 
  unnest() %>% 
  left_join(., phillyTracts, by = c("Origin.Tract" = "GEOID")) %>% 
  st_sf() %>% 
  dplyr::select(Origin.Tract, Absolute_Error, geometry) %>%
  gather(Variable, Value, -Origin.Tract, -geometry) %>%
  group_by(Variable, Origin.Tract) %>%
  na.omit() %>%
  summarize(MAE = mean(Value)) %>%
  ggplot() + 
  geom_sf(aes(fill = q5(MAE))) +
  scale_fill_manual(values = palette5,
                    labels = c("2", "3", "4", "5", "8"),
                    name = "MAE") +
  labs(title="Chloropleth map of errors (MAE) by neighborhood from Model 4",
       subtitle = "Bike share trip predicton; A test set of 7 weeks; Philadelphia, Jul~Sept 2019") +
  mapTheme

Socio-economic factors

Obviously from the plot of errors below, the model cannot predict well equally for different socio-economic contexts:

Mean commute time: the model predicts pretty well where people spend less commute time.

Percent of population taking public transit: the model predicts pretty well where there is higher percent of population taking public transit.

Median age: the model predicts pretty well where people with median age around 20 and median age above 50.

Percent of White population: the model predicts well where there is fewer percent of White population.

Median household income: the model predicts well where there is lower median household income (less than 30,000).

#Error as a function of socio-economic factors
#Add median age
philly_week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station= map(data, pull, start_station), 
         from_latitude = map(data, pull, from_latitude), 
         from_longitude = map(data, pull, from_longitude),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White),
         Med_Age=map(data,pull,Med_Age),
         Mean_Commute_Time = map(data,pull, Mean_Commute_Time)) %>%
  dplyr::select(interval60, start_station, from_longitude, 
         from_latitude, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Med_Age,Mean_Commute_Time) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White, Med_Age,Mean_Commute_Time) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       subtitle = "Philadelphia, Jul~Sept 2019; A test set of 7 weeks",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Conclusion and advice

To sum up, the model does not predict well for where and when more trips are likely to occur, as there are higher errors taking place in those context, for example, the rush hours in Central City. And the key takeaway is that the model underpredicts the bike share demand during rush hours in Central City in weekdays. Therefore, one piece of advice is, Indego should supply more bike share during those time in Central City, and people should be prepared to schedule well if they need to use bike share service at those time in Central City.

The model is good at predicting when and where trips are less frequent, such as non-rush hour during weekdays and in the areas aways from Central City.

Furthermore, the model cannot generalize equally for different socio-economic context.

Bike share prediction - Indego in Philadelphia

Ran Xin

11/14/2019

Executive summary

Data

Bike share data

Census data

Weather data

Exploratory analysis: trip count

Temporal pattern

Spatial pattern

Space/time animation

Exploratory analysis: trip count and variables

Time lags

Weather

Socio-economic factors

Modeling

Model building and estimation

Accuracy

Generalizability

Stations

Time

Space

Socio-economic factors

Conclusion and advice