predicting occupancy at both the stop and route levels. This
ensures that our method can react to both short and long-term
changes in the public transit system. We do this by analyzing
and combining different spatio-temporal data such as weather,
traffic, and APC data to develop a model for bus occupancy.
First, we investigate how data can be augmented and merged
to provide features that would expose the relationship with
bus occupancy. Second, we build different models for bus-stop
and transit-route levels. Finally, we demonstrate and compare
our approach using actual APC data from the public transit
agency of Nashville, TN. The main contribution of this paper
is implementing a data cleaning and augmentation method
that processes and cleans raw APC data. Raw APC data is
often noisy and is faced by different issues. Augmenting and
cleaning ensure that data used in training models is valid.
We generate passenger occupancy from alighting and boarding
information.
Organization: The rest of this paper is organized as fol-
lows. In Section II, we give an overview of the state-of-the-
art in occupancy prediction. In Section III, we present and
formulate the problem. We then discuss in-depth the APC
data in Section IV and the issues accompanying the dataset. In
Section VI, we validate our proposed models using real-world
data from Nashville, TN. Finally, in Section VII, we give our
conclusions.
II. RELATED WORK
In this section we discuss the current state-of-the-art meth-
ods used in public transit occupancy prediction.
A. Occupancy Prediction
Given the importance of public transit and the increasing
ubiquity of available vehicle data, research in the field of
occupancy prediction, also known as passenger flow or transit
demand prediction, has been flourishing. There is a consider-
able number of work done on understanding and mapping the
occupancy level in public transport.
Short-term passenger demand forecasting fall into one of
two categories, parametric and non-parametric approaches.
Traditionally, parametric approaches such as historical av-
eraging [4] and autoregressive integrated moving average
(ARIMA) [5] have been used to predict not only demand
but traffic flow, travel times and vehicle speed. Ever since it
was established, ARIMA has been known to perform well in
modeling linear and stationary time series. However, ARIMA’s
shortcomings in taking into account seasonality and capturing
non-linear relationships in data are also well known.
In contrast, non-parametric approaches build a non-linear
relationship between the input and output variables without
any prior knowledge. These methods gained popularity as
consequence of the rapidly increasing availability of data
from systems such as Advanced Public Transportation Systems
(APTS) and Advanced Traveler Information Systems [3].
These techniques have been proven effective at forecasting
demand based on data gathered through smart cards [6],
[7]. Toque et al. [8] used Random Forest (RF) and LSTM
neural networks trained on smart card data to predict travel
demand. By creating multiple temporal units neural networks
(MTUNN) and parallel ensemble neural networks (PENN),
Tsai et al. [9] showed that it can outperform predictions based
on statistical analysis of historical data.
Incorporating other spatio-temporal dataset such as weather
and special events have also been explored. Karnberger et
al. [10] considered the effect of exogenous events on public
transportation ridership. Meanwhile, Zhou et al. [11] combined
smart data and weather information and found that while riders
are more resilient to changes in weather, it still has an effect
on the overall demand. Finally, Wood et al. generated models
the passenger occupancy and demand at the next-stop/any-
stop level based on APC and weather data [12] and proved
that even simpler models such as RF and LSTM provide
reliable estimates of future data when trained with historical
information if demand patterns are fairly stable.
There has been plenty of work done in the field of public
transportation with a special focus on improving reliability
through understanding and forecasting passenger demand.
However, our work is distinct in three ways. First, our work
aims to provide occupancy prediction at both the stop and trip
levels separately by forecasting short and long term demand.
Second, we work on APC data which is fundamentally differ-
ent from smart card data, which is the data commonly used by
prior work. Smart cards are embedded with integrated circuits
enabling it to process information, or in this case, allow for
contactless ticketing for riding on mass transit. These cards are
much more accurate and complete in their data collection [13],
[14] due in part they require passengers to swipe after getting
on and before getting of the vehicle. In contrast, APC data is
much more noisy and introduces far more uncertainty in data
collection and processing. Third, we focus on implementing
this for the entire public transport system and not on a few
select routes.
III. PROBLEM STATEMENT
Based on our conversations with the transit agency, they
want to be able to identify particular trips and stops which
experience overcrowding. Overcrowding increases the chances
of passengers not being able to get on the bus and decreasing
their overall satisfaction and willingness to take public transit
again in the future. Knowing the maximum occupancy at
the trip and stop level will allow them to react and prepare
accordingly by increasing bus dispatch frequency thereby
decreasing headway.
The primary objective of this work is to provide accurate
occupancy prediction for public transit vehicles. The goal is to
be able to reliable and efficiently forecast maximum ridership
demand at both stop and trip levels. The problem then is,
given a fleet of heterogeneous vehicles2, each equipped with
automated passenger count systems, how are we able to model
and accurately predict the maximum occupancy at any trip or
stop in the future.
2In this work we use the terms vehicle and bus as public transit vehicles
interchangeably.