Cycling in Marburg (4/4): Machine learning predictions

Sun 06 December 2020
data
#cycling-in-marburg

This article is the last article in the series of articles on Nextbikes in Marburg. So far, I covered the following topics in the series:

Article 1 of 4: Project introduction, data source and first analyses
Article 2 of 4: Analyses from the perspective of Nextbike users
Article 3 of 4: Analyses from the perspective of Marburg the city council

In this article, I turn towards a slightly more technical perspective: I apply machine learning to predict the number of parked bikes in two ways. In the first way, I fit a Markov chain to the transition matrix and derive the Markov steady state bike distribution. In the second way, I train a random forest to predict the number of parked bikes in the Nextbike ecosystem of Marburg based on the hour of the day, day of the week and temperature.

Table of Contents:

Markov chain and steady state
Prediction of the number of parked bikes using a random forest
Outlook

Markov chain and steady state

Here, I will fit a Markov chain to the dynamics of Nextbikes in Marburg and find the steady state of the corresponding dynamical system.

For that, the corresponding stochastic matrix, \(\mathbf{P}\), is obtained from the transition matrix \(\mathbf{T}\). The transition matrix stores the numbers of transitions between all pairs of stations. The number of transitions between stations \(t\) (for “to”) and \(f\) (for “from”) is stored as \(T_{f,t}\). The transition matrix has been introduced and shown in the article intended for the city council.

With the stochastic matrix \(\mathbf{P}\), the state of Nextbikes in Marburg, \(\vec{x}\), can be advanced in time with

\[\vec{x}^{(i+1)} = \mathbf{P} \cdot \vec{x}^{(i)}\]

or in component notation

\[x^{(i+1)}_{t} = \sum_{f=1}^{36} P_{t, f} x^{(i)}_{f}.\]

The vector \(\vec{x}^{(i)}\) stores the number of bikes at each station at time \(i\). Since the probability to move to one of the other stations is 1, the stochastic matrix has to fulfill

\[\sum_{t=1}^{36} P_{t,f} = 1 \quad \forall f.\]

Finally, we can compute \(\mathbf{P}\) using the following relation

\[P_{t,f} = \frac{T_{f, t}}{\sum_{j=1}^{36} T_{f,j}}.\]

This framework can be used to find the temporal steady state \(\vec{x}^{(S)}\), which satisfies

\[\vec{x}^{(S)} = \mathbf{P} \cdot \vec{x}^{(S)}.\]

Hence, \(\vec{x}^{(S)}\) is a state that does not change upon time stepping forwards. This condition translates into an eigenvalue problem with eigenvalue 1,

\[\lambda \cdot \vec{x}_{(S)} = \mathbf{P} \cdot \vec{x}_{(S)}\]

with \(\lambda=1\). After solving the eigenvalue problem and selecting the eigenvector with eigenvalue 1, the steady state is obtained. The steady state yields suggestions for where bikes will accumulate and, potentially, need to be carried around manually.

Hence, also the steady state computation confirms that the Hauptbahnhof and Elisabeth-Blochmann-Platz are the largest stations. Note that the steady state is only determined up to a constant due to the linearity of the underlying eigenvalue problem.

Prediction of the number of parked bikes using a random forest

Now I come to a classical regression machine learning task. The task is to predict the accumulative number of bikes in Marburg given the hour of the day, the day of the week and the temperature.

Since I was not able to find temperature data for Marburg, I used temperature data from Giessen, which is a city nearby. The data is obtained from the German weather service (in German: DWD). Let’s see how the temperature varies during the time of data collection.

The different resamples of the raw data show how the temperature rises towards summer and declines towards winter.

With the weather and the number of parked bikes in the whole Marburg Nextbike ecosystem (introduced earlier), the training data can be constructed. The training data,

\[(\text{hour}, \text{weekday}, \text{temperature}) \rightarrow \text{number of parked bikes},\]

maps the triplet of the hour, the weekday and the temperature to the number of parked bikes in the Marburg Nextbike ecosystem. Since the training data here is 3 dimensional, I follow the best practise of visualising the training data prior to training. The target variable, i.e. the number of parked bikes, is colour coded in the following figure.

The two perspectives of the same underlying data show that there is little variation of the number of bikes parked with day, a little more variation with hour and most variation with temperature. That’s reasonable: People prefer to ride bikes when it is warm and when it is not too late or early on a day.

This visual understanding is confirmed by the three pearson correlation coefficients between the target variable and all three dependent variables:

Correlation between parked bikes and temperature: -0.74.
Correlation between parked bikes and hour: -0.14.
Correlation between parked bikes and day: 0.02.

Hence, the temperature is the variable that is (anti)correlated the strongest with the target variable.

As machine learning model, I use a random forest regressor. The hyperparameters are optimised using a cross-validated grid search with 5 folds. This ensures that the limited training data of around \(\mathcal{O}(10^4)\) samples is used most efficiently to find the best possible hyperparameters.

The classifier reaches a mean squared error (MSE) of around 137 on the test data set. To evaluate if this is a good result, the square root of the MSE is computed and compared to the average of the target variable: The square root of the MSE is around 11.7 and the average number of parked bikes is 118 (with minimal value being 62 and maximal value 163). I consider this a decent result because the mean error is a factor of 10 smaller than typical values of the variable that is predicted. There is a multitude of ways to improve this result even further - see the outlook of this article for further information.

Lastly, let’s compare the measured data that served as training data and the predicted data side-by-side using a dense covering of the phase space.

The predicted data shows the same characteristics as the underlying training data: The warmer it gets, the fewer bikes are parked in the Nextbike ecosystem in Marburg. Hence, I obtained a decent regressor.

Outlook

This article demonstrates how machine learning can be applied to obtain additional information from the Nextbike data.

There’s a myriad of possible technical extensions to improve the quality of the regressor:

Use humidity for predicting the number of parked bikes.
Perform a sine transformation of the day of the week and the hour of the day to obtain periodic features.
Use one hot encodings of day of the week and hour of the day.
Experiment with different machine learning models.
Collect and use more data.

Not only the number of parked bikes can be predicted. There is also a large spectrum of additional tasks that could be performed, e.g.:

Predict the number of parked bikes for each station separately.
Predictive maintenance: When do bikes break? Are there precursors?
Predict the number of transitions of bikes to find instances when Nextbikes might hinder smooth car traffic in Marburg.
Spatial interpolation of bike demand. For example with Gaussian Processes - similar to what I did previously for the parking demand in Marburg.

These tasks might be useful for either the cities that host bike sharing systems or even the companies themselves.

This article belongs to a series of articles. This is the list of all articles:

Cycling in Marburg (1/4): Project introduction and data source
Cycling in Marburg (2/4): For Nextbike users
Cycling in Marburg (3/4): For the city council
Cycling in Marburg (4/4): Machine learning predictions