Updated: Aug 24, 2020
Time series are a very common format of data : as soon as a measurement is repeated over time, one can generate time series out of those. It can apply to any kind of businesses : server logs activity in IT departments, short term statistics in banks, stock prices in finance, customers' data consumption for telecom companies... Time series are (almost) everywhere and one very usual task for those is of course predictions. This blogpost addresses prediction for univariate time series: we predict one time series at a time, only projecting its past, regardless of other time series. As an example, we will try to predict the daily number of downloads on CRAN for the package autoTS, which is designed for automated prediction of time series.
Because in time series, the sequence is crucial, traditional machine learning algorithm cannot be applied. As a matter of fact, autocorrelation (ie the fact that the observation at time T depends on the observation at times T-1, T-2...) has to be taken into account. This is why dedicated algorithms have been designed specifically for such data. Those algorithms aim to identify several patterns in the time series : trend, cycle, seasonality, autocorrelation,...
Of course, like in traditional machine learning, each algorithm comes with its own way to estimate the different patterns, and its own set of parameters to be tuned to achieve the best prediction. And like in traditional machine learning, there is no free lunch, meaning that none of the algorithms systematically outperforms the others. In other words, a data scientist should at least try some of them, if not them all.
The R package autoTS aims to automate this benchmark of the most popular algorithms to then predict the time series in the most accurate way. In this example, we fetch the number of downloads of autoTS thanks to the package `cranlogs`, prepare the data with `dplyr` and `lubridate`. We'll plot the results with the `ggplot2` package.
library(dplyr) library(ggplot2) library(lubridate) library(autoTS) library(cranlogs) # get and clean data data <- cran_downloads("autoTS",from = "2020-06-01") %>% mutate(date=as_date(date)) %>% filter(date<=today()-2*days()) # plot the time series data %>% ggplot(aes(date,count)) + geom_line(color="orange") + theme_minimal() + geom_smooth() + labs(x="Date",y="Number of downloads")
Now, we can start training our algorithms to see which one fits best this time series. For that, we'll use the getBestModel() function, with 14 data points to evaluate the models. This means that the last 14 days will be ignored in the training phase: we'll compute the error made by each algorithms on these 14 days and pick the one with the smallest error ! We can plot the result of the training phase using the graph.train element of the resulting object. The getBestModel() function only needs a vector containing the dates and one other containing the values, plus a character string indicating the frequency of the time series.
bm <- getBestModel(dd$date,data$count,freq = "day",n_test = 14) bm$graph.train
In this case, the prophet algorithm, which is generally very performant is not adapted at all for this time series; digging in the bm object, the best model is sarima. We can now implement it on the full time series to get the final prediction. For that, we only have to pass the bm object as argument for the my.prediction() function, along with the number of data points to be predicted.
res <- bm %>% my.predictions(n_pred = 14) %>% mutate_at(4,function(xx) ifelse(xx<0,0,xx)) res %>% filter(type %in% c(NA,"mean")) %>% ggplot() + geom_line(aes(dates,actual.value),color="orange") + geom_line(aes(dates,sarima),linetype=2,color="orange") + theme_minimal() + labs(x="Date",y="Number of downloads")
The result looks pretty nice and the algorithm seems to have learned the weekly seasonality and other parameters (autoregressive and/or moving average components). You can of course design a pipeline that will directly provide the prediction without saving the intermediate `bm` object !