Tutorial | Forecasting time series data with R and Dataiku#
R has several great packages built specifically to handle time series data. Using these packages within Dataiku, you can perform time series visualization, modeling, and forecasting.
Get started#
Objectives#
In this tutorial, you will:
Use R in Dataiku for time series analysis, exploration, and modeling.
Deploy a time series model from the Lab as an R recipe.
Prerequisites#
Dataiku 12.0 or later.
An Advanced Analytics Designer or Full Designer user profile.
The R integration installed.
An R code environment including the forecast and zoo packages.
Create the project#
From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Forecasting Time Series With R.
Click Install.
From the project homepage, click Go to Flow (or
g
+f
).
From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by ML Practitioner.
Select Forecasting Time Series With R.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
Use case summary#
We will use the passenger dataset from the U.S. International Air Passenger and Freight Statistics Report. This dataset contains data on the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline.
Notice that the Flow already performs the following preliminary steps:
A Download recipe imports the data from the URL:
https://data.transportation.gov/api/views/xgub-n9bw/rows.csv?accessType=DOWNLOAD
and creates the passengers dataset.A Prepare recipe modifies the dataset so that we are left with only those columns relevant to the analysis: Date, carriergroup, and Total.
A Group recipe creates a new dataset group0_passengers that contains the total number of travelers per month for carrier group “0”.
Now we can proceed to perform analysis and forecasting on The group0_passengers dataset also includes a chart.
We see two really interesting patterns:
A general upward trend in the number of passengers.
A yearly cycle with the lowest number of passengers occurring around the new year and the highest number of passengers during the late summer.
Let’s see if we can use these trends to forecast the number of passengers after March 2019.
Create an R notebook#
We’ll start with an R code notebook.
Select the group0_passengers dataset.
Navigate to the Lab () tab of the right side panel.
Under Code Notebooks, click New.
From the available types of notebooks, select R.
If not set at the project level, change the code environment to one including the forecast and zoo packages.
Click Create.
Perform interactive analysis with an R notebook#
Let’s first confirm the trends we’ve seen in the native chart.
Replace the default library import statement with the one below including the extra packages.
library(dataiku) library(forecast) library(dplyr) library(zoo)
Tip
The dataiku package lets us read and write datasets to Dataiku.
The forecast package has the functions we need for training models to predict time series.
The dplyr package has functions for manipulating dataframes.
The zoo package has functions for working with regular and irregular time series.
Run the cell to read in the data to an R dataframe using the R API.
Add a line like
head(df)
to explore the data.
Now that we’ve loaded our data, let’s create a time series object using the base ts()
function and plot it.
The ts()
function takes a numeric vector, the start time, and the frequency of measurement. For our dataset, these values are:
Total_passengers
1990
(the year for which the measurements begin)a frequency of
12
(months in a year)
Copy-paste and run the following lines in the next cell of the notebook:
ts_passengers <- ts(df$Total_passengers, start = 1990, frequency = 12) plot(ts_passengers)
Tip
Not surprisingly, we see the same trends found in the plot using the native chart builder in Dataiku. Now let’s start modeling!
Choose a forecasting model#
We are going to try three different forecasting methods and deploy the best one as a recipe in the Flow. In general, it is good practice to test several different modeling methods, and choose the method that provides the best performance.
Model 1: Exponential smoothing state space model#
The ets()
function in the forecast
package fits exponential state smoothing (ETS) models. This function automatically optimizes the choice of model parameters.
Copy-paste the code below to use the
ets
function to make a forecast for the next 24 months.
m_ets <- ets(ts_passengers)
f_ets <- forecast(m_ets, h = 24) # forecast 24 months into the future
plot(f_ets)
Tip
The forecast is shown in blue, with the gray area representing a 95% confidence interval. Just by looking, we see that the forecast roughly matches the historical pattern of the data.
Model 2: Autoregressive Integrated Moving Average (ARIMA) model#
The auto.arima()
function from the forecast package returns the best ARIMA model based on performance metrics. Using the auto.arima()
function is almost always better than calling the arima()
function directly.
Note
For more information on the auto.arima()
function, see an explanation in the Forecasting textbook.
Once again, copy-paste the code below to to make a forecast for the next 24 months.
m_aa <- auto.arima(ts_passengers)
f_aa <- forecast(m_aa, h = 24)
plot(f_aa)
Tip
Observe that these confidence intervals are a bit smaller than those for the ETS model. This could be the result of a better fit to the data. Let’s train a third model, and then do a model comparison!
Model 3: TBATS model#
The last model we are going to train is a TBATS model. This model is designed for use when there are multiple cyclic patterns (e.g. daily, weekly and yearly patterns) in a single time series. We’ll see if this model can detect complicated patterns in our time series.
Copy-paste the code below to to make a third forecast for the next 24 months.
m_tbats <- tbats(ts_passengers)
f_tbats <- forecast(m_tbats, h = 24)
plot(f_tbats)
Tip
Now we have three models that all seem to give reasonable predictions. Let’s compare them to see which one performs best.
Compare models#
We’ll use the Akaike Information Criterion (AIC) to compare the different models. AIC is a common method for determining how well a model fits the data, while penalizing more complex models. The model with the smallest AIC value is the best fitting model.
Add the following code to the next cell, and run it.
barplot(c(ETS = m_ets$aic, ARIMA = m_aa$aic, TBATS = m_tbats$AIC),
col = "light blue",
ylab = "AIC")
We see that the ARIMA model performs the best according to this criteria. Let’s now proceed to convert our interactive notebook into an R code recipe that can be integrated into our Dataiku workflow.
Store the forecasted values#
To do this, we first have to store the output of the forecast()
function from the chosen model into a dataframe, so that we can pass it to Dataiku. The following code can be broken down into three steps:
Find the last date for which we have a measurement.
Create a dataframe with the prediction for each month. We’ll also include the lower and upper bounds of the predictions, and the date. Since we’re representing dates by the year, each month is 1/12 of a year.
Split the date column into separate columns for year and month.
Add the following code to the next cell, and run it.
last_date <- index(ts_passengers)[length(ts_passengers)]
data.frame(passengers_predicted = f_aa$mean,
passengers_lower = f_aa$lower[,2],
passengers_upper = f_aa$upper[,2],
date = last_date + seq(1/12, 2, by = 1/12)) %>%
mutate(year = floor(date)) %>%
mutate(month = round(((date %% 1) * 12) + 1)) ->
forecast
Now that we have the code to create the forecast for the next 24 months, and the code to convert the result into a dataframe, we are all set to deploy the model as a recipe.
Note
Here we speak of deploying a model from a notebook (part of the Lab) to a recipe (part of the Flow). Note however that this is not the same as deploying a Dataiku-managed visual or custom model to the Flow. Our result will be an R recipe in the Flow and not a saved model (represented by a green diamond).
Deploy the model to the Flow#
To deploy our model, we must create a new R recipe. In the notebook:
Click + Create Recipe > R recipe > OK.
Ensure that group0_passengers dataset is the input dataset.
Under Outputs, click + Add. Name the the output dataset
forecast
.Click Create Dataset.
Click Create Recipe.
If not already selected, set the code environment on the recipe’s Advanced tab to the one used in the notebook.
Click Run in the recipe editor to execute the recipe.
Open the forecast dataset to look at the new predictions.
Tip
In a real situation, we’d optimize the code in the recipe to run only the portions that will output the forecast dataset.
What’s next?#
Congratulations! Now that you have spent some time forecasting a time series dataset with R in Dataiku, you may also want to see Tutorial | Time series forecasting (Visual ML) to perform time series forecasting without code.