Tutorial | Forecasting time series (Visual ML interface)

Today many machine learning problems often involve a time component. This additional temporal constraint introduces complexities that require careful analysis.

Dataiku offers various ways to implement time series modeling and forecasting. We’ll focus on Dataiku’s time series analysis functionality in the visual machine learning interface. You’ll get to see how this interface offers users the same capabilities for time series forecasting as for any other type of Machine learning model.

Let’s get started

In this tutorial, you’ll use the visual time series functionality of the Lab to design and train a model on time series in a dataset. You’ll then deploy the model to the Flow, evaluate the model, and, finally, score it. You’ll perform these tasks by using visual interfaces within the visual machine learning framework.

Prerequisites

  • A Dataiku 11 instance

  • A basic level of knowledge about Dataiku is helpful. If you’ve never used Dataiku before, try the Core Designer learning path or a Quick Start tutorial.

  • A code environment that includes the required packages for time series forecasting models. You can create a new code environment that includes these packages or add them to your existing code environment. See the Requirements for the code environment in the reference documentation.

Tip

To implement time series analysis, you’ll need to specify this code environment in the visual ML interface.

Create the project

From the Dataiku homepage, click +New Project > DSS Tutorials > Time Series > Forecasting Time Series With Visual ML (Tutorial).

Note

You can also download the starter project from this website and import it as a zip file.

Build the Flow

The initial starter Flow contains empty datasets. To work with these datasets, we’ll need to build the Flow.

  • Click Flow Actions from the bottom-right corner of your window.

  • Select Build all and keep the default selection for handling dependencies.

  • Click Build.

Wait for the build to finish, and then refresh the page to see the built Flow.

Starting Flow for this project.

Description of the starting Flow

The project begins with an input dataset, airline_stocks, that contains weekly stock prices from 2008 to January 2022 for three major US airlines: American (AAL), Delta (DAL), and United (UAL). The dataset contains the following columns:

  • Ticker: Stock symbol used to identify each time series in the dataset uniquely

  • Date: weekly time stamps

  • Open, High, Low, Close: corresponding opening, highest, lowest, and closing prices for the day

  • Adj Close: the adjusted closing price for the day

  • Volume: the number of stocks traded

A Prepare recipe applied to the input dataset (airline_stocks) parses the Date column and renames the Adj Close column to Adj_close. Adj_close is the target variable that you’ll forecast. This data preparation in the Prepare recipe is all you need for the input dataset, as it is generally clean with no missing values.

The project also includes a training dataset train filtered out of the prepared airline_stocks_prepared dataset. The train dataset contains stock data for the Monday of every week from 2008 to the end of December 2021.

Note

The airline_stocks_prepared dataset contains all the training and validation data and will be useful during model evaluation and scoring. This is because Dataiku requires that the input datasets to the Evaluate and Scoring recipes include the historical (training) data used by the time series model.

Training dataset for the project.

First, view the Charts tab of the train dataset to see some pre-computed plots of the Adj_close price in train. The Statistics tab of the dataset also includes some pre-computed analyses on the Adj_close price of the UAL time series in train.

Plotting your data and performing statistical analyses are good initial steps to see if you can observe any patterns in the time series, such as trends and seasonalities.

Tip

To explore the dataset using charts and statistical analyses, see the How-To: Perform Statistical Analysis on Time Series article.

Build and train a time series model

To build the time series model, you’ll use the train dataset. This dataset contains all the time series data before January 1, 2022.

Tip

You’ll use the remaining time series values (occurring after January 1, 2022) as a validation set to evaluate the performance of your trained model. This validation set contains 18 weekly entries (or time steps) for each stock price time series.

Create the time series forecasting analysis

To create the time series forecasting analysis:

  • Click the train dataset in the Flow and open the Actions tab in the right panel.

  • Click Lab then select Time Series Forecasting.

  • Specify Adj_close as the numerical feature and Date as the date feature.

  • Select Ticker as the identifier column for multiple time series (remember, we have time series for three different airlines).

  • Select Quick prototypes and create the analysis.

Create the time series forecasting analysis.

Now you can examine the Design tab of the analysis.

Inspect the general settings panel

In the General settings panel of the Design tab, Dataiku already specified parameter values based on (1) the input selections you made when you created the forecasting analysis and (2) the default settings. Let’s inspect these values.

Time step parameters

The first parameter we’ll inspect is the “Day of week” parameter. Dataiku has suggested a time step of 1 week.

  • Since the weekly values in the dataset occur on a Monday, change the Day of week to Monday.

Note

Although the input data here is equispaced, it is important that you specify the correct value for the Day of week parameter. Suppose you select a weekday different from the day the timestamps occur. In that case, Dataiku will create new timestamps for the specified day of the week and determine their corresponding values (for other columns in the dataset) by interpolating between the original timestamp values.

The Scoring and Evaluate recipes will also forecast values for dates that fall on the weekday you specify.

Forecasting parameters

We need to predict the weekly stock price for the next 6 weeks, but since we plan to only refresh the models every 18 weeks, we want to use an evaluation period of 18 weeks, equivalent to 3 forecast horizons.

  • Set Forecast horizon to 6. This parameter determines the length of the model forecast; therefore, you should specify a value that is a factor of the length of the validation set.

  • Set Horizons in evaluation to 3. This parameter determines the number of forecasting horizons.

  • Leave Skipped time steps in each forecast horizon at the default setting of 0. This parameter tells Dataiku the number of time steps within each horizon that you want to skip during the evaluation.

Note

Recall the validation set contains the time series values occurring after January 1, 2022. This data is in the airline_stocks_prepared dataset and contains 18 weekly entries (or time steps) for each stock price time series.

When you change the forecasting horizon, you have the opportunity to re-detect settings.

Re-detect settings after changing the forecasting horizon.
  • Click Re-Detect Settings. Dataiku automatically changes the Nb. time steps for evaluation to 6 after re-detecting settings.


Tip

Notice the graphic to the right side of the “Forecasting parameters”. Visual aids like this appear throughout the visual time series interface to help you understand how Dataiku uses the parameter values you specify.

Quantiles

Also, notice the forecasting quantiles. Keep these default values.

Note

Note that Dataiku will use:

  • the same Forecast horizon value when you apply a Scoring recipe to the model.

  • the Nb. time steps for evaluation value as the forecast horizon when you apply an Evaluate recipe to the model.

  • the same Forecast quantiles you specify during training when you score and evaluate the model.

Inspect the Train/Test Set panel

Now, let’s go to the Train/Test Set panel to examine it. In this panel, you can specify settings for resampling and splitting the time series.

Note

If the train data is not equispaced, Dataiku will automatically resample it. You’ll also be able to specify the imputation method for numerical and categorical data in the missing time steps.

  • Keep the default settings in the Time series resampling section.

  • Under Splitting parameters, select the K-fold cross-test splitting strategy.

Dataiku lets you know how many forecast horizons are present in the test set and how many time steps, if any, are skipped.

  • Keep the default 5 folds to use for the k-fold cross-test.

Note

Using the cross-test provides a more accurate estimation of model performance and is useful when you don’t have much training data. The reference documentation on Settings: Train/Test set explains how the process works.

Settings in the train/test set panel of the design tab.

Inspect the Metrics panel

Now, go to the Metrics panel. The settings defined in the Metrics, Algorithms, and Hyperparameters panels define how Dataiku performs the search for the best model hyperparameters.

In the Metrics panel, you can choose the metric that Dataiku will use for model evaluation on the Train / Set Set and for deciding which model is the best when performing hyperparameters optimization.

  • Keep the default metric, Mean Absolute Percentage Error (MAPE).

Settings in the metrics panel of the design tab.

Inspect the External Features panel

Let’s go to the External Features panel.

External features are exogenous time-dependent features. By default, Dataiku disables the external features for several reasons. One reason is that some training algorithms do not support the use of external features.

Another reason for this default behavior is that for any model you train with external features, Dataiku will require you to provide future values of those external features during forecasting (when trying to score the trained model). Therefore, if there’s no way to know the values of the external features ahead of time (as it applies in this case of stock price information), the model should not use them during training. Using these features will lead to what is called “data leakage”.

  • Leave the external features disabled.

External features tab showing that Dataiku disables external features by default.

Inspect the Algorithms panel

In the Algorithms panel, Dataiku provides several algorithms that you can use to train your model. By default, Dataiku selects the NPTS and Simple Feed Forward algorithms. During the hyperparameters optimization phase, each set of hyperparameters (defined in the Algorithms page) will be evaluated with the metric defined in the Metrics page.

  • Enable the Deep AR and Seasonal Trend algorithms in addition to the default selections.

Selected algorithms panel for the model.

Inspect the Hyperparameters panel

In the Hyperparameters panel, Dataiku will evaluate the set of hyperparameters we defined in the Algorithms page using the metric we defined in the Metrics page.

  • Go to the Algorithms page and select NPTS.

  • In Exponential kernel weight, add a value of 2.

  • Return to the Hyperparameters page.

  • Keep the default settings.

When we train the model, Dataiku will perform the optimization using the cross-validation schema that respects “time”. From the hyperparameter visualization, we can see that the validation sets (in green) shift for each fold (in our case, 3 folds) so that they do not overlap.

Hyperparameter search panel for the model

One last step before training is to check the runtime environment.

Inspect the Runtime Environment panel

From the Runtime environment panel, you can select the code environment that contains the required packages for building the time series forecast models.

  • Go to the Runtime environment panel.

  • Select a code environment that includes the required packages for time series forecasting models.

    Note

    You can create a new code environment that includes the required packages or add them to an existing code environment. See the Requirements for the code environment in the reference documentation.

    Select an appropriate code environment for time series models.

  • Save your changes.

  • Click Train, and then Train once more to begin training the models.

Inspect the training results and deploy the model to the Flow

You can inspect the results for the models you trained, just as you would for prediction or clustering models.

View the model results page.

Notice the bar charts at the top that allow you to compare different metrics across the trained models. The Simple Feed Forward algorithm performed best for all three metrics (MAPE, Symmetric MAPE, and RMSE).

You can click the metrics dropdown in the session summary to change the displayed metrics and see more performance information across the algorithms.

Dropdown in the session summary shows other available metrics.

  • On the Model Result page, click the best-performing model, Simple Feed Forward, to see the model training results.

  • Click the Forecast charts panel to see the predictions for each time series (AAL, DAL, and UAL) and inspect the plots.

Figure showing a plot of the forecast for the UAL time series.

For each of the “k = 5” folds, a black line shows the actual time series values; another line shows the values that the model forecast for that fold; and a shaded area represents the confidence intervals for the forecast values.

Notice that Dataiku also plots the forecast values and confidence intervals for the next horizon of the training dataset (beyond the last timestamp in the training data).

The other panels of the model’s Report page show additional training details. For example, the Metrics tab in the Performance section displays the aggregated metrics for the overall dataset and its individual time series. The tabs in the Model Information section provide more details on resampling, features used to train the model, the algorithm details, etc.

Once you finish inspecting the model and are satisfied with its performance, you can deploy the model to the Flow, as you would for any prediction model in Dataiku.

  • Click Deploy from the top right corner of the page.

  • Click Create to deploy the Predict Adj_close model to the Flow.

Note

Like any other Dataiku visual ML model, you can deploy time series models in the Flow for batch scoring or as an API for live scoring.

Figure showing the time series model in the Flow.

Evaluate the time series model

Next, evaluate the model’s performance on data not used during training. For this, you’ll use the Evaluate recipe.

Create the Evaluate recipe

To apply the Evaluate recipe to the time series model, you’ll use a validation set from the airlines_stocks_prepared dataset as input. The validation set will include the time series values after January 1, 2022. Recall that this validation set contains 18 weekly entries (or time steps) for each stock price time series.

To create the Evaluate recipe:

  • Click the model to select it from the Flow.

  • Click the Evaluate recipe from the right panel.

  • Specify airline_stocks_prepared as the Input dataset to the Evaluate recipe.

  • Click Set to add an Output dataset.

  • Name the dataset eval.

  • Click Create the Dataset.

  • Click Set to add a “Metrics” dataset.

  • Name the dataset metrics.

  • Click Create the Dataset.

    Figure showing the creation of the Evaluate recipe.

  • Click Create Recipe.

Dataiku opens up the recipe’s Settings page.

Figure showing the settings page of the Evaluate recipe.

Notice the following on the Settings page:

  • Dataiku alerts you: “The input dataset must contain the historical data.”

    Note

    If you used external features while training the model, Dataiku would require that the input data to the Evaluate recipe contain values for the external features. This requirement is also true when you use the Scoring recipe.

  • By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

Specify the remaining settings for the recipe as follows:

  • Specify the “Nb. evaluation windows” values as 3 to use three evaluation windows of 6 weeks.

    Tip

    Notice that we chose these settings strategically to ensure that the Evaluate recipe forecasts the time series values for all the 18 time series values not used during training.

  • Select the option to compute metrics “per time series” for Dataiku to show one row of metrics per time series (note that you can also compute aggregated metrics).

  • Keep the default selection of the columns to include in the output dataset eval and the metrics dataset metrics.

    Figure showing custom settings for the Evaluate recipe.

  • Run the recipe.

Explore the evaluation results

In the eval dataset, you can view the actual values of the Adj_close prices and the forecasts alongside the quantiles.

Figure showing the output dataset (eval) of the Evaluate recipe.

You can also create line charts on the dataset to compare the forecasts to the actual time series values.

Figure showing the plot of forecast and actual values in the time series.

In addition to the forecast values, you should also examine the associated metrics.

  • Open the metrics dataset to see one row of metrics per time series.

Figure showing the output dataset (metrics) of the Evaluate recipe.

Score the time series model

Finally, you can apply a Scoring recipe to the model to predict future values of the time series.

To apply the Scoring recipe to the time series model, you’ll use the airlines_stocks_prepared dataset as input. The Scoring recipe will use this input with the trained model to forecast future values of the time series (for dates after May 2, 2022).

To create the Scoring recipe,

  • Return to the Flow.

  • Click the model to select it from the Flow.

  • Click the Score recipe from the right panel.

  • Specify airline_stocks_prepared as the “Input dataset” to the recipe.

  • Name the output dataset scored.

  • Click Create Recipe.

Dataiku opens the recipe’s Settings page.

Figure showing the settings page of the Scoring recipe.

On the recipe’s Settings page, notice the following:

  • Dataiku alerts you that the “input dataset must contain the historical data.”

  • By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

Specify the remaining settings as follows:

  • Specify the Past data to include as 52 weeks.

  • Keep Input columns to include selected to avoid copying the whole input dataset to the output.

  • Keep the default selection of columns (Ticker, Date, and Adj_close) to include in the output.

  • Click Run to run the recipe. Wait for the run to finish.

  • Open the scored dataset.

  • Sort the scored dataset in descending order along the Date column to see the forecast values at the top of the dataset.

Figure showing the scored dataset.

The recipe forecast values of the Adj_close alongside the quantiles for the next six weeks for each of the airline stocks.

Note

The reference documentation provides more information on using the Scoring recipe with a time series model.

What’s next?

Congratulations! You’ve taken your first steps toward modeling and forecasting time series data using Dataiku’s visual time series interface.

You can learn more about time series forecasting using the visual interface by checking out the reference documentation on Time Series Forecasting.