Tutorial | Time series forecasting (Visual ML)#

Get started#

Many machine learning problems often involve a time component. This temporal constraint introduces complexities that require careful analysis.

Dataiku offers various ways to implement time series modeling and forecasting. We’ll focus on Dataiku’s time series analysis functionality within the visual machine learning interface.

Objectives#

In this tutorial, you will:

Design a time series forecasting model using the visual ML interface.
Train and deploy a forecasting model to the Flow.
Use the Evaluate and Score recipes with a forecasting model.

Prerequisites#

Dataiku 14.0 or later.
An Advanced Analytics Designer or Full Designer user profile.
Basic knowledge of visual ML in Dataiku (ML Practitioner level or equivalent).
If not using Dataiku Cloud (where it’s available by default), you’ll need a specific code environment including the required packages. See Runtime and GPU support in the reference documentation.

Note

If you want to perform some exploratory data analysis (EDA) before starting this forecasting project, visit Tutorial | Time series analysis.

Create the project#

From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Visual Time Series Forecasting.
If needed, change the folder into which the project will be installed, and click Install.
From the project homepage, click Go to Flow (or type g + f).

From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by ML Practitioner.
Select Visual Time Series Forecasting.
From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

Click Flow Actions at the bottom right of the Flow.
Click Build all.
Keep the default settings and click Build.

Review the train dataset#

The train dataset is a multivariate time series dataset that includes three important columns:

Column	Description
Ticker	Stores the stock symbol identifying three independent time series for the three airlines: American (AAL), Delta (DAL), and United (UAL).
Date	Stores weekly timestamps from 2008 to January 2022.
Adj_close	Stores the stock’s daily closing price.

Design a forecasting model#

To build a time series model, we’ll use the train dataset. This dataset contains all time series data before January 1, 2022.

Tip

You’ll use the remaining time series values (occurring after January 1, 2022) as a validation set to evaluate the performance of your trained model. This validation set contains 18 weekly entries (or time steps) for each stock price time series.

Create the time series forecasting task#

The process is similar to that of other visual models.

From the Flow, select the train dataset, and navigate to the Lab () tab of the right panel.
From the Visual ML section, select Time Series Forecasting.
Select Adj_close as the numerical feature to predict.
Select Date as the date feature.
Under Define identifier columns, select Ticker as the identifier column for the different time series.
Leaving the default for quick prototypes, click Create.

Configure the model’s general settings#

In the General settings panel of the Design tab, Dataiku has already specified parameter values based on:

The input selections from the forecasting task’s creation
The default settings

Let’s tweak these settings. In this exercise, we want to predict the weekly airline stock prices for the next six weeks and refresh the models every 18 weeks. Therefore, we’ll want to use an evaluation period of 18 weeks, equivalent to three forecast horizons.

Within the General settings panel of the modeling task’s Design tab, set the Day of week to Monday.

Note

Although the input data here is equispaced, it’s important that you specify the correct value for the Day of week parameter. Suppose you select a weekday different from the day the timestamps occur. In that case, Dataiku will create new timestamps for the specified day of the week and determine their corresponding values (for other columns in the dataset) by interpolating between the original timestamp values. The Scoring and Evaluate recipes will also forecast values for dates that fall on the weekday you specify.
Under Forecasting parameters, set Forecast horizon (in time steps) to 6.

Note

This parameter determines the length of the model forecast, so you should specify a value that’s a factor of the length of the validation set.
In the Changing forecast horizon window, select Re-detect settings.
Set Horizons in evaluation to 3. This parameter determines the number of forecasting horizons.
Leave Skipped time steps in each forecast horizon at the default setting of 0. This parameter tells Dataiku the number of time steps within each horizon that you want to skip during the evaluation.

Tip

Recall the validation set contains the time series values occurring after January 1, 2022. This data is in the airline_stocks_prepared dataset and contains 18 weekly entries (or time steps) for each stock price time series.

Important

When configuring the model’s design, note that Dataiku will use:

The same Forecast horizon value when you apply a Scoring recipe to the model.
The Nb. time steps for evaluation value as the forecast horizon when you apply an Evaluate recipe to the model.
The same Forecast quantiles you specify during training when you score and evaluate the model.

Configure the Train / Test Set panel#

Because time series have an order, sampling in the train and test sets are quite different from a traditional machine learning problem.

In the Design tab, navigate to the Train / Test Set panel on the left.
Under Splitting parameters, check the box for a K-fold cross-test splitting strategy, and keep the default of 5 folds.

Tip

If the training data isn’t equispaced, Dataiku will automatically resample it. You’ll also be able to specify the imputation method for numerical and categorical data in the missing time steps.

Note

Learn more about the K-fold cross-test by visiting Cross-validation.

Configure the Metrics panel#

The settings defined in the Metrics, Algorithms, and Hyperparameters panels define how Dataiku performs the search for the best model hyperparameters.

In the Metrics panel, you can choose the metric that Dataiku will use for model evaluation on the train and test sets.

On the left, navigate to the Metrics panel.
Switch to Mean Absolute Percentage Error (MAPE) as the metric for which the model’s hyperparameters should be optimized.

Configure the External Features panel#

External features are exogenous time-dependent features. By default, Dataiku disables the external features for several reasons. One reason is that some training algorithms don’t support the use of external features.

Another reason for this default behavior is that for any model you train with external features, Dataiku will require you to provide future values of those external features during forecasting (when trying to score the trained model).

Therefore, if there’s no way to know the values of the external features ahead of time (as it applies in this case of stock price information), the model shouldn’t use them during training. Using these features will lead to what’s called data leakage.

On the left, navigate to the External features panel.
Leave the external features deactivated.

Configure the Algorithms panel#

The visual ML interface offers three categories of forecasting algorithms as indicated by the icon to the right of their names: baseline, statistical, and deep learning.

During the hyperparameters optimization phase, each set of hyperparameters (defined in the Algorithms page) will be evaluated with the metric defined in the Metrics page.

On the left, navigate to the Algorithms panel.
In addition to the default selections based on the quick prototype designation, add the DeepAR algorithm.

Inspect the Runtime environment#

One last step before training is to confirm that you have a compatible code environment in place.

On the left, navigate to the Runtime environment panel.
If not already present, select a code environment that includes the required packages for time series forecasting models.

Train and deploy a forecasting model#

Once we’re satisfied with the model’s design, we can go ahead training a session of models, and then choosing one to deploy to the Flow. You’ll notice that this process is the same for any other visual prediction or clustering model.

Train time series forecasting models#

Let’s kick off the training session!

Near the top right of the modeling task, click Train.
In the dialog, click Train once more to start the training session.

Inspect the training results#

The bar charts at the top allow you to compare different metrics across the trained models. In this case, the Simple Feed Forward algorithm performed best for all three metrics (MAPE, Symmetric MAPE, and RMSE).

On the left of the Result tab, click the best-performing model, Simple Feed Forward, to see the model training results.

Tip

You can click the Performance metrics dropdown in the session summary to change the displayed metrics.
In the Report tab, click on Metrics to have an overview of all the metrics in which your model has performed.

As for other visual models, the model report provides a number of visualizations and metrics related to the model’s performance. Users of 13.4+ can dive even deeper with a refined granularity:

Still in the Report tab, navigate to the Granular time series metrics.
Under Per-fold metrics, toggle On the option.
Notice the different performance metrics of your model for each of the folds set up for cross-validation.

This way, you can see if any of the folds are under or overperforming relative to the others. This allows you to better control the performance of your model training and predictions.

View forecast charts#

Let’s look closer at this model’s forecast.

Still in the Report tab, navigate to the Forecast charts panel.
Inspect the predictions for each time series plot.

For each of the k = 5 folds, a black line shows the actual time series values. Another line shows the values that the model forecast for that fold. A shaded area represents the confidence intervals for the forecast values.

Notice that Dataiku also plots the forecast values and confidence intervals for the next horizon of the training dataset (beyond the last timestamp in the training data).

Note

The other panels of the model’s Report page show additional training details. For example, the Metrics tab in the Performance section displays the aggregated metrics for the overall dataset and its individual time series. The tabs in the Model Information section provide more details on resampling, features used to train the model, the algorithm details, etc.

Deploy the model to the Flow#

Once you finish inspecting the model and are satisfied with its performance, you can deploy the model to the Flow.

Click Deploy from the top right corner of the model page.
Click Create to deploy the Predict Adj_close (forecast) model to the Flow.

Note

Like any other Dataiku visual ML model, you can deploy time series models in the Flow for batch scoring or as an API for real-time scoring.

Evaluate a forecasting model#

Next, let’s evaluate the model’s performance on data not used during training. For this, we’ll use the Evaluate recipe.

Create an Evaluate recipe#

To apply the Evaluate recipe to the time series model, we’ll use a validation set from the airlines_stocks_prepared dataset as input. The validation set will include the time series values after January 1, 2022. Recall that this validation set contains 18 weekly entries (or time steps) for each stock price time series.

From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.
In the Actions panel on the right, select the Evaluate recipe.
Click Set to add an output dataset. Name it eval, and click Create Dataset.
Click Set to add metrics. Name it metrics, and click Create Dataset.
Click Set to add an evaluation store. Name it eval_store, and click Create Evaluation Store.
Once you have inputs, outputs, and the evaluation store created, click Create Recipe.

Configure the Evaluate recipe#

On the recipe’s Settings page, Dataiku alerts as to how much past data is required according to the model’s specifications.

Note

If you used external features while training the model, Dataiku would require that the input data to the Evaluate recipe contain values for the external features. This requirement is also true when you use the Scoring recipe.

By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

Specify the remaining settings for the recipe as follows:

On the Settings tab of the recipe, specify the Nb. evaluation windows values as 3 to use three evaluation windows of six weeks.
Instead of aggregated metrics, select the option to compute metrics Per time series for Dataiku to show one row of metrics per time series.
Click Run at the bottom left to execute the recipe.

Important

Notice that we chose these settings strategically to ensure that the Evaluate recipe forecasts the time series values for all 18 time series values not used during training.

Explore the evaluation results#

In the eval dataset, we can view the actual values of the Adj_close prices and the forecasts alongside the quantiles. We can also create line charts on the dataset to compare the forecasts to the actual time series values.

Open the eval dataset, and explore the added forecast and quantile_0 columns.
Navigate to the Charts tab.
From the chart picker, select a Lines plot.
Drag Adj_close and forecast to the Y-axis field.
Drag Date to the X-axis field.
Drag Ticker to the Subcharts option to the left of the chart. Adjust the chart height as needed.

In addition to the forecast values, we can also examine the associated metrics.

Return to the Flow.
Open the metrics dataset to see one row of metrics per time series, per run of the Evaluate recipe.

For more in-depth evaluation monitoring, you can check the evaluation store created before.

Return to the Flow.
Open the eval_store evaluation store.
Under Model Evaluation, click Open on the latest evaluation.

Note

Notice that on this page you have an overview of all the evaluations of this model with a list of relevant statistics.
Navigate to the Forecast charts tab to visually explore the different forecasts the model has made for each ticker.

Score a forecasting model#

Finally, let’s apply a Score recipe to the model to predict future values of the time series.

Create a Score recipe#

To apply the Score recipe to the time series model, we’ll again use the airlines_stocks_prepared dataset as input. The Score recipe will use this input with the trained model to forecast future values of the time series (for dates after May 2, 2022).

From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.
In the Actions panel on the right, select the Score recipe.
Click Create Recipe.

Configure the Score recipe#

Similar to the Evaluate recipe, Dataiku alerts us as to how much past data must be found in the input dataset.

Leave the forecast length by default to 6.

Note

Since Dataiku version 13.5, users are able to change the forecast length to a custom input value.
Set the Past data to include at 52 weeks.
Near the bottom left, click Run to execute the recipe.

Inspect the scored data#

The recipe forecast values of the Adj_close alongside the quantiles for the next six weeks for each of the airline stocks.

When the Score recipe finishes, open the airline_stocks_prepared_scored dataset.
Click on the Date column header, select Sort, and then click the icon to sort in descending order to see the forecasted values at the top of the dataset.
Click on the Ticker column header, select Filter, and then select a series such as AAL to view an individual series.

Note

The reference documentation provides more information on using the Score recipe with a time series model.

Next steps#

Congratulations! You’ve taken your first steps toward forecasting time series data using Dataiku’s visual ML interface.