Tutorial | Time series forecasting (visual ML)#

Get started#

Many machine learning problems often involve a time component. This temporal constraint introduces complexities that require careful analysis.

Dataiku offers various ways to implement time series modeling and forecasting. We’ll focus on Dataiku’s time series analysis functionality within the visual machine learning interface.

Objectives#

In this tutorial, you will:

  • Design a time series forecasting model using the visual ML interface.

  • Train and deploy a forecasting model to the Flow.

  • Use the Evaluate and Score recipes with a forecasting model.

Prerequisites#

  • Dataiku 14.2 or later.

  • An Advanced Analytics Designer or Full Designer user profile.

  • Basic knowledge of visual ML in Dataiku (ML Practitioner level or equivalent).

  • If not using Dataiku Cloud (where it’s available by default), you’ll need a specific code environment including the required packages. See Runtime and GPU support in the reference documentation.

Note

If you want to perform some exploratory data analysis (EDA) before starting this forecasting project, visit Tutorial | Time series analysis.

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Visual Time Series Forecasting.

  4. If needed, change the folder into which the project will be installed, and click Create.

  5. From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Review the train dataset#

The train dataset is a multivariate time series dataset that includes important columns:

Column

Description

product

The category of the sold product: laptop, toy, and tshirt.

date

Daily sales of each product.

sales

The sales amount.

is_holiday_flag

The boolean if the day is part of the holidays.

website_traffic

The amount of users going on the market website.

promotion_level

The indicator of a discount.

Design a forecasting model#

To build a time series model, we’ll use the train3 dataset. This dataset contains all time series data before 2024.

Tip

You’ll use the remaining time series values (occurring on 2024) as a validation set to evaluate the performance of your trained model.

Create the time series forecasting task#

The process is similar to that of other visual models.

  1. From the Flow, select the train dataset, and navigate to the Lab (Lab icon.) tab of the right panel.

  2. From the Visual ML section, select Time Series Forecasting.

  3. Select sales as the numerical feature to forecast.

  4. Select date as the date feature.

  5. Under Define identifier columns, select product as the identifier column for the different time series.

  6. Leaving the default for quick prototypes, click Create.

Dataiku screenshot of the dialog to create a time series forecasting model.

Configure the model’s general settings#

In the General settings panel of the Design tab, Dataiku has already specified parameter values based on:

  • The input selections from the forecasting task’s creation

  • The default settings

Let’s tweak these settings. In this exercise, we want to predict the daily sales amount for the next 7 days and refresh the model every 28 days. Therefore, we’ll want to use an evaluation period of 28 days, equivalent to four forecast horizons.

  1. Within the General settings panel of the modeling task’s Design tab, set the Timestep to 1 Day.

  2. Under Forecasting parameters, set Forecast horizon (in time steps) to 7.

    Note

    This parameter determines the length of the model forecast, so you should specify a value that’s a factor of the length of the validation set.

  3. In the Changing forecast horizon window, select Re-detect settings.

  4. Leave Skipped time steps in each forecast horizon at the default setting of 0. This parameter tells Dataiku the number of time steps within each horizon that you want to skip during the evaluation.

Dataiku screenshot of the general settings of a time series model design.

Tip

Recall the validation set contains the time series values occurring on 2024.

Configure the Train / Test Set panel#

Because time series have an order, splitting in the train and test sets is quite different from a traditional machine learning problem.

  1. In the Design tab, navigate to the Train / Test Set panel on the left.

  2. In the Train/Test splitting section, leave the Auto setting by default.

    Note

    If you want to have control over the splitting. You can also click Custom and specify the date range you wish to have in your train and test sets per fold.

  3. Enter 4 in the Horizons in test set field.

  4. Check the box for a K-fold cross-test splitting strategy, and keep the default of 5 folds.

Dataiku screenshot of the train and test set settings of a time series model design.

Note

Learn more about the K-fold cross-test by visiting Cross-validation.

Configure the Metrics panel#

The settings defined in the Metrics, Algorithms, and Hyperparameters panels define how Dataiku performs the search for the best model hyperparameters.

In the Metrics panel, you can choose the metric that Dataiku will use for model evaluation on the train and test sets.

  1. On the left, navigate to the Metrics panel.

  2. Switch to Mean Absolute Percentage Error (MAPE) as the metric for which the model’s hyperparameters should be optimized.

Dataiku screenshot of the metrics settings of a time series model design.

You can also click + Add Custom Metric to leverage business-related metric you could have.

Configure the External Features panel#

External features are exogenous time-dependent features. By default, Dataiku disables the external features for several reasons. One reason is that some training algorithms don’t support the use of external features.

In the present use case, external features can be relevant. Having past data to compute the impact of an event, or both past and future data to account for known events such as holidays or discounts, can improve predictions. Also, it is recommended to have external features if you want to use classical machine learning techniques such as XGBoost or random forest that can lead to appealing results.

  1. Navigate to the External features panel.

  2. Check is_holiday_flag and promotion_level to enable multiple selection.

  3. Select Input (past and future).

  4. Select the past only parameter for website_traffic.

Dataiku screenshot of the external feature settings of a time series model design.

Holidays and promotions are features that can be known in advance and impact future sales, whereas website traffic is unknown in advance. These settings are relevant for this use case.

Generate new features#

The variables you selected as external features are now available for use. You can configure how you want new features to be generated from them.

For the time-shifted features, you have two choices:

Option

Description

From forecast origin

The shift will start from the begining of the horizon whatever the timestep predicted.

From the forecasted point

The shift will start from the predicted timestep. You’ll have to set a shift that is at least the number of timesteps within the horizon.

The second type of feature that you can generate is a window. With windows, you can aggregate past values of a feature to create a fixed historical value. For example, you can use the average web traffic over a week as a single feature that provides useful context.

  1. Navigate to the Feature Generation panel.

  2. Delete the From forecast origin input in website_traffic, so that no feature is created from it.

  3. Leave the default rolling window of -7, as it represents the weekly window you need.

  4. For web_traffic and sales, leave Avg and Std. dev. selected and unselect the others.

Dataiku screenshot of the external feature settings of a time series model design.

Configure the Algorithms panel#

The visual ML interface offers four categories of forecasting algorithms as indicated by the icon to the right of their names: statistical, deep learning, classical ML, and baseline.

During the hyperparameters optimization phase, each set of hyperparameters (defined in the Algorithms page) will be evaluated with the metric defined in the Metrics page.

  1. On the left, navigate to the Algorithms panel.

  2. In addition to the default selections based on the quick prototype designation, add the DeepAR - Torch and XGBoost algorithms, under Deep learning and Classical ML, respectively.

Dataiku screenshot of the algorithm settings of a time series model design.

Inspect the Runtime environment#

One last step before training is to confirm that you have a compatible code environment in place.

  1. On the left, navigate to the Runtime environment panel.

  2. If not already present, select a code environment that includes the required packages for time series forecasting models.

Dataiku screenshot of the runtime environment panel of a time series model.

Train and deploy a forecasting model#

Once we’re satisfied with the model’s design, we can go ahead training a session of models, and then choosing one to deploy to the Flow. You’ll notice that this process is the same for any other visual prediction or clustering model.

Train time series forecasting models#

Let’s kick off the training session!

  1. Near the top right of the modeling task, click Train.

  2. In the dialog, click Train once more to start the training session.

Dataiku screenshot of the dialog to train a time series forecasting model.

Inspect the training results#

The bar charts at the top allow you to compare different metrics across the trained models. In this case, the XGBoost algorithm performed best for all three metrics (MAPE, Symmetric MAPE, and RMSE).

  1. On the left of the Result tab, click the best-performing model, XGBoost, to see the model training results.

    Dataiku screenshot of a model training session result.

    Tip

    Click the Metrics dropdown on top of the session summary to change the displayed metrics.

  2. In the Report tab, click on Metrics to have an overview of all the metrics in which your model has performed.

As for other visual models, the model report provides a number of visualizations and metrics related to the model’s performance.

  1. Still in the Report tab, navigate to the Granular time series metrics.

  2. Under Per-fold metrics, toggle On the option.

  3. Notice the different performance metrics of your model for each of the folds set up for cross-validation.

Dataiku screenshot of the metrics per fold.

This way, you can see if any of the folds are under or overperforming relative to the others. This allows you to better control the performance of your model training and predictions.

View forecast charts#

Let’s look closer at this model’s forecast.

  1. Still in the Report tab, navigate to the Forecast charts panel.

  2. Inspect the predictions for each time series plot.

Dataiku screenshot of a forecast chart.

For each of the k = 5 folds, a black line shows the actual time series values. Another line shows the values that the model forecast for that fold. A shaded area represents the confidence intervals for the forecast values.

Note

The other panels of the model’s Report page show additional training details. For example, the Metrics tab in the Performance section displays the aggregated metrics for the overall dataset and its individual time series. The tabs in the Model Information section provide more details on resampling, features used to train the model, the algorithm details, etc.

Deploy the model to the Flow#

Once you finish inspecting the model and are satisfied with its performance, you can deploy the model to the Flow.

  1. Click Deploy from the top right corner of the model page.

  2. Click Create to deploy the Predict sales (forecast) model to the Flow.

Dataiku screenshot of the dialog to deploy a prediction model.

Note

Like any other Dataiku visual ML model, you can deploy time series models in the Flow for batch scoring or as an API for real-time scoring.

Evaluate a forecasting model#

Next, let’s evaluate the model’s performance on data not used during training. For this, we’ll use the Evaluate recipe.

Create an Evaluate recipe#

To apply the Evaluate recipe to the time series model, we’ll use a validation set from the eval dataset as input. The validation set will include the time series values after January 1, 2024.

  1. From the Flow, select both the eval dataset and Predict sales (forecast) model.

  2. In the Actions panel on the right, select the Evaluate recipe.

  3. Click Set to add an output dataset. Name it eval_output, and click Create Dataset.

  4. Click Set to add metrics. Name it eval_metrics, and click Create Dataset.

  5. Click Set to add an evaluation store. Name it eval_store, and click Create Evaluation Store.

  6. Once you have inputs, outputs, and the evaluation store created, click Create Recipe.

Dataiku screenshot of the dialog to create an Evaluate recipe.

Configure the Evaluate recipe#

On the recipe’s Settings page, Dataiku alerts as to how much past data is required according to the model’s specifications.

Note

If you used external features while training the model, Dataiku would require that the input data to the Evaluate recipe contain values for the external features. This requirement is also true when you use the Scoring recipe.

By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

Specify the remaining settings for the recipe as follows:

  1. On the Settings tab of the recipe, leave the Nb. evaluation timesteps values as 0 to use the maximum number of evaluation windows.

  2. Instead of aggregated metrics, select the option to compute metrics Per time series for Dataiku to show one row of metrics per time series.

  3. Click Run to execute the recipe.

Dataiku screenshot of the settings tab of an Evaluate recipe.

Explore the evaluation results#

In the eval dataset, we can view the actual values of the sales prices and the forecasts alongside the quantiles. We can also create line charts on the dataset to compare the forecasts to the actual time series values.

  1. Open the eval_output dataset, and explore the added forecast column.

  2. Navigate to the Charts tab.

  3. From the chart picker, select a Lines plot.

  4. Drag sales and forecast to the Y-axis field.

  5. Drag date to the X-axis field.

  6. Click date and select Week as Date range to have a weekly view of the chart.

  7. Drag product to the Subcharts option to the left of the chart. Adjust the chart height as needed.

Dataiku screenshot of a line chart of time series forecasts.

In addition to the forecast values, we can also examine the associated metrics.

  1. Return to the Flow.

  2. Open the eval_metrics dataset to see one row of metrics per time series, per run of the Evaluate recipe.

Dataiku screenshot of a line chart of time series forecasts.

For more in-depth evaluation monitoring, you can check the evaluation store created before.

  1. Return to the Flow.

  2. Open the eval_store evaluation store.

  3. Under Model Evaluation, click Open on the latest evaluation.

    Note

    Notice that on this page you have an overview of all the evaluations of this model with a list of relevant statistics.

  4. Navigate to the Forecast charts tab to visually explore the different forecasts the model has made for each product.

Dataiku screenshot of a monitoring of time series forecasts from the evaluation store.

Score a forecasting model#

Finally, let’s apply a Score recipe to the model to predict future values of the time series.

Create a Score recipe#

To apply the Score recipe to the time series model, we’ll use the scoring editable dataset as input. The Score recipe will use this input with the trained model to forecast future values of the time series (for dates after January 1, 2025).

  1. From the Flow, select both the scoring dataset and Predict sales (forecast) model.

  2. In the Actions panel on the right, select the Score recipe.

  3. Click Create Recipe.

Dataiku screenshot of the dialog for a Score recipe.

Configure the Score recipe#

Similar to the Evaluate recipe, Dataiku alerts us as to how much past data must be found in the input dataset.

By default, the recipe uses a forecast horizon of 7 days since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

  1. Leave the forecast length by default to 7.

  2. Set the Past data to include at 28 days.

  3. Click Run to execute the recipe.

Dataiku screenshot of the settings tab of a Score recipe.

Inspect the scored data#

The recipe forecast values of the sales alongside the quantiles for the next seven days for each of the product.

  1. When the Score recipe finishes, open the scoring_scored dataset.

  2. Click on the date column header, select Sort, and then click the icon to sort in descending order to see the forecasted values at the top of the dataset.

  3. Click on the product column header, select Filter, and then select a series such as laptop to view an individual series.

Dataiku screenshot of the output of a Score recipe.

Note

The reference documentation provides more information on using the Score recipe with a time series model.

Next steps#

Congratulations! You’ve taken your first steps toward forecasting time series data using Dataiku’s visual ML interface.

See also

Learn more about time series forecasting using the visual interface in the reference documentation.

You can also explore an example project demonstrating various visual time series forecasting techniques in the Dataiku Gallery.