Tutorial | Time series forecasting (Visual ML)#

Get started#

Many machine learning problems often involve a time component. This temporal constraint introduces complexities that require careful analysis.

Dataiku offers various ways to implement time series modeling and forecasting. We’ll focus on Dataiku’s time series analysis functionality within the visual machine learning interface.

Objectives#

In this tutorial, you will:

  • Design a time series forecasting model using the visual ML interface.

  • Train and deploy a forecasting model to the Flow.

  • Use the Evaluate and Score recipes with a forecasting model.

Prerequisites#

  • A Dataiku instance (version 11 and above).

  • Basic knowledge of visual ML in Dataiku (ML Practitioner level or equivalent).

  • If not using Dataiku Cloud (where it is available by default), you’ll need a specific code environment including the required packages.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > Visual Time Series Forecasting.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Explore the Flow#

For any kind of data, before ever building models, it is important to explore the data by plotting charts and performing statistical analyses. Time series data is no exception.

By exploring your time series data, you’ll understand its characteristics better. For example, you can get insights into the underlying trends, patterns, and correlations. These insights will guide your feature engineering and inform which kinds of algorithms that would be best suited for modeling the data.

Review the starting Flow#

After a few brief preparation steps, the train dataset in the starter project includes three columns of importance:

Column

Description

Ticker

Stores the stock symbol identifying three independent time series for the three airlines: American (AAL), Delta (DAL), and United (UAL).

Date

Stores weekly timestamps from 2008 to January 2022.

Adj_close

Stores the stock’s daily closing price we hope to predict or forecast.

Training dataset for the project.

Note

If building forecasting models later, note that the airline_stocks_prepared dataset contains all the training and validation data. This is helpful because Dataiku requires that the input datasets to the Evaluate and Scoring recipes include the historical (training) data used by the time series model.

Review the existing charts#

Plotting the data is a good initial step to see if you can observe any patterns in the time series, such as trends and seasonalities. This has already been done for you.

  1. Navigate to the Charts tab of the train dataset.

  2. Interactively explore the existing line plots.

Dataiku screenshot of the Charts tab of a time series dataset.

The plots show a dip in airline stock prices in early 2020 — likely due to the COVID pandemic. Also, there appears to be a general upward trend from 2009 to 2020 for the UAL and DAL time series, perhaps less so for AAL.

See also

See the tutorial on time series preparation for an exercise on visualizing time series data. More generally, you can find resources on charts in the Knowledge Base.

Design a forecasting model#

To build the time series model, we’ll use the train dataset. This dataset contains all time series data before January 1, 2022.

Tip

You’ll use the remaining time series values (occurring after January 1, 2022) as a validation set to evaluate the performance of your trained model. This validation set contains 18 weekly entries (or time steps) for each stock price time series.

Create the time series forecasting task#

The process is very similar to that of other visual models.

  1. From the Flow, select the train dataset, and navigate to the Lab.

  2. From the Visual ML section, select Time Series Forecasting.

  3. Select Adj_close as the numerical feature to predict.

  4. Select Date as the date feature.

  5. Select Ticker as the identifier column for multiple time series (we have time series for three different airlines).

  6. Leaving the default for quick prototypes, click Create.

Dataiku screenshot of the dialog to create a time series forecasting model.

Configure the model’s general settings#

In the General settings panel of the Design tab, Dataiku already specified parameter values based on (1) the input selections from the forecasting task’s creation and (2) the default settings.

We plan to predict the weekly stock price for the next six weeks, but since we plan to only refresh the models every 18 weeks, we want to use an evaluation period of 18 weeks, equivalent to three forecast horizons.

Note

Although the input data here is equispaced, it is important that you specify the correct value for the Day of week parameter. Suppose you select a weekday different from the day the timestamps occur. In that case, Dataiku will create new timestamps for the specified day of the week and determine their corresponding values (for other columns in the dataset) by interpolating between the original timestamp values. The Scoring and Evaluate recipes will also forecast values for dates that fall on the weekday you specify.

  1. Within the General settings panel of the modeling task’s Design tab, change the Day of week to Monday.

  2. Under Forecasting parameters, set Forecast horizon (in time steps) to 6. This parameter determines the length of the model forecast; therefore, you should specify a value that is a factor of the length of the validation set.

  3. When you change the forecasting horizon, you have the opportunity to re-detect settings. Click Re-Detect Settings.

  4. Set Horizons in evaluation to 3. This parameter determines the number of forecasting horizons.

  5. Leave Skipped time steps in each forecast horizon at the default setting of 0. This parameter tells Dataiku the number of time steps within each horizon that you want to skip during the evaluation.

Dataiku screenshot of the general settings of a time series model design.

Tip

Recall the validation set contains the time series values occurring after January 1, 2022. This data is in the airline_stocks_prepared dataset and contains 18 weekly entries (or time steps) for each stock price time series.

Important

When configuring the model’s design, note that Dataiku will use:

  • The same Forecast horizon value when you apply a Scoring recipe to the model.

  • The Nb. time steps for evaluation value as the forecast horizon when you apply an Evaluate recipe to the model.

  • The same Forecast quantiles you specify during training when you score and evaluate the model.

Configure the Train / Test Set panel#

Because time series have an order, sampling in the train and test sets are quite different from a traditional machine learning problem.

  1. In the Design tab, navigate to the Train / Test Set panel on the left.

  2. Under Splitting parameters, check the box for a K-fold cross-test splitting strategy, and keep the default of 5 folds.

Dataiku screenshot of the train and test set settings of a time series model design.

Tip

If the training data is not equispaced, Dataiku will automatically resample it. You’ll also be able to specify the imputation method for numerical and categorical data in the missing time steps.

Note

Using the cross-test provides a more accurate estimation of model performance and is useful when you don’t have much training data. The reference documentation explains how the process works.

Configure the Metrics panel#

The settings defined in the Metrics, Algorithms, and Hyperparameters panels define how Dataiku performs the search for the best model hyperparameters.

In the Metrics panel, you can choose the metric that Dataiku will use for model evaluation on the train and test sets.

  1. On the left, navigate to the Metrics panel.

  2. Switch to Mean Absolute Percentage Error (MAPE) as the metric for which the model’s hyperparameters should be optimized.

Dataiku screenshot of the metrics settings of a time series model design.

Configure the External Features panel#

External features are exogenous time-dependent features. By default, Dataiku disables the external features for several reasons. One reason is that some training algorithms do not support the use of external features.

Another reason for this default behavior is that for any model you train with external features, Dataiku will require you to provide future values of those external features during forecasting (when trying to score the trained model).

Therefore, if there’s no way to know the values of the external features ahead of time (as it applies in this case of stock price information), the model should not use them during training. Using these features will lead to what is called data leakage.

  1. On the left, navigate to the External features panel.

  2. Leave the external features disabled.

Dataiku screenshot of the external feature settings of a time series model design.

Configure the Algorithms panel#

The visual ML interface offers three categories of forecasting algorithms as indicated by the icon to the right of their names: baseline, statistical, and deep learning.

During the hyperparameters optimization phase, each set of hyperparameters (defined in the Algorithms page) will be evaluated with the metric defined in the Metrics page.

  1. On the left, navigate to the Algorithms panel.

  2. In addition to the default selections based on the quick prototype designation, add the DeepAR algorithm.

Dataiku screenshot of the algorithm settings of a time series model design.

Inspect the Runtime environment#

One last step before training is to confirm that you have a compatible code environment in place.

  1. On the left, navigate to the Runtime environment panel.

  2. If not already present, select a code environment that includes the required packages for time series forecasting models.

Dataiku screenshot of the runtime environment panel of a time series model.

Train and deploy a forecasting model#

Once we are satisfied with the model’s design, we can go ahead training a session of models, and then choosing one to deploy to the Flow. You’ll notice that this process is the same for any other visual prediction or clustering model.

Train time series forecasting models#

Let’s kickoff the training session!

  1. Near the top right of the modeling task, click Train.

  2. In the dialog, click Train once more to start the training session.

Dataiku screenshot of the dialog to train a time series forecasting model.

Inspect the training results#

The bar charts at the top allow you to compare different metrics across the trained models. In this case, the Simple Feed Forward algorithm performed best for all three metrics (MAPE, Symmetric MAPE, and RMSE).

  1. On the left of the Result tab, click the best-performing model, Simple Feed Forward, to see the model training results.

Dataiku screenshot of a model training session result.

Tip

You can click the Performance metrics dropdown in the session summary to change the displayed metrics.

View forecast charts#

Let’s look closer at this model’s forecast.

  1. In the Report tab, navigate to the Forecast charts panel on the left.

  2. Inspect the predictions for each time series plot.

Dataiku screenshot of a forecast chart.

For each of the k = 5 folds, a black line shows the actual time series values; another line shows the values that the model forecast for that fold; and a shaded area represents the confidence intervals for the forecast values.

Notice that Dataiku also plots the forecast values and confidence intervals for the next horizon of the training dataset (beyond the last timestamp in the training data).

Note

The other panels of the model’s Report page show additional training details. For example, the Metrics tab in the Performance section displays the aggregated metrics for the overall dataset and its individual time series. The tabs in the Model Information section provide more details on resampling, features used to train the model, the algorithm details, etc.

Deploy the model to the Flow#

Once you finish inspecting the model and are satisfied with its performance, you can deploy the model to the Flow.

  1. Click Deploy from the top right corner of the model page.

  2. Click Create to deploy the Predict Adj_close (forecast) model to the Flow.

Dataiku screenshot of the dialog to deploy a prediction model.

Note

Like any other Dataiku visual ML model, you can deploy time series models in the Flow for batch scoring or as an API for real-time scoring.

Evaluate a forecasting model#

Next, let’s evaluate the model’s performance on data not used during training. For this, we’ll use the Evaluate recipe.

Create an Evaluate recipe#

To apply the Evaluate recipe to the time series model, we’ll use a validation set from the airlines_stocks_prepared dataset as input. The validation set will include the time series values after January 1, 2022. Recall that this validation set contains 18 weekly entries (or time steps) for each stock price time series.

  1. From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.

  2. In the Actions panel on the right, select the Evaluate recipe.

  3. Click Set to add an output dataset. Name it eval, and click Create Dataset.

  4. Click Set to add metrics. Name it metrics, and click Create Dataset.

  5. Once you have both inputs and outputs, click Create Recipe.

Dataiku screenshot of the dialog to create an Evaluate recipe.

Configure the Evaluate recipe#

On the recipe’s Settings page, Dataiku alerts as to how much past data is required according to the model’s specifications.

Note

If you used external features while training the model, Dataiku would require that the input data to the Evaluate recipe contain values for the external features. This requirement is also true when you use the Scoring recipe.

By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

Specify the remaining settings for the recipe as follows:

  1. On the Settings tab of the recipe, specify the Nb. evaluation windows values as 3 to use three evaluation windows of six weeks.

  2. Instead of aggregated metrics, select the option to compute metrics Per time series for Dataiku to show one row of metrics per time series.

  3. Click Run at the bottom left to execute the recipe.

Dataiku screenshot of the settings tab of an Evaluate recipe.

Important

Notice that we chose these settings strategically to ensure that the Evaluate recipe forecasts the time series values for all 18 time series values not used during training.

Explore the evaluation results#

In the eval dataset, we can view the actual values of the Adj_close prices and the forecasts alongside the quantiles. We can also create line charts on the dataset to compare the forecasts to the actual time series values.

  1. Open the eval dataset, and explore the added forecast and quantile_0 columns.

  2. Navigate to the Charts tab.

  3. From the chart picker, select a Lines plot.

  4. Drag Adj_close and forecast to the Y-axis field.

  5. Drag Date to the X-axis field.

  6. Drag Ticker to the Subcharts option to the left of the chart. Adjust the chart height as needed.

Dataiku screenshot of a line chart of time series forecasts.

In addition to the forecast values, we can also examine the associated metrics.

  1. Return to the Flow.

  2. Open the metrics dataset to see one row of metrics per time series, per run of the Evaluate recipe.

Dataiku screenshot of a line chart of time series forecasts.

Score a forecasting model#

Finally, let’s apply a Scoring recipe to the model to predict future values of the time series.

Create a Score recipe#

To apply the Scoring recipe to the time series model, we’ll again use the airlines_stocks_prepared dataset as input. The Scoring recipe will use this input with the trained model to forecast future values of the time series (for dates after May 2, 2022).

  1. From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.

  2. In the Actions panel on the right, select the Score recipe.

  3. Click Create Recipe.

Dataiku screenshot of the dialog for a Score recipe.

Configure the Score recipe#

Similar to the Evaluate recipe, Dataiku alerts us as to how much past data must be found in the input dataset.

By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).

  1. Specify the Past data to include as 52 weeks.

  2. Near the bottom left, click Run to execute the recipe.

Dataiku screenshot of the settings tab of a Score recipe.

Inspect the scored data#

The recipe forecast values of the Adj_close alongside the quantiles for the next six weeks for each of the airline stocks.

  1. When the Score recipe finishes, open the airline_stocks_prepared_scored dataset.

  2. Click on the Date column header, select Sort, and then click the icon to sort in descending order to see the forecasted values at the top of the dataset.

  3. Click on the Ticker column header, select Filter, and then select a series such as AAL to view an individual series.

Dataiku screenshot of the output of a Score recipe.

Note

The reference documentation provides more information on using the Scoring recipe with a time series model.

What’s next?#

Congratulations! You’ve taken your first steps toward forecasting time series data using Dataiku’s visual ML interface.

See also

Learn more about time series forecasting using the visual interface in the reference documentation.

You can also explore an example project demonstrating various visual time series forecasting techniques in the Dataiku Gallery.