Tutorial | Time series forecasting (Visual ML)#
Get started#
Many machine learning problems often involve a time component. This temporal constraint introduces complexities that require careful analysis.
Dataiku offers various ways to implement time series modeling and forecasting. We’ll focus on Dataiku’s time series analysis functionality within the visual machine learning interface.
Objectives#
In this tutorial, you will:
Design a time series forecasting model using the visual ML interface.
Train and deploy a forecasting model to the Flow.
Use the Evaluate and Score recipes with a forecasting model.
Prerequisites#
Dataiku 12.0 or later.
An Advanced Analytics Designer or Full Designer user profile.
Basic knowledge of visual ML in Dataiku (ML Practitioner level or equivalent).
If not using Dataiku Cloud (where it is available by default), you’ll need a specific code environment including the required packages.
Note
If you want to perform some exploratory data analysis (EDA) before starting this forecasting project, visit Tutorial | Time series analysis.
Create the project#
From the Dataiku Design homepage, click + New Project > DSS tutorials > ML Practitioner > Visual Time Series Forecasting.
From the project homepage, click Go to Flow (or
g
+f
).
Note
You can also download the starter project from this website and import it as a zip file.
You’ll next want to build the Flow.
Click Flow Actions at the bottom right of the Flow.
Click Build all.
Keep the default settings and click Build.
Review the train dataset#
The train dataset is a multivariate time series dataset that includes three important columns:
Column |
Description |
---|---|
Ticker |
Stores the stock symbol identifying three independent time series for the three airlines: American (AAL), Delta (DAL), and United (UAL). |
Date |
Stores weekly timestamps from 2008 to January 2022. |
Adj_close |
Stores the stock’s daily closing price. |
Design a forecasting model#
To build a time series model, we’ll use the train dataset. This dataset contains all time series data before January 1, 2022.
Tip
You’ll use the remaining time series values (occurring after January 1, 2022) as a validation set to evaluate the performance of your trained model. This validation set contains 18 weekly entries (or time steps) for each stock price time series.
Create the time series forecasting task#
The process is very similar to that of other visual models.
From the Flow, select the train dataset, and navigate to the Lab.
From the Visual ML section, select Time Series Forecasting.
Select Adj_close as the numerical feature to predict.
Select Date as the date feature.
Under Define identifier columns, select Ticker as the identifier column for our different time series.
Leaving the default for quick prototypes, click Create.
Configure the model’s general settings#
In the General settings panel of the Design tab, Dataiku has already specified parameter values based on:
The input selections from the forecasting task’s creation
The default settings
Let’s tweak these settings. In this exercise, we want to predict the weekly airline stock prices for the next six weeks and refresh the models every 18 weeks. Therefore, we’ll want to use an evaluation period of 18 weeks, equivalent to three forecast horizons.
Within the General settings panel of the modeling task’s Design tab, change the Day of week to Monday.
Note
Although the input data here is equispaced, it is important that you specify the correct value for the Day of week parameter. Suppose you select a weekday different from the day the timestamps occur. In that case, Dataiku will create new timestamps for the specified day of the week and determine their corresponding values (for other columns in the dataset) by interpolating between the original timestamp values. The Scoring and Evaluate recipes will also forecast values for dates that fall on the weekday you specify.
Under Forecasting parameters, set Forecast horizon (in time steps) to
6
.Note
This parameter determines the length of the model forecast, so you should specify a value that is a factor of the length of the validation set.
In the Changing forecast horizon window, select Re-detect settings.
Set Horizons in evaluation to
3
. This parameter determines the number of forecasting horizons.Leave Skipped time steps in each forecast horizon at the default setting of
0
. This parameter tells Dataiku the number of time steps within each horizon that you want to skip during the evaluation.
Tip
Recall the validation set contains the time series values occurring after January 1, 2022. This data is in the airline_stocks_prepared dataset and contains 18 weekly entries (or time steps) for each stock price time series.
Important
When configuring the model’s design, note that Dataiku will use:
The same Forecast horizon value when you apply a Scoring recipe to the model.
The Nb. time steps for evaluation value as the forecast horizon when you apply an Evaluate recipe to the model.
The same Forecast quantiles you specify during training when you score and evaluate the model.
Configure the Train / Test Set panel#
Because time series have an order, sampling in the train and test sets are quite different from a traditional machine learning problem.
In the Design tab, navigate to the Train / Test Set panel on the left.
Under Splitting parameters, check the box for a K-fold cross-test splitting strategy, and keep the default of 5 folds.
Tip
If the training data is not equispaced, Dataiku will automatically resample it. You’ll also be able to specify the imputation method for numerical and categorical data in the missing time steps.
Note
Learn more about the K-fold cross-test by visiting Cross-validation.
Configure the Metrics panel#
The settings defined in the Metrics, Algorithms, and Hyperparameters panels define how Dataiku performs the search for the best model hyperparameters.
In the Metrics panel, you can choose the metric that Dataiku will use for model evaluation on the train and test sets.
On the left, navigate to the Metrics panel.
Switch to Mean Absolute Percentage Error (MAPE) as the metric for which the model’s hyperparameters should be optimized.
Configure the External Features panel#
External features are exogenous time-dependent features. By default, Dataiku disables the external features for several reasons. One reason is that some training algorithms do not support the use of external features.
Another reason for this default behavior is that for any model you train with external features, Dataiku will require you to provide future values of those external features during forecasting (when trying to score the trained model).
Therefore, if there’s no way to know the values of the external features ahead of time (as it applies in this case of stock price information), the model should not use them during training. Using these features will lead to what is called data leakage.
On the left, navigate to the External features panel.
Leave the external features disabled.
Configure the Algorithms panel#
The visual ML interface offers three categories of forecasting algorithms as indicated by the icon to the right of their names: baseline, statistical, and deep learning.
During the hyperparameters optimization phase, each set of hyperparameters (defined in the Algorithms page) will be evaluated with the metric defined in the Metrics page.
On the left, navigate to the Algorithms panel.
In addition to the default selections based on the quick prototype designation, add the DeepAR algorithm.
Inspect the Runtime environment#
One last step before training is to confirm that you have a compatible code environment in place.
On the left, navigate to the Runtime environment panel.
If not already present, select a code environment that includes the required packages for time series forecasting models.
Train and deploy a forecasting model#
Once we are satisfied with the model’s design, we can go ahead training a session of models, and then choosing one to deploy to the Flow. You’ll notice that this process is the same for any other visual prediction or clustering model.
Train time series forecasting models#
Let’s kick off the training session!
Near the top right of the modeling task, click Train.
In the dialog, click Train once more to start the training session.
Inspect the training results#
The bar charts at the top allow you to compare different metrics across the trained models. In this case, the Simple Feed Forward algorithm performed best for all three metrics (MAPE, Symmetric MAPE, and RMSE).
On the left of the Result tab, click the best-performing model, Simple Feed Forward, to see the model training results.
Tip
You can click the Performance metrics dropdown in the session summary to change the displayed metrics.
View forecast charts#
Let’s look closer at this model’s forecast.
In the Report tab, navigate to the Forecast charts panel on the left.
Inspect the predictions for each time series plot.
For each of the k = 5 folds, a black line shows the actual time series values; another line shows the values that the model forecast for that fold; and a shaded area represents the confidence intervals for the forecast values.
Notice that Dataiku also plots the forecast values and confidence intervals for the next horizon of the training dataset (beyond the last timestamp in the training data).
Note
The other panels of the model’s Report page show additional training details. For example, the Metrics tab in the Performance section displays the aggregated metrics for the overall dataset and its individual time series. The tabs in the Model Information section provide more details on resampling, features used to train the model, the algorithm details, etc.
Deploy the model to the Flow#
Once you finish inspecting the model and are satisfied with its performance, you can deploy the model to the Flow.
Click Deploy from the top right corner of the model page.
Click Create to deploy the Predict Adj_close (forecast) model to the Flow.
Note
Like any other Dataiku visual ML model, you can deploy time series models in the Flow for batch scoring or as an API for real-time scoring.
Evaluate a forecasting model#
Next, let’s evaluate the model’s performance on data not used during training. For this, we’ll use the Evaluate recipe.
Create an Evaluate recipe#
To apply the Evaluate recipe to the time series model, we’ll use a validation set from the airlines_stocks_prepared dataset as input. The validation set will include the time series values after January 1, 2022. Recall that this validation set contains 18 weekly entries (or time steps) for each stock price time series.
From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.
In the Actions panel on the right, select the Evaluate recipe.
Click Set to add an output dataset. Name it
eval
, and click Create Dataset.Click Set to add metrics. Name it
metrics
, and click Create Dataset.Once you have both inputs and outputs, click Create Recipe.
Configure the Evaluate recipe#
On the recipe’s Settings page, Dataiku alerts as to how much past data is required according to the model’s specifications.
Note
If you used external features while training the model, Dataiku would require that the input data to the Evaluate recipe contain values for the external features. This requirement is also true when you use the Scoring recipe.
By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).
Specify the remaining settings for the recipe as follows:
On the Settings tab of the recipe, specify the Nb. evaluation windows values as
3
to use three evaluation windows of six weeks.Instead of aggregated metrics, select the option to compute metrics Per time series for Dataiku to show one row of metrics per time series.
Click Run at the bottom left to execute the recipe.
Important
Notice that we chose these settings strategically to ensure that the Evaluate recipe forecasts the time series values for all 18 time series values not used during training.
Explore the evaluation results#
In the eval dataset, we can view the actual values of the Adj_close prices and the forecasts alongside the quantiles. We can also create line charts on the dataset to compare the forecasts to the actual time series values.
Open the eval dataset, and explore the added forecast and quantile_0 columns.
Navigate to the Charts tab.
From the chart picker, select a Lines plot.
Drag Adj_close and forecast to the Y-axis field.
Drag Date to the X-axis field.
Drag Ticker to the Subcharts option to the left of the chart. Adjust the chart height as needed.
In addition to the forecast values, we can also examine the associated metrics.
Return to the Flow.
Open the metrics dataset to see one row of metrics per time series, per run of the Evaluate recipe.
Score a forecasting model#
Finally, let’s apply a Scoring recipe to the model to predict future values of the time series.
Create a Score recipe#
To apply the Scoring recipe to the time series model, we’ll again use the airlines_stocks_prepared dataset as input. The Scoring recipe will use this input with the trained model to forecast future values of the time series (for dates after May 2, 2022).
From the Flow, select both the airline_stocks_prepared dataset and Predict Adj_close (forecast) model.
In the Actions panel on the right, select the Score recipe.
Click Create Recipe.
Configure the Score recipe#
Similar to the Evaluate recipe, Dataiku alerts us as to how much past data must be found in the input dataset.
By default, the recipe uses a forecast horizon of six weeks (six steps in advance) since this was the model’s setting during training. Dataiku also outputs forecast values at different quantiles (the same ones used during training).
Specify the Past data to include as
52
weeks.Near the bottom left, click Run to execute the recipe.
Inspect the scored data#
The recipe forecast values of the Adj_close alongside the quantiles for the next six weeks for each of the airline stocks.
When the Score recipe finishes, open the airline_stocks_prepared_scored dataset.
Click on the Date column header, select Sort, and then click the icon to sort in descending order to see the forecasted values at the top of the dataset.
Click on the Ticker column header, select Filter, and then select a series such as AAL to view an individual series.
Note
The reference documentation provides more information on using the Scoring recipe with a time series model.
What’s next?#
Congratulations! You’ve taken your first steps toward forecasting time series data using Dataiku’s visual ML interface.
See also
Learn more about time series forecasting using the visual interface in the reference documentation.
You can also explore an example project demonstrating various visual time series forecasting techniques in the Dataiku Gallery.