Concept | Time series preparation#
Watch the video
Before we can unlock the power of modeling and forecasting techniques on time series data, we often need to ensure the data has certain properties.
In this article, we’ll discuss some common objectives of time series preparation and show how to achieve them by using the Time Series Preparation plugin in Dataiku.
Time series data review#
The Time Series Analysis & Forecasting course presents the types and formats of time series data. To recap, time series data can record a single time series, or it can contain multiple independent time series.
It might record just one variable of interest or it might document the relationship between multiple variables over time.
Furthermore, time series data can be stored in wide or long formats.
Time series preparation tasks#
Common tasks of time series preparation aim to:
Transform data occurring in irregular intervals into equispaced data.
Identify periods when data values are within a given range under certain conditions.
Smooth the data in order to reduce the noise and volatility or enrich it to uncover hidden patterns.
Focus on specific sections of data that are of particular interest.
Time Series Preparation plugin#
Each of these tasks can be handled by a recipe in the Time Series Preparation plugin.
These recipes are:
Time series data in Dataiku#
Let’s take a look at a time series dataset in Dataiku.
We have a column of order dates and a column recording the amount spent on those orders. We also have a categorical identifier column.
So, is this data ready for use in a task such as forecasting?
To answer this question, let’s first look at the order_date column. In order for the Time Series Preparation plugin to treat the contents of this column as a timestamp, it must be parsed.
Using a Prepare recipe, we can parse the order_date column, so that it is now recognized as a date, and the values are a true timestamp.
Long or wide format#
Before using the Time Series Preparation plugin, we also need to know whether our data is in long or wide format.
We know this data is stored in long format because all of our measurements over time are stored in the same column, amount_spent.
We also have an identifier column, tshirt_category, that tells us the individual time series to which each measurement belongs.
If we wanted to, we could use the Pivot recipe or the Pivot processor in the Prepare recipe to reshape our data from long to wide format.
Once in wide format, we more easily notice that there were actually six different time series aggregated in the long format–one for each t-shirt category.
Valid time series#
We should also verify that each unique timestamp has only one measurement, a requirement for any time series to be valid. If trying to use a recipe from the Time Series Preparation plugin where this property does not exist, Dataiku will throw an error complaining about duplicate timestamps.
Using the Analyze tool on the whole dataset is one way to check this. Here, every date has no more than one measurement for each series of a product category.
The wide format also makes it easy to spot missing values.
For each date, we may have an order for one category, but the other categories have missing data.
At the same time, we have some large gaps in the time series. The first date is March 29, and the next is not until April 10. Many other days are implicitly missing.
This means that there is unequal spacing between successive timestamps.
Depending on our objectives, this may be an issue we need to address with the Resampling recipe in the Time Series Preparation plugin.