Concept Summary: Time Series Preparation¶
Before we can unlock the power of modeling and forecasting techniques on time series data, we often need to ensure the data has certain properties.
This task is so important that we have dedicated this entire course to the preparation of time series data.
In this lesson, we’ll discuss some common objectives of time series preparation and show how to achieve them by using the time series preparation plugin in Dataiku DSS.
Time Series Basics Review¶
In the Time Series Basics course, we learned about the types and formats of time series data. To recap, time series data can record a single time series, or it can contain multiple independent time series.
It might record just one variable of interest or it might document the relationship between multiple variables over time.
Furthermore, time series data can be stored in wide or long formats.
Time Series Preparation Tasks¶
Common tasks of time series preparation aim to:
Transform data occurring in irregular intervals into equispaced data
Identify periods when data values are within a given range under certain conditions
Smooth the data in order to reduce the noise and volatility or enrich it to uncover hidden patterns
and, lastly, focus on specific sections of data that are of particular interest.
Time Series Preparation Plugin¶
Each of these tasks can be handled by a recipe in the Time Series Preparation plugin.
These recipes are:
Time Series Data in Dataiku DSS¶
Let’s take a look at a time series dataset in Dataiku DSS.
We have a column of order dates and a column recording the amount spent on those orders. We also have a categorical identifier column.
So, is this data ready for use in a task such as forecasting?
To answer this question, let’s first look at the order_date column.
In order for the time series preparation plugin to treat the contents of this column as a timestamp, it must be parsed.
Using a Prepare recipe, we can parse the order_date column, so that it is now recognized as a date, and the values are a true timestamp.
Long or Wide Format¶
Before using the Time Series Preparation plugin, we also need to know whether our data is in long or wide format.
We know this data is stored in long format because all of our measurements over time are stored in the same column, amount_spent.
And we have an identifier column, tshirt_category, that tells us the individual time series to which each measurement belongs.
If we wanted to, we could use the Pivot recipe or the Pivot processor in the Prepare recipe to reshape our data from long to wide format.
Once in wide format, we more easily notice that there were actually six different time series aggregated in the long format – one for each t-shirt category.
Valid Time Series¶
We should also verify that each unique timestamp has only one measurement, a requirement for any time series to be valid.
Using the Analyze tool on the whole dataset is one way to check this. Here, every date has no more than one measurement for each series of a product category.
The wide format also makes it easy to spot missing values.
For each date, we may have an order for one category, but the other categories have missing data.
At the same time, we have some large gaps in the time series. The first date is March 29, and the next is not until April 10. Many other days are implicitly missing.
This means that there is unequal spacing between successive timestamps.
Depending on our objectives, this may be an issue we need to address with the Resampling recipe in the Time Series Preparation plugin.
In the following sections, we’ll dive into each one of these recipes.