Concept Summary: Time Series Preparation

Before we can unlock the power of modeling and forecasting techniques on time series data, we often need to ensure the data has certain properties.

This task is so important that we have dedicated this entire course to the preparation of time series data.

../../../_images/before-modeling.png

In this lesson, we’ll discuss some common objectives of time series preparation and show how to achieve them by using the time series preparation plugin in Dataiku DSS.

Time Series Basics Review

In the Time Series Basics course, we learned about the types and formats of time series data. To recap, time series data can record a single time series, or it can contain multiple independent time series.

../../../_images/single-multiple.png

It might record just one variable of interest or it might document the relationship between multiple variables over time.

../../../_images/uni-multivariate.png

Furthermore, time series data can be stored in wide or long formats.

../../../_images/wide-long.png

Time Series Preparation Tasks

Common tasks of time series preparation aim to:

  • Transform data occurring in irregular intervals into equispaced data

../../../_images/resampling.png
  • Identify periods when data values are within a given range under certain conditions

../../../_images/interval-extraction.png
  • Smooth the data in order to reduce the noise and volatility or enrich it to uncover hidden patterns

../../../_images/windowing.png
  • and, lastly, focus on specific sections of data that are of particular interest.

../../../_images/extrema-extraction.png

Time Series Preparation Plugin

Each of these tasks can be handled by a recipe in the Time Series Preparation plugin.

These recipes are:

  • Resampling

  • Interval Extraction

  • Windowing

  • Extrema Extraction

../../../_images/ts-prep-plugin.png

Time Series Data in Dataiku DSS

Let’s take a look at a time series dataset in Dataiku DSS.

We have a column of order dates and a column recording the amount spent on those orders. We also have a categorical identifier column.

../../../_images/orders-raw.png

So, is this data ready for use in a task such as forecasting?

Parsed Date

To answer this question, let’s first look at the order_date column.

In order for the time series preparation plugin to treat the contents of this column as a timestamp, it must be parsed.

Using a Prepare recipe, we can parse the order_date column, so that it is now recognized as a date, and the values are a true timestamp.

../../../_images/parse-date1.png

Long or Wide Format

Before using the Time Series Preparation plugin, we also need to know whether our data is in long or wide format.

We know this data is stored in long format because all of our measurements over time are stored in the same column, amount_spent.

And we have an identifier column, tshirt_category, that tells us the individual time series to which each measurement belongs.

If we wanted to, we could use the Pivot recipe or the Pivot processor in the Prepare recipe to reshape our data from long to wide format.

../../../_images/orders-wide.png

Once in wide format, we more easily notice that there were actually six different time series aggregated in the long format–one for each t-shirt category.

Valid Time Series

We should also verify that each unique timestamp has only one measurement, a requirement for any time series to be valid. If trying to use a recipe from the Time Series Preparation plugin where this property does not exist, Dataiku DSS will throw an error complaining about duplicate timestamps.

Using the Analyze tool on the whole dataset is one way to check this. Here, every date has no more than one measurement for each series of a product category.

../../../_images/valid-ts.png

Missing Data

The wide format also makes it easy to spot missing values.

For each date, we may have an order for one category, but the other categories have missing data.

../../../_images/orders-wide-missing.png

At the same time, we have some large gaps in the time series. The first date is March 29, and the next is not until April 10. Many other days are implicitly missing.

This means that there is unequal spacing between successive timestamps.

Depending on our objectives, this may be an issue we need to address with the Resampling recipe in the Time Series Preparation plugin.

What’s next?

In the following sections, we’ll dive into each one of these recipes.