Concept Summary: Resampling¶
In this lesson, we’ll discuss resampling and interpolation, what motivates these steps, and how we can apply them in Dataiku DSS using the Resampling recipe of the Time Series Preparation plugin.
Equispaced Timestamps¶
Let’s return to our conceptual example of revenue from the Haiku T-Shirt shop.
At the beginning of this series, our data is equispaced. We have exactly one value for every consecutive day.
Then, at some point, our timestamps become irregularly spaced. Some days are missing from the series. If we want to use this data for further analysis or forecasting, these irregularly-spaced timestamps can cause a problem.
Resampling the data gives us a way to create equispaced timestamps.
If we choose to keep our data at a daily interval, we can fill in the missing dates with new rows in the series.
Interpolation¶
Now, our timestamps are equispaced, but what values do we fill in for the new rows?
This is the interpolation step. We need to infer a value for the new rows based on our understanding or assumptions about the time series.
Connecting the points with a straight line is one option, but there are many possible options.
For example, we could carry forward the previous value.
Or, populate backwards from the next value.
Maybe instead of a linear relationship, the fit is closer to a quadratic?
These are just a few of the possible options. The answer will depend on your own understanding and assumptions about the data.
Resampling in Dataiku DSS¶
Let’s see how this process actually works in Dataiku DSS using the same data from the previous lesson.
From the wide format, we can easily see that the timestamps are not equispaced. Many dates are missing.
We can use the Resampling recipe in the Time Series Preparation plugin to take care of this.
order_date is the timestamp column.
In this case, we are resampling at a daily level, but we could also choose a shorter or longer interval depending on the data at hand and our objectives.
For now, let’s not interpolate or extrapolate any values.
Here we are using data in a wide format, but if we had long format data, we could just as easily check this box and provide the name of the identifier column.
After running the resampling recipe, we have one row for every date from the beginning to the end of the range. But all of the values for the new rows are missing because we did not yet choose any interpolation method.
Constant Value Interpolation¶
Let’s first demonstrate interpolating a constant value. If these are distinct sales, and we are not missing any data, interpolating a constant value of 0 may be a good strategy.
In the output, we can see that values of 0 have been added in between the original data points.
For example, the first sale of male, black T-Shirts was for 19 dollars. The next purchase was about one month later for 57 dollars. All of the dates in between those dates had a 0 filled in.
Extrapolation¶
Because we did not perform any extrapolation, Dataiku DSS did not insert any values before an individual series begins.
That’s why there are still missing values before the first recorded value in the later time series.
Let’s fill in the missing zeroes so we have a common start and end date for all series in the dataset by using the same interpolation method for extrapolation.
Looking at the output, we could take a dataset like this much further, for example, by calculating total sales per month or a rolling 7-day average.
Edit Series¶
Before moving on to an analysis step, we may want to edit the series. For the first month, sales are very sparse. Perhaps we don’t want to include this data. The plugin makes it easy to clip the series from the beginning or the end in the same time unit as the resampling parameter.
Now that same number of timestamps will be removed from the beginning of the series.
What’s next?¶
And that’s the basics of equispacing and interpolating time series data with the Resampling recipe in the Time Series Preparation plugin!
Up next we’ll see how to identify periods when data values are within a given range under certain conditions using the Interval Extraction recipe.