Concept | Sample/Filter recipe#
Watch the video
There are many ways to sample and filter data in Dataiku. On one hand, you can do this within a number of visual recipes, but you can also use the dedicated Sample/Filter Recipe. This recipe is useful for emphasizing this action in your data pipeline, making it easy to experiment with different methods or update as needed.
For instance, if you have a significantly large dataset, you might just want to keep meaningful information. By getting rid of redundant data, you can apply data transformations quickly and clearly.
The first option you can activate in this recipe is a filter. There are three ways that you can achieve the correct subset of rows:
Adding conditions. You can add one or multiple conditions that will define the output of the filter.
Writing a formula. Dataiku’s formula language can be applied here.
Using SQL expressions. SQL expressions can be used if you are using a SQL engine to run the recipe.
Let’s say that you want to analyze sales between
July 1, 2019 and
August 1, 2019 of white T-shirts.
You could use these conditions:
You could use this formula:
order_date > asDate('2019-07-01 00:00', 'yyyy-MM-dd HH:mm')
order_date < asDate('2019-08-01 00:00', 'yyyy-MM-dd HH:mm')
strval('tshirt_category') == 'W_tshirt'
Or you could write some SQL:
("order_date" BETWEEN '2019-07-01 00:00:00' AND '2019-08-01 00:00:00')
("tshirt_category" = 'W_tshirt')
Dataiku provides this flexibility so that you can choose what works best for your use case.
In addition to filtering, this recipe contains another useful method to narrow your data: sampling. Let’s become familiar with some different kinds of sampling methods.
Different sampling methods have different computation costs. Some require one or two “full passes” – meaning Dataiku has to read the whole dataset one or two times.
The first records option allows you to define a definite number of first rows in the dataset to retrieve. This is a simple and quick method, but you might end up with a biased view of the data. In other words, the first N rows might not be representative of the whole dataset.
For random sampling, Dataiku will output a random subset of rows. This can be either an approximate ratio or an approximate number of records. Using a ratio requires one full pass, while finding an approximate number of records requires two full passes of the dataset.
Column values subset#
This method requires two full passes and works by:
Targeting a column.
Taking a random sample of values from that column.
Collecting and outputting all rows that include those column values.
To understand this option, imagine that you are working with a dataset of customer orders. You want a random sample of customer orders that includes all orders belonging to each randomly selected customer. Therefore, you choose to sample by values in the customer_id column.
This method will randomly select customers from the customer_id column, and find all other orders (rows) from those customers. The output will be all orders from a random sample of customers.
For the class rebalance sampling method, Dataiku does its best to equalize—or balance—the frequency of classes in the output sample. Most often, class rebalancing is used for machine learning, as you don’t usually want to train models that have skewed class proportions. Class rebalance always requires two full passes of the dataset.
When using class rebalance in Dataiku, you can either select a percentage of records or an approximate number of records to sample. You’ll also have to choose the column that contains the classes that you would like to balance.
There is also an option to use random seeds, which allows you to reproduce the random sample. There are many reasons to use a fixed random seed: for instance, if you want to share your project with someone who needs to replicate your results.
Let’s review an example. Using the same T-shirt orders use case, you might want to predict what type of T-shirt a customer will order. However, looking at the data, the different T-shirt categories are quite unbalanced.
However, after applying the class rebalance method with a 20% sampling ratio, you can see that the classes are much more balanced:
This method does not oversample — only undersample — so Dataiku will never generate extra records.