Hands-On Tutorial: Visual Logic for Data Preparation¶
A common data preparation task is defining complex rules based on several conditions or input columns. Think of cases where you needed to recode categorical values or define logic in the style of “if-then” statements.
Dataiku offers many tools to solve these kinds of problems. Here, we’ll focus on four tools in particular, each with its strengths and weaknesses.
Let’s Get Started¶
In this tutorial, you will practice defining rules and complex visual filters using four different methods:
Create the Project¶
We’ll reuse the Flow from the AI Consumer quick start.
From the Dataiku homepage, click +New Project > DSS Tutorials > Quick Start > AI Consumer Quick Start (Tutorial).
You can also download the starter project from this website and import it as a zip file.
Use Case Summary¶
This project is a simplified credit card fraud use case. Using data about transactions, merchants, and cardholders, we have a model that predicts which transactions should be authorized and which are potentially fraudulent.
A score of 1 for the target variable, authorized_flag, represents an authorized transaction. A score of 0, on the other hand, is a transaction that failed authorization.
Switch Formula Function¶
If you’re familiar with the
SWITCH() function in Microsoft Excel, you’ll be glad to know the same functionality can be found in Dataiku’s Formula language.
Let’s use the switch function in Dataiku’s Formula language to remap values in a single categorical column.
Open the Prepare recipe that computes transactions_joined_prepared.
Add a new Formula step to the script.
Name the output column
Open the editor panel, and paste the following expression below to recode the first three values.
Observe the sample output, and then click Apply.
switch(item_category, "A", "Domestic", "B", "International", "C", "Unknown")
Note that any values not specified, such as category “D” in this case, remain empty in the output column.
Switch Case Processor¶
In some cases, the switch function will be the perfect tool—particularly when creating a formula to be copied and pasted from a spreadsheet outside of Dataiku.
On the other hand, it may be difficult to read and maintain this kind of remapping for a column like merchant_subsector_description, which has 38 distinct values.
The Switch case processor step provides a more robust definition of rules for remapping a single variable.
In the same Prepare recipe, add a Switch case step to the script.
Choose merchant_subsector_description as the column.
Name the output
Click +Add Rule repeatedly to create each of the following sets of key-value pairs:
unknownas the value for unmatched rows, and observe the step preview.
Here we used the default case-sensitive “Exact” mode, but the processor also has the mode “Lowercase” (match “GAS”, “Gas”, and “gas” for example) and the accent-insensitive “Normalize” mode.
Compared to the Formula above, this solution makes it much easier to review the table of rules.
This processor has similar functionality to the Find and Replace processor, but has a few differences, such as the ability to set a value for unmatched rows.
Create if, then, else statements¶
Like with the Switch formula function, the Switch case processor might be the perfect tool for your task. For more complicated logic, you can try the Create if, then, else statements processor. This processor allows you to implement if, else if, and else statements in a visual manner.
Let’s use this processor to create rules that define categories of risk.
In the same Prepare recipe, add a Create if, then, else statements step to the script.
Start With an If Statement¶
Let’s start by defining parameters based on the minimum and maximum possible values for FICO scores, a common measure of creditworthiness in the United States.
For the first condition, if the column card_fico_score is
300, then a new column (type
risk) equals the value
Click Preview. All rows in the preview of the risk column should be empty since none satisfy this condition.
This is only the minimum bound. Let’s also add the maximum.
In the editor panel, add a second condition to the if statement.
In the if block, click Add a Condition:
if card_fico_score is
850, then the risk column should also be set to
When multiple conditions in an if or else if statement are present, we need to be mindful of the conjunction connecting them.
Set the conjunction connecting these conditions to OR so that if either one of the two conditions is satisfied, then the row will be marked as
Click Preview again. All rows remain empty.
Add an “Else If” Block¶
Once we have the first if-then statement verifying valid FICO scores, let’s define more rules to help categorize the level of a transaction’s risk.
In the editor panel, click + Add Else If Group.
Provide the condition days_active (the number of days a credit card is active) is
If this condition is satisfied, then set the new column
riskto the value
Click Preview, and observe that some rows in the risk column now have the value “high”.
Add an Else Statement¶
Let’s use a final else statement to define a rule for any value that does not satisfy any of the conditions above it.
In the editor panel, click + Add Else.
If none of the above conditions are met, set the new column
riskto the value
Click Preview to see all of the previously empty rows now filled with this new value.
Add More “Else If” Blocks¶
We can imagine needing many more possible outputs than just “invalid”, “high”, and “low”. We can define them with additional else if blocks.
Beneath the first “else if” block (above the else block), click + Add Else If Group.
Provide the condition days_active is
If this condition is satisfied, then set the new column
riskto the value
Click Preview to see that some values in the risk column now are labeled as “medium”.
The order of if and “else if” blocks matters! Consider if the order of the two “else if” blocks in this example were reversed. A transaction on a card active for 15 days would satisfy the first “else if” block and be labeled as a medium risk transaction–even though it also satisfies the more stringent (but subsequent) condition for “high” risk transactions.
Turn Conditions into Groups for Nested Logic¶
The current structure of the rules in place could be summarized by:
if (A or B) then "invalid" else if C then "high" else if D then "medium" else "low"
We often need much more complex, nested logic. For example, the actual rules for a “medium” value might be
if D and (E or F).
For this kind of logic, you can organize conditions into groups of conditions. Let’s try this now.
In the editor panel, return to the second “else if” block that defines the “medium” risk values.
Click the dropdown arrow next to “+ Add a Condition” in order to Add a group.
In addition to the first condition, “medium” risk values must also satisfy at least one of the following two conditions:
Click Preview to observe the new input, and then Apply to save the logic in this step.
Finally, save the recipe and update the schema.
Although we can define a quite complex set of rules with this processor, Dataiku provides a readable summary of this step in the script on the left.
This simple example only shows the beginning of what’s possible with this quite powerful processor. Two additional features are noteworthy:
This processor offers many types of operators when creating conditions that vary depending on the storage type of the column.
This example used only numeric operators (greater than, less than, etc.), but strings and dates, for example, have many others, such as “Contains” or “Is before date”.
All of the conditions in this example are compared against a numeric value, but you can also evaluate conditions against a column. You can also use a column (or a formula) for the output of a then statement.
This is particularly useful when you have windowed or lagged features. For example, if we had the customer’s average purchase amount we could compare the purchase_amount of the transaction to the column purchase_amount_avg instead of a fixed value.
Use Filtering Conditions Everywhere¶
Instead of checking for invalid FICO scores in the Prepare recipe, we might want to perform this kind of check earlier in the data pipeline, for example in the Pre-filter step of the Join recipe.
We can use the same style of conditions available in the “Create if, then, else statement” processor in many of Dataiku’s filtering screens, such as the Pre- and Post-filter steps of most visual recipes.
Navigate to the Pre-filter step of the Join recipe.
Turn On the Filter for the cardholder_info dataset.
Before performing the Join, keep only rows from this dataset that satisfy all of the following conditions:
<= 100(an arbitrary upper bound).
Click the dropdown next to “+ Add a Condition” to add a group of conditions.
In addition to the upper bound on age, set lower and upper bounds on FICO score:
fico_score must be
300AND fico_score must be
Close the dialog, and review the summary of the filter. Then save and Run the recipe.