Tutorial | Visual logic processors for data preparation#

Watch the featurette

A common data preparation task is defining complex rules based on several conditions or input columns. Think of cases where you needed to recode categorical values or define logic in the style of if-then statements.

Dataiku offers many tools to solve these kinds of problems. Here, we’ll focus on four tools in particular, each with its strengths and weaknesses.

Get started#

Objectives#

In this tutorial, you will:

Prerequisites#

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Visual Logic.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

This project includes data from a simplified credit card fraud use case. Using data about transactions, merchants, and cardholders, we could build a model that predicts which transactions should be authorized and which are potentially fraudulent.

A score of 1 for the target variable, authorized_flag, represents an authorized transaction. A score of 0, on the other hand, is a transaction that failed authorization.

Switch formula function#

If you’re familiar with the SWITCH() function in Microsoft Excel, you’ll be glad to know the same functionality can be found in Dataiku’s Formula language.

Let’s use the switch function in Dataiku’s Formula language to remap values in a single categorical column.

  1. Open the Prepare recipe that computes transactions_joined_prepared.

  2. Click + Add a New Step on the left.

  3. Search for and add a new Formula step to the script.

  4. Name the output column item_category_recoded.

  5. Click Open Editor Panel.

  6. Paste the following expression below to recode the first three values.

    switch(item_category, "A", "Domestic", "B", "International", "C", "Unknown")
    
  7. Observe the sample output, and then click Apply.

Dataiku screenshot of a switch function in a Formula step of a Prepare recipe.

Important

Note that any values not specified, such as category D in this case, remain empty in the output column.

Switch case processor#

In some cases, the switch function will be the perfect tool — particularly when creating a formula to be copied and pasted from a spreadsheet outside of Dataiku.

On the other hand, it may be difficult to read and maintain this kind of remapping for a column like merchant_subsector_description, which has 38 distinct values.

The Switch case processor step provides a more robust definition of rules for remapping a single variable.

  1. In the same Prepare recipe, click + Add a New Step.

  2. Search for and add a Switch case step to the script.

  3. Select merchant_subsector_description as the column.

  4. Name the output column spending_type.

  5. Click + Add Rule repeatedly to create each of the following sets of key-value pairs:

    • gas -> essential

    • internet -> essential

    • luxury goods -> discretionary

    • consumer electronics -> discretionary

  6. Add unknown as the value for unmatched rows, and observe the step preview.

Dataiku screenshot of a switch case processor in a Prepare recipe.

Important

Here we used the default case-sensitive Exact mode, but the processor also has the mode Lowercase (match “GAS”, “Gas”, and “gas” for example) and the accent-insensitive Normalize mode.

If you already have your mappings in two columns of a spreadsheet, click the Raw text edit option, and directly copy-paste them into the Rules field. Then click List edit to see the formatted rules.

Compared to the Formula step above, this solution makes it much easier to review the table of rules.

Note

This processor has similar functionality to the Find and Replace processor, but has a few differences, such as the ability to set a value for unmatched rows.

Create if, then, else statements#

For more complicated logic, you can try the Create if, then, else statements processor. This processor allows you to implement if, else if, and else statements in a visual manner.

Let’s use this processor to create rules that define categories of risk.

  1. In the same Prepare recipe, click + Add a New Step.

  2. Search for and add a Create if, then, else statements step to the script.

Start with an if statement#

Let’s start by defining parameters based on the minimum and maximum possible values for FICO scores, a common measure of creditworthiness in the United States.

  1. For the first condition, if the column card_fico_score is < the value 300, then set a new column (type risk) equal to the value invalid.

    Dataiku screenshot of an if statement a Prepare recipe.
  2. Click Preview. All rows in the preview of the risk column should be empty since none satisfy this condition.

This is only the minimum bound. Let’s also add the maximum bound for an valid FICO score.

  1. In the if block, click + Add a Condition.

  2. If card_fico_score is > the value 850, then the risk column should equal invalid.

When multiple conditions in an if or else if statement are present, we need to be mindful of the conjunction connecting them.

  1. Set the conjunction connecting these conditions to OR so that if either one of the two conditions is satisfied, then the row will be marked as invalid in the risk column.

    Dataiku screenshot of two if statements in a Prepare recipe.
  2. Click Preview again. All rows remain empty.

Add an else if block#

Once we have the first if-then statement verifying valid FICO scores, let’s define more rules to help categorize the level of a transaction’s risk.

  1. In the editor panel, click + Add Else If Group.

  2. Provide the condition days_active (the number of days a credit card is active) is < the value 30.

  3. If this condition is satisfied, then set the new column risk equal to the value high.

    Dataiku screenshot of an else if statement in a Prepare recipe.
  4. Click Preview, and observe that some rows in the risk column now have the value high.

Add an else statement#

Let’s use a final else statement to define a rule for any value that does not satisfy any of the conditions above it.

  1. At the bottom of the the editor panel, click + Add Else.

  2. If none of the above conditions are met, set the new column risk equal to the value low.

    Dataiku screenshot of an else statement in a Prepare recipe.
  3. Click Preview to see all of the previously empty rows now filled with this new value.

Add more else if blocks#

We can imagine needing many more possible outputs than just invalid, high, and low. We can define them with additional else if blocks.

Tip

Click and drag from the bottom right corner of the editor panel to reduce the size of the dataset output preview.

  1. Beneath the first else if block (above the else block), click + Add Else If Group.

  2. Provide the condition days_active is < the value 60.

  3. If this condition is satisfied, then set the new column risk equal to the value medium.

    Dataiku screenshot of a second else if block in a Prepare recipe.
  4. Click Preview to see that some values in the risk column now are labeled as medium.

Important

The order of if and else if blocks matters! Consider the case where the order of the two else if blocks in this example were reversed. A transaction on a card active for 15 days would satisfy the first else if block and be labeled as a medium risk transaction — even though it also satisfies the more stringent (but subsequent) condition for high risk transactions.

Turn conditions into groups for nested logic#

The current structure of the rules in place could be summarized by:

if (A or B) then "invalid"
else if C then "high"
else if D then "medium"
else "low"

We often need much more complex, nested logic. For example, the actual rules for a medium risk value might be if D and (E or F).

For this kind of logic, you can organize conditions into groups of conditions. Let’s try this now.

  1. In the editor panel, return to the second else if block that defines the medium risk values.

  2. Click the dropdown arrow next to + Add a Condition in order to Add a group.

  3. In addition to the first condition, force medium risk values to also satisfy at least one of the following two conditions:

    • signature_provided == the value 0.

    • OR

    • purchase_amount > the value 200.

    Dataiku screenshot of a nested conditional group.
  4. Click Preview to observe the new input, and then Apply to save the logic in this step.

  5. Finally, Save the recipe.

Tip

Although we can define a quite complex set of rules with this processor, Dataiku provides a readable summary of this step in the script on the left.

Dataiku screenshot of a summary of an if, then, else step in a Prepare recipe.

Going further#

This simple example only shows the beginning of what’s possible with this quite powerful processor. Two additional features are noteworthy:

More operators#

This processor offers many types of operators when creating conditions that vary depending on the storage type of the column.

This example used only numeric operators (greater than, less than, etc.), but strings and dates, for example, have many others, such as “Contains” or “Is before date”.

Comparing columns#

All of the conditions in this example are compared against a numeric value, but you can also evaluate conditions against a column. You can also use a column (or a formula) for the output of a then statement.

This is particularly useful when you have windowed or lagged features. For example, if we had the customer’s average purchase amount, we could compare the purchase_amount of the transaction to the column purchase_amount_avg instead of a fixed value.

Dataiku screenshot of an if statement comparing a column.

Use filtering conditions everywhere#

Instead of checking for invalid FICO scores in the Prepare recipe, we might want to perform this kind of check earlier in the data pipeline, for example in the Pre-filter step of the Join recipe.

We can use the same style of conditions available in the Create if, then, else statement processor in many of Dataiku’s filtering screens, such as the Pre- and Post-filter steps of most visual recipes.

  1. Navigate to the Pre-filters step of the Join recipe.

  2. Toggle On the Filter for the cardholder_info dataset.

  3. Before performing the Join operation, keep only rows from this dataset that satisfy all of the following conditions:

    • age <= 100 (an arbitrary upper bound).

  4. Click the dropdown menu next to + Add a Condition, and click Add a group. In addition to the upper bound on age, set lower and upper bounds on FICO score:

    • fico_score >= to 300

    • AND

    • fico_score <= to 850

    Dataiku screenshot of a pre-filter step in a Join recipe using multiple filter conditions.
  5. Click Apply, and review the summary of the filter.

  6. Click Run at the bottom right of the recipe, and Run again to build the Flow downstream.

Note

In addition to visual recipes, you also have the power of this same filtering logic when configuring samples in the Explore and Charts tabs, defining cell value metrics in the Status tab, or creating an assertion in the Design tab of a prediction model.

What’s next?#

Congratulations! Your data preparation toolkit for defining rules and complex logic without code in Dataiku has now expanded.

You can learn more about data preparation in Dataiku in the reference documentation and the Knowledge Base.