Tutorial | Data preparation and visualization in the Lab#

As of the tutorial on joining data, we have built a simple and clean Flow.

Real data projects are rarely so clean. Having all of your work in the Flow can lead to overcrowding. The Lab is the place for experimentation and preliminary work. You can then decide to deploy work you do in the Lab to the Flow.

Objectives#

In this tutorial, you will:

  • Prepare data in the Lab’s visual analysis object (rather than the Flow).

  • Create charts in the same object.

  • Deploy a data preparation script to the Flow.

  • Deploy charts to an output dataset in the Flow.

Starting here?

If you skipped the previous sections, you must complete the Tutorial | Join recipe to get the customers_orders_joined dataset.

Create a visual analysis#

Returning to project where we joined datasets, let’s create a visual analysis for the customers_orders_joined dataset.

  1. Select or open the customers_orders_joined dataset.

  2. From the Actions tab in the right panel, click the blue Lab button.

  3. Under Visual analyses, click New Analysis.

    A dataset highlighted in the Flow showing the actions available including Visual Analysis.
  4. In the New analysis dialog that prompts you to name your analysis, leave the default name, and click Create Analysis.

    A prompt for a new analysis name in the New analysis window.

Interactively prepare your data#

Dataiku displays the Script tab for our visual analysis. The script works the same as the script in a Prepare recipe.

Tip

Screencasts at the end of sections mirror the actions described.

Parse the birthdate column#

First, let’s parse the birthdate column.

  1. Open the birthdate column dropdown and select Parse date. Based on the sample, Dataiku suggests possible date formats. It’s up to you to validate which one is correct.

  2. Choose the yyyy/MM/dd format, and click Use Date Format.

  3. Leave the Output column field in the script step empty so the parsed date replaces the original birthdate column.

Compute a new age column#

Using the customer’s birth date and the date they made their first order, we can compute the customer’s age when they made their first order.

  1. From the birthdate column dropdown, choose Compute time since.

  2. Set the until option to Another date column.

  3. In the Other column field, enter first_order_date.

  4. Change Output time unit to Years.

  5. In the Output column field, enter age_first_order.

Computing customer's age when they made their first order.

Examine the age distribution#

Now that we have ages, let’s examine the distribution.

  1. From the new column age_first_order header dropdown, select Analyze. As it turns out, there are a number of outliers with ages well over 120. Let’s assume these outliers represent bad data and remove them.

  2. Within the Analyze dialog, under the Outliers section, click the Actions button.

  3. Select Clear rows outside of 1.5 IQR in order to set those values to missing.

    Using the Analyze tool to clear outliers.

Now the distribution looks more reasonable, but there are still a few suspicious values over 100. We can remove these by setting an upper bound limit.

  1. Close the Analyze dialog.

  2. In the Script tab, click the new step, Clear values outside [x,y] in age_first_order, to expand it.

  3. Set the upper bound to 100.

Using the Clear Outliers tool in a script.

Remove some columns#

Lastly, now that we’ve computed age_first_order, we won’t need birthdate or first_order_date anymore, so let’s remove them from the script.

  1. Open the birthdate column dropdown, and select Delete.

  2. Repeat step 1 for the first_order_date column.

Dataiku adds a Remove step to the script.

See a screencast covering these steps

Leverage the user agent#

The user_agent column contains information about the browser and operating system, and we want to pull this information out into separate columns so that it’s possible to use it in further analyses.

Dataiku has inferred the meaning of the user_agent column to be User-Agent. Accordingly, its column dropdown is able to suggest specific actions.

  1. From the user_agent column dropdown, choose Classify User-Agent. This adds another step to the script and seven new columns to the dataset.

    Dataiku screenshot of a visual analysis script including a step to classify user agent.

    For this tutorial, we are only interested in the browser and the operating system, so we will remove the columns we don’t need.

  2. To do this quickly, change from the Table view to the Columns view using the icon near the top right of the screen.

  3. Select six columns beginning with user_agent_, except user_agent_brand and user_agent_os.

  4. At the top left of the column list, click the Actions button, and select Delete.

  5. Switch back to Table view.

Toggling between table view and column view in a script for quick actions on columns.

See a screencast covering these steps

Leverage the IP address#

Dataiku has inferred the meaning of the ip_address column to be an IP address. Just like with user_agent, we’ll have meaning-specific actions in the column dropdown.

  1. Open the column header dropdown for the ip_address column.

  2. Select Resolve GeoIP. This adds a new step to the script and seven new columns to the dataset that tell us about the geographic location of each IP address.

  3. In the script step, deselect Extract country code, Extract region, and Extract city because, in this tutorial, we are only interested in the country and GeoPoint (approximate longitude and latitude of the IP address).

  4. Finally, delete the ip_address column because we won’t need it anymore.

See a screencast covering these steps

Use formulas#

Let’s create a new column to act as a label on the customers generating a lot of revenue. We’ll consider customers with a value of total orders in excess of 300 as “high revenue” customers.

  1. Click +Add a New Step in the script, and select Formula.

  2. Enter high_revenue as the name of the Output column.

  3. Click Open Editor Panel to open the expression editor.

  4. Type if(total_sum > 300, "True", "False") as the expression.

  5. Click Apply.

Using the formula editor in a step in a script.

See also

See the reference documentation to get the syntax for the Formula processor.

Visualize your data with charts#

Now let’s move from the Script tab to the Charts tab.

Sales by age and campaign#

Let’s see if there is a relationship between a customer’s age, whether that customer is part of a Haiku T-Shirt campaign, and how much they spend.

  1. Click +Chart at the bottom center of the screen.

  2. From the chart type picker, choose the Scatter plot.

  3. Drag age_first_order to the X axis and total_sum to the Y axis.

  4. Drag campaign to the color droplet.

  5. Drag count to the field setting the size of bubbles.

  6. From the size dropdown to the left of the count field, change the Base radius from 5 to 1 to reduce overlapping bubbles.

Building a scatter plot chart.

The scatter plot shows that older customers, and those who are part of the campaign, tend to spend the most. The bubble sizes show that some of the moderately valued customers are those who have made a lot of small purchases, while others have made a few larger purchases.

Sales by geography#

Since we extracted locations from ip_address, it’s also natural to want to know where our customers come from. We can visualize this with a map.

  1. Click +Chart to create a third chart.

  2. Choose the Scatter map.

  3. Drag ip_address_geopoint to the Geo field.

  4. Drag campaign to the color droplet.

  5. Drag total_sum to the field that sets bubble size.

  6. From the size dropdown, change the base radius from 5 to 2 to reduce overlapping bubbles.

This looks much better, and you can quickly get a feel for customer locations. If we then want to focus on the largest sales:

  1. Drag total_sum to the Filters box.

  2. Click the number for the lower bound to edit it, and type 300 as the lower bound to filter out customers who have spent less than 300.

Building a scatter map.

See a screencast covering these steps

Deploy work in the Lab to the Flow#

When working on charts in a visual analysis, you are building charts with a sample of your data. You can change the sample in the Sampling and Engine tab in the left panel, but since Dataiku has to re-apply the latest preparation each time, it will not be efficient for very large datasets.

In addition, if you want to share these charts with your team on a dashboard, you will first need to deploy your script to the Flow. Let’s deploy our script now.

Deploy the script#

First, let’s deploy the script to the Flow.

  1. From any tab in the visual analysis, go to the top right corner of the screen, and click on Deploy Script.

    A dialog appears to deploy the script as a Prepare recipe. By default, charts created in the Lab will be carried over to the new dataset so that you can view them on the whole output data, rather than a sample.

  2. Rename the output dataset customers_labelled.

  3. Click Deploy to create the recipe.

  4. Save the recipe, and go to the Flow.

Deploying a script from the Lab to the Flow.

Build the dataset#

The white square with a dashed line means the instructions for building a dataset are now available in the Flow, but the dataset is not yet built. Let’s build it.

  1. Open the customers_labelled dataset, and see that it is empty. This is because we have not yet run the recipe to build the full output dataset.

  2. Click Build. This opens a dialog that asks whether you want to build just this dataset (not recursive) or recursively reconstruct datasets either upstream or downstream. Since the input dataset is up-to-date, a non-recursive build is sufficient.

  3. Click Build Dataset (leaving non-recursive selected).

    Building a dataset non recursively.
  4. When the job completes, refresh the screen to view the output.

Configure the chart#

Let’s configure the stacked bar chart to use the entire dataset.

  1. Go to the Charts tab (G+V) of the customers_labelled dataset.

  2. Click Sampling & Engine from the left panel.

  3. Deselect Use same sample as explore.

  4. Select No sampling (whole data) as the sampling method.

  5. Click Save and Refresh Sample.

Bar chart reconfigured to use the whole dataset rather than just a sample.

See a screencast covering these steps

What’s next?#

Congratulations! You successfully deployed a visual analysis script from the Lab to the Flow. Be sure to review the concept materials for greater discussion on the differences between these two.

Now that the orders and customers datasets are joined, cleaned, and prepared, let’s take one more step: using reporting tools to share results with stakeholders.