Tutorial | Data preparation and visualization in the Lab (Core Designer part 9)¶
As of the tutorial on joining data, we have built a simple and clean Flow.
Real data projects are rarely so clean. Having all of your work in the Flow can lead to overcrowding. The Lab is the place for experimentation and preliminary work. You can then decide to deploy work you do in the Lab to the Flow.
In this tutorial, you will:
Prepare data in the Lab’s visual analysis object (rather than the Flow).
Create charts in the same object.
Deploy a data preparation script to the Flow.
Deploy charts to an output dataset in the Flow.
If you skipped the previous sections, you must complete the Tutorial | Enrich the dataset (Core Designer part 8) to get the customers_orders_joined dataset.
Returning to project where we joined datasets, let’s create a visual analysis for the customers_orders_joined dataset.
Dataiku displays the Script tab for our visual analysis. The script works the same as the script in a Prepare recipe.
Screencasts at the end of sections mirror the actions described.
Parse the birthdate column¶
First, let’s parse the birthdate column.
Open the birthdate column dropdown and select Parse date. Based on the sample, Dataiku suggests possible date formats. It’s up to you to validate which one is correct.
Choose the yyyy/MM/dd format, and click Use Date Format.
Leave the Output column field in the script step empty so the parsed date replaces the original birthdate column.
Compute a new age column¶
Using the customer’s birth date and the date they made their first order, we can compute the customer’s age when they made their first order.
From the birthdate column dropdown, choose Compute time since.
Set the until option to Another date column.
In the Other column field, enter
Change Output time unit to Years.
In the Output column field, enter
Examine the age distribution¶
Now that we have ages, let’s examine the distribution.
From the new column age_first_order header dropdown, select Analyze. As it turns out, there are a number of outliers with ages well over 120. Let’s assume these outliers represent bad data and remove them.
Within the Analyze dialog, under the Outliers section, click the Actions button.
Select Clear rows outside of 1.5 IQR in order to set those values to missing.
Now the distribution looks more reasonable, but there are still a few suspicious values over 100. We can remove these by setting an upper bound limit.
Close the Analyze dialog.
In the Script tab, click the new step, Clear values outside [x,y] in age_first_order, to expand it.
Set the upper bound to
Remove some columns¶
Lastly, now that we’ve computed age_first_order, we won’t need birthdate or first_order_date anymore, so let’s remove them from the script.
Open the birthdate column dropdown, and select Delete.
Repeat step 1 for the first_order_date column.
Dataiku adds a Remove step to the script.
Leverage the user agent¶
The user_agent column contains information about the browser and operating system, and we want to pull this information out into separate columns so that it’s possible to use it in further analyses.
Dataiku has inferred the meaning of the user_agent column to be User-Agent. Accordingly, its column dropdown is able to suggest specific actions.
From the user_agent column dropdown, choose Classify User-Agent. This adds another step to the script and seven new columns to the dataset.
To do this quickly, change from the Table view to the Columns view using the icon near the top right of the screen.
Select six columns beginning with
user_agent_, except user_agent_brand and user_agent_os.
At the top left of the column list, click the Actions button, and select Delete.
Switch back to Table view.
Leverage the IP address¶
Dataiku has inferred the meaning of the ip_address column to be an IP address. Just like with user_agent, we’ll have meaning-specific actions in the column dropdown.
Open the column header dropdown for the ip_address column.
Select Resolve GeoIP. This adds a new step to the script and seven new columns to the dataset that tell us about the geographic location of each IP address.
In the script step, deselect Extract country code, Extract region, and Extract city because, in this tutorial, we are only interested in the country and GeoPoint (approximate longitude and latitude of the IP address).
Finally, delete the ip_address column because we won’t need it anymore.
Let’s create a new column to act as a label on the customers generating a lot of revenue. We’ll consider customers with a value of total orders in excess of 300 as “high revenue” customers.
Click +Add a New Step in the script, and select Formula.
high_revenueas the name of the Output column.
Click Open Editor Panel to open the expression editor.
if(total_sum > 300, "True", "False")as the expression.
The syntax for the Formula processor can be found in the reference documentation.
Now let’s move from the Script tab to the Charts tab.
Popular user agents¶
Since we extracted the browsers used by customers from user_agent, it’s natural to want to know which browsers are most popular. A common way to visualize parts of a whole is with a pie or donut chart.
Navigate to the Charts tab.
Click the chart type dropdown, and choose Donut.
Drag Count of records to the Show box and user_agent_brand to the By box.
This shows that nearly 3/4 of customers who have placed orders use the Chrome browser. The donut displays the relative share of each browser to the total, but we’d like to include the OS in the visualization.
Click the chart type tool again, and select Vertical stacked bars.
Drag user_agent_os to the And box.
If necessary, click on user_agent_brand, and adjust the sorting to Count of records, descending.
Adding user_agent_os gives us further insight to the data. As expected, IE and Edge are only available on Windows, and Safari is only on MacOS. What is enlightening is that there are approximately double the number of customers using Chrome on MacOS as Safari and Firefox combined. There is a similar relationship between use of Chrome versus Firefox on Linux.
Sales by age and campaign¶
Let’s see if there is a relationship between a customer’s age, whether that customer is part of a Haiku T-Shirt campaign, and how much they spend.
Click +Chart at the bottom center of the screen.
From the chart type picker, choose the Scatter plot.
Drag age_first_order to the X axis and total_sum to the Y axis.
Drag campaign to the color droplet.
Drag count to the field setting the size of bubbles.
From the size dropdown to the left of the count field, change the Base radius from 5 to 1 to reduce overlapping bubbles.
The scatter plot shows that older customers, and those who are part of the campaign, tend to spend the most. The bubble sizes show that some of the moderately valued customers are those who have made a lot of small purchases, while others have made a few larger purchases.
Sales by geography¶
Since we extracted locations from ip_address, it’s also natural to want to know where our customers come from. We can visualize this with a map.
Click +Chart to create a third chart.
Choose the Scatter map.
Drag ip_address_geopoint to the Geo field.
Drag campaign to the color droplet.
Drag total_sum to the field that sets bubble size.
From the size dropdown, change the base radius from 5 to 2 to reduce overlapping bubbles.
This looks much better, and you can quickly get a feel for customer locations. If we then want to focus on the largest sales:
Drag total_sum to the Filters box.
Click the number for the lower bound to edit it, and type
300as the lower bound to filter out customers who have spent less than 300.
When working on charts in a visual analysis, you are building charts with a sample of your data. You can change the sample in the Sampling and Engine tab in the left panel, but since Dataiku has to re-apply the latest preparation each time, it will not be efficient for very large datasets.
In addition, if you want to share these charts with your team on a dashboard, you will first need to deploy your script to the Flow. Let’s deploy our script now.
Deploy the script¶
First, let’s deploy the script to the Flow.
From any tab in the visual analysis, go to the top right corner of the screen, and click on Deploy Script. A dialog appears to deploy the script as a Prepare recipe. By default, charts created in the Lab will be carried over to the new dataset so that you can view them on the whole output data, rather than a sample.
Rename the output dataset
Click Deploy to create the recipe.
Save the recipe, and go to the Flow.
Build the dataset¶
The white square with a dashed line means the instructions for building a dataset are now available in the Flow, but the dataset is not yet built. Let’s build it.
Open the customers_labelled dataset, and see that it is empty. This is because we have not yet run the recipe to build the full output dataset.
Click Build. This opens a dialog that asks whether you want to build just this dataset (not recursive) or recursively reconstruct datasets either upstream or downstream. Since the input dataset is up-to-date, a non-recursive build is sufficient.
Click Build Dataset (leaving non-recursive selected).
When the job completes, refresh the screen to view the output.
Configure the chart¶
Let’s configure the stacked bar chart to use the entire dataset.
Go to the Charts tab (
G+V) of the customers_labelled dataset.
Click Sampling & Engine from the left panel.
Deselect Use same sample as explore.
Select No sampling (whole data) as the sampling method.
Click Save and Refresh Sample.
Congratulations! You successfully deployed a visual analysis script from the Lab to the Flow. Be sure to review the concept materials for greater discussion on the differences between these two.
Now that the orders and customers datasets are joined, cleaned, and prepared, let’s take one more step: using reporting tools to share results with stakeholders.