Hands-On: Plugin Store¶
As you know, plugins allow users to extend the native features of Dataiku DSS. A plugin can contain one or more components, such as recipes, datasets, web apps, processors, and more.
Let’s Get Started!¶
In this hands-on lesson, you will use the Census USA plugin to enrich a dataset with socio-demographic variables from the US Census Bureau.
This lesson assumes that you have basic knowledge of working with Dataiku DSS datasets and recipes. If not already on the Advanced Designer learning path, completing the Core Designer Certificate is recommended.
Also, you must have the following three plugins installed on your Dataiku DSS instance:
These plugins are available through the Dataiku Plugin store, and you can find the instructions for installing plugins in the reference documentation. To check whether the plugins are already installed on your instance, go to the Installed tab in the Plugin Store to see a list of all installed plugins.
We also recommend that you complete the following lessons beforehand:
The following figure shows the final Flow in Dataiku DSS.
Create Your Project¶
Create your project by selecting one of these three options:
Continue from the Visual Recipes 102 Course¶
You can begin this lesson by continuing with the same project you built in the Visual Recipes 102 course.
Import a New File-based Project¶
You can import the file-based version of this project from the Dataiku homepage. This version does not require that you have a PostgreSQL connection in your Dataiku DSS instance.
From the Dataiku homepage. Click +New Project and select DSS Tutorials from the list. Choose Advanced Designer from the left, and select Plugin Store (File-based Tutorial). In the Flow, you’ll see that the project already includes the steps completed in the Visual Recipes 102 course.
Import a New SQL-based Project¶
You can import the SQL-based version of this project from the Dataiku homepage. Click +New Project and select DSS Tutorials from the list. Choose Advanced Designer from the left, and select Plugin Store (SQL-based Tutorial).
The imported project uses datasets that are connected to a PostgreSQL connection named postgresql. Dataiku DSS will return some connection-related errors if you do not have an identically-named PostgreSQL connection on your DSS instance. To resolve these errors, remap connection names as described in the article Remapping Connections in a DSS Instance.
Once you’ve remapped the connection names, you may encounter some warnings about “missing plugin” or “plugin version mismatch”. Provided the “Census USA” plugin listed in the “Prerequisites” section of this lesson is already on your instance, you can ignore these warnings by clicking OK.
Build Your Project¶
If you’ve imported one of the new projects, when you click Go to Flow, you’ll see that the project already includes the steps completed in the Visual Recipes 102 course. Notice that this project only has the skeleton of the Flow. The datasets have not yet been built.
Let’s build the parts we need (highlighted in the previous figure) for this lesson.
From the Flow, select the two output datasets relevant for this tutorial: income_per_tract_usa_copy and merchants_by_state.
With both datasets selected, choose Build from the Actions sidebar.
By default, the option “Build required dependencies” should be chosen. Click Preview to view the suggested job.
Now in the Jobs tab, we can see all 11 activities Dataiku DSS will perform.
Click Run and observe how Dataiku DSS progresses through the list of activities.
Inspect the Data¶
The merchants_by_state dataset contains a list of unique merchant ids from Delaware state, the geographical coordinates (latitude and longitude) of the merchant location, and the merchant subsector description.
Our first task is to determine the US census tract ID for each merchant location using this dataset. To do this, we will use the Get US census block group from lat lon recipe in the Census USA plugin.
Access the US Census Plugin¶
How you access the Census USA plugin will depend on which of its components that you choose to use.
The plugin consists of six components — three dataset connectors and three visual recipes.
The dataset connectors from this plugin enable us to build and use the US Census data directly within Dataiku DSS. See the plugin page for more information.
Normally, to query the Census Bureau we would have to write code that uses their API to request data. A plugin recipe provides a graphical user interface wrapper around this code.
We will use the Get US census block group from lat lon recipe to enrich our dataset. To access the recipe, click the +Recipe button from the Flow and select Census USA from the list. Alternatively, to access the recipe,
Open or select the merchants_by_state dataset in the Flow
Open the Actions sidebar
Click Census USA from the “Plugin recipes” section to bring up a window containing the three recipes in the plugin
Select the Get US census block group from lat lon recipe.
Configure the Plugin Recipe¶
You can now configure the input and output of the Get US census block group from lat lon recipe by specifying the input dataset as merchants_by_state and creating a new dataset
merchant_census_tracts as the output. Doing this opens up the Settings page of the recipe. To configure the settings,
Specify the value of “Column LATITUDE” as merchant_latitude and “Column LONGITUDE” as merchant_longitude.
Keep the value for “Benchmark” as Public_AR_Current to use the most recent snapshot of the US Census database, and “Vintage” as Current_Current to use the current address ranges as of the selected benchmark.
Specify the “API call throttle” as
0to define the pause in seconds between each API call. A zero value is fine because the dataset is small, but you should adapt the value accordingly for larger datasets.
Select Use an id column as the value for “param_strategy.”
Finally, specify the “Input Column ID” to correspond to the unique IDs in the
merchant_idcolumn of the dataset.
Run the recipe and explore the output dataset merchant_census_tracts. You can see that the dataset contains geographical information about the census tract ID and the state, county, and block codes. For more details on the returned codes, see the Census Geocoder Documentation.
Now that we’ve determined the US census tract ID for each merchant’s location, our next task is to find the average household income of the census tracts for each merchant subsector. For this, we will perform a join of three datasets: merchant_census_tracts, merchants_by_state, and income_per_tract_usa_copy.
Join the Datasets¶
The income_per_tract_usa_copy dataset contains the average household income for all US census tracts. We will use a Join recipe to combine this dataset with the merchant_census_tracts and merchants_by_state datasets.
Select the merchant_census_tracts dataset from the Flow, and click the Join recipe from the Actions sidebar.
Select the additional input dataset merchants_by_state.
Name the output dataset
In the Settings page of the Join recipe, go to the Join step. Here,
Click +Add Input and select income_per_tract_usa_copy as the “New input dataset” to be joined with the “Existing input dataset” merchant_census_tracts.
Click Add Dataset.
In the Selected columns step of the Join recipe,
Select only the “tract_id” column from the merchant_census_tracts dataset, and the “average_tract_income” column from the income_per_tract_usa_copy dataset. From the merchants_by_state dataset, select all the columns.
Finally, run the recipe and explore the output dataset merchants_with_tract_income.
You can also create a bar chart to display the census tracts’ average income for each merchant subsector.
Great job! Now you have some hands-on experience working with a plugin recipe. This is just a first step in working with plugins. You can try using other components in the Census USA plugin, such as the dataset connectors. You can also install plugins that include other kinds of components and try using them in your workflow.