Hands-On Tutorial: Geo Join

Getting Started

In this tutorial, you’ll get hands-on practice with geospatial features by preparing and joining geospatial data to visualize credit card fraud activity on a map.

Prerequisites

You’ll need access to Dataiku version 11.1 or above (the free edition is enough). To get the free edition, visit the Get Started page, and click Install Now on the bottom left.

Geospatial Features Used in This Tutorial

When you’ve completed this tutorial, you’ll have a better understanding of the following features:

  • Create area around geopoint processor

  • Geo Join recipe

  • Density map

  • geoMakeValid formula

What We’re Building

We’ll start by building out an existing project by creating new geographic features. Then, we’ll further build out the pipeline by joining on these features. Then we’ll create a density map that can be published to a dashboard.

Graphic image describing the steps performed in the tutorial.

How We’ll Build The Project

Our goal is to build a density map of merchants at risk of credit card fraud. To do this, we’ll create geographic features using the Prepare recipe and then join datasets using the Geo Join recipe.

We’ll limit our analysis to a specific merchant subsector, “gas”; and only those transactions that are flagged as fraudulent.

By creating geographic features, we’ll be able to focus on visualizing where the fraudulent transactions are occurring. This will allow us to identify “merchants at risk”. In other words, we’ll be able to visualize the density of merchants at risk of being targeted by the user of the stolen credit card.

Create and Explore the Project

In this section, we’ll create and explore the project. To create our project, we’ll open a tutorial that already has the input datasets we want to transform.

To open the tutorial:

  • Sign in to your instance of Dataiku.

  • From the Dataiku homepage, click on +New Project > DSS Tutorials > General Topics > Geospatial Analysis: CC Fraud Data.

Note

You can also download the starter project from this website and import it as a zip file.

Dataiku opens the Summary tab of the project, also known as the project homepage.

Project homepage.

  • Click Go To Flow.

  • In the lower right corner, click Flow Actions, then select Build all.

  • Click Build.

  • Wait while Dataiku builds the Flow, then refresh your browser window.

Explore the Flow

You’ll notice that the project contains more than one flow zone. For this lesson, we’ll focus on the following datasets in the Geo Join flow zone:

  • gas_sector_unauth_transactions: This dataset has been prepared from transactions_joined and it contains the geopoints for all unauthorized transactions belonging to the gas merchant subsector. We’ll further prepare this dataset to create areas around the geopoints.

  • gas_sector_merchants: This dataset has been prepared from merchants_info and it contains the geopoints for all merchants belonging to the gas merchant subsector.

Geo Join flow zone containing geospatial datasets.

Create Geographic Areas Around the Unauthorized Transactions

To visualize the number of merchants at risk around each unauthorized transaction, we’ll first need to create a geographic area around each unauthorized transaction’s geopoint. For this, we’ll use the Create area around geopoint processor in the Prepare recipe to operate on the existing geopoints in the dataset.

Let’s define a 20-mile geographic area.

  • Go to the Flow.

  • In the Geo Join flow zone, click the gas_sector_unauth_transactions dataset once to select it.

  • Open the right-side panel to see the available Actions, then choose the Prepare recipe from Visual recipes.

Geo Join flow zone with visual recipe actions.

  • Keep the default output dataset name and click Create Recipe.

Now, we’ll add a step to the script.

  • In the Script, click + Add a New Step.

  • In the processors library, search for geo and select Create area around geopoint.

Visual prepare recipe with processors library.

  • Select transaction_geopoint as the Input column.

  • Name the Output column 20_miles.

  • Set the Shape to “Circle”.

  • Define a Radius of 20, and a Distance unit of “Miles”.

Step in a script in a visual prepare recipe.

  • Save your changes.

Set Column Storage Type

Dataiku has inferred the storage type of our new geographic area column as “string”. Let’s set the storage type of the column to “Geometry” to match the values in the column.

  • At the top of the 20_miles column, click the storage type, string, and select Geometry / Geography from the drop-down list.

  • Likewise, set the storage type of transactions_geopoint to Geo Point.

  • Save and Run the recipe.

  • Explore the output dataset.

Optional: Ensure Valid Geometry

We can apply a formula, “geoMakeValid”, to ensure the values in our geometry column are valid according to the Open Geospatial Consortium (OGC). To do this:

  • In the Script, click + Add a New Step.

  • In the processors library, search for and select Formula.

  • Name the Output column 20_miles.

  • Open the editor panel and type geoMakeValid(20_miles).

  • Apply the formula and Run the recipe.

Formula editor showing the geoMakeValid expression.

Now that we have defined geographic areas around the geopoints of unauthorized transactions, we can use the Geo Join recipe to map merchants to the areas.

Join Two Datasets Using Geospatial Data

In this section, we’ll create a dataset that includes all of the merchants located within each 20-mile geographic area. Later, we’ll create a chart to visualize the density of the merchants within these areas.

To create our dataset, we’ll use the Geo Join recipe.

  • Return to the Flow.

  • In the Geo Join flow zone, click the newly created gas_sector_unauth_transactions_prepared dataset to select it.

  • Hold down the Shift key then click the gas_sector_merchants dataset to select it, too.

  • Open the right-side panel and choose the Geo join recipe from Visual recipes.

Dataiku displays the New geojoin recipe window.

  • Name the output dataset merchants_contained_in_20_miles instead of the default name.

  • Click Create Recipe.

New geojoin recipe dialogue with input and output datasets.

Note

If Dataiku displays “No available join columns”, return to your input datasets and ensure that the storage types of the geospatial columns (transaction_geopoint, merchant_gas_geopoint, and geometry) are set to type Geo Point or Geometry / Geography.

Dataiku displays the Join step of the recipe. By default, Dataiku selects available geospatial columns to use for the join and specifies a join condition. Here, Dataiku has selected the transaction_geopoint column from our left dataset as the join key. However, we’ll perform this join using the polygon we created.

Let’s configure the Join conditions.

  • Click transaction_geopoint to open the join conditions.

  • Keep the default Match when all conditions are satisfied.

  • Click transaction_geopoint to open the configuration options.

  • Set “Column from” to 20_miles and the join condition to Contains.

Join conditions configuration window.

  • Click OK to close the Join conditions window.

Let’s configure the columns in the output dataset, removing any that we don’t need.

  • In the left panel, go to the Selected columns step.

  • In gas_sector_unauth_transactions, remove the transaction_geopoint column.

  • In gas_sector_merchants, remove the merchant_subsector_description column.

Selected columns step of a visual join recipe .

  • Save and Run the recipe.

Let’s explore the output dataset.

  • Open the output dataset to explore it.

  • In the 20_miles column, right-click one of the values and choose Preview.

This feature lets you preview a polygon or point on a map.

Output dataset computed from a join recipe.

Create a Density Map

In this section, we’ll build a type of chart known as a density map. The map takes as an input a column containing geo points and represents them in the form of a heat map. We can customize our map by selecting a color palette, configuring the intensity, and setting the radius of the circles for each point.

While a density map and scatter map may seem similar, we’ll soon discover how valuable a density map can be when plotting a large set of points on a map. Since the density map allows us to show where geopoints are concentrated on a map, we can quickly understand where, on a map, there are areas with a lot of points, or where there are areas with very few points.

Let’s create our map!

  • Return to the Flow.

  • Open the merchants_contained_in_20_miles dataset.

  • Click on the Charts tab.

  • Click the Chart symbol in the upper left to view the chart options and types.

  • Choose the Density map.

Charts tab of a dataset with Density map selected.

  • From the Columns panel on the left, drag and drop merchant_gas_geopoint to Geo.

  • Drag and drop purchase_amount to Details to set the heatmap’s intensity.

Let’s configure the radius of our circles, the color palette, and add a tooltip.

  • Open the Color panel, then set the Radius to approximately “0.25”.

  • Select OrRd from the Palette.

  • Open the Tooltip panel, then drag and drop the signature_provided column to set the tooltip.

You can zoom in on the map to see more details. You can also configure the sampling settings by clicking on Sample.

  • Edit the title of the chart to Density of Merchants at Risk.

Here’s what the chart looks like when zoomed in.

Visualization of a geospatial analysis.

Using our Density Map, we can now visualize the geographic relationship between merchant location and transactions flagged as fraudulent. More specifically, we can visualize the concentration of merchants at risk in each geographic area.