Garage dataset¶
The last dataset is found in garage_locations.csv.zip. In order for Dataiku to recognize the data, go into the Format/Preview tab and select One record per line from the Type dropdown. The resulting dataset, garage_locations, should have a single column.
We now need to parse the html code in each cell in order to extract the relevant information (the name and latitude/longitude of each garage). For that, we will use a combination of string transformations. The trick is to find unique string combinations to split the column as efficiently as possible and avoid mistakes.
Note
For longer recipes, as shown in the one here, organizing steps into Groups can be a useful tool, particularly if you need to duplicate a series of steps elsewhere.
From the garage_locations dataset, create a new Prepare recipe, keeping the default output name, and adding the following steps to the script:
Use the Filter rows/cells on value processor to remove rows where the column line has the value
line
.The values in the dataset are too long to view in their entirety in the table. Right-click on the first cell and select Show complete value. Looking at the complete value of the first cell, the garage name (“Carrosserie M. SERVICES AUTOMOBILES”) is wrapped within
h1
tags. This seems to be the case for each observation, so let’s try splitting the first column on theh1
tag.
Use the Split column processor on the line column, taking
h1
as the delimiter. This gets us part of the way there, but there is still a common prefix before the garage name.In the same Split step, try adding the whole prefix
h1 class=""""left"""">' + '
as the delimiter. Select Truncate in the Split processor, making sure that “Starting from” is set to End to to keep only the last column.
In a new step, use the Split columns processor on the line_0 column, with
' + '
as the delimiter. Select Truncate to keep only the first column from the beginning.Delete the column line_0.
Rename the column line_0_0 to
name
.
Select the last four steps. After right clicking, select Group and name it Parse garage name
.
Now parse the latitude and longitude coordinates using a similar process:
Split the column line on the delimiter
LatLng(
. Select Truncate, keeping1
column starting from the End.Split the new column line_0 on the delimiter
),icon
. Here Truncate, keeping1
column starting from the Beginning.Split the new column line_0_0 on the delimiter
,
.Rename the column line_0_0_0 to
latitude
and line_0_0_1 tolongitude
.Delete the columns line, line_0 and line_0_0, leaving only three remaining columns.
Organize the last five steps into a group called Parse lat/lon coordinates
.
Use the Create GeoPoint from lat/lon processor to create a new column named geopoint from the existing latitude and longitude columns.
Use the Reverse-geocode location processor on the geopoint column to extract the same geographic information as for rental_agencies_prepared: city (level 8), department, region and country
Delete all the four new columns with the “_enName” suffix.
Run the recipe, updating the schema to nine columns.
We now have three geo-coded datasets: accidents, rental agencies, and garages.