Tutorial | Working with shapefiles and US census data#

Shapefiles are one of the most common formats for geographic data. Let’s get started using them in Dataiku!

Get started#

Objectives#

In this tutorial, you will:

  • Import shapefiles and US census data into Dataiku.

  • Join spatial and demographic data.

  • Visualize the results on a map.

Prerequisites#

Create the project#

Let’s get started!

  1. From the Dataiku Design homepage, click + New Project > Blank project.

  2. Name it Working with shapefiles.

  3. Click Create.

  4. From the project homepage, navigate to the Flow from the top navigation bar (or use the keyboard shortcut g + f).

The shapefile format#

When working with spatial or geographic data, you will encounter many different types of file formats such as .geojson, .gpkg, .csv, and .tiff. One of the most common though is the shapefile, initially created by ESRI.

Although often referred to as a singular file, a shapefile is actually a collection of typically four (and potentially other) files (.shp, .shx, .dbf, and .proj).

Together, these files can spatially describe vector features such as points, lines, polygons, and multipolygons.

Download shapefiles#

The example data for this exercise is TIGER/Line Shapefiles from the US Census Bureau. They contain official 2019 US county borders among some other information.

  1. Download the zip file at this URL through your browser: https://www2.census.gov/geo/tiger/TIGER2019/COUNTY/tl_2019_us_county.zip

Upload shapefiles#

Dataiku provides built-in support for the shapefile format. Let’s upload the shapefiles we’ve just downloaded.

  1. From the Flow, click + Dataset.

  2. Select Upload your files.

  3. Click Select Files.

  4. Select the locally-downloaded zip file from above. Dataiku can extract the files in the zip folder for you.

  5. Click Create a Single Dataset.

Dataiku screenshot of a shapefile upload screen.

Now we have to adjust the file type before creating the dataset.

  1. On the Format subtab, select Shapefile as the file type.

  2. Name it us_counties.

  3. Click Create.

Dataiku screenshot of a shapefile upload screen changing the file type.

Explore shapefiles#

After importing the dataset, the Explore tab shows a preview of the data in a tabular format.

  • The first column, the_geom, specifies the dataset’s geometry. It is stored as a string, but Dataiku can interpret its meaning to be Geometry. Each row, a county, is stored as a multipolygon. Right-clicking on a cell value opens a menu including Preview, which opens a map of the geometry.

  • The second column, shp_srs, specifies the dataset’s Spatial Reference System (SRS), also known as a Coordinate Reference System (CRS). A spatial reference system defines how the spatial elements of the data relate to the Earth’s surface. In this case, the dataset uses one of the most common geographic SRS, EPSG:4269.

Dataiku screenshot of a preview of a geometry column.

Filter shapefiles in a visual recipe#

Shapefiles can be manipulated in Dataiku like any other dataset. Let’s use them in a visual recipe.

  1. From the us_counties dataset, open the Actions tab, and select Sample/Filter recipe.

  2. Name the output dataset nj_counties.

  3. Click Create Recipe.

  4. Turn the Filter tile On.

  5. Filter the dataset to keep only rows where STATEFP equals 34 (the FIPS code for the state of New Jersey).

  6. Click Run.

Dataiku screenshot of the Filter recipe settings.

After running the recipe, note that the output dataset now has 21 rows, one for each county in New Jersey.

Note

An alternative way to achieve this result would be to use the Filter rows/cells on value processor in a Prepare recipe, especially if there was more data preparation to be done.

Download US Census data#

We now have a dataset where each row holds the shape of a county in New Jersey. As of now though, there is no demographic data attached to the counties.

The Census USA plugin has a number of features relating to census data, including an easy way to download data from the US Census Bureau.

  1. From the Flow, select + New Dataset > Census USA > US Census dataset.

  2. For the State, provide nj. Ensure State format is state_2letters.

  3. Select ACS5Y2017 as Census content.

  4. Select COUNTY as Census level.

  5. The Census field is a string of variables (without spaces). Add B00001_001E,B19013_001E to retrieve data for total population and median household income, respectively.

  6. Click Test & Get Schema.

  7. Name the output dataset nj_demo.

  8. Click Create.

Dataiku screenshot of the dialog for downloading a US census dataset.

Note

There are a number of ways to find out the code for a particular census variable. One way is by building the US Census metadata dataset in the Census USA plugin.

Enrich shapefiles with census data#

We now have the population and an estimate of median household income for each county in the state. Before we can join our spatial and demographic data, we’ll do a few brief preparation steps on the demographic data.

  1. From the nj_demo dataset, select a Prepare recipe from the Actions sidebar.

  2. Click Create Recipe.

  3. Open the column header dropdown for B00001_001E and B19013_001E, and rename them population and median_household_income, respectively.

  4. Select the storage type dropdown for the column GEOID_DKU, and change it to a string so we can join it with the string GEOID column of nj_counties.

  5. Click Run.

Dataiku screenshot of a Prepare recipe.

We can now join the datasets of shapefiles and demographic information.

  1. From the Flow, select the datasets nj_counties and nj_demo_prepared.

  2. In the Actions sidebar, select Join from the menu of visual recipes.

  3. Click Create Recipe.

  4. Click Add a Condition.

  5. For the join condition, select GEOID as the column from nj_counties and GEOID_DKU as the column from nj_demo_prepared.

  6. Click OK.

  7. Click Run, and open the output dataset when the recipe finishes running.

Dataiku screenshot of the join condition of a Join recipe.

Map shapefiles#

Now we can visualize the distribution of our demographic variables on a map.

  1. Navigate to the Charts tab of the nj_counties_joined dataset.

  2. From the chart picker, select Administrative map (filled).

  3. Drag the_geom column to the Geo field. Click the dropdown to adjust the level of detail to Department/County.

  4. Drag the population or median_household_income column to the color droplet field.

  5. Click the color droplet to adjust the color palette to your preference.

Dataiku screenshot of a filled administrative map.

Tip

For more resources on mapping, see Tutorial | No-code maps.

What’s next?#

Congratulations! You’ve seen how to import, manipulate, and visualize shapefiles and US Census data in Dataiku.

Note

For more on shapefiles, see the reference documentation.

For more practice working with geographic data, see Tutorial | Geographic processors!

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Geospatial Analytics. When ready, challenge yourself to earn a certification!