Prepare the Demographics Dataset¶
With the bike station data in place, let’s prepare data on the number of people living in DC at a granular level.
Create a new Uploaded Files dataset from the Demographics file. Accept the default name block_group_demog
.
This dataset contains the number of people in the US Census ACS5Y 2013 at the block group level. For reference, note that the US Census generally follows a hierarchy of States > Counties > Census Tracts > Block Groups > Census Blocks.
The fully qualified block group identification is contained within the geoid column. We can split this column to obtain the block group.
Note
We could alternatively build up the block group from the state, county, tract, and blkgrp columns. As a stretch goal, see if you can figure out how to do that. Then consider which method you prefer. There are many means to an end in data science, and you will need to assess what works best in each situation.
Create a new Prepare recipe from this dataset with the default output block_group_demog_prepared
and the following steps in its Script:
Use the Split column processor on the geoid column with
US
as the delimiter. Choose to Truncate, keeping only one of the output columns, starting from the End.The new geoid_0 column should keep everything after the “US”.
Use the Rename columns processor on a few columns:
Old name
New name
geoid_0
block_group
name
block_name
BOOOO1_001E
nbPeople
Ensure the block_group column storage type is set to “string” and NOT “bigint” by editing the schema from the menu directly beneath the column header.
Use the Round numbers processor on the nbPeople column to round values to integers (0 significant digits, 0 decimal places).
Remove five more columns that will no longer be required: state, county, tract, blkgrp, and geoid
Run the recipe, updating the schema to four columns.
Now that we have the number of people in every census block group in DC, the demographic data is ready!