Hands-On: Managed Folders

Sometimes datasets aren’t enough! Although Dataiku DSS comes with a large number of supported data formats, these are not always enough for every data science project. Managed folders, often in conjunction with code recipes, are often a handy tool for working with non-supported file types, such as images, PDFs, and much more.

In fact, folders can be a useful tool whenever you want to manipulate files (even supported file types) programmatically.

In this case, we have data tables printed in a UN report on urban population growth that we want to extract into Dataiku DSS to conduct our own analysis.

Let’s Get Started

In this hands-on lesson, you will learn how to:

  • download and/or upload files that Dataiku DSS cannot natively read into a managed folder;

  • create a DSS dataset from a folder of files;

  • work with a folder as both input and output to a Python recipe; and

  • publish the contents of a managed folder on a dashboard.

Prerequisites

To complete this tutorial, you will need:

  • access to a Dataiku DSS instance;

  • a Python 3.6+ code environment with the packages tabula-py and matplotlib.

  • A locally downloaded copy of this 2016 UN report on world cities.

Workflow Overview

When you have completed the tutorial, you will have built the Flow pictured below:

../../_images/folders-final-flow.png

Create a Project

We’ll be creating this project from start to finish so you just need a new empty project.

  • From the Dataiku DSS homepage, click on +New Project > Blank project.

  • Give it a name, such as Managed Folders.

  • Click Create.

Upload the PDF

The data for this project is found in the UN’s 2016 report on World Cities. As a PDF though, we cannot directly import it as a DSS dataset. Instead, let’s import the report into a managed folder.

  • From the Flow, choose +Dataset > Folder.

  • Name the output folder un_pdf_download.

  • Click Create Recipe after choosing a storage location.

  • Click Add a File or drag and drop the PDF into the folder.

  • When it has finished uploading, click on the file name to preview it.

  • Scroll through the PDF to find the published data tables starting from page 12.

../../_images/report-in-folder.png

Note

We also could have used a Download recipe to directly download the data into the folder. For use cases where the data sources regularly update, this would be a more sensible approach.

Parse the PDF with a Code Recipe

Dataiku DSS doesn’t have its own way of extracting data tables from a PDF, but the tabula Python library does. A small amount of Python code can get these tables into a DSS dataset.

The code environment

Let’s first designate this project’s code environment, one that has the packages tabula-py and matplotlib (which will be needed later).

  • From the More options menu in the top navigation bar, select Settings > Code env selection.

  • Change the default Python code env by changing the Mode to “Select an environment” and the Environment to the chosen code env.

  • Click Save.

The Python recipe

Now we can return to the Flow.

  • With the un_pdf_download folder selected, add a Python recipe.

  • Under Outputs, click +Add, but instead of adding a new dataset, add a New Folder.

  • Name the folder un_csv.

  • Create the folder and then create the recipe.

  • Delete the sample code.

Normally we’d prototype a recipe in a notebook, but in this case, we already have working code ready. The full code is below, but take a moment to understand these key steps.

  • We use the Dataiku API to stream the PDF from the folder.

  • The tabula-py library does the actual PDF parsing.

  • The last step is to write the output to another folder.

After pasting the code below into the recipe editor, run the recipe, and view the output folder.

Warning

If you gave the downloaded file or managed folders different names than those described here, be sure to update them.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
from tabula.io import read_pdf

# Read recipe inputs
un_pdf_download = dataiku.Folder("un_pdf_download")

# read in the pdf and use tabula-py to extract tabular data
with un_pdf_download.get_download_stream("the_worlds_cities_in_2016_data_booklet.pdf") as stream:
        tables = read_pdf(stream, pages = "12-26", multiple_tables = True)

# parse the pdf tables
for table in tables:
        table.columns = table.iloc[0].values
        table.dropna(inplace=True)
        table.drop(table.index[0], inplace=True)

# remove corrupted data
tables.pop(2)

# Write recipe outputs
csvs = dataiku.Folder("un_csv")

# write dataframes to csvs
path = '/dataset_{}'

for index, table in enumerate(tables):
        csvs.upload_stream(path.format(index), table.to_csv().encode("utf-8"))

Note

Whenever possible, it is advisable to use the get_download_stream() method to read a file from a folder, rather than get_path(). While get_path() will only work for a local folder, get_download_stream() works regardless of where the contents are stored. This is addressed in the product documentation.

Create a “Files in Folder” Dataset

The un_csv folder now holds 14 files, one for each page of the PDF containing a table. From this folder of CSV files, let’s create one DSS dataset.

  • From the un_csv folder, click Create Dataset in the Actions sidebar.

  • Click List Files to confirm the files in the folder (dataset_0 to dataset_13).

  • Click Test and then Preview to observe the format of the dataset being created.

  • On the Format/Preview tab, click “Parse next line as column headers”.

  • Let’s fix some of the column names here:

    • Add the prefix pop_ to the three population columns 2000, 2016, and 2030.

    • Add the prefix avg_roc_ to the columns 2000-2016 and 2016-2030 for the average annual rate of change in population.

  • Name the output un_world_cities and click Create.

../../_images/files-in-folder-dataset.png

We now have a DSS dataset that we can manipulate further with visual and/or code recipes.

Visual Data Preparation

Although we could have handled this in the previous Python recipe, let’s take advantage of the Prepare recipe for some quick data cleaning.

  • From the un_data dataset, create a Prepare recipe with the default output name.

  • Remove the unnecessary col_0.

  • Add another step to convert number formats from French to Raw in the three population columns.

  • As flagged by the red portion of the data quality bar, there’s one city with non-standard representations of missing data.

    • For columns pop_2000 and avg_roc_2000-2016, clear invalid cells for the meanings integer and decimal, respectively.

  • Run the recipe, updating the schema in the process.

../../_images/folders-prepare-recipe.png

Output Custom Plots to a Folder

There may be times when the native chart builder cannot create the visualization we need. Instead, we might want to create a more customized plot in a library like matplotlib or ggplot2. We can use a code notebook to create a custom plot, which can then be saved as a static insight and published to a dashboard.

However, for situations where we want to generate a large number of files as output (such as one chart per country), it may be preferable to use a managed folder as output.

For the ten largest countries, let’s compare growth rates from 2000-2016 and the projected growth rate from 2016 to 2030 with a pyramid plot, created in matplotlib.

  • From the un_world_cities_prepared dataset, create a Python recipe.

  • Add a New Folder as an output named population_growth_comparisons.

  • Create the recipe and delete the sample code.

The full code recipe is below, but take note of the following key points.

  • We use the Dataiku API to interact with the input dataset as a Pandas dataframe.

  • We use matplotlib to create the charts.

  • For each plot, we use the upload_stream() method to write the image to the output folder because it works for both local and non-local folders.

Once you have pasted this code into your recipe editor, run the recipe, and view the output folder.

import dataiku
import pandas as pd
import matplotlib.pyplot as plt
import os
import io

# Read recipe inputs
un_data_prepared = dataiku.Dataset("un_world_cities_prepared")
df = un_data_prepared.get_dataframe()

# top 10 most populous countries
TOP_10 = ['India', 'China', 'Brazil', 'Japan', 'Pakistan', 'Mexico', 'Nigeria',
                'United States of America', 'Indonesia', 'Turkey']

# generate plot for each country and save to folder
for country in TOP_10:

        df_filtered = df[df['Country or area'] == country]

        y = range(0, len(df_filtered))
        x_1 = df_filtered["avg_roc_2000-2016"]
        x_2 = df_filtered["avg_roc_2016-2030"]

        fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(12, 9))

        fig.patch.set_facecolor('xkcd:light grey')
        plt.figtext(.5,.9, "Pop. ROC Comparison ", fontsize=15, ha='center')

        axes[0].barh(y, x_1, align='center', color='royalblue')
        axes[0].set(title='2000-2016')
        axes[1].barh(y, x_2, align='center', color='red')
        axes[1].set(title='2016-2030')

        axes[1].grid()
        axes[0].set(yticks=y, yticklabels=df_filtered['City'])
        axes[0].invert_xaxis()
        axes[0].grid()

        # Write recipe outputs
        pyramid_plot = dataiku.Folder("population_growth_comparisons")

        bs = io.BytesIO()
        plt.savefig(bs, format="png")
        pyramid_plot.upload_stream(country + "_fig.png", bs.getvalue())

Publish a Managed Folder Insight

In the population_growth_comparisons folder, we can browse the files to view the chart for each country.

In some cases, we might automate the export of the contents of a managed folder to some other location. For this use case though, let’s publish the whole folder as an insight on a dashboard.

  • From the population_growth_comparisons folder, click Publish in the Actions sidebar.

  • Then click Create to add it to a dashboard as a “whole folder” insight. It may take a minute or two.

  • Adjust the size of the folder preview on the Edit tab as needed. Click Save.

  • Navigate to the View tab to interact with the charts in the folder.

../../_images/folder-insight.png

Learn More

That’s it! You have successfully demonstrated how to use managed folders as the input and output to code recipes to assist you in handling types of data that Dataiku DSS cannot natively read and/or write.

For more information on managed folders, please visit the product documentation.