Tutorial | Managed folders#

Although Dataiku supports a large number of data formats, datasets alone are not always sufficient for every data science project. Managed folders, often in conjunction with code recipes, can be a handy tool for programmatically working with non-supported file types (like images, PDFs, and more) and supported file types, too.

In this tutorial, we will work with data tables printed in a UN report on urban population growth that we want to extract into Dataiku.

Get started#

Objectives#

In this tutorial, you will learn how to:

  • Parse and extract data from a pdf in a folder using Python.

  • Create a Dataiku dataset from a folder of files.

  • Work with a folder as both input and output to a Python recipe.

  • Publish the contents of a managed folder on a dashboard.

Prerequisites#

To complete this tutorial, you will need:

  • Dataiku version 12 or above.

  • A 3.8 Python environment with the packages

    matplotlib
    tabula-py
    

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Developer > Managed Folders.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Explore the folder#

The data for this project is a PDF named UN’s 2016 report on World Cities. You can explore it in the managed folder.

  1. From the Flow, open the folder un_pdf_download.

  2. Open the PDF file in the folder.

  3. Scroll through the PDF to find the published data tables starting from page 12.

Dataiku screenshot of the uploaded PDF in a managed folder.

Parse the PDF with a code recipe#

Dataiku doesn’t have its own way of extracting data tables from a PDF, but the tabula Python library does. A small amount of Python code can get these tables into a Dataiku dataset.

Change the code environment#

Let’s first designate this project’s code environment, one that has the packages tabula-py and matplotlib (which will be needed later).

  1. From the More options menu in the top navigation bar, select Settings > Code env selection.

  2. Change the Mode to Select an environment.

  3. Change the Environment to a compatible code env. See instructions about creating a compatible code environment for tips.

  4. Click Save.

Create a Python recipe#

Now we can return to the Flow.

  1. Select the un_pdf_download folder.

  2. On the Action menu, add a Python recipe.

  3. Under Outputs, click +Add, but instead of adding a new dataset, add a New Folder.

  4. Name the folder un_csv.

  5. Click on Create Folder.

  6. Click on Create Recipe.

Dataiku screenshot of adding a new folder.

Normally we would prototype a recipe in a notebook, but in this case, we have working code ready. The full code is below, but take a moment to understand these key points.

  • We use the Dataiku API to stream the PDF from the folder.

  • For cloud users, tempfile allows for using the file normally with various libraries.

  • The tabula-py library does the actual PDF parsing.

  • The last step is to write the output to another folder.

  1. Delete the code sample.

  2. Copy and paste the code below into the recipe editor.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd
    from tabula.io import read_pdf
    import tempfile
    
    # Read recipe inputs
    un_pdf_download = dataiku.Folder("un_pdf_download")
    
    # read in the pdf and use tabula-py to extract tabular data
    with un_pdf_download.get_download_stream("the_worlds_cities_in_2016_data_booklet.pdf") as f:
            with tempfile.NamedTemporaryFile(suffix=".pdf") as temp_pdf:
                    temp_pdf.write(f.read())
                    temp_pdf.seek(0)
                    tables = read_pdf(temp_pdf.name, pages = "12-26", multiple_tables = True)
    
    # parse the pdf tables
    for table in tables:
            table.columns = table.iloc[0].values
            table.dropna(inplace=True)
            table.drop(table.index[0], inplace=True)
    
    # remove corrupted data
    tables.pop(2)
    
    # Write recipe outputs
    csvs = dataiku.Folder("un_csv")
    
    # write dataframes to csvs
    path = '/dataset_{}'
    
    for index, table in enumerate(tables):
    csvs.upload_stream(path.format(index), table.to_csv().encode("utf-8"))
    
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd
    from tabula.io import read_pdf
    
    # Read recipe inputs
    un_pdf_download = dataiku.Folder("un_pdf_download")
    
    # read in the pdf and use tabula-py to extract tabular data
    with un_pdf_download.get_download_stream("the_worlds_cities_in_2016_data_booklet.pdf") as stream:
            tables = read_pdf(stream, pages = "12-26", multiple_tables = True)
    
    # parse the pdf tables
    for table in tables:
            table.columns = table.iloc[0].values
            table.dropna(inplace=True)
            table.drop(table.index[0], inplace=True)
    
    # remove corrupted data
    tables.pop(2)
    
    # Write recipe outputs
    csvs = dataiku.Folder("un_csv")
    
    # write dataframes to csvs
    path = '/dataset_{}'
    
    for index, table in enumerate(tables):
            csvs.upload_stream(path.format(index), table.to_csv().encode("utf-8"))
    

    Tip

    If you get an error extracting the tables from the un_csv folder, you can also download the un_world_cities dataset, and continue from Output custom plots to a folder section below.

  1. Run the recipe.

  2. View the output folder.

Warning

If you gave the managed folders different names than those described here, be sure to update them.

Note

Whenever possible, it is advisable to use the get_download_stream() method to read a file from a folder, rather than get_path(). While get_path() will only work for a local folder, get_download_stream() works regardless of where the contents are stored. See the reference documentation to learn more.

Create a Files in Folder dataset#

The un_csv folder now holds 14 files, one for each page of the PDF containing a table. From this folder of CSV files, let’s create one Dataiku dataset.

  1. With the un_csv folder selected, click Create dataset in the right panel.

  2. Click List Files to confirm the files in the folder (dataset_0 to dataset_13).

  3. Click Test & Get Schema, and then click Preview.

  4. On the Format/Preview tab, check the box Parse next line as column headers.

  5. Fix some of the column names:

    • Add the prefix pop_ to the three population columns 2000, 2016, and 2030.

    • Add the prefix avg_roc_ to the columns 2000-2016 and 2016-2030 for the average annual rate of change in population.

  6. Name the output un_world_cities, and click Create.

Dataiku screenshot of creating a files in folder dataset.

We now have a Dataiku dataset that we can manipulate further with visual and/or code recipes.

Visual data preparation#

Although we could have handled this in the previous Python recipe, let’s take advantage of the Prepare recipe for some quick data cleaning.

  1. From the un_world_cities dataset, create a Prepare recipe with the default output name.

  2. Remove the unnecessary col_0.

  3. Add another step to convert number formats in the three population columns.

    • The name of the step is Convert number formats. The input format should be French, and the output format should be Raw.

  4. As flagged by the red portion of the data quality bar, there is one city with non-standard representations of missing data.

    • For columns pop_2000 and avg_roc_2000-2016, clear invalid cells for the meanings integer and decimal, respectively.

  5. Run the recipe, updating the schema if necessary.

Dataiku screenshot of the Prepare recipe needed to clean the data extracted from the PDF.

Output custom plots to a folder#

There may be times when the native chart builder cannot create the visualization we need. Instead, we might want to create a more customized plot in a library like matplotlib or ggplot2. We can use a code notebook to create a custom plot, which can then be saved as a static insight and published to a dashboard.

However, for situations where we want to generate a large number of files as output (such as one chart per country), it may be preferable to use a managed folder as output.

For the ten largest countries, let’s create pyramid plots in matplotlib to compare growth rates from 2000-2016 and the projected growth rate from 2016 to 2030.

  1. From the un_world_cities_prepared dataset, create a Python recipe.

  2. Add a New Folder (not a dataset) as an output named population_growth_comparisons.

  3. Create the recipe.

The full code recipe is below, but take note of the following key points.

  • We use the Dataiku API to interact with the input dataset as a Pandas dataframe.

  • We use matplotlib to create the charts.

  • For each plot, we use the upload_stream() method to write the image to the output folder because it works for both local and non-local folders.

  1. Delete the code sample.

  2. Copy and paste this code into your recipe editor.

    import dataiku
    import pandas as pd
    import matplotlib.pyplot as plt
    import os
    import io
    
    # Read recipe inputs
    un_data_prepared = dataiku.Dataset("un_world_cities_prepared")
    df = un_data_prepared.get_dataframe()
    
    # top 10 most populous countries
    TOP_10 = ['India', 'China', 'Brazil', 'Japan', 'Pakistan', 'Mexico', 'Nigeria',
            'United States of America', 'Indonesia', 'Turkey']
    
    # generate plot for each country and save to folder
    for country in TOP_10:
    
            df_filtered = df[df['Country or area'] == country]
    
            y = range(0, len(df_filtered))
            x_1 = df_filtered["avg_roc_2000-2016"]
            x_2 = df_filtered["avg_roc_2016-2030"]
    
            fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(12, 9))
    
            fig.patch.set_facecolor('xkcd:light grey')
            plt.figtext(.5,.9, "Pop. ROC Comparison ", fontsize=15, ha='center')
    
            axes[0].barh(y, x_1, align='center', color='royalblue')
            axes[0].set(title='2000-2016')
            axes[1].barh(y, x_2, align='center', color='red')
            axes[1].set(title='2016-2030')
    
            axes[1].grid()
            axes[0].set(yticks=y, yticklabels=df_filtered['City'])
            axes[0].invert_xaxis()
            axes[0].grid()
    
            # Write recipe outputs
            pyramid_plot = dataiku.Folder("population_growth_comparisons")
    
            bs = io.BytesIO()
            plt.savefig(bs, format="png")
            pyramid_plot.upload_stream(country + "_fig.png", bs.getvalue())
    
  3. Run the recipe.

  4. View the output folder.

Publish a managed folder insight#

In the population_growth_comparisons folder, we can browse the files to view the chart for each country.

In some cases, we might automate the export of the contents of a managed folder to some other location. For this use case though, let’s publish the whole folder as an insight on a dashboard.

  1. From the population_growth_comparisons folder, click Publish in the Actions sidebar.

  2. Click Create to add it to a dashboard. It may take a minute or two.

  3. Adjust the size of the folder preview on the Edit tab as needed. Click Save.

  4. Navigate to the View tab to interact with the images in the folder.

Dataiku screenshot of the final images inside the output managed folder.

What’s next?#

That’s it! You have successfully demonstrated how to use managed folders as the input and output to code recipes to assist you in handling types of data that Dataiku cannot natively read and/or write.

See also

For more information on managed folders, please visit the managed folders reference documentation.