Concept: The dataiku Package

In this lesson, let’s explore how to use the dataiku package in Dataiku DSS for low-level actions like reading and writing datasets or files to a folder.

The API in a Code Recipe

You’ve already used the dataiku package if you’ve created a code recipe in Dataiku DSS. In this lesson, we’ll deconstruct the starter code below found in a Python recipe, but the logic is the same for an R recipe as well.

  • The code starts by importing the dataiku and other standard packages.

  • After the import statements, the dataiku.Dataset method declares a dataiku Dataset object.

  • The next line loads that object into memory as a pandas dataframe.

    • These two lines let you get right to work regardless of whether “mydataset” is coming from a simple csv in your local filesystem, an SQL table, HDFS, or an Amazon S3 bucket in the cloud–to name a few examples.

  • At the end of the recipe, a method from the dataiku package helps out once again, allowing you to easily write the output back to a DSS dataset.

  • The meat of the recipe is left to you because Dataiku DSS assists where it can, but otherwise, it stays out of your way.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
mydataset = dataiku.Dataset("mydataset")
mydataset_df = mydataset.get_dataframe()

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

mydataset_processed_df = mydataset_df # For this sample code, simply copy input to output

# Write recipe outputs
mydataset_processed = dataiku.Dataset("mydataset_processed")
mydataset_processed.write_with_schema(mydataset_processed_df)

Interaction with Other Objects

You can use the dataiku package for low-level interaction with other objects in DSS: for instance, managed folders, saved models, metrics & checks, and more!

A slide introducing how the dataiku package is for low-level interaction with Dataiku objects.

You can access the dataiku package from anywhere you can write code in the platform.

For example, if you create a webapp, you’ll use the same call to dataiku.Dataset to interact with a dataset that you would find in a code recipe or notebook.

A slide introducing how APIs can be used through Dataiku such as in webapps.

If you write a custom Python metric or check, the input dataset to your process functions will be a dataiku.Dataset object.

A slide introducing how APIs can be in custom metrics and checks.

One of the easiest ways to get started with the API is to open a code notebook and hit “Tab” on an object to see available functions with that object. Or try pressing “Shift + Tab” inside a function to find its documentation.

You can also browse the product documentation of the APIs online.

Now that we’ve taken a brief look at using the dataiku package for low-level interactions, you’ll also want to know how to use the public API for administrative or automation tasks. We’ll explore this in another lesson.