Concept: The dataiku Package¶
In this lesson, let’s explore how to use the dataiku package in Dataiku DSS for low-level actions like reading and writing datasets or files to a folder.
The API in a Code Recipe¶
You’ve already used the dataiku package if you’ve created a code recipe in Dataiku DSS. In this lesson, we’ll deconstruct the starter code below found in a Python recipe, but the logic is the same for an R recipe as well.
The code starts by importing the dataiku and other standard packages.
After the import statements, the
dataiku.Dataset
method declares a dataiku Dataset object.The next line loads that object into memory as a pandas dataframe.
These two lines let you get right to work regardless of whether “mydataset” is coming from a simple csv in your local filesystem, an SQL table, HDFS, or an Amazon S3 bucket in the cloud–to name a few examples.
At the end of the recipe, a method from the dataiku package helps out once again, allowing you to easily write the output back to a DSS dataset.
The meat of the recipe is left to you because Dataiku DSS assists where it can, but otherwise, it stays out of your way.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
mydataset = dataiku.Dataset("mydataset")
mydataset_df = mydataset.get_dataframe()
# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
mydataset_processed_df = mydataset_df # For this sample code, simply copy input to output
# Write recipe outputs
mydataset_processed = dataiku.Dataset("mydataset_processed")
mydataset_processed.write_with_schema(mydataset_processed_df)
Interaction with Other Objects¶
You can use the dataiku package for low-level interaction with other objects in DSS: for instance, managed folders, saved models, metrics & checks, and more!
You can access the dataiku package from anywhere you can write code in the platform.
For example, if you create a webapp, you’ll use the same call to dataiku.Dataset
to interact with a dataset that you would find in a code recipe or notebook.
If you write a custom Python metric or check, the input dataset to your process functions will be a dataiku.Dataset
object.
One of the easiest ways to get started with the API is to open a code notebook and hit “Tab” on an object to see available functions with that object. Or try pressing “Shift + Tab” inside a function to find its documentation.
You can also browse the product documentation of the APIs online.
Now that we’ve taken a brief look at using the dataiku package for low-level interactions, you’ll also want to know how to use the public API for administrative or automation tasks. We’ll explore this in another lesson.