Python and Dataiku

Learn how to integrate Python code into Dataiku.

Reference | Reading or writing a dataset with custom Python code

When you use a Python recipe to transform a dataset in Dataiku, you generally use the Dataiku Python API to read and write to the dataset.

This Dataiku API provides an easy way to read or write datasets, regardless of their size or data store. This way, you don’t need to install specific packages for interacting with each data store, or learn specific APIs.

There are some cases, however, where the Dataiku API does not provide enough flexibility, and you want to use the specific API or package for your datastore.

Some use cases could include:

  • You want to read data, which is stored in a MongoDB collection with a specific filter, which is not represented in the filter for the input dataset.

  • You want to “upsert” data into the output dataset (i.e., insert, update, or remove records based on a primary key).

The usage of the Dataiku API is by no means mandatory. You can read data and write data however you want. If you don’t call the get_dataframe or iter_tuples methods, Dataiku will not read any data, nor load anything in memory from the datastore.

Similarly, you don’t have to use the write_dataframe or get_writer API to write data in the output. Even if you use a writer that Dataiku does not know about (for example, the pymongo package for MongoDB), the recipe will work properly, and Dataiku will know that the dataset has been changed.

Code Sample | Access info about datasets

You generally want to avoid hard-coding connection information, table names, etc. in your recipe code. Dataiku can give you some connection / location information about the datasets that you are trying to read or write.

For all datasets, you can use the dataset.get_location_info() method. It returns a structure containing an info dict. The keys in the info dict depend on the specific kind of dataset. Print the dict to see more (NB: you can do that in a Jupyter notebook). Here are a few examples:

# myfs is a Filesystem dataset
dataset = dataiku.Dataset("myfs")
locinfo = dataset.get_location_info()
print locinfo["info"]

{
  "path" : "/data/input/myfs"
}

# sql is a PostgreSQL dataset
dataset = dataiku.Dataset("sql")
locinfo = dataset.get_location_info()
print locinfo["info"]

{
  "databaseType" : "PostgreSQL",
  "schema" : "public",
  "table" : "mytablename"
}

In addition, for “Filesystem-like” datasets (Filesystem, HDFS, S3, etc.), you can use the get_files_info() method to get details about all files in a dataset (or partition).

dataset = dataiku.Dataset("non_partitioned_fs")
fi = dataset.get_files_info()

for filepath in fi["globalPaths"]:
  # Returns a path relative to the root path of the dataset.
  # The root path of the dataset is returned by get_location_info
  print filepath["path"]
  # Size in bytes of that file
  print filepath["size"]
dataset = dataiku.Dataset("partitioned_fs")
fi = dataset.get_files_info()

for (partition_id, partition_filepaths) in fi["pathsByPartition"].items():
  print partition_id

  for filepath in partition_filepaths:
    # Returns a path relative to the root path of the dataset.
    # The root path of the dataset is returned by get_location_info
    print filepath["path"]
    # Size in bytes of that file
    print filepath["size"]

How-to | Enable auto-completion in a Jupyter notebook

Many of you have shown interest in enabling auto-completion in Jupyter Notebooks so, in the interest of knowledge sharing, we wanted to demonstrate just how simple it is.

Access the Jupyter Menu

You have auto-complete in Jupyter notebooks like you have in any other Jupyter environment. Simply hit the “Tab” key while writing code. This will open a menu with suggestions. Hit “Enter” to choose the suggestion.

../../_images/jupyter-notebooks.png