Reference | Reading or writing a dataset with custom Python code#

When you use a Python recipe to transform a dataset in Dataiku, you generally use the Dataiku Python API to read and write to the dataset.

This Dataiku API provides an easy way to read or write datasets, regardless of their size or data store. This way, you don’t need to install specific packages for interacting with each data store, or learn specific APIs.

There are some cases, however, where the Dataiku API does not provide enough flexibility, and you want to use the specific API or package for your datastore.

Some use cases could include:

  • You want to read data, which is stored in a MongoDB collection with a specific filter, which is not represented in the filter for the input dataset.

  • You want to “upsert” data into the output dataset (i.e., insert, update, or remove records based on a primary key).

The usage of the Dataiku API is by no means mandatory. You can read data and write data however you want. If you don’t call the get_dataframe or iter_tuples methods, Dataiku will not read any data, nor load anything in memory from the datastore.

Similarly, you don’t have to use the write_dataframe or get_writer API to write data in the output. Even if you use a writer that Dataiku does not know about (for example, the pymongo package for MongoDB), the recipe will work properly, and Dataiku will know that the dataset has been changed.