Tutorial | Integration with MongoDB#

Introduction#

MongoDB has become very popular over the past years, and may be encountered in many industries and projects. We will show in this article how to read and write data to MongoDB from Dataiku.

Technical requirements#

You have access to a MongoDB instance with an existing database and collections. Here, we have a local MongoDB instance, with a “dataiku” database and a “github” collection. The collection uses data from the GitHub Archive, which is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. The raw data is provided as JSON, which makes it perfectly suitable for MongoDB.

Create the MongoDB connection#

If you use a MongoDB data source, the first thing to do in Dataiku is to create the corresponding connection.

  1. In the Administration menu of Dataiku, under the Connections tab, click on New connection and select MongoDB.

    Creating a new connection to MongoDB in the Dataiku Administration dialogs.
  2. Fill in the connection credentials as required.

    MongoDB connection credentials.

Note

If needed, you can define the connection using an advanced URI syntax.

Create a MongoDB dataset#

Once your connection created, you can easily access the data stored there by creating a new MongoDB Dataiku dataset (from the Dataset > New menu).

  1. Browse the collections.

  2. Select the one you want to use as dataset.

Creating a Dataiku dataset from MongoDB.

Note that you can also limit the results returned by using the Input filtering query field:

Limiting results with an input filter.

Analyze your data#

You are now ready to analyze your MongoDB data. For instance, you could use a visual data preparation script from Analyze to flatten or unfold the documents stored in your MongoDB collection:

"Visual preparation recipe on a MongoDB collection"

You may also write the data to MongoDB, for instance if you wished to expose it in a web-based application. Suppose the following dataset has been built:

"A Dataiku dataset to be written to MongoDB"

You could then use a visual preparation script to create a document from each record, using a custom Python formula, for instance:

import json

def process(row):
    return {
      'actor_id': row['actor_id'],
      'stats': [
        {'creation_count': int(row['creation_count'])},
        {'created_at_max': str(row['created_at_max'])}
      ]
    }

Finally, you could write this data to MongoDB by deploying the script, and specifying your MongoDB connection as target:

"Deploying a Prepare recipe with a MongoDB dataset as the output"

Wait for your script to run. Your data is written into a new MongoDB collection. You may want to check the output directly from a Mongo shell to be sure it worked:

"Checking that the data was written into a new MongoDB collection"

That’s it for a first taste of integrating Dataiku and MongoDB.