Tutorial | Integration with Amazon Redshift#

Introduction#

Amazon Redshift is one of the analytical databases Dataiku can easily work with. This article shows how to integrate these two technologies.

Loading data into Redshift with Dataiku is as simple as:

  • Create your connection to Redshift and S3.

  • Load your dataset into S3 with the proper format and in the same zone as your Redshift instance.

  • Load your data from S3 to Redshift using a Sync operator.

That’s it. Dataiku will take care of the plumbing for you and let you focus on analyzing your data.

A Dataiku flow with Redshift data.

The rest of the course provides an interactive tutorial.

Technical requirements#

We assume here that you have access to a Redshift instance (otherwise see the docs on how to create one from your Amazon AWS console), and that you have access to a S3 bucket with the proper “write” privileges.

Define the Redshift connection#

  1. Go to the Administration interface in Dataiku, and create a new Amazon Redshift connection.

    Creating a new connection to Redshift.
  2. Fill in your credentials as prompted.

    Redshift credentials to define the connection.

You’ll need to know a few pieces information, including the database’s password.

Define the S3 connection#

  1. From the Administration interface again, and create a new Amazon AWS (S3) connection.

    Creating a new connection to Amazon AWS (S3).
  2. Fill in your credentials as prompted.

    Amazon AWS (S3) credentials to define the connection.

Please note that you’ll need to know a few pieces of information, including your Amazon credentials, and a bucket where you can read and write data.

Get some (big) data#

In this example, we are going to use Github data. Specifically, we’ll use data from the GitHub Archive, which is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. These data are very interesting since they provide very detailed and granular events, letting us looking for interesting insights from open source software projects.

We have access here to 6 months of Github activity history (from January to June 2015), stored on our server as JSON files split by hour (one hour == one JSON file).

  1. Create a Dataiku dataset on top of these files.

    Dataset preview of Github activity history.

    Note

    The selected format is One record per line, since we do not want to flatten the JSON file yet due to its very rich nature.

  2. Define a partitioning scheme, allowing us to efficiently work on daily subsets of data.

    Defining the partitioning scheme for the dataset.
  3. Save the input dataset (named github here).

Saving the input dataset.

Note that you see here the raw JSON content, with one event per line, stored in a fairly large 30Gb dataset (compressed) with 90 millions records for this 6 months period. Also, the column has automatically be named line.

Push your data to S3#

Loading data effectively into Redshift requires the files to be available on S3, along with a few other constraints well documented here.

For the sake of simplicity, we will not use all the information available in the initial dataset. To create a new dataset with a suitable (tabular) format for Redshift, create a new Analyze script on the initial dataset.

As we are only interested, for now, in looking at the global activity and popularity of the Github repos, the visual data preparation script we build does the following:

  • Flatten the JSON structure (not going beyond 1 level depth).

  • Flatten the actor and repo sections of the JSON.

  • Create a new date column that will be used for Redshift partitions.

  • Remove the unnecessary columns.

Prepare recipe script to transform the data.

Deploy the script#

This script can now be deployed, and the resulting dataset written on Amazon S3. To do so:

  1. Click on the Deploy Script button.

  2. Store the newly created github_s3 dataset in your S3 connection.

Deploying the Prepare recipe.

A new recipe has been created, with the local github dataset as input, and the github_s3 dataset as output. Note that the daily partitioning scheme of the input dataset has automatically been copied to the output.

Comply with the Redshift constraints#

Before actually building your dataset (i.e putting your data on S3), and to comply with the Redshift constraints:

  1. Click on the link to the new github_s3 dataset.

  2. Go to the Settings tab.

  3. Change the quoting style to Escaping only.

  4. Save your changes.

Setting the quoting style to Escaping only.

Build the dataset#

Now it’s time to build the dataset.

  1. Go back to the Flow screen (click on the button in the nav bar or hit g + f on your keyboard).

  2. Click on the github_s3 dataset icon.

  3. In the right panel, click on Build, and load a few days to begin, the month of January 2015 for instance.

    Building the dataset in the Flow.
  4. Once everything set, hit the Build button. A new job is launched, that will:

    • Start from the input files.

    • Apply the visual data preparation script.

    • Load it into S3 with the proper format on the fly.

With our servers, it takes approximatively two minutes to load a month worth of data:

Job details for the S3 dataset build.

That’s it, your data sits now in S3 and is ready to be loaded in Redshift:

The data loaded into S3.

Load your data to Redshift#

Loading your dataset into Redshift will now be very easy.

  1. Go back to the Flow screen again.

  2. Create a new Sync recipe from the github_s3 dataset:

    Creating a Prepare recipe from the S3 dataset.

    This will create a new github_t dataset, stored in your Redshift connection.

    The summary screen showing up will let you see again the input / output of your Sync recipe (note that again the partitioning scheme is copied), and build (i.e copy your S3 data to Redshift) your dataset directly from there.

    The input/output screen of the Prepare recipe.
  3. Hit the Run button and wait for the job to complete.

Job details for the Redshift dataset build.

A little bit less than 2 minutes to wait, and your data is in Redshift! And note that the resulting table is partitioned, and we didn’t even have to care about its schema.

What’s next?#

You can for instance easily create Charts on the Redshift dataset. Because Redshift can be used interactively thanks to its performance, Charts can be switched to “Live processing” (i.e Redshift will do most of the aggregation work):

Charts engine selection.

Because live processing is enabled, Charts will be based on the entire dataset:

Bar chart with creation date on the X axis, colored by line type.

You may also want to analyze interactively the dataset with your own SQL queries, using a SQL notebook:

SQL notebook connected to Redshift.

And finally, you could also write a SQL recipe in a larger workflow, for instance if you were to use Redshift to aggregate and reduce the size the base data, before passing it to R or Python for deeper analysis, or create a Model from it:

Flow showing data loaded into S3, Redshift, aggregated with an SQL recipe, analyzed with a Python recipe, and scored with a cluster model.

We’re done. A few steps will take your data into Redshift, and you’ll be good for large scale analysis in a matter of minutes.