Integration with Amazon Redshift

Amazon Redshift is one of the analytical databases DSS can easily work with. This article shows how to integrate these 2 technologies.

Loading data into Redshift with DSS is as simple as:

  • create your connection to Redshift and S3

  • load your dataset into S3 with the proper format and in the same zone as your Redshift instance

  • load your data from S3 to Redshift using a “Sync” operator

That’s it. DSS will take care of the plumbing for you and let you focus on analyzing your data.

"A Dataiku flow with Redshift data"

The rest of the course provides a hands-on exercise.

Technical Requirements

We assume here that you have access to a Redshift instance (otherwise see the docs on how to create one from your Amazon AWS console), and that you have access to a S3 bucket with the proper “write” privileges.

Define the Redshift connection

Go to the Administration interface in DSS, and create a new “Redshift” connection:

"Creating a new connection to Redshift"

Fill in your credentials as prompted:

"Redshift credentials to define the connection"

You’ll need to know a few pieces information, including the database’s password.

Define the S3 connection

From the Administration interface again, and create a new “Amazon AWS (S3)” connection:

"Creating a new connection to Amazon AWS (S3)"

Fill in your credentials as prompted:

"Amazon AWS (S3) credentials to define the connection"

Please note that you’ll need to know a few pieces of information, including your Amazon credentials, and a bucket where you can read and write data.

Get some (big) data

In this example, we are going to use Github data. Specifically, we’ll use data from the GitHub Archive, which is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. These data are very interesting since they provide very detailed and granular events, letting us looking for interesting insights from open source software projects.

We have access here to 6 months of Github activity history (from January to June 2015), stored on our server as JSON files split by hour (one hour == one JSON file).

As usual, the first step is to create a DSS Dataset on top of these files:

"Dataset preview of Github activity history"

Please note that the selected format is “One record per line”, since we do not want to flatten the JSON file yet due to its very rich nature.

Also, we define a partitioning scheme, allowing us to efficiently work on daily subsets of data:

"Defining the partitioning scheme for the dataset"

You can now save the input dataset (named “github” here):

"Saving the input dataset"

Note that you see here the raw JSON content, with one event per line, stored in a fairly large 30Gb dataset (compressed) with 90 millions records for this 6 months period. Also, the column has automatically be named “line”.

Push your data to S3

Loading data effectively into Redshift requires the files to be available on S3, along with a few other constraints well documented here.

For the sake of simplicity, we will not use all the information available in the initial dataset. To create a new dataset with a suitable (tabular) format for Redshift, create a new Analyze script on the initial dataset.

As we are only interested, for now, in looking at the global activity and popularity of the Github repos, the visual data preparation script we build does the following:

  • flatten the JSON structure (not going beyond 1 level depth)

  • flatten the “actor” and “repo” sections of the JSON

  • create a new “date” column that will be used for Redshift partitions

  • remove the unnecessary columns

"Prepare recipe script to transform the data"

This script can now be deployed, and the resulting dataset written on Amazon S3. Click on the “Deploy Script” button, and store the newly created “github_s3” dataset in your S3 connection.

"Deploying the Prepare recipe"

A new recipe has been created, with the local “github” dataset as input, and the “github_s3” dataset as output. Note that the daily partitionning scheme of the input dataset has automatically been copied to the output.

Before actually building your dataset (i.e putting your data on S3), and to comply with the Redshift constraints, click on the link to the new “github_s3” dataset, and go to the “Settings” tab. Change the quoting style to “Escaping only”, and save your changes.

"Setting the quoting style to Escaping only"

Go back to the Flow screen (click on the button in the nav bar or hit “g + f” on your keyboard), and click on the “github_s3” dataset icon. In the right panel, click on “Build”, and load a few days to begin, the month of January 2015 for instance:

"Building the dataset in the flow"

Once everything set, hit the “Build” button. A new job is launched, that will:

  • start from the input files

  • apply the visual data preparation script

  • load it into S3 with the proper format on the fly.

With our servers, it takes approximatively 2 minutes to load a month worth of data:

"Job details for the S3 dataset build"

That’s it, your data sits now in S3 and is ready to be loaded in Redshift:

"The data loaded into S3"

Load your data to Redshift

Loading your dataset into Redshift will now be very easy. Just go back to the Flow screen again, and create a new “Sync” recipe from the “github_s3” dataset:

"Creating a Prepare recipe from the S3 dataset"

This will create a new “github_t” dataset, stored in your Redshift connection.

The summary screen showing up will let you see again the input / output of your Sync recipe (note that again the partitioning scheme is copied), and build (i.e copy your S3 data to Redshift) your dataset directly from there:

"The input/output screen of the Prepare recipe"

Hit the “Run” button and wait for the job to complete:

"Job details for the Redshift dataset build"

A little bit less than 2 minutes to wait, and your data is in Redshift! And note that the resulting table is partitioned, and we didn’t even have to care about its schema.

What’s Next

You can for instance easily create Charts on the Redshift dataset. Because Redshift can be used interactively thanks to its performance, Charts can be switched to “Live processing” (i.e Redshift will do most of the aggregation work):

"Charts engine selection"

Because live processing is enabled, Charts will be based on the entire dataset:

"Bar chart with creation date on the X axis, colored by line type"

You may also want to analyze interactively the dataset with your own SQL queries, using a SQL notebook:

"SQL notebook connected to Redshift"

And finally, you could also write a SQL recipe in a larger workflow, for instance if you were to use Redshift to aggregate and reduce the size the base data, before passing it to R or Python for deeper analysis, or create a Model from it:

"Flow showing data loaded into S3, Redshift, aggregated with an SQL recipe, analyzed with a Python recipe, and scored with a cluster model"

We’re done. A few steps will take your data into Redshift, and you’ll be good for large scale analysis in a matter of minutes.