Tutorial | Sentiment analysis plugin#

Binary Sentiment Analysis is the task of automatically analyzing a text data to decide whether it is positive or negative. This is useful when faced with a lot of text data that would be too time-consuming to manually label. Dataiku provides a plugin that allows you to compute binary sentiment scores for English text data.

Objectives#

We will show you how to:

  • Install the sentiment analysis plugin.

  • Compute sentiment scores for text data.

Prerequisites#

We will be working with IMDB movie reviews. The original data is from the Large Movie Review Dataset, which is a compressed folder with many text files, each corresponding to a review. In order to simplify this how-to, we have provided a single csv file for download.

Install the plugin#

First you need to install the Sentiment Analysis plugin. This requires Administrator privileges on the Dataiku instance.

Create your project and prepare the data#

  1. Create a new project and give it a name like IMDB Sentiment Analysis.

  2. In the Flow, create a files-based dataset and upload the CSV file you downloaded earlier.

The dataset has three columns:

  • One containing the text of the review.

  • One containing the rating given by the customer on a 1-10 scale.

  • One containing a mapping of that rating to sentiment polarity.

When a text is positive we say that it has a polarity of 1, otherwise we say it has a polarity of 0.

"IMDB test sample

Let’s predict the sentiment of these reviews and then compare the predicted sentiment polarities with the actual values to get a sense of how the plugin works and how well it does.

Compute sentiment scores#

  1. In the project Flow, click on +RECIPE then select the Sentiment Analysis plugin.

    Sentiment analysis plugin dialog.
  2. Select the recipe Compute sentiment scores.

  3. Specify an input dataset where the reviews can be found, and specify an output dataset.

  4. After creating the recipe, run it by simply selecting the column containing the texts (here, our movie reviews). Also set the Output confidence scores checkbox, which outputs the model’s confidence on each prediction as a new column. 5. Click Run.

Compute sentiment scores dialog.

After a few seconds, the plugin outputs a copy of the original dataset with two additional columns:

Dataset with computed sentiment scores.

You can see that the plugin is sometimes right in its predictions and sometimes wrong. To get a better idea of how well the plugin did on our task, let’s compute an accuracy score (the number of good predictions over the number of reviews). To do that we go to Status > Edit and create a new Python Probe, using the following code as the metric:

def process(dataset, partition_id):
    df = dataset.get_dataframe()
    prediction = df["predicted_sentiment"].values == "positive"
    original = df["polarity"].values

    return {"accuracy": (prediction == original).mean()}

Then we run this probe, save it, and get the accuracy in Status > Metrics:

So, we can see that using the Sentiment Analysis plugin, we get ~ 89.5% accuracy over 25,000 movie reviews.

What’s next?#

There is a page dedicated to the Sentiment Analysis plugin.

You can build your own deep learning models for Sentiment Analysis using Keras and Tensorflow in Dataiku.