Tutorial | Active learning for tabular data classification problems using Dataiku apps#

Prerequisites#

  • You should be familiar with the basics of machine learning in Dataiku.

Technical requirements#

  • Access to a Dataiku instance of version higher than 8.0 where the ML-assisted Labeling plugin is installed.

  • A code Python 3.6 code environment called ml-assisted-labeling-visual-ml-python-36 should be created. It should have these packages installed:

scikit-learn>=0.20,<0.21
scipy>=1.1,<1.2
xgboost==0.81
statsmodels>=0.9,<0.10
jinja2>=2.10,<2.11
flask>=1.0,<1.1

Setting up#

Suppose you need to classify article titles depending on whether they look like clickbait or not. We’re going to download a table containing unlabeled article titles and label them manually using active learning in a specific webapp.

Supporting data#

We will use a dataset of article titles containing 1 column. It contains both clickbait and legit titles.

Create the project#

In this tutorial, we will use a Dataiku app to fasten the creation of the Flow.

  1. Go to the application menu and select Image classification - ML Assisted Labeling.

  2. Click on Start using the application.

    Creation menu of a Tabular Classification App.
  3. Give a name to your project, for example Clickbait.

Labeling setup#

You are now presented with a user-friendly user interface of the tabular data classification application. There are two steps required to kickstart the application:

  1. Tabular input. Simply drag and drop your unlabeled csv file to this area to add the data.

  2. You need to provide the labeling categories, enter two of them: clickbait and legit into the key-value table.

Application settings for a Tabular App in Dataiku

Label the data#

  1. Start the labeling webapp by clicking on Run now next to the Start / Restart the labeling webapp and wait while the app is starting.

    Now that you have 17260 unlabeled rows before you can start training a classifier to distinguish clickbait from legit titles, you first have to label them.

  2. Click on the Labeling app link next to Label tabular data.

  3. To start labeling, click on one of the category buttons on the right.

    Note

    • If you’re not sure, it’s possible to skip a sample.

    • You may also leave a comment related to a sample.

    • To change a category of an already labeled sample, you may navigate back using the arrow buttons.

    • It may also be convenient to use hot keys assigned to labels to go even faster.

  4. Label a few samples. Make sure that you have several labels per category (gray progress bar under a category button).

Once you have enough labeled samples you can start training your model.

Labeling webapp after labeling

Generate queries#

Now that you have some labeled samples, you can train the first model to enable active learning.

  1. Navigate back to the homepage of the Clickbait Dataiku App.

  2. Click on Run now next to Re-generate queries.

After the queries are generated, the labeling app will restart and the active learning will be enabled.

Active learning enabled

What’s next?#

For more on active learning, see the following posts on Data From the Trenches:

References