Active Learning for tabular data classification problems using Dataiku Apps¶
Access to a DSS instance of version higher than 8.0 where the ML-assisted Labeling plugin is installed.
A code Python 3.6 code environment called ml-assisted-labeling-visual-ml-python-36 should be created. It should have these packages installed:
scikit-learn>=0.20,<0.21 scipy>=1.1,<1.2 xgboost==0.81 statsmodels>=0.9,<0.10 jinja2>=2.10,<2.11 flask>=1.0,<1.1
Suppose you need to classify article titles depending on whether they look like clickbait or not. We’re going to download a table containing unlabeled article titles and label them manually using active learning in a specific webapp.
We will use a dataset of article titles containing 1 column. It contains both clickbait and legit titles.
Create the Project¶
In this tutorial, we will use a Dataiku app to fasten the creation of the flow. Go to the application menu, select Image classification - ML Assisted Labeling and click on Start using the application.
Give a name to your project, for example Clickbait.
You are now presented with a user-friendly user interface of the tabular data classification application. There are two steps required to kickstart the application:
Tabular input. Simply drag and drop your unlabeled csv file to this area to add the data.
Next you need to provide the labeling categories, enter two of them: clickbait and legit into the key-value table.
Label the data¶
Next you may start the labeling webapp by clicking on Run now next to the Start / Restart the labeling webapp, wait while the app is starting.
Now that you have 17260 unlabeled rows before you can start training a classifier to distinguish clickbait from legit titles you first have to label them.
Click on the Labeling app link next to Label tabular data.
In order to start labeling click on one of the category buttons on the right. If you’re not sure, it’s possible to skip a sample. You may also leave a comment related to a sample. In order to change a category of an already labeled sample you may navigate back using the arrow buttons. It may also be convenient to use hot keys assigned to labels to go even faster.
Label a few samples, make sure that you have several labels per category (grey progress bar under a category button). Once you have enough labeled samples you can start training your model.
Now that you have some labeled samples, you can train the first model to enable active learning. Navigate back to the home page of the Clickbait Dataiku App and click on Run now next to Re-generate queries.
After the queries are generated the labeling app will restart and the active learning will be enabled.
For more on active learning, see the following posts on Data From the Trenches: