Hands-On: Create Your Project and Prepare the Data

The main goal of Image Classification - The Visual Way, is to classify a set of images using a pre-trained model. Additional objectives are to install the required plugins, apply an object detection task, and analyze your model to understand it.

Specifically, you will install the Deep learning on images plugin, and work with the lions and tigers image files to classify images using a pre-trained model. You will then perform image feature extraction, and analyze and understand your model using the Tensorboard web app template. Then, you will install the Object detection in images plugin and detect objects in images.


To get an idea of the steps to be completed in the hands-on lessons, you can always visit the concept video at the beginning of each section in the course.

In addition, you can visit the Dataiku gallery project that shows a similar project.


You should complete the Machine Learning Basics course prior to beginning this one. You should also have access to an instance of Dataiku DSS that allows you to install plugins.

Create Your Project

From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Image Classification - The Visual Way (Tutorial). Click on Go to Flow.

In the Flow, you can see two folders, Images to classify and Images for retraining.

Image classification project, showing two folders


To create your own project, you can download the lions_and_tigers.zip (189MB) file from the Dataiku website and extract the contents. Then, in your project, create two folders (from the + Dataset dropdown, select Folder) named Images to classify and Images for re-training. Populate the folders with the contents of the same-named folders from the zip file by dragging and dropping the images from your computer to the folder.

The zip file also contains a Python script file. If you choose to use the Python script file, be sure to change the reference to the Dataiku folder from 63RF8lzq to the reference in your project (it’s visible in the left panel under the Input Datasets).

Prepare the Data

In order to use the images for re-training, we need a dataset that labels each image as a lion or tiger. Fortunately, the name of each image file contains the text “lion” or “tiger”, correctly identifying which big cat is in the image. We can process the filenames with Python code to create the Labels dataset.

Select the Images for retraining folder and choose the Python code recipe from the actions menu. In the recipe creation dialog, create a new dataset called Labels. Then click Create Recipe.

Replace the code with the following:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
images_for_retraining = dataiku.Folder("Images for retraining")
images_for_retraining_info = images_for_retraining.get_info()

paths = images_for_retraining.list_paths_in_partition()

LABEL_0 = "lion"
LABEL_1 = "tiger"

pandas_dataframe = pd.DataFrame(columns=['path', 'label'])
for i,j in enumerate(paths):
  if LABEL_0 in j:
      pandas_dataframe.loc[i] = [j[1:], LABEL_0]
  if LABEL_1 in j:
      pandas_dataframe.loc[i] = [j[1:], LABEL_1]

# Write recipe outputs
labels = dataiku.Dataset("Labels")

The completed Python script should look similar to this, where your code uses the unique ID of your folder or the name of the folder (“Images for retraining”):


Click Run, then explore the output dataset to ensure it looks right.


Now the data is ready for use!