Hands-On: Create Your Project and Prepare the Data

Specifically, you will install the Deep learning on images plugin, and work with the lions and tigers image files to classify images using a pre-trained model. You will then perform image feature extraction, and analyze and understand your model using the Tensorboard web app template. Then, you will install the Object detection in images plugin and detect objects in images.

Note

To get an idea of the steps to be completed in the hands-on lessons, you can always visit the concept video at the beginning of each section in the course.

In addition, you can visit the Dataiku gallery project that shows a similar project.

Prerequisites

You should complete the Machine Learning Basics course prior to beginning this one. You should also have access to an instance of Dataiku DSS that allows you to install plugins.

Create Your Project

From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Image Classification - The Visual Way (Tutorial). Click on Go to Flow.

In the Flow, you can see two folders, Images to classify and Images for retraining.

Image classification project, showing two folders

Note

To create your own project, you can download the lions_and_tigers.zip (189MB) file from the Dataiku website and extract the contents. Then, in your project, create two folders (from the + Dataset dropdown, select Folder) named Images to classify and Images for re-training. Populate the folders with the contents of the same-named folders from the zip file by dragging and dropping the images from your computer to the folder.

The zip file also contains a Python script file. If you choose to use the Python script file, be sure to change the reference to the Dataiku folder from 63RF8lzq to the reference in your project (it’s visible in the left panel under the Input Datasets).

Prepare the Data

In order to use the images for re-training, we need a dataset that labels each image as a lion or tiger. Fortunately, the name of each image file contains the text “lion” or “tiger”, correctly identifying which big cat is in the image. We can process the filenames with Python code to create the Labels dataset.

Select the Images for retraining folder and choose the Python code recipe from the actions menu. In the recipe creation dialog, create a new dataset called Labels. Then click Create Recipe.

Replace the code with the following:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
images_for_retraining = dataiku.Folder("T6uuiPzy")
images_for_retraining_info = images_for_retraining.get_info()

paths = images_for_retraining.list_paths_in_partition()

LABEL_0 = "lion"
LABEL_1 = "tiger"

pandas_dataframe = pd.DataFrame(columns=['path', 'label'])
for i,j in enumerate(paths):
  if LABEL_0 in j:
      pandas_dataframe.loc[i] = [j[1:], LABEL_0]
  if LABEL_1 in j:
      pandas_dataframe.loc[i] = [j[1:], LABEL_1]

# Write recipe outputs
labels = dataiku.Dataset("Labels")
labels.write_with_schema(pandas_dataframe)

The completed Python script displays as follows:

../../../_images/image-class-visual-way31.png

Click Run, then explore the output dataset to ensure it looks right.

../../../_images/image-class-visual-way131.png

Now the data is ready for use!