Hands-On: Create Your Project and Prepare the Data¶
Specifically, you will install the Deep learning on images plugin, and work with the lions and tigers image files to classify images using a pre-trained model. You will then perform image feature extraction, and analyze and understand your model using the Tensorboard web app template. Then, you will install the Object detection in images plugin and detect objects in images.
To get an idea of the steps to be completed in the hands-on lessons, you can always visit the concept video at the beginning of each section in the course.
In addition, you can visit the Dataiku gallery project that shows a similar project.
You should complete the Machine Learning Basics course prior to beginning this one. You should also have access to an instance of Dataiku DSS that allows you to install plugins.
Create Your Project¶
From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Image Classification - The Visual Way (Tutorial). Click on Go to Flow.
In the Flow, you can see two folders, Images to classify and Images for retraining.
To create your own project, you can download the lions_and_tigers.zip (189MB) file from the Dataiku website and extract the contents. Then, in your project, create two folders (from the + Dataset dropdown, select Folder) named
Images to classify and
Images for re-training. Populate the folders with the contents of the same-named folders from the zip file by dragging and dropping the images from your computer to the folder.
The zip file also contains a Python script file. If you choose to use the Python script file, be sure to change the reference to the Dataiku folder from 63RF8lzq to the reference in your project (it’s visible in the left panel under the Input Datasets).
Prepare the Data¶
In order to use the images for re-training, we need a dataset that labels each image as a lion or tiger. Fortunately, the name of each image file contains the text “lion” or “tiger”, correctly identifying which big cat is in the image. We can process the filenames with Python code to create the Labels dataset.
Select the Images for retraining folder and choose the Python code recipe from the actions menu. In the recipe creation dialog, create a new dataset called
Labels. Then click Create Recipe.
Replace the code with the following:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs images_for_retraining = dataiku.Folder("T6uuiPzy") images_for_retraining_info = images_for_retraining.get_info() paths = images_for_retraining.list_paths_in_partition() LABEL_0 = "lion" LABEL_1 = "tiger" pandas_dataframe = pd.DataFrame(columns=['path', 'label']) for i,j in enumerate(paths): if LABEL_0 in j: pandas_dataframe.loc[i] = [j[1:], LABEL_0] if LABEL_1 in j: pandas_dataframe.loc[i] = [j[1:], LABEL_1] # Write recipe outputs labels = dataiku.Dataset("Labels") labels.write_with_schema(pandas_dataframe)
The completed Python script displays as follows:
Click Run, then explore the output dataset to ensure it looks right.
Now the data is ready for use!