Tutorial | Image classification without code#

Introduction#

Deep learning models are powerful tools for image classification, but they are difficult and expensive to create from scratch.

Dataiku provides several pre-trained deep learning models that you can use to classify images. You can also re-train a model to specialize it on a particular set of images.

In this tutorial, you will:

  • Classify images of healthy and diseased bean plant leaves using a pre-trained model.

  • Learn how to evaluate and fine-tune the model.

  • Use the model to classify new images.

  • Evaluate the model’s performance.

When finished, you’ll have built the Flow below.

Screenshot showing the final Flow with an image classification model and testing.

Prerequisites#

  • A Dataiku instance (version 11.3 or above). Free edition is enough; Dataiku Cloud is not compatible.

  • A computer vision code environment, which you can set up in the Applications menu of Dataiku under Administration > Settings > Misc.

Create the project#

From the Dataiku homepage, select +New Project > DSS tutorials > ML Practitioner > Image Classification without Code.

Explore the data#

In the Flow, you’ll see two folders called bean_images_train and bean_images_test that contain the images for our classification model. The images are stored in managed folders, which is a requirement for running image classification models in Dataiku.

Images for the model are stored in two managed folders in the Flow.

This dataset is from a project by the Artificial Intelligence Lab at Makerere University that uses image classification to help identify disease among bean crops in Uganda. The plants are classified as healthy or infected with bean rust or with angular leaf spot diseases.

Take a moment to browse the files in the bean_images folder to get a sense of the images we’ll be classifying. The nearly 500 images are divided into three subfolders by class: healthy, bean_rust, and angular_leaf_spot.

Images are organized into three folders depending on class.

The Flow also contains a folder called bean_images_test, with images we’ll use to test our model after training. That folder has a similar structure with three subfolders representing each image class.

Prepare the data for image classification#

Before we can create an image classifier, we need to build a tabular dataset that tells our model the file path where it can find each image. The dataset also will include the class for each image, which will be the target for our model to predict. We’ll do this using the List Contents recipe, which lists all files in a folder, their file paths, and other information.

  1. With the bean_images_train folder highlighted, go to the Actions menu on the right panel and select the List Contents recipe from Visual recipes.

  2. In the New List Folder Contents recipe dialog, create the recipe with the default settings and name bean_images_train_files.

  3. In the recipe Settings tab, select +Add Level Mapping and set the Folder level to 1, and the Column name to target.

    This creates a column called target that contains the name of the folder where each image is located, which is the image’s category, or class.

  4. Run the recipe.

The List Contents recipe builds a tabular dataset with information about files in managed folders.

The resulting table bean_images_train_files contains 492 rows with the name and file path of each image, along with a column called target where each image is labeled as healthy, bean_rust, or angular_leaf_spot.

Screenshot showing the bean_images_train_files dataset created with the recipe.

Fine-tune a pre-trained model#

After preparing the data, we are ready to use one of the pre-trained image classification models in Dataiku.

Note

Before building an image classification model, you must set up a specific code environment. Go to Administration > Settings > Misc. In the section DSS internal code environment, create an Image classification code environment by selecting your Python interpreter and clicking on Create the environment. Dataiku will install all the required packages for image classification and add the new code environment to the Code Envs tab.

Create the model#

In this section, we’ll create a model in the Lab, and later we’ll deploy it in the Flow.

  1. In the Flow, highlight the bean_images_train_files dataset and go to the Lab, then choose Image Classification.

  2. The next window asks you to define the model’s target, or what categories it will predict. Choose the target column.

  3. Select the bean_images_train folder in the Image folder dropdown to tell the model where to find the images for training.

  4. Name your model or leave the default name and select Create.

The Create image classification info window.

Dataiku creates the image classification model and adds it to the Lab, then navigates to the model Design tab where you can preview images and settings. Before training the model, let’s review the input and settings.

Check the model settings#

The Basic > Target panel shows the Target column and Image location we input when creating the model. It also recognizes the Path column so the model can find each image. Double-check that these settings are correct.

Under Target classes, the model automatically recognizes that we have created a classification task with three classes of healthy or diseased bean plants. You can preview the images and filter them by selecting each class in the bar chart.

Screenshot showing the Target panel for our model and previewing the images.

Dataiku will also automatically split the images into training and validation sets so the model can continually test its performance during the training phase. You can see these settings under Basic > Train/Test Set. This panel also contains sampling settings, which you might want to change when using larger datasets.

Screenshot showing the test/train set for evaluation while training the model.

Choose a pre-trained model#

Dataiku provides three pre-trained neural networks that are widely used and considered industry standards. You can see all the models and choose one to use in the Design tab under Basic > Training > Model.

  • EfficientNet B0: Efficiency-oriented

  • EfficientNet B4: Balanced between efficiency and performance

  • EfficientNet B7: Performance-oriented

Balanced (EfficientNet B4) is the default model, and we will use it in this tutorial.

Screenshot showing the three pre-trained model options in Dataiku.

Also in the Model section, you can specify how many layers of the model to retrain on your images. Convolutional neural networks (CNN) have many different layers that are pre-trained on millions of images. Retraining the final layer or final few layers helps the model learn on your specific images.

Dataiku’s models always retrain the final layer, also called classifier layer, to adapt it to the use case at hand. The default setting of 0 under Number of fine-tuned layers means one layer will be fine-tuned. Inputting 1 here means that two layers will be finetuned, and so on. Adding fine-tuned layers can increase performance but also increases processing time. We will use the default of 0.

In the Optimization and Fine-tuning sections, values are set to industry standards, and in most cases you will not change these.

For purposes of this tutorial, if you want the model to finish training more quickly, make sure Early stopping is selected and change the Early stopping patience to 3.

This means the model will stop cycling through images if it does not detect any performance improvement after three cycles, or epochs. We’ll discuss epochs and early stopping patience further in the Concept | Optimization of image classification models lesson.

In the Training panel, you can control the model optimization and fine-tuning parameters.

Train the model#

  1. When you are finished viewing the settings, select Save.

  2. Click Train at the top right to begin training the model.

  3. In the window, give your model a name or use the default and select Train.

Note

If your Dataiku instance is running on a server with a GPU, you can activate the GPU for training so the model can process much more quickly. Otherwise, the model will run on CPU and the training will take longer.

Model training can take time depending on your computer’s memory capacity. During training, you can view the chart in the Result tab to track the performance of your model at the end of each epoch. The default metric to evaluate and maximize the model’s performance is ROC AUC, or area under the curve, but other metrics such as Precision or Accuracy are available in the Metric dropdown above the chart.

The chart in the Results tab tracks performance of your model at the end of each epoch.

Tip

You can review the concepts behind ROC AUC and other various metrics in the Model Evaluation concept from the Dataiku Knowledge Base.

After training completes, you can assess the performance of your model before deploying it to the Flow.

Model metrics and explainability#

In the previous Introduction lesson, we created and trained a model to classify images of bean plants into healthy, angular bean spot, or bean rust classifications. In this tutorial, we’ll view a number of different metrics and reports to understand the model and its performance.

Diagnostics#

In training sessions where Dataiku detects some performance issues, such as in our example, a Diagnostics button appears when training is completed.

  1. Click on the Diagnostics button to view the information.

Note

If your model training session doesn’t trigger a Diagnostics button, you can go directly to the results summary by clicking on your model name under Session 1 to the left of the graph.

Screenshot showing the model Results page with Diagnostics button.

The Diagnostics panel gives some Dataset sanity checks with some potential ways to improve model performance, along with Leakage detection or Abnormal predictions detection, if applicable. In our model, the sanity checks suggest the training and testing sets are too small for robust performance. This was by design so the model would run faster for this tutorial, but in most situations, you would want to provide more images.

Screenshot of dataset sanity checks.

Results summary#

To view a more in-depth summary of model performance, navigate to Summary in the left panel. The summary shows our ROC AUC of .876 (your results may vary), which means the model had fairly good performance. The AUC is always between 0 and 1. The closer to 1, the higher the performance of the model.

Screenshot showing the model summary.

Take a moment to review the information given in the model summary.

What if?#

You can upload new images directly into the model to classify them, view the probabilities of each class, and see how the algorithm is making a decision. This information can give you insight into how the algorithm works and potential ways to improve it.

To see how this works, we’ll input a new image our model has not seen before.

  1. Download the file *bean_disease_whatif*, which is an image of plants with angular leaf spot.

  2. Navigate to What if? on the left panel.

  3. Either drag and drop the new image onto the screen or click Browse for images and find the image to upload.

    The model classifies this image as having angular leaf spot disease, which is the correct classification. On the right, we can see the probabilities of each class were fairly close, showing that the model is having perhaps a difficult time distinguishing between classes. One way to fix this would be to add more images of angular leaf spot to help the model recognize that disease.

  4. Hover over each of the classes with your mouse, and a heat map will appear to show which parts of the image the algorithm focused on to create the probabilities that the image falls into any of the three classes.

A heatmap shows which parts of the image the model focused on to classify an image.

If the prediction in What if is wrong, you might see that the algorithm is focusing on the wrong parts of an image. You can fine-tune an additional layer of your model or change the data augmentation to make the algorithm more robust, as we’ll explore in the lesson Concept | Optimization of image classification models.

Confusion matrix#

To further check the model’s performance, you can view a confusion matrix showing how many images were correctly and incorrectly classified during training. In the Explainability section of the left panel, choose Confusion matrix.

In the model pictured below, you can see that 27 images with a ground truth of healthy were predicted as healthy, making these correct predictions. However, five images were predicted to have bean rust but were really healthy. Again, your results may vary.

Confusion matrix showing the breakdown of true vs predicted classes for the images.

Click on the portion of the matrix with a ground truth of angular_leaf_spot and predicted class of bean_rust (in this example, the 13 in the bottom middle, though your numbers likely will be different). The images will appear on the right.

Screenshot showing image browser in the confusion matrix.

You can click on any image to view its information and use the arrow to browse through all 13 images. Doing so, you can quickly see that images of bean rust and angular leaf spot can be very similar, causing the model to make some incorrect predictions. Adding more images of angular leaf spot might help the model differentiate.

Click on other sections of the confusion matrix to explore images that were correctly or incorrectly classified.

Screenshot showing image details from the confusion matrix image browser.

Density chart#

Another useful measure of performance is the density chart, which illustrates how the model succeeds in recognizing the classes.

  • Under the Performance section, select Density chart.

The two curves on the chart show the probability density of images that belong to the observed class vs. rows that don’t. In the below chart, the blue density function on the left shows the probability that an image was not healthy, and the orange density function on the right shows the probability of healthy. Ideally the two density functions are fully separated, so the farther apart they are, the better a model performed.

View density functions for each class by choosing each class from the dropdown menu in the top left of the chart.

The density chart illustrating the probability density of images showing healthy vs not healthy leaves.

Deploy the model to the Flow#

Now that we have reviewed performance of the model, we can deploy it to the Flow so we can use it to make predictions on new images.

From the model report where we viewed the performance metrics, click on Deploy in the top right corner.

Note

If you close the model before deploying it, it will not appear in the Flow. To find the model, you can click on the training files dataset and go to the Lab, or select the Visual Analyses menu from the top navigation bar.

The deploy button is found in teh top right of the model report page.

The model training recipe and model now appear in the project Flow.

The Flow so far shows the model training recipe and deployed model.

Classify new images with the model#

After training the model, viewing various metrics, and optionally fine-tuning the model and retraining, we can now use the model to classify a batch of images it has not seen before.

For purposes of this tutorial, we will use images where we know the classification. This can help us further test the performance of our model. However, we also could input images for which we don’t know the classification in order to make useful predictions.

Prepare the data for classification#

Before we can make predictions, we need to create a tabular dataset with the file path and target information, similar to the one we built for the training set.

  1. With the bean_images_test folder highlighted, go to the Actions menu and select the List Contents recipe from Visual recipes.

  2. In the info window, create the recipe with the default settings and name bean_images_test_files.

  3. Select +Add level mapping and set the Folder level to 1, and the Column name to target.

  4. Run the recipe.

This recipe creates a new dataset in the Flow called bean_images_test_files with information on each of the 82 test images.

Screenshot showing the bean-images-test-files tabular dataset created by the List Contents recipe.

Run the model on the new images#

Now we can apply the model to the new images in bean_images_test.

  1. Select the model image classification on bean_images_train_files in the Flow.

  2. In the right panel, under Apply model on data to predict, select Score.

  3. In the info window, set the Input dataset to bean_images_test_files and the Managed folder to bean_images_test.

  4. Name the output dataset or keep the default and select Create recipe.

    Screenshot showing the Score a dataset info window.

    Dataiku navigates to the Settings tab for the scoring recipe. Here you can change the batch size, edit how many images to score at a time, or activate GPU to run scoring more quickly if that is available to you.

  5. Leave the default batch size of 2 and Run the recipe.

    Screenshot showing the scoring recipe settings.

    Dataiku creates a new dataset in the Flow called bean_disease_test_scored.

  6. Open the dataset bean_disease_test_scored to explore.

  7. Scroll to the right to view the final five columns that include the target, prediction, and probability of each class.

    Screenshot showing the scored dataset in tabular format.

A quick scan of the probabilities tells us again that the model determined very similar probabilities for each of the three classes on many images.

We can view the entire dataset as images to help understand the predictions.

  1. In the top right, click on the Image view button.

    The image view button can be found in the top right of the dataset explore view.

    The annotations that appear in this view are the predicted class of each image.

  2. Click on any image to view all information about that image. Use the arrow in the info window to scroll through each image in the dataset.

    Screenshot showing the details you can view for each image.

Create a confusion matrix#

In this case, we scored a dataset that included the ground truth for each image. Because of this, we can create a confusion matrix to get an overall picture of how the model performed on these new images. Note that you cannot do this without ground truth labels. To create the matrix:

  1. Navigate to the Charts tab.

  2. From the chart type dropdown, choose the Pivot table.

  3. Move the target column to Rows, the prediction column to Columns, and the Count of records measure to Value.

  4. Double-click on the chart title to give it a more descriptive name.

Your results may vary, but the chart below shows the model did a fairly good job of predicting the healthy and diseased plants.

Screenshot showing the confusion matrix we created using the pivot table chart.

What’s next?#

Congratulations on building your first image classification model in Dataiku! You’re ready to create a new model on your own images.

You also might want to learn how to build object detection models in Dataiku with this tutorial.