Digging into COCO¶

Notebooks offer a convenient way to analyze visual datasets. Code and visualizations can live in the same place, which is exactly what CV/ML often requires. With that in mind, being able to find problems in visual datasets is the first step towards improving them. This notebook walks us though each "step" (i.e., a notebook cell) of digging for problems in an image dataset. First, we'll need to install the fiftyone package with pip.

If you're working in Google Colab, be sure to enable a GPU runtime before running any cell

In [ ]:

!pip install fiftyone

Next, we can download and load our dataset. We will be using the COCO-2017 validation split. Let's also take a moment to visualize the ground truth detection labels using the FiftyOne App. The following two cells will do all of this for us.

In [1]:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("coco-2017", split="validation")

Split 'validation' already downloaded
Loading 'coco-2017' split 'validation'
 100% |██████████████████████████| 5000/5000 [27.8s elapsed, 0s remaining, 177.9 samples/s]      
Dataset 'coco-2017-validation' created

In [2]:

session = fo.launch_app(dataset)

We have our COCO-2017 validation dataset loaded, now let's download and load our model and apply it to our validation dataset. We will be using the faster-rcnn-resnet50-fpn-coco-torch pre-trained model from the FiftyOne Model Zoo. Let's apply the predictions to a new label field predictions, and limit the application to detections with confidence >= 0.6:

In [3]:

model = foz.load_zoo_model("faster-rcnn-resnet50-fpn-coco-torch")

# This will take some time. If not using a GPU, I recommend reducing
# the dataset size with the below line. Results will differ.
# 
# dataset = dataset.take(100)

dataset.apply_model(model, label_field="predictions", confidence_thresh=0.6)

Downloading model from 'https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth'...
 100% |██████|    1.2Gb/1.2Gb [1.6s elapsed, 0s remaining, 727.7Mb/s]         
 100% |█████| 5000/5000 [31.0m elapsed, 0s remaining, 2.7 samples/s]

Let's focus on issues related vehicle detections and consider all buses, cars, and trucks vehicles and ignore any other detections, in both the ground truth labels and our predictions.

The following filters our dataset to a view containing only our vehicle detections, and renders the view in the App. Because we are in a notebook, you will notice that each time a new App cell is opened, the previously active App cell will be replaced with a screenshot of itself. Neato!

In [4]:

from fiftyone import ViewField as F

vehicle_labels = ["bus","car", "truck"]
only_vehicles = F("label").is_in(vehicle_labels)

vehicles = (
    dataset
    .filter_labels("predictions", only_vehicles)
    .filter_labels("ground_truth", only_vehicles)
)

session.view = vehicles

Now that we have our predictions, we can evaluate the model. We'll use the evaluate_detections() method, which is available on all FiftyOne datasets/views and uses the COCO evaluation methodology by default:

In [5]:

results = vehicles.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    iou=0.75,
)

Evaluating detections...
 100% |███████| 640/640 [10.1s elapsed, 0s remaining, 67.8 samples/s]

evaluate_detections() has populated various pieces of data about the evaluation into our dataset. Of note is information about which predictions were not matched with a ground truth box. The following view into the dataset lets us look at only those unmatched predictions. We'll sort by maximum per-sample confidence, as well, in descending order.

In [6]:

session.view = (
    vehicles
    .filter_labels("predictions", F("eval_id") == "")
    .sort_by(F("predictions.detections").map(F("confidence")).max(), reverse=True)
)

Double-clicking on a few images, we can see that the most common reason for an unmatched prediction is that there is a label mismatch. It is not surprising, as all three of these classes are in the vehicle supercategory. Trucks and cars are often confused in human annotation and model prediction.

Looking beyond class confusion, though, let's take a look at the first two samples in our unmatched predictions view.

The truncated car in the right of the image has too small of a bounding box. The unmatched prediction is far more accurate, but did not meet the IoU threshold.

The very first sample, found in the pictures above, has an annotation mistake. The truncated car in the right of the image has too small of a ground truth bounding box (pink). The unmatched prediction (yellow) is far more accurate, but did not meet the IoU threshold.

The predicted box of the car in the shadow of the trees is correct, but it is not labeled in the ground truth.

The second sample found in our unmatched predictions view contains a different kind of annotation error. A more egregious one, in fact. The correctly predicted bounding box (yellow) in the image has no corresponding ground truth. The car in the shade of the trees was simply not annotated.

Manually fixing these mistakes is out of the scope of this example, as it requires a large feedback loop. FiftyOne is dedicated to making that feedback loop possible (and efficient), but for now let's focus on how we can answer questions about model performance, and confirm the hypothesis that our model does in fact confuse buses, cars, and trucks quite often.

We'll do this by re-evaluating our predictions with buses, cars, and trucks all merged into a single vehicle label. The following creates such a view, clones the view into a separate dataset so we'll have separate evaluation results, and evaluates the merged labels.

In [7]:

vehicle_labels = {label: "vehicle" for label in ["bus","car", "truck"]}

merged_vehicles_dataset = (
    vehicles
    .map_labels("ground_truth", vehicle_labels)
    .map_labels("predictions", vehicle_labels)
    .select_fields(["ground_truth", "predictions"])
    .clone("merged_vehicles_dataset")
)

merged_vehicles_dataset.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
    iou=0.75,
)

session.dataset = merged_vehicles_dataset

Evaluating detections...
 100% |███████| 640/640 [6.6s elapsed, 0s remaining, 102.2 samples/s]

Now we have evaluation results for the originally segmented bus, car, and truck detections and the merged detections. We can now simply compare the number of true positives from the original evaluation, to the number of true positives in the merged evaluation.

In [9]:

original_tp_count = vehicles.sum("eval_tp")
merged_tp_count = merged_vehicles_dataset.sum("eval_tp")

print("Original Vehicles True Positives: %d" % original_tp_count)
print("Merged Vehicles True Positives: %d" % merged_tp_count)

Original Vehicles True Positives: 1431
Merged Vehicles True Positives: 1515

We can see that before merging the bus, car, and truck labels there were 1,431 true positives. Merging the three labels together resulted in 1,515 true positives.

We were able to confirm our hypothesis! Albeit, a quite obvious one. But we now have a data-backed understanding a common failure mode of this model. And now this entire experiment can be shared with others. The following will screenshot the last active App window, so all outputs can be statically viewed by others.

In [8]:

session.freeze() # Screenshot the active App window for sharing

Thanks for following along! The FiftyOne project can be found on GitHub. If you agree that the CV/ML community needs an open tool to solve its data problems, give us a star!