In this tutorial we will see how to easily create an image dataset through Google Images. Note: You will have to repeat these steps for any new category you want to Google (e.g once for dogs and once for cats).
Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.
Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.
It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:
"canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis
You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.
You will need to get the urls of each of the images. You can do this by running the following commands:
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou); window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
%reload_ext autoreload %autoreload 2
from fastai import * from fastai.vision import *
Choose an appropriate name for your labeled images. You can run these steps multiple times to grab different labels.
folder = 'black' file = 'urls_black.txt'
folder = 'teddys' file = 'urls_teddys.txt'
folder = 'grizzly' file = 'urls_grizzly.txt'
You will need to run this line once per each category.
path = Path('data/bears') dest = path/folder dest.mkdir(parents=True, exist_ok=True)
Finally, upload your urls file. You just need to press 'Upload' in your working directory and select your file, then click 'Upload' for each of the displayed files.
Now you will need to download you images from their respective urls.
fast.ai has a function that allows you to do just that. You just have to specify the urls filename and the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.
Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.
You will need to run this line once for every category.
classes = ['teddys','grizzly','black']
download_images(path/file, dest, max_pics=200)
# If you have problems download, try with `max_workers=0` to see exceptions: # download_images(path/file, dest, max_pics=20, max_workers=0)
Then we can remove any images that can't be opened:
for c in classes: print(c) verify_images(path/c, delete=True)
np.random.seed(42) data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
Good! Let's take a look at some of our pictures then.
['black', 'grizzly', 'teddys']
data.classes, data.c, len(data.train_ds), len(data.valid_ds)
(['black', 'grizzly', 'teddys'], 3, 199, 49)
learn = create_cnn(data, models.resnet34, metrics=error_rate)
Total time: 00:46 epoch train_loss valid_loss error_rate 1 0.648200 0.147811 0.031746 (00:12) 2 0.360954 0.124326 0.023810 (00:11) 3 0.259792 0.123840 0.007936 (00:11) 4 0.200647 0.124179 0.007936 (00:11)
Total time: 00:22 epoch train_loss valid_loss error_rate 1 0.076893 0.124552 0.007936 (00:11) 2 0.058538 0.122732 0.007936 (00:11)
interp = ClassificationInterpretation.from_learner(learn)
Some of our top losses aren't due to bad performance by our model. There are images in our data set that shouldn't be.
ImageCleaner widget from
fastai.widgets we can prune our top losses, removing photos that don't belong.
from fastai.widgets import *
First we need to get the file paths from our top_losses. We can do this with
.from_toplosses. We then feed the top losses indexes and corresponding dataset to
ds, idxs = DatasetFormatter().from_toplosses(learn, ds_type=DatasetType.Valid) fd = ImageCleaner(ds, idxs)
Flag photos for deletion by clicking 'Delete'. Then click 'Next Batch' to delete flagged photos and keep the rest in that row.
ImageCleaner will show you a new row of images until there are no more to show. In this case, the widget will show you images until there are none left from
You can also find duplicates in your dataset and delete them! To do this, you need to run
.from_similars to get the potential duplicates' ids and then run
duplicates=True. The API works similarly as with misclassified images: just choose the ones you want to delete and click 'Next Batch' until there are no more images left.
ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid)
ImageCleaner(ds, idxs, duplicates=True)
<fastai.widgets.image_cleaner.ImageCleaner at 0x7f2109993f98>
You probably want to use CPU for inference, except at massive scale (and you almost certainly don't need to train in real-time). If you don't have a GPU that happens automatically. You can test your model on CPU like so:
import fastai fastai.defaults.device = torch.device('cpu')
img = open_image(path/'black'/'00000021.jpg') img