by: Francisco Ingham and Jeremy Howard. Inspired by Adrian Rosebrock
In this tutorial we will see how to easily create an image dataset through Google Images. Note: You will have to repeat these steps for any new category you want to Google (e.g once for dogs and once for cats).
!curl https://course.fast.ai/setup/colab | bash
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 322 100 322 0 0 422 0 --:--:-- --:--:-- --:--:-- 421 Updating fastai... Done.
from fastai.vision import *
Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.
Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.
It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:
"canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis
You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.
Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.
In Google Chrome press CtrlShiftj on Windows/Linux and CmdOptj on macOS, and a small window the javascript 'Console' will appear. In Firefox press CtrlShiftk on Windows/Linux or CmdOptk on macOS. That is where you will paste the JavaScript commands.
You will need to get the urls of each of the images. Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn't work. Then you can run the following commands:
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
Choose an appropriate name for your labeled images. You can run these steps multiple times to create different labels.
folder = 'sato'
file = 'urls_sato.csv'
folder = 'shio'
file = 'urls_shio.csv'
folder = 'shouko'
file = 'urls_shouko.csv'
You will need to run this cell once per each category.
path = Path('data/happy_sugar_life')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly Enter your authorization code: ·········· Mounted at /content/drive
!cp "/content/drive/My Drive/fastai-v3/data/Happy Sugar Life Dataset/sato.zip" "/content/data/happy_sugar_life/sato.zip"
!cp "/content/drive/My Drive/fastai-v3/data/Happy Sugar Life Dataset/shoko.zip" "/content/data/happy_sugar_life/shoko.zip"
!cp "/content/drive/My Drive/fastai-v3/data/Happy Sugar Life Dataset/shio.zip" "/content/data/happy_sugar_life/shio.zip"
!cp "/content/drive/My Drive/fastai-v3/data/Happy Sugar Life Dataset/asahi.zip" "/content/data/happy_sugar_life/asahi.zip"
#unzip them
import zipfile
zip_ref = zipfile.ZipFile("/content/data/happy_sugar_life/sato.zip", 'r')
zip_ref.extractall("/content/data/happy_sugar_life/")
zip_ref.close()
zip_ref = zipfile.ZipFile("/content/data/happy_sugar_life/shio.zip", 'r')
zip_ref.extractall("/content/data/happy_sugar_life/")
zip_ref.close()
zip_ref = zipfile.ZipFile("/content/data/happy_sugar_life/shouko.zip", 'r')
zip_ref.extractall("/content/data/happy_sugar_life/")
zip_ref.close()
zip_ref = zipfile.ZipFile("/content/data/happy_sugar_life/asahi.zip", 'r')
zip_ref.extractall("/content/data/happy_sugar_life/")
zip_ref.close()
path.ls()
[PosixPath('data/happy_sugar_life/sato'), PosixPath('data/happy_sugar_life/.ipynb_checkpoints'), PosixPath('data/happy_sugar_life/sato.zip'), PosixPath('data/happy_sugar_life/asahi'), PosixPath('data/happy_sugar_life/asahi.zip'), PosixPath('data/happy_sugar_life/shio.zip'), PosixPath('data/happy_sugar_life/shouko'), PosixPath('data/happy_sugar_life/shouko.zip'), PosixPath('data/happy_sugar_life/shio'), PosixPath('data/happy_sugar_life/shoko')]
Now,let's load🔃 the images and assign some classes.
classes = ['sato','shio','shouko','asahi']
Then we can remove any images that can't be opened:
for c in classes:
print(c)
verify_images(path/c, delete=True, max_size=500)
sato
shio
shouko asashi
# np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
# try not to use imagenet_stats for normalize
imagenet_stats
([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
data.train_ds??
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
# ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
Good! Let's take a look at some of our pictures then.
data.classes
['asahi', 'sato', 'shio', 'shouko']
data.show_batch(rows=3, figsize=(7,8))
data.classes, data.c, len(data.train_ds), len(data.valid_ds)
(['asahi', 'sato', 'shio', 'shouko'], 4, 490, 122)
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(4)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.876986 | 0.652400 | 0.254098 | 00:20 |
1 | 1.168839 | 0.238772 | 0.081967 | 00:19 |
2 | 0.880157 | 0.137055 | 0.032787 | 00:19 |
3 | 0.694925 | 0.133711 | 0.040984 | 00:18 |
learn.save('stage-1')
After some quick training, let's do some fine tuning.
learn.unfreeze()
learn.lr_find()
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.255795 | #na# | 00:16 | |
1 | 0.285223 | #na# | 00:15 | |
2 | 0.273368 | #na# | 00:16 | |
3 | 0.269292 | #na# | 00:15 | |
4 | 0.256310 | #na# | 00:15 | |
5 | 0.234673 | #na# | 00:15 | |
6 | 0.208064 | #na# | 00:15 | |
7 | 0.212439 | #na# | 00:16 | |
8 | 0.477814 | #na# | 00:15 | |
9 | 0.667494 | #na# | 00:15 |
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
# If the plot is not showing try to give a start and end learning rate
# learn.lr_find(start_lr=1e-5, end_lr=1e-1)
learn.recorder.plot()
learn.recorder.plot_lr()
# learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))
learn.fit_one_cycle(2, max_lr=slice(1e-4/2,1e-3/2))
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.235391 | 0.365020 | 0.122951 | 00:19 |
1 | 0.203716 | 0.151499 | 0.049180 | 00:19 |
learn.lr_find(start_lr=1e-7, end_lr=2e-3)
learn.recorder.plot()
learn.recorder.plot_lr()
learn.save('stage-2', return_path= True)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.158354 | #na# | 00:15 | |
1 | 0.174361 | #na# | 00:15 | |
2 | 0.176616 | #na# | 00:15 | |
3 | 0.165817 | #na# | 00:15 | |
4 | 0.161536 | #na# | 00:15 | |
5 | 0.151717 | #na# | 00:16 | |
6 | 0.158819 | #na# | 00:15 | |
7 | 0.156142 | #na# | 00:15 | |
8 | 0.143844 | #na# | 00:15 | |
9 | 0.133874 | #na# | 00:16 | |
10 | 0.117162 | #na# | 00:15 | |
11 | 0.108767 | #na# | 00:15 | |
12 | 0.102191 | #na# | 00:16 | |
13 | 0.111332 | #na# | 00:15 |
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
PosixPath('data/happy_sugar_life/models/stage-2.pth')
learn.load('stage-2');
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
Some of our top losses aren't due to bad performance by our model. There are images in our data set that shouldn't be.
Using the ImageCleaner
widget from fastai.widgets
we can prune our top losses, removing photos that don't belong.
from fastai.widgets import *
First we need to get the file paths from our top_losses. We can do this with .from_toplosses
. We then feed the top losses indexes and corresponding dataset to ImageCleaner
.
Notice that the widget will not delete images directly from disk but it will create a new csv file cleaned.csv
from where you can create a new ImageDataBunch with the corrected labels to continue training your model.
In order to clean the entire set of images, we need to create a new dataset without the split. The video lecture demostrated the use of the ds_type
param which no longer has any effect. See the thread for more details.
db = (ImageList.from_folder(path)
.split_none()
.label_from_folder()
.transform(get_transforms(), size=256) #224
.databunch()
)
# If you already cleaned your data using indexes from `from_toplosses`,
# run this cell instead of the one before to proceed with removing duplicates.
# Otherwise all the results of the previous step would be overwritten by
# the new run of `ImageCleaner`.
# db = (ImageList.from_csv(path, 'cleaned.csv', folder='.')
# .split_none()
# .label_from_df()
# .transform(get_transforms(), size=224)
# .databunch()
# )
Then we create a new learner to use our new databunch with all the images.
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)
learn_cln.load('stage-2');
#top loss returns two things, dataset with the highest loss and the index
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)
idxs
showing=0
end_at = 10
print("Images that have higher loss(From the highest).")
print("It means the images that the model feel confused on.")
while(showing<=end_at):
show_image(ds[idxs[showing]][0])
showing+=1
Images that have higher loss(From the highest). It means the images that the model feel confused on.
Make sure you're running this notebook in Jupyter Notebook, not Jupyter Lab. That is accessible via /tree, not /lab. Running the ImageCleaner
widget in Jupyter Lab is not currently supported.
# Don't run this in google colab or any other instances running jupyter lab.
# If you do run this on Jupyter Lab, you need to restart your runtime and
# runtime state including all local variables will be lost.
# ImageCleaner(ds, idxs, path)
If the code above does not show any GUI(contains images and buttons) rendered by widgets but only text output, that may caused by the configuration problem of ipywidgets. Try the solution in this link to solve it.
Flag photos for deletion by clicking 'Delete'. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. ImageCleaner
will show you a new row of images until there are no more to show. In this case, the widget will show you images until there are none left from top_losses.ImageCleaner(ds, idxs)
You can also find duplicates in your dataset and delete them! To do this, you need to run .from_similars
to get the potential duplicates' ids and then run ImageCleaner
with duplicates=True
. The API works in a similar way as with misclassified images: just choose the ones you want to delete and click 'Next Batch' until there are no more images left.
Make sure to recreate the databunch and learn_cln
from the cleaned.csv
file. Otherwise the file would be overwritten from scratch, losing all the results from cleaning the data from toplosses.
ds, idxs = DatasetFormatter().from_similars(learn_cln)
#this will find the duplicated files
idxs
showing=0
end_at = 10
while(showing<=end_at):
show_image(ds[idxs[showing]][0])
showing+=1
Getting activations...
Computing similarities...
#ImageCleaner(ds, idxs, path, duplicates=True)
Remember to recreate your ImageDataBunch from your cleaned.csv
to include the changes you made in your data!
First thing first, let's export the content of our Learner
object for production:
learn.export()
This will create a file named 'export.pkl' in the directory where we were working that contains everything we need to deploy our model (the model, the weights but also some metadata like the classes or the transforms/normalization used).
You probably want to use CPU for inference, except at massive scale (and you almost certainly don't need to train in real-time). If you don't have a GPU that happens automatically. You can test your model on CPU like so:
defaults.device = torch.device('cpu')
img = open_image(path/'shouko/vlcsnap-2020-02-05-17h08m13s825.png')
img.resize(torch.Size([img.shape[0], 126, 224]))
We create our Learner
in production enviromnent like this, just make sure that path
contains the file 'export.pkl' from before.
learn = load_learner(path) #so this function will load the pkl file
pred_class,pred_idx,outputs = learn.predict(img)
pred_class
Category shouko
So you might create a route something like this (thanks to Simon Willison for the structure of this code):
@app.route("/classify-url", methods=["GET"])
async def classify_url(request):
bytes = await get_bytes(request.query_params["url"])
img = open_image(BytesIO(bytes))
_,_,losses = learner.predict(img)
return JSONResponse({
"predictions": sorted(
zip(cat_learner.data.classes, map(float, losses)),
key=lambda p: p[1],
reverse=True
)
})
(This example is for the Starlette web app toolkit.)
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(1, max_lr=0.5)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 2.002434 | 9085017.000000 | 0.782609 | 00:02 |
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
Previously we had this result:
Total time: 00:57
epoch train_loss valid_loss error_rate
1 1.030236 0.179226 0.028369 (00:14)
2 0.561508 0.055464 0.014184 (00:13)
3 0.396103 0.053801 0.014184 (00:13)
4 0.316883 0.050197 0.021277 (00:15)
learn.fit_one_cycle(5, max_lr=1e-9)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.820173 | 1.681390 | 0.391304 | 00:02 |
1 | 1.729933 | 1.503223 | 0.347826 | 00:02 |
2 | 1.721467 | 1.627320 | 0.260870 | 00:02 |
3 | 1.596017 | 1.766101 | 0.369565 | 00:02 |
4 | 1.606956 | 2.113033 | 0.347826 | 00:02 |
learn.recorder.plot_losses()
#because the lr rate is 🐌, ,
As well as taking a really long time, it's getting too many looks at each image, so may overfit.
learn = cnn_learner(data, models.resnet34, metrics=error_rate, pretrained=False)
learn.fit_one_cycle(1)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.762493 | 26.919785 | 0.434783 | 00:02 |
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.9, bs=32,
ds_tfms=get_transforms(do_flip=False, max_rotate=0, max_zoom=1, max_lighting=0, max_warp=0
),size=224, num_workers=4).normalize(imagenet_stats)
You can deactivate this warning by passing `no_check=True`.
/usr/local/lib/python3.6/dist-packages/fastai/basic_data.py:248: UserWarning: Your training dataloader is empty, you have only 23 items in your training set. Your batch size is 32, you should lower it. Your batch size is {self.train_dl.batch_size}, you should lower it.""")
learn = cnn_learner(data, models.resnet50, metrics=error_rate, ps=0, wd=0)
learn.unfreeze()
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/checkpoints/resnet50-19c8e357.pth
HBox(children=(IntProgress(value=0, max=102502400), HTML(value='')))
learn.fit_one_cycle(2, slice(1e-6,1e-4))
ERROR:root:An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line string', (1, 127))
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-56-a6defc03cbca> in <module>() ----> 1 learn.fit_one_cycle(2, slice(1e-6,1e-4)) /usr/local/lib/python3.6/dist-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch) 21 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start, 22 final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch)) ---> 23 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks) 24 25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr, moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72, /usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks) 198 else: self.opt.lr,self.opt.wd = lr,wd 199 callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks) --> 200 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks) 201 202 def create_opt(self, lr:Floats, wd:Floats=0.)->None: /usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics) 86 "Fit the `model` on `data` and learn using `loss_func` and `opt`." 87 assert len(learn.data.train_dl) != 0, f"""Your training dataloader is empty, can't train a model. ---> 88 Use a smaller batch size (batch size={learn.data.train_dl.batch_size} for {len(learn.data.train_dl.dataset)} elements).""" 89 cb_handler = CallbackHandler(callbacks, metrics) 90 pbar = master_bar(range(epochs)) AssertionError: Your training dataloader is empty, can't train a model. Use a smaller batch size (batch size=32 for 23 elements).