%matplotlib inline
%reload_ext autoreload
%autoreload 2
from fastai.conv_learner import *
from fastai.dataset import *
from pathlib import Path
import json
from PIL import ImageDraw, ImageFont
from matplotlib import patches, patheffects
# torch.cuda.set_device(0)
We will be looking at the Pascal VOC dataset. It's quite slow, so you may prefer to download from this mirror. There are two different competition/research datasets, from 2007 and 2012. We'll be using the 2007 version. You can use the larger 2012 for better results, or even combine them (but be careful to avoid data leakage between the validation sets if you do this).
Unlike previous lessons, we are using the python 3 standard library pathlib
for our paths and file access. Note that it returns an OS-specific class (on Linux, PosixPath
) so your output may look a little different. Most libraries than take paths as input can take a pathlib object - although some (like cv2
) can't, in which case you can use str()
to convert it to a string.
!pwd
/home/ubuntu/fastai/courses/dl2
!ln -s ~/data data
!ls -la data/
total 837144 drwxrwxr-x 4 ubuntu ubuntu 4096 May 20 05:28 . drwxr-xr-x 24 ubuntu ubuntu 4096 May 20 15:01 .. drwxrwxr-x 8 ubuntu ubuntu 4096 May 13 13:09 dogscats -rw-rw-r-- 1 ubuntu ubuntu 857214334 Apr 1 2017 dogscats.zip drwxrwxr-x 2 ubuntu ubuntu 4096 May 20 06:02 spellbee
%cd data
/home/ubuntu/data
%mkdir pascal
%cd pascal/
/home/ubuntu/data/pascal
!tar -xf VOCtrainval_06-Nov-2007.tar
!aria2c --file-allocation=none -c -x 5 -s 5 http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
[#8be4bd 431MiB/438MiB(98%) CN:1 DL:34MiB] 06/15 21:53:04 [NOTICE] Download complete: /home/ubuntu/data/VOCtrainval_06-Nov-2007.tar Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= 8be4bd|OK | 33MiB/s|/home/ubuntu/data/VOCtrainval_06-Nov-2007.tar Status Legend: (OK):download completed.
!aria2c --file-allocation=none -c -x 5 -s 5 https://storage.googleapis.com/coco-dataset/external/PASCAL_VOC.zip
06/15 21:54:27 [NOTICE] Download complete: /home/ubuntu/data/PASCAL_VOC.zip Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= 1350c8|OK | 8.4MiB/s|/home/ubuntu/data/PASCAL_VOC.zip Status Legend: (OK):download completed.
!tar -xf VOCtrainval_06-Nov-2007.tar
!unzip PASCAL_VOC.zip
Archive: PASCAL_VOC.zip creating: PASCAL_VOC/ inflating: PASCAL_VOC/pascal_test2007.json inflating: PASCAL_VOC/pascal_train2007.json inflating: PASCAL_VOC/pascal_train2012.json inflating: PASCAL_VOC/pascal_val2007.json inflating: PASCAL_VOC/pascal_val2012.json
%mv PASCAL_VOC/*.json .
%rmdir PASCAL_VOC
%ls -la
total 462072 drwxrwxr-x 3 ubuntu ubuntu 4096 Jun 15 22:01 ./ drwxrwxr-x 5 ubuntu ubuntu 4096 Jun 15 21:56 ../ -rw-r--r-- 1 ubuntu ubuntu 2584743 Jul 7 2015 pascal_test2007.json -rw-r--r-- 1 ubuntu ubuntu 1346236 Aug 19 2015 pascal_train2007.json -rw-r--r-- 1 ubuntu ubuntu 2912167 Aug 19 2015 pascal_train2012.json -rw-r--r-- 1 ubuntu ubuntu 1342257 Jul 7 2015 pascal_val2007.json -rw-r--r-- 1 ubuntu ubuntu 2922699 Aug 19 2015 pascal_val2012.json -rw-rw-r-- 1 ubuntu ubuntu 1998182 Jun 15 21:54 PASCAL_VOC.zip drwxrwxr-x 3 ubuntu ubuntu 4096 Nov 6 2007 VOCdevkit/ -rw-rw-r-- 1 ubuntu ubuntu 460032000 Jun 15 21:53 VOCtrainval_06-Nov-2007.tar
%cd ~/fastai/courses/dl2
/home/ubuntu/fastai/courses/dl2
PATH = Path('data/pascal')
list(PATH.iterdir())
[PosixPath('data/pascal/pascal_train2012.json'), PosixPath('data/pascal/VOCtrainval_06-Nov-2007.tar'), PosixPath('data/pascal/pascal_train2007.json'), PosixPath('data/pascal/models'), PosixPath('data/pascal/VOCdevkit'), PosixPath('data/pascal/pascal_val2007.json'), PosixPath('data/pascal/pascal_test2007.json'), PosixPath('data/pascal/pascal_val2012.json'), PosixPath('data/pascal/PASCAL_VOC.zip'), PosixPath('data/pascal/tmp')]
As well as the images, there are also annotations - bounding boxes showing where each object is. These were hand labeled. The original version were in XML, which is a little hard to work with nowadays, so we uses the more recent JSON version which you can download from this link.
You can see here how pathlib
includes the ability to open files (amongst many other capabilities).
trn_j = json.load( (PATH / 'pascal_train2007.json').open() )
trn_j.keys()
dict_keys(['images', 'type', 'annotations', 'categories'])
IMAGES, ANNOTATIONS, CATEGORIES = ['images', 'annotations', 'categories']
trn_j[IMAGES][:5]
[{'file_name': '000012.jpg', 'height': 333, 'width': 500, 'id': 12}, {'file_name': '000017.jpg', 'height': 364, 'width': 480, 'id': 17}, {'file_name': '000023.jpg', 'height': 500, 'width': 334, 'id': 23}, {'file_name': '000026.jpg', 'height': 333, 'width': 500, 'id': 26}, {'file_name': '000032.jpg', 'height': 281, 'width': 500, 'id': 32}]
trn_j[ANNOTATIONS][:2]
[{'segmentation': [[155, 96, 155, 270, 351, 270, 351, 96]], 'area': 34104, 'iscrowd': 0, 'image_id': 12, 'bbox': [155, 96, 196, 174], 'category_id': 7, 'id': 1, 'ignore': 0}, {'segmentation': [[184, 61, 184, 199, 279, 199, 279, 61]], 'area': 13110, 'iscrowd': 0, 'image_id': 17, 'bbox': [184, 61, 95, 138], 'category_id': 15, 'id': 2, 'ignore': 0}]
trn_j[CATEGORIES][:8]
[{'supercategory': 'none', 'id': 1, 'name': 'aeroplane'}, {'supercategory': 'none', 'id': 2, 'name': 'bicycle'}, {'supercategory': 'none', 'id': 3, 'name': 'bird'}, {'supercategory': 'none', 'id': 4, 'name': 'boat'}, {'supercategory': 'none', 'id': 5, 'name': 'bottle'}, {'supercategory': 'none', 'id': 6, 'name': 'bus'}, {'supercategory': 'none', 'id': 7, 'name': 'car'}, {'supercategory': 'none', 'id': 8, 'name': 'cat'}]
It's helpful to use constants instead of strings, since we get tab-completion and don't mistype.
FILE_NAME, ID, IMG_ID, CAT_ID, BBOX = 'file_name', 'id', 'image_id', 'category_id', 'bbox'
cats = { o[ID]:o["name"] for o in trn_j[CATEGORIES] }
trn_fns = { o[ID]:o[FILE_NAME] for o in trn_j[IMAGES] }
trn_ids = { o[ID] for o in trn_j[IMAGES] }
list( (PATH / 'VOCdevkit/VOC2007').iterdir() )
[PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages'), PosixPath('data/pascal/VOCdevkit/VOC2007/SegmentationClass'), PosixPath('data/pascal/VOCdevkit/VOC2007/Annotations'), PosixPath('data/pascal/VOCdevkit/VOC2007/SegmentationObject'), PosixPath('data/pascal/VOCdevkit/VOC2007/ImageSets')]
JPEGS = 'VOCdevkit/VOC2007/JPEGImages'
IMG_PATH = PATH / JPEGS
list( IMG_PATH.iterdir() )[:5]
[PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/001688.jpg'), PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/007189.jpg'), PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/003408.jpg'), PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/001604.jpg'), PosixPath('data/pascal/VOCdevkit/VOC2007/JPEGImages/000729.jpg')]
Each image has a unique ID.
im0_d = trn_j[IMAGES][0]
im0_d
{'file_name': '000012.jpg', 'height': 333, 'width': 500, 'id': 12}
im0_d[FILE_NAME], im0_d[ID]
('000012.jpg', 12)
A defaultdict
is useful any time you want to have a default dictionary entry for new keys. Here we create a dict from image IDs to a list of annotations (tuple of bounding box and class id).
We convert VOC's height/width into top-left/bottom-right, and switch x/y coords to be consistent with numpy.
def hw_bb(bb):
# Example, bb = [155, 96, 196, 174]
return np.array([ bb[1], bb[0], bb[3] + bb[1] - 1, bb[2] + bb[0] - 1 ])
# VOC's bbox: column (x coord), row (of top left, y coord), height, width
#ix 0 1 2 3
bb = [155, 96, 196, 174]
bb[1], bb[0], bb[3] + bb[1] - 1, bb[2] + bb[0] - 1
(96, 155, 269, 350)
trn_anno = collections.defaultdict(lambda:[])
for o in trn_j[ANNOTATIONS]:
if not o['ignore']:
bb = o[BBOX] # one bbox. looks like '[155, 96, 196, 174]'.
bb = hw_bb(bb)
trn_anno[o[IMG_ID]].append( (bb, o[CAT_ID]) )
len(trn_anno)
2501
# Test getting the first element from dict_values
list(trn_anno.values())[0]
[(array([ 96, 155, 269, 350]), 7)]
print(im0_d[ID])
im_a = trn_anno[im0_d[ID]]
im_a
12
[(array([ 96, 155, 269, 350]), 7)]
im0_a = im_a[0] # get first item (first bbox) from list. note: possible to have more than one bbox per image.
im0_a
(array([ 96, 155, 269, 350]), 7)
cats[7]
'car'
trn_anno[17]
[(array([ 61, 184, 198, 278]), 15), (array([ 77, 89, 335, 402]), 13)]
cats[15], cats[13]
('person', 'horse')
Some libs take VOC format bounding boxes, so this let's us convert back when required:
bb_voc = [155, 96, 196, 174]
bb_fastai = hw_bb(bb_voc)
bb_fastai
array([ 96, 155, 269, 350])
def bb_hw(a):
return np.array( [ a[1], a[0], a[3] - a[1] + 1, a[2] - a[0] + 1 ] )
f'expected: {bb_voc}, actual: {bb_hw(bb_fastai)}'
'expected: [155, 96, 196, 174], actual: [155 96 196 174]'
You can use Visual Studio Code (vscode - open source editor that comes with recent versions of Anaconda, or can be installed separately), or most editors and IDEs, to find out all about the open_image
function. vscode things to know:
im = open_image(IMG_PATH / im0_d[FILE_NAME])
Matplotlib's plt.subplots
is a really useful wrapper for creating plots, regardless of whether you have more than one subplot. Note that Matplotlib has an optional object-oriented API which I think is much easier to understand and use (although few examples online use it!)
def show_img(im, figsize=None, ax=None):
if not ax:
fig, ax = plt.subplots(figsize=figsize)
ax.imshow(im)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
return ax
A simple but rarely used trick to making text visible regardless of background is to use white text with black outline, or visa versa. Here's how to do it in matplotlib.
def draw_outline(o, lw):
o.set_path_effects( [patheffects.Stroke(linewidth=lw, foreground='black'),
patheffects.Normal()] )
Note that *
in argument lists is the splat operator. In this case it's a little shortcut compared to writing out b[-2],b[-1]
.
def draw_rect(ax, b):
patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor='white', lw=2))
draw_outline(patch, 4)
def draw_text(ax, xy, txt, sz=14):
text = ax.text(*xy, txt, verticalalignment='top', color='white', fontsize=sz, weight='bold')
draw_outline(text, 1)
ax = show_img(im)
b = bb_hw(im0_a[0]) # convert bbox back to VOC format
draw_rect(ax, b)
draw_text(ax, b[:2], cats[im0_a[1]])
def draw_im(im, ann):
# im is image, ann is annotations
ax = show_img(im, figsize=(16, 8))
for b, c in ann:
# b is bbox, c is class id
b = bb_hw(b)
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)
def draw_idx(i):
# i is image id
im_a = trn_anno[i] # training annotations
im = open_image(IMG_PATH / trn_fns[i]) # trn_fns is training image file names
print(im.shape)
draw_im(im, im_a) # im_a is an element of annotation
draw_idx(17) # image id is 17
(364, 480, 3)
A lambda function is simply a way to define an anonymous function inline. Here we use it to describe how to sort the annotation for each image - by bounding box size (descending).
def get_lrg(b):
if not b:
raise Exception()
# x is tuple. e.g.: (array([96 155 269 350]), 16)
# x[0] returns a numpy array. e.g.: [96 155 269 350]
# x[0][-2:] returns a numpy array. e.g.: [269 350]. This is the width x height of a bbox.
# x[0][:2] returns a numpy array. e.g.: [96 155]. This is the x/y coord of a bbox.
# np.product(x[0][-2:] - x[0][:2]) returns a scalar. e.g.: 33735
b = sorted(b, key=lambda x: np.product(x[0][-2:] - x[0][:2]), reverse=True)
return b[0] # get the first element in the list, which is the largest bbox for one image.
# Debugging codes
np_prod = np.product(np.array([269, 350]) - np.array([96, 155]))
minus_mul = (269 - 96) * (350 - 155) # bbox volume (area): width x height at origin (0, 0)
print(np_prod)
assert np_prod == minus_mul
33735
# for k, v in trn_anno.items():
# print(f"k: {k}, v: {v}")
# a is image id (int), b is tuple of bbox (numpy array) & class id (int)
trn_lrg_anno = { a: get_lrg(b) for a, b in trn_anno.items() if (a != 0 and a != 1) }
trn_lrg_anno[23]
(array([ 1, 2, 461, 242]), 15)
Now we have a dictionary from image id to a single bounding box - the largest for that image.
def draw_largest_bbox(img_id):
b, c = trn_lrg_anno[img_id] # trn_lrg_anno is a tuple. destructuring syntax.
print(f'### DEBUG ### bbox: {b.tolist()}, class id: {c}') # print numpy.ndarray using tolist method.
b = bb_hw(b) # convert back fastai's bbox to VOC format
ax = show_img(open_image(IMG_PATH / trn_fns[img_id]), figsize=(5, 10))
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)
img_id = 695
draw_largest_bbox(img_id)
### DEBUG ### bbox: [125, 108, 365, 414], class id: 13
(PATH / 'tmp').mkdir(exist_ok=True)
CSV = PATH / 'tmp/lrg.csv'
Often it's easiest to simply create a CSV of the data you want to model, rather than trying to create a custom dataset. Here we use Pandas to help us create a CSV of the image filename and class.
df = pd.DataFrame({ 'fn': [trn_fns[o] for o in trn_ids],
'cat': [cats[trn_lrg_anno[o][1]] for o in trn_ids] }, columns=['fn', 'cat'])
df.to_csv(CSV, index=False)
f_model = resnet34
sz = 224
bs = 64
From here it's just like Dogs vs Cats!
tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on, crop_type=CropType.NO)
md = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms, bs=bs)
x, y = next(iter(md.val_dl))
show_img(md.val_ds.denorm(to_np(x))[0])
<matplotlib.axes._subplots.AxesSubplot at 0x7f64b9d0ba58>
learn = ConvLearner.pretrained(f_model, md, metrics=[accuracy])
learn.opt_fn = optim.Adam
lrf = learn.lr_find(1e-5, 100)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
78%|███████▊ | 25/32 [00:11<00:03, 2.09it/s, loss=13.1]
When your LR finder graph looks like this, you can ask for more points on each end:
learn.sched.plot()
learn.sched.plot(n_skip=5, n_skip_end=1)
lr = 2e-2
learn.fit(lr, 1, cycle_len=1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 1.259122 0.782595 0.78
[array([0.78259]), 0.779999997138977]
lrs = np.array([lr/1000, lr/100, lr])
learn.freeze_to(-2)
lrf = learn.lr_find(lrs/1000)
learn.sched.plot(1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
84%|████████▍ | 27/32 [00:17<00:03, 1.52it/s, loss=5.12]
learn.fit(lrs/5, 1, cycle_len=1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 0.789873 0.674313 0.788
[array([0.67431]), 0.7879999985694885]
learn.unfreeze()
Accuracy isn't improving much - since many images have multiple different objects, it's going to be impossible to be that accurate.
learn.fit(lrs/5, 1, cycle_len=2)
HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))
epoch trn_loss val_loss accuracy 0 0.600366 0.672303 0.794 1 0.444746 0.691367 0.786
[array([0.69137]), 0.786]
learn.save('clas_one')
learn.load('clas_one')
x, y = next(iter(md.val_dl))
probs = F.softmax(predict_batch(learn.model, x), -1)
x, preds = to_np(x), to_np(probs)
preds = np.argmax(preds, -1)
You can use the python debugger pdb
to step through code.
pdb.set_trace()
to set a breakpoint%debug
magic to trace an errorCommands you need to know:
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
ima = md.val_ds.denorm(x)[i]
b = md.classes[preds[i]]
ax = show_img(ima, ax=ax)
draw_text(ax, (0, 0), b)
plt.tight_layout()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
It's doing a pretty good job of classifying the largest object!
Now we'll try to find the bounding box of the largest object. This is simply a regression with 4 outputs. So we can use a CSV with multiple 'labels'.
BB_CSV = PATH / 'tmp/bb.csv'
bb = np.array([ trn_lrg_anno[o][0] for o in trn_ids ])
bbs = [' '.join( str(p) for p in o ) for o in bb]
df = pd.DataFrame({
'fn': [ trn_fns[o] for o in trn_ids ],
'bbox': bbs
}, columns=['fn', 'bbox'])
df.to_csv(BB_CSV, index=False)
BB_CSV.open().readlines()[:5] # read up to 5 lines
['fn,bbox\n', '008197.jpg,186 450 226 496\n', '008199.jpg,84 363 374 498\n', '008202.jpg,110 190 371 457\n', '008203.jpg,187 37 359 303\n']
The following Pandas data processing pipeline was based on this code snippet, thanks to Phani Srikanth (@binga)
By using Pandas, we can do things much simpler than using Python collections.defaultdict
.
We can quickly get the bounding boxes into fastai CSV format, ready for bounding box regression.
The more you get to know Pandas, the more often you realize it is a good way to solve lots of different problems.
with open(PATH / 'pascal_train2007.json') as i:
d = json.load(i)
print(d.keys())
categories = pd.DataFrame(d[CATEGORIES])
annotations = pd.DataFrame(d[ANNOTATIONS])
images = pd.DataFrame(d[IMAGES])
dict_keys(['images', 'type', 'annotations', 'categories'])
images.head()
file_name | height | id | width | |
---|---|---|---|---|
0 | 000012.jpg | 333 | 12 | 500 |
1 | 000017.jpg | 364 | 17 | 480 |
2 | 000023.jpg | 500 | 23 | 334 |
3 | 000026.jpg | 333 | 26 | 500 |
4 | 000032.jpg | 281 | 32 | 500 |
categories.head()
id | name | supercategory | |
---|---|---|---|
0 | 1 | aeroplane | none |
1 | 2 | bicycle | none |
2 | 3 | bird | none |
3 | 4 | boat | none |
4 | 5 | bottle | none |
annotations.head()
area | bbox | category_id | id | ignore | image_id | iscrowd | segmentation | |
---|---|---|---|---|---|---|---|---|
0 | 34104 | [155, 96, 196, 174] | 7 | 1 | 0 | 12 | 0 | [[155, 96, 155, 270, 351, 270, 351, 96]] |
1 | 13110 | [184, 61, 95, 138] | 15 | 2 | 0 | 17 | 0 | [[184, 61, 184, 199, 279, 199, 279, 61]] |
2 | 81326 | [89, 77, 314, 259] | 13 | 3 | 0 | 17 | 0 | [[89, 77, 89, 336, 403, 336, 403, 77]] |
3 | 64227 | [8, 229, 237, 271] | 2 | 4 | 0 | 23 | 0 | [[8, 229, 8, 500, 245, 500, 245, 229]] |
4 | 29505 | [229, 219, 105, 281] | 2 | 5 | 0 | 23 | 0 | [[229, 219, 229, 500, 334, 500, 334, 219]] |
data = (
annotations
.merge(categories, how='left', left_on=CAT_ID, right_on=ID)
.merge(images, how='left', left_on=IMG_ID, right_on=ID)
)
data.head()
area | bbox | category_id | id_x | ignore | image_id | iscrowd | segmentation | id_y | name | supercategory | file_name | height | id | width | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34104 | [155, 96, 196, 174] | 7 | 1 | 0 | 12 | 0 | [[155, 96, 155, 270, 351, 270, 351, 96]] | 7 | car | none | 000012.jpg | 333 | 12 | 500 |
1 | 13110 | [184, 61, 95, 138] | 15 | 2 | 0 | 17 | 0 | [[184, 61, 184, 199, 279, 199, 279, 61]] | 15 | person | none | 000017.jpg | 364 | 17 | 480 |
2 | 81326 | [89, 77, 314, 259] | 13 | 3 | 0 | 17 | 0 | [[89, 77, 89, 336, 403, 336, 403, 77]] | 13 | horse | none | 000017.jpg | 364 | 17 | 480 |
3 | 64227 | [8, 229, 237, 271] | 2 | 4 | 0 | 23 | 0 | [[8, 229, 8, 500, 245, 500, 245, 229]] | 2 | bicycle | none | 000023.jpg | 500 | 23 | 334 |
4 | 29505 | [229, 219, 105, 281] | 2 | 5 | 0 | 23 | 0 | [[229, 219, 229, 500, 334, 500, 334, 219]] | 2 | bicycle | none | 000023.jpg | 500 | 23 | 334 |
largest_bbox = data.pivot_table(index=FILE_NAME, values='area', aggfunc=max).reset_index()
largest_bbox = largest_bbox.merge(data[['area', BBOX, IMG_ID, FILE_NAME, 'name']], how='left')
largest_bbox.head()
file_name | area | bbox | image_id | name | |
---|---|---|---|---|---|
0 | 000012.jpg | 34104 | [155, 96, 196, 174] | 12 | car |
1 | 000017.jpg | 81326 | [89, 77, 314, 259] | 17 | horse |
2 | 000023.jpg | 111101 | [2, 1, 241, 461] | 23 | person |
3 | 000026.jpg | 21824 | [89, 124, 248, 88] | 26 | car |
4 | 000032.jpg | 28832 | [103, 77, 272, 106] | 32 | aeroplane |
# Pandas version of bb_hw
def bb_hw_pandas(x):
# Example, x = [155, 96, 196, 174]
return [x[1], x[0], x[3] + x[1] - 1, x[2] + x[0] - 1]
# format bbox list as string. convert values to string.
largest_bbox['bbox_new'] = largest_bbox[BBOX].apply(lambda x: bb_hw_pandas(x))
largest_bbox['bbox_str'] = largest_bbox['bbox_new'].apply(lambda x: ' '.join(str(y) for y in x))
largest_bbox.head()
file_name | area | bbox | image_id | name | bbox_new | bbow_str | bbox_str | |
---|---|---|---|---|---|---|---|---|
0 | 000012.jpg | 34104 | [155, 96, 196, 174] | 12 | car | [96, 155, 269, 350] | 96 155 269 350 | 96 155 269 350 |
1 | 000017.jpg | 81326 | [89, 77, 314, 259] | 17 | horse | [77, 89, 335, 402] | 77 89 335 402 | 77 89 335 402 |
2 | 000023.jpg | 111101 | [2, 1, 241, 461] | 23 | person | [1, 2, 461, 242] | 1 2 461 242 | 1 2 461 242 |
3 | 000026.jpg | 21824 | [89, 124, 248, 88] | 26 | car | [124, 89, 211, 336] | 124 89 211 336 | 124 89 211 336 |
4 | 000032.jpg | 28832 | [103, 77, 272, 106] | 32 | aeroplane | [77, 103, 182, 374] | 77 103 182 374 | 77 103 182 374 |
largest_bbox[[FILE_NAME, 'bbox_str']].to_csv(BB_CSV, index=False)
!head -n 10 {BB_CSV}
file_name,bbox_str 000012.jpg,96 155 269 350 000017.jpg,77 89 335 402 000023.jpg,1 2 461 242 000026.jpg,124 89 211 336 000032.jpg,77 103 182 374 000033.jpg,106 8 262 498 000034.jpg,166 115 399 359 000035.jpg,97 217 317 464 000036.jpg,78 26 343 318
BB_CSV.open().readlines()[:5] # read up to 5 lines
['file_name,bbox_str\n', '000012.jpg,96 155 269 350\n', '000017.jpg,77 89 335 402\n', '000023.jpg,1 2 461 242\n', '000026.jpg,124 89 211 336\n']
======================================== END OF ASIDE ========================================
f_model = resnet34
sz = 224
bs = 64
Set continuous=True
to tell fastai this is a regression problem, which means it won't one-hot encode the labels, and will use MSE as the default crit.
Note that we have to tell the transforms constructor that our labels are coordinates, so that it can handle the transforms correctly.
Also, we use CropType.NO because we want to 'squish' the rectangular images into squares, rather than center cropping, so that we don't accidentally crop out some of the objects. (This is less of an issue in something like imagenet, where there is a single object to classify, and it's generally large and centrally located).
augs = [RandomFlip(),
RandomRotate(30),
RandomLighting(0.1,0.1)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True, bs=4)
idx = 3
fig, axes = plt.subplots(3, 3, figsize=(9, 9))
for i, ax in enumerate(axes.flat):
x, y = next(iter(md.aug_dl))
ima = md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print('b:', b)
show_img(ima, ax=ax)
draw_rect(ax, b)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
b: [ 1. 89. 499. 192.] b: [ 1. 89. 499. 192.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
b: [ 1. 89. 499. 192.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
b: [ 1. 89. 499. 192.] b: [ 1. 89. 499. 192.] b: [ 1. 89. 499. 192.] b: [ 1. 89. 499. 192.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
b: [ 1. 89. 499. 192.] b: [ 1. 89. 499. 192.]
augs = [RandomFlip(tfm_y=TfmType.COORD),
RandomRotate(30, tfm_y=TfmType.COORD),
RandomLighting(0.1,0.1, tfm_y=TfmType.COORD)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, aug_tfms=augs, tfm_y=TfmType.COORD)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True, bs=4)
idx = 3
fig, axes = plt.subplots(3, 3, figsize=(9, 9))
for i, ax in enumerate(axes.flat):
x, y = next(iter(md.aug_dl))
ima = md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print(b)
show_img(ima, ax=ax)
draw_rect(ax, b)
[ 1. 60. 221. 125.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 12. 224. 211.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 9. 224. 214.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 21. 224. 202.] [ 0. 0. 224. 223.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 55. 224. 135.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 15. 224. 208.]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[ 0. 31. 224. 182.] [ 0. 53. 224. 139.]
Note
You may notice that sometimes it looks odd like the middle on in the bottom row. This is the constraint of the information we have. If the object occupied the corners of the original bounding box, your new bounding box needs to be bigger after the image rotates. So you must be careful of not doing too higher rotations with bounding boxes because there is not enough information for them to stay accurate. If we were doing polygons or segmentations, we would not have this problem.
tfm_y = TfmType.COORD
augs = [RandomFlip(tfm_y=tfm_y),
RandomRotate(3, p=0.5, tfm_y=tfm_y),
RandomLighting(0.05,0.05, tfm_y=tfm_y)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=tfm_y, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, bs=bs, continuous=True)
fastai let's you use a custom_head
to add your own module on top of a convnet, instead of the adaptive pooling and fully connected net which is added by default. In this case, we don't want to do any pooling, since we need to know the activations of each grid cell.
The final layer has 4 activations, one per bounding box coordinate. Our target is continuous, not categorical, so the MSE loss function used does not do any sigmoid or softmax to the module outputs.
512*7*7
25088
head_reg4 = nn.Sequential(Flatten(), nn.Linear(512*7*7, 4))
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)
learn.opt_fn = optim.Adam
learn.crit = nn.L1Loss()
learn.summary()
OrderedDict([('Conv2d-1', OrderedDict([('input_shape', [-1, 3, 224, 224]), ('output_shape', [-1, 64, 112, 112]), ('trainable', False), ('nb_params', 9408)])), ('BatchNorm2d-2', OrderedDict([('input_shape', [-1, 64, 112, 112]), ('output_shape', [-1, 64, 112, 112]), ('trainable', False), ('nb_params', 128)])), ('ReLU-3', OrderedDict([('input_shape', [-1, 64, 112, 112]), ('output_shape', [-1, 64, 112, 112]), ('nb_params', 0)])), ('MaxPool2d-4', OrderedDict([('input_shape', [-1, 64, 112, 112]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-5', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-6', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-7', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-8', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-9', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-10', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('BasicBlock-11', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-12', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-13', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-14', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-15', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-16', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-17', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('BasicBlock-18', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-19', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-20', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-21', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-22', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 36864)])), ('BatchNorm2d-23', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('trainable', False), ('nb_params', 128)])), ('ReLU-24', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('BasicBlock-25', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 64, 56, 56]), ('nb_params', 0)])), ('Conv2d-26', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 73728)])), ('BatchNorm2d-27', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-28', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-29', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-30', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('Conv2d-31', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 8192)])), ('BatchNorm2d-32', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-33', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('BasicBlock-34', OrderedDict([('input_shape', [-1, 64, 56, 56]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-35', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-36', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-37', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-38', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-39', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-40', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('BasicBlock-41', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-42', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-43', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-44', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-45', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-46', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-47', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('BasicBlock-48', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-49', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-50', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-51', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-52', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 147456)])), ('BatchNorm2d-53', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('trainable', False), ('nb_params', 256)])), ('ReLU-54', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('BasicBlock-55', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 128, 28, 28]), ('nb_params', 0)])), ('Conv2d-56', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 294912)])), ('BatchNorm2d-57', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-58', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-59', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-60', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('Conv2d-61', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 32768)])), ('BatchNorm2d-62', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-63', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-64', OrderedDict([('input_shape', [-1, 128, 28, 28]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-65', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-66', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-67', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-68', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-69', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-70', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-71', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-72', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-73', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-74', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-75', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-76', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-77', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-78', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-79', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-80', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-81', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-82', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-83', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-84', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-85', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-86', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-87', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-88', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-89', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-90', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-91', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-92', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-93', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-94', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-95', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-96', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 589824)])), ('BatchNorm2d-97', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('trainable', False), ('nb_params', 512)])), ('ReLU-98', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('BasicBlock-99', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 256, 14, 14]), ('nb_params', 0)])), ('Conv2d-100', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1179648)])), ('BatchNorm2d-101', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-102', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Conv2d-103', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 2359296)])), ('BatchNorm2d-104', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('Conv2d-105', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 131072)])), ('BatchNorm2d-106', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-107', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('BasicBlock-108', OrderedDict([('input_shape', [-1, 256, 14, 14]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Conv2d-109', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 2359296)])), ('BatchNorm2d-110', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-111', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Conv2d-112', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 2359296)])), ('BatchNorm2d-113', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-114', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('BasicBlock-115', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Conv2d-116', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 2359296)])), ('BatchNorm2d-117', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-118', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Conv2d-119', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 2359296)])), ('BatchNorm2d-120', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('trainable', False), ('nb_params', 1024)])), ('ReLU-121', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('BasicBlock-122', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 512, 7, 7]), ('nb_params', 0)])), ('Flatten-123', OrderedDict([('input_shape', [-1, 512, 7, 7]), ('output_shape', [-1, 25088]), ('nb_params', 0)])), ('Linear-124', OrderedDict([('input_shape', [-1, 25088]), ('output_shape', [-1, 4]), ('trainable', True), ('nb_params', 100356)]))])
learn.lr_find(1e-5, 100)
learn.sched.plot(5)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
78%|███████▊ | 25/32 [00:11<00:03, 2.14it/s, loss=529]
lr = 2e-3
learn.fit(lr, 2, cycle_len=1, cycle_mult=2)
HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))
epoch trn_loss val_loss 0 48.960351 35.755788 1 37.135304 29.60765 2 31.466736 29.009163
[array([29.00916])]
lrs = np.array([lr/100, lr/10, lr])
learn.freeze_to(-2)
lrf = learn.lr_find(lrs/1000)
learn.sched.plot(1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss 0 82.31227 1.4744848065204166e+17
learn.fit(lrs, 2, cycle_len=1, cycle_mult=2)
HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))
epoch trn_loss val_loss 0 25.858838 25.091344 1 22.565964 22.855172 2 19.391733 21.236308
[array([21.23631])]
learn.freeze_to(-3)
learn.fit(lrs, 1, cycle_len=2)
HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))
epoch trn_loss val_loss 0 18.009395 21.977178 1 16.113632 20.927288
[array([20.92729])]
learn.save('reg4')
learn.load('reg4')
x, y = next(iter(md.val_dl))
learn.model.eval()
preds = to_np(learn.model(VV(x)))
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
ima = md.val_ds.denorm(to_np(x))[i]
b = bb_hw(preds[i])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
plt.tight_layout()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
f_model=resnet34
sz=224
bs=64
val_idxs = get_cv_idxs(len(trn_fns))
======================================== Start Debugging - CSV data ========================================
CSV_FILES = PATH / 'tmp'
!ls {CSV_FILES}
bb.csv lrg.csv
CSV of the bounding box of the largest object. This is simply a regression with 4 outputs (predicted values). So we can use a CSV with multiple 'labels'.
!head -n 10 {CSV_FILES}/bb.csv
fn,bbox 008197.jpg,186 450 226 496 008199.jpg,84 363 374 498 008202.jpg,110 190 371 457 008203.jpg,187 37 359 303 000012.jpg,96 155 269 350 008204.jpg,144 142 335 265 000017.jpg,77 89 335 402 008211.jpg,181 77 499 281 008213.jpg,125 291 166 330
CSV of the image filename and the class of the largest object (from annotations JSON).
!head -n 10 {CSV_FILES}/lrg.csv
fn,cat 008197.jpg,car 008199.jpg,person 008202.jpg,cow 008203.jpg,sofa 000012.jpg,car 008204.jpg,person 000017.jpg,horse 008211.jpg,person 008213.jpg,chair
======================================== End Debugging - CSV data ========================================
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=TfmType.COORD, aug_tfms=augs)
# Model data for bounding box of the largest object.
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,
bs=bs, continuous=True, val_idxs=val_idxs)
# Model data for classification of the largest object.
md2 = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms_from_model(f_model, sz))
A dataset can be anything with __len__
and __getitem__
. Here's a dataset that adds a 2nd label to an existing dataset:
class ConcatLblDataset(Dataset):
"""
A dataset that adds a second label to an existing dataset.
"""
def __init__(self, ds, y2):
self.ds, self.y2 = ds, y2
def __len__(self):
return len(self.ds)
def __getitem__(self, i):
x, y = self.ds[i]
return (x, (y, self.y2[i]))
We'll use it to add the classes to the bounding boxes labels.
trn_ds2 = ConcatLblDataset(md.trn_ds, md2.trn_y)
val_ds2 = ConcatLblDataset(md.val_ds, md2.val_y)
# Grab the two 'label' (bounding box & class) from a record in the validation dataset.
val_ds2[0][1] # record at index 0. labels at index 1, input image x at index 0 (we are not grabbing this)
(array([ 0., 1., 223., 178.], dtype=float32), 14)
We can replace the dataloaders' datasets with these new ones.
md.trn_dl.dataset = trn_ds2
md.val_dl.dataset = val_ds2
We have to denorm
alize the images from the dataloader before they can be plotted.
idx = 9
x, y = next(iter(md.val_dl)) # x is image array, y is labels
# Debug y variable
print(f'type of y: {type(y)}, y length: {len(y)}')
print(y[0].size()) # bounding box top-left coord & bottom-right coord values
print(y[1].size()) # object category (class)
type of y: <class 'list'>, y length: 2 torch.Size([64, 4]) torch.Size([64])
# y[0] returns 64 set of bounding boxes (labels).
# Here's we only grab the first 2 images' bounding boxes. The returned data type is PyTorch FloatTensor in GPU.
print(y[0][:2])
# Grab the first 2 images' object classes. The returned data type is PyTorch LongTensor in GPU.
print(y[1][:2])
0 1 223 178 7 123 186 194 [torch.cuda.FloatTensor of size 2x4 (GPU 0)] 14 3 [torch.cuda.LongTensor of size 2 (GPU 0)]
# Debug x data from GPU
x.size() # batch of 64 images, each image with 3 channels and size of 224x224
torch.Size([64, 3, 224, 224])
# Debug x data from CPU
to_np(x).shape
(64, 3, 224, 224)
ima = md.val_ds.ds.denorm(to_np(x))[idx] # reverse the normalization done to a batch of images.
b = bb_hw(to_np(y[0][idx]))
b
array([134., 148., 36., 48.])
ax = show_img(ima)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[y[1][idx]])
We need one output activation for each class (for its probability) plus one for each bounding box coordinate. We'll use an extra linear layer this time, plus some dropout, to help us train a more flexible model.
head_reg4 = nn.Sequential(
Flatten(),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(25088, 256),
nn.ReLU(),
nn.BatchNorm1d(256),
nn.Dropout(0.5),
nn.Linear(256, 4 + len(cats))
)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam
# DEBUG: what's inside cats
print(type(cats))
print(len(cats))
print('%s, %s' % (cats[1], cats[2]))
<class 'dict'> 20 aeroplane, bicycle
Code comments:
input
: activations.target
: ground truth.bb_t, c_t = target
: our custom dataset returns a tuple containing bounding box coordinates and classes. This assignment will destructure them.bb_i, c_i = input[:, :4]
, input[:, 4:]
: the first :
is for the batch dimension. e.g.: 64 (for 64 images).b_i = F.sigmoid(bb_i) * 224
: we know our image is 224 by 224. Sigmoid
will force it to be between 0 and 1, and multiply it by 224 to help our neural net to be in the range of what it has to be.def detn_loss(input, target):
"""
Loss function for the position and class of the largest object in the image.
"""
bb_t, c_t = target
# bb_i: the 4 values for the bbox
# c_i: the 20 classes `len(cats)`
bb_i, c_i = input[:, :4], input[:, 4:]
bb_i = F.sigmoid(bb_i) * 224 # scale bbox values to stay between 0 and 224 (224 is the max img width or height)
bb_l = F.l1_loss(bb_i, bb_t) # bbox loss
clas_l = F.cross_entropy(c_i, c_t) # object class loss
# I looked at these quantities separately first then picked a multiplier
# to make them approximately equal
return bb_l + clas_l * 20
def detn_l1(input, target):
"""
Loss for the first 4 activations.
L1Loss is like a Mean Squared Error — instead of sum of squared errors, it uses sum of absolute values
"""
bb_t, _ = target
bb_i = input[:, :4]
bb_i = F.sigmoid(bb_i) * 224
return F.l1_loss(V(bb_i), V(bb_t)).data
def detn_acc(input, target):
"""
Accuracy
"""
_, c_t = target
c_i = input[:, 4:]
return accuracy(c_i, c_t)
learn.crit = detn_loss
learn.metrics = [detn_acc, detn_l1]
# With the metrics defined, we find the learning rate
learn.lr_find()
learn.sched.plot()
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
97%|█████████▋| 31/32 [00:13<00:00, 2.32it/s, loss=478]
lr = 1e-2
learn.fit(lr, 1, cycle_len=3, use_clr=(32, 5))
HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))
epoch trn_loss val_loss detn_acc detn_l1 0 71.055205 48.157942 0.754 33.202651 1 51.411235 39.722549 0.776 26.363626 2 42.721873 38.36225 0.786 25.658993
[array([38.36225]), 0.7860000019073486, 25.65899333190918]
learn.save('reg1_0')
learn.freeze_to(-2)
lrs = np.array([lr/100, lr/10, lr])
learn.lr_find(lrs/1000)
learn.sched.plot(0)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
91%|█████████ | 29/32 [00:19<00:02, 1.47it/s, loss=331]
learn.fit(lrs/5, 1, cycle_len=5, use_clr=(32, 10))
HBox(children=(IntProgress(value=0, description='Epoch', max=5), HTML(value='')))
epoch trn_loss val_loss detn_acc detn_l1 0 36.650519 37.198765 0.768 23.865814 1 30.822986 36.280846 0.776 22.743629 2 26.792856 35.199342 0.756 21.564384 3 23.786961 33.644777 0.794 20.626075 4 21.58091 33.194585 0.788 20.520627
[array([33.19459]), 0.788, 20.52062666320801]
learn.save('reg1_1')
learn.load('reg1_1')
learn.unfreeze()
learn.fit(lrs/10, 1, cycle_len=10, use_clr=(32, 10))
HBox(children=(IntProgress(value=0, description='Epoch', max=10), HTML(value='')))
epoch trn_loss val_loss detn_acc detn_l1 0 19.133272 33.833656 0.804 20.774298 1 18.754909 35.271939 0.77 20.572007 2 17.824877 35.099138 0.776 20.494296 3 16.8321 33.782667 0.792 20.139132 4 15.968 33.525141 0.788 19.848904 5 15.356815 33.827995 0.782 19.483242 6 14.589975 33.49683 0.778 19.531291 7 13.811117 33.022376 0.794 19.462907 8 13.238251 33.300647 0.794 19.423868 9 12.613972 33.260653 0.788 19.346758
[array([33.26065]), 0.7880000019073486, 19.34675830078125]
learn.save('reg1')
learn.load('reg1')
y = learn.predict()
x, _ = next(iter(md.val_dl))
from scipy.special import expit
from scipy.special import expit
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
ima=md.val_ds.ds.denorm(to_np(x))[i]
bb = expit(y[i][:4])*224
b = bb_hw(bb)
c = np.argmax(y[i][4:])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[c])
plt.tight_layout()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
ima = md.val_ds.ds.denorm(to_np(x))[i]
bb = expit(y[i][:4]) * 224
b = bb_hw(bb)
c = np.argmax(y[i][4:])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[c])
plt.tight_layout()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).