Revisiting Unreasonable Effectiveness of Data in Deep Learning Era: https://arxiv.org/abs/1707.02968
Building Datasets
Creating Datasets / Crowdsourcing
Mindcontrol: A web application for brain segmentation quality control: https://www.sciencedirect.com/science/article/pii/S1053811917302707
Combining citizen science and deep learning to amplify expertise in neuroimaging: https://www.biorxiv.org/content/10.1101/363382v1.abstract
Augmentation
Most of you taking this class are rightfully excited to learn about new tools and algorithms to analyzing your data. This lecture is a bit of an anomaly and perhaps disappointment because it doesn't cover any algorithms, or tools.
It probably isn't the new oil, but it forms an essential component for building modern tools today.
Testing good algorithms requires good data
If you don't know what to expect how do you know your algorithm worked?
If you have dozens of edge cases how can you make sure it works on each one?
If a new algorithm is developed every few hours, how can you be confident they actually work better (facebook's site has a new version multiple times per day and their app every other day)
For machine learning, even building requires good data
If you count cells maybe you can write your own algorithm,
but if you are trying to detect subtle changes in cell structure that indicate cancer you probably can't write a list of simple mathematical rules yourself.
Well organized and structure data is very easy to reuse. Another project can easily combine your data with their data in order to get even better results.
The primary success of datasets has been shown through the most famous datasets collected. Here I show 2 of the most famous general datasets and one of the most famous medical datasets. The famous datasets are important for
Each of these datasets is very different from images with fewer than 1000 pixels to images with more than 100MPx, but what they have in common is how their analysis has changed.
All of these datasets used to be analyzed by domain experts with hand-crafted features.
Starting in the early 2010s, the approaches of deep learning began to improve and become more computationally efficient. With these techniques groups with absolutely no domain knowledge could begin building algorithms and winning contests based on these datasets
No, that isn't the point of this lecture. Even if you aren't using deep learning the point of these stories is having well-labeled, structured, and organized datasets makes your problem a lot more accessible for other groups and enables a variety of different approaches to be tried. Ultimately it enables better solutions to be made and you to be confident that the solutions are in fact better
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from skimage.util import montage as montage2d
%matplotlib inline
(img, label), _ = mnist.load_data()
fig, m_axs = plt.subplots(5, 5, figsize=(9, 9))
for c_ax, c_img, c_label in zip(m_axs.flatten(), img, label):
c_ax.imshow(c_img, cmap='gray')
c_ax.set_title(c_label)
c_ax.axis('off')
/anaconda3/envs/qbi2019/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend.
Some of the issues which can come up with datasets are
These lead to problems with the algorithms built on top of them.
Google Photos, y'all *** up. My friend's not a gorilla. pic.twitter.com/SMkMCsNVX4
— I post from https://v2.jacky.wtf. 🆓 != safe. (@jackyalcine) June 29, 2015
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from skimage.util import montage as montage2d
%matplotlib inline
(img, label), _ = mnist.load_data()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
d_subset = np.where(np.in1d(label, [1, 2, 3]))[0]
ax1.imshow(montage2d(img[d_subset[:64]]), cmap='gray')
ax1.set_title('Images')
ax1.axis('off')
ax2.hist(label[d_subset[:64]], np.arange(11))
ax2.set_title('Digit Distribution')
Text(0.5, 1.0, 'Digit Distribution')
Most groups have too little well-labeled data and labeling new examples can be very expensive. Additionally there might not be very many cases of specific classes. In medicine this is particularly problematic, because some diseases might only happen a few times in a given hospital and you still want to be able to recognize the disease and not that particular person.
ImageDataGenerator(
['featurewise_center=False', 'samplewise_center=False', 'featurewise_std_normalization=False', 'samplewise_std_normalization=False', 'zca_whitening=False', 'zca_epsilon=1e-06', 'rotation_range=0.0', 'width_shift_range=0.0', 'height_shift_range=0.0', 'shear_range=0.0', 'zoom_range=0.0', 'channel_shift_range=0.0', "fill_mode='nearest'", 'cval=0.0', 'horizontal_flip=False', 'vertical_flip=False', 'rescale=None', 'preprocessing_function=None', 'data_format=None'],
)
Docstring:
Generate minibatches of image data with real-time data augmentation.
# Arguments
featurewise_center: set input mean to 0 over the dataset.
samplewise_center: set each sample mean to 0.
featurewise_std_normalization: divide inputs by std of the dataset.
samplewise_std_normalization: divide each input by its std.
zca_whitening: apply ZCA whitening.
zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
rotation_range: degrees (0 to 180).
width_shift_range: fraction of total width, if < 1, or pixels if >= 1.
height_shift_range: fraction of total height, if < 1, or pixels if >= 1.
shear_range: shear intensity (shear angle in degrees).
zoom_range: amount of zoom. if scalar z, zoom will be randomly picked
in the range [1-z, 1+z]. A sequence of two can be passed instead
to select this range.
channel_shift_range: shift range for each channel.
fill_mode: points outside the boundaries are filled according to the
given mode ('constant', 'nearest', 'reflect' or 'wrap'). Default
is 'nearest'.
Points outside the boundaries of the input are filled according to the given mode:
'constant': kkkkkkkk|abcd|kkkkkkkk (cval=k)
'nearest': aaaaaaaa|abcd|dddddddd
'reflect': abcddcba|abcd|dcbaabcd
'wrap': abcdabcd|abcd|abcdabcd
cval: value used for points outside the boundaries when fill_mode is
'constant'. Default is 0.
horizontal_flip: whether to randomly flip images horizontally.
vertical_flip: whether to randomly flip images vertically.
rescale: rescaling factor. If None or 0, no rescaling is applied,
otherwise we multiply the data by the value provided. This is
applied after the `preprocessing_function` (if any provided)
but before any other transformation.
preprocessing_function: function that will be implied on each input.
The function will run before any other modification on it.
The function should take one argument:
one image (Numpy tensor with rank 3),
from keras.preprocessing.image import ImageDataGenerator
img_aug = ImageDataGenerator(
featurewise_center=False,
samplewise_center=False,
zca_whitening=False,
zca_epsilon=1e-06,
rotation_range=30.0,
width_shift_range=0.25,
height_shift_range=0.25,
shear_range=0.25,
zoom_range=0.5,
fill_mode='nearest',
horizontal_flip=False,
vertical_flip=False
)
Even something as simple as labeling digits can be very time consuming (maybe 1-2 per second).
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
%matplotlib inline
(img, label), _ = mnist.load_data()
img = np.expand_dims(img, -1)
fig, m_axs = plt.subplots(4, 10, figsize=(16, 10))
# setup augmentation
img_aug.fit(img)
real_aug = img_aug.flow(img[:10], label[:10], shuffle=False)
for c_axs, do_augmentation in zip(m_axs, [False, True, True, True]):
if do_augmentation:
img_batch, label_batch = next(real_aug)
else:
img_batch, label_batch = img, label
for c_ax, c_img, c_label in zip(c_axs, img_batch, label_batch):
c_ax.imshow(c_img[:, :, 0], cmap='gray', vmin=0, vmax=255)
c_ax.set_title('{}\n{}'.format(
c_label, 'aug' if do_augmentation else ''))
c_ax.axis('off')
We can use a more exciting dataset to try some of the other features in augmentation
from keras.datasets import cifar10
(img, label), _ = cifar10.load_data()
img_aug = ImageDataGenerator(
featurewise_center=True,
samplewise_center=False,
zca_whitening=False,
zca_epsilon=1e-06,
rotation_range=30.0,
width_shift_range=0.25,
height_shift_range=0.25,
channel_shift_range=0.25,
shear_range=0.25,
zoom_range=1,
fill_mode='reflect',
horizontal_flip=True,
vertical_flip=True
)
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
%matplotlib inline
fig, m_axs = plt.subplots(4, 10, figsize=(16, 12))
# setup augmentation
img_aug.fit(img)
real_aug = img_aug.flow(img[:10], label[:10], shuffle=False)
for c_axs, do_augmentation in zip(m_axs, [False, True, True, True]):
if do_augmentation:
img_batch, label_batch = next(real_aug)
img_batch -= img_batch.min()
img_batch = np.clip(img_batch/img_batch.max() *
255, 0, 255).astype('uint8')
else:
img_batch, label_batch = img, label
for c_ax, c_img, c_label in zip(c_axs, img_batch, label_batch):
c_ax.imshow(c_img)
c_ax.set_title('{}\n{}'.format(
c_label[0], 'aug' if do_augmentation else ''))
c_ax.axis('off')
from sklearn.dummy import DummyClassifier
dc = DummyClassifier(strategy='most_frequent')
dc.fit([0, 1, 2, 3],
['Healthy', 'Healthy', 'Healthy', 'Cancer'])
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
dc.predict([0]), dc.predict([1]), dc.predict([3]), dc.predict([100])
(array(['Healthy'], dtype='<U7'), array(['Healthy'], dtype='<U7'), array(['Healthy'], dtype='<U7'), array(['Healthy'], dtype='<U7'))
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from skimage.util import montage as montage2d
%matplotlib inline
(img, label), _ = mnist.load_data()
fig, m_axs = plt.subplots(5, 5, figsize=(12, 12))
m_axs[0, 0].hist(label[:24], np.arange(11))
m_axs[0, 0].set_title('Digit Distribution')
for i, c_ax in enumerate(m_axs.flatten()[1:]):
c_ax.imshow(img[i], cmap='gray')
c_ax.set_title(label[i])
c_ax.axis('off')
dc = DummyClassifier(strategy='most_frequent')
dc.fit(img[:24], label[:24])
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
dc.predict(img[0:10])
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint8)
fig, m_axs = plt.subplots(4, 6, figsize=(12, 12))
for i, c_ax in enumerate(m_axs.flatten()):
c_ax.imshow(img[i], cmap='gray')
c_ax.set_title('{}\nPredicted: {}'.format(label[i], dc.predict(img[i])[0]))
c_ax.axis('off')
This isn't a machine learning class and so we won't dive deeply into other methods, but nearest neighbor is often a very good baseline (that is also very easy to understand). You basically take the element from the original set that is closest to the image you show.
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from skimage.util import montage as montage2d
%matplotlib inline
(img, label), _ = mnist.load_data()
fig, m_axs = plt.subplots(5, 5, figsize=(12, 12))
m_axs[0, 0].hist(label[:24], np.arange(11))
m_axs[0, 0].set_title('Digit Distribution')
for i, c_ax in enumerate(m_axs.flatten()[1:]):
c_ax.imshow(img[i], cmap='gray')
c_ax.set_title(label[i])
c_ax.axis('off')
from sklearn.neighbors import KNeighborsClassifier
neigh_class = KNeighborsClassifier(n_neighbors=1)
neigh_class.fit(img[:24].reshape((24, -1)), label[:24])
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')
# predict on a few images
neigh_class.predict(img[0:10].reshape((10, -1)))
array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4], dtype=uint8)
fig, m_axs = plt.subplots(4, 6, figsize=(12, 12))
for i, c_ax in enumerate(m_axs.flatten()):
c_ax.imshow(img[i], cmap='gray')
c_ax.set_title('{}\nPredicted: {}'.format(label[i],
neigh_class.predict(img[i].reshape((1, -1)))[0]))
c_ax.axis('off')
Wow the model works really really well, it got every example perfectly. What we did here (a common mistake) was evaluate on the same data we 'trained' on which means the model just correctly recalled each example, if we try it on new images we can see the performance drop but still a reasonable result
fig, m_axs = plt.subplots(4, 6, figsize=(12, 12))
for i, c_ax in enumerate(m_axs.flatten(), 25):
c_ax.imshow(img[i], cmap='gray')
c_ax.set_title('{}\nPredicted: {}'.format(label[i],
neigh_class.predict(img[i].reshape((1, -1)))[0]))
c_ax.axis('off')
from sklearn.metrics import accuracy_score, confusion_matrix
pred_values = neigh_class.predict(img[24:].reshape((-1, 28*28)))
ax1 = print_confusion_matrix(confusion_matrix(label[24:], pred_values), class_names=range(10))
ax1.set_title('Accuracy: {:2.2%}'.format(accuracy_score(label[24:], pred_values)));