Description of data sources

Here we give a brief description of where to find the main data sets used in this tutorial. Detailed descriptions of how to work with this data once it has been downloaded are given within the main tutorial content (links given below).


MNIST handwritten digits

This is arguably the most well-known benchmark data set for the pattern recognition task. The data is available at

http://yann.lecun.com/exdb/mnist/

for anyone with an internet connection. No registration is required.

Once the raw data has been acquired, we assume that it is stored in the data/mnist directory, and prepared as follows.

$ mkdir data/mnist
$ cd data/mnist
$ wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
$ wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
$ gunzip *

From here, we can get to work.

CIFAR-10 tiny images

Let's prepare a sub-directory called cifar10 to store the data. There are three versions of the data available: a Python version, MATLAB version, and binary version. While the Python version is perfectly acceptable, let's prepare using the binary version.

$ cd data/cifar10
$ wget http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
$ tar -xzf cifar-10-binary.tar.gz

A directory cifar-10-batches-bin is created, with content as follows:

$ ls cifar-10-batches-bin
batches.meta.txt  data_batch_2.bin  data_batch_4.bin  readme.html
data_batch_1.bin  data_batch_3.bin  data_batch_5.bin  test_batch.bin

From here, we can get to work.

Fisher's Iris data set

$ mkdir data/iris
$ cd data/iris
$ wget [ URL ]/bezdekIris.data
$ wget [ URL ]/iris.data
$ wget [ URL ]/iris.names

where [ URL ] is as follows:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris

From here, we can get to work.

Visual image data

The vim-2 data set, also known as the "Gallant Lab Natural Movie 4T fMRI Data set", is available from the website of Collaborative Research in Computational Neuroscience (CRCNS), at the following URL:

https://crcns.org/data-sets/vc/vim-2

This requires free registration to CRCNS.org, which can be done quickly using their "Request Account" page:

https://crcns.org/request-account

The application is screened, and so it may take a day or two before it is (hopefully) accepted.

If you are just downloading it locally, then logging in and downloading via your browser is perfectly acceptable, but if you are using a remote server for computation, be it your own or some cloud-based solution, it is best to make use of the download scripts that are provided:

https://crcns.org/download

Under "Batch download method", there is a link (https://portal.nersc.gov/project/crcns/download/tools) to a page which requires input of your username and password. From here, we get access to the sub-directory within tools. Looking inside tools/download, there are a handful of files, including crcns-download-tools-instuctions, which explains how to set up the configuration file and how to use the download/verification scripts. Setup requires only a few minutes; just follow the lucid instructions and take a break while the files are downloaded.

Once the raw data has been acquired, we assume that it is stored in the data/vim-2 directory, in whatever your working directory is.

From here, we can get to work.