# Preparation of raw (binary) data¶

Raw data comes in many forms, often binary. Working with it is straightforward, but requires a certain degree of care to ensure that the data that we read in fact contains the information we expect.

## Overview of the data set¶

As a straightforward exercise in manipulating binary files using standard Python functions, here we shall make use of the the well-known database of handwritten digits, called MNIST, a modified subset of a larger dataset from the National Institute of Standards and Technology.

A typical source for this data set is the website of Y. LeCun (http://yann.lecun.com/exdb/mnist/). They provide the following description,

"The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image"

and the following files containing the data of interest.

train-images-idx3-ubyte.gz: training set images (9912422 bytes) train-labels-idx1-ubyte.gz: training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

The files are stored in a binary format called IDX, typically used for storing vector data. First we decompress the files via

$cd data/MNIST$ gunzip train-images-idx3-ubyte.gz
$gunzip train-labels-idx1-ubyte.gz$ gunzip t10k-images-idx3-ubyte.gz