Date: July 30, 2019

### Representation of numbers on a computer¶

Given that any computer has limited storage capacity, we can represent only finitely many numbers on a computer. Thus, inevitably, we have to deal with approximations of the real number system using finite computer representations. To arrive at a consistent representation of floating point numbers across different computer architecture, the most widely used and accepted standard is the Institute of Electrical and Electronics Engineers (IEEE-754) standard for representing real numbers. First note that all numbers are stored in binary (base $2$) format. Any normal number on the machine is represented as $$x = \pm 1.d_1d_2\ldots d_s \times 2^{e}$$ where $1.d_1d_2\ldots d_s$ is the significand and $e$ is the exponent (both represented using $0$'s and $1$'s). For example, consider the number $77$ in decimal. We have $$77_{10} = 2^6 + 2^3 + 2^2 + 2^0 = 1001101_2 = 1.001101_2 \times 2^{6} = 1.001101_2 \times 2^{110_{2}}$$ As indicated in the Figure above, the first bit is the sign bit, the next $e$ bits are for the exponent and the last $s$ bits are for the significand. We will now list the general conventions followed in accordance with the IEEE-754 standard.

• Sign bit: $0$ indicates $+$ and $1$ indicates $-$.
• Exponent bit: Since there are $e$ bits for the exponent, there are a total of $2^e$ values the exponent can take. To represent negative exponents as well, a bias of $2^{e-1}-1$ is introduced, i.e., $0$ exponent is represented as $0\overbrace{111\ldots1}^{e-1}_2$
• Significand bit: Stores the leading bits in the mantissa apart from the leading $1$. #### Normal floating point number¶

These are represented as $$\pm 1.d_1d_2\ldots d_s \times 2^{E}$$ where $2-2^{e-1} \leq E \leq 2^{e-1}-1$ (after bias). Note that since normal floating point numbers begin with $1$, it suffices to store the $s$ bits after the leading $1$. #### Sub-normal floating point number¶

These are respresented as $$\pm 0. d_1d_2\ldots d_s \times 2^{2-2^{e-1}}$$ The exponent bits of sub-normal floating point numbers are all zero. The significand stores the $s$ bits after the leading $0$.

#### Machine precision¶

This is defined as the difference between the smallest number exceeding $1$ that can be represented on the machine and $1$ (There are slightly different definitions depending on the way rounding is done. In this article, we will assume that by rounding we mean truncating or chopping off the digits. Depending on that the definition of machine precision changes.). Note that the smallest number exceeding $1$ that can be represented on the machine is $1.000\ldots01 = 1+2^{-s}$.

Hence, machine precision is $\epsilon_m = 2^{-s}$.

Note that if $x$ is any real number and $\text{fl}(x)$ is the floating point representation of $x$ (i.e., after appropriate chopping, $x$ will be represented as $\text{fl}(x)$ on the machine, we then have $$\left \lvert{\dfrac{x-\text{fl}(x)}{\text{fl}(x)}}\right \rvert \leq \epsilon_m$$

From the above, note that floating point representation introduces relative errors and not absolute errors.

Single precision: Of the total of $32$ bits, the first one is alloted for sign, the next $8$ for exponent and the remaining $23$ for significand.

Double precision: Of the total of $64$ bits, the first one is alloted for sign, the next $11$ for exponent and the remaining $52$ for significand. In [ ]: