Answer the questions with Julia code, Julia calculations, and words in comment strings (starting with #). You can also use Markdown cells for text, if you wish. Problem 0 is worked out for you as an example.

**Problem 0.** Use Julia's `nextfloat`

function to determine the separation between 1.0 and the next floating point number (using the default `Float64`

type).

In [16]:

```
nextfloat(1.0)-1.0
```

Out[16]:

Can you explain the precise value of this number?

In [17]:

```
# Yes, 2.220446049250313e-16 is 2^-52, the next allowable value for a 52-bit mantissa
2.0^-52
```

Out[17]:

Predict the next floating point number after 8.0 based on your understanding of floating-point numbers, and
verify with `nextfloat`

.

In [18]:

```
# 8 is represented as (1 + 0)*2^3. The next 64-bit float after 8.0 should be (1 + 2^-52)*2^3 == 8 + 2^-49.
8 + 2.0^-49
```

Out[18]:

In [19]:

```
nextfloat(8.0)
```

Out[19]:

In [21]:

```
# those look the same, but the representation of floats as decimal strings uses rounding
# let's check that the two values are exactly equal by subtracting
(8 + 2.0^-49) - nextfloat(8.0)
```

Out[21]:

**Problem 1.** Based on 9/7/2016 lecture material, what range of integers should a 16-bit integer type represent?

In [4]:

```
# 16-bit ints should represent 2^32 integers in the range -2^-15 to 2^15-1
-2^15, 2^15-1
```

Out[4]:

Check your answer by running typemin(Int16) and typemax(Int16).

In [2]:

```
typemin(Int16), typemax(Int16)
```

Out[2]:

**Problem 2.** Same as problem 1 for 32-bit integers.

In [5]:

```
# 32-bit ints should represent 2^32 integers in the range -2^-31 to 2^32
-2^31, 2^31-1
```

Out[5]:

In [6]:

```
typemin(Int32), typemax(Int32)
```

Out[6]:

**Problem 3.** The standard 32-bit floating-point type uses 1 bit for sign, 8 bits for exponents, and 23 bits for the mantissa.

What is machine epsilon for a 32-bit float? Express your answer both as a power of two and a power of ten.

In [8]:

```
# 23 bits for mantissa means that the next floating point number after 1 is 1+2^-23
# so machine epsilon is 2^-23 ≈ 10^-7.
# Note that you have to calculate this quantity as 1/2^23 or 2.0^-23
2^-23 # fails because Julia doesn't do negative powers of integers
```

In [10]:

```
2.0^-23 # but this works
```

Out[10]:

How many digits of accuracy does the mantissa have?

In [11]:

```
# The mantissa of a 32-bit float has seven digits of accuracy, since 2^-23 is about 10^-7.
# Let's verify by counting the digits of a 32-bit approximation to π
Float32(π)
```

Out[11]:

In [14]:

```
# Hmmm. Julia shows eight digits for a 32-bit pi. Let's compare to a higher-precision approximation.
Float64(π)
```

Out[14]:

In [ ]:

```
# Ok, the error is in the 7th or 8th digit, depending on whether you count the rounded-up 7
# as an accurate digit or not.
```

What are the minimum and maximum exponents, in base 10?

In [16]:

```
# Oooh, tough question. The binary representation has sign times mantissa times 2^e.
# With 8 bits devoted to e, that gives a range -2^-7 < e < 2^7-1 or -128 < e < 127.
# That means the range of 2^e is 2^-128 < 2^e < 2^127. To convert to exponents in base 10,
# we just need to approximate 2^128 with a power of 10.
2.0^128
```

Out[16]:

In [17]:

```
# Looks like the minimum and maximum base-10 exponents of a 32-bit float are -38 and 38.
# Let's check. 1e38 should be ok, but 1e39 should overflow to Inf
Float32(1e38)
```

Out[17]:

In [18]:

```
Float32(1e39)
```

Out[18]:

In [20]:

```
#Yep! Let's try to same with 1e-28 and 1e-39
Float32(1e-38), Float32(1e-39)
```

Out[20]:

In [22]:

```
# Whoa, what's up with that? It turns out that the IEEE floating-point standard manages
# to pack in a few extra negative exponents, leading to what are called subnormal numbers.
# For 32 bit numbers, the negative exponents go all the way to 1e-45
Float32(1e-45), Float32(1e-46)
```

Out[22]:

**Problem 4.** The standard 16-bit floating-point type uses 1 bit for sign, 5 bits for exponents, and 10 bits for the mantissa. What size error do you expect in a 16-bit computation of 9.4 - 9 - 0.4?

In [23]:

```
# Let's see, 10 bits for mantissa means a machine epsilon of 2^-9 or about 10^-3.
# So I expect only three digits of accuracy in a 16-bit computation of 9.4 - 9 - 0.4.
2.0^-9
```

Out[23]:

You can enter a 16-bit float in Julia with the syntax `9.4f0`

. Compute 9.4 - 9 - 0.4 using 16-bit floats and verify your expectation.

In [26]:

```
# Gee, Professor Gibson, thanks for misleading us all. 9.4f0 gives you a 32-bit float.
# The correct syntax for 16-bit floats is Float16(9.4)
Float16(9.4) - Float16(9) - Float16(0.4)
```

Out[26]:

**Problem 5.** Find the roots of $x^2 - 4x + 6^{-28} = 0$ to several significant digits.

I'm going to answer this partly in Markdown, since the formulae look better when typeset.

The roots are $x_{1,2} = \frac{4 \pm \sqrt{16 - 4 \times 6^{-28}}}{2} = 2 \pm \sqrt{4 - 6^{-28}}$. If you simply plug in numbers and calculate in 64 bits, you get

In [32]:

```
x_1 = 2 + sqrt(4 - 6.0^-28)
```

Out[32]:

In [34]:

```
x_2 = 2 - sqrt(4 - 6.0^-28)
```

Out[34]:

The value for the first root $x_1$ is accurate to 16 digits. You can see that by recomputing in higher precision.

In [33]:

```
big(2) + sqrt(big(4.0)-big(6.0)^-28)
```

Out[33]:

The value of zero for the second root $x_2$ is not accurate. In fact it has *no digits of accuracy*.
Using higher-precision arithmetic shows $x_2 \approx 4.07 \times 10^{-23}$.

In [36]:

```
big(2) - sqrt(big(4.0)-big(6.0)^-28)
```

Out[36]:

The problem is that $6^{-28} \approx 10^{-22}$ is very small compared to 4.

Machine epsilon for 64-bit floats is $2^{-52}$. That means the next number past 4 is $4(1+2^{-52}) \approx 4 + 10^{-15}$, and the number prior to it is about $4 - 10^{-15}$. Thus the $6^{-28} \approx10^{-22}$ you're trying to add to 4 is way below the resolution of the floating-point numbers in the neighborhood of 4. So when you compute $4 - 6^{-28}$, it gets rounded to the nearest floating point number, which is 4. And then you get $2 - \sqrt{4} = 0$. That's no good.

We need to revise the mathematical expression for $x_2 = 2 - \sqrt{4 - 6^{-28}}$ to remove the close cancellation. Multiplying by $\left(2 + \sqrt{4 - 6^{-28}}\right)/\left(2 + \sqrt{4 - 6^{-28}}\right)$ and simplifying gives

$x_2 = \frac{6^{-28}}{2 + \sqrt{4 - 6^{-28}}}$

which evaluates in 64-bit arithmetic to

In [37]:

```
x_2 = 6.0^-28 /(2 + sqrt(4+6.0^-28))
```

Out[37]:

Comparing this value to the one computed with BigFloats, we see that all sixteen digits are correct.

Now that you know the answers, do you see a way you could have found them easily by a couple simple approximations?

Yes! Go back to the original problem of finding the solutions of $x^2 - 4x + 6^{-28} = 0$. For the root near 4, the $6^{28}$ is negligible. Drop it and solve $x^2 - 4x = 0$. Dividing by $x$ gives $x = 4$.

For the root near zero, $x$ is small, but $x^2$ is even smaller. Drop it from the equation and solve $-4x + 6^{-28} = 0$. That gives $x = 6^{-28}/4$.

In [38]:

```
6.0^-28 / 4
```

Out[38]:

Wow, this simple approximation is accurate to sixteen digits!

**Problem 6.** Given x = 4778 + 3.77777e-10 and y = 4778 + 3.11111e-10, how many digits of accuracy do you expect in 64-bit calculations of the following?

x + y

x - y

x * y

x / y

In [39]:

```
# Let's figure out how accurately we can represent x and y as 64-bit floating-point numbers
# 4 778.000 000 000 000 000 000
# + 0.000 000 000 377 777 000
# ------------------------------
# 4 778.000 000 000 377 777
# but we only have 16 digits to store that number. So we have to truncate the last three 7's.
# So in 64 bits, x and y will get rounded to 16-digit accuracy
# x = 4 778.000 000 000 377 and
# y = 4 778.000 000 000 311
# x+y will add those numbers and produce a value accurate to 16 digits.
# x = 4 778.000 000 000 377
# + y = 4 778.000 000 000 311
# x+y = 9 556.000 000 000 688
# x-y, though, is subject to cancellation! The true answer is
# x-y = (4778 + 3.77777e-10) - (4778 + 3.11111e-10)
# = 0.66666e-10
# = 6.6666e-11
# but in 64 bits, we get
# x = 4 778.000 000 000 377
# - y = 4 778.000 000 000 311
# x-y = 0 000.000 000 000 066 = 6.6e-11
# That's only accurate to two digits!
# Like x+y, x*y and x/y will be accurate to 16 digits, since there's no cancellation.
```

Can you think of way to test your expectations in Julia? Note that evaluating the expressions in 64-bit arithmetic just gives you the 64-bit approximation, without telling you anythign about its accuracy.

In [45]:

```
x = 4778 + 3.77777e-10
```

Out[45]:

In [46]:

```
y = 4778 + 3.11111e-10
```

Out[46]:

In [47]:

```
# We can get represent those numbers more precisely with creative use of Julia's BigFloat type
xbig = 4778 + 377777*BigFloat(10)^-15
```

Out[47]:

In [48]:

```
ybig = 4778 + 311111*BigFloat(10)^-15
```

Out[48]:

Now do a bunch of comparisons

In [55]:

```
x+y
```

Out[55]:

In [50]:

```
xbig+ybig
```

Out[50]:

Hooray! x+y is accurate to 16 digits, as expected!

In [51]:

```
x-y
```

Out[51]:

In [52]:

```
xbig-ybig
```

Out[52]:

Hooray! x-y is accurate to only two digits, as expected!

In [53]:

```
x*y
```

Out[53]:

In [54]:

```
xbig*ybig
```

Out[54]:

Hooray! x*y is accurate to 16 digits, as expected!

In [56]:

```
x/y
```

Out[56]:

In [57]:

```
xbig/ybig
```

Out[57]:

Hooray! x/y is accurate to 16 digits, as expected!