# Math 753/853 HW1: floating-point numbers, due Wed 9/14¶

Answer the questions with Julia code, Julia calculations, and words in comment strings (starting with #). You can also use Markdown cells for text, if you wish. Problem 0 is worked out for you as an example.

Problem 0. Use Julia's nextfloat function to determine the separation between 1.0 and the next floating point number (using the default Float64 type).

In [16]:
nextfloat(1.0)-1.0

Out[16]:
2.220446049250313e-16

Can you explain the precise value of this number?

In [17]:
# Yes, 2.220446049250313e-16 is 2^-52, the next allowable value for a 52-bit mantissa
2.0^-52

Out[17]:
2.220446049250313e-16

Predict the next floating point number after 8.0 based on your understanding of floating-point numbers, and verify with nextfloat.

In [18]:
# 8 is represented as (1 + 0)*2^3. The next 64-bit float after 8.0 should be (1 + 2^-52)*2^3 == 8 + 2^-49.
8 + 2.0^-49

Out[18]:
8.000000000000002
In [19]:
nextfloat(8.0)

Out[19]:
8.000000000000002
In [21]:
# those look the same, but the representation of floats as decimal strings uses rounding
# let's check that the two values are exactly equal by subtracting
(8 + 2.0^-49) - nextfloat(8.0)

Out[21]:
0.0

Problem 1. Based on 9/7/2016 lecture material, what range of integers should a 16-bit integer type represent?

In [4]:
# 16-bit ints should represent 2^32 integers in the range -2^-15 to 2^15-1
-2^15, 2^15-1

Out[4]:
(-32768,32767)

In [2]:
typemin(Int16), typemax(Int16)

Out[2]:
(-32768,32767)

Problem 2. Same as problem 1 for 32-bit integers.

In [5]:
# 32-bit ints should represent 2^32 integers in the range -2^-31 to 2^32
-2^31, 2^31-1

Out[5]:
(-2147483648,2147483647)
In [6]:
typemin(Int32), typemax(Int32)

Out[6]:
(-2147483648,2147483647)

Problem 3. The standard 32-bit floating-point type uses 1 bit for sign, 8 bits for exponents, and 23 bits for the mantissa.

What is machine epsilon for a 32-bit float? Express your answer both as a power of two and a power of ten.

In [8]:
# 23 bits for mantissa means that the next floating point number after 1 is 1+2^-23
# so machine epsilon is 2^-23 ≈ 10^-7.

# Note that you have to calculate this quantity as 1/2^23 or 2.0^-23

2^-23 # fails because Julia doesn't do negative powers of integers

LoadError: DomainError:
Cannot raise an integer x to a negative power -n.
Make x a float by adding a zero decimal (e.g. 2.0^-n instead of 2^-n), or write 1/x^n, float(x)^-n, or (x//1)^-n.

in power_by_squaring(::Int64, ::Int64) at ./intfuncs.jl:118
in ^(::Int64, ::Int64) at ./intfuncs.jl:142
In [10]:
2.0^-23 # but this works

Out[10]:
1.1920928955078125e-7

How many digits of accuracy does the mantissa have?

In [11]:
# The mantissa of a 32-bit float has seven digits of accuracy, since 2^-23 is about 10^-7.
# Let's verify by counting the digits of a 32-bit approximation to π
Float32(π)

Out[11]:
3.1415927f0
In [14]:
# Hmmm. Julia shows eight digits for a 32-bit pi. Let's compare to a higher-precision approximation.
Float64(π)

Out[14]:
3.141592653589793
In [ ]:
# Ok, the error is in the 7th or 8th digit, depending on whether you count the rounded-up 7
# as an accurate digit or not.


What are the minimum and maximum exponents, in base 10?

In [16]:
# Oooh, tough question. The binary representation has sign times mantissa times 2^e.
# With 8 bits devoted to e, that gives a range -2^-7 < e < 2^7-1 or -128 < e < 127.
# That means the range of 2^e is 2^-128 < 2^e < 2^127. To convert to exponents in base 10,
# we just need to approximate 2^128 with a power of 10.

2.0^128

Out[16]:
3.402823669209385e38
In [17]:
# Looks like the minimum and maximum base-10 exponents of a 32-bit float are -38 and 38.
# Let's check. 1e38 should be ok, but 1e39 should overflow to Inf
Float32(1e38)

Out[17]:
1.0f38
In [18]:
Float32(1e39)

Out[18]:
Inf32
In [20]:
#Yep! Let's try to same with 1e-28 and 1e-39
Float32(1e-38), Float32(1e-39)

Out[20]:
(1.0f-38,1.0f-39,1.0f-40)
In [22]:
# Whoa, what's up with that? It turns out that the IEEE floating-point standard manages
# to pack in a few extra negative exponents, leading to what are called subnormal numbers.
# For 32 bit numbers, the negative exponents go all the way to 1e-45
Float32(1e-45), Float32(1e-46)

Out[22]:
(1.0f-45,0.0f0)

Problem 4. The standard 16-bit floating-point type uses 1 bit for sign, 5 bits for exponents, and 10 bits for the mantissa. What size error do you expect in a 16-bit computation of 9.4 - 9 - 0.4?

In [23]:
# Let's see, 10 bits for mantissa means a machine epsilon of 2^-9 or about 10^-3.
# So I expect only three digits of accuracy in a 16-bit computation of 9.4 - 9 - 0.4.
2.0^-9

Out[23]:
0.001953125

You can enter a 16-bit float in Julia with the syntax 9.4f0. Compute 9.4 - 9 - 0.4 using 16-bit floats and verify your expectation.

In [26]:
# Gee, Professor Gibson, thanks for misleading us all. 9.4f0 gives you a 32-bit float.
# The correct syntax for 16-bit floats is Float16(9.4)
Float16(9.4) - Float16(9) - Float16(0.4)

Out[26]:
Float16(-0.001465)

Problem 5. Find the roots of $x^2 - 4x + 6^{-28} = 0$ to several significant digits.

I'm going to answer this partly in Markdown, since the formulae look better when typeset.

The roots are $x_{1,2} = \frac{4 \pm \sqrt{16 - 4 \times 6^{-28}}}{2} = 2 \pm \sqrt{4 - 6^{-28}}$. If you simply plug in numbers and calculate in 64 bits, you get

In [32]:
x_1 = 2 + sqrt(4 - 6.0^-28)

Out[32]:
4.0
In [34]:
x_2 = 2 - sqrt(4 - 6.0^-28)

Out[34]:
0.0

The value for the first root $x_1$ is accurate to 16 digits. You can see that by recomputing in higher precision.

In [33]:
big(2) + sqrt(big(4.0)-big(6.0)^-28)

Out[33]:
3.999999999999999999999959289634836307028083916167847873596658889610299673573882

The value of zero for the second root $x_2$ is not accurate. In fact it has no digits of accuracy. Using higher-precision arithmetic shows $x_2 \approx 4.07 \times 10^{-23}$.

In [36]:
big(2) - sqrt(big(4.0)-big(6.0)^-28)

Out[36]:
4.071036516369297191608383215212640334111038970032642610106519358651805052773103e-23

The problem is that $6^{-28} \approx 10^{-22}$ is very small compared to 4.

Machine epsilon for 64-bit floats is $2^{-52}$. That means the next number past 4 is $4(1+2^{-52}) \approx 4 + 10^{-15}$, and the number prior to it is about $4 - 10^{-15}$. Thus the $6^{-28} \approx10^{-22}$ you're trying to add to 4 is way below the resolution of the floating-point numbers in the neighborhood of 4. So when you compute $4 - 6^{-28}$, it gets rounded to the nearest floating point number, which is 4. And then you get $2 - \sqrt{4} = 0$. That's no good.

We need to revise the mathematical expression for $x_2 = 2 - \sqrt{4 - 6^{-28}}$ to remove the close cancellation. Multiplying by $\left(2 + \sqrt{4 - 6^{-28}}\right)/\left(2 + \sqrt{4 - 6^{-28}}\right)$ and simplifying gives

$x_2 = \frac{6^{-28}}{2 + \sqrt{4 - 6^{-28}}}$

which evaluates in 64-bit arithmetic to

In [37]:
x_2 = 6.0^-28 /(2 + sqrt(4+6.0^-28))

Out[37]:
4.071036516369297e-23

Comparing this value to the one computed with BigFloats, we see that all sixteen digits are correct.

Now that you know the answers, do you see a way you could have found them easily by a couple simple approximations?

Yes! Go back to the original problem of finding the solutions of $x^2 - 4x + 6^{-28} = 0$. For the root near 4, the $6^{28}$ is negligible. Drop it and solve $x^2 - 4x = 0$. Dividing by $x$ gives $x = 4$.

For the root near zero, $x$ is small, but $x^2$ is even smaller. Drop it from the equation and solve $-4x + 6^{-28} = 0$. That gives $x = 6^{-28}/4$.

In [38]:
6.0^-28 / 4

Out[38]:
4.071036516369297e-23

Wow, this simple approximation is accurate to sixteen digits!

Problem 6. Given x = 4778 + 3.77777e-10 and y = 4778 + 3.11111e-10, how many digits of accuracy do you expect in 64-bit calculations of the following?

x + y

x - y

x * y

x / y

In [39]:
# Let's figure out how accurately we can represent x and y as 64-bit floating-point numbers

#  4 778.000 000 000 000 000 000
# +    0.000 000 000 377 777 000
# ------------------------------
#  4 778.000 000 000 377 777

# but we only have 16 digits to store that number. So we have to truncate the last three 7's.
# So in 64 bits, x and y will get rounded to 16-digit accuracy

# x = 4 778.000 000 000 377 and
# y = 4 778.000 000 000 311

# x+y will add those numbers and produce a value accurate to 16 digits.

#   x  =  4 778.000 000 000 377
# + y  =  4 778.000 000 000 311
# x+y  =  9 556.000 000 000 688

# x-y, though, is subject to cancellation! The true answer is
# x-y = (4778 + 3.77777e-10) - (4778 + 3.11111e-10)
#     = 0.66666e-10
#     = 6.6666e-11

# but in 64 bits, we get
#   x  =  4 778.000 000 000 377
# - y  =  4 778.000 000 000 311
# x-y  =  0 000.000 000 000 066 =  6.6e-11

# That's only accurate to two digits!

# Like x+y, x*y and x/y will be accurate to 16 digits, since there's no cancellation.


Can you think of way to test your expectations in Julia? Note that evaluating the expressions in 64-bit arithmetic just gives you the 64-bit approximation, without telling you anythign about its accuracy.

In [45]:
x = 4778 + 3.77777e-10

Out[45]:
4778.000000000377
In [46]:
y = 4778 + 3.11111e-10

Out[46]:
4778.000000000311
In [47]:
# We can get represent those numbers more precisely with creative use of Julia's BigFloat type
xbig = 4778 + 377777*BigFloat(10)^-15

Out[47]:
4.778000000000377776999999999999999999999999999999999999999999999999999999999983e+03
In [48]:
ybig = 4778 + 311111*BigFloat(10)^-15

Out[48]:
4.778000000000311110999999999999999999999999999999999999999999999999999999999991e+03

Now do a bunch of comparisons

In [55]:
x+y

Out[55]:
9556.000000000688
In [50]:
xbig+ybig

Out[50]:
9.556000000000688887999999999999999999999999999999999999999999999999999999999974e+03

Hooray! x+y is accurate to 16 digits, as expected!

In [51]:
x-y

Out[51]:
6.639311322942376e-11
In [52]:
xbig-ybig

Out[52]:
6.666599999999999999999999999999999999999999999999999999999999999129874322838217e-11

Hooray! x-y is accurate to only two digits, as expected!

In [53]:
x*y

Out[53]:
2.282928400000329e7
In [54]:
xbig*ybig

Out[54]:
2.282928400000329150686400011753058024699999999999999999999999999999999999999978e+07

Hooray! x*y is accurate to 16 digits, as expected!

In [56]:
x/y

Out[56]:
1.000000000000014
In [57]:
xbig/ybig

Out[57]:
1.000000000000013952699874423536869315481000213698281998857789056883435193467709

Hooray! x/y is accurate to 16 digits, as expected!