# 3976 Assignment 1: Desigining a Float128 type¶

Please complete the following assignment in Jupyter, and email the completed .ipynb file to [email protected] by midnight on 14 April 2016.

You are strongly encouraged to work out the problems on your own. If you use existing solutions, existing code, help online or collaborate with other students, you will still receive full credit provided that you make this clear and cite your sources. If you collaborate with other students, please submit individual solutions and be specific in giving credit for what each student contributed.

There are two ways to complete the assignment: using bits and string manipulation, or using <<, >> on UInt128s. Either approach will be accepted, though the latter is faster and hence recommended.

In this assignment you will construct a Float128 bits-type, represented by 128 bits:

In :
import Base: show, bits, exponent, significand, rand, +, -, *, /, <, >, sign, BigFloat, inv, Float64

bitstype 128 Float128 <: AbstractFloat


For this types, we will interpret the 128-bits in data as: 1 sign bit, 63 exponent bits and 64 signficiand bits. That is, numbers will be represented by

$$x = \pm 2^{q-S}*(1.b_1b_2b_3b_4\ldots b_{64})_2$$

where $S = 2^62-1$, $q$ is an unsigned 63-bit integer in the 2nd through 64th bits and $b_1b_2\cdots b_{64}$ are in the last 64 bits.

We will not implement Inf, NaN or subnormal numbers. We will however implement both $±0$.

We will use a globally defined constant S to represent $S$:

In :
const S=Int64(2)^Int64(62)-Int64(1)

Out:
4611686018427387903

We will use the bits to get access to the bits. The following function defines bits for a Float8:

In :
bits(x::Float128)=bits(reinterpret(UInt128,x))

Out:
bits (generic function with 6 methods)

Exercise 1

(a) Complete the following exponent function, that returns $q-S$ (as an Int64).

(b) Check that

y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
x=reinterpret(Float128,y)
exponent(x)


returns 2.

In :
function exponent(x::Float128)
y=reinterpret(UInt128,x)
q=(y>>(128-64)) % UInt64
q=(q<<1)>>1

reinterpret(Int64,q-S)
end

Out:
exponent (generic function with 6 methods)
In :
y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
q=(y>>(128-64)) % UInt64
q=(q<<1)>>1

bits(q)

Out:
"0100000000000000000000000000000000000000000000000000000000000001"
In :
y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
@show exponent(reinterpret(Float128,y)) # 2
y=parse(UInt128,"11000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
@show exponent(reinterpret(Float128,y)) # 2
y=parse(UInt128,"01000000000000000000000000000000100000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
@show exponent(reinterpret(Float128,y)) # 2147483650
y=parse(UInt128,"00100000000000000000000000000000000000000000000000000000000000000000011001100110011001100110011001100110011001100110100000000000",2)
@show exponent(reinterpret(Float128,y)) # 2147483650

exponent(reinterpret(Float128,y)) = 2
exponent(reinterpret(Float128,y)) = 2
exponent(reinterpret(Float128,y)) = 2147483650
exponent(reinterpret(Float128,y)) = -2305843009213693951

Out:
-2305843009213693951

Exercise 2

Explain the following significand_uint64 function, that returns the significand bits of a Float128 as a UInt64.

In :
function significand_uint64(x::Float128)
y=reinterpret(UInt128,x)
y % UInt64
end

Out:
significand_uint64 (generic function with 1 method)

Exercise 3

Complete the following function sign to return the sign of a Float128:

In :
function sign(x::Float128)
y=reinterpret(UInt128,x)
s=y>>127
if s==1
-1
else
1
end
end

Out:
sign (generic function with 8 methods)

Exercise 4

An important skill to learn is how to read code that uses functionality that you are not yet familiar with.

(a) What is a BigFloat? (Hint: use ?BigFloat)

(b) Add comments to the function BigFloat(x::Float128) and show(io::IO,x::Float128) explaining how it works. Why do I display precisely 20 digits in the definition of show?

(c) Check that

y=parse(UInt128,"00111111111111111111111111111111111111111111111111111111111111110100110011001100110011001100110011001100110011001101000000000000",2)
x=reinterpret(Float128,y)


returns 1.300000000000000044

In :
function BigFloat(x::Float128)
if reinterpret(UInt128,x)==0
0.0
elseif reinterpret(UInt128,x)==(UInt128(1) << 127)
-0.0
else
ret=BigFloat(1)
hf=BigFloat(0.5)
sig = bits(significand_uint64(x))
for k=1:length(sig)
if sig[k]=='1'
ret+=hf
end
hf/=2
end
sign(x)*2^BigFloat(exponent(x))*ret
end
end

function show(io::IO,x::Float128)
str=string(BigFloat(x))
print(io,str[1:20])
k=search(str,'e')
if k > 0
print(io,str[k:end])
end
end

Out:
show (generic function with 106 methods)
In :
y=parse(UInt128,"00111111111111111111111111111111111111111111111111111111111111110100110011001100110011001100110011001100110011001101000000000000",2)
x=reinterpret(Float128,y)

Out:
1.300000000000000044

Exercise 5

(a) Complete the function Float128(x::Float64) that converts a Float64 to a Float128

(b) Check that

Float128(2.5)


returns 2.500000000000000000 and

Float128(1.3)


returns 1.300000000000000044

(c) Explain why 2.5 was converted exactly but 1.3 was not.

In :
function Float128(x::Float64)
if x === 0.0
reinterpret(Float128,UInt128(0))
elseif x === -0.0
reinterpret(Float128,UInt128(1) << 127)
else
if x < 0
y=UInt128(1) << 127
else
y=UInt128(0)
end

y += UInt128(exponent(x)+S) << 64
y += UInt128(reinterpret(UInt64,x) << 12)

reinterpret(Float128,y)
end
end

Out:
Float128
In :
f128=[Float128(2.5),Float128(1.3),Float128(150.6756),Float128(-150.6756),
Float128(0.125),Float128(0.013)]

# 6-element Array{Float128,1}:
#   2.500000000000000000
#   1.300000000000000044
#   1.506756000000000028e+02
#  -1.50675600000000002e+02
#   1.250000000000000000e-01
#   1.299999999999999940e-02

Out:
6-element Array{Float128,1}:
2.500000000000000000
1.300000000000000044
1.506756000000000028e+02
-1.50675600000000002e+02
1.250000000000000000e-01
1.299999999999999940e-02

Exercise 6

(a) Complete the following definition Float64(x::Float128) that converts a Float128 to a Float64 by chopping

(b) Check that

Float64(Float128(1.3))


returns 1.3

In :
function Float64(x::Float128)
y=reinterpret(UInt128,x)

if y == 0
0.0
elseif y == (UInt128(1) << 127)
-0.0
else
r=UInt64(y >> 127) << 63
r+=UInt64(exponent(x)+Int64(1023)) << (64-12)
r+=UInt64((y<<64)>>64) >> 12
reinterpret(Float64,r)
end
end

Out:
Float64
In :
map(Float64,f128)-
[2.5,1.3,150.6756,-150.6756,0.125,0.013]|>norm  #0.0

Out:
0.0

Exercise 7

In the remaining exercises, we can assume all Float128s are positive normal numbers.

(a) Complete the following addition function for two Float128s, by reducing the problem to addition with UInt128s. Round the computation towards zero.

(b) Check that

Float128(1.25)+Float128(2.5)


returns 3.750000000000000000

In :
function +(x::Float128,y::Float128)
qx=exponent(x)
qy=exponent(y)

if qx ≥ qy
qx=exponent(x)
qy=exponent(y)

newexp = UInt64(qx+S)

two64 = UInt128(1) << 64

sigx = UInt128(significand_uint64(x)) + two64
sigy = UInt128(significand_uint64(y)) + two64

# TODO: add sigx and sigy and return the resulting Float128.
# Remember: sigy needs to be shifted to account for different exponents.

sigy = sigy >> (qx-qy)

newsig = sigx + (sigy  )
if newsig >= 2two64
newsig >> 1
newexp += 1
end

newsig = UInt128(newsig % UInt64)

reinterpret(Float128,newsig + (UInt128(newexp) << 64))
else
y+x
end
end

Out:
+ (generic function with 172 methods)
In :
Float128(1.5)+Float128(10.5),Float128(1.5)+Float128(0.015)
# (1.200000000000000000e+01,1.514999999999999999)

Out:
(1.200000000000000000e+01,1.514999999999999999)

Exercise 8

(a) Complete the following multiplication of two Float128s. Again, assume that x and y are positive normal numbers. Hint: (1+sigx)*(1+sigy) = 1 + sigx + sigy + sigx*sigy

(b) Check that Float128(1.25)*Float128(2.5) returns 3.125000000000000000.

In :
function *(x::Float128,y::Float128)
qx=exponent(x)
qy=exponent(y)

newexp=UInt128(qx+qy+S)

two64=UInt128(1) << 64
sigx = UInt128(significand_uint64(x))
sigy = UInt128(significand_uint64(y))

sigxy  = (sigx*sigy >> 64)
newsig = (two64 + sigx+sigy+sigxy)
if newsig >= 2*two64
newexp += 1
newsig = newsig >> 1
end
newsig = (newsig << 64) >> 64
newexp = newexp << 64
reinterpret(Float128,newexp + newsig)
end

Out:
* (generic function with 139 methods)
In :
Float128(1.5)*Float128(3.5),Float128(0.15)*Float128(34.5) # (5.250000000000000000,5.174999999999999808)

Out:
(5.250000000000000000,5.174999999999999808)

Exercise 9

The following function computes division for Float128.

(b) Confirm that

Float128(1.3)/Float128(0.4)


returns 3.249999999999999930.

In :
function inv(x::Float128)
qx=exponent(x)

sig = UInt128(significand_uint64(x))
if sig == 0
reinterpret(Float128,UInt128(S-qx)<<64)
else
newexp=UInt128(S-qx-1)<<64

two64=UInt128(1) << 64
sig = sig + two64
two127=UInt128(1) << 127
reinterpret(Float128,newexp+UInt128((div(two127,sig) << 2) % UInt64))
end
end

function /(x::Float128,y::Float128)
x*inv(y)
end

Out:
/ (generic function with 49 methods)
In :
Float128(1.3)/Float128(0.4),Float128(10.3)/Float128(0.4),
Float128(10.3)/Float128(10.4)
# (3.249999999999999930,2.575000000000000034e+01,9.903846153846154189e-01)

Out:
(3.249999999999999930,2.575000000000000034e+01,9.903846153846154189e-01)

Exercise 10

(a) Write a routine that calculates

$$\sum_{k=0}^{28} {1 \over k!}$$

using Float128 arithmetic, by using the / and + functions defined above.

(b) Compare the number of accurate digits to the higher accuracy approximation of e:

println(BigFloat(e))
In :
ret=Float128(1.0)
h=Float128(1.0)

for k=1:27
ret += h
h /= Float128(1.0(k+1))
end
println(BigFloat(e))
println(ret)

2.718281828459045235360287471352662497757247093699959574966967627724076630353555
2.718281828459045234

In [ ]: