Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
In this second part of the chapter, we look at the float
type in detail. It is probably the most commonly used one in all of data science, even across programming languages.
float
Type¶As we have seen before, some assumptions need to be made as to how the 0s and 1s in a computer's memory are to be translated into numbers. This process becomes a lot more involved when we go beyond integers and model real numbers (i.e., the set R) with possibly infinitely many digits to the right of the period like 1.23.
The Institute of Electrical and Electronics Engineers (IEEE, pronounced "eye-triple-E") is one of the important professional associations when it comes to standardizing all kinds of aspects regarding the implementation of soft- and hardware.
The IEEE 754 standard defines the so-called floating-point arithmetic that is commonly used today by all major programming languages. The standard not only defines how the 0s and 1s are organized in memory but also, for example, how values are to be rounded, what happens in exceptional cases like divisions by zero, or what is a zero value in the first place.
In Python, the simplest way to create a float
object is to use a literal notation with a dot .
in it.
b = 42.0
id(b)
139923238853936
type(b)
float
b
42.0
As with int
literals, we may use underscores _
to make longer float
objects easier to read.
0.123_456_789
0.123456789
42.
42.0
float(42)
42.0
float("42")
42.0
Leading and trailing whitespace is ignored ...
float(" 42.87 ")
42.87
... but not whitespace in between.
float("42. 87")
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[10], line 1 ----> 1 float("42. 87") ValueError: could not convert string to float: '42. 87'
float
objects are implicitly created as the result of dividing an int
object by another with the division operator /
.
1 / 3
0.3333333333333333
In general, if we combine float
and int
objects in arithmetic operations, we always end up with a float
type: Python uses the "broader" representation.
40.0 + 2
42.0
21 * 2.0
42.0
float
objects may also be created with the scientific literal notation: We use the symbol e
to indicate powers of 10, so 1.23∗100 translates into 1.23e0
.
1.23e0
1.23
Syntactically, e
needs a float
or int
object in its literal notation on its left and an int
object on its right, both without a space. Otherwise, we get a SyntaxError
.
1.23 e0
Cell In[15], line 1 1.23 e0 ^ SyntaxError: invalid syntax
1.23e 0
Cell In[16], line 1 1.23e 0 ^ SyntaxError: invalid decimal literal
1.23e0.0
Cell In[17], line 1 1.23e0.0 ^ SyntaxError: invalid syntax
If we leave out the number to the left, Python raises a NameError
as it unsuccessfully tries to look up a variable named e0
.
e0
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[18], line 1 ----> 1 e0 NameError: name 'e0' is not defined
So, to write 100 in Python, we need to think of it as 1∗100 and write 1e0
.
1e0
1.0
To express thousands of something (i.e., 103), we write 1e3
.
1e3 # = thousands
1000.0
Similarly, to express, for example, milliseconds (i.e., 10−3s), we write 1e-3
.
1e-3 # = milli
0.001
There are also three special values representing "not a number," called nan
, and positive or negative infinity, called inf
or -inf
, that are created by passing in the corresponding abbreviation as a str
object to the float() built-in. These values could be used, for example, as the result of a mathematically undefined operation like division by zero or to model the value of a mathematical function as it goes to infinity.
float("nan") # also float("NaN")
nan
float("+inf") # also float("+infinity") or float("infinity")
inf
float("inf") # also float("+inf")
inf
float("-inf")
-inf
nan
objects never compare equal to anything, not even to themselves. This happens in accordance with the IEEE 754 standard.
float("nan") == float("nan")
False
Another caveat is that any arithmetic involving a nan
object results in nan
. In other words, the addition below fails silently as no error is raised. As this also happens in accordance with the IEEE 754 standard, we need to be aware of that and check any data we work with for any
nan
occurrences before doing any calculations.
42 + float("nan")
nan
On the contrary, as two values go to infinity, there is no such concept as difference and everything compares equal.
float("inf") == float("inf")
True
Adding 42
to inf
makes no difference.
float("inf") + 42
inf
float("inf") + 42 == float("inf")
True
We observe the same for multiplication ...
42 * float("inf")
inf
42 * float("inf") == float("inf")
True
... and even exponentiation!
float("inf") ** 42
inf
float("inf") ** 42 == float("inf")
True
Although absolute differences become unmeaningful as we approach infinity, signs are still respected.
-42 * float("-inf")
inf
-42 * float("-inf") == float("inf")
True
As a caveat, adding infinities of different signs is an undefined operation in math and results in a nan
object. So, if we (accidentally or unknowingly) do this on a real dataset, we do not see any error messages, and our program may continue to run with non-meaningful results! This is another example of a piece of code failing silently.
float("inf") + float("-inf")
nan
float("inf") - float("inf")
nan
float
objects are inherently imprecise, and there is nothing we can do about it! In particular, arithmetic operations with two float
objects may result in "weird" rounding "errors" that are strictly deterministic and occur in accordance with the IEEE 754 standard.
For example, let's add 1
to 1e15
and 1e16
, respectively. In the latter case, the 1
somehow gets "lost."
1e15 + 1
1000000000000001.0
1e16 + 1
1e+16
Interactions between sufficiently large and small float
objects are not the only source of imprecision.
from math import sqrt
sqrt(2) ** 2
2.0000000000000004
0.1 + 0.2
0.30000000000000004
This may become a problem if we rely on equality checks in our programs.
sqrt(2) ** 2 == 2
False
0.1 + 0.2 == 0.3
False
A popular workaround is to benchmark the absolute value of the difference between the two numbers to be checked for equality against a pre-defined threshold
sufficiently close to 0
, for example, 1e-15
.
threshold = 1e-15
abs((sqrt(2) ** 2) - 2) < threshold
True
abs((0.1 + 0.2) - 0.3) < threshold
True
The built-in format() function allows us to show the significant digits of a
float
number as they exist in memory to arbitrary precision. To exemplify it, let's view a couple of float
objects with 50
digits. This analysis reveals that almost no float
number is precise! After 14 or 15 digits "weird" things happen. As we see further below, the "random" digits ending the float
numbers do not "physically" exist in memory! Rather, they are "calculated" by the format() function that is forced to show
50
digits.
The format() function is different from the format()
method on
str
objects introduced in the next chapter (cf., Chapter 6 ): Yet, both work with the so-called format specification mini-language
:
".50f"
is the instruction to show 50
digits of a float
number.
format(0.1, ".50f")
'0.10000000000000000555111512312578270211815834045410'
format(0.2, ".50f")
'0.20000000000000001110223024625156540423631668090820'
format(0.3, ".50f")
'0.29999999999999998889776975374843459576368331909180'
format(1 / 3, ".50f")
'0.33333333333333331482961625624739099293947219848633'
The format() function does not round a
float
object in the mathematical sense! It just allows us to show an arbitrary number of the digits as stored in memory, and it also does not change these.
On the contrary, the built-in round() function creates a new numeric object that is a rounded version of the one passed in as the argument. It adheres to the common rules of math.
For example, let's round 1 / 3
to five decimals. The obtained value for roughly_a_third
is also imprecise but different from the "exact" representation of 1 / 3
above.
roughly_a_third = round(1 / 3, 5)
roughly_a_third
0.33333
format(roughly_a_third, ".50f")
'0.33333000000000001517008740847813896834850311279297'
Surprisingly, 0.125
and 0.25
appear to be precise, and equality comparison works without the threshold
workaround: Both are powers of 2 in disguise.
format(0.125, ".50f")
'0.12500000000000000000000000000000000000000000000000'
format(0.25, ".50f")
'0.25000000000000000000000000000000000000000000000000'
0.125 + 0.125 == 0.25
True
To understand these subtleties, we need to look at the binary representation of floats and review the basics of the IEEE 754
standard. On modern machines, floats are modeled in so-called double precision with 64 bits that are grouped as in the figure below. The first bit determines the sign (0 for plus, 1 for minus), the next 11 bits represent an exponent term, and the last 52 bits resemble the actual significant digits, the so-called fraction part. The three groups are put together like so:
A 1. is implicitly prepended as the first digit, and both, fraction and exponent, are stored in base 2 representation (i.e., they both are interpreted like integers above). As exponent is consequently non-negative, between 010 and 204710 to be precise, the −1023, called the exponent bias, centers the entire 2exponent−1023 term around 1 and allows the period within the 1.fraction part be shifted into either direction by the same amount. Floating-point numbers received their name as the period, formally called the radix point , "floats" along the significant digits. As an aside, an exponent of all 0s or all 1s is used to model the special values
nan
or inf
.
As the standard defines the exponent part to come as a power of 2, we now see why 0.125
is a precise float: It can be represented as a power of 2, i.e., 0.125=(−1)0∗1.0∗21020−1023=2−3=18. In other words, the floating-point representation of 0.12510 is 02, 11111111002=102010, and 02 for the three groups, respectively.
The crucial fact for the data science practitioner to understand is that mapping the infinite set of the real numbers R to a finite set of bits leads to the imprecisions shown above!
So, floats are usually good approximations of real numbers only with their first 14 or 15 digits. If more precision is required, we need to revert to other data types such as a Decimal
or a Fraction
, as shown in the next two sections.
This blog post gives another neat and visual way as to how to think of floats. It also explains why floats become worse approximations of the reals as their absolute values increase.
The Python documentation provides another good discussion of floats and the goodness of their approximations.
If we are interested in the exact bits behind a float
object, we use the .hex() method that returns a
str
object beginning with "0x1."
followed by the fraction in hexadecimal notation and the exponent as an integer after subtraction of 1023 and separated by a "p"
.
one_eighth = 1 / 8
one_eighth.hex()
'0x1.0000000000000p-3'
Also, the .as_integer_ratio() method returns the two smallest integers whose ratio best approximates a
float
object.
one_eighth.as_integer_ratio()
(1, 8)
roughly_a_third.hex()
'0x1.555475a31a4bep-2'
roughly_a_third.as_integer_ratio()
(3002369727582815, 9007199254740992)
0.0
is also a power of 2 and thus a precise float
number.
zero = 0.0
zero.hex()
'0x0.0p+0'
zero.as_integer_ratio()
(0, 1)
As seen in Chapter 1 , the .is_integer()
method tells us if a
float
can be casted as an int
object without any loss in precision.
roughly_a_third.is_integer()
False
one = roughly_a_third / roughly_a_third
one.is_integer()
True
As the exact implementation of floats may vary and be dependent on a particular Python installation, we look up the .float_info attribute in the sys
module in the standard library
to check the details. Usually, this is not necessary.
import sys
sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)