Elements of Data Science

by Allen Downey

The goal of these notebooks is to give you the tools you need to execute a data science project from beginning to end. We'll start with some basic programming concepts and work our way toward data science tools.

If you already have some programming experience, the first few notebooks might be a little slow. But there might be some material, specific to data science, that you have not seen before.

If you have not programmed before, you should expect to face some challenges. I have done my best to explain everything as we go along; I try not to assume anything. And there are exercises throughout the notebooks that should help you learn, and remember what you learned.

Programming is a super power. As you learn to program, I hope you feel empowered to take on bigger challenges. But programming can also be frustrating. It might take some persistence to get past some rough spots.

The topics in this notebook include:

Using Jupyter to write and run Python code.

Basic programming features in Python: variables and values.

Translating formulas from math notation to Python.

Along the way, we'll review a couple of math topics I assume you have seen before, logarithms and algebra.

This is a Jupyter notebook. Jupyter is a software development environment, which means you can use it to write and run programs in Python and other programming languages.

A Jupyter notebook is made up of cells, where each cell contains either text or code you can run.

If you are running this notebook on Colab, you should see buttons in the top left that say "+ Code" and "+ Text". The first one adds code cell and the second adds a text cell.

If you want to try them out, select this cell by clicking on it, then press the "+ Text". A new cell should appear below this one.

Type something in the cell. You can use the buttons to format it, or you can mark up the text using Markdown. When you are done, hold down Shift and press Enter, which will format the text you just typed and then move to the next cell.

If you select a Code cell, you should see a button on the left with a triangle inside a circle, which is the icon for "Play". If you press this button, Jupyter runs the code in the cell and displays the results.

When you run code in a notebook for the first time, you might get a message warning you about the things a notebook can do. If you are running a notebook from a source you trust, which I hope includes me, you can press "Run Anyway".

Instead of clicking the "Play" button, you can also run the code in a cell by holding down Shift and pressing Enter.

This notebook introduces the most fundamental tools for working with data: representing numbers and other values, and performing arithmetic operations.

Python provides tools for working with numbers, words, dates, times, and locations (latitude and longitude).

Let's start with numbers. Python can handle several types of numbers, but the two most common are:

`int`

, which represents integer values like`3`

, and`float`

, which represents numbers that have a fraction part, like`3.14159`

.

Most often, we use `int`

to represent counts and `float`

to represent measurements.
Here's an example of an `int`

and a `float`

:

In [3]:

```
3
```

In [4]:

```
3.14159
```

`float`

is short for "floating-point", which is the name for the way these numbers are stored.

**Exercise:** Create a code cell below this one and type in the following number: `1.2345e3`

Then run the cell. The output should be `1234.5`

The `e`

in `1.2345e3`

stands for "exponent". This way of writing numbers is a version of scientific notation that means $1.2345 \times 10^{3}$. If you are not familiar with scientific notation, you might want to read this.

Python provides operators that perform arithmetic. The operators that perform addition and subtraction are `+`

and `-`

:

In [5]:

```
2 + 1
```

In [6]:

```
2 - 1
```

The operators that perform multiplication and division are `*`

and `/`

:

In [7]:

```
2 * 3
```

In [8]:

```
2 / 3
```

And the operator for exponentiation is `**`

:

In [9]:

```
2**3
```

Unlike math notation, Python does not allow "implicit multiplication". For example, in math notation, if you write $3 (2 + 1)$, that's understood to be the same as $3 \times (2+ 1)$.

Python does not allow that notation:

In [10]:

```
3 (2 + 1)
```

In this example, the error message is not very helpful, which is why I am warning you now. If you want to multiply, you have to use the `*`

operator:

In [11]:

```
3 * (2 + 1)
```

The arithmetic operators follow the rules of precedence you might have learned as "PEMDAS":

- Parentheses before
- Exponentiation before
- Multiplication and division before
- Addition and subtraction

So in this expression:

In [12]:

```
1 + 2 * 3
```

The multiplication happens first. If that's not what you want, you can use parentheses to make the order of operations explicit:

In [13]:

```
(1 + 2) * 3
```

**Exercise:** Write a Python expression that raises `1+2`

to the power `3*4`

. The answer should be `531441`

.

Note: in the cell below, it should say

`# Solution goes here`

Lines like this that begin with `#`

are "comments"; they provide information, but they have no effect when the program runs.

When you do this exercise, you should delete the comment and replace it with your solution.

In [14]:

```
# Solution goes here
```

Python provides functions that compute all the usual mathematical functions, like `sin`

and `cos`

, `exp`

and `log`

.

However, they are not part of Python itself; they are in a "library", which is a collection of functions that supplement the Python language.

Actually, there are several libraries that provide math functions; the one we'll use is called NumPy, which stands for "Numerical Python", and is pronounced "num' pie".

Before you can use a library, you have to "import" it. Here's how we import NumPy:

In [15]:

```
import numpy as np
```

It is conventional to import `numpy`

as `np`

, which means we can refer to it by the short name `np`

rather than the longer name `numpy`

.

Note that pretty much everything is case-sensitive, which means that `numpy`

is not the same as `NumPy`

. So even though the name of the library is NumPy, when we import it we have to call it `numpy`

. If you run the following cell, you should get an error:

In [16]:

```
import NumPy as np
```

But if we import `np`

correctly, we can use it to read the value `pi`

, which is an approximation of the mathematical constant $\pi$.

In [17]:

```
np.pi
```

The result is a `float`

with 16 digits. As you probably know, we can't represent $\pi$ with a finite number of digits, so this result is only approximate.

`numpy`

provides `log`

, which computes the natural logarithm, and `exp`

, which raises the constant `e`

to a power.

In [18]:

```
np.log(100)
```

In [19]:

```
np.exp(1)
```

**Exercise:** Use these functions to confirm the mathematical identity $\log(e^x) = x$, which should be true for any value of $x$.

With floating-point values, this identity should work for values of $x$ between -700 and 700. What happens when you try it with larger and smaller values?

In [20]:

```
# Solution goes here
```

As this example shows, floating-point numbers are finite approximations, which means they don't always behave like math.

As another example, see what happens when you add up `0.1`

three times:

In [21]:

```
0.1 + 0.1 + 0.1
```

The result is close to `0.3`

, but not exact.

We'll see other examples of floating-point approximation later, and learn some ways to deal with it.

A variable is a name that refers to a value.

The following statement assigns the `int`

value 5 to a variable named `x`

:

In [22]:

```
x = 5
```

The variable we just created has the name `x`

and the value 5.

If a variable name appears at the end of a cell, Jupyter displays its value.

In [23]:

```
x
```

If we use `x`

as part of an arithmetic operation, it represents the value 5:

In [24]:

```
x + 1
```

In [25]:

```
x**2
```

We can also use `x`

with `numpy`

functions:

In [26]:

```
np.exp(x)
```

Notice that the result from `exp`

is a `float`

, even though the value of `x`

is an `int`

.

**Exercise:** If you have not programmed before, one of the things you have to get used to is that programming languages are picky about details. Natural languages, like English, and semi-formal languages, like math notation, are more forgiving.

4 exampel in Ingli\$h you kin get prackicly evRiThing rong-rong-rong and sti11 be undr3stud.

As another example, in math notation, parentheses and square brackets mean the same thing, you can write

$\sin (\omega t)$

or

$\sin [\omega t]$

Either one is fine. And you can leave out the parentheses altogether, as long as the meaning is clear:

$\sin \omega t$

In Python, every character counts. For example, the following are all different:

```
np.exp(x)
np.Exp(x)
np.exp[x]
np.exp x
```

While you are learning, I encourage you to make mistakes on purpose to see what goes wrong. Read the error messages carefully. Sometimes they are helpful and tell you exactly what's wrong. Other times they can be misleading. But if you have seen the message before, you might remember some likely causes.

In the next cell, try out the different versions of `np.exp(x)`

above, and see what error messages you get.

In [29]:

```
np.exp(x)
```

**Exercise:** Search the NumPy documentation to find the function that computes square roots, and use it to compute a floating-point approximation of the golden ratio:

$\phi = \frac{1 + \sqrt{5}}{2}$

Hint: The result should be close to `1.618`

.

In [33]:

```
# Solution goes here
```

If you are running on Colab and you want to save your work, now is a good time to press the "Copy to Drive" button (near the upper left), which saves a copy of this notebook in your Google Drive.

If you want to change the name of the file, you can click on the name in the upper left.

If you don't use Google Drive, look under the File menu to see other options.

Once you make a copy, any additional changes you make will be saved automatically, so now you can continue without worrying about losing your work.

Now let's use variables to solve a problem involving mathematical calculation.

Suppose we have the following formula for computing compound interest from Wikipedia:

"The total accumulated value, including the principal sum $P$ plus compounded interest $I$, is given by the formula:

$V=P\left(1+{\frac {r}{n}}\right)^{nt}$

where:

- $P$ is the original principal sum
- $V$ is the total accumulated value
- $r$ is the nominal annual interest rate
- $n$ is the compounding frequency
- $t$ is the overall length of time the interest is applied (expressed using the same time units as $r$, usually years).

"Suppose a principal amount of \$1,500 is deposited in a bank paying an annual interest rate of 4.3\%, compounded quarterly. Then the balance after 6 years is found by using the formula above, with

In [30]:

```
P = 1500
r = 0.043
n = 4
t = 6
```

We can compute the total accumulated value by translating the mathematical formula into Python syntax:

In [31]:

```
P * (1 + r/n)**(n*t)
```

**Exercise:** Continuing the example from Wikipedia:

"Suppose the same amount of \$1,500 is compounded biennially", so `n = 1/2`

.

What would the total value be after 6 years? Hint: we expect the answer to be a bit less than the previous answer.

In [32]:

```
# Solution goes here
```

**Exercise:** If interest is compounded continuously, the value after time $t$ is given by the formula:

$V=P~e^{rt}$

Translate this function into Python and use it compute the value of the investment in the previous example with continuous compounding. Hint: we expect the answer to be a bit more than the previous answers.

In [33]:

```
# Solution goes here
```

**Exercise** Applying your algebra skills, solve the previous equation for $r$. Now use the formula you just derived to answer this question.

"Harvard's tuition in 1970 was \$4,070 (not including room, board, and fees).

"In 2019 it is \$46,340. What was the annual rate of increase over that period, treating it as if it had compounded continuously?"

In [34]:

```
# Solution goes here
```

The point of this exercise is to practice using variables. But it is also a reminder about logarithms, which we will use extensively.

Here are a few tips on using Jupyter to compute and display values.

Generally, if there is a single expression in a cell, Jupyter computes the value of the expression and displays the result.

For example, we've already seen how to display the value of `np.pi`

:

In [27]:

```
np.pi
```

Here's a more complex example with functions, operators, and numbers:

In [30]:

```
1 / np.sqrt(2 * np.pi) * np.exp(-3**2 / 2)
```

If you put more than one expression in a cell, Jupyter computes them all, but it only display the result from the last:

In [32]:

```
1
2 + 3
np.exp(1)
(1 + np.sqrt(5)) / 2
```

If you want to display more than one value, you can separate them with commas:

In [36]:

```
1, 2 + 3, np.exp(1), (1 + np.sqrt(5)) / 2
```

That result is actually a tuple, which you will learn about in the next notebook.

Here's one last Jupyter tip: when you assign a value to variable, Jupyter does not display the value:

In [37]:

```
phi = (1 + np.sqrt(5)) / 2
```

So it is idiomatic to assign a value to a variable and immediately display the result:

In [38]:

```
phi = (1 + np.sqrt(5)) / 2
phi
```

**Exercise:** Display the value of $\phi$ and its inverse, $1/\phi$, on a single line.

In [39]:

```
# Solution goes here
```

Congratulations on completing the first notebook!

Now that you have worked with Colab, you might find it helpful to watch this video, where I explain a little more about how it works:

In [12]:

```
from IPython.display import YouTubeVideo
YouTubeVideo("eIY-PsYBrPs")
```

In [ ]:

```
```