Python for Data Scientists

You can check an object's type using the `type()`

built-in function.

Let's take a look at the `type()`

of the integer from above:

In [401]:

```
# Print type() of 2
print(type(2))
```

**Truncations**

The `int()`

function creats integers. It always rounds down. That's because what it's really doing is **truncating** the decimals from the number.

In [402]:

```
print( int(5.3) )
print( int(5.9999) )
```

Floats are numeric objects with decimals:

In [403]:

```
print(type(5))
print(type(5.))
print(type(5.3))
```

You can also convert an integer into a float using the `float() function`

. Python will tag on a `.0`

to indicate that it's a float.

In [404]:

```
print( 1337 ) # int
print( float(1337) ) # convert int to float
```

Strings are text objects enclosed by single or double quotes.

In [405]:

```
print( 'This is a string.' )
print( type('This is a string.') )
```

The following are the basic operations. Here is a full list fo perators.

In [406]:

```
print( 10 + 2 ) # Add
print( 10 - 2 ) # Subtract
print( 10 / 2 ) # Divide
print( 10 * 2 ) # Multiply
print( 10**2 ) # Squared
print( 10 % 2 ) # Modulo (i.e. what is the remainder when 10 is divided by 2?)
```

Comparison operators are often based on boolean logic. These are binary obeject types that are either `True`

or `False`

.

In [407]:

```
print(5>3) # Greater than
print(5>=3) # Greater than or equal to
print(5< 3) # Less than
print(5<= 3) # Less than or equal to
print(5== 3) # Equal to
print(5!= 3) # Not equal to
```

**Example string opeartions**

In [408]:

```
'5' * 5 # Repetition
```

Out[408]:

In [409]:

```
print( 992 + 345 ) # Performs calculation
print( '992 + 345' ) # Does not perform calculation
```

In [410]:

```
'992' + '345' # Concatentation
```

Out[410]:

This is the start of a long series of connected analysis seen throughout the notebook. Each analysis build on previous analyses and concepts.

**Here is our hypothetical setting**:

- You are 007 James Bond. You are on your first training mission.
- You will travel to different locations in London.
- First, you will prepare for and plan the trip.
- After you arrive at each location, you will meet with local business owners and offer your help.

**First, calculate the total volume of water you can carry, in cm$^3$.**

- The formula for the volume of a cylinder is $V = \pi r^2 h$
- The container cylinder's radius is 4cm and its height is 16cm
- You have room in your backpack for 3 containers!

In [411]:

```
# Total volume of water you can carry
import math # need the pi constant
print(3 * 3.14 * 4**2 * 16)
3 * math.pi * 4**2 * 16
```

Out[411]:

Q estimates that you'll need at least 2000 cm$^3$ for the mission. Is our container large enough?

In [412]:

```
(3 * math.pi * 4**2 * 16) > 2000
```

Out[412]:

In [413]:

```
# Insert dynamic value into string
print( '1 + 2 = {}'.format(1 + 2) )
```

In [414]:

```
# Insert 3 dynamic values into string
print( '{} + {} = {}'.format(1, 2, 1 + 2) )
```

Variables are named objects and used to store data or other information:

In [415]:

```
# Set variables
a = 2
b = 4
c = "Hello from Canada"
print( a )
print( b )
print( c )
print( a * b )
```

Best practice: use lower_case_with_underscores. The shorter the variable the better.

For example, let's say we wanted to calculate total return on investment based on the formula for compound interest:

In [416]:

```
# Set variables
principal = 1000.0
yearly_interest_rate = 0.03
n_years = 5
```

In [417]:

```
# Calculate total investment return
total_return = principal * (1 + yearly_interest_rate)**n_years - principal
# Print total investment return
print( total_return )
```

**Interlude: Getting help on python commands**

In [418]:

```
# Get help on the print() function
help(print)
```

[Back to Table of Contents](#TOC)

Lists are mutable sequences of objects, enclosed by square brackets: `[]`

In [419]:

```
# Create list of integers
integer_list = [0, 1, 2, 3, 4]
print( integer_list )
print( type(integer_list) )
```

In [420]:

```
# Create list of mixed types: strings, ints, and floats
my_list = ['hello', 1, 'Canada', 2, 3.0]
# Print entire list
print( my_list )
```

In Python, lists are zero-indexed.

In [421]:

```
print( my_list[0] ) # Print the first element
print( my_list[2] ) # Print the third element
```

**Slicing**

In [422]:

```
#Selects all starting from the 2nd element, but BEFORE the 4th element
print( my_list[1:3] )
```

In [423]:

```
# Selects all BEFORE the 4th element
print( my_list[:3] )
# Selects all starting from the 2nd element
print( my_list[1:] )
```

**Negative indices**

In [424]:

```
# Select the last element
print( my_list[-1] )
# Select all BEFORE the last element
print( my_list[:-1] )
# Selects all starting from the 2nd element, but before the last element
print( my_list[1:-1] )
```

Lists are mutable, meaning you can change individual elements of a list. Lets update `my_list`

:

In [425]:

```
my_list[0] = 'bonjour' # Sets new value for the first element
print( my_list[0] ) # Print the first element
print( my_list[2] ) # Print the third element
```

**Appending** to and **removing** elements from lists are both easy to do. These are a special class of 'list' functions called methods. You can see all list methods by calling the help for lists!

In [426]:

```
help(list)
```

In [427]:

```
# Add to end of the list
my_list.append(22)
print(my_list)
# Remove an element from the list
my_list.remove(3.0)
print(my_list)
```

Lets create a couple of lists and do some operations.

In [428]:

```
a = [1, 2, 3]
b = [4, 5, 6]
```

In [429]:

```
print( a + b ) # Concatentation
```

In [430]:

```
print( a * 3 ) # Repetition
```

In [431]:

```
print( 3 in a ) # Membership
```

In [432]:

```
print( min(b), max(b) ) # Min, Max
```

In [433]:

```
print( len(a) ) # Length
```

In previous analysis, you learned about 007's training mission. You also made sure you had enough water for the trip by performing calculations. Now, it's time to start planning the locations in London to visit!

*Boroughs of London.*

Lets analayze some London boroughs and each of there locations. First. lets create lists of the neccessary borough variables that contain the locations. Python has a variety of **input/output** methods. If you want to see all of the methods, including the one below, you can learn more about them in the documentation.

In [434]:

```
# Read lists of locations
with open('project_files/brent.txt', 'r') as f:
brent = f.read().splitlines()
with open('project_files/camden.txt', 'r') as f:
camden = f.read().splitlines()
with open('project_files/redbridge.txt', 'r') as f:
redbridge = f.read().splitlines()
with open('project_files/southwark.txt', 'r') as f:
southwark = f.read().splitlines()
```

Note that when text files are read using the `splitlines()`

function, the resulting object is a list.

So the four objects you just created from the files - `brent`

, `camden`

, `redbridge`

, and `southwark`

- are all lists.

In [435]:

```
print( type(brent) )
print( type(camden) )
print( type(redbridge) )
print( type(southwark) )
```

Print the number of locations in each listed borough:

In [436]:

```
# Print length of each list
print(len(brent))
print(len(camden))
print(len(redbridge))
print(len(southwark))
```

In [437]:

```
# Print the first 5 locations of each Borough
print(brent[:5])
print(camden[:5])
print(redbridge[:5])
print(southwark[:5])
```

In [438]:

```
# Is 'Newbury Park' in Redbridge?
print( 'Newbury Park' in redbridge ) # Membership
# Is 'Peckham' in Brent?
print( 'Peckham' in brent ) # Membership
```

In [439]:

```
# Print minimum value in southwark
print( min(southwark)) # Min in this case is alphabet-sorted
# Print maximum value in southwark
print( max(southwark))
```

Sets are unordered collections of **unique** objects, enclosed by curly braces: `{ }`

For example:

In [440]:

```
integer_set = {0, 1, 2, 3, 4}
print( integer_set )
print( type(integer_set) )
```

Because each element in a set must be unique, sets are a great tool for removing duplicates. For example:

In [441]:

```
fibonacci_list = [ 1, 1, 2, 3, 5, 8, 13 ] # Will keep both 1's will remain
fibonacci_set = { 1, 1, 2, 3, 5, 8, 13 } # Only one 1 will remain
print( fibonacci_list )
print( fibonacci_set )
```

In [442]:

```
# Create a list
fibonacci_list = [ 1, 1, 2, 3, 5, 8, 13 ]
# Convert it to a set
fibonacci_set = set(fibonacci_list)
print( fibonacci_set )
```

In [443]:

```
powers_of_two = { 1, 2, 4, 8, 16 }
fibonacci_set = { 1, 1, 2, 3, 5, 8, 13 }
```

In [444]:

```
# Union: Elements in either set (two ways, both do same thing)
print( powers_of_two.union( fibonacci_set ) )
print( powers_of_two | fibonacci_set )
```

In [445]:

```
# Intersection: Elements in both sets (two ways, both do same thing)
print( powers_of_two.intersection( fibonacci_set ) )
print( powers_of_two & fibonacci_set )
```

In [446]:

```
# Difference
print( powers_of_two )
print( fibonacci_set )
print( powers_of_two - fibonacci_set )
print( fibonacci_set - powers_of_two)
```

Let's continue 007's training. Before we continue, lets see if wwe have duplicates and if so, remove them from our lists:

In [447]:

```
# Does Brent have duplicates?
print(len(brent) != len(set(brent)))
# Does Camden have duplicates?
print(len(camden) != len(set(camden)))
# Does Redbridge have duplicates?
print(len(redbridge) != len(set(redbridge)))
# Does Southwark have duplicates?
print(len(southwark) != len(set(southwark)))
```

For the lists with duplicates, remove duplicates by converting them into sets. Then, convert them back into lists.

In [448]:

```
# Convert lists to sets to remove duplicates, then convert them back to lists
brent = list(set(brent))
camden = list(set(camden))
redbridge = list(set(redbridge))
southwark = list(set(southwark))
print(len(brent))
print(len(camden))
print(len(redbridge))
print(len(southwark))
```

Dictionaries are unordered collections of **key-value pairs**, enclosed by curly braces, like `{}`

.

In [449]:

```
integer_dict = {
'zero' : 0,
'one' : 1,
'two' : 2,
'three' : 3,
'four' : 4
}
print( integer_dict )
print( type(integer_dict) )
```

A dictionary, also called a "dict", is like a miniature database for storing and organizing data.

Each element in a dict is actually a key-value pair, and it's called an **item**.

**Key**: A key is like a name for the value. You should make them descriptive.**Value**: A value is some other Python object to store. They can be floats, lists, functions, and so on.

Here's an example with descriptive keys:

In [450]:

```
my_dict = {
'title' : "The Foundation Trillogy",
'author' : 'Issac Assimov',
100 : ['A number.', 'The answer to 60 + 40 = ?'] # Keys can be integers too!
}
```

Access values using their Keys:

In [451]:

```
# Print the value for the 'title' key
print( my_dict['title'] )
# Print the value for the 'author' key
print( my_dict['author'] )
```

**Updating values**

In [452]:

```
# Updating existing key-value pair
my_dict['author'] = 'Isaac Asimov'
# Print the value for the 'author' key
print( my_dict['author'] )
```

In [453]:

```
# Append element to list
my_dict[100].append('Answer to the Ultimate Question of Life, the Universe, and Everything')
# Print value for the key 42
print( my_dict[100] )
```

**Creating new items**

In [454]:

```
# Creating a new key-value pair
my_dict['year'] = 1951
```

In [455]:

```
# Print summary of the book.
print('{} was written by {} in {}.'.format(my_dict['title'], my_dict['author'], my_dict['year']) )
```

**Convenience functions**

In [456]:

```
# Keys - get entire list
print( my_dict.keys() )
# Values - get all values
print( my_dict.values() )
```

In [457]:

```
# All items (list of tuples)
print( my_dict.items() )
```

Lets create a locations dictionary for the London boroughs:

In [458]:

```
# Create location_dict
location_dict = {
'Brent' : brent,
'Camden' : camden,
'Redbridge' : redbridge,
'Southwark' : southwark
}
```

Lets double check that the dictionary has the correct keys using a for loop. More on loops syntax in later section:

In [459]:

```
for borough in ['Brent', 'Camden', 'Redbridge', 'Southwark']:
print( borough in location_dict )
```

Since the dictionary is correct, we do not need the original lists, so we can get rid of them:

In [460]:

```
brent, camden, redbridge, southwark = None, None, None, None
```

In [461]:

```
type(brent)
```

Out[461]:

nb: `None`

is an object that denotes emptiness.

Now, we want to split our visit to London into two trips: one for Inner London and one for Outer London.

**Lets add two new items to our dictionary:**

**Key:**`'Inner London'`

...**Value:**All locations in`'Camden'`

and`'Southwark'`

.**Key:**`'Outer London'`

...**Value:**All locations in`'Brent'`

and`'Redbridge'`

.

In [462]:

```
# Create a new key-value pair for 'Inner London'
location_dict['Inner London'] = location_dict['Camden'] + location_dict['Southwark']
# Create a new key-value pair for 'Outer London'
location_dict['Outer London'] = location_dict['Brent'] + location_dict['Redbridge']
```

In [463]:

```
# Quick QA - Add the locations of boroughs
len(location_dict['Camden']) + len(location_dict['Southwark']) + len(location_dict['Brent']) + len(location_dict['Redbridge'])
```

Out[463]:

In [464]:

```
# Quick QA - Add innner and outer loactions
len(location_dict['Inner London']) + len(location_dict['Outer London'])
```

Out[464]:

**Save work using Pickle**

- Use a Python built-in package called
`pickle`

to save the`location_dict`

object to use later. - Pickle saves an entire object in a file on your computer.
- Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

In [465]:

```
#Import pickle module
import pickle
# Save object to disk
with open('project_files/location_dict.pkl', 'wb') as f: # Saving the file 'location_dict' to project_files folder
pickle.dump(location_dict, f)
```

[Back to Table of Contents](#TOC)

If statements check if a condition is met before running a block of code.

Begin them with the `if`

keyword. Then the statement has two parts:

- The
**condition**, which must evaluate to a boolean. (Technically, it's fine as long as its "truthiness" can be evaluated. - The
**code block**to run if the condition is met (indented with 4 spaces).

For example:

In [466]:

```
current_fuel = 85
# Condition
if current_fuel >= 80:
# Code block to run if condition is met
print( 'We have enough fuel to last the zombie apocalypse. ')
```

In [467]:

```
current_fuel = 50
# Condition
if current_fuel >= 80:
# Do this when condition is met
print( 'We have enough fuel to last the zombie apocalypse. ')
else:
# Do this when condition is not met
print( 'Restock! We need at least {} gallons.'.format(80 - current_fuel) )
```

In [468]:

```
current_fuel = 70
# First condition
if current_fuel >= 80:
print( 'We have enough fuel to last the zombie apocalypse. ')
# If first condition is not met, check this condition
elif current_fuel < 60:
print( 'ALERT: WE ARE WAY TOO LOW ON FUEL!' )
# If no conditions were met, perform this
else:
print( 'Restock! We need at least {} gallons.'.format(80 - current_fuel) )
```

We have created our location dictionary containing the boroughs we are interested in visiting. Now, we want to determine which location to visit first.

First, let's import `location_dict`

again using `pickle`

.

In [469]:

```
# Import location_dict using pickle
import pickle
# Read object from disk
with open('project_files/location_dict.pkl', 'rb') as f:
location_dict = pickle.load(f)
```

In [470]:

```
# Print the keys in location_dict
print(location_dict.keys())
```

**Pseudocode for code below**:

**If**our Inner London list has more locations than our Outer London list, print the message:

Inner London is huge!

**Else if**our Outer London list has more locations than our Inner London list, print the message:

Outer London is huge!

**Else**(i.e. they have the same number of locations), print the message:

Inner and outer London are huge!

In [471]:

```
if len(location_dict['Inner London']) > len(location_dict['Outer London']):
print('Inner London is huge!')
elif len(location_dict['Outer London']) > len(location_dict['Inner London']):
print('Outer London is huge!')
else:
print('Inner and outer London are huge!')
```

In [472]:

```
# Print list
for number in [0, 1, 2, 3, 4]:
print( number )
```

In [473]:

```
# Print range
for number in range(5):
print( number )
```

In [474]:

```
# Print reversed range
for number in reversed(range(5)):
print( number )
```

In [475]:

```
# Check if number divisible by another number
range_list = range(10)
for number in reversed(range_list):
if number == 0:
print( 'Liftoff!' )
elif number % 3 == 0:
print( 'Buzz' )
else:
print( number )
```

In [476]:

```
# for loops within other for loops
list_a = [4, 3, 2]
list_b = [6, 3]
for a in list_a:
for b in list_b:
print( a, 'x', b, '=', a * b )
```

- Set a variable to an empty list, which is just
`[]`

. - Then, as you loop through elements,
`.append()`

the ones you want to your list variable.

In [477]:

```
# Separate range_list into even_list and odd_list
range_list = range(10)
# Initialize empty lists
evens_list = []
odds_list = []
# Iterate through each number in range_list
for number in range_list:
# check for divisibility by 2
if number % 2 == 0:
# If divisible by 2
evens_list.append(number)
else:
# If not divisible by 2
odds_list.append(number)
# Confirm our lists are correct
print( evens_list )
print( odds_list )
```

Lets count the number of locations in each of our boroughs.

**Tip:**You can iterate through keys and values of a dictionary at the same time using`.items()`

, like so:

forkey, valuein location_dict.items(): # code block

**Tip:**Remember, to insert multiple dynamic values into a string, you can just add more places to`.format()`

, like so:

'{}has{}locations.'.format(first_value, second_value)

In [478]:

```
# For each key in location_dict, print the number of locations in its list
for key, value in location_dict.items():
print('{} has {} locations.'.format(key, len(value)))
```

Now, let's give each location in Outer London a first impression based on its name.

**Pseudocode**

Combine `if`

and `for`

statements. For each location in Outer London...

**If**its name has`'Farm'`

,`'Park'`

,`'Hill'`

, or`'Green'`

in it, print:

{name} sounds pleasant.

**Else If**its name has`'Royal'`

,`'Queen'`

, or`'King'`

in it, print:

{name} sounds grand.

- If its name doesn't sound pleasant or grand, just ignore it.
**Tip:**If you want to check if any word from a list is found in a string, you can use`any()`

, like so:

any( word in name for word inlist_of_words)

In [479]:

```
pleasant_sounding = ['Farm', 'Park', 'Hill', 'Green']
royal_sounding = ['Royal', 'Queen', 'King']
# Print first impression of each location in Outer London based on its names
for name in location_dict['Outer London']:
if any(word in name for word in pleasant_sounding): # 'word' is part of any function
print( name, 'sounds pleasant.' )
elif any(word in name for word in royal_sounding):
print( name, 'sounds grand.' )
```

List comprehensions aren't mandatory, but they help keep your code clean and concise.

List comprehensions construct new lists out of existing ones after applying transformations or conditions to them.

Example

In [480]:

```
# Construct list of the squares in range(10) using list comprehension
squares_list = [number**2 for number in range(10)]
print( squares_list )
```

You can even include conditional logic in list comprehensions!

In [481]:

```
# Conditional inclusion
evens_list = [number for number in range(10) if number % 2 == 0]
print( evens_list )
```

In [482]:

```
# Conditional outputs
even_odd_labels = ['Even' if number % 2 == 0 else 'Odd' for number in range(10)]
print( even_odd_labels )
```

Finally, list comprehensions are not limited to lists!

- You can use them for other data structures too.
- The syntax is the same, except you would enclose it with curly braces for sets and parentheses for tuples.

For example, we can create a set like so:

In [483]:

```
# Construct set of doubles using set comprehension
doubles_set = { number * 2 for number in range(10) }
print( doubles_set )
```

**Lets use a list comprehension, to create a new list called pleasant_locations.**

- It should contain locations in Outer London that are
`pleasant_sounding`

.

In [484]:

```
# Create pleasant_locations list using a list comprehension
pleasant_locations = [name for name in location_dict['Outer London']
if any(word in name for word in pleasant_sounding)]
# Print the pleasant-sounding locations
print(pleasant_locations)
```

In [485]:

```
# Print number of pleasant-sounding locations
len(pleasant_locations)
```

Out[485]:

In [486]:

```
def make_message_exciting(message='hello, world'):
text = message + '!'
return text
print( type(make_message_exciting) )
```

Let's break that down:

- Functions begin with the
`**def**`

keyword, followed by the function name - They can take optional arguments, E.g., message = 'hello world'
- They are then followed by an indented code block.
- Finally, they return a value, which is also indented

To call a function, simply type its name and a parentheses.

In [487]:

```
# Call make_message_exciting() function
make_message_exciting()
```

Out[487]:

In practice, functions are ideal for isolating functionality.

In [488]:

```
def square(x):
output = x*x
return output
def cube(x):
output = x*x*x
return output
print( square(3) )
print( cube(2) )
print( square(3) + cube(2) )
```

nb: code block in a function is actually optional, as long as you have a return statement.

In [489]:

```
# Example of function without a code block
def hello_world():
return 'hello world'
# Call hello_world() function
hello_world()
```

Out[489]:

And also... the return statement is optional, as long as you have a code block.

- If no return statement is given, the function will return
`None`

by default. - Code blocks in the function will still run.

In [490]:

```
# Example of function without a return statement
def print_hello_world():
print( 'hello world' )
# Call print_hello_world() function
print_hello_world()
```

Arguments can have also default values, set using the = operator.

In [491]:

```
def print_message( message='Hello, world', punctuation='.' ):
output = message + punctuation
print( output )
# Print default message
print_message()
```

To pass a new value for the argument, simply set it again when calling the function.

In [492]:

```
# Print new message, but default punctuation
print_message( message='Nice to meet you', punctuation='...' )
```

When passing a value to an argument, you don't have to write the argument's name if the values are in order.

In [493]:

```
# Print new message without explicity setting the argument
print_message( 'Where is everybody', '?' )
```

**Lets write a function called filter_locations that takes two arguments:**

`location_list`

`words_list`

The function should return the list of names in `location_list`

that have any word in `words_list`

.

In [494]:

```
def filter_locations(location_list, words_list):
return [name for name in location_list
if any(word in name for word in words_list)]
```

Next, let's test that function.

**Lets create a new pleasant_locations list using the function we just wrote.**

- Pass in the list of Outer London locations and the list of pleasant-sounding words.

In [495]:

```
# Create pleasant_locations using filter_locations()
pleasant_locations = filter_locations(location_dict['Outer London'], pleasant_sounding)
# Print list of pleasant-sounding locations
print( pleasant_locations )
```

**Next, let's use this handy function to create a grand_locations list for locations that sound grand.**

- Pass in the list of Outer London locations and the list of grand-sounding words.

In [496]:

```
# Create grand_locations using filter_locations()
grand_locations = filter_locations(location_dict['Outer London'], royal_sounding)
# Print list of grand-sounding locations
print(grand_locations)
```

Awesome, are there any locations in both lists?

**Display the locations in both the pleasant_locations and grand_locations lists.**

- You can compare these lists manually because they are short, but try using
`sets`

and the`.intersection()`

function (or`&`

operator)!

In [497]:

```
# Display locations that sound pleasant and grand (two ways)
# Using .intersection() function
print(set(pleasant_locations).intersection( set(grand_locations) ))
# Using & operator
print(set(pleasant_locations) & set(grand_locations)) # I prefer this!
```

Great, we'll start with these for our visit.

[Back to Table of Contents](#TOC)

Let's import NumPy.

In [498]:

```
import numpy as np # In code to left, we set an alias for NumPy to np
```

NumPy arrays are tables of elements that all share the same data type (usually numeric).

`numpy.ndarray`

is its official type.

In [499]:

```
# Array of ints
array_a = np.array([0, 1, 2, 3])
print( array_a )
print( type(array_a) )
```

In [500]:

```
# Print data type of contained elements
print( array_a.dtype )
```

- The NumPy array
*array_a*itself has type**numpy.ndarray**. - The elements contained inside the array have type
**int64**.

NumPy arrays are homogenous, which means all of their elements must have **the same data type**.

Because NumPy doesn't support mixed types, it converts all of the elements in a mixed array into strings.

See following:

In [501]:

```
# Mixed array with 1 string and 2 integers
array_b = np.array(['four', 5, 6])
# Print elements in array_b
print( array_b )
```

The two arrays we created above, array_a and array_b both have only 1 axis.

- You can think an axis as a direction in the coordinate plane.
- For example, lines have 1 axis, squares have 2 axes, cubes have 3 axes, etc.

We can use the `.shape`

attribute to see the axes for a NumPy array.

In [502]:

```
print( array_a.shape )
print( array_b.shape )
```

As you can see, `.shape`

returns a tuple.

- The number of elements in the tuple is the number of axes.
- And each element's value is the length of that axis.

Together, these two pieces of information make up the shape, or dimensions, of the array.

- So array_a is a 4x1 array. It has 1 axis of length 4.
- And array_b is a 3x1 array. It has 1 axis of length 3.
- They are both considered "1-dimensional" arrays.

In [503]:

```
print(array_a)
# First element of array_a
print( array_a[0] )
# Last element of array_a
print( array_a[-1] )
```

In [504]:

```
# Or can SLICE it: From second element of array_a up to the 4th
print( array_a[2:4] )
```

- NumPy has a special
`np.nan`

object for denoting missing values. - This object is called
`NaN`

, which stands for Not a Number.

For example, let's create an array with a missing value:

In [505]:

```
# Array with missing values
array_with_missing_value = np.array([1.2, 8.8, 4.0, np.nan, 6.1])
# Print array
print( array_with_missing_value )
```

In [506]:

```
# Print array's dtype
print( array_with_missing_value.dtype )
```

*In the previous analysis...*

In the previous analysis, we decided on Park Royal as our first stop in our training mission to London.

Now we've arrived at the tube station!

*Park Royal Tube Station*

Park Royal is home to London's largest business park.

- The business park supports around 1,700 businesses.
- Q hands you a manual with their names and locations.
*They are listed in order from smallest to largest*(this will be important later).

**First, let's create a NumPy array called business_ids.**

- The first business has ID
`1`

, the second one has ID`2`

and so on.

In [507]:

```
# Create array of business_ids with values ranging from 1 to 1700
business_ids = np.array(range(1, 1701))
```

**Next, print the shape of business_ids to confirm it has shape (1700, ).**

In [508]:

```
# Print shape of business_ids
print( business_ids.shape )
```

**Finally, print the last 10 business ID's to confirm the array is set up properly.**

In [509]:

```
# Print last 10 business ID's
print( business_ids[-10:] )
# Wrong wway to do it
print( business_ids[-10:-1] ) # Remember, the end number not included!
```

For creating matrices!

For example, let's create a matrix with 2 rows and 3 columns:

In [510]:

```
array_c = np.array([[1, 2, 3], [4, 5, 6]])
print( array_c )
```

In [511]:

```
# The NumPy array above is 2x3.
print( array_c.shape )
```

For example, we can reshape it from its original shape of 2x3 to the new shape of 3x2:

In [512]:

```
# Reshape to 3x2
print( array_c.reshape(3, 2) )
```

Nb: this is not the same as transposing array_c, because we are keeping the same order of the elements.

In [513]:

```
# Reshape to 1x6
print( array_c.reshape(1,6) )
```

In [514]:

```
# Print number axes
print(array_c)
print( len(array_c.reshape(1,6).shape) )
```

If you want to reshape it to 1x6 and reduce the number of axes, then you actually need to reshape it to (6,) with an empty second axis.

- There's a shortcut for reducing an array to 1 axis: the
`.flatten()`

function.

In [515]:

```
# Reshape and reduce axes
print( array_c.reshape(6,) )
# Flatten (reduce to 1 axis) # Shortcut
print( array_c.flatten() )
# Print number of axes
print( len(array_c.reshape(6,).shape) )
print( len(array_c.flatten().shape) )
```

In [516]:

```
# Nb: the original function is immutable!
print(array_c)
print(array_c.shape)
```

If you do want to transpose the array, you can simply `.transpose()`

it!

- Transposing will flip the rows and columns of an array (not just the shape).
- Reshaping will simply change the shape, but keep the same order of elements.

In [517]:

```
# Transpose
print( array_c.transpose() )
# REMEMBER the original data is still the same
print(array_c)
```

Let's look at an example with a 3x3 array:

In [518]:

```
# Create a 3x3 array
array_d = np.array(range(1,10)).reshape(3,3)
print(array_d)
```

In [519]:

```
# Select all elements in the first row
# Tip: A colon (:) means select all elements along that axis.
print( array_d[0, :] )
```

In [520]:

```
# Select second row and all columns
print( array_d[1, :] )
```

In [521]:

```
# Select third column and all rows
print( array_d[:, 2])
```

In [522]:

```
# Select second column and all rows
print( array_d[:, 1])
```

In [523]:

```
# Select after second row and second column
print( array_d[1:, 1:] )
```

There are **1700** businesses in Park Royal's business park. However, you only have **10** hours to stay in Park Royal before you need to move on to the next location. Assuming you could only visit **5** businesses per hour, which businesses should you visit?

Well, 2 options immediately come to mind:

- You could just visit the first 50 businesses.
- You could randomly sample 50 businesses.

While these appear fine at first glance, there are potential flaws with both of these approaches. Remember, the businesses were listed in order from smallest to largest! Therefore...

- Just visiting the first 50 would give us a biased sample of only the smallest businesses.
- Visiting a random sample of 50 is better, but there's still a chance that our sample ends up biased from pure chance.

Instead, let's take a **stratified random sample** based on the size of the business.

This is a very important concept for practical machine learning.

Stratified random sampling is first grouping your observations based on a key variable, and then randomly sampling from those groups.

This will ensure your sample is **representative** of the broader dataset along that key variable.

*Stratified Random Sampling*

For this analysis, it just means that...

- We'll start by splitting our businesses into 10 groups of 170 businesses each.
- The first group will have ID's from 1 to 170, the second will have ID's from 171 to 340, etc.
- Then we'll randomly select 5 businesses from each group of 170.
- Since the businesses are already ordered by size, this will ensure that small, medium, and big businesses are all represented in our sample.

**First, reshape the 1-dimensional business_ids array into a new 2-dimensional id_matrix array.**

In [524]:

```
# Create id_matrix by reshaping business_ids to have 10 columns
id_matrix = business_ids.reshape(170,10)
# Print shape
print(id_matrix.shape)
```

In [525]:

```
print(id_matrix)
```

Great, now we have a matrix with 10 columns representing 10 groups of businesses. But remember, our goal is to stratify our sample by size of the business, and because our businesses are ordered by size, the first group should be 1 to 170, the second group should be 171 to 340, and so on.

**Print the first column (group) of id_matrix.**

- Does it contain businesses 1 to 170?

In [526]:

```
# Print first column of id_matrix
print(id_matrix[:,0])
```

Crap, that's not what we wanted.

Remember, when you `.reshape`

an array, the new array keeps the order of the elements.

**Interlude: Toy Problem**

Let's walk through a miniature example of what we just did in previous analysis, because it will be easier to understand by peeking under the hood.

- Instead of 1700 businesses, let's say we only had 170 businesses to visit, but we still want to group them into 10 groups by ID.
- Therefore, we want the first group to have businesses 1 to 17, the second group to have 18 to 34, the third group to have 35 to 51, and so on.

In [527]:

```
# Create array of 170 ID's
mini_business_ids = np.array( range(1, 171) )
# Reshape to have 10 columns
mini_id_matrix = mini_business_ids.reshape(17, 10)
# Display mini_id_matrix
print( mini_id_matrix )
```

Aha, now we're on to something... so instead of reshaping to 17 rows and 10 columns, what if we reshape to 10 rows and 17 columns?

In [528]:

```
# Reshape to have 10 rows
mini_id_matrix = mini_business_ids.reshape(10, 17)
# Display mini id matrix
print( mini_id_matrix )
```

Now we have 10 rows instead of 10 columns.

- The first row has the businesses 1 to 17, which is what we want!
- And if we want the first column to have businesses 1 to 17, we can simply transpose this matrix.

Nb: **Toy problems** are miniature versions of your problem that are easier to break apart and understand conceptually.

Let's try building that `id_matrix`

again.

**This time, reshape it to have 10 rows, each representing 1 group of businesses.**

- The first row should contain businesses 1 to 170.

In [529]:

```
# Reshape business_ids to have 10 rows, with 170 businesses each
id_matrix = business_ids.reshape(10,170)
```

In [530]:

```
print(id_matrix)
```

In [531]:

```
# Print the first row of id_matrix
print(id_matrix[0,:])
```

We now have businesses with ID's from **1 to 170**.

Now, what if we still want our groups to be grouped by column, instead of by row? Let's *flip our rows and columns* now that the rows have the correct groups.

**Overwrite id_matrix with its transposed version.**

In [532]:

```
# Overwrite id_matrix with flipped version
id_matrix = id_matrix.transpose()
# Print first column
print(id_matrix)
# Print shape of new id_matrix
print(id_matrix.shape)
```

- Functions are applied to each element in the array.
- Operations are applied between corresponding elements in two arrays.

Example:

In [533]:

```
# 2x2 matrix of floats
x = np.array([[1.0, 2.0], [3.0, 4.0]])
print( x )
# 2x2 matrix of floats
y = np.array([[2.0, 5.0], [10.0, 3.0]])
print( y )
```

In [534]:

```
# Addition
print( x + y ) # This is matrices addition, from linear algebra!
```

In [535]:

```
# Subtraction
print( x - y )
```

In [536]:

```
# Multiplication
print( x * y )
```

In [537]:

```
# Division
print( x / y )
```

In [538]:

```
# Modulo
print( x % y )
```

In [539]:

```
print(x)
```

In [540]:

```
# Addition
print( x + 2 )
```

In [541]:

```
# Subtraction
print( x - 2 )
```

In [542]:

```
# Multiplication
print( x * 2 )
```

In [543]:

```
# Modulo
print( x % 2 )
```

A few of the most useful ones include:

`np.sqrt()`

for square root`np.abs()`

for absolute value`np.power()`

for raising elements to a power`np.exp()`

for calculating exponentials`np.log()`

for calculating the natural log

Example

In [544]:

```
# Cubed
print(x)
np.power(x, 3)
```

Out[544]:

A few of the most useful ones include:

`np.sum()`

for calculating the sum`np.min()`

for finding the smallest element`np.max()`

for finding the largest element`np.median()`

for finding the median`np.mean()`

for calculating the mean`np.std()`

for calculating the standard deviation

Example:

In [545]:

```
print( x )
# Sum of all elements
print( np.sum(x) )
```

Now, what if we want to aggregate across columns or rows, instead of across the entire array?

We can pass in an axis argument:

`axis=0`

to aggregate across columns`axis=1`

to aggregate across rows

For example:

In [546]:

```
print( x )
# Sum of each column
print( np.sum(x, axis=0) )
```

In [547]:

```
# Sum of each row
print( np.sum(x, axis=1) )
```

Let's take another look at our `id_matrix`

.

In [548]:

```
id_matrix
```

Out[548]:

- In previous analysis we displayed the first group (column) and confirmed that the ID's were from
**1 to 170**. - Now let's confirm the rest of the array is correct by finding the
`np.min()`

and`np.max()`

of each group.

**First, create and print an object called group_min with the minimum ID of each group (column) of id_matrix.**

In [549]:

```
# Create group_min of each column
group_min = np.min(id_matrix, axis=0)
# Print group_min
print(group_min)
```

**Next, lets create and print an object called group_max with the maximum ID of each group (column) of id_matrix.**

In [550]:

```
# Create group_max of each column
group_max = np.max(id_matrix, axis=0)
# Print group_max
print(group_max)
```

**Finally, subtract group_min from group_max to confirm that each of the 10 groups has a range of 170.**

- Remember to add 1 to the difference between max and min because the ends are inclusive
- i.e. 170 - 1 = 169

In [551]:

```
# Print range of each group
print(group_max - group_min +1)
```

NumPy comes with a submodule called `np.random`

.

- This submodule has many useful tools for random sampling from various distributions and generating randomized data.

**Interlude: Tab trick**

Pressing tab after typeing `np.random.`

, will show all the different methods! This is called tab completion.

Let's try generating a 8x8 matrix of random digits from 0 to 9.

We can use the `np.random.randomint()`

function, which draws from a uniform distribution (i.e. each integer is equally likely to be drawn).

It takes 3 arguments:

- low: Lower boundary for sampling (inclusive)
- high: Upper boundary for sampling (exclusive)
- size: Number of samples to draw

First, let's generate 64 random digits:

In [552]:

```
# Randomly draw 64 samples from a uniform distribution from [0, 10]
sample = np.random.randint(low=0, high=10, size=64)
print( sample )
```

Next, we can **reshape** that array into the **8x8** matrix we wanted.

In [553]:

```
print(sample.reshape(8,8))
```

However, there's an easier way to create a 8x8 matrix right from the start.

The size argument of `np.random.randomint()`

can actually accept a tuple with one value per axis, like so:

In [554]:

```
# Randomly draw 8x8 samples from a uniform distribution from [0, 10]
generated_matrix = np.random.randint(0, 10, (8,8))
print( generated_matrix )
```

nb: every time you run the code above, will generate new sample values!

In machine learning, you'll often want to be able to reproduce your random samples. For example, if you need to close your file and come back to it later, you'll want to draw the same random samples so you can get consistent results.

That's where `np.random.seed()`

comes in.

In [555]:

```
# Set seed for reproducible results
np.random.seed(1337)
# Randomly draw 8x8 samples from a uniform distribution from [0, 10)
generated_matrix = np.random.randint(0, 10, (8,8))
print( generated_matrix )
```

Next, what if we want to randomly select 5 elements from the first column of generated_matrix?

In [556]:

```
# Select first column of generated_matrix
print( generated_matrix[:,0] )
```

Well, if we want to randomly select 5 elements from that column, we can use `np.random.choice()`

In [557]:

```
# Set seed for reproducible results
np.random.seed(55)
# Randomly select 5 elements from first column of generated_matrix
print( np.random.choice(generated_matrix[:,0], 5) )
```

By default, `np.random.choice()`

samples with replacement.

- This just means that after each element is selected, it is replenished before selecting the next one.
- The third argument to
`np.random.choice()`

tells it to sample with replacement or not.

Example (to ensure no elements are repeated):

In [558]:

```
# Set seed for reproducible results
np.random.seed(55)
# Randomly select 5 elements from first column of generated_matrix
print( np.random.choice(generated_matrix[:,0], 5, replace=False) )
```

We are now ready to select 5 businesses from each group in `id_matrix`

.

**Pseudocode for code below**

**Lets write a loop that chooses 5 businesses from column of id_matrix.**

- Set the random seed to 9001.
- Sample without replacement.

In [559]:

```
# Seed random seed
np.random.seed(9001)
# Print selected businesses from each group
print( np.random.choice(id_matrix[:,0], 5, replace=False) )
```

In [560]:

```
# Seed random seed
np.random.seed(9001)
# Print selected businesses from each group
for group in range(id_matrix.shape[1]):
print('Group {}: {}'.format( group + 1, np.random.choice( id_matrix[:,group], 5, replace=False ) ) )
```

Awesome, now we're ready to start visiting the businesses... Let's start with Group 1 and the business with ID 7. Lets call this **Casey's Flower Shop**.

[Back to Table of Contents](#TOC)

Let's fisrt import Pandas:

In [561]:

```
import pandas as pd
```

DataFrames are like highly optimized spreadsheets.

One of the most common ways to create a DataFrame is to pass a dictionary into the `pd.DataFrame()`

function, like so:

In [562]:

```
example_dataframe = pd.DataFrame({
'column_1' : [5, 4, 3],
'column_2' : ['a', 'b', 'c']
})
example_dataframe
```

Out[562]:

- The one most useful way for us is by importing CSV (comma-separated) files.
- Pandas has a handy
`pd.read_csv()`

function just for this purpose. - Pandas can also read from Excel, JSON, SQL, and many other formats.

Let's see an example:

In [563]:

```
# Read the iris dataset from a CSV file in our project_files folder
df = pd.read_csv('project_files/iris.csv')
# Print data type for df
print( type(df) )
```

In [564]:

```
# Display the first 5 rows of the dataframe
df.head()
```

Out[564]:

- Pandas DataFrames always have 2 axes (i.e. like a spreadsheet).
- The first value in the shape tuple is the number of rows, and the second is the number of columns.

In [565]:

```
# Shape of dataframe
print( df.shape )
```

In [566]:

```
# Number of rows (both of below are equivalent)
print( len(df) )
print( df.shape[0] )
```

In [567]:

```
# Print min value of each column in data frame df
print( df.min() )
```

In [568]:

```
# Print max value of each column in data frame df
print( df.max() )
```

Finally, one very quick way to summarize your data is with the `.describe()`

function.

It will display many statistics at once, including:

- mean
- standard deviation
- mininum
- 25th percentile
- median
- 75th percentile
- maximum

of each column:

In [569]:

```
# Display summary statistics for each variable
df.describe()
```

Out[569]:

*In the previous analysis...*

- we discovered that
Casey's Flower Shopis the first local business we need to visit in Park Royal.

When we arrive, we find out that the owner, Casey, urgently needs our help!

- They just received a new shipment of
Iris flowers, but they've never stocked these flowers before.- Casey asks us to share what we know about these new flowers.
- Luckily, we have the Iris dataset to learn more about them ourselves!

While 150 observations is not exactly "big data," it's still too large to fit on our screen at once. So let's use another toy problem to practice the concepts.

First, lets create a new DataFrame called `toy_df`

. It will contain the first 5 rows plus the last 5 rows from our original Iris dataset.

In [570]:

```
toy_df = pd.concat([df.head(), df.tail()]) #pd.concat combines or concatinates the top five and last five
```

**Next, display toy_df**

In [571]:

```
# Display toy_df
toy_df
```

Out[571]:

**Next, display a summary table for toy_df.**

- It will show the mean, standard deviation, and quartiles for each of the columns

In [572]:

```
# Describe toy_df
toy_df.describe()
```

Out[572]:

- Series are single columns of data from DataFrames.
`pandas.core.series.Series`

is the official type.- You can directly create a Series from a list. For example:

In [573]:

```
integer_series = pd.Series([0, 1, 2, 3, 4])
print( integer_series )
print( type(integer_series) )
```

To select `petal_length`

(two ways):

In [574]:

```
# Way 1
print( type(df.petal_length) )
# Way 2
print( type(df['petal_length']) )
# Check that both ways are identical
print( all(df.petal_length == df['petal_length']) ) #Setting to all returns True if all true, or False otherwise
```

In [575]:

```
# First 5 values of petal length
print( df.petal_length.head() )
# Minimum sepal length
print( 'The minimum petal length is', df.petal_length.min() )
# Maximum sepal length
print( 'The maximum petal length is', df.petal_length.max() )
```

For categorical (i.e. non-numeric) variables, it's useful to know which unique values they have.

- In the Iris dataset, the only categorical variable is 'species'
- Each unique value is also called a
**class**. - By the way, building a model to predict a categorical variable is called classification.

To find the unique classes, you can use the `.unique()`

function:

In [576]:

```
# Print unique species
print( df.species.unique() )
```

Series behave very similarly to 1-dimensional arrays from NumPy.

For example, you can directly apply many NumPy math functions to Pandas Series.

In [577]:

```
# Pandas
print( df.petal_length.mean() )
# NumPy
import numpy as np
print( np.mean( df.petal_length ) )
```

In [578]:

```
# Create new petal area feature
df['petal_area'] = df.petal_width * df.petal_length
```

Elementwise operations are very useful in machine learning, especially for feature engineering.

Feature engineering is the process of creating new features (model input variables) from existing ones.

Let's first use our `toy_df`

to illustrate the concept.

In the Iris dataset, we have petal width and length, but what if we wanted to know petal area? Well, we can create a new `petal_area`

feature.

**First, display the two columns of petal_width and petal_length in toy_df.**

**Tip:**You can index a DataFrame using a list of column names too, like so:

df[['column_1', 'column_2']]

In [579]:

```
# Display petal_width and petal_length
toy_df[['petal_width', 'petal_length']]
```

Out[579]:

**Next, create a new petal_area feature in toy_df.**
We multiply the

`petal_width`

column by the `petal_length`

column.In [580]:

```
# Create a new petal_area column
toy_df['petal_area'] = toy_df.petal_width * toy_df.petal_length
# Display toy_df
toy_df
```

Out[580]:

By creating a `petal_area`

feature, it's now much easier to see that virginica flowers have significantly larger petals than setosa flowers do!

Often, by creating new features, you can learn more about the data (and improve your machine learning models).

Boolean masks are list-like sequences of True/False (boolean) values. These allow you to filter DataFrames in enlightening ways.

"List-like" sequences include:

- lists
- NumPy arrays
- Pandas Series

For example, both of these are valid boolean masks:

In [581]:

```
list_mask = [True, True, False, True]
series_mask = pd.Series([True, True, False, True])
```

In [582]:

```
print( series_mask )
```

Often, the more useful way to create a boolean mask is by directly applying a conditional statement to another Pandas Series.

For example, let's say we had the following series:

In [583]:

```
example_series = pd.Series([10, 5, -3, 2])
print( example_series )
```

If we wanted to create a boolean mask for all of the positive values in example_series, we can do so like this:

In [584]:

```
# Create boolean mask from a condition
series_mask = example_series > 0
print( series_mask )
```

Now, we can actually use that boolean mask to filter our Series and keep only the positive observations.

For example:

In [585]:

```
# Keep only True values from the boolean mask
example_series[series_mask]
```

Out[585]:

Or, by using the tilde `~`

operator (called the invert operator), we can filter our Series and keep only non-positive observations.

This is equivalent to keeping only the False values from our boolean mask.

For example:

In [586]:

```
# Keep only False values from the boolean mask
example_series[~series_mask]
```

Out[586]:

In [587]:

```
# Display [i.e., index] observations where petal_area > 14
df[df.petal_area > 14]
```

Out[587]:

Indicator variables are variables that can take on one of two values:

- 1 if a condition is met.
- 0 if a condition is not met.

In machine learning, we want the values to be 1/0 instead of True/False because our algorithms will require numeric inputs.

Fortunately, you can convert a boolean mask into 1/0's using `.astype(int)`

:

In [588]:

```
# Example boolean Series
example_mask = pd.Series([True, False, False, True, False])
# Convert boolean Series into 1/0
print( example_mask.astype(int) )
```

Thus, if we want to create an indicator variable for petal_area > 14, we can do so in one line of code:

In [589]:

```
# Create indicator variable for petal_area > 14
df['giant'] = (df.petal_area > 14).astype(int)
df.head()
```

Out[589]:

DataFrames can be indexed by any number of masks, separated by an & operator.

Let's say we wanted to see only versicolor and virginica flowers with sepal_length > 3.2.

- We can use the
`.isin()`

function to see if species is either versicolor or virginica. - To keep our code clean, if we have multiple masks, we might set them to separate variables.
- We'll then combine the masks using the & operator.

In [590]:

```
# Versicolor or virginica
species_mask = df.species.isin(['versicolor', 'virginica'])
# Sepal width > 3.2
sepal_width_mask = df.sepal_width > 3.2
# Index with both masks
df[species_mask & sepal_width_mask]
```

Out[590]:

Again, we'll use the `toy_df`

to really drive home these concepts.

Let's say we wanted to display observations where `petal_area > 10`

and `sepal_width > 3`

.

**First, display toy_df again.**

In [591]:

```
# Display toy_df
toy_df
```

Out[591]:

**Create a boolean mask for petal_area > 10.**

In [592]:

```
# Mask for petal_area > 10
petal_area_mask = toy_df.petal_area > 10
# Display petal_area_mask
petal_area_mask
```

Out[592]:

**Next, create a boolean mask for sepal_width > 3.**

In [593]:

```
# Mask for sepal_width > 3
sepal_width_mask = toy_df.sepal_width > 3
# Display sepal_width_mask
sepal_width_mask
```

Out[593]:

**Next, display the two masks combined using the & operator.**

In [594]:

```
# Display both masks, combined
petal_area_mask & sepal_width_mask
```

Out[594]:

**Finally, select the observations from toy_df where both conditions are met.**

In [595]:

```
# Index with both masks
toy_df[petal_area_mask & sepal_width_mask]
```

Out[595]:

The `.groupby()`

function allows you to segment and summarize data across different classes.

For example, let's say we wanted to find the average measurements for each of the 3 species of Iris flowers.

- First, take our DataFrame and group by species:
`df.groupby('species')`

- Then, simply take the mean by column:
`df.groupby('species').mean()`

Here's how the output looks:

In [596]:

```
# Display average measurements for each species
df.groupby('species').mean()
```

Out[596]:

Finally, what if we wanted to display multiple pieces of information (or **aggregations**)?

Well, we can use the `.agg()`

function and pass in a list of aggregations, like so:

In [597]:

```
# Display min, median, max measurements for each species
df.groupby('species').agg(['min', 'median', 'max'])
```

Out[597]:

Now, armed with the power of **groupby**, we're almost ready to return to Casey and share what we've learned about Iris flowers! But before we do, let's just bring back our `toy_df`

for one last hoorah, just to make sure we know what's going on under the hood.

Let's calculate the median `petal_area`

for each species.

- Since
`toy_df`

is small, we can do this manually as well and check to make sure the values are correct.

**First, let's manually calculate the median petal_area for the virginica flowers in our toy_df.**

In [598]:

```
# Display all 'virginica' species; values sorted by petal_area (both work)
# Way 1
toy_df[toy_df.species.isin(['virginica'])].sort_values('petal_area')
# Way 2
toy_df[toy_df.species == 'virginica'].sort_values(by='petal_area')
```

Out[598]:

Based on the output above, what's median `petal_area`

for the virginica species?

**Next, let's manually calculate the median petal_area for the setosa flowers in our toy_df.**

In [599]:

```
# Display all 'setosa' species (both ways work the same)
# Way 1
toy_df[toy_df.species.isin(['setosa'])].sort_values('petal_area')
# Way 2
toy_df[toy_df.species == 'setosa'].sort_values(by='petal_area')
```

Out[599]:

Based on the output above, what's median `petal_area`

for the setosa species?

**Finally let's calculate the median values using a .groupby().**

- Should get the same result!

In [600]:

```
# Median petal_area in toy_df of all features
toy_df.groupby('species').median()
```

Out[600]: