In [8]:

```
import pandas as pd
```

The primary two components of pandas are the `Series`

and `DataFrame`

.

A `Series`

is essentially a column, and a `DataFrame`

is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

There are *many* ways to create a DataFrame from scratch, but a great option is to just use a simple `dict`

.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [9]:

```
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
```

And then pass it to the pandas DataFrame constructor:

In [10]:

```
purchases = pd.DataFrame(data)
purchases
```

Out[10]:

The **Index** of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [11]:

```
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases
```

Out[11]:

So now we could **loc**ate a customer's order by using their name:

In [12]:

```
purchases.loc['June']
```

Out[12]:

We can also access colums:

In [13]:

```
purchases['oranges']
```

Out[13]:

With CSV files all you need is a single line to load in the data:

In [14]:

```
df = pd.read_csv('purchases.csv')
df
```

Out[14]:

CSVs don't have indexes like our DataFrames, so all we need to do is just designate the `index_col`

when reading:

In [15]:

```
df = pd.read_csv('purchases.csv', index_col=0)
df
```

Out[15]:

Let's load in the IMDB movies dataset to begin:

In [16]:

```
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
```

We're loading this dataset from a CSV and designating the movie titles to be our index.

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with `.head()`

:

In [17]:

```
movies_df.head()
```

Out[17]:

`.head()`

outputs the **first** five rows of your DataFrame by default, but we could also pass a number as well: `movies_df.head(10)`

would output the top ten rows, for example.

To see the **last** five rows use `.tail()`

. `tail()`

also accepts a number, and in this case we printing the bottom two rows.:

In [18]:

```
movies_df.tail(2)
```

Out[18]:

`.info()`

should be one of the very first commands you run after loading your data:

In [19]:

```
movies_df.info()
```

`.info()`

provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the `Revenue`

and `Metascore`

columns. We'll look at how to handle those in a bit.

In [20]:

```
movies_df.shape
```

Out[20]:

Note that `.shape`

has no parentheses and is a simple tuple of format (rows, columns). So we have **1000 rows** and **11 columns** in our movies DataFrame.

You'll be going to `.shape`

a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

Here's how to print the column names of our dataset:

In [21]:

```
movies_df.columns
```

Out[21]:

We can use the `.rename()`

method to rename certain or all columns via a `dict`

. We don't want parentheses, so let's rename those:

In [22]:

```
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
```

Out[22]:

Excellent. But what if we want to lowercase all names? Instead of using `.rename()`

we could also set a list of names to the columns like so:

In [23]:

```
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
```

Out[23]:

But that's too much work. Instead of just renaming each column manually we can do a list comprehension:

In [24]:

```
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
```

Out[24]:

Using `describe()`

on an entire DataFrame we can get a summary of the distribution of continuous variables:

In [25]:

```
movies_df.describe()
```

Out[25]:

Understanding which numbers are continuous also comes in handy when thinking about the type of plot to use to represent your data visually.

`.describe()`

can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:

In [26]:

```
movies_df['genre'].describe()
```

Out[26]:

This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).

`.value_counts()`

can tell us the frequency of all values in a column:

In [27]:

```
movies_df['genre'].value_counts().head(10)
```

Out[27]:

By using the correlation method `.corr()`

we can generate the relationship between each continuous variable:

In [28]:

```
movies_df.corr()
```

Out[28]:

Correlation tables are a numerical representation of the bivariate relationships in the dataset.

Positive numbers indicate a positive correlation — one goes up the other goes up — and negative numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect correlation.

So looking in the first row, first column we see `rank`

has a perfect correlation with itself, which is obvious. On the other hand, the correlation between `votes`

and `revenue_millions`

is 0.6. A little more interesting.

Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome. You can visually represent bivariate relationships with scatterplots (seen below in the plotting section).

For a deeper look into data summarizations check out Essential Statistics for Data Science.

Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.

You already saw how to extract a column using square brackets like this:

In [37]:

```
genre_col = movies_df['genre']
type(genre_col)
```

Out[37]:

This will return a *Series*. To extract a column as a *DataFrame*, you need to pass a list of column names. In our case that's just a single column:

In [38]:

```
genre_col = movies_df[['genre']]
type(genre_col)
```

Out[38]:

Since it's just a list, adding another column name is easy:

In [39]:

```
subset = movies_df[['genre', 'rating']]
subset.head()
```

Out[39]:

Now we'll look at getting data by rows.

For rows, we have two options:

`.loc`

-**loc**ates by name`.iloc`

-**loc**ates by numerical**i**ndex

Remember that we are still indexed by movie Title, so to use `.loc`

we give it the Title of a movie:

In [40]:

```
prom = movies_df.loc["Prometheus"]
prom
```

Out[40]:

On the other hand, with `iloc`

we give it the numerical index of Prometheus:

In [41]:

```
prom = movies_df.iloc[1]
```

`loc`

and `iloc`

can be thought of as similar to Python `list`

slicing. To show this even further, let's select multiple rows.

How would you do it with a list? In Python, just slice with brackets like `example_list[1:4]`

. It's works the same way in pandas:

In [42]:

```
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
```

Out[42]:

One important distinction between using `.loc`

and `.iloc`

to select multiple rows is that `.loc`

includes the movie *Sing* in the result, but when using `.iloc`

we're getting rows 1:4 but the movie at index 4 (*Suicide Squad*) is not included.

Slicing with `.iloc`

follows the same rules as slicing with lists, the object at the index at the end is not included.

We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?

For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an example of a Boolean condition:

In [43]:

```
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
```

Out[43]:

Similar to `isnull()`

, this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. To return the rows where that condition is True we have to pass this operation into the DataFrame:

In [44]:

```
movies_df[movies_df['director'] == "Ridley Scott"].head()
```

Out[44]:

You can get used to looking at these conditionals by reading it like:

Select movies_df where movies_df director equals Ridley Scott

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:

In [45]:

```
movies_df[movies_df['rating'] >= 8.6].head(3)
```

Out[45]:

We can make some richer conditionals by using logical operators `|`

for "or" and `&`

for "and".

Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

In [46]:

```
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()
```

Out[46]:

We need to make sure to group evaluations with parentheses so Python knows how to evaluate the conditional.

Using the `isin()`

method we could make this more concise though:

In [47]:

```
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
```

Out[47]:

Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but made below the 25th percentile in revenue.

Here's how we could do all of that:

In [48]:

```
movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]
```

Out[48]:

If you recall up when we used `.describe()`

the 25th percentile for revenue was about 17.4, and we can access this value directly by using the `quantile()`

method with a float of 0.25.

It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.

An efficient alternative is to `apply()`

a function to the dataset. For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.

First we would create a function that, when given a rating, determines if it's good or bad:

In [50]:

```
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
```

Now we want to send the entire rating column through this function, which is what `apply()`

does:

In [52]:

```
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(3)
```

Out[52]:

The `.apply()`

method passes every value in the `rating`

column through the `rating_function`

and then returns a new Series. This Series is then assigned to a new column called `rating_category`

.

You can also use anonymous functions as well. This lambda function achieves the same result as `rating_function`

:

In [54]:

```
movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')
movies_df.head(3)
```

Out[54]:

Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (`pip install matplotlib`

):

In [57]:

```
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger
```

Now we can begin. There won't be a lot of coverage on plotting, but it should be enough to explore you're data easily.

**Side note:**
For categorical variables utilize Bar Charts* and Boxplots. For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots.

Let's plot the relationship between ratings and revenue. All we need to do is call `.plot()`

on `movies_df`

with some info about how to construct the plot:

In [58]:

```
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');
```

What's with the semicolon? It's not a syntax error, just a way to hide the `<matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18>`

output when plotting in Jupyter notebooks.

If we want to plot a simple Histogram based on a single column, we can call plot on a column:

In [59]:

```
movies_df['rating'].plot(kind='hist', title='Rating');
```

Do you remember the `.describe()`

example at the beginning of this tutorial? Well, there's a graphical representation of the interquartile range, called the Boxplot. Let's recall what `describe()`

gives us on the ratings column:

In [60]:

```
movies_df['rating'].describe()
```

Out[60]:

Using a Boxplot we can visualize this data:

In [61]:

```
movies_df['rating'].plot(kind="box");
```

By combining categorical and continuous data, we can create a Boxplot of revenue that is grouped by the Rating Category we created above:

In [62]:

```
movies_df.boxplot(column='revenue_millions', by='rating_category');
```

That's the general idea of plotting with pandas. There's too many plots to mention, so definitely take a look at the `plot()`

docs here for more information on what it can do.