#!/usr/bin/env python # coding: utf-8 # # Python pandas Q&A video series by [Data School](http://www.dataschool.io/) # # ### [YouTube playlist](https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y) and [GitHub repository](https://github.com/justmarkham/pandas-videos) # # ## Table of contents # # 1. What is pandas? # 2. How do I read a tabular data file into pandas? # 3. How do I select a pandas Series from a DataFrame? # 4. Why do some pandas commands end with parentheses (and others don't)? # 5. How do I rename columns in a pandas DataFrame? # 6. How do I remove columns from a pandas DataFrame? # 7. How do I sort a pandas DataFrame or a Series? # 8. How do I filter rows of a pandas DataFrame by column value? # 9. How do I apply multiple filter criteria to a pandas DataFrame? # 10. Your pandas questions answered! # 11. How do I use the "axis" parameter in pandas? # 12. How do I use string methods in pandas? # 13. How do I change the data type of a pandas Series? # 14. When should I use a "groupby" in pandas? # 15. How do I explore a pandas Series? # 16. How do I handle missing values in pandas? # 17. What do I need to know about the pandas index? (Part 1) # 18. What do I need to know about the pandas index? (Part 2) # 19. How do I select multiple rows and columns from a pandas DataFrame? # 20. When should I use the "inplace" parameter in pandas? # 21. How do I make my pandas DataFrame smaller and faster? # 22. How do I use pandas with scikit-learn to create Kaggle submissions? # 23. More of your pandas questions answered! # 24. How do I create dummy variables in pandas? # 25. How do I work with dates and times in pandas? # 26. How do I find and remove duplicate rows in pandas? # 27. How do I avoid a SettingWithCopyWarning in pandas? # 28. How do I change display options in pandas? # 29. How do I create a pandas DataFrame from another object? # 30. How do I apply a function to a pandas Series or DataFrame? # In[1]: # conventional way to import pandas import pandas as pd # ## 1. What is pandas? ([video](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=1)) # # - [pandas main page](http://pandas.pydata.org/) # - [pandas installation instructions](http://pandas.pydata.org/pandas-docs/stable/install.html) # - [Anaconda distribution of Python](https://www.continuum.io/downloads) (includes pandas) # - [How to use the IPython/Jupyter notebook](https://youtu.be/IsXXlYVBt1M?t=5m17s) (video) # ## 2. How do I read a tabular data file into pandas? ([video](https://www.youtube.com/watch?v=5_QXMwezPJE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=2)) # In[2]: # read a dataset of Chipotle orders directly from a URL and store the results in a DataFrame orders = pd.read_table('http://bit.ly/chiporders') # In[3]: # examine the first 5 rows orders.head() # Documentation for [**`read_table`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html) # In[4]: # read a dataset of movie reviewers (modifying the default parameter values for read_table) user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code'] users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols) # In[5]: # examine the first 5 rows users.head() # [Back to top] # ## 3. How do I select a pandas Series from a DataFrame? ([video](https://www.youtube.com/watch?v=zxqjeyKP2Tk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=3)) # In[6]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_table('http://bit.ly/uforeports', sep=',') # In[7]: # read_csv is equivalent to read_table, except it assumes a comma separator ufo = pd.read_csv('http://bit.ly/uforeports') # In[8]: # examine the first 5 rows ufo.head() # In[9]: # select the 'City' Series using bracket notation ufo['City'] # or equivalently, use dot notation ufo.City # **Bracket notation** will always work, whereas **dot notation** has limitations: # # - Dot notation doesn't work if there are **spaces** in the Series name # - Dot notation doesn't work if the Series has the same name as a **DataFrame method or attribute** (like 'head' or 'shape') # - Dot notation can't be used to define the name of a **new Series** (see below) # In[10]: # create a new 'Location' Series (must use bracket notation to define the Series name) ufo['Location'] = ufo.City + ', ' + ufo.State ufo.head() # [Back to top] # ## 4. Why do some pandas commands end with parentheses (and others don't)? ([video](https://www.youtube.com/watch?v=hSrDViyKWVk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=4)) # In[11]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') # **Methods** end with parentheses, while **attributes** don't: # In[12]: # example method: show the first 5 rows movies.head() # In[13]: # example method: calculate summary statistics movies.describe() # In[14]: # example attribute: number of rows and columns movies.shape # In[15]: # example attribute: data type of each column movies.dtypes # In[16]: # use an optional parameter to the describe method to summarize only 'object' columns movies.describe(include=['object']) # Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) # # [Back to top] # ## 5. How do I rename columns in a pandas DataFrame? ([video](https://www.youtube.com/watch?v=0uBirYFhizE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=5)) # In[17]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') # In[18]: # examine the column names ufo.columns # In[19]: # rename two of the columns by using the 'rename' method ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True) ufo.columns # Documentation for [**`rename`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) # In[20]: # replace all of the column names by overwriting the 'columns' attribute ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time'] ufo.columns = ufo_cols ufo.columns # In[21]: # replace the column names during the file reading process by using the 'names' parameter ufo = pd.read_csv('http://bit.ly/uforeports', header=0, names=ufo_cols) ufo.columns # Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) # In[22]: # replace all spaces with underscores in the column names by using the 'str.replace' method ufo.columns = ufo.columns.str.replace(' ', '_') ufo.columns # Documentation for [**`str.replace`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) # # [Back to top] # ## 6. How do I remove columns from a pandas DataFrame? ([video](https://www.youtube.com/watch?v=gnUKkS964WQ&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=6)) # In[23]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head() # In[24]: # remove a single column (axis=1 refers to columns) ufo.drop('Colors Reported', axis=1, inplace=True) ufo.head() # Documentation for [**`drop`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) # In[25]: # remove multiple columns at once ufo.drop(['City', 'State'], axis=1, inplace=True) ufo.head() # In[26]: # remove multiple rows at once (axis=0 refers to rows) ufo.drop([0, 1], axis=0, inplace=True) ufo.head() # [Back to top] # ## 7. How do I sort a pandas DataFrame or a Series? ([video](https://www.youtube.com/watch?v=zY4doF6xSxY&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=7)) # In[27]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head() # **Note:** None of the sorting methods below affect the underlying data. (In other words, the sorting is temporary). # In[28]: # sort the 'title' Series in ascending order (returns a Series) movies.title.sort_values().head() # In[29]: # sort in descending order instead movies.title.sort_values(ascending=False).head() # Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html) for a **Series**. (Prior to version 0.17, use [**`order`**](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.order.html) instead.) # In[30]: # sort the entire DataFrame by the 'title' Series (returns a DataFrame) movies.sort_values('title').head() # In[31]: # sort in descending order instead movies.sort_values('title', ascending=False).head() # Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) for a **DataFrame**. (Prior to version 0.17, use [**`sort`**](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sort.html) instead.) # In[32]: # sort the DataFrame first by 'content_rating', then by 'duration' movies.sort_values(['content_rating', 'duration']).head() # [Summary of changes to the sorting API](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#changes-to-sorting-api) in pandas 0.17 # # [Back to top] # ## 8. How do I filter rows of a pandas DataFrame by column value? ([video](https://www.youtube.com/watch?v=2AFGPdNn4FM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=8)) # In[33]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head() # In[34]: # examine the number of rows and columns movies.shape # **Goal:** Filter the DataFrame rows to only show movies with a 'duration' of at least 200 minutes. # In[35]: # create a list in which each element refers to a DataFrame row: True if the row satisfies the condition, False otherwise booleans = [] for length in movies.duration: if length >= 200: booleans.append(True) else: booleans.append(False) # In[36]: # confirm that the list has the same length as the DataFrame len(booleans) # In[37]: # examine the first five list elements booleans[0:5] # In[38]: # convert the list to a Series is_long = pd.Series(booleans) is_long.head() # In[39]: # use bracket notation with the boolean Series to tell the DataFrame which rows to display movies[is_long] # In[40]: # simplify the steps above: no need to write a for loop to create 'is_long' since pandas will broadcast the comparison is_long = movies.duration >= 200 movies[is_long] # or equivalently, write it in one line (no need to create the 'is_long' object) movies[movies.duration >= 200] # In[41]: # select the 'genre' Series from the filtered DataFrame movies[movies.duration >= 200].genre # or equivalently, use the 'loc' method movies.loc[movies.duration >= 200, 'genre'] # Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) # # [Back to top] # ## 9. How do I apply multiple filter criteria to a pandas DataFrame? ([video](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9)) # In[42]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head() # In[43]: # filter the DataFrame to only show movies with a 'duration' of at least 200 minutes movies[movies.duration >= 200] # Understanding **logical operators:** # # - **`and`**: True only if **both sides** of the operator are True # - **`or`**: True if **either side** of the operator is True # In[44]: # demonstration of the 'and' operator print(True and True) print(True and False) print(False and False) # In[45]: # demonstration of the 'or' operator print(True or True) print(True or False) print(False or False) # Rules for specifying **multiple filter criteria** in pandas: # # - use **`&`** instead of **`and`** # - use **`|`** instead of **`or`** # - add **parentheses** around each condition to specify evaluation order # **Goal:** Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama' # In[46]: # CORRECT: use the '&' operator to specify that both conditions are required movies[(movies.duration >=200) & (movies.genre == 'Drama')] # In[47]: # INCORRECT: using the '|' operator would have shown movies that are either long or dramas (or both) movies[(movies.duration >=200) | (movies.genre == 'Drama')].head() # **Goal:** Filter the original DataFrame to show movies with a 'genre' of 'Crime' or 'Drama' or 'Action' # In[48]: # use the '|' operator to specify that a row can match any of the three criteria movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10) # or equivalently, use the 'isin' method movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10) # Documentation for [**`isin`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) # # [Back to top] # ## 10. Your pandas questions answered! ([video](https://www.youtube.com/watch?v=B-r9VuK80dk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=10)) # **Question:** When reading from a file, how do I read in only a subset of the columns? # In[49]: # read a dataset of UFO reports into a DataFrame, and check the columns ufo = pd.read_csv('http://bit.ly/uforeports') ufo.columns # In[50]: # specify which columns to include by name ufo = pd.read_csv('http://bit.ly/uforeports', usecols=['City', 'State']) # or equivalently, specify columns by position ufo = pd.read_csv('http://bit.ly/uforeports', usecols=[0, 4]) ufo.columns # **Question:** When reading from a file, how do I read in only a subset of the rows? # In[51]: # specify how many rows to read ufo = pd.read_csv('http://bit.ly/uforeports', nrows=3) ufo # Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) # **Question:** How do I iterate through a Series? # In[52]: # Series are directly iterable (like a list) for c in ufo.City: print(c) # **Question:** How do I iterate through a DataFrame? # In[53]: # various methods are available to iterate through a DataFrame for index, row in ufo.iterrows(): print(index, row.City, row.State) # Documentation for [**`iterrows`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html) # **Question:** How do I drop all non-numeric columns from a DataFrame? # In[54]: # read a dataset of alcohol consumption into a DataFrame, and check the data types drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.dtypes # In[55]: # only include numeric columns in the DataFrame import numpy as np drinks.select_dtypes(include=[np.number]).dtypes # Documentation for [**`select_dtypes`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) # **Question:** How do I know whether I should pass an argument as a string or a list? # In[56]: # describe all of the numeric columns drinks.describe() # In[57]: # pass the string 'all' to describe all columns drinks.describe(include='all') # In[58]: # pass a list of data types to only describe certain types drinks.describe(include=['object', 'float64']) # In[59]: # pass a list even if you only want to describe a single data type drinks.describe(include=['object']) # Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) # # [Back to top] # ## 11. How do I use the "axis" parameter in pandas? ([video](https://www.youtube.com/watch?v=PtO3t6ynH-8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=11)) # In[60]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[61]: # drop a column (temporarily) drinks.drop('continent', axis=1).head() # Documentation for [**`drop`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) # In[62]: # drop a row (temporarily) drinks.drop(2, axis=0).head() # When **referring to rows or columns** with the axis parameter: # # - **axis 0** refers to rows # - **axis 1** refers to columns # In[63]: # calculate the mean of each numeric column drinks.mean() # or equivalently, specify the axis explicitly drinks.mean(axis=0) # Documentation for [**`mean`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) # In[64]: # calculate the mean of each row drinks.mean(axis=1).head() # When performing a **mathematical operation** with the axis parameter: # # - **axis 0** means the operation should "move down" the row axis # - **axis 1** means the operation should "move across" the column axis # In[65]: # 'index' is an alias for axis 0 drinks.mean(axis='index') # In[66]: # 'columns' is an alias for axis 1 drinks.mean(axis='columns').head() # [Back to top] # ## 12. How do I use string methods in pandas? ([video](https://www.youtube.com/watch?v=bofaC0IckHo&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=12)) # In[67]: # read a dataset of Chipotle orders into a DataFrame orders = pd.read_table('http://bit.ly/chiporders') orders.head() # In[68]: # normal way to access string methods in Python 'hello'.upper() # In[69]: # string methods for pandas Series are accessed via 'str' orders.item_name.str.upper().head() # In[70]: # string method 'contains' checks for a substring and returns a boolean Series orders.item_name.str.contains('Chicken').head() # In[71]: # use the boolean Series to filter the DataFrame orders[orders.item_name.str.contains('Chicken')].head() # In[72]: # string methods can be chained together orders.choice_description.str.replace('[', '').str.replace(']', '').head() # In[73]: # many pandas string methods support regular expressions (regex) orders.choice_description.str.replace('[\[\]]', '').head() # [String handling section](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) of the pandas API reference # # [Back to top] # ## 13. How do I change the data type of a pandas Series? ([video](https://www.youtube.com/watch?v=V0AWyzVMf54&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=13)) # In[74]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[75]: # examine the data type of each Series drinks.dtypes # In[76]: # change the data type of an existing Series drinks['beer_servings'] = drinks.beer_servings.astype(float) drinks.dtypes # Documentation for [**`astype`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html) # In[77]: # alternatively, change the data type of a Series while reading in a file drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'beer_servings':float}) drinks.dtypes # In[78]: # read a dataset of Chipotle orders into a DataFrame orders = pd.read_table('http://bit.ly/chiporders') orders.head() # In[79]: # examine the data type of each Series orders.dtypes # In[80]: # convert a string to a number in order to do math orders.item_price.str.replace('$', '').astype(float).mean() # In[81]: # string method 'contains' checks for a substring and returns a boolean Series orders.item_name.str.contains('Chicken').head() # In[82]: # convert a boolean Series to an integer (False = 0, True = 1) orders.item_name.str.contains('Chicken').astype(int).head() # [Back to top] # ## 14. When should I use a "groupby" in pandas? ([video](https://www.youtube.com/watch?v=qy0fDqoMJx8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=14)) # In[83]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[84]: # calculate the mean beer servings across the entire dataset drinks.beer_servings.mean() # In[85]: # calculate the mean beer servings just for countries in Africa drinks[drinks.continent=='Africa'].beer_servings.mean() # In[86]: # calculate the mean beer servings for each continent drinks.groupby('continent').beer_servings.mean() # Documentation for [**`groupby`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) # In[87]: # other aggregation functions (such as 'max') can also be used with groupby drinks.groupby('continent').beer_servings.max() # In[88]: # multiple aggregation functions can be applied simultaneously drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max']) # Documentation for [**`agg`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) # In[89]: # specifying a column to which the aggregation function should be applied is not required drinks.groupby('continent').mean() # In[90]: # allow plots to appear in the notebook get_ipython().run_line_magic('matplotlib', 'inline') # In[91]: # side-by-side bar plot of the DataFrame directly above drinks.groupby('continent').mean().plot(kind='bar') # Documentation for [**`plot`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) # # [Back to top] # ## 15. How do I explore a pandas Series? ([video](https://www.youtube.com/watch?v=QTVTq8SPzxM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=15)) # In[92]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head() # In[93]: # examine the data type of each Series movies.dtypes # **Exploring a non-numeric Series:** # In[94]: # count the non-null values, unique values, and frequency of the most common value movies.genre.describe() # Documentation for [**`describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html) # In[95]: # count how many times each value in the Series occurs movies.genre.value_counts() # Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) # In[96]: # display percentages instead of raw counts movies.genre.value_counts(normalize=True) # In[97]: # 'value_counts' (like many pandas methods) outputs a Series type(movies.genre.value_counts()) # In[98]: # thus, you can add another Series method on the end movies.genre.value_counts().head() # In[99]: # display the unique values in the Series movies.genre.unique() # In[100]: # count the number of unique values in the Series movies.genre.nunique() # Documentation for [**`unique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) and [**`nunique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html) # In[101]: # compute a cross-tabulation of two Series pd.crosstab(movies.genre, movies.content_rating) # Documentation for [**`crosstab`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) # **Exploring a numeric Series:** # In[102]: # calculate various summary statistics movies.duration.describe() # In[103]: # many statistics are implemented as Series methods movies.duration.mean() # Documentation for [**`mean`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) # In[104]: # 'value_counts' is primarily useful for categorical data, not numerical data movies.duration.value_counts().head() # In[105]: # allow plots to appear in the notebook get_ipython().run_line_magic('matplotlib', 'inline') # In[106]: # histogram of the 'duration' Series (shows the distribution of a numerical variable) movies.duration.plot(kind='hist') # In[107]: # bar plot of the 'value_counts' for the 'genre' Series movies.genre.value_counts().plot(kind='bar') # Documentation for [**`plot`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html) # # [Back to top] # ## 16. How do I handle missing values in pandas? ([video](https://www.youtube.com/watch?v=fCMrO_VzeL8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=16)) # In[108]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.tail() # **What does "NaN" mean?** # # - "NaN" is not a string, rather it's a special value: **`numpy.nan`**. # - It stands for "Not a Number" and indicates a **missing value**. # - **`read_csv`** detects missing values (by default) when reading the file, and replaces them with this special value. # # Documentation for [**`read_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) # In[109]: # 'isnull' returns a DataFrame of booleans (True if missing, False if not missing) ufo.isnull().tail() # In[110]: # 'nonnull' returns the opposite of 'isnull' (True if not missing, False if missing) ufo.notnull().tail() # Documentation for [**`isnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) and [**`notnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.notnull.html) # In[111]: # count the number of missing values in each Series ufo.isnull().sum() # This calculation works because: # # 1. The **`sum`** method for a DataFrame operates on **`axis=0`** by default (and thus produces column sums). # 2. In order to add boolean values, pandas converts **`True`** to **1** and **`False`** to **0**. # In[112]: # use the 'isnull' Series method to filter the DataFrame rows ufo[ufo.City.isnull()].head() # **How to handle missing values** depends on the dataset as well as the nature of your analysis. Here are some options: # In[113]: # examine the number of rows and columns ufo.shape # In[114]: # if 'any' values are missing in a row, then drop that row ufo.dropna(how='any').shape # Documentation for [**`dropna`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) # In[115]: # 'inplace' parameter for 'dropna' is False by default, thus rows were only dropped temporarily ufo.shape # In[116]: # if 'all' values are missing in a row, then drop that row (none are dropped in this case) ufo.dropna(how='all').shape # In[117]: # if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape # In[118]: # if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape # In[119]: # 'value_counts' does not include missing values by default ufo['Shape Reported'].value_counts().head() # In[120]: # explicitly include missing values ufo['Shape Reported'].value_counts(dropna=False).head() # Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) # In[121]: # fill in missing values with a specified value ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True) # Documentation for [**`fillna`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) # In[122]: # confirm that the missing values were filled in ufo['Shape Reported'].value_counts().head() # [Working with missing data in pandas](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) # # [Back to top] # ## 17. What do I need to know about the pandas index? (Part 1) ([video](https://www.youtube.com/watch?v=OYZNk7Z9s6I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=17)) # In[123]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[124]: # every DataFrame has an index (sometimes called the "row labels") drinks.index # In[125]: # column names are also stored in a special "index" object drinks.columns # In[126]: # neither the index nor the columns are included in the shape drinks.shape # In[127]: # index and columns both default to integers if you don't define them pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head() # **What is the index used for?** # # 1. identification # 2. selection # 3. alignment (covered in the next video) # In[128]: # identification: index remains with each row when filtering the DataFrame drinks[drinks.continent=='South America'] # In[129]: # selection: select a portion of the DataFrame using the index drinks.loc[23, 'beer_servings'] # Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) # In[130]: # set an existing column as the index drinks.set_index('country', inplace=True) drinks.head() # Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) # In[131]: # 'country' is now the index drinks.index # In[132]: # 'country' is no longer a column drinks.columns # In[133]: # 'country' data is no longer part of the DataFrame contents drinks.shape # In[134]: # country name can now be used for selection drinks.loc['Brazil', 'beer_servings'] # In[135]: # index name is optional drinks.index.name = None drinks.head() # In[136]: # restore the index name, and move the index back to a column drinks.index.name = 'country' drinks.reset_index(inplace=True) drinks.head() # Documentation for [**`reset_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) # In[137]: # many DataFrame methods output a DataFrame drinks.describe() # In[138]: # you can interact with any DataFrame using its index and columns drinks.describe().loc['25%', 'beer_servings'] # [Indexing and selecting data](http://pandas.pydata.org/pandas-docs/stable/indexing.html) # # [Back to top] # ## 18. What do I need to know about the pandas index? (Part 2) ([video](https://www.youtube.com/watch?v=15q-is8P_H4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=18)) # In[139]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[140]: # every DataFrame has an index drinks.index # In[141]: # every Series also has an index (which carries over from the DataFrame) drinks.continent.head() # In[142]: # set 'country' as the index drinks.set_index('country', inplace=True) # Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) # In[143]: # Series index is on the left, values are on the right drinks.continent.head() # In[144]: # another example of a Series (output from the 'value_counts' method) drinks.continent.value_counts() # Documentation for [**`value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) # In[145]: # access the Series index drinks.continent.value_counts().index # In[146]: # access the Series values drinks.continent.value_counts().values # In[147]: # elements in a Series can be selected by index (using bracket notation) drinks.continent.value_counts()['Africa'] # In[148]: # any Series can be sorted by its values drinks.continent.value_counts().sort_values() # In[149]: # any Series can also be sorted by its index drinks.continent.value_counts().sort_index() # Documentation for [**`sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html) and [**`sort_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_index.html) # **What is the index used for?** # # 1. identification (covered in the previous video) # 2. selection (covered in the previous video) # 3. alignment # In[150]: # 'beer_servings' Series contains the average annual beer servings per person drinks.beer_servings.head() # In[151]: # create a Series containing the population of two countries people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population') people # Documentation for [**`Series`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) # In[152]: # calculate the total annual beer servings for each country (drinks.beer_servings * people).head() # - The two Series were **aligned** by their indexes. # - If a value is missing in either Series, the result is marked as **NaN**. # - Alignment enables us to easily work with **incomplete data**. # In[153]: # concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index) pd.concat([drinks, people], axis=1).head() # Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) # # [Indexing and selecting data](http://pandas.pydata.org/pandas-docs/stable/indexing.html) # # [Back to top] # ## 19. How do I select multiple rows and columns from a pandas DataFrame? ([video](https://www.youtube.com/watch?v=xvpNA7bC8cs&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=19)) # In[154]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head(3) # The [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) method is used to select rows and columns by **label**. You can pass it: # # - A single label # - A list of labels # - A slice of labels # - A boolean Series # - A colon (which indicates "all labels") # In[155]: # row 0, all columns ufo.loc[0, :] # In[156]: # rows 0 and 1 and 2, all columns ufo.loc[[0, 1, 2], :] # In[157]: # rows 0 through 2 (inclusive), all columns ufo.loc[0:2, :] # In[158]: # this implies "all columns", but explicitly stating "all columns" is better ufo.loc[0:2] # In[159]: # rows 0 through 2 (inclusive), column 'City' ufo.loc[0:2, 'City'] # In[160]: # rows 0 through 2 (inclusive), columns 'City' and 'State' ufo.loc[0:2, ['City', 'State']] # In[161]: # accomplish the same thing using double brackets - but using 'loc' is preferred since it's more explicit ufo[['City', 'State']].head(3) # In[162]: # rows 0 through 2 (inclusive), columns 'City' through 'State' (inclusive) ufo.loc[0:2, 'City':'State'] # In[163]: # accomplish the same thing using 'head' and 'drop' ufo.head(3).drop('Time', axis=1) # In[164]: # rows in which the 'City' is 'Oakland', column 'State' ufo.loc[ufo.City=='Oakland', 'State'] # In[165]: # accomplish the same thing using "chained indexing" - but using 'loc' is preferred since chained indexing can cause problems ufo[ufo.City=='Oakland'].State # The [**`iloc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) method is used to select rows and columns by **integer position**. You can pass it: # # - A single integer position # - A list of integer positions # - A slice of integer positions # - A colon (which indicates "all integer positions") # In[166]: # rows in positions 0 and 1, columns in positions 0 and 3 ufo.iloc[[0, 1], [0, 3]] # In[167]: # rows in positions 0 through 2 (exclusive), columns in positions 0 through 4 (exclusive) ufo.iloc[0:2, 0:4] # In[168]: # rows in positions 0 through 2 (exclusive), all columns ufo.iloc[0:2, :] # In[169]: # accomplish the same thing - but using 'iloc' is preferred since it's more explicit ufo[0:2] # The [**`ix`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html) method is used to select rows and columns by **label or integer position**, and should only be used when you need to mix label-based and integer-based selection in the same call. # In[170]: # read a dataset of alcohol consumption into a DataFrame and set 'country' as the index drinks = pd.read_csv('http://bit.ly/drinksbycountry', index_col='country') drinks.head() # In[171]: # row with label 'Albania', column in position 0 drinks.ix['Albania', 0] # In[172]: # row in position 1, column with label 'beer_servings' drinks.ix[1, 'beer_servings'] # **Rules for using numbers with `ix`:** # # - If the index is **strings**, numbers are treated as **integer positions**, and thus slices are **exclusive** on the right. # - If the index is **integers**, numbers are treated as **labels**, and thus slices are **inclusive**. # In[173]: # rows 'Albania' through 'Andorra' (inclusive), columns in positions 0 through 2 (exclusive) drinks.ix['Albania':'Andorra', 0:2] # In[174]: # rows 0 through 2 (inclusive), columns in positions 0 through 2 (exclusive) ufo.ix[0:2, 0:2] # [Summary of the pandas API for selection](https://github.com/pydata/pandas/issues/9595) # # [Back to top] # ## 20. When should I use the "inplace" parameter in pandas? ([video](https://www.youtube.com/watch?v=XaCSdr7pPmY&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=20)) # In[175]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head() # In[176]: ufo.shape # In[177]: # remove the 'City' column (doesn't affect the DataFrame since inplace=False) ufo.drop('City', axis=1).head() # In[178]: # confirm that the 'City' column was not actually removed ufo.head() # In[179]: # remove the 'City' column (does affect the DataFrame since inplace=True) ufo.drop('City', axis=1, inplace=True) # In[180]: # confirm that the 'City' column was actually removed ufo.head() # In[181]: # drop a row if any value is missing from that row (doesn't affect the DataFrame since inplace=False) ufo.dropna(how='any').shape # In[182]: # confirm that no rows were actually removed ufo.shape # In[183]: # use an assignment statement instead of the 'inplace' parameter ufo = ufo.set_index('Time') ufo.tail() # In[184]: # fill missing values using "backward fill" strategy (doesn't affect the DataFrame since inplace=False) ufo.fillna(method='bfill').tail() # In[185]: # compare with "forward fill" strategy (doesn't affect the DataFrame since inplace=False) ufo.fillna(method='ffill').tail() # [Back to top] # ## 21. How do I make my pandas DataFrame smaller and faster? ([video](https://www.youtube.com/watch?v=wDYDYGyN_cw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=21)) # In[186]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[187]: # exact memory usage is unknown because object columns are references elsewhere drinks.info() # In[188]: # force pandas to calculate the true memory usage drinks.info(memory_usage='deep') # In[189]: # calculate the memory usage for each Series (in bytes) drinks.memory_usage(deep=True) # Documentation for [**`info`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) and [**`memory_usage`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html) # In[190]: # use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers drinks['continent'] = drinks.continent.astype('category') drinks.dtypes # In[191]: # 'continent' Series appears to be unchanged drinks.continent.head() # In[192]: # strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.) drinks.continent.cat.codes.head() # In[193]: # memory usage has been drastically reduced drinks.memory_usage(deep=True) # In[194]: # repeat this process for the 'country' Series drinks['country'] = drinks.country.astype('category') drinks.memory_usage(deep=True) # In[195]: # memory usage increased because we created 193 categories drinks.country.cat.categories # The **category** data type should only be used with a string Series that has a **small number of possible values**. # In[196]: # create a small DataFrame from a dictionary df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']}) df # In[197]: # sort the DataFrame by the 'quality' Series (alphabetical order) df.sort_values('quality') # In[198]: # define a logical ordering for the categories df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True) df.quality # In[199]: # sort the DataFrame by the 'quality' Series (logical order) df.sort_values('quality') # In[200]: # comparison operators work with ordered categories df.loc[df.quality > 'good', :] # [Overview of categorical data in pandas](http://pandas.pydata.org/pandas-docs/stable/categorical.html) # # [API reference for categorical methods](http://pandas.pydata.org/pandas-docs/stable/api.html#categorical) # # [Back to top] # ## 22. How do I use pandas with scikit-learn to create Kaggle submissions? ([video](https://www.youtube.com/watch?v=ylRlGCtAtiE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22)) # In[201]: # read the training dataset from Kaggle's Titanic competition into a DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head() # **Goal:** Predict passenger survival aboard the Titanic based on [passenger attributes](https://www.kaggle.com/c/titanic/data) # # **Video:** [What is machine learning, and how does it work?](https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1) # In[202]: # create a feature matrix 'X' by selecting two DataFrame columns feature_cols = ['Pclass', 'Parch'] X = train.loc[:, feature_cols] X.shape # In[203]: # create a response vector 'y' by selecting a Series y = train.Survived y.shape # **Note:** There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes. # In[204]: # fit a classification model to the training data from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X, y) # **Video series:** [Introduction to machine learning with scikit-learn](https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A) # In[205]: # read the testing dataset from Kaggle's Titanic competition into a DataFrame test = pd.read_csv('http://bit.ly/kaggletest') test.head() # In[206]: # create a feature matrix from the testing data that matches the training data X_new = test.loc[:, feature_cols] X_new.shape # In[207]: # use the fitted model to make predictions for the testing set observations new_pred_class = logreg.predict(X_new) # In[208]: # create a DataFrame of passenger IDs and testing set predictions pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head() # Documentation for the [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) constructor # In[209]: # ensure that PassengerID is the first column by setting it as the index pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head() # In[210]: # write the DataFrame to a CSV file that can be submitted to Kaggle pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv') # Documentation for [**`to_csv`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) # In[211]: # save a DataFrame to disk ("pickle it") train.to_pickle('train.pkl') # In[212]: # read a pickled object from disk ("unpickle it") pd.read_pickle('train.pkl').head() # Documentation for [**`to_pickle`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html) and [**`read_pickle`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html) # # [Back to top] # ## 23. More of your pandas questions answered! ([video](https://www.youtube.com/watch?v=oH3wYKvwpJ8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=23)) # **Question:** Could you explain how to read the pandas documentation? # # [pandas API reference](http://pandas.pydata.org/pandas-docs/stable/api.html) # **Question:** What is the difference between **`ufo.isnull()`** and **`pd.isnull(ufo)`**? # In[213]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head() # In[214]: # use 'isnull' as a top-level function pd.isnull(ufo).head() # In[215]: # equivalent: use 'isnull' as a DataFrame method ufo.isnull().head() # Documentation for [**`isnull`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) # **Question:** Why are DataFrame slices inclusive when using **`.loc`**, but exclusive when using **`.iloc`**? # In[216]: # label-based slicing is inclusive of the start and stop ufo.loc[0:4, :] # In[217]: # position-based slicing is inclusive of the start and exclusive of the stop ufo.iloc[0:4, :] # Documentation for [**`loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) and [**`iloc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) # In[218]: # 'iloc' is simply following NumPy's slicing convention... ufo.values[0:4, :] # In[219]: # ...and NumPy is simply following Python's slicing convention 'python'[0:4] # In[220]: # 'loc' is inclusive of the stopping label because you don't necessarily know what label will come after it ufo.loc[0:4, 'City':'State'] # **Question:** How do I randomly sample rows from a DataFrame? # In[221]: # sample 3 rows from the DataFrame without replacement (new in pandas 0.16.1) ufo.sample(n=3) # Documentation for [**`sample`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) # In[222]: # use the 'random_state' parameter for reproducibility ufo.sample(n=3, random_state=42) # In[223]: # sample 75% of the DataFrame's rows without replacement train = ufo.sample(frac=0.75, random_state=99) # In[224]: # store the remaining 25% of the rows in another DataFrame test = ufo.loc[~ufo.index.isin(train.index), :] # Documentation for [**`isin`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.isin.html) # # [Back to top] # ## 24. How do I create dummy variables in pandas? ([video](https://www.youtube.com/watch?v=0s_1IsROgDc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=24)) # In[225]: # read the training dataset from Kaggle's Titanic competition train = pd.read_csv('http://bit.ly/kaggletrain') train.head() # In[226]: # create the 'Sex_male' dummy variable using the 'map' method train['Sex_male'] = train.Sex.map({'female':0, 'male':1}) train.head() # Documentation for [**`map`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) # In[227]: # alternative: use 'get_dummies' to create one column for every possible value pd.get_dummies(train.Sex).head() # Generally speaking: # # - If you have **"K" possible values** for a categorical feature, you only need **"K-1" dummy variables** to capture all of the information about that feature. # - One convention is to **drop the first dummy variable**, which defines that level as the "baseline". # In[228]: # drop the first dummy variable ('female') using the 'iloc' method pd.get_dummies(train.Sex).iloc[:, 1:].head() # In[229]: # add a prefix to identify the source of the dummy variables pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head() # In[230]: # use 'get_dummies' with a feature that has 3 possible values pd.get_dummies(train.Embarked, prefix='Embarked').head(10) # In[231]: # drop the first dummy variable ('C') pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10) # How to translate these values back to the original 'Embarked' value: # # - **0, 0** means **C** # - **1, 0** means **Q** # - **0, 1** means **S** # In[232]: # save the DataFrame of dummy variables and concatenate them to the original DataFrame embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:] train = pd.concat([train, embarked_dummies], axis=1) train.head() # Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) # In[233]: # reset the DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head() # In[234]: # pass the DataFrame to 'get_dummies' and specify which columns to dummy (it drops the original columns) pd.get_dummies(train, columns=['Sex', 'Embarked']).head() # In[235]: # use the 'drop_first' parameter (new in pandas 0.18) to drop the first dummy variable for each feature pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head() # Documentation for [**`get_dummies`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) # # [Back to top] # ## 25. How do I work with dates and times in pandas? ([video](https://www.youtube.com/watch?v=yCgJGsg0Xa4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=25)) # In[236]: # read a dataset of UFO reports into a DataFrame ufo = pd.read_csv('http://bit.ly/uforeports') ufo.head() # In[237]: # 'Time' is currently stored as a string ufo.dtypes # In[238]: # hour could be accessed using string slicing, but this approach breaks too easily ufo.Time.str.slice(-5, -3).astype(int).head() # In[239]: # convert 'Time' to datetime format ufo['Time'] = pd.to_datetime(ufo.Time) ufo.head() # In[240]: ufo.dtypes # Documentation for [**`to_datetime`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) # In[241]: # convenient Series attributes are now available ufo.Time.dt.hour.head() # In[242]: ufo.Time.dt.weekday_name.head() # In[243]: ufo.Time.dt.dayofyear.head() # API reference for [datetime properties and methods](http://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties) # In[244]: # convert a single string to datetime format (outputs a timestamp object) ts = pd.to_datetime('1/1/1999') ts # In[245]: # compare a datetime Series with a timestamp ufo.loc[ufo.Time >= ts, :].head() # In[246]: # perform mathematical operations with timestamps (outputs a timedelta object) ufo.Time.max() - ufo.Time.min() # In[247]: # timedelta objects also have attributes you can access (ufo.Time.max() - ufo.Time.min()).days # In[248]: # allow plots to appear in the notebook get_ipython().run_line_magic('matplotlib', 'inline') # In[249]: # count the number of UFO reports per year ufo['Year'] = ufo.Time.dt.year ufo.Year.value_counts().sort_index().head() # In[250]: # plot the number of UFO reports per year (line plot is the default) ufo.Year.value_counts().sort_index().plot() # [Back to top] # ## 26. How do I find and remove duplicate rows in pandas? ([video](https://www.youtube.com/watch?v=ht5buXUMqkQ&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=26)) # In[251]: # read a dataset of movie reviewers into a DataFrame user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code'] users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id') users.head() # In[252]: users.shape # In[253]: # detect duplicate zip codes: True if an item is identical to a previous item users.zip_code.duplicated().tail() # In[254]: # count the duplicate items (True becomes 1, False becomes 0) users.zip_code.duplicated().sum() # In[255]: # detect duplicate DataFrame rows: True if an entire row is identical to a previous row users.duplicated().tail() # In[256]: # count the duplicate rows users.duplicated().sum() # Logic for [**`duplicated`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html): # # - **`keep='first'`** (default): Mark duplicates as True except for the first occurrence. # - **`keep='last'`**: Mark duplicates as True except for the last occurrence. # - **`keep=False`**: Mark all duplicates as True. # In[257]: # examine the duplicate rows (ignoring the first occurrence) users.loc[users.duplicated(keep='first'), :] # In[258]: # examine the duplicate rows (ignoring the last occurrence) users.loc[users.duplicated(keep='last'), :] # In[259]: # examine the duplicate rows (including all duplicates) users.loc[users.duplicated(keep=False), :] # In[260]: # drop the duplicate rows (inplace=False by default) users.drop_duplicates(keep='first').shape # In[261]: users.drop_duplicates(keep='last').shape # In[262]: users.drop_duplicates(keep=False).shape # Documentation for [**`drop_duplicates`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) # In[263]: # only consider a subset of columns when identifying duplicates users.duplicated(subset=['age', 'zip_code']).sum() # In[264]: users.drop_duplicates(subset=['age', 'zip_code']).shape # [Back to top] # ## 27. How do I avoid a SettingWithCopyWarning in pandas? ([video](https://www.youtube.com/watch?v=4R4WsDJ-KVc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=27)) # In[265]: # read a dataset of top-rated IMDb movies into a DataFrame movies = pd.read_csv('http://bit.ly/imdbratings') movies.head() # In[266]: # count the missing values in the 'content_rating' Series movies.content_rating.isnull().sum() # In[267]: # examine the DataFrame rows that contain those missing values movies[movies.content_rating.isnull()] # In[268]: # examine the unique values in the 'content_rating' Series movies.content_rating.value_counts() # **Goal:** Mark the 'NOT RATED' values as missing values, represented by 'NaN'. # In[269]: # first, locate the relevant rows movies[movies.content_rating=='NOT RATED'].head() # In[270]: # then, select the 'content_rating' Series from those rows movies[movies.content_rating=='NOT RATED'].content_rating.head() # In[271]: # finally, replace the 'NOT RATED' values with 'NaN' (imported from NumPy) import numpy as np movies[movies.content_rating=='NOT RATED'].content_rating = np.nan # **Problem:** That statement involves two operations, a **`__getitem__`** and a **`__setitem__`**. pandas can't guarantee whether the **`__getitem__`** operation returns a view or a copy of the data. # # - If **`__getitem__`** returns a view of the data, **`__setitem__`** will affect the 'movies' DataFrame. # - But if **`__getitem__`** returns a copy of the data, **`__setitem__`** will not affect the 'movies' DataFrame. # In[272]: # the 'content_rating' Series has not changed movies.content_rating.isnull().sum() # **Solution:** Use the **`loc`** method, which replaces the 'NOT RATED' values in a single **`__setitem__`** operation. # In[273]: # replace the 'NOT RATED' values with 'NaN' (does not cause a SettingWithCopyWarning) movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan # In[274]: # this time, the 'content_rating' Series has changed movies.content_rating.isnull().sum() # **Summary:** Use the **`loc`** method any time you are selecting rows and columns in the same statement. # # **More information:** [Modern Pandas (Part 1)](http://tomaugspurger.github.io/modern-1.html) # In[275]: # create a DataFrame only containing movies with a high 'star_rating' top_movies = movies.loc[movies.star_rating >= 9, :] top_movies # **Goal:** Fix the 'duration' for 'The Shawshank Redemption'. # In[276]: # overwrite the relevant cell with the correct duration top_movies.loc[0, 'duration'] = 150 # **Problem:** pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'. # In[277]: # 'top_movies' DataFrame has been updated top_movies # In[278]: # 'movies' DataFrame has not been updated movies.head(1) # **Solution:** Any time you are attempting to create a DataFrame copy, use the [**`copy`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html) method. # In[279]: # explicitly create a copy of 'movies' top_movies = movies.loc[movies.star_rating >= 9, :].copy() # In[280]: # pandas now knows that you are updating a copy instead of a view (does not cause a SettingWithCopyWarning) top_movies.loc[0, 'duration'] = 150 # In[281]: # 'top_movies' DataFrame has been updated top_movies # Documentation on indexing and selection: [Returning a view versus a copy](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) # # Stack Overflow: [What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?](http://stackoverflow.com/questions/34884536/what-is-the-point-of-views-in-pandas-if-it-is-undefined-whether-an-indexing-oper) # # [Back to top] # ## 28. How do I change display options in pandas? ([video](https://www.youtube.com/watch?v=yiO43TQ4xvc&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=28)) # In[282]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') # In[283]: # only 60 rows will be displayed when printing drinks # In[284]: # check the current setting for the 'max_rows' option pd.get_option('display.max_rows') # Documentation for [**`get_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_option.html) # In[285]: # overwrite the current setting so that all rows will be displayed pd.set_option('display.max_rows', None) drinks # In[286]: # reset the 'max_rows' option to its default pd.reset_option('display.max_rows') # Documentation for [**`set_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html) and [**`reset_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.reset_option.html) # In[287]: # the 'max_columns' option is similar to 'max_rows' pd.get_option('display.max_columns') # In[288]: # read the training dataset from Kaggle's Titanic competition into a DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head() # In[289]: # an ellipsis is displayed in the 'Name' cell of row 1 because of the 'max_colwidth' option pd.get_option('display.max_colwidth') # In[290]: # overwrite the current setting so that more characters will be displayed pd.set_option('display.max_colwidth', 1000) train.head() # In[291]: # overwrite the 'precision' setting to display 2 digits after the decimal point of 'Fare' pd.set_option('display.precision', 2) train.head() # In[292]: # add two meaningless columns to the drinks DataFrame drinks['x'] = drinks.wine_servings * 1000 drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000 drinks.head() # In[293]: # use a Python format string to specify a comma as the thousands separator pd.set_option('display.float_format', '{:,}'.format) drinks.head() # In[294]: # 'y' was affected (but not 'x') because the 'float_format' option only affects floats (not ints) drinks.dtypes # In[295]: # view the option descriptions (including the default and current values) pd.describe_option() # In[296]: # search for specific options by name pd.describe_option('rows') # Documentation for [**`describe_option`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.describe_option.html) # In[297]: # reset all of the options to their default values pd.reset_option('all') # [Options and Settings](http://pandas.pydata.org/pandas-docs/stable/options.html) # # [Back to top] # ## 29. How do I create a pandas DataFrame from another object? ([video](https://www.youtube.com/watch?v=-Ov1N1_FbP8&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=29)) # In[298]: # create a DataFrame from a dictionary (keys become column names, values become data) pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}) # In[299]: # optionally specify the order of columns and define the index df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c']) df # Documentation for [**`DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) # In[300]: # create a DataFrame from a list of lists (each inner list becomes a row) pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color']) # In[301]: # create a NumPy array (with shape 4 by 2) and fill it with random numbers between 0 and 1 import numpy as np arr = np.random.rand(4, 2) arr # In[302]: # create a DataFrame from the NumPy array pd.DataFrame(arr, columns=['one', 'two']) # In[303]: # create a DataFrame of student IDs (100 through 109) and test scores (random integers between 60 and 100) pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}) # Documentation for [**`np.arange`**](http://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) and [**`np.random`**](http://docs.scipy.org/doc/numpy/reference/routines.random.html) # In[304]: # 'set_index' can be chained with the DataFrame constructor to select an index pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student') # Documentation for [**`set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) # In[305]: # create a new Series using the Series constructor s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape') s # Documentation for [**`Series`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) # In[306]: # concatenate the DataFrame and the Series (use axis=1 to concatenate columns) pd.concat([df, s], axis=1) # **Notes:** # # - The Series name became the column name in the DataFrame. # - The Series data was aligned to the DataFrame by its index. # - The 'shape' for row 'a' was marked as a missing value (NaN) because that index was not present in the Series. # # Documentation for [**`concat`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) # # [Back to top] # ## 30. How do I apply a function to a pandas Series or DataFrame? ([video](https://www.youtube.com/watch?v=P_q0tkYqvSk&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=30)) # In[307]: # read the training dataset from Kaggle's Titanic competition into a DataFrame train = pd.read_csv('http://bit.ly/kaggletrain') train.head() # **Goal:** Map the existing values of a Series to a different set of values # # **Method:** [**`map`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) (Series method) # In[308]: # map 'female' to 0 and 'male' to 1 train['Sex_num'] = train.Sex.map({'female':0, 'male':1}) train.loc[0:4, ['Sex', 'Sex_num']] # **Goal:** Apply a function to each element in a Series # # **Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) (Series method) # # **Note:** **`map`** can be substituted for **`apply`** in many cases, but **`apply`** is more flexible and thus is recommended # In[309]: # calculate the length of each string in the 'Name' Series train['Name_length'] = train.Name.apply(len) train.loc[0:4, ['Name', 'Name_length']] # In[310]: # round up each element in the 'Fare' Series to the next integer import numpy as np train['Fare_ceil'] = train.Fare.apply(np.ceil) train.loc[0:4, ['Fare', 'Fare_ceil']] # In[311]: # we want to extract the last name of each person train.Name.head() # In[312]: # use a string method to split the 'Name' Series at commas (returns a Series of lists) train.Name.str.split(',').head() # In[313]: # define a function that returns an element from a list based on position def get_element(my_list, position): return my_list[position] # In[314]: # apply the 'get_element' function and pass 'position' as a keyword argument train.Name.str.split(',').apply(get_element, position=0).head() # In[315]: # alternatively, use a lambda function train.Name.str.split(',').apply(lambda x: x[0]).head() # **Goal:** Apply a function along either axis of a DataFrame # # **Method:** [**`apply`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) (DataFrame method) # In[316]: # read a dataset of alcohol consumption into a DataFrame drinks = pd.read_csv('http://bit.ly/drinksbycountry') drinks.head() # In[317]: # select a subset of the DataFrame to work with drinks.loc[:, 'beer_servings':'wine_servings'].head() # In[318]: # apply the 'max' function along axis 0 to calculate the maximum value in each column drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0) # In[319]: # apply the 'max' function along axis 1 to calculate the maximum value in each row drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head() # In[320]: # use 'np.argmax' to calculate which column has the maximum value for each row drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head() # **Goal:** Apply a function to every element in a DataFrame # # **Method:** [**`applymap`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html) (DataFrame method) # In[321]: # convert every DataFrame element into a float drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head() # In[322]: # overwrite the existing DataFrame columns drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float) drinks.head() # [Back to top]