Data Science

Lesson 3

Datasets & DataFrames

Reminders Import Required Libraries Read the User Data Read Movie Ratings Read the Movie Information Get Information About Movie Data Selecting Data Filtering Data Split Apply Combine Passing a Function Pandas Wrap-Up

OVERVIEW

For this boot camp, we will be scraping data from data-sets into data-frames using the **pandas** library. Data scraping is about obtaining data from webpages or other data-sets. There is low level scraping where you parse data out of the HTML or XML code of a webpage. You can also scrape data from the APIs of certain websites as well as pre-compiled datasets.

REMINDERS

>- **Make use of keyboard short-cuts like tab completion** >- **Make use of Google** > - Web resources can help you turn snippets into fully functional code >- **Make variable names meaningful!!!** > - (a = b * c) can be harder to interpret when contextual errors arise > - Using contextually relevant variable names can make it easier for others to jump in and contribute to, or analyze your code when you get stuck >- **Comment your code for clear logic!**

Duplicate the Notebook

1. Launch ‘Jupyter Notebook’ using the shortcut you created in Lesson 1 2. Navigate to your Lesson 3 folder 3. Select & Duplicate the ‘Lesson3-Datasets-and-Dataframes.ipynb’ file - *You will be tinkering – things might break – it’s a backup* 4. Open the new duplicate file & Rename it: **‘scraping-intro’**

NOTE

>- **Answer any questions in bold and blue in the code block(cell) below each section.**

Any statements in italic and green are for consideration and should help guide you to understand the code involved.

IMPORT THE REQUIRED LIBRARIES

- Run the import snippet to ensure there are no errors - Make note of the libraries being imported, and the namespaces used

Pandas Reference Documentation: http://pandas.pydata.org/pandas-docs/stable/

In [1]:

from IPython.display import HTML
import re # regular expressions

## IMPORT NUMPY
import numpy as np

## IMPORT PANDAS
import pandas as pd

READ THE USER DATA

Pandas has a function to read CSV files and turn them into tables
- These tables are called DataFrames
*Alter the 'users.head()' code below to return more than 5 rows of user data at a time.*
What happens if you change the column names in 'u_cols'* ?*
What happens if you change the separator type to sep='\t'* ?*

In [2]:

# Pass in column names for each CSV column
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

# Convert data from CSV into a DataFrame called 'users'
users = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.user', 
    sep='|', names=u_cols)

# Change the amount of data returned by the .head() function
# By default it returns the first 5 rows of data in the specified dataframe
users.head()

Out[2]:

	user_id	age	sex	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

READ THE MOVIE RATINGS

Use the code above as a reference for completing this section.

How can you find information regarding a dataset?
- This situation is unique as we have been provided a pre-compiled data-set with a resource for how the data is assembled
- Click on the link below and open it beside your notebook: http://files.grouplens.org/datasets/movielens/ml-100k
Open the 'README' file and observe the 'u.data' information
Open the 'u.data' file and observe the structure of the data
- When understanding how data should be read by 'pandas', it is important to view the data itself as opposed to relying solely on the documentation that may or may not be provided as it can contain visual clues to how it should be interpreted.
How many columns of data are there?
What are the column names?
What are the data-types?

Load the CSV data for movie ratings into a DataFrame called 'ratings'.

The dataset url is provided in the comments of the code block below.

In [1]:

#   Pass in column names for each CSV column
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

#   Convert data from 'http://files.grouplens.org/datasets/movielens/ml-100ku.data' 
#   into a DataFrame called 'ratings'


#   Return the first 5 rows of data from 'ratings'

READ THE MOVIE INFORMATION

Specifying columns can be beneficial when trying to limit the amount of data that has to be processed, or the memory it requires.

This example uses usecols in conjunction with the range() function to select the first 5 columns of data.
- Notice how there are 5 column names in 'm_cols' and a range of 5 columns.
Complete the code below in order to experiment with the output.
Why is their encoding for this data, but not the others?
What happens when you remove, or change it?
What happens when the column range does not match the number of column names provided?

In [4]:

# The movies file contains columns indicating the movie's genres
# let's only load the first 5 columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 
            'video_release_date', 'imdb_url']

# Convert data from CSV into a DataFrame called 'movies'
movies = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.item', 
    sep='|', names=m_cols, usecols=range(5), encoding = "ISO-8859-1")

# Return the first 5 rows of data from 'movies'

GET INFORMATION ABOUT THE MOVIE DATA

The functions being used are from the ‘pandas’ library.

Use the pandas reference provided above to determine how to return the datatypes of each column in the 'movies' DataFrame.
Why are only two columns returned for the ‘.describe( )’ function below?

In [5]:

#   How do you get the datatype of each column in a DataFrame


#   According to the pandas reference; How does this function determine what to return?
#   What are the benefits of these limitations?
print(movies.describe())

          movie_id  video_release_date
count  1682.000000                 0.0
mean    841.500000                 NaN
std     485.695893                 NaN
min       1.000000                 NaN
25%     421.250000                 NaN
50%     841.500000                 NaN
75%    1261.750000                 NaN
max    1682.000000                 NaN

SELECTING DATA

Read over the code to ensure you understand the logic involved
- A DataFrame is a group of Series’ with a shared index
- A single DataFrame column holds a single Series

DataFrame => group of Series with shared index

single DataFrame column => Series

Therefore, each column of data in a DataFrame is considered a Series

Print the data for the first five rows with the columns you want
Use the ‘.iloc[ ]’ function to return and print the user data (for UserID 5) from the 'users' DataFrame you created above.
Why is the completed output no longer formatted like section 4 and prior?
Does the print function have a different output formatting than pandas?

In [33]:

#   These functions are similar but return different data
#   Run both of them individually by commenting out the other and vice versa
users.head()
#users['occupation'].head()

#   This functionality is not limited to one column, and can even be passed by reference
#   Specify the column names that you want to return from the users dataset
#columns_you_want = ['column_name1', 'column_name2']

#   Print the data for the first five rows with the columns you want from the users dataset
#print(users[columns_you_want].head())

#   Print the data of a specific row in the coumns_you_want using the iloc[] function

Out[33]:

	user_id	age	sex	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

FILTERING DATA

Selecting users older than 25 from the user data
- Creating a new sub-DataFrame or sub-frame with that data
Notice how the column name can be referenced via the data-set name
- This is only possible after it has been processed into a DataFrame via pandas
Create a new sub-frame called 'youthStudents' and select only students under the age of 18

In [39]:

#   Select users older than 25
oldUsers = users[users.age > 25]
oldUsers.head()

#   Select the first 10 users under the age of 18 who are also students

Out[39]:

	user_id	age	sex	occupation	zip_code
1	2	53	F	other	94043
4	5	33	F	other	15213
5	6	42	M	executive	98101
6	7	57	M	administrator	91344
7	8	36	M	administrator	05201

OVERVIEW QUIZ 1

Using the 'users' DataFrame you created earlier in this lesson

Show the first 3 users aged 40 and Male
Show the mean age of all Female Programmers

Part 1¶

In [36]:

#   Show the first 3 users who are age 40 AND male

Part 2¶

In [37]:

#   Show users who are female and programmers

#   Show statistical summary which includes the mean

#   Display as the mean age of all female programmers

SPLIT-APPLY-COMBINE

Splitting the data into groups based on some specified criteria
Applying a function to each group of criteria independently of each other
Combining the results into a data structure

Find Diligent Users¶

Split data per user ID
- Split data into groups using the groupby() function to select a series or column from the dataset
Apply the Count function on ratings
- Use the .count() function on the newly grouped data and combine it into a new sub-frame called 'ratings_per_user'
Combine the result
- Split data into groups using the groupby() function to select a series or column from the dataset

In [57]:

#   Split data into groups
#Incomplete Example: grouped_data = movies.groupby()

#   Use the .count() function

#   Display the first 5 rows of data

OVERVIEW QUIZ 2

Get the Average Rating for the first 5 Movie ID's
Get the Movie Titles with the highest average rating
Get the number of ratings per Movie

Part 1¶

In [58]:

#   Split data into groups

#   Average and Combine

#  Display the new sub-frame
print("Average Ratings:")

Average ratings:
movie_id
1    3.878319
2    3.206107
3    3.033333
4    3.550239
5    3.302326
Name: rating, dtype: float64

Part 2¶

In [64]:

#   Get the maximum rating in the average ratings
# maximum_rating = ?

#   Index the movies in the average ratings that are equal to the maximum rating determined above
#   Store those movie id's in a new dataframe


#   Display the good movie IDs
print("Good Movies - IDs:")


#   Display the titles of the good movie IDs if they are in the movies DataFrame
#   Use the .isin() function to determine what titles should be printed
print("Good Movies - Titles")

Good Movies - IDs:
Int64Index([814, 1122, 1189, 1201, 1293, 1467, 1500, 1536, 1599, 1653], dtype='int64', name='movie_id')
Good Movies - Titles
813                         Great Day in Harlem, A (1994)
1121                       They Made Me a Criminal (1939)
1188                                   Prefontaine (1997)
1200           Marlene Dietrich: Shadow and Light (1996) 
1292                                      Star Kid (1997)
1466                 Saint of Fort Washington, The (1993)
1499                            Santa with Muscles (1996)
1535                                 Aiqing wansui (1994)
1598                        Someone Else's America (1995)
1652    Entertaining Angels: The Dorothy Day Story (1996)
Name: title, dtype: object

Part 3¶

In [65]:

#   Count the number of ratings per movie ID
#   Store them in a new sub-frame called 'how_many_ratings'
#   Reference Part 1 of the Quiz for the sub-frame you need to count


#   Display the number of ratings per movie that are equal to the maximum rating
print("Number of Ratings per Movie:")

Number of ratings per movie
movie_id
814     1
1122    1
1189    3
1201    1
1293    3
1467    2
1500    2
1536    1
1599    1
1653    1
Name: rating, dtype: int64

PASSING A FUNCTION

Running the lambda function means for every iteration of an individual user ID (the column identifier in the dataset) there is a calculated mean/average added to the new DataFrame.

Complete and experiment with the code below.

Notice how lambda is calling 'f:' and not ':f', what does this mean? Why is it important?

In [88]:

#   Get the average rating of each user ID
# average_ratings = grouped_data.apply(lambda f: f.mean())

#   Display the first 5 rows of average ratings by user ID

Out[88]:

user_id
1    3.610294
2    3.709677
3    2.796296
4    4.333333
5    2.874286
Name: rating, dtype: float64

OVERVIEW QUIZ 3

Get the average rating per user
Get a list of all occupations and if they are male or female dominant
Get the total number of male and female users

Part 1¶

In [89]:

#   Get the average rating per user

#   Get the movie ratings and group them by user ID
#   Create a new sub-frame called 'grouped_data' to store it in

#   Get the mean of the grouped_data
#   Create a new sub-frame called 'average_ratings' to store it in

#   Get the first 10 rows of average ratings sub-frame

#   Get and Display the number of ratings per movie

Part 2¶

In [4]:

#   List all occupations and if they are male or female dominant

#   Get the gender and group by occupation
#   Create a new sub-frame called 'grouped_data' to store it in

#   Get a boolean value to represent male dominated occupations
#   Running the lambda function in this situation means
#   for every occupation with more males than females
#   a list with each occupation and a boolean value stating true or false is added to the new DataFrame


#   Display the sub-frame

Part 3¶

In [92]:

#   Get the total number of both male and female users

#   Use the sum() function to count the number of Male users
#   in the users dataset based on their gender
print('Number of Male Users: ')


#   Use the sum() function to count the number of Female users
#   in the users dataset based on their gender
print('Number of Female Users: ')

Number of Male Users: 
Number of Female Users:

PANDAS WRAP-UP

- Test your understanding of the content by putting it to practice! - ***Do your best to implement your own versions of the following using the different data-sets and examples provided above:***

In [25]:

# Create a data frame

# Get a sub-frame

# Filter data

# Use group-by

# Apply a user defined function

In [26]:

# Below is a list of 'Line/Cell Magic' keywords that can be used in
# conjunction with your Python code.

# It is NOT necessary to use line magic for this bootcamp, however it is
# a good tool to familiarize yourself with for future programming, and
# interpreting examples you may encounter researching your new found knowledge.

# Enables AutoMagic in your Jupyter Notebook
%lsmagic

# Line/Cell Magic Keywords - Built in magical functions

Out[26]:

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

In [ ]: