- Any statements in italic and green are for consideration and should help guide you to understand the code involved.
Pandas Reference Documentation: http://pandas.pydata.org/pandas-docs/stable/
from IPython.display import HTML
import re # regular expressions
## IMPORT NUMPY
import numpy as np
## IMPORT PANDAS
import pandas as pd
Pandas has a function to read CSV files and turn them into tables
*Alter the 'users.head()' code below to return more than 5 rows of user data at a time.*
What happens if you change the column names in 'u_cols'* ?*
What happens if you change the separator type to sep='\t'* ?*
# Pass in column names for each CSV column
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
# Convert data from CSV into a DataFrame called 'users'
users = pd.read_csv(
'http://files.grouplens.org/datasets/movielens/ml-100k/u.user',
sep='|', names=u_cols)
# Change the amount of data returned by the .head() function
# By default it returns the first 5 rows of data in the specified dataframe
users.head()
user_id | age | sex | occupation | zip_code | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
Use the code above as a reference for completing this section.
How can you find information regarding a dataset?
Open the 'README' file and observe the 'u.data' information
Open the 'u.data' file and observe the structure of the data
How many columns of data are there?
What are the column names?
What are the data-types?
Load the CSV data for movie ratings into a DataFrame called 'ratings'.
# Pass in column names for each CSV column
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
# Convert data from 'http://files.grouplens.org/datasets/movielens/ml-100ku.data'
# into a DataFrame called 'ratings'
# Return the first 5 rows of data from 'ratings'
Specifying columns can be beneficial when trying to limit the amount of data that has to be processed, or the memory it requires.
This example uses usecols in conjunction with the range() function to select the first 5 columns of data.
Complete the code below in order to experiment with the output.
Why is their encoding for this data, but not the others?
What happens when you remove, or change it?
What happens when the column range does not match the number of column names provided?
# The movies file contains columns indicating the movie's genres
# let's only load the first 5 columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date',
'video_release_date', 'imdb_url']
# Convert data from CSV into a DataFrame called 'movies'
movies = pd.read_csv(
'http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
sep='|', names=m_cols, usecols=range(5), encoding = "ISO-8859-1")
# Return the first 5 rows of data from 'movies'
The functions being used are from the ‘pandas’ library.
# How do you get the datatype of each column in a DataFrame
# According to the pandas reference; How does this function determine what to return?
# What are the benefits of these limitations?
print(movies.describe())
movie_id video_release_date count 1682.000000 0.0 mean 841.500000 NaN std 485.695893 NaN min 1.000000 NaN 25% 421.250000 NaN 50% 841.500000 NaN 75% 1261.750000 NaN max 1682.000000 NaN
DataFrame => group of Series with shared index
single DataFrame column => Series
Therefore, each column of data in a DataFrame is considered a Series
Print the data for the first five rows with the columns you want
Use the ‘.iloc[ ]’ function to return and print the user data (for UserID 5) from the 'users' DataFrame you created above.
Why is the completed output no longer formatted like section 4 and prior?
Does the print function have a different output formatting than pandas?
# These functions are similar but return different data
# Run both of them individually by commenting out the other and vice versa
users.head()
#users['occupation'].head()
# This functionality is not limited to one column, and can even be passed by reference
# Specify the column names that you want to return from the users dataset
#columns_you_want = ['column_name1', 'column_name2']
# Print the data for the first five rows with the columns you want from the users dataset
#print(users[columns_you_want].head())
# Print the data of a specific row in the coumns_you_want using the iloc[] function
user_id | age | sex | occupation | zip_code | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
Selecting users older than 25 from the user data
Notice how the column name can be referenced via the data-set name
Create a new sub-frame called 'youthStudents' and select only students under the age of 18
# Select users older than 25
oldUsers = users[users.age > 25]
oldUsers.head()
# Select the first 10 users under the age of 18 who are also students
user_id | age | sex | occupation | zip_code | |
---|---|---|---|---|---|
1 | 2 | 53 | F | other | 94043 |
4 | 5 | 33 | F | other | 15213 |
5 | 6 | 42 | M | executive | 98101 |
6 | 7 | 57 | M | administrator | 91344 |
7 | 8 | 36 | M | administrator | 05201 |
Using the 'users' DataFrame you created earlier in this lesson
# Show the first 3 users who are age 40 AND male
# Show users who are female and programmers
# Show statistical summary which includes the mean
# Display as the mean age of all female programmers
# Split data into groups
#Incomplete Example: grouped_data = movies.groupby()
# Use the .count() function
# Display the first 5 rows of data
# Split data into groups
# Average and Combine
# Display the new sub-frame
print("Average Ratings:")
Average ratings: movie_id 1 3.878319 2 3.206107 3 3.033333 4 3.550239 5 3.302326 Name: rating, dtype: float64
# Get the maximum rating in the average ratings
# maximum_rating = ?
# Index the movies in the average ratings that are equal to the maximum rating determined above
# Store those movie id's in a new dataframe
# Display the good movie IDs
print("Good Movies - IDs:")
# Display the titles of the good movie IDs if they are in the movies DataFrame
# Use the .isin() function to determine what titles should be printed
print("Good Movies - Titles")
Good Movies - IDs: Int64Index([814, 1122, 1189, 1201, 1293, 1467, 1500, 1536, 1599, 1653], dtype='int64', name='movie_id') Good Movies - Titles 813 Great Day in Harlem, A (1994) 1121 They Made Me a Criminal (1939) 1188 Prefontaine (1997) 1200 Marlene Dietrich: Shadow and Light (1996) 1292 Star Kid (1997) 1466 Saint of Fort Washington, The (1993) 1499 Santa with Muscles (1996) 1535 Aiqing wansui (1994) 1598 Someone Else's America (1995) 1652 Entertaining Angels: The Dorothy Day Story (1996) Name: title, dtype: object
# Count the number of ratings per movie ID
# Store them in a new sub-frame called 'how_many_ratings'
# Reference Part 1 of the Quiz for the sub-frame you need to count
# Display the number of ratings per movie that are equal to the maximum rating
print("Number of Ratings per Movie:")
Number of ratings per movie movie_id 814 1 1122 1 1189 3 1201 1 1293 3 1467 2 1500 2 1536 1 1599 1 1653 1 Name: rating, dtype: int64
Running the lambda function means for every iteration of an individual user ID (the column identifier in the dataset) there is a calculated mean/average added to the new DataFrame.
Complete and experiment with the code below.
Notice how lambda is calling 'f:' and not ':f', what does this mean? Why is it important?
# Get the average rating of each user ID
# average_ratings = grouped_data.apply(lambda f: f.mean())
# Display the first 5 rows of average ratings by user ID
user_id 1 3.610294 2 3.709677 3 2.796296 4 4.333333 5 2.874286 Name: rating, dtype: float64
# Get the average rating per user
# Get the movie ratings and group them by user ID
# Create a new sub-frame called 'grouped_data' to store it in
# Get the mean of the grouped_data
# Create a new sub-frame called 'average_ratings' to store it in
# Get the first 10 rows of average ratings sub-frame
# Get and Display the number of ratings per movie
# List all occupations and if they are male or female dominant
# Get the gender and group by occupation
# Create a new sub-frame called 'grouped_data' to store it in
# Get a boolean value to represent male dominated occupations
# Running the lambda function in this situation means
# for every occupation with more males than females
# a list with each occupation and a boolean value stating true or false is added to the new DataFrame
# Display the sub-frame
# Get the total number of both male and female users
# Use the sum() function to count the number of Male users
# in the users dataset based on their gender
print('Number of Male Users: ')
# Use the sum() function to count the number of Female users
# in the users dataset based on their gender
print('Number of Female Users: ')
Number of Male Users: Number of Female Users:
# Create a data frame
# Get a sub-frame
# Filter data
# Use group-by
# Apply a user defined function
# Below is a list of 'Line/Cell Magic' keywords that can be used in
# conjunction with your Python code.
# It is NOT necessary to use line magic for this bootcamp, however it is
# a good tool to familiarize yourself with for future programming, and
# interpreting examples you may encounter researching your new found knowledge.
# Enables AutoMagic in your Jupyter Notebook
%lsmagic
# Line/Cell Magic Keywords - Built in magical functions
Available line magics: %alias %alias_magic %autocall %automagic %autosave %bookmark %cd %clear %cls %colors %config %connect_info %copy %ddir %debug %dhist %dirs %doctest_mode %echo %ed %edit %env %gui %hist %history %killbgscripts %ldir %less %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %ls %lsmagic %macro %magic %matplotlib %mkdir %more %notebook %page %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %popd %pprint %precision %profile %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole %quickref %recall %rehashx %reload_ext %ren %rep %rerun %reset %reset_selective %rmdir %run %save %sc %set_env %store %sx %system %tb %time %timeit %unalias %unload_ext %who %who_ls %whos %xdel %xmode Available cell magics: %%! %%HTML %%SVG %%bash %%capture %%cmd %%debug %%file %%html %%javascript %%js %%latex %%perl %%prun %%pypy %%python %%python2 %%python3 %%ruby %%script %%sh %%svg %%sx %%system %%time %%timeit %%writefile Automagic is ON, % prefix IS NOT needed for line magics.