In this tutorial, we are going to read the csv file and the json file we saved in the previous tutorial.
Then we are going to use pandas [1] to do data wrangling and manipulation of the DataFrame
.
Finally, we are going to do some basic data analytics and basic plotting using matplotlib [2].
%matplotlib inline
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
To load the csv file, we proceed as follows,
#shot_df.to_csv(path_or_buf='test.csv',mode='w')
shot_df = pd.read_csv(filepath_or_buffer='test.csv')
#shot_df = pd.read_csv('test.csv')
To print the first 4 rows in the DataFrame
and display all the columns, you can proceed as follows,
shot_df.head(4)
Unnamed: 0 | GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | ... | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | ... | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | 1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | ... | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 | 2 | Shot Chart Detail | 21400018 | 53 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 4 | ... | Fadeaway Jump Shot | 2PT Field Goal | Mid-Range | Left Side(L) | 8-16 ft. | 12 | -105 | 63 | 1 | 0 |
3 | 3 | Shot Chart Detail | 21400018 | 77 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 2 | ... | Jump Shot | 3PT Field Goal | Right Corner 3 | Right Side(R) | 24+ ft. | 22 | 227 | -16 | 1 | 0 |
4 rows × 22 columns
Have in mind that this is a larger DataFrame
, if you do not use .head(4) it will print all the rows in the DataFrame
.
To force pandas to display all the columns, you can proceed as follows (we only display the first 4 rows),
pd.set_option('display.max_columns', None)
shot_df.head(4)
#shot_df
Unnamed: 0 | GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | 1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 | 2 | Shot Chart Detail | 21400018 | 53 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 4 | 45 | Missed Shot | Fadeaway Jump Shot | 2PT Field Goal | Mid-Range | Left Side(L) | 8-16 ft. | 12 | -105 | 63 | 1 | 0 |
3 | 3 | Shot Chart Detail | 21400018 | 77 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 2 | 31 | Missed Shot | Jump Shot | 3PT Field Goal | Right Corner 3 | Right Side(R) | 24+ ft. | 22 | 227 | -16 | 1 | 0 |
To erase a column in a DataFrame
we can proceed as follows.
Notice that we want to erase the column with the name Unnamed: 0
, hence the notation ('Unnamed: 0', 1).
shot_df1 = shot_df.drop('Unnamed: 0', 1)
To display the information in shot_df1
(we only display the first 2 rows),
shot_df1.head(2)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
As you can see, we erased the column ('Unnamed: 0', 1).
Alternatively we can use the column 0 as the row labels of the DataFrame.
shot_df2 = pd.read_csv(filepath_or_buffer='test.csv',index_col=0)
To display the information in shot_df2
(we only display the first 2 rows),
shot_df2.head(2)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
We can do indexing and slicing in a DataFrame
.
To display all the columns belonging to rows 0 to 2 of the DataFrame shot_df2
,
shot_df2.iloc[0:2,:]
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 21400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
or we can display columns 10 to 13 of the first five rows,
shot_df2.iloc[0:5,10:13]
EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | |
---|---|---|---|
0 | Missed Shot | Jump Shot | 2PT Field Goal |
1 | Made Shot | Layup Shot | 2PT Field Goal |
2 | Missed Shot | Fadeaway Jump Shot | 2PT Field Goal |
3 | Missed Shot | Jump Shot | 3PT Field Goal |
4 | Missed Shot | Jump Shot | 3PT Field Goal |
To load a json file using the module json (the hard way),
#import json
#import pprint
#from pprint import pprint
#with open('data_json.json') as data_file:
# data_json = json.load(data_file)
#pprint(data_json)
#type(data_json)
#data_json['resultSets'][0]['headers']
#data_json['resultSets'][0]['rowSet']
#for x in data_json:
# print (x)
#for x in data_json['resultSets'][x]
# print (data_json['resultSets'][x])
#data_json
#type(data_json['resultSets'][0])
#print (data_json['resultSets'][0])
To load the json file using pandas, we proceed as follows,
#pd_json=pd.read_json(path_or_buf='data_json.json')
pd_json=pd.read_json(path_or_buf='data_json.json',typ='series')
Remember, you can inspect the json file using JSONView, as in the previous tutorial.
To know the type of the json file we just loaded,
type(pd_json)
pandas.core.series.Series
Notice that it is a series
and not a DataFrame
. Later on we are going to see how to convert series
to a DataFrame
, for the moment this does not generate any problem.
The next line will print the content of the json file. This is the same information you see when you use JSONView, but now we read in the hard way.
pd_json
parameters {u'PlayerID': 2544, u'StartPeriod': None, u'St... resource shotchartdetail resultSets [{u'headers': [u'GRID_TYPE', u'GAME_ID', u'GAM... dtype: object
These lines will print the information inside each block of the json file, they are commented as they print a lot information.
#pd_json.parameters
#pd_json.resource
#pd_json.resultSets
At this point, we can create the DataFrame
using the json file we imported in pandas (pretty much as in the previous tutorial).
First we need to grab the headers and shot chart data.
# Grab the headers to be used as column headers for our DataFrame
json_headers = pd_json['resultSets'][0]['headers']
# Grab the shot chart data
json_shots = pd_json['resultSets'][0]['rowSet']
To display the content of json_headers
json_headers
[u'GRID_TYPE', u'GAME_ID', u'GAME_EVENT_ID', u'PLAYER_ID', u'PLAYER_NAME', u'TEAM_ID', u'TEAM_NAME', u'PERIOD', u'MINUTES_REMAINING', u'SECONDS_REMAINING', u'EVENT_TYPE', u'ACTION_TYPE', u'SHOT_TYPE', u'SHOT_ZONE_BASIC', u'SHOT_ZONE_AREA', u'SHOT_ZONE_RANGE', u'SHOT_DISTANCE', u'LOC_X', u'LOC_Y', u'SHOT_ATTEMPTED_FLAG', u'SHOT_MADE_FLAG']
To display the content of json_shots
. This line is commented as it prints a lot information.
#json_shots
To create the DataFrame
using json_headers
and json_shots
,
json_to_pd = pd.DataFrame(json_shots, columns=json_headers)
and to display the first 5 rows of json_to_pd
,
json_to_pd.head()
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Shot Chart Detail | 0021400018 | 4 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 11 | 20 | Missed Shot | Jump Shot | 2PT Field Goal | Mid-Range | Right Side Center(RC) | 16-24 ft. | 18 | 114 | 148 | 1 | 0 |
1 | Shot Chart Detail | 0021400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
2 | Shot Chart Detail | 0021400018 | 53 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 4 | 45 | Missed Shot | Fadeaway Jump Shot | 2PT Field Goal | Mid-Range | Left Side(L) | 8-16 ft. | 12 | -105 | 63 | 1 | 0 |
3 | Shot Chart Detail | 0021400018 | 77 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 2 | 31 | Missed Shot | Jump Shot | 3PT Field Goal | Right Corner 3 | Right Side(R) | 24+ ft. | 22 | 227 | -16 | 1 | 0 |
4 | Shot Chart Detail | 0021400018 | 82 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 1 | 51 | Missed Shot | Jump Shot | 3PT Field Goal | Above the Break 3 | Right Side Center(RC) | 24+ ft. | 26 | 91 | 246 | 1 | 0 |
To determine if json_to_pd
is equal to shot_df1
(we only display the first 5 rows),
json_to_pd.head() == shot_df1.head()
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | False | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
1 | True | False | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
2 | True | False | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
3 | True | False | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
4 | True | False | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True | True |
Notice that the only difference is the column GAME_ID. When loading the csv, pandas erased the leading zeroes, the rest of the columns are equal.
This can be fixed, but I will let you as an exercise.
The DataFrame
shot_df1
contains the shot chart data of all the field goal attempts Lebron James took during the 2014-15 regular season.
We are specifically interested in the data saved in the columns LOC_X
, LOC_Y
and SHOT_MADE_FLAG
. The columns LOC_X
and LOC_Y
contain the coordinate values for each shot measured from the basket rim. The column SHOT_MADE_FLAG
contains the outcome of the shot, 0 for missed and 1 for converted.
To extract from the DataFrame
shot_df1
the converted shots or shot_df1.SHOT_MADE_FLAG == 1, we proceed as follows (we only display the first 2 rows),
shot_df1[shot_df1.SHOT_MADE_FLAG == 1].head(2)
GRID_TYPE | GAME_ID | GAME_EVENT_ID | PLAYER_ID | PLAYER_NAME | TEAM_ID | TEAM_NAME | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | EVENT_TYPE | ACTION_TYPE | SHOT_TYPE | SHOT_ZONE_BASIC | SHOT_ZONE_AREA | SHOT_ZONE_RANGE | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Shot Chart Detail | 21400018 | 33 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 1 | 6 | 30 | Made Shot | Layup Shot | 2PT Field Goal | Restricted Area | Center(C) | Less Than 8 ft. | 0 | -7 | 0 | 1 | 1 |
9 | Shot Chart Detail | 21400018 | 299 | 2544 | LeBron James | 1610612739 | Cleveland Cavaliers | 3 | 6 | 54 | Made Shot | Jump Shot | 3PT Field Goal | Above the Break 3 | Center(C) | 24+ ft. | 25 | 26 | 249 | 1 | 1 |
To save the previous output in the DataFrame
converted
,
converted = shot_df1[shot_df1.SHOT_MADE_FLAG == 1]
To get the dimensions of the DataFrame
converted.shape
(624, 21)
To save the output of the missed shots in the DataFrame
missed
,
missed = shot_df1[shot_df1.SHOT_MADE_FLAG == 0]
Let's create a new DataFrame
as follows,
new_df = pd.DataFrame()
and let's extract some information from the DataFrame
shot_df1
and put it on the DataFrame
new_df
.
Notice that we can access the information in a DataFrame
as
shot_df1.LOC_X
or
shot_df1['LOC_X']
.
new_df['LOC_X'] = shot_df1.LOC_X
new_df['LOC_Y'] = shot_df1['LOC_Y']
new_df['SHOT_MADE_FLAG'] = shot_df1['SHOT_MADE_FLAG']
new_df['SHOT_TYPE'] = shot_df1['SHOT_TYPE']
To display the information in the DataFrame
new_df
(we only display the first 5 rows),
#pprint(new_df.head(4))
new_df.head()
LOC_X | LOC_Y | SHOT_MADE_FLAG | SHOT_TYPE | |
---|---|---|---|---|
0 | 114 | 148 | 0 | 2PT Field Goal |
1 | -7 | 0 | 1 | 2PT Field Goal |
2 | -105 | 63 | 0 | 2PT Field Goal |
3 | 227 | -16 | 0 | 3PT Field Goal |
4 | 91 | 246 | 0 | 3PT Field Goal |
We can also extract information from a DataFrame
using strings. From the DataFrame
new_df
, let's extract the SHOT_TYPE
information as follows,
three_pointer = new_df[new_df.SHOT_TYPE == '3PT Field Goal']
two_pointer = new_df[new_df.SHOT_TYPE == '2PT Field Goal']
To display the information contained in the DataFrame
three_pointer
and two_pointer
(we only display the first 2 rows),
three_pointer.head(2)
LOC_X | LOC_Y | SHOT_MADE_FLAG | SHOT_TYPE | |
---|---|---|---|---|
3 | 227 | -16 | 0 | 3PT Field Goal |
4 | 91 | 246 | 0 | 3PT Field Goal |
two_pointer.head(2)
LOC_X | LOC_Y | SHOT_MADE_FLAG | SHOT_TYPE | |
---|---|---|---|---|
0 | 114 | 148 | 0 | 2PT Field Goal |
1 | -7 | 0 | 1 | 2PT Field Goal |
To know the type of two_pointer
,
type(two_pointer)
pandas.core.frame.DataFrame
Notice that if we extract the column two_pointer['LOC_X']
and save it in the object tmp_object
, as follows,
tmp_object = two_pointer['LOC_X']
The type of the object tmp_object
is a pandas series
instead of a DataFrame
.
type(tmp_object)
pandas.core.series.Series
To convert tmp_object
to a DataFrame
, we proceed as follows.
Notice that the option name
in to_frame(name='LOC_X')
, corresponds to the name we want to assign to that column in the DataFrame tmp_object1
.
tmp_object1 = two_pointer['LOC_X'].to_frame(name='LOC_X')
which has a type
type(tmp_object1)
pandas.core.frame.DataFrame
To convert a pandas series or DataFrame to a numpy array, we proceed as follows,
np_array0 = shot_df1.as_matrix(columns=None)
np_array1 = two_pointer['LOC_X'].as_matrix(columns=None)
which has a type
type(np_array0)
numpy.ndarray
type(np_array1)
numpy.ndarray
We can also do mathematical operations between columns (and rows) of a DataFrame
, for example (we only display the first 5 rows),
(three_pointer['LOC_X']*100).head()
#(three_pointer*0).head()
3 22700 4 9100 6 12200 9 2600 12 -12700 Name: LOC_X, dtype: int64
((three_pointer['LOC_X']-three_pointer.LOC_Y)/three_pointer.LOC_Y**2).head()
3 0.949219 4 -0.002561 6 -0.002046 9 -0.003597 12 -0.006749 dtype: float64
(three_pointer['LOC_X']/three_pointer['LOC_X'].max()).head()
3 0.941909 4 0.377593 6 0.506224 9 0.107884 12 -0.526971 Name: LOC_X, dtype: float64
To create a new DataFrame
with the normalized values of LOC_X
and LOC_Y
,
df_norm = pd.DataFrame()
c1=three_pointer['LOC_X']/three_pointer['LOC_X'].max()
c2=three_pointer['LOC_Y']/three_pointer['LOC_Y'].max()
df_norm['c1'] = c1
df_norm['c2'] = c2
df_norm.head()
c1 | c2 | |
---|---|---|
3 | 0.941909 | -0.038278 |
4 | 0.377593 | 0.588517 |
6 | 0.506224 | 0.562201 |
9 | 0.107884 | 0.595694 |
12 | -0.526971 | 0.550239 |
Notice that we only display the first 5 rows.
To generate a summary statistics of a DataFrame
.
shot_df1.describe()
GAME_ID | GAME_EVENT_ID | PLAYER_ID | TEAM_ID | PERIOD | MINUTES_REMAINING | SECONDS_REMAINING | SHOT_DISTANCE | LOC_X | LOC_Y | SHOT_ATTEMPTED_FLAG | SHOT_MADE_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1279.000000 | 1279.000000 | 1279 | 1279 | 1279.000000 | 1279.000000 | 1279.000000 | 1279.000000 | 1279.000000 | 1279.000000 | 1279 | 1279.000000 |
mean | 21400602.060985 | 250.136044 | 2544 | 1610612739 | 2.448006 | 5.295543 | 28.170446 | 12.136826 | -13.997654 | 83.936669 | 1 | 0.487881 |
std | 346.471023 | 156.436191 | 0 | 0 | 1.137046 | 3.566744 | 17.278734 | 10.302508 | 104.266921 | 91.996852 | 0 | 0.500049 |
min | 21400018.000000 | 2.000000 | 2544 | 1610612739 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -245.000000 | -30.000000 | 1 | 0.000000 |
25% | 21400281.000000 | 116.000000 | 2544 | 1610612739 | 1.000000 | 2.000000 | 13.000000 | 1.000000 | -94.000000 | 4.000000 | 1 | 0.000000 |
50% | 21400643.000000 | 254.000000 | 2544 | 1610612739 | 2.000000 | 5.000000 | 29.000000 | 12.000000 | -2.000000 | 42.000000 | 1 | 0.000000 |
75% | 21400891.000000 | 379.000000 | 2544 | 1610612739 | 3.000000 | 8.000000 | 42.000000 | 23.000000 | 22.000000 | 164.000000 | 1 | 1.000000 |
max | 21401203.000000 | 628.000000 | 2544 | 1610612739 | 5.000000 | 11.000000 | 59.000000 | 46.000000 | 241.000000 | 418.000000 | 1 | 1.000000 |
Notice that it computes the statistics only of the columns with numerical values.
To count the total number of 3 point shots attempted,
three_pointer.count()
LOC_X 339 LOC_Y 339 SHOT_MADE_FLAG 339 SHOT_TYPE 339 dtype: int64
To count the number of 3 point shots converted and only for the column SHOT_MADE_FLAG,
three_pointer.SHOT_MADE_FLAG[three_pointer.SHOT_MADE_FLAG == 1].count()
120
and to count the number of 3 point shots missed,
three_pointer.SHOT_MADE_FLAG[three_pointer.SHOT_MADE_FLAG == 0].count()
219
At this point, let's do some basic plotting using matplotlib [2].
The first plot will be the converted and missed shots using the DataFrames
we just created, that is, we are going to plot the shot charts.
sns.set_style("white")
sns.set_color_codes()
#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]
plt.figure(figsize=(12,11))
plt.scatter(converted.LOC_X, converted.LOC_Y, color='green',label='converted',s=20,marker='o',alpha=0.5)
plt.scatter(missed.LOC_X, missed.LOC_Y, color='red',label='missed',s=20,marker='o',alpha=0.5)
#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)
plt.legend()
plt.grid()
plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()
FYI, the values in LOC_X and LOC_Y are in inches
Let's do a plot of all the 3 point shots.
sns.set_style("white")
sns.set_color_codes()
#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]
plt.figure(figsize=(12,11))
plt.scatter(three_pointer.LOC_X, three_pointer.LOC_Y, color='green',label='3 point shots',s=20,marker='o',alpha=0.5)
#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)
plt.legend()
plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()
We can also plot all the 3 point shots by converted and missed, as follows,
sns.set_style("white")
sns.set_color_codes()
#shottrue = shot_df1[shot_df.SHOT_MADE_FLAG == 1]
#shotfalse = shot_df1[shot_df.SHOT_MADE_FLAG == 0]
plt.figure(figsize=(12,11))
plt.scatter(three_pointer.LOC_X[three_pointer.SHOT_MADE_FLAG == 1], three_pointer.LOC_Y[three_pointer.SHOT_MADE_FLAG == 1], color='green',label='3 point shots converted',s=20,marker='o',alpha=0.5)
plt.scatter(three_pointer.LOC_X[three_pointer.SHOT_MADE_FLAG == 0], three_pointer.LOC_Y[three_pointer.SHOT_MADE_FLAG == 0], color='red',label='3 point shots missed',s=20,marker='o',alpha=0.5)
#plt.scatter(shottrue.LOC_X, shottrue.LOC_Y)
#plt.scatter(shotfalse.LOC_X, shotfalse.LOC_Y)
plt.legend()
plt.xlim(-300,300)
plt.ylim(-100,500)
#plt.grid()
plt.show()
Finally, we can use seaborn [3] to do more advanced plots,
# create our jointplot
joint_shot_chart = sns.jointplot(shot_df1.LOC_X, shot_df1.LOC_Y, stat_func=None,
kind='scatter', space=0.2, alpha=1,
size=12, edgecolor='w', color='g').set_axis_labels("x location", "y location")
plt.figure(figsize=(12,11))
<matplotlib.figure.Figure at 0x10966b210>
<matplotlib.figure.Figure at 0x10966b210>
#import sys
#print('Python version:', sys.version_info)
#import IPython
#print('IPython version:', IPython.__version__)
#print('Requests version', requests.__version__)
#print('Pandas version:', pd.__version__)
#print('json version:', json.__version__)
#import matplotlib
#print('matplotlib version:', matplotlib.__version__)
#print('seaborn version:', sns.__version__)