In [1]:

```
import numpy as np
import pandas as pd
%matplotlib inline
from ggplot import *
```

In [245]:

```
# Inspect data, the data is pitches tracked over a 2 month stretch in the 2013
# MLB season.
baseball = pd.read_csv('./data/baseball-pitches-clean.csv')
print baseball.shape[0], " pitches were tracked."
baseball.head()
```

Out[245]:

In [4]:

```
baseball.columns
```

Out[4]:

In [7]:

```
# How many pitches types are there?
baseball.pitch_type.unique()
```

Out[7]:

In [8]:

```
baseball.pitch_name.unique()
```

Out[8]:

In [16]:

```
# How many pitchers are in the dataset?
len(baseball.pitcher_name.unique())
```

Out[16]:

In [23]:

```
baseball.describe()[['start_speed', 'end_speed']]
```

Out[23]:

A start speed of 49.4 mph seems very very low, let's investigate this further.

In [31]:

```
slowest_pitch = baseball[baseball['start_speed'] == baseball['start_speed'].min(0)]
slowest_pitch.pitcher_name
```

Out[31]:

In [49]:

```
zach_wheeler = baseball[baseball['pitcher_name'] == 'Zack Wheeler']
less_than_70 = zach_wheeler[zach_wheeler['start_speed'] < 70]
print 'Number of pitches under 70 mph =', len(less_than_70)
print 'Mean of Zach Wheeler\'s pitch speeds', round(zach_wheeler['start_speed'].mean(),2), 'MPH.'
```

Ok so from what we see above that pitch that's 49 MPH is definately an error, there's no way a guy who's throwing 90 MPH on average is going to throw a 49 MPH pitch.

In [69]:

```
print len(baseball[baseball['start_speed'] < 60]), 'pitches are under 60 mph'
# R.A. Dickey is a knuckleballer, one of only ones in the entire league
dickey = baseball[baseball['pitcher_name'] == 'R.A. Dickey']
print 'R. A. Dickey has ', len(dickey[dickey['start_speed'] < 60]), 'under 60 mph'
```

If Dickey who's a knuckleballer isn't throwing anything under 60 MPH, then it's pretty safe to say these pitches under 60 are outliars.

In [75]:

```
over_60 = baseball['start_speed'] >= 60
baseball = baseball[over_60]
```

Now that we've cleaned up the dataset a little, let's start visualizing it.

Before we plot, let's simplify the dataset a bit more

In [142]:

```
baseball = baseball[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type',
'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
baseball.head()
```

Out[142]:

In [102]:

```
p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=baseball) + geom_jitter()
p
```

Out[102]:

That's a bit hard to see let's do a facet wrap

In [87]:

```
p = ggplot(aes(x='px', y='pz'), data=baseball) + geom_point(color='blue') + facet_wrap('pitch_name')
p
```

Out[87]:

**Some Obsversations**

- Knuckleballs look to be have the most variance. This isn't that suprising since knuckleballs are based on Chaos Theory.
- Changeups appear to be located mostly in the bottom half of the zone. This intuitively makes sense since a changeup is meant to look exactly like a fastball, the changeup has a slower speed than the fastball thereby confusing the hitter.
- Because the changeup is on the same trajectory as a fastball but slower, gravity has a greater effect, therefore the pitch ends up lower in the strikezone.

Ok so I watch baseball and I've literally never heard of the Eephus pitch. From the graph it looks like it's really unpredictable, but also that there's not much data on it. Let's take a look at the actual counts.

In [89]:

```
baseball['pitch_name'].value_counts()
```

Out[89]:

In [98]:

```
# Show in percentages
baseball['pitch_name'].value_counts() / len(baseball) * 100
```

Out[98]:

There are only 59 Eephus pitches thrown in our entire dataset! Put that in comparison with the 447 knuckleballs which are a rarity in themselves. So what is a Eephus pitch then?

In [97]:

```
from IPython.display import YouTubeVideo
YouTubeVideo('uW0V6OsxDBo', 600, 338)
```

Out[97]:

Let's checkout the distribution of pitch types

In [103]:

```
p = ggplot(aes(x='start_speed'), data=baseball) + geom_histogram() + facet_wrap('pitch_name')
p
```

Out[103]:

This rules out my suspicion that the Eephus pitch is similar to the Knuckleball. It's suprising the knuckeball distribution is centered where it is in the high 70's. Traditionally Knuckleballs are high 60's pitches. This might be due to R.A. Dickey being the dominant Knuckleball user in today's game. His are known to be faster than most.

In [120]:

```
# Let's see how many of these Dickey throws
knuckles = baseball[baseball['pitch_name'] == 'Knuckleball']
dickey = knuckles[knuckles['pitcher_name'] == 'R.A. Dickey']
print 'Percentage of Knuckleballs belonging to Dickey', (len(dickey) / len(knuckles) * 100)
```

Well it turns out all the Knuckleballs in our dataset are thrown by R.A. Dickey! Well that confirms the suspicion about the Knuckleball speeds.

We saw previously that it was pretty difficult to gain much insight into pitch types aside from general differences. This might be more meaningful if we analyzed a specific pitcher. Let's do Yu Darvish.

Darvish is known for having a wide array of pitches at his disposal and is one of the best current pitchers in baseball so he's a solid choice.

In [129]:

```
# Let's get darvish data
darvish = baseball[baseball['pitcher_name'] == 'Yu Darvish']
darvish['pitch_name'].value_counts() / len(darvish) * 100
```

Out[129]:

Darvish's percentage pitch counts are drastically different from the average of the dataset, his approach is far more balanced. Over the 50% of pitches in the dataset are fastballs.

In [145]:

```
p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=darvish) + geom_jitter(alpha=0.3)
p = p + ggtitle('Darvish Pitch Spread') + stat_smooth(method='lm')
p
```

Out[145]:

It looks like Darvish's pitches all land in similar locations, looking further at the smoothing lines though we can see that the lines for his top 3 pitches (~94% of the pitches) are very similar.

Looking at the data it's easy to see why Darvish is such a lethal pitcher. In summary he was a wide array of pitches and to the hitter they all look pretty much identical.

In [152]:

```
p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=darvish)
p = p + stat_smooth(method='lm', size=5)
p
```

Out[152]:

In [153]:

```
p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=darvish)
p = p + geom_jitter(alpha=0.3)
p
```

Out[153]:

Apart from his slider there's no drastic change in pitch speeds. Further if we take a lot at his top 3 pitches: fastball, cut fastball and slider we see that the distribution of the pitches speeds is consistent and it stays consistent throughout the entire game.

If a hitter's hope was that Darvish was become weaker over the course of a game it looks like they're out of a luck.

In [159]:

```
baseball['pitcher_name'].value_counts()
```

Out[159]:

In [160]:

```
verlander = baseball[baseball['pitcher_name'] == 'Justin Verlander']
verlander.head()
```

Out[160]:

In [161]:

```
verlander['pitch_name'].value_counts() / len(verlander) * 100
```

Out[161]:

Already we can see Verlander is a drastically different pitcher than Darvish, fastballs make up 55% of his routine, Darvish fastballs made up 36% of his routine. Verlander throws his 3 other pitches for around the same amount.

It's interesting to note that 94% of Darvish's routine was made up of fastball, cut fastball and slider. Verlander 2nd and 3rd pitches are Darvish's 4th and 5th, thrown for ~32% vs ~6%.

In [163]:

```
p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=verlander) + geom_jitter(alpha=0.3)
p = p + ggtitle('Verlander Pitch Spread') + stat_smooth(method='lm')
p
```

Out[163]:

Verlander's distribution is more predictable than Darvish's. We see that fastball end up in the upper portion of the strikezone while the other 3 pitches end up in the lower portion.

The changeup and curveball are similar in terms of their distribution, it would be difficult for a hitter to tell them apart.

All 3 secondary pitches follow the trend that the farther right in the strikezone you go the lower the pitch will likely be.

In [164]:

```
p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=verlander)
p = p + stat_smooth(method='lm', size=5)
p
```

Out[164]:

In [165]:

```
p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=verlander)
p = p + geom_jitter(alpha=0.3)
p
```

Out[165]:

Verlander's fastball becomes faster over the course of the game and his changeup slower. We can also see that Verlander isn't as consistent with his pitch speeds as Darvish. He's more consistent during the middle innings.

This makes sense intuitively since in the first couple of innings the pitcher is finding their "groove" and in the latter innings fatigue starts to set in.

I found it weird that Verlander's fastball gets faster over the course of the game. So I decided to compare it to the norm.

In [168]:

```
p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch Speed vs Innings')
p
```

Out[168]:

In [169]:

```
p = ggplot(aes(x='inning', y='start_speed'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch speed vs Innings')
p
```

Out[169]:

Over this is super weird, at least to me. Shouldn't the speeds get slower as the game progresses?

A problem with the current approach is that it doesn't take into account switching the pitcher, pitch count would probably be a much better way to measure this.

In [186]:

```
baseball['date'] = baseball['pitch_time'].str.slice(0,10)
baseball['pitch_count'] = 1
baseball['pitch_count'] = baseball.groupby(['pitcher_name', 'date'])['pitch_count'].cumsum()
```

Let's try it again with the pitch counts.

In [187]:

```
p = ggplot(aes(x='pitch_count', y='start_speed', color='pitch_name'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch Speed vs Pitch Count')
p
```