import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
The raw_data
used here is downloaded from kaggle(site:https://www.kaggle.com/rounakbanik/pokemon), which contains information on 801 Pokémons from seven generations. (The latest edition is now in eighth generation.) However, the full list of the first seven generations should contain 809 Pokémons, so the missing parts was sorted out from https://wiki.52poke.com/wiki/%E4%B8%BB%E9%A1%B5. The dataset provides 41 indicators including name, base stats, type, performance against other types and some other characteristics, but only some key indicators will be analysed here.
raw_data = pd.read_csv('Pokemon.csv')
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 809 entries, 0 to 808 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 abilities 809 non-null object 1 against_bug 809 non-null float64 2 against_dark 809 non-null float64 3 against_dragon 809 non-null float64 4 against_electric 809 non-null float64 5 against_fairy 809 non-null float64 6 against_fighting 809 non-null float64 7 against_fire 809 non-null float64 8 against_flying 809 non-null float64 9 against_ghost 809 non-null float64 10 against_grass 809 non-null float64 11 against_ground 809 non-null float64 12 against_ice 809 non-null float64 13 against_normal 809 non-null float64 14 against_poison 809 non-null float64 15 against_psychic 809 non-null float64 16 against_rock 809 non-null float64 17 against_steel 809 non-null float64 18 against_water 809 non-null float64 19 attack 809 non-null int64 20 base_egg_steps 809 non-null int64 21 base_happiness 809 non-null int64 22 base_total 809 non-null int64 23 capture_rate 809 non-null object 24 classfication 809 non-null object 25 defense 809 non-null int64 26 experience_growth 809 non-null int64 27 height_m 794 non-null float64 28 hp 809 non-null int64 29 japanese_name 809 non-null object 30 name 809 non-null object 31 percentage_male 703 non-null float64 32 pokedex_number 809 non-null int64 33 sp_attack 809 non-null int64 34 sp_defense 809 non-null int64 35 speed 809 non-null int64 36 type1 809 non-null object 37 type2 414 non-null object 38 weight_kg 794 non-null float64 39 generation 809 non-null int64 40 legendary 809 non-null int64 dtypes: float64(21), int64(13), object(7) memory usage: 259.3+ KB
The following indicators will be selected for further analysis:
'generation'
: indicate which generation the Pokémon was first introduced'ledendary'
: indicates whether a Pokémon is a legendary one, 1 means legendary, 0 means ordinary'name'
: English name of the Pokémon'type1'
,'type2'
: the type of the Pokémons, some of them have dual type'hp'
,'attack'
,'defense'
,'sp_attack'
,'sp_defense'
,'speed'
,'base_total'
: base stats of six factors and their sum'against_*'
: performance against other types ('*'
means each type) (back to 3.1)The 'against_*'
seems kind of complex, so these columns name will be changed into '*'
# use `.replace` to cut out the "against_"
raw_data.columns = raw_data.columns.map(lambda x: x.replace("against_",""))
Use the following list main
to extract the data
from the raw_data
. It's worth noting that the dataset has already sorted the Pokémon in order. So there is no need to reset new index. Moreover, the type_list
is a list represented the performance against other types. But it can also be used as a list of type name in section 2.2. And we can tell from the raw_data.info()
that only 'type2'
has missing values in data
, which means that 395 Pokémons only have one type.(back to 2.3)
# `main` is combined by three list
basic_list = ['generation', 'legendary', 'name', 'type1', 'type2']
stats_list = ['base_total', 'hp', 'attack', 'defense',
'sp_attack', 'sp_defense', 'speed']
type_list = list(raw_data.columns[1:19])
main = basic_list + stats_list + type_list
data = raw_data.loc[:,main]
data.head()
generation | legendary | name | type1 | type2 | base_total | hp | attack | defense | sp_attack | ... | ghost | grass | ground | ice | normal | poison | psychic | rock | steel | water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | Bulbasaur | grass | poison | 318 | 45 | 49 | 49 | 65 | ... | 1.0 | 0.25 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.5 |
1 | 1 | 0 | Ivysaur | grass | poison | 405 | 60 | 62 | 63 | 80 | ... | 1.0 | 0.25 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.5 |
2 | 1 | 0 | Venusaur | grass | poison | 625 | 80 | 100 | 123 | 122 | ... | 1.0 | 0.25 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.5 |
3 | 1 | 0 | Charmander | fire | NaN | 309 | 39 | 52 | 43 | 60 | ... | 1.0 | 0.50 | 2.0 | 0.5 | 1.0 | 1.0 | 1.0 | 2.0 | 0.5 | 2.0 |
4 | 1 | 0 | Charmeleon | fire | NaN | 405 | 58 | 64 | 58 | 80 | ... | 1.0 | 0.50 | 2.0 | 0.5 | 1.0 | 1.0 | 1.0 | 2.0 | 0.5 | 2.0 |
5 rows × 30 columns
We will count the number of Pokémons in different generations (whether it's legendary as a subdivision standard) Fisrtly, we have to create ordinary
and legendary
to store the number of different kinds of Pokémons.
%%time
# create two empty array to store data
ordinary = np.zeros(7)
legendary = np.zeros(7)
# run a loop for each generation
for g in range(7):
for l in [0,1]:
# give the codition to extract corresponding data
judge = (data['generation'] == g+1) & (data['legendary'] == l)
if l == 0:
# get the ordinary Pokémon in each generation
ordinary[g] = data.loc[judge]['name'].count()
else:
# get the legendary Pokémon in each generation
legendary[g] = data.loc[judge]['name'].count()
# the elements in two list are all float
# quantity should be int
ordinary = ordinary.astype(int)
legendary = legendary.astype(int)
# create a dataframe
total = ordinary + legendary
index = np.array(('ordinary','legendary', 'total'))
columns = np.array([f'G{i+1}' for i in range(7)])
pd.DataFrame(data=(ordinary, legendary, total),
index=index, columns=columns)
Wall time: 17.1 ms
G1 | G2 | G3 | G4 | G5 | G6 | G7 | |
---|---|---|---|---|---|---|---|
ordinary | 146 | 94 | 125 | 94 | 143 | 66 | 63 |
legendary | 5 | 6 | 10 | 13 | 13 | 6 | 25 |
total | 151 | 100 | 135 | 107 | 156 | 72 | 88 |
%%time
# shorter one but the same as above
ordinary_and_legendary = np.zeros((7,3))
# use `.groupby` to divide into different categories
gb_gen = data.groupby("generation")
for i in range(7):
gg_gen = gb_gen.get_group(i+1)
# the total must be counted after the first `.groupby`
ordinary_and_legendary[i,2] = gg_gen['name'].count()
# second `.groupby`
gb_gen_leg = gg_gen.groupby('legendary')
for j in range(2):
gg_gen_led = gb_gen_leg.get_group(j)
ordinary_and_legendary[i,j] = gg_gen_led['name'].count()
pd.DataFrame(data=ordinary_and_legendary.T.astype(int),
index=index, columns=columns)
Wall time: 15.3 ms
G1 | G2 | G3 | G4 | G5 | G6 | G7 | |
---|---|---|---|---|---|---|---|
ordinary | 146 | 94 | 125 | 94 | 143 | 66 | 63 |
legendary | 5 | 6 | 10 | 13 | 13 | 6 | 25 |
total | 151 | 100 | 135 | 107 | 156 | 72 | 88 |
Then we will use a stacked bar chart to present the data.
# give the width and colors of the bars
width = 0.5
colors = ['#008080', '#48D1CC']
# create a figure and set the size
fig, ax = plt.subplots(figsize=[10,8])
# plot the stacked bar chart
p1 = ax.bar(range(7), ordinary, width, color=colors[0])
p2 = ax.bar(range(7), legendary, width, bottom=ordinary, color=colors[1])
# other setting of the chart
ax.set_title('Quantity of Pokémon in each generation', fontsize='xx-large')
ax.set_xlabel('Generation', fontsize='large')
ax.set_ylabel('Quantity', fontsize='large')
ax.set_xticks(range(7))
ax.set_xticklabels(columns)
ax.legend((p1, p2), ('Ordinary', 'Legendary'), fontsize='large')
ax.grid(axis='y', ls=':')
# mark the numbers in the bar chart
for i in range(7):
a = ordinary[i]
b = total[i]
ax.text(i, a, f'{a}', fontsize='large', horizontalalignment='center')
ax.text(i, b, f'{b}', fontsize='large', horizontalalignment='center')
Although we can use the method in section 2.1 to process the type data and get the number of Pokémon with different types, there are two reasons against so. First, there are 18 types of Pokémon, which makes it hard to recognize the number of each type when the data is stacked in just one bar. Second, some Pokémons might have two types. The information in 'type2'
may be ignored when using the method above.Here we extract columns 'type1'
and 'type2'
from data
to create type_df
and try to find the number of Pokémon that correspond to all types of combinations.(back to 3.1)
type_df = data.loc[:,['type1','type2']]
# For those single-type Pokémon, 'type2' is missing at first.
# So make them 'type2' == 'type1' to represent their types.
fill = type_df.loc[:,'type1']
type_df.loc[:,'type2'].fillna(fill, inplace = True)
type_df.T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type1 | grass | grass | grass | fire | fire | fire | water | water | water | bug | ... | psychic | steel | fighting | poison | poison | rock | fire | electric | steel | steel |
type2 | poison | poison | poison | fire | fire | flying | water | water | water | bug | ... | psychic | fairy | ghost | poison | dragon | steel | ghost | electric | steel | steel |
2 rows × 809 columns
Then we will check each Pokémon's type and calculate the number of each type or each dual type. Here we use the type_list
we defined before just to name the origin
's index and colunms.
(back to 1.2)
# create empty dataframe
data0 = np.zeros((18,18))
origin = pd.DataFrame(data=data0, index=type_list,
columns=type_list, dtype='int32')
# get each Pokémon's type information
type1 = type_df.loc[:,'type1'].values
type2 = type_df.loc[:,'type2'].values
# fill in the information to the dataframe
for i,j in zip(type1,type2):
origin.loc[i,j] += 1
origin
bug | dark | dragon | electric | fairy | fighting | fire | flying | ghost | grass | ground | ice | normal | poison | psychic | rock | steel | water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bug | 18 | 0 | 0 | 4 | 2 | 3 | 2 | 13 | 1 | 6 | 1 | 0 | 0 | 11 | 0 | 3 | 5 | 3 |
dark | 0 | 9 | 4 | 0 | 0 | 2 | 2 | 5 | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 0 | 2 | 0 |
dragon | 0 | 0 | 12 | 1 | 0 | 2 | 1 | 4 | 0 | 0 | 4 | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
electric | 0 | 0 | 0 | 28 | 2 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 4 | 0 |
fairy | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fighting | 0 | 1 | 0 | 0 | 0 | 22 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 | 0 |
fire | 0 | 1 | 1 | 0 | 0 | 6 | 28 | 6 | 1 | 0 | 2 | 2 | 2 | 0 | 1 | 1 | 1 | 1 |
flying | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ghost | 0 | 1 | 1 | 0 | 1 | 0 | 3 | 2 | 9 | 5 | 2 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
grass | 0 | 3 | 0 | 0 | 5 | 3 | 0 | 6 | 1 | 38 | 1 | 2 | 0 | 14 | 2 | 0 | 3 | 0 |
ground | 0 | 3 | 2 | 1 | 0 | 0 | 1 | 3 | 2 | 0 | 12 | 2 | 0 | 0 | 2 | 3 | 1 | 0 |
ice | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 3 | 12 | 0 | 0 | 2 | 0 | 0 | 3 |
normal | 0 | 4 | 1 | 0 | 4 | 2 | 0 | 26 | 0 | 2 | 1 | 0 | 61 | 0 | 3 | 0 | 0 | 1 |
poison | 1 | 3 | 2 | 0 | 0 | 2 | 2 | 3 | 0 | 0 | 2 | 0 | 0 | 16 | 0 | 0 | 0 | 3 |
psychic | 0 | 0 | 0 | 0 | 6 | 1 | 1 | 6 | 2 | 1 | 0 | 0 | 0 | 0 | 35 | 0 | 1 | 0 |
rock | 2 | 1 | 2 | 0 | 2 | 1 | 0 | 4 | 0 | 2 | 6 | 2 | 0 | 1 | 2 | 11 | 4 | 6 |
steel | 0 | 0 | 1 | 0 | 3 | 1 | 0 | 2 | 3 | 0 | 1 | 0 | 0 | 0 | 6 | 3 | 6 | 0 |
water | 2 | 4 | 2 | 2 | 4 | 2 | 0 | 7 | 2 | 3 | 9 | 3 | 0 | 3 | 5 | 4 | 1 | 61 |
The data is not intuitive from the table, so we use heatmap to present the data. Using imshow
is a way of drawing a heat map but kind of complicated. So seaborn
is used to streamline the code cause it was developed based on the matplotlib
and create visualizations with less code.
# ignore the FutureWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# single type and dual type
single_type = (type_df.loc[:,'type1'] == type_df.loc[:,'type2']).sum()
total = origin.sum().sum()
print(f'The number of single-type Pokémon is {single_type}.')
print('The proportion of single-type Pokémon is {0:.2%}'.format(single_type/total))
# heatmap
fig,ax=plt.subplots(figsize=(18,18))
import seaborn as sns
sns.heatmap(origin,annot=True,cmap='GnBu',square=True)
ax.set_title('Type of the Pokémon', fontsize='xx-large');
The number of single-type Pokémon is 395. The proportion of single-type Pokémon is 48.83%
There are six base stats for each Pokémon that indicate its ability. Here we will give some general statistics of these value. The stats_list
in section 1.2 will be used to extract the base stats from the data
.(back to 2.4)
stats_df = data.loc[:,['name']+stats_list].set_index('name')
stats_df.head()
base_total | hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|---|
name | |||||||
Bulbasaur | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
Ivysaur | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
Venusaur | 625 | 80 | 100 | 123 | 122 | 120 | 80 |
Charmander | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
Charmeleon | 405 | 58 | 64 | 58 | 80 | 65 | 80 |
# statistic of base stats
stats_des = stats_df.describe()
# give a 'formater' to take two decimal places
formater = '{:.2f}'.format
stats_des.applymap(formater)
base_total | hp | attack | defense | sp_attack | sp_defense | speed | |
---|---|---|---|---|---|---|---|
count | 809.00 | 809.00 | 809.00 | 809.00 | 809.00 | 809.00 | 809.00 |
mean | 429.33 | 69.03 | 78.14 | 73.23 | 71.50 | 70.94 | 66.48 |
std | 119.43 | 26.58 | 32.25 | 31.11 | 32.41 | 27.87 | 29.17 |
min | 180.00 | 1.00 | 5.00 | 5.00 | 10.00 | 20.00 | 5.00 |
25% | 323.00 | 50.00 | 55.00 | 50.00 | 46.00 | 50.00 | 45.00 |
50% | 440.00 | 65.00 | 75.00 | 70.00 | 65.00 | 66.00 | 65.00 |
75% | 509.00 | 80.00 | 100.00 | 90.00 | 92.00 | 90.00 | 87.00 |
max | 780.00 | 255.00 | 185.00 | 230.00 | 194.00 | 230.00 | 180.00 |
stats_df
is a Dataframe of base stats.
stats_des
is some statistics onstats_df
.
Then we will use a boxplot to present the data to get a general idea of the distribution of the base stats.
# some setting for boxplot
boxprops = {'facecolor':'#008080'}
flierprops = {'markerfacecolor':'#48D1CC'}
medianprops = {'markerfacecolor':'orange'}
meanprops = {'marker':'D','markerfacecolor':'white'}
# ax0 for base total
ax0 = plt.axes([0.1, 0.1, 0.3, 1.2])
ax0.set_title('base total', fontsize='x-large')
ax0.set_xticklabels(['total'], fontsize='large')
ax0.grid(axis='y', ls=':')
ax0.boxplot(stats_df.loc[:, 'base_total'], patch_artist=True,
showmeans=True, flierprops=flierprops,
medianprops=medianprops, meanprops=meanprops)
# ax1 for six base stats
ax1 = plt.axes([0.5, 0.1, 1.8, 1.2])
ax1.set_title('base stats', fontsize='x-large')
ax1data = [stats_df.loc[:, i] for i in stats_list[1:]]
ax1.set_xticklabels(stats_list[1:], fontsize='large')
ax1.grid(axis='y', ls=':')
# patch_artist: fill the box with color ; showmeans: show the mean
# boxprops: box set; flierprops: extremum set;
# medianprops: median set; meanprops: mean set
ax1.boxplot(ax1data, patch_artist=True, showmeans=True,
boxprops=boxprops, flierprops=flierprops,
medianprops=medianprops, meanprops=meanprops);
base_total
varies widely between Pokémon ranging from 180 to 780.'attack'
but low 'defense'
. That's why single stats have so many extreme values.# kernel density of 'base_total'
ax0 = plt.axes([0.1, 0.1, 1.1, 1.2])
stats_df['base_total'].plot.density(xlim=(0, 850), label='base_total')
ax0.legend(fontsize='x-large')
ax0.grid()
# kernel density of six stats
ax1 = plt.axes([1.4, 0.1, 1.6, 1.2])
for i in stats_df.columns[1:]:
stats_df[i].plot.density(xlim=(0, 260), label=i)
ax1.legend(fontsize='x-large')
ax1.grid();
Pokémon can be roughly divided into some categories due to their base stats (not an official rule). We only choose some Pokémon ('base_total'
>= 75% quantile) for analysis because it will be intricate when you put more than 800 observations in a chart. In this section, we must use the stats_df
and stats_des
in section 2.3.
aggressive_index
= 0.6 * max{'attack'
,'sp_attack'
} + 0.4 * 'speed'
tough_index
= 0.6 * max{'defense'
,'sp_defense'
} + 0.4 * 'hp'
'aggressive_index'
, we will call it 'aggressive'.'tough_index'
, we will call it 'tough'.'aggressive_index'
or 'tough_index'
, we will call it 'special'.def aggressive_index(s):
if s['attack'] > s['sp_attack']:
aggressive_index = 0.6*s['attack'] + 0.4*s['speed']
else:
aggressive_index = 0.6*s['sp_attack'] + 0.4*s['speed']
return aggressive_index
def tough_index(s):
if s['defense'] > s['sp_defense']:
tough_index = 0.6*s['defense'] + 0.4*s['hp']
else:
tough_index = 0.6*s['sp_defense'] + 0.4*s['hp']
return tough_index
Use aggregate to apply on the functions defined above to combines multiple values into a single value and we will get a dataframe contained all Pokémons' 'aggressive_index'
and 'tough_index'
. Because the .describe()
doesn't report the 'μ+3σ', so here we give the a_special
and t_special
as the critical value of 'special'.
# compute the two index
# stats_df is from section 2.3
index = stats_df.agg((aggressive_index, tough_index), axis=1)
# μ+3σ 'special' critical value
a_special = index['aggressive_index'].mean() + 3*index['aggressive_index'].std()
t_special = index['tough_index'].mean() + 3*index['tough_index'].std()
index.describe()
aggressive_index | tough_index | |
---|---|---|
count | 809.000000 | 809.000000 |
mean | 79.812608 | 76.090729 |
std | 26.664726 | 24.544133 |
min | 8.000000 | 27.400000 |
25% | 59.800000 | 57.200000 |
50% | 78.000000 | 74.800000 |
75% | 97.600000 | 92.000000 |
max | 172.400000 | 183.000000 |
# choose the Pokémon which 'base_total' >= 75% quantile and get their index
# stats_df and stats_des is from section 2.3
stats_high = stats_df.loc[stats_df['base_total'] >= stats_des.loc['75%','base_total']]
index_high = stats_high.agg((aggressive_index, tough_index), axis=1)
stats_high
is a sample ofstats_df
.
index_high
is a sample ofindex
.
fig, ax = plt.subplots(figsize=(14,8))
# scatter plot
ax.scatter(index_high['aggressive_index'],
index_high['tough_index'], color='#6A5ACD')
# 'aggressive_index' == 'tough_index'
ax.plot(range(45,180),range(45,180),
color="orange", linestyle=':', label='equilibrium')
# indicate the mean and special critical value
ax.vlines(index['aggressive_index'].mean(), 45, 180,
color="green", linestyle=':')
ax.hlines(index['tough_index'].mean(), 45, 180,
color="green", linestyle=':', label='mean')
ax.vlines(a_special, 45, 180, color="red", linestyle=':')
ax.hlines(t_special, 45, 180, color="red", linestyle=':', label='μ+3σ')
# other settings
ax.set_xlabel('aggressive index', fontsize='x-large')
ax.set_ylabel('tough index', fontsize='x-large')
ax.set_title('Aggressive and tough index', fontsize='xx-large')
ax.legend(fontsize='large')
# mark the special Pokémons
for i in index_high.index:
a = index_high.loc[i,'aggressive_index']
t = index_high.loc[i,'tough_index']
# This is separate for fear of overlapping the text of the annotations.
if (i == 'Regirock')|(i == 'Regice'):
ax.annotate('Regirock/Regice',(a,t), xytext=(-150, 10),
textcoords="offset points", fontsize='large',
arrowprops=dict(arrowstyle="->"))
elif (a > a_special)|(t > t_special):
ax.text(a, t, i, fontsize='large')
'base_total'
(>= 75% quantile), it will more chance to have higher index in two aspects (>= mean).This section will select the nine 'special' Pokémons' information as a sample special_data
. Create another two dataframes all_data
(all Pokémons) and high_data
(25% Pokémons) in the same way.
# get those special Pokémon's base stats and index
a = index_high.loc[:,'aggressive_index']
t = index_high.loc[:,'tough_index']
stats_special = stats_high.loc[(a > a_special)|(t > t_special)]
index_special = index_high.loc[(a > a_special)|(t > t_special)]
# create three dataframe
special_data = pd.merge(stats_special, index_special,
left_on='name', right_index=True)
all_data = pd.merge(stats_df, index,
left_on='name', right_index=True)
high_data = pd.merge(stats_high, index_high,
left_on='name', right_index=True)
special_data
base_total | hp | attack | defense | sp_attack | sp_defense | speed | aggressive_index | tough_index | |
---|---|---|---|---|---|---|---|---|---|
name | |||||||||
Alakazam | 600 | 55 | 50 | 65 | 175 | 105 | 150 | 165.0 | 85.0 |
Mewtwo | 780 | 106 | 150 | 70 | 194 | 120 | 140 | 172.4 | 114.4 |
Steelix | 610 | 75 | 125 | 230 | 55 | 95 | 30 | 87.0 | 168.0 |
Blissey | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 67.0 | 183.0 |
Aggron | 630 | 70 | 140 | 230 | 60 | 80 | 50 | 104.0 | 166.0 |
Regirock | 580 | 80 | 100 | 200 | 50 | 100 | 50 | 80.0 | 152.0 |
Regice | 580 | 80 | 50 | 100 | 100 | 200 | 50 | 80.0 | 152.0 |
Zygarde | 708 | 216 | 100 | 121 | 91 | 95 | 85 | 94.0 | 159.0 |
Stakataka | 570 | 61 | 131 | 211 | 53 | 101 | 13 | 83.8 | 151.0 |
# 'aggresive'
# basic settings
ax0 = plt.axes([0.1, 0.1, 0.8, 0.9])
ax0.set_title('Alakazam aggressive performance', fontsize = 'x-large')
# get the specific value of the Pokémon
a_ = ['attack', 'sp_attack', 'speed', 'aggressive_index']
aggressive = special_data.loc['Alakazam'][a_]
# find the criteria for comparison
aggressive_all_mean = all_data.mean()[a_]
aggressive_high_mean = high_data.mean()[a_]
# scatter plot
ax0.scatter(np.array(a_), aggressive, label='Alakazam')
ax0.scatter(np.array(a_), aggressive_all_mean, label='all_mean')
ax0.scatter(np.array(a_), aggressive_high_mean, label='high_mean')
# give some tag lines and legend
ax0.vlines(np.array(a_), ymin=0, ymax=aggressive, color='orange',
alpha=0.7, linewidth=2, linestyle=':')
ax0.legend(fontsize = 'medium')
# 'tough'
# nearly the same as above
ax1 = plt.axes([1.1, 0.1, 0.8, 0.9])
ax1.set_title('Blissey tough performance', fontsize = 'x-large')
t_ = ['defense', 'sp_defense', 'hp', 'tough_index']
tough = special_data.loc['Blissey'][t_]
tough_mean_all = all_data.mean()[t_]
tough_mean_high = high_data.mean()[t_]
ax1.scatter(np.array(t_), tough, label='Blissey')
ax1.scatter(np.array(t_), tough_mean_all, label='all_mean')
ax1.scatter(np.array(t_), tough_mean_high, label='high_mean')
ax1.vlines(np.array(t_), ymin=0, ymax=tough, color='orange',
alpha=0.7, linewidth=2, linestyle=':')
ax1.legend(fontsize = 'medium', loc='upper left');
'sp_attack'
and 'speed'
, but they are relatively mediocre in 'attack'
, such as Alakazam. Some Pokémons are 'tough' because of their superior 'hp'
but are less able to resist attacks, such as Blissey.Therefore, the simple categorization of 'aggressive' and 'tough' is a bit too arbitrary, and a more detailed categorization is needed obviously. Here we will plot the six stats on a radar map to visually see where a Pokémon stands out. To prevent label overlapping, here we change the 'sp_attack'
and 'sp_defense'
into 'sp_a'
and 'sp_d'
.
# remove the 'base_total'
df0 = stats_special[stats_special.columns[1:]]
df0 = df0.rename(columns={'sp_attack':'sp_a', 'sp_defense':'sp_d'})
df0 # what we will input into the following function
hp | attack | defense | sp_a | sp_d | speed | |
---|---|---|---|---|---|---|
name | ||||||
Alakazam | 55 | 50 | 65 | 175 | 105 | 150 |
Mewtwo | 106 | 150 | 70 | 194 | 120 | 140 |
Steelix | 75 | 125 | 230 | 55 | 95 | 30 |
Blissey | 255 | 10 | 10 | 75 | 135 | 55 |
Aggron | 70 | 140 | 230 | 60 | 80 | 50 |
Regirock | 80 | 100 | 200 | 50 | 100 | 50 |
Regice | 80 | 50 | 100 | 100 | 200 | 50 |
Zygarde | 216 | 100 | 121 | 91 | 95 | 85 |
Stakataka | 61 | 131 | 211 | 53 | 101 | 13 |
# get the boundary value of the map
def boundary(df):
minb = maxb = 0
a,b = df.values.shape
for i in range(a):
for j in range(b):
value = df.iloc[i,j]
if value < minb:
minb = value
if value > maxb:
maxb = value
return minb, maxb
import matplotlib.colors as mcolors
# create radar map
def base_stats_map(df, rows, columns):
"""
create several radar map on base stats
inputs:
df: a dataframe contain six stats of Pokémons
the index must be Pokémons' names
the columns must be stats names
rows: the number of rows on the figure
columns: the number of columns on the figure
outputs:
(rows*columns) radar maps in one figure
"""
# get all available colors
colors = list(mcolors.TABLEAU_COLORS)
minb, maxb = boundary(df)
number = len(df.columns)
# divide the circle into sections based on the number of stats
# 0 to 2pi, divide for several times
angles = np.linspace(0, 2*np.pi, number, endpoint=False)
# go around the circle and come back to the origin
angles = np.append(angles, angles[0])
# create polar axises of rows * columns
fig, ax = plt.subplots(rows, columns, figsize=(18,18),
subplot_kw=dict(polar=True))
# The original ax.shape is a two-dimensional array.
# For the convenience of the next iteration loop,
# we break it down into a one-dimensional array.
ax = ax.ravel()
name = df.index
stats = df.values
for i, (name, stats) in enumerate(zip(name, stats)):
# go around the circle and come back to the origin
stats = np.append(np.array(stats), stats[0])
# plot on each axis and fill color
ax[i].plot(angles, stats, color=colors[i])
ax[i].fill(angles, stats, alpha=0.7, color=colors[i])
# set the labels of xtick
ax[i].set_xticks(angles)
ax[i].set_xticklabels(df.columns)
# set the title (or the 'name')
ax[i].set_title(name, size=10, color='black', position=(0.5, 0.4))
# set the minimun and maximun r
# +0.1 to prevent the outermost circle from appearing incompletely
ax[i].set_rmin(minb)
ax[i].set_rmax(maxb + 0.1)
plt.show()
# plot the radar map
base_stats_map(df0,3,3)
'sp_attack'
,Aggron has the highest 'defense'
...'base_total"
of Pokémon, with the larger the area, the stronger the overall strength.The data_battle
we will use here is the copy of data
with some same treatment as section 2.2 (fill in the NaN
in 'type2'
). In this section, we will create a specific battle mechanism to simulate the battle between the Pokémons.
fill = data.loc[:,'type1']
data_battle = data.copy()
data_battle.loc[:,'type2'].fillna(fill, inplace=True)
Here are the details:
'speed'
. The faster one is 'first'
, the slower one is 'second'
.'dm'
: Damage multiplier. As we see in the section 1.2, each Pokémon has 18 performances against other types, which means that when others attack this Pokémon, the damage is determined by the number of the corresponding type. For example, the 'dm'
of a single-type (grass) Pokémon to a dual-type (green-bug) is 0.25; the 'dm'
of a dual-type (green-bug) Pokémon to a single-type (grass) is 0.5 * 2 = 1. (See the code below)'da'
: Damage. Assume that when a Pokémon has higher 'attack'
than 'sp_attack'
, it will use physical attacks and its opponent will resist the 'attack'
with 'defense'
. If one's 'attack'
is lower than the other's 'defense'
, the attack will give a 'da'
= 0.5 * 'dm'
.If not, the 'da'
= (one's 'attack'
- the other's 'defense'
) * 'dm'
. Same thing for those who have a higher 'sp_attack'
.'hp'
(Δhp = -'da'
). When one of two runs out of its 'hp'
, the battle is over and the round of the battle will be recorded.def choose_who(df1):
a, b = np.random.randint(0,809,(2,))
# not the same
if a == b:
if a == 0:
a = a + 1
else:
a = a - 1
p1 = df1.iloc[a]
p2 = df1.iloc[b]
# return all the information of the two Pokémons
return p1, p2
def who_first(p1, p2):
# based on their 'speed'
if p1['speed'] >= p2['speed']:
return p1, p2
else:
return p2, p1
def damage_multiplier(p1, p2):
p1type1 = p1['type1']
p1type2 = p1['type2']
p2type1 = p2['type1']
p2type2 = p2['type2']
if p1type1 == p1type2:
dm1 = p2[p1type1]
else:
dm1 = p2[p1type1] * p2[p1type2]
if p2type1 == p2type2:
dm2 = p1[p2type1]
else:
dm2 = p1[p2type1] * p1[p2type2]
# the battle must end
# so dm1 = dm2 = 0 is not allowed
if (dm1 == 0) & (dm2 == 0):
dm1, dm2 = (0.05, 0.05)
return dm1, dm2
def damage(p1, p2, dm1, dm2):
if p1['attack'] >= p1['sp_attack']:
if p1['attack'] <= p2['defense']:
# negative damage is not allowed
da1 = 0.5 * dm1
else:
da1 = (p1['attack']-p2['defense']) * dm1
else:
if p1['sp_attack'] <= p2['sp_defense']:
da1 = 0.5 * dm1
else:
da1 = (p1['sp_attack']-p2['sp_defense']) * dm1
# same as above
if p2['attack'] >= p2['sp_attack']:
if p2['attack'] <= p1['defense']:
da2 = 0.5 * dm2
else:
da2 = (p2['attack']-p1['defense']) * dm2
else:
if p2['sp_attack'] <= p1['sp_defense']:
da2 = 0.5 * dm2
else:
da2 = (p2['sp_attack']-p1['sp_defense']) * dm2
return da1, da2
def bout(p1, p2, da1, da2):
bround = 0
hp1 = p1['hp']
hp2 = p2['hp']
while (hp1 > 0) & (hp2 > 0):
# the while loop adds 1, the round adds 1
bround += 1
if da1 >= hp2:
hp2 = 0
winner = p1['name']
loser = p2['name']
break # when one's 'hp'=0, battle over
hp2 = hp2 - da1
if da2 >= hp1:
hp1 = 0
winner = p2['name']
loser = p1['name']
break
hp1 = hp1 - da2
return (winner, round(hp1,2),
round(hp2,2), bround)
def battle(df, N):
"""
get a df of all information of some Pokémons
and simulate battles for N times to get the results
return another dataframe
"""
# make a list to store the results
list0 = [0]
# simulate for N times
for i in range(N):
# firstly
p1, p2 = choose_who(df)
first, second = who_first(p1, p2)
# secondly
dm1, dm2 = damage_multiplier(first, second)
da1, da2 = damage(first, second, dm1, dm2)
# lastly
winner, hp1, hp2, bround = bout(first, second, da1, da2)
# add one result after one battle ends
list0.append([first['name'], second['name'], winner,
bround, first['hp'], second['hp'],
hp1, hp2, dm1, dm2, da1, da2])
# change the results into a dataframe
columns=['first', 'second', 'winner', 'round',
'first_hp0', 'second_hp0', 'first_hp', 'second_hp',
'first_dm', 'second_dm', 'first_da', 'second_da']
df0 = pd.DataFrame(data=list0[1:], index=range(N),
columns=columns)
return df0
%%timeit
simulation_sample = battle(data_battle, 50)
60.8 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Simulating for 50 times needs 60-80ms, which is not very quickly. However, less simulations can not produce reliable results. So saving the CSV document with large amounts of data (simulating for 100,000 times) is not demonstrated here. (The code will showed as markdown)
simulation_of_battle = battle(data_battle, 100000)
simulation_of_battle.to_csv('simulation_of_battle.csv')
simulation_of_battle = pd.read_csv('simulation_of_battle.csv')
simulation_of_battle
Unnamed: 0 | first | second | winner | round | first_hp0 | second_hp0 | first_hp | second_hp | first_dm | second_dm | first_da | second_da | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Skiploom | Starly | Skiploom | 6 | 55 | 40 | 5.00 | 0.00 | 0.50 | 2.00 | 7.500 | 10.00 |
1 | 1 | Hoopa | Timburr | Hoopa | 1 | 80 | 75 | 80.00 | 0.00 | 2.00 | 0.00 | 270.000 | 0.00 |
2 | 2 | Lopunny | Banette | Banette | 19 | 65 | 64 | 0.00 | 6.05 | 0.05 | 0.05 | 3.050 | 3.55 |
3 | 3 | Vivillon | Guzzlord | Guzzlord | 2 | 80 | 223 | 0.00 | 75.00 | 2.00 | 1.00 | 74.000 | 51.00 |
4 | 4 | Eevee | Aromatisse | Aromatisse | 2 | 55 | 101 | 0.00 | 100.00 | 1.00 | 1.00 | 0.500 | 34.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | 99995 | Pidgeotto | Gourgeist | Gourgeist | 28 | 63 | 85 | 0.00 | 84.30 | 0.05 | 0.05 | 0.025 | 2.25 |
99996 | 99996 | Luvdisc | Hippopotas | Hippopotas | 3 | 43 | 68 | 0.00 | 65.00 | 2.00 | 1.00 | 1.000 | 17.00 |
99997 | 99997 | Shaymin | Wimpod | Shaymin | 1 | 100 | 25 | 100.00 | 0.00 | 1.00 | 1.00 | 90.000 | 0.50 |
99998 | 99998 | Doduo | Slurpuff | Slurpuff | 1 | 35 | 82 | 0.00 | 81.50 | 1.00 | 1.00 | 0.500 | 50.00 |
99999 | 99999 | Milotic | Slowking | Milotic | 380 | 95 | 95 | 0.25 | 0.00 | 0.50 | 0.50 | 0.250 | 0.25 |
100000 rows × 13 columns
# Calculate how many times each Pokémon appears in the battle simulations
first_one = simulation_of_battle['first'].value_counts()
second_one = simulation_of_battle['second'].value_counts()
occur = first_one + second_one
occur.isnull().sum()
9
# compute the win rate of each Pokémon
winner = simulation_of_battle['winner'].value_counts()
rate = winner/occur
# regardless of the NaN in win rate
# here is the top 10 of win rate
rate.sort_values().tail(21)[:10][::-1]
Kyogre 0.995261 Tyranitar 0.991189 Regigigas 0.981900 Gyarados 0.972868 Slaking 0.961977 Darmanitan 0.959184 Scizor 0.958159 Wishiwashi 0.947368 Aggron 0.945946 Blastoise 0.940945 dtype: float64
Although the dataset used here is not academic in some ways. But the operation of data visualization can be applied to many other datasets. Although I chose such a dataset with a playful attitude at the beginning, after a period of practical operation, I found that I could also learn a lot of data processing knowledge from it, which benefited me a lot.