PUBG Finish Placement Prediction

1. Feature and data explanation

At first, tell something about the game. PlayerUnknown's Battlegrounds (PUBG) - an online multiplayer battle royale game. Up to 100 players are dropped onto an island empty-handed and must explore, scavenge, loot and eliminate other players until only one is left standing, all while the play zone continues to shrink.
Battle Royale-style video games have taken the world by storm. So PUBG becomes very popular. With over 50 million copies sold, it's the fifth best selling game of all time, and has millions of active monthly players.

The task: using player statistic during the match, predict final placement of this player, where 0 is last place and 1 is winner winner, chicken dinner.

Dataset contains over 65,000 games' worth of anonymized player data, which you can download from kaggle website. Each row of data is player stats at the end of the match.
The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.
Statistics can be like - player kills, his/her match, group and personal ID, amount walked distance and etc...
WinPlacePerc - is a target feature on a scale from 1 (first place) to 0 (last place) - percentile winning placement.

A solution of the task can be valuable for PUBG players, for understanding, what parameters are important, which tactic to choose. Also using PUBG Developer API we can collect our own data with more features. So it makes real to create a lot of different apps, which will help players. For example, app with personal assisstant, who will give a tips, what skill you should to train .

Let's look to the data

2-3 Primary data analysis and visual data analysis

In [46]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import scipy.stats as sc
import gc
import warnings
In [47]:
plt.rcParams['figure.figsize'] = 15,8
sns.set(rc={'figure.figsize':(15,8)})
pd.options.display.float_format = '{:.2f}'.format
warnings.filterwarnings('ignore')
gc.enable()
In [48]:
train = pd.read_csv('../input/train_V2.csv')
test = pd.read_csv('../input/test_V2.csv')
train.head()
Out[48]:
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks longestKill matchDuration matchType maxPlace numGroups rankPoints revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
0 7f96b2f878858a 4d4b580de459be a10357fd1a4a91 0 0 0.00 0 0 0 60 1241 0 0 0.00 1306 squad-fpp 28 26 -1 0 0.00 0 0.00 0 0 244.80 1 1466 0.44
1 eef90569b9d03c 684d5656442f9e aeb375fc57110c 0 0 91.47 0 0 0 57 0 0 0 0.00 1777 squad-fpp 26 25 1484 0 0.00 0 11.04 0 0 1434.00 5 0 0.64
2 1eaf90ac73de72 6a4a42c3245a74 110163d8bb94ae 1 0 68.00 0 0 0 47 0 0 0 0.00 1318 duo 50 47 1491 0 0.00 0 0.00 0 0 161.80 2 0 0.78
3 4616d365dd2853 a930a9c79cd721 f1f1f4ef412d7e 0 0 32.90 0 0 0 75 0 0 0 0.00 1436 squad-fpp 31 30 1408 0 0.00 0 0.00 0 0 202.70 3 0 0.17
4 315c96c26c9aac de04010b3458dd 6dc8ff871e21e6 0 0 100.00 0 0 0 45 0 1 1 58.53 1424 solo-fpp 97 95 1560 0 0.00 0 0.00 0 0 49.75 2 0 0.19

Data fields

  • DBNOs - Number of enemy players knocked.
  • assists - Number of enemy players this player damaged that were killed by teammates.
  • boosts - Number of boost items used.
  • damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
  • headshotKills - Number of enemy players killed with headshots.
  • heals - Number of healing items used.
  • Id - Player’s Id
  • killPlace - Ranking in match of number of enemy players killed.
  • killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is * a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
  • killStreaks - Max number of enemy players killed in a short amount of time.
  • kills - Number of enemy players killed.
  • longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
  • matchDuration - Duration of match in seconds.
  • matchId - ID to identify match. There are no matches that are in both the training and testing set.
  • matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
  • rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
  • revives - Number of times this player revived teammates.
  • rideDistance - Total distance traveled in vehicles measured in meters.
  • roadKills - Number of kills while in a vehicle.
  • swimDistance - Total distance traveled by swimming measured in meters.
  • teamKills - Number of times this player killed a teammate.
  • vehicleDestroys - Number of vehicles destroyed.
  • walkDistance - Total distance traveled on foot measured in meters.
  • weaponsAcquired - Number of weapons picked up.
  • winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
  • groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
  • numGroups - Number of groups we have data for in the match.
  • maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
  • winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match
In [49]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
Id                 object
groupId            object
matchId            object
assists            int64
boosts             int64
damageDealt        float64
DBNOs              int64
headshotKills      int64
heals              int64
killPlace          int64
killPoints         int64
kills              int64
killStreaks        int64
longestKill        float64
matchDuration      int64
matchType          object
maxPlace           int64
numGroups          int64
rankPoints         int64
revives            int64
rideDistance       float64
roadKills          int64
swimDistance       float64
teamKills          int64
vehicleDestroys    int64
walkDistance       float64
weaponsAcquired    int64
winPoints          int64
winPlacePerc       float64
dtypes: float64(6), int64(19), object(4)
memory usage: 983.9+ MB

We have 4.5 millions of player stats records!

Now check dataset for missing values

In [50]:
display(train[train.isnull().any(1)])
display(test[test.isnull().any(1)])
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks longestKill matchDuration matchType maxPlace numGroups rankPoints revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
2744604 f70c74418bb064 12dfbede33f92b 224a123c53e008 0 0 0.00 0 0 0 1 0 0 0 0.00 9 solo-fpp 1 1 1574 0 0.00 0 0.00 0 0 0.00 0 0 nan
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks longestKill matchDuration matchType maxPlace numGroups rankPoints revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints

There are only one row with nan value, so let's drop it

In [51]:
train.drop(2744604, inplace=True)

General info about aech column

In [52]:
train.describe()
Out[52]:
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks longestKill matchDuration maxPlace numGroups rankPoints revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
count 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00 4446965.00
mean 0.23 1.11 130.72 0.66 0.23 1.37 47.60 505.01 0.92 0.54 23.00 1579.51 44.50 43.01 892.01 0.16 606.12 0.00 4.51 0.02 0.01 1154.22 3.66 606.46 0.47
std 0.59 1.72 170.78 1.15 0.60 2.68 27.46 627.50 1.56 0.71 50.97 258.74 23.83 23.29 736.65 0.47 1498.34 0.07 30.50 0.17 0.09 1183.50 2.46 739.70 0.31
min 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 133.00 2.00 1.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 0.00 0.00 0.00 0.00 0.00 0.00 24.00 0.00 0.00 0.00 0.00 1367.00 28.00 27.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 155.10 2.00 0.00 0.20
50% 0.00 0.00 84.24 0.00 0.00 0.00 47.00 0.00 0.00 0.00 0.00 1438.00 30.00 30.00 1443.00 0.00 0.00 0.00 0.00 0.00 0.00 685.60 3.00 0.00 0.46
75% 0.00 2.00 186.00 1.00 0.00 2.00 71.00 1172.00 1.00 1.00 21.32 1851.00 49.00 47.00 1500.00 0.00 0.19 0.00 0.00 0.00 0.00 1976.00 5.00 1495.00 0.74
max 22.00 33.00 6616.00 53.00 64.00 80.00 101.00 2170.00 72.00 20.00 1094.00 2237.00 100.00 100.00 5910.00 39.00 40710.00 18.00 3823.00 12.00 5.00 25780.00 236.00 2013.00 1.00

We can already guess, that the target feature has uniform distribution. It's because winPlacePerc is already scaled feature and after every match player can have only one place.

In [53]:
train['winPlacePerc'].hist(bins=25);

We can notice, that 0 and values are more than others. It's because first and last place exists in every match)
WinPlacePerc has obviously uniform distribution, but let's check target feature for normality and skewness of distribution (becouse of task)

In [54]:
print(sc.normaltest(train['winPlacePerc']))
print('Skew: ', sc.skew(train['winPlacePerc']))
NormaltestResult(statistic=6456876.444084833, pvalue=0.0)
Skew:  0.09882959417670069

Pvalue is zero, so this distribution is not normal
Skew is close to zero, so distribution is almostly symmetric

Now look at distrubution of features with upper limit (to get rid of outliers) and without zero values (because of lots of zero values)
Also make boxplots to see correlation target feature from feature values

In [55]:
def featStat(featureName, constrain,plotType):
    feat = train[featureName][train[featureName]>0]
    data = train[[featureName,'winPlacePerc']].copy()
    q99 = int(data[featureName].quantile(0.99))
    plt.rcParams['figure.figsize'] = 15,5;   
    
    if constrain!=None:
        feat = feat[feat<constrain]
    if plotType == 'hist':
        plt.subplot(1,2,1)
        feat.hist(bins=50);
        plt.title(featureName);
        
        n = 20
        cut_range = np.linspace(0,q99,n)
        cut_range = np.append(cut_range, data[featureName].max())
        data[featureName] = pd.cut(data[featureName],
                                         cut_range,
                                         labels=["{:.0f}-{:.0f}".format(a_, b_) for a_, b_ in zip(cut_range[:n], cut_range[1:])],
                                         include_lowest=True
                                        )
        ax = plt.subplot(1,2,2)
        sns.boxplot(x="winPlacePerc", y=featureName, data=data, ax=ax, color="#2196F3")
        ax.set_xlabel('winPlacePerc', size=14, color="#263238")
        ax.set_ylabel(featureName, size=14, color="#263238")
        plt.gca().xaxis.grid(True)
        plt.tight_layout()
           
    if plotType == 'count':        
        plt.subplot(1,2,1)
        sns.countplot(feat, color="#2196F3");
        
        plt.subplot(1,2,2)
        data.loc[data[featureName] > q99, featureName] = q99+1
        x_order = data.groupby(featureName).mean().reset_index()[featureName]
        x_order.iloc[-1] = str(q99+1)+"+"
        data[featureName][data[featureName] == q99+1] = str(q99+1)+"+"
        
        ax = sns.boxplot(x=featureName, y='winPlacePerc', data=data, color="#2196F3", order = x_order);
        ax.set_xlabel(featureName, size=14, color="#263238")
        ax.set_ylabel('WinPlacePerc', size=14, color="#263238")
    plt.tight_layout()

Kills and damage

In [56]:
featStat('kills',15,'count');
plt.show();
featStat('longestKill',400,'hist');
plt.show();
featStat('damageDealt',1000,'hist');

Heals and boosts

In [57]:
featStat('heals',20,'count')
plt.show()
featStat('boosts',12,'count')

Distance

In [58]:
featStat('walkDistance',5000,'hist')
plt.show()
featStat('swimDistance',500,'hist')
plt.show()
featStat('rideDistance',12000,'hist')

Some other features

In [59]:
featStat('weaponsAcquired',15,'count')
plt.show()
featStat('vehicleDestroys',None,'count')
In [60]:
features = ['kills', 'longestKill', 'damageDealt', 'heals', 'boosts', 'walkDistance', 'swimDistance', 'rideDistance', 'weaponsAcquired', 'vehicleDestroys']
zeroPerc = ((train[features] == 0).sum(0) / len(train)*100).sort_values(ascending = False)
sns.barplot(x=zeroPerc.index , y=zeroPerc, color="#2196F3");
plt.title("Percentage of zero values")
plt.tight_layout()

As we can see, with increasing of value of this features, probability to win also increase. So features, described above, good correlate with target feature.
Plot remaining features

In [61]:
df = train.drop(columns=['Id','matchId','groupId','matchType']+features)
df[(df>0) & (df<=df.quantile(0.99))].hist(bins=25,layout=(5,5),figsize=(15,15));
plt.tight_layout()

Feature correlations

In [62]:
f,ax = plt.subplots(figsize=(15, 13))
sns.heatmap(df.corr(), annot=True, fmt= '.1f',ax=ax,cbar=False)
plt.show()

Take features, which most correlate with target feature

In [63]:
f,ax = plt.subplots(figsize=(11, 11))
cols = abs(train.corr()).nlargest(6, 'winPlacePerc')['winPlacePerc'].index
hm = sns.heatmap(np.corrcoef(train[cols].values.T), annot=True, square=True, fmt='.2f',  yticklabels=cols.values, xticklabels=cols.values)
print(", ".join(cols[1:]), " most correlate with target feature")
plt.show()
walkDistance, killPlace, boosts, weaponsAcquired, damageDealt  most correlate with target feature

Let's make pairplots. We can clearly see correlation with winPlacePerc (but maybe only with weaponsAcquired it's difficult to see)

In [64]:
sns.set(font_scale=2)
sns.pairplot(train, y_vars=["winPlacePerc"], x_vars=cols[1:],height=8);
sns.set(font_scale=1)