Violin Plot 그려보기¶

챠트로 나의 PC App사용 트랜드 살펴보기¶

데이터에 숨어있는 시즈널한 패턴, 비이상적인 신호를 찾아내기 위해서는 데이터를 Plotting 해 보는 것만큼 좋은 것이 없다. 그중에서 요즘 개인적으로 많이 사용하는 Violin Plot으로 나의 2014년, 2015년 PC 사용 패턴을 그려 보고자 한다.

In [1]:

%matplotlib inline

In [2]:

import pandas as pd
import numpy as np
from matplotlib import rcParams
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
import math

## ploting theme 
sns.set(style="whitegrid", palette="colorblind", color_codes=True, font_scale=1.4,
        rc = {'font.size': 12, 'font.family':'NanumGothic'})

파이썬 Seaborn 패키지로 그냥 Plotting 해보자¶

https://web.stanford.edu/~mwaskom/software/seaborn/

랜덤 값 100개를 생성하고¶

In [3]:

import random
data = np.array( [random.randrange(0,100) for x in range(100)])
print(data)

[99 80 46 76 42 95 78 72 27 85 74 96  2 62 85 22 85 64 91 34 62 40 23 80 86
 65 42 14 92 99 20 88 47 54 70 28 29 22 50 10 65 14  4 46 52  1 31 89 50 29
 41 52 14 82 69 92 34 92  5 19 71 31 13 79 74 30 98  4 49 95  1 69  9 67 80
 81 58  9 15 21 28 27 84 71 72 62 71 89 77 83  2 79 21 62 99  3 91 88 62 47]

최소값, 25% 구간, 중앙값, 75%구간, 최대값을 볼 수 있는 Box Plot도 그려보고¶

In [4]:

import seaborn as sns 
sns.boxplot(data)

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a00efd0>

데이터의 분포를 볼 수 있는 KDE(Kernal Density Estimation) Plot도 그려보고¶

In [5]:

sns.kdeplot(np.array(data))

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a0b8630>

Box plot과 KDE plot을 한꺼번에 그려주는 Violin Plot을 그려본다¶

In [6]:

sns.violinplot(data)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a289518>

데이터셋 만들기¶

2014, 2015 나의 PC에서 사용하는 프로그램별 사용시간 데이터

In [7]:

!head ./resource/pc_usetime.csv

idx,uid,computername,shortsessionid,longsessionid,filename,title,catecode,usetime,jobclass,timestamp,inserttime,idate
1088,CHOIKYUMIN,CHOIKYUMIN,,,EXCEL.EXE,Microsoft Excel - 추천시스템 기능 기술 세분화.xlsx,6,270000,문서,1391404937,,
1089,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,AfreecaTV Analytics :: Collaboration Search 검색어 - Chrome,2,5000,인터넷,1391404707,,
1090,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,AfreecaTV Analytics :: Hot Boost Search Word - Chrome,2,5000,인터넷,1391404702,,
1091,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,AfreecaTV Analytics :: Hot Broadcast by StarBalloon Cnt - Chrome,2,5000,인터넷,1391404712,,
1092,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,AfreecaTV Analytics :: Starballoon Statistics - Chrome,2,15000,인터넷,1391404827,,
1093,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,AfreecaTV Analytics :: pilot project~ - Chrome,2,40000,인터넷,1391404732,,
1094,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,마이테이블 :: Pearson 상관계수(sample correlation coefficient) - Chrome,2,130000,인터넷,1391405107,,
1095,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,"상대거리 계산 - 피타고라스 정리, 유클리드 거리 공식 : 네이버 블로그 - Chrome",2,30000,인터넷,1391405197,,
1096,CHOIKYUMIN,CHOIKYUMIN,,,chrome.exe,새 탭 - Chrome,2,35000,인터넷,1391405162,,

In [8]:

total_ds = pd.DataFrame.from_csv("./resource/pc_usetime.csv",index_col=None)
## type casting float to int 
total_ds[['usetime','timestamp']] = total_ds[['usetime','timestamp']].fillna(0).astype(int)

In [9]:

total_ds[13000:].head(2).T 

Out[9]:

	13000	13001
idx	21244	21245
uid	CHOIKYUMIN	CHOIKYUMIN
computername	CHOIKYUMIN	CHOIKYUMIN
shortsessionid	1394520453.0	1394520453.0
longsessionid	1.394511e+09	1.394511e+09
filename	OUTLOOK.EXE	chrome.exe
title	받은 편지함 - goodvc@afreecatv.com - Microsoft Outlook	AfreecaTV Analytics :: Related Word Test Admin...
catecode	7	2
usetime	20000	160000
jobclass	기타업무	인터넷
timestamp	1394520603	1394520823
inserttime	1.394521e+09	1.394521e+09
idate	2014-03-11 15:53:43	2014-03-11 15:53:43

확률 밀도를 그릴수 있도록 데이터 셋 변경¶

In [10]:

## dataset 펼치기
data_arrays = []
for (idx, row) in total_ds.iterrows():
    for ut in range(0,int(row.usetime),5000):
        ts  = row.timestamp + round(ut/1000)
        now = datetime.fromtimestamp(ts)
        days = (now - datetime(now.year, 1, 1)).days+1
        data_arrays.append(['total', ts, now, now.year ,days, row.filename, row.jobclass])

## pandas 객체 생성 
fully_expended_ds = pd.DataFrame(data_arrays, columns=['total', 'ts', 'time', 'year', 'days', 'filename', 'jobclass'])
## 주요 필드 만들기 
## quater label 
fully_expended_ds['YYYYQt'] = fully_expended_ds.time.apply(lambda x : "%d' %dQ" % (x.year-2000, x.quarter))
## day trend values 
fully_expended_ds['day-minute'] = fully_expended_ds.time.apply(lambda x : x.hour*60+x.minute)
## hour label
fully_expended_ds['hour'] = fully_expended_ds.time.apply(lambda x :  "%dh" % (math.ceil((x.hour+1)/3)*3 ) )
## month label
fully_expended_ds['month'] = fully_expended_ds.time.apply(lambda x : x.month)
## week label
weekday_str = '월 화 수 목 금 토 일'.split()
fully_expended_ds['weekday'] = fully_expended_ds.time.apply(lambda x : weekday_str[x.weekday()])
## app rank 
apps_stat = total_ds.groupby(['filename']).sum().sort(['usetime'], ascending=False).reset_index()[['filename','usetime']]
apps_stat['rank'] = range(1,apps_stat.shape[0]+1)
fully_expended_ds = pd.merge(fully_expended_ds, apps_stat, on='filename')

In [11]:

fully_expended_ds.head()

Out[11]:

	total	ts	time	year	days	filename	jobclass	YYYYQt	day-minute	hour	month	weekday	usetime	rank
0	total	1391404937	2014-02-03 14:22:17	2014	34	EXCEL.EXE	문서	14' 1Q	862	15h	2	월	434725000	6
1	total	1391404942	2014-02-03 14:22:22	2014	34	EXCEL.EXE	문서	14' 1Q	862	15h	2	월	434725000	6
2	total	1391404947	2014-02-03 14:22:27	2014	34	EXCEL.EXE	문서	14' 1Q	862	15h	2	월	434725000	6
3	total	1391404952	2014-02-03 14:22:32	2014	34	EXCEL.EXE	문서	14' 1Q	862	15h	2	월	434725000	6
4	total	1391404957	2014-02-03 14:22:37	2014	34	EXCEL.EXE	문서	14' 1Q	862	15h	2	월	434725000	6

바이올린 챠트 그리기¶

바이올린 챠트는 입력 데이터에 대한 box plot과 커널 밀도 추정(Kernel Density Estimation) Plot을 카타고리컬하게 표현하는 챠트이다.
중앙에 있는 box-plot대신 stick, point

2014,2015년도 나의 PC사용 시간을 1년(1~365)을 기준으로 바이올린 Plot으로 보기¶

In [12]:

import seaborn as sns
# plot style 
sns.set(style="whitegrid", palette="colorblind", font_scale=1.4, rc={'font.family':'NanumGothic'} )

2개 그룹를 하나의 Violin으로 Plotting하는 hue 설정¶

In [13]:

sns.violinplot(data=fully_expended_ds, x='days',y='total', hue='year', split=True )

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1317f52e8>

보기 좋게 다듬기¶

In [14]:

plt.figure(figsize=(12,8))
sns.set(style="whitegrid", palette="colorblind", color_codes=True, font_scale=1.4,
        rc = {'font.size': 12, 'font.family':'NanumGothic'})
## ploting
g = sns.violinplot(data=fully_expended_ds, x='days', y='total', hue='year'
                   , scale="width", orient='h', split=True, cut=2 )
## draw x-ticks
ticks = fully_expended_ds.groupby('month').max()[['days']]
plt.xticks( ticks.days.tolist(), [ "%d월" % m for m in ticks.index.tolist()] )
plt.xlabel('')
## set x-axis range
plt.xlim(-50, 400)
plt.show()

In [15]:

tmp_ds = fully_expended_ds[fully_expended_ds['year']==2015].groupby(['YYYYQt','days']).count()[['usetime']].reset_index()
tmp_ds = tmp_ds[tmp_ds['usetime']>12*120]
(tmp_ds.groupby('YYYYQt').mean()[['usetime']]*5/60/60).T

Out[15]:

YYYYQt	15' 1Q	15' 2Q	15' 3Q	15' 4Q
usetime	6.987449	6.924741	6.291538	5.447402

Y축을 여러개의 그룹을 Plotting하기¶

In [16]:

sns.violinplot(data=fully_expended_ds[fully_expended_ds['rank']<11], x='days', y='filename' )

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b068710>

깔끔하게 다듬에서 Plotting 하기¶

주요 Parameters¶

y : 카테고리
hue : 카테고리
scale : 각각의 카테고리별 violin의 width의 scaling을 하기 위한 옵션
bw : Kernel의 Band width를 나타낸다

In [17]:

import seaborn as sns

## 다듬어진 Violin Plotting하기 
def drawViolin(ds, x, y, hue, label=None, figsize=(14,50), order=None, scale='width') :
    sns.set(style="whitegrid", palette="colorblind", font_scale=1.4, rc={'font.family':'NanumGothic'} )
    plt.figure(figsize=figsize)
    order_list = order or ds.groupby(y).count().sort(x, ascending=False).index.tolist()
    
    ## ploting
    g = sns.violinplot(data=ds, x=x, y=y, hue=hue, scale=scale, orient='h'
                       , cut=2, split=True, inner='box'
                       , order = order_list
                      )
    plt.tick_params(labeltop='on')

    if label != None:
        ## x ticks
        label_ds = ds.groupby(label).max()
        x_index = label_ds[x].values.tolist()
        x_label = label_ds.index.tolist()
        plt.xticks(x_index, x_label, rotation='vertical')

    plt.xlabel('')
    plt.ylabel('')
    

PC App별 사용시간 변화 보기¶

In [18]:

drawViolin(fully_expended_ds[fully_expended_ds['rank']<11], x='ts', y='filename', hue=None, label='YYYYQt', figsize=(14,20))

이 PC App별로 부류를 만들어 트랜드 산출해보면¶

In [19]:

drawViolin(fully_expended_ds, x='ts', y='jobclass', hue=None, label='YYYYQt', figsize=(14,20))

시간대별 PC 사용량 변화 보기¶

In [20]:

drawViolin(fully_expended_ds, x='day-minute', y='total', hue='year', label='hour', figsize=(14,6))

In [21]:

weekday_str = '월 화 수 목 금'.split()
drawViolin(fully_expended_ds[fully_expended_ds['weekday'].isin(weekday_str)], x='day-minute', y='weekday'
             , hue='year', label='hour', figsize=(14,20), order=weekday_str, scale='count')

In [ ]: