2015 Python 사용자 조사¶

한국어 Python 사용자들은 누구이며 어떻게 개발을 하고 있는지 알아보기 위해 2015년 8월 27일부터 29일까지 3일간 설문조사를 진행했습니다. 30일 PyCon.KR에서 발표한 분석 결과를 정리해서 공개합니다.

문의 또는 수정/변경 요청은 유재명euphoris@gmail.com에게 해주십시오.

In [1]:

%matplotlib inline

In [2]:

import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn

In [3]:

data = pandas.read_csv('clean.csv')
libs = data.columns.values[19:]
m = data[libs].mean() * 100
df = pandas.DataFrame({'라이브러리': libs, '비율': m})

In [4]:

seaborn.set_context("notebook")
seaborn.set(font='NanumBarunGothic', font_scale=1.5)
plt.figure(figsize=(12, 8))

Out[4]:

<matplotlib.figure.Figure at 0x7fc1e89a3940>

<matplotlib.figure.Figure at 0x7fc1e89a3940>

In [5]:

def plot_ratio(libs, ylab='라이브러리'):
    seaborn.barplot(x='비율', y='라이브러리', data=df.loc[libs])
    seaborn.axlabel(xlabel = '비율 (%)', ylabel=ylab)

Python 사용¶

최근 1개월 내에 사용한 파이썬 구현¶

In [6]:

plot_ratio(['v2.5', 'v2.6', 'v2.7', 'v3.0', 'v3.1', 'v3.2', 'v3.3', 'v3.4', 'pypy',], '구현')

In [7]:

def plot_fc(column):
    m = data.groupby(column).agg({column: len}) / data[column].count() * 100
    m['라이브러리'] = m.index
    m = m.sort(column, ascending=False)
    seaborn.barplot(y='라이브러리', x=column, data=m)
    seaborn.axlabel('비율 (%)', column)

주로 사용하는 파이썬 에디터/IDE¶

In [8]:

plot_fc('editor')

개인 개발환경에서 주로 사용하는 운영체제¶

In [9]:

plot_fc('os')

최근 1년 내 사용한 라이브러리¶

DB 관련¶

In [10]:

plot_ratio(['SQLAlchemy', 'MySQL-python','redis','DjangoORM','psycopg2','Storm'])

웹 개발¶

In [11]:

plot_ratio(['flask', 'django', 'tornado', 'bottle', 'pyramid', 'falcon', 'wheezy',])

과학 계산 및 데이터 분석¶

In [12]:

plot_ratio(['numpy', 'scipy', 'pandas', 'scikit-learn', 'statmodels', 'sympy'])

테스트¶

In [13]:

plot_ratio(['requests', 'BeautifulSoup', 'lxml', 'selenium', 'html5lib', 'Scrapy', 'aiohttp', ])

개발 환경¶

In [14]:

plot_ratio(['pip', 'virtualenv', 'setuptools', 'anaconda', ])

연봉¶

In [15]:

def compare_user_group(lib, y='salary'):
    m = data.groupby(lib).agg({y: numpy.mean})
    m['_usage'] = m.index
    seaborn.barplot(y=y, x='_usage', data=m)
    seaborn.axlabel(xlabel=lib, ylabel=y)

라이브러리와 연봉¶

SQLAlchemy, virtualenv 등 일부 라이브러리는 사용자들의 연봉이 더 낮은 현상이 나타났습니다. 그 이유는...

In [16]:

compare_user_group('SQLAlchemy')

In [17]:

compare_user_group('virtualenv')

In [18]:

data['total'] = data.career + data.usage

In [19]:

data_salary = data.loc[data.salary > 1000,:]

경력과 연봉¶

당연하지만 경력이 길 수록 연봉도 높아집니다.

In [20]:

seaborn.regplot(x="career", x_jitter=.2, y="salary", data=data_salary)
seaborn.axlabel('경력', '연봉')

Python을 오래 쓸 수록...¶

재밌게도 Python을 오래 쓴 분들이 연봉이 더 높습니다.

In [21]:

seaborn.regplot(x="usage", x_jitter=.2, y="salary", data=data_salary)
seaborn.axlabel('Python 쓴 햇수', '연봉')

회사 규모¶

좀 더 흥미롭게도 회사 규모와 연봉 사이에는 별 관계가 나타나지 않습니다.

In [22]:

seaborn.regplot(x="company", x_jitter=.2, y="salary", data=data_salary)
seaborn.axlabel('회사 규모(임직원 수)', '연봉')

아주 큰 회사들은 빼고 1000명 미만인 회사들만 보아도 그렇습니다.

In [23]:

seaborn.regplot(x="company", x_jitter=.2, y="salary", data=data_salary.loc[data_salary.company < 1000])
seaborn.axlabel('회사 규모(임직원 수)', '연봉')

회귀분석¶

선형회귀분석이라는 통계적 기법을 이용해 경력, Python을 쓴 햇수, 회사 규모가 연봉을 예측하는데 어떻게 작용하는지 알아보겠습니다. 아래 두 번째 표에서 coef 열은 각 변수가 1 증가할 때마다 연봉이 평균적으로 얼마나 변하는지를 나타냅니다. 경력이 1년 증가하면 연봉은 164만원, Python을 1년 더 쓰면 165만원이 증가하는군요.

Python을 한 번도 써본적 없는 경력 10년차 개발자와 Python을 10년 쓴 신입 개발자의 연봉이 평균적으로 비슷할 거라는 이야기입니다.

회사 규모는 연봉을 예측하는데 역시 별 의미가 없네요. (유의수준 5%에서 통계적으로 유의미하지 않음)

In [24]:

import statsmodels.formula.api as smf

smf.ols('salary ~ career + usage + company', data=data_salary).fit().summary()

Out[24]:

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.270
Model:	OLS	Adj. R-squared:	0.238
Method:	Least Squares	F-statistic:	8.619
Date:	Mon, 31 Aug 2015	Prob (F-statistic):	6.05e-05
Time:	23:36:54	Log-Likelihood:	-673.86
No. Observations:	74	AIC:	1356.
Df Residuals:	70	BIC:	1365.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	3069.0658	497.339	6.171	0.000	2077.155 4060.977
career	164.0749	55.642	2.949	0.004	53.101 275.049
usage	165.5141	67.498	2.452	0.017	30.893 300.135
company	0.0943	0.068	1.383	0.171	-0.042 0.230

Omnibus:	25.106	Durbin-Watson:	2.039
Prob(Omnibus):	0.000	Jarque-Bera (JB):	41.297
Skew:	1.298	Prob(JB):	1.08e-09
Kurtosis:	5.579	Cond. No.	7.64e+03

이걸 좀 직관적으로 보이기 위해 Python 쓴 햇수와 경력을 더해서 가로축으로 삼아 그래프를 그려봤습니다. 좀 더 패턴이 뚜렷해 보이지 않나요?

In [25]:

seaborn.regplot(x="total", x_jitter=.2, y="salary", data=data_salary)
seaborn.axlabel('Python 쓴 햇수 + 경력', '연봉')

SQLAlchemy 사용여부도 회귀분석에 같이 넣어보았습니다. 이렇게 하면 SQLAlchemy의 사용이 연봉에 미치는 효과가 유의수준 5%에서 통계적으로 유의미하지는 않은 것으로 나타났습니다. 이런 결과로 보건데 SQLAlchemy를 쓰면 연봉이 낮아지는 것처럼 보였던 것은 아마 경력이 짧거나 Python을 쓴지 오래되지 않은 분들이 SQLAlchemy를 주로 쓰기 때문에 그렇게 보인 것이 아닌가 합니다.

In [26]:

smf.ols('salary ~ SQLAlchemy + career + usage + company', data=data).fit().summary()

Out[26]:

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.265
Model:	OLS	Adj. R-squared:	0.228
Method:	Least Squares	F-statistic:	7.289
Date:	Mon, 31 Aug 2015	Prob (F-statistic):	4.58e-05
Time:	23:36:54	Log-Likelihood:	-796.65
No. Observations:	86	AIC:	1603.
Df Residuals:	81	BIC:	1616.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2804.2040	666.509	4.207	0.000	1478.060 4130.348
SQLAlchemy	-793.3610	584.673	-1.357	0.179	-1956.677 369.955
career	100.4852	58.961	1.704	0.092	-16.828 217.798
usage	266.9966	73.883	3.614	0.001	119.992 414.001
company	0.1018	0.080	1.264	0.210	-0.058 0.262

Omnibus:	4.996	Durbin-Watson:	1.874
Prob(Omnibus):	0.082	Jarque-Bera (JB):	4.657
Skew:	0.371	Prob(JB):	0.0974
Kurtosis:	3.865	Cond. No.	1.05e+04

virtualenv도 마찬가지입니다. SQLAlchemy와 virtualenv는 여러분의 연봉을 깎아먹지 않으니 안심하세요.

In [27]:

smf.ols('salary ~ virtualenv + career + usage + company', data=data).fit().summary()

Out[27]:

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.259
Model:	OLS	Adj. R-squared:	0.223
Method:	Least Squares	F-statistic:	7.086
Date:	Mon, 31 Aug 2015	Prob (F-statistic):	6.07e-05
Time:	23:36:55	Log-Likelihood:	-796.97
No. Observations:	86	AIC:	1604.
Df Residuals:	81	BIC:	1616.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2875.2314	769.643	3.736	0.000	1343.883 4406.580
virtualenv	-723.3209	651.652	-1.110	0.270	-2019.904 573.262
career	95.9677	60.345	1.590	0.116	-24.100 216.035
usage	276.8826	74.184	3.732	0.000	129.280 424.485
company	0.0973	0.082	1.183	0.240	-0.066 0.261

Omnibus:	4.521	Durbin-Watson:	1.879
Prob(Omnibus):	0.104	Jarque-Bera (JB):	3.892
Skew:	0.387	Prob(JB):	0.143
Kurtosis:	3.699	Cond. No.	1.23e+04

경력과 Python을 쓴 햇수도 그래프로 찍어봤습니다. 둘 사이에도 상당히 밀접한 관계가 있지만 예외도 많은 것을 볼 수 있습니다.

In [28]:

seaborn.regplot(x="career", x_jitter=.2, y="usage", data=data)
seaborn.axlabel('경력', 'Python 쓴 햇수')

In [29]:

numpy.corrcoef(data.career, data.usage)[0,1]

Out[29]:

0.3604925162478223