In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import iplot, init_notebook_mode
%matplotlib inline
In [2]:
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (20, 6)

问题

  • 使用 Python 2 和 Python 3 的开发者的比例?
  • 做数据分析和机器学习的人中分别有多少人使用的是 Python 3?
  • 常用框架中使用 Python 2 和 Python 3 的比例?
  • 做数据分析和机器学习的人常用的框架?
  • 公司规模大小和是否使用 Python 3 的关系?
  • 开发者年龄和是否使用 Python 3 的关系?
  • 使用 Python 3 和 Python 2 的开发者的国别分布?
  • 开发者中使用 IDE 的情况?

读取数据集

In [3]:
survey_df = pd.read_csv('pythondevsurvey2017_raw_data.csv')
survey_df.columns = [c.lower() for c in survey_df.columns]
survey_df.head()
Out[3]:
is python the main language you use for your current projects? none:what other language(s) do you use? java:what other language(s) do you use? javascript:what other language(s) do you use? c/c++:what other language(s) do you use? php:what other language(s) do you use? c#:what other language(s) do you use? ruby:what other language(s) do you use? bash / shell:what other language(s) do you use? objective-c:what other language(s) do you use? ... technical support:which of the following best describes your job role(s)? data analyst:which of the following best describes your job role(s)? business analyst:which of the following best describes your job role(s)? team lead:which of the following best describes your job role(s)? product manager:which of the following best describes your job role(s)? cio / ceo / cto:which of the following best describes your job role(s)? systems analyst:which of the following best describes your job role(s)? other - write in::which of the following best describes your job role(s)? could you tell us your age range? what country do you live in?
0 Yes NaN NaN JavaScript NaN PHP NaN NaN Bash / Shell NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 60 or older Italy
1 Yes NaN NaN JavaScript NaN NaN NaN NaN NaN NaN ... NaN NaN NaN Team lead NaN NaN NaN NaN 40-49 United Kingdom
2 Yes NaN NaN JavaScript NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 40-49 France
3 No, I don’t use Python for my current projects NaN NaN NaN NaN NaN C# NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 17 or younger Spain
4 Yes NaN Java NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 18-20 Israel

5 rows × 162 columns

In [4]:
survey_df.shape
Out[4]:
(9506, 162)
In [5]:
def find_cols(df, kws):
    '''找到 df 中含有 kws 的列'''
    return [item for item in df.columns if all ([w in item for w in kws])]
In [6]:
find_cols(df=survey_df, kws=['python', 'version'])
Out[6]:
['which version of python do you use the most?',
 'installer from python.org:what do you typically use to upgrade your python version?',
 'build from source:what do you typically use to upgrade your python version?',
 'automatic upgrade via cloud provider:what do you typically use to upgrade your python version?',
 'enthought:what do you typically use to upgrade your python version?',
 'anaconda:what do you typically use to upgrade your python version?',
 'activepython:what do you typically use to upgrade your python version?',
 'intel distribution for python:what do you typically use to upgrade your python version?',
 'os-provided python (via apt-get, yum, homebrew, etc.):what do you typically use to upgrade your python version?',
 'pyenv:what do you typically use to upgrade your python version?',
 'pythonz:what do you typically use to upgrade your python version?',
 'other - write in::what do you typically use to upgrade your python version?']

使用 Python 2 和 Python 3 的开发者的比例?

In [7]:
python_version = survey_df['which version of python do you use the most?']
python_version.describe()
Out[7]:
count         8112
unique           2
top       Python 3
freq          6046
Name: which version of python do you use the most?, dtype: object
In [8]:
python_version.value_counts(normalize=True, dropna=False)
Out[8]:
Python 3    0.636019
Python 2    0.217336
NaN         0.146644
Name: which version of python do you use the most?, dtype: float64
In [9]:
python_version.value_counts(normalize=False, dropna=False)
Out[9]:
Python 3    6046
Python 2    2066
NaN         1394
Name: which version of python do you use the most?, dtype: int64
In [10]:
python_version.value_counts(normalize=True, dropna=True)
Out[10]:
Python 3    0.745316
Python 2    0.254684
Name: which version of python do you use the most?, dtype: float64
In [11]:
python_version.value_counts(normalize=True, dropna=True).plot(kind='pie', 
                                                              figsize=(5, 5), 
                                                              startangle=90, 
                                                              autopct='%.0f%%', 
                                                              fontsize=14,
                                                              colors=sns.color_palette('rainbow')[:2])
plt.title('Python 2 VS Python 3', fontsize=18)
plt.ylabel('');
plt.tight_layout()
plt.savefig('python-version.png')

在使用 Python 的开发者中,大概有 75% 的人已经在使用 Python 3 了。

做数据分析和机器学习的人中分别有多少人使用的是 Python 3?

In [12]:
python_da_ml = survey_df[['machine learning:\xa0what do you use python for?', 'data analysis:\xa0what do you use python for?', 'which version of python do you use the most?']]
In [13]:
python_da_ml.dtypes
Out[13]:
machine learning: what do you use python for?    object
data analysis: what do you use python for?       object
which version of python do you use the most?     object
dtype: object
In [14]:
python_da = pd.crosstab(python_da_ml['which version of python do you use the most?'], python_da_ml['data analysis:\xa0what do you use python for?'], normalize=True)
In [15]:
python_da
Out[15]:
data analysis: what do you use python for? Data analysis
which version of python do you use the most?
Python 2 0.233177
Python 3 0.766823
In [16]:
python_ml = pd.crosstab(python_da_ml['which version of python do you use the most?'], python_da_ml['machine learning:\xa0what do you use python for?'], normalize=True)
In [17]:
pd.concat([python_da, python_ml], axis=1)
Out[17]:
Data analysis Machine learning
which version of python do you use the most?
Python 2 0.233177 0.193548
Python 3 0.766823 0.806452
In [18]:
pd.concat([python_da, python_ml], axis=1).T.plot(kind='bar', figsize=(10, 5), color=sns.color_palette('rainbow'))
plt.xticks(rotation=0, fontsize=14)
plt.title('Data Analysis and Machine Learning VS Python version', fontsize=18)
plt.legend(title=None)
plt.tight_layout()
plt.savefig('data-analysis-machine-learning-vs-python-version.png')

In [19]:
cols = find_cols(survey_df, 'what framework(s) do you use in addition to python?')
cols
Out[19]:
['none:what framework(s) do you use in addition to python?',
 'django:what framework(s) do you use in addition to python?',
 'flask:what framework(s) do you use in addition to python?',
 'tornado:what framework(s) do you use in addition to python?',
 'bottle:what framework(s) do you use in addition to python?',
 'web2py:what framework(s) do you use in addition to python?',
 'numpy / pandas / matplotlib / scipy and similar:what framework(s) do you use in addition to python?',
 'keras / theano / tensorflow / scikit-learn and similar:what framework(s) do you use in addition to python?',
 'pillow:what framework(s) do you use in addition to python?',
 'pyqt / pygtk / wxpython:what framework(s) do you use in addition to python?',
 'tkinter:what framework(s) do you use in addition to python?',
 'pygame:what framework(s) do you use in addition to python?',
 'cherrypy:what framework(s) do you use in addition to python?',
 'twisted:what framework(s) do you use in addition to python?',
 'pyramid:what framework(s) do you use in addition to python?',
 'requests:what framework(s) do you use in addition to python?',
 'asyncio:what framework(s) do you use in addition to python?',
 'kivy:what framework(s) do you use in addition to python?',
 'six:what framework(s) do you use in addition to python?',
 'aiohttp:what framework(s) do you use in addition to python?',
 'other - write in::what framework(s) do you use in addition to python?',
 'cloud platforms (google app engine, aws, rackspace, heroku and similar):what additional technology(s) do you use in addition to python?',
 'jupyter notebook:what editor(s)/ide(s) have you considered for use in your python development?',
 'komodo editor:what editor(s)/ide(s) have you considered for use in your python development?',
 'komodo ide:what editor(s)/ide(s) have you considered for use in your python development?']
In [20]:
frameworks = survey_df[cols[1:]]
frameworks.head()
Out[20]:
django:what framework(s) do you use in addition to python? flask:what framework(s) do you use in addition to python? tornado:what framework(s) do you use in addition to python? bottle:what framework(s) do you use in addition to python? web2py:what framework(s) do you use in addition to python? numpy / pandas / matplotlib / scipy and similar:what framework(s) do you use in addition to python? keras / theano / tensorflow / scikit-learn and similar:what framework(s) do you use in addition to python? pillow:what framework(s) do you use in addition to python? pyqt / pygtk / wxpython:what framework(s) do you use in addition to python? tkinter:what framework(s) do you use in addition to python? ... requests:what framework(s) do you use in addition to python? asyncio:what framework(s) do you use in addition to python? kivy:what framework(s) do you use in addition to python? six:what framework(s) do you use in addition to python? aiohttp:what framework(s) do you use in addition to python? other - write in::what framework(s) do you use in addition to python? cloud platforms (google app engine, aws, rackspace, heroku and similar):what additional technology(s) do you use in addition to python? jupyter notebook:what editor(s)/ide(s) have you considered for use in your python development? komodo editor:what editor(s)/ide(s) have you considered for use in your python development? komodo ide:what editor(s)/ide(s) have you considered for use in your python development?
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Django Flask Tornado NaN NaN NumPy / pandas / Matplotlib / scipy and similar NaN Pillow NaN NaN ... Requests NaN NaN six NaN Other - Write In: NaN NaN NaN NaN
2 Django NaN NaN NaN NaN NaN NaN NaN NaN NaN ... Requests NaN NaN six NaN NaN NaN NaN NaN Komodo IDE
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NumPy / pandas / Matplotlib / scipy and similar Keras / Theano / TensorFlow / scikit-learn and... Pillow NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 24 columns

In [21]:
count_df = frameworks.count().sort_values(ascending=False)
count_df.index = [item.split(':')[0] for item in count_df.index]

count_df.plot(kind='bar', color=sns.color_palette('rainbow', frameworks.shape[1]))
plt.xticks(fontsize=14)
Out[21]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]), <a list of 24 Text xticklabel objects>)
In [22]:
values = frameworks.count().sort_values(ascending=False).values
labels = [item.split(':')[0] for item in frameworks.count().sort_values(ascending=False).index]

plt.figure(figsize=(20, 17))
sns.barplot(x=values, y=labels, orient='h', palette=sns.color_palette("rainbow", 24))
plt.xticks(fontsize=14)
plt.yticks(fontsize=18)
plt.tight_layout()
plt.savefig('frameworks.png')

常用框架中使用 Python 2 和 Python 3 的比例

In [23]:
python_ver = survey_df['which version of python do you use the most?']
In [24]:
def process_col(col):
    return pd.crosstab(index=python_ver, columns=col).iloc[:, 0]
In [25]:
# process_col(frameworks['django:what framework(s) do you use in addition to python?'])
In [26]:
frameworks_pyver = frameworks.apply(lambda col: pd.crosstab(index=python_ver, columns=col).iloc[:, 0])
frameworks_pyver.columns = [item.split(':')[0] for item in frameworks.columns]
frameworks_pyver
Out[26]:
django flask tornado bottle web2py numpy / pandas / matplotlib / scipy and similar keras / theano / tensorflow / scikit-learn and similar pillow pyqt / pygtk / wxpython tkinter ... requests asyncio kivy six aiohttp other - write in cloud platforms (google app engine, aws, rackspace, heroku and similar) jupyter notebook komodo editor komodo ide
which version of python do you use the most?
Python 2 841 678 144 83 97 727 264 333 299 175 ... 763 95 70 237 44 223 551 346 43 59
Python 3 2522 1929 366 199 235 2436 1096 924 830 763 ... 2006 664 319 389 395 426 1409 1394 121 126

2 rows × 24 columns

In [27]:
frameworks_pyver_ratio = frameworks_pyver / frameworks_pyver.sum(axis=0)
In [28]:
frameworks_pyver_ratio.T.plot(kind='bar', color=sns.color_palette('rainbow'))
plt.xticks(rotation=90, fontsize=14)
Out[28]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]), <a list of 24 Text xticklabel objects>)
In [29]:
df = frameworks_pyver_ratio.stack().reset_index()
df.columns=['pyver', 'framework', 'value']
df.head()
Out[29]:
pyver framework value
0 Python 2 django 0.250074
1 Python 2 flask 0.260069
2 Python 2 tornado 0.282353
3 Python 2 bottle 0.294326
4 Python 2 web2py 0.292169
In [30]:
plt.figure(figsize=(20, 17))
sns.barplot(x='value', y='framework', hue='pyver', data=df, orient='h', palette=sns.color_palette('rainbow'))
plt.yticks(fontsize=18)
Out[30]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]), <a list of 24 Text yticklabel objects>)
In [31]:
sns.distplot(frameworks_pyver_ratio.iloc[0, :], bins=5, color='b')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x232cf0333c8>
In [32]:
sns.stripplot(x='framework', y='value', hue='pyver', data=df, size=5, palette=sns.color_palette('rainbow'))
plt.xticks(rotation=90, fontsize=14);
plt.legend(title=None)
Out[32]:
<matplotlib.legend.Legend at 0x232cf090278>
In [33]:
plt.figure(figsize=(7, 13))
sns.stripplot(x='value', y='framework', hue='pyver', data=df, orient='h', size=5, palette=sns.color_palette('rainbow'))
plt.yticks(fontsize=14)
plt.legend(title=None)
Out[33]:
<matplotlib.legend.Legend at 0x232cf17c550>
In [34]:
plt.figure(figsize=(13, 13))
sns.stripplot(x='value', y='framework', hue='pyver', 
              data=df, 
              order=frameworks_pyver_ratio.T['Python 3'].sort_values(ascending=False).index, 
              orient='h', 
              size=7, 
              palette=sns.color_palette('rainbow'))
plt.yticks(fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.legend(title=None, loc='upper center')
plt.tight_layout()
plt.savefig('frameworks-python-version.png')

做数据分析和机器学习的人常用的框架?

In [35]:
cols = find_cols(survey_df, ['use', 'python', 'most'])
cols
Out[35]:
['what do you use python for the most?',
 'which version of python do you use the most?']
In [36]:
uses = survey_df['what do you use python for the most?']
uses.head()
Out[36]:
0    DevOps / System administration / Writing autom...
1                                      Web development
2                                      Web development
3                                                  NaN
4                                  Desktop development
Name: what do you use python for the most?, dtype: object
In [37]:
frameworks_uses = frameworks.apply(lambda col: pd.crosstab(index=uses, columns=col).iloc[:, 0])
frameworks_uses.columns = [item.split(':')[0] for item in frameworks_uses.columns]
frameworks_uses.head()
Out[37]:
django flask tornado bottle web2py numpy / pandas / matplotlib / scipy and similar keras / theano / tensorflow / scikit-learn and similar pillow pyqt / pygtk / wxpython tkinter ... requests asyncio kivy six aiohttp other - write in cloud platforms (google app engine, aws, rackspace, heroku and similar) jupyter notebook komodo editor komodo ide
Computer graphics 20 15 7 3 3 47 11.0 23 34 16 ... 16 3.0 9 3.0 1 8 16 10 5 3
Data analysis 424 395 73 38 49 926 397.0 159 200 154 ... 376 85.0 44 94.0 46 81 279 594 27 25
Desktop development 151 114 17 13 21 156 28.0 79 193 139 ... 119 20.0 51 26.0 8 41 61 71 13 15
DevOps / System administration / Writing automation scripts 271 289 41 33 28 230 58.0 83 106 80 ... 343 97.0 20 64.0 53 68 227 113 24 23
Educational purposes 160 91 13 16 21 186 53.0 55 68 115 ... 68 17.0 35 6.0 9 22 80 96 16 20

5 rows × 24 columns

In [38]:
da_ml_frameworks_uses = frameworks_uses.loc[['Data analysis', 'Machine learning']]
da_ml_frameworks_uses.head()
Out[38]:
django flask tornado bottle web2py numpy / pandas / matplotlib / scipy and similar keras / theano / tensorflow / scikit-learn and similar pillow pyqt / pygtk / wxpython tkinter ... requests asyncio kivy six aiohttp other - write in cloud platforms (google app engine, aws, rackspace, heroku and similar) jupyter notebook komodo editor komodo ide
Data analysis 424 395 73 38 49 926 397.0 159 200 154 ... 376 85.0 44 94.0 46 81 279 594 27 25
Machine learning 239 186 48 19 31 462 416.0 90 88 76 ... 163 40.0 25 37.0 22 33 139 297 10 7

2 rows × 24 columns

In [39]:
da_ml_frameworks_uses.T.sort_values(by='Data analysis').plot.area(stacked=False, alpha=0.5, figsize=(20, 15), 
                                                                  color=sns.color_palette('rainbow')[:2])
plt.xticks(range(da_ml_frameworks_uses.shape[1]), 
           da_ml_frameworks_uses.T.sort_values(by='Data analysis').index, 
           rotation=90, fontsize=18);
plt.yticks(fontsize=14)
plt.legend(fontsize=16)
plt.tight_layout()
plt.savefig('frameworks-data-analysis-machine-learning.png')
In [40]:
plt.figure(figsize=(10, 15))
df = da_ml_frameworks_uses.stack().reset_index()
df.columns = ['use', 'framework', 'value']
sns.barplot(x='value', y='framework', hue='use', 
            data=df, 
            orient='h', 
            order=da_ml_frameworks_uses.T.sort_values(by='Data analysis', ascending=False).index,
            palette=sns.color_palette('rainbow'))
plt.yticks(fontsize=16)
Out[40]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]), <a list of 24 Text yticklabel objects>)

可以看到数据分析和机器学习从业者使用的框架大致差不多,只是在 keras、theano、tensorflow 和 scikit-learn 等机器学习库上差别较大,当然这也理所当然。

只是让我想不到的是 web 框架 Django 和 Flask 能够排的这么前。

公司规模大小和是否使用 Python 3 的关系?

In [41]:
cols = find_cols(survey_df, ['how', 'many', 'people', 'project'])
cols
Out[41]:
['how many people are in your project team?']
In [42]:
team_scale = survey_df[cols[0]]
team_scale.head()
Out[42]:
0    2-7 people
1    2-7 people
2           NaN
3           NaN
4    2-7 people
Name: how many people are in your project team?, dtype: object
In [43]:
team_scale.describe()
Out[43]:
count           3635
unique             5
top       2-7 people
freq            2620
Name: how many people are in your project team?, dtype: object
In [44]:
team_scale.isnull().sum()
Out[44]:
5871
In [45]:
team_pyver = pd.crosstab(team_scale, python_ver)
team_pyver = team_pyver.reindex(['2-7 people', '8-12 people', '13-20 people', '21-40 people', 'More than 40 people'])
team_pyver
Out[45]:
which version of python do you use the most? Python 2 Python 3
how many people are in your project team?
2-7 people 769 1553
8-12 people 183 327
13-20 people 65 105
21-40 people 25 42
More than 40 people 28 40
In [46]:
team_pyver_sorted = team_pyver.div(team_pyver.sum(axis=1), axis=0).sort_values(by='Python 3', ascending=False)
team_pyver_sorted
Out[46]:
which version of python do you use the most? Python 2 Python 3
how many people are in your project team?
2-7 people 0.331180 0.668820
8-12 people 0.358824 0.641176
21-40 people 0.373134 0.626866
13-20 people 0.382353 0.617647
More than 40 people 0.411765 0.588235
In [47]:
team_pyver_sorted['Python 3'].plot(label='Python 3', marker='o', markersize=10, color='b', linewidth=3)
plt.xticks(range(5), team_pyver_sorted.index, fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('team scale', fontsize=16)
plt.ylabel('use ratio of python 3', fontsize=16)
plt.legend(fontsize=14)
plt.title('Team scale VS Use ratio of Python 3', fontsize=18)
plt.tight_layout()
plt.savefig('team-scale-python-3.png')