Summary Analysis of the 2017 GitHub Open Source Survey

By R. Stuart Geiger (@staeiou), Berkeley Institute for Data Science

Overview

This notebook analyzes the 2017 Open Source Survey, conducted by staff at GitHub, Inc. and other collaborators (see https://opensourcesurvey.org/2017 and https://github.com/github/open-source-survey). The survey was run in 2017, asking over 50 questions on a variety of topics. The survey's designers explain the motivation, design, and distribution of the survey:

In collaboration with researchers from academia, industry, and the community, GitHub designed a survey to gather high quality and novel data on open source software development practices and communities. We collected responses from 5,500 randomly sampled respondents sourced from over 3,800 open source repositories on GitHub.com, and over 500 responses from a non-random sample of communities that work on other platforms. The results are an open data set about the attitudes, experiences, and backgrounds of those who use, build, and maintain open source software."

Purpose and goal

The GitHub survey team presented analyses of some questions when releasing the survey, but there were many more questions asked that are relevant to researchers and community members. This report is an exploratory analysis of all questions asked in the survey, providing a basic summary of the responses to each question. This report presents and plots summary statistics -- mostly frequency counts, proportions, then a frequency or proportion bar graph -- of all questions asked in the survey. Most questions are presented individually, with panel questions grouped together as appropriate. There are no correlations, regressions, or descriptive breakouts between subgroups. Likert-style questions (e.g. Strongly agree <-> strongly disagree) have not been recoded to numerical, scalar values. There are no discussions or interpretations of results. This is left for future work.

The purpose of this notebook is to facilitate future research on this dataset by giving an overview of the kinds of questions asked in the survey, as well as serve as the basis for a PDF report, published on SocArXiv and OSF at https://osf.io/preprints/socarxiv/qps53/. The notebook is public on GitHub at https://github.com/staeiou/github-survey-analysis and others are encouraged to extend it as they see fit.

In [1]:
!pip install pandas seaborn
Requirement already satisfied: pandas in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: seaborn in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: python-dateutil>=2 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /home/staeiou/conda/lib/python3.5/site-packages (from python-dateutil>=2->pandas)
In [2]:
import pandas as pd
import matplotlib, matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline
pd.options.display.float_format = '{:.2f}%'.format # add % to all floats, all floats here are percentages
In [3]:
## For making pretty tables when nbconverting to latex

pd.set_option('display.notebook_repr_html', True)

def _repr_latex_(self):
    return "\centering{%s}" % self.to_latex()

pd.DataFrame._repr_latex_ = _repr_latex_  # monkey patch pandas DataFrame

Download and unzip data

In [4]:
!unzip -o data_for_public_release.zip
Archive:  data_for_public_release.zip
   creating: data_for_public_release/
  inflating: data_for_public_release/negative_incidents.csv  
  inflating: __MACOSX/data_for_public_release/._negative_incidents.csv  
  inflating: data_for_public_release/notes.txt  
  inflating: data_for_public_release/questionnaire.txt  
  inflating: __MACOSX/data_for_public_release/._questionnaire.txt  
  inflating: data_for_public_release/README.txt  
  inflating: __MACOSX/data_for_public_release/._README.txt  
  inflating: data_for_public_release/survey_data.csv  
  inflating: __MACOSX/data_for_public_release/._survey_data.csv  
In [5]:
!ls data_for_public_release/
negative_incidents.csv	questionnaire.txt  survey_data.csv
notes.txt		README.txt

Data processing

Main dataset

Load main dataset into pandas

In [6]:
pd.options.display.max_rows = 500
In [7]:
survey_df = pd.read_csv("data_for_public_release/survey_data.csv")
In [8]:
print("survey_data.csv length:", len(survey_df))
survey_data.csv length: 6029
In [9]:
survey_complete_df = survey_df.query("STATUS == 'Complete'")
print("survey_data.csv completed responses:", len(survey_complete_df))
survey_data.csv completed responses: 3746

Explore the main dataset with some sample responses

In [10]:
survey_complete_df[0:3].transpose()
Out[10]:
3 4 6
RESPONSE.ID 48 49 51
DATE.SUBMITTED 3/21/17 15:42 3/21/17 15:38 3/21/17 15:41
STATUS Complete Complete Complete
PARTICIPATION.TYPE.FOLLOW 1 1 1
PARTICIPATION.TYPE.USE.APPLICATIONS 1 1 1
PARTICIPATION.TYPE.USE.DEPENDENCIES 1 1 1
PARTICIPATION.TYPE.CONTRIBUTE 1 1 0
PARTICIPATION.TYPE.OTHER 0 0 0
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE Frequently Occasionally NaN
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS Rarely Rarely NaN
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE Frequently Rarely NaN
CONTRIBUTOR.TYPE.FILE.BUGS Frequently Frequently NaN
CONTRIBUTOR.TYPE.FEATURE.REQUESTS Frequently Frequently NaN
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN Never Occasionally NaN
EMPLOYMENT.STATUS Employed full time Full time student Employed full time
PROFESSIONAL.SOFTWARE Frequently NaN Frequently
FUTURE.CONTRIBUTION.INTEREST Very interested Very interested Very interested
FUTURE.CONTRIBUTION.LIKELIHOOD Very likely Very likely Somewhat unlikely
OSS.USER.PRIORITIES.LICENSE Very important to have Very important to have Very important to have
OSS.USER.PRIORITIES.CODE.OF.CONDUCT Somewhat important not to have Somewhat important to have Not important either way
OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE Somewhat important to have Very important to have Somewhat important to have
OSS.USER.PRIORITIES.CLA Not important either way Very important to have Don't know what this is
OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT Somewhat important to have Very important to have Very important to have
OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS Somewhat important to have Very important to have Very important to have
OSS.USER.PRIORITIES.WELCOMING.COMMUNITY Very important to have Very important to have Somewhat important to have
OSS.USER.PRIORITIES.WIDESPREAD.USE Somewhat important to have Not important either way Somewhat important to have
OSS.CONTRIBUTOR.PRIORITIES.LICENSE Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT Somewhat important not to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CLA Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE Somewhat important to have NaN NaN
SEEK.OPEN.SOURCE Sometimes Always Always
OSS.UX Generally easier to use About the same Generally easier to use
OSS.SECURITY Generally more secure Generally more secure About the same
OSS.STABILITY About the same Generally less stable About the same
INTERNAL.EFFICACY Strongly agree Strongly agree Strongly agree
EXTERNAL.EFFICACY Strongly agree Strongly agree Neither agree nor disagree
OSS.IDENTIFICATION Neither agree nor disagree Strongly agree Neither agree nor disagree
USER.VALUES.STABILITY Moderately important Extremely important Extremely important
USER.VALUES.INNOVATION Not at all important Very important Moderately important
USER.VALUES.REPLICABILITY Very important Very important Moderately important
USER.VALUES.COMPATIBILITY Very important Very important Extremely important
USER.VALUES.SECURITY Very important Very important Extremely important
USER.VALUES.COST Very important Not at all important Very important
USER.VALUES.TRANSPARENCY Very important Extremely important Extremely important
USER.VALUES.USER.EXPERIENCE Extremely important Moderately important Very important
USER.VALUES.CUSTOMIZABILITY Extremely important Very important Extremely important
USER.VALUES.SUPPORT Slightly important Moderately important Not at all important
USER.VALUES.TRUSTED.PRODUCER Very important Slightly important Moderately important
TRANSPARENCY.PRIVACY.BELIEFS People should be able to contribute code witho... People should be able to contribute code witho... People should be able to contribute code witho...
INFO.AVAILABILITY A lot of information about me A lot of information about me A little information about me
INFO.JOB Yes No No
TRANSPARENCY.PRIVACY.PRACTICES.GENERAL I include my real name. I include my real name. I don't publish this kind of content online.
TRANSPARENCY.PRIVACY.PRACTICES.OSS I include my real name. I include my real name. NaN
RECEIVED.HELP Yes Yes Yes
FIND.HELPER Other - Please describe I asked for help in a public forum (e.g. in a ... I asked a specific person for help.
HELPER.PRIOR.RELATIONSHIP We knew each other well. Total strangers, I didn't know of them previou... We knew each other well.
RECEIVED.HELP.TYPE Writing code or otherwise implementing ideas. Installing or using an application. Installing or using an application.
PROVIDED.HELP Yes Yes Yes
FIND.HELPEE I reached out to them to offer unsolicited help. They asked for help in a public forum (e.g. in... They asked me directly for help.
HELPEE.PRIOR.RELATIONSHIP Total strangers, I didn't know of them previou... Total strangers, I didn't know of them previou... We knew each other well.
PROVIDED.HELP.TYPE Writing code or otherwise implementing ideas. Installing or using an application. Installing or using an application.
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE Yes Yes Yes
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION Yes No No
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE Yes Yes Yes
DISCOURAGING.BEHAVIOR.BAD.DOCS Yes Yes Yes
DISCOURAGING.BEHAVIOR.CONFLICT Yes Yes No
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE No No No
OSS.AS.JOB Yes, directly- some or all of my work duties ... NaN NaN
OSS.AT.WORK Frequently NaN Frequently
OSS.IP.POLICY I am free to contribute without asking for per... NaN I'm not sure.
EMPLOYER.POLICY.APPLICATIONS Use of open source applications is acceptable ... NaN Use of open source applications is encouraged.
EMPLOYER.POLICY.DEPENDENCIES Use of open source dependencies is acceptable ... NaN Use of open source dependencies is encouraged.
OSS.HIRING Very important NaN NaN
IMMIGRATION No, I live in the country where I was born. No, I live in the country where I was born. Yes, and I intend to stay permanently.
MINORITY.HOMECOUNTRY NaN NaN No
MINORITY.CURRENT.COUNTRY No No No
GENDER Man Man Man
TRANSGENDER.IDENTITY No No No
SEXUAL.ORIENTATION No Yes No
WRITTEN.ENGLISH Very well Very well Very well
AGE 35 to 44 years 17 or younger 35 to 44 years
FORMAL.EDUCATION Bachelor's degree Secondary (high) school graduate or equivalent Vocational/trade program or apprenticeship
PARENTS.FORMAL.EDUCATION Bachelor's degree Master's degree Bachelor's degree
AGE.AT.FIRST.COMPUTER.INTERNET 13 - 17 years old Younger than 13 years old 13 - 17 years old
LOCATION.OF.FIRST.COMPUTER.INTERNET At home (belonging to me or a family member) At home (belonging to me or a family member) At home (belonging to me or a family member)
PARTICIPATION.TYPE.ANY.REPONSE 1 1 1
POPULATION github github github
OFF.SITE.ID NaN NaN NaN
TRANSLATED 0 0 0

Create lists of variables for bulk analysis

In [11]:
participation_type_vars = ['PARTICIPATION.TYPE.FOLLOW',
       'PARTICIPATION.TYPE.USE.APPLICATIONS',
       'PARTICIPATION.TYPE.USE.DEPENDENCIES', 'PARTICIPATION.TYPE.CONTRIBUTE',
       'PARTICIPATION.TYPE.OTHER']

contrib_type_vars = ['CONTRIBUTOR.TYPE.CONTRIBUTE.CODE',
       'CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS',
       'CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE', 'CONTRIBUTOR.TYPE.FILE.BUGS',
       'CONTRIBUTOR.TYPE.FEATURE.REQUESTS', 'CONTRIBUTOR.TYPE.COMMUNITY.ADMIN']

contrib_other_vars = ['EMPLOYMENT.STATUS', 'PROFESSIONAL.SOFTWARE',
       'FUTURE.CONTRIBUTION.INTEREST', 'FUTURE.CONTRIBUTION.LIKELIHOOD']

contrib_ident_vars = participation_type_vars + contrib_type_vars + contrib_other_vars
In [12]:
user_pri_vars = ['OSS.USER.PRIORITIES.LICENSE', 'OSS.USER.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE', 'OSS.USER.PRIORITIES.CLA',
       'OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.USER.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.USER.PRIORITIES.WIDESPREAD.USE']

contrib_pri_vars = ['OSS.CONTRIBUTOR.PRIORITIES.LICENSE',
       'OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE',
       'OSS.CONTRIBUTOR.PRIORITIES.CLA',
       'OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE']

oss_values_vars = [ 'SEEK.OPEN.SOURCE',
       'OSS.UX', 'OSS.SECURITY', 'OSS.STABILITY', 'INTERNAL.EFFICACY',
       'EXTERNAL.EFFICACY', 'OSS.IDENTIFICATION']

user_values_vars = ['USER.VALUES.STABILITY',
       'USER.VALUES.INNOVATION', 'USER.VALUES.REPLICABILITY',
       'USER.VALUES.COMPATIBILITY', 'USER.VALUES.SECURITY', 'USER.VALUES.COST',
       'USER.VALUES.TRANSPARENCY', 'USER.VALUES.USER.EXPERIENCE',
       'USER.VALUES.CUSTOMIZABILITY', 'USER.VALUES.SUPPORT',
       'USER.VALUES.TRUSTED.PRODUCER']

values_pri_vars = user_pri_vars + contrib_pri_vars + user_values_vars + oss_values_vars 
In [13]:
privacy_transp_vars = ['TRANSPARENCY.PRIVACY.BELIEFS',
       'INFO.AVAILABILITY', 'INFO.JOB',
       'TRANSPARENCY.PRIVACY.PRACTICES.GENERAL',
       'TRANSPARENCY.PRIVACY.PRACTICES.OSS']
In [14]:
help_vars = ['RECEIVED.HELP', 'FIND.HELPER',
       'HELPER.PRIOR.RELATIONSHIP', 'RECEIVED.HELP.TYPE', 'PROVIDED.HELP',
       'FIND.HELPEE', 'HELPEE.PRIOR.RELATIONSHIP', 'PROVIDED.HELP.TYPE']
In [15]:
paid_work_vars = ['OSS.AS.JOB',
       'OSS.AT.WORK', 'OSS.IP.POLICY', 'EMPLOYER.POLICY.APPLICATIONS',
       'EMPLOYER.POLICY.DEPENDENCIES', 'OSS.HIRING']
In [16]:
discouraging_vars = ['DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE',
       'DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION',
       'DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE',
       'DISCOURAGING.BEHAVIOR.BAD.DOCS', 'DISCOURAGING.BEHAVIOR.CONFLICT',
       'DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE']
In [17]:
demographic_vars = ['IMMIGRATION',
       'MINORITY.HOMECOUNTRY', 'MINORITY.CURRENT.COUNTRY', 'GENDER',
       'TRANSGENDER.IDENTITY', 'SEXUAL.ORIENTATION', 'WRITTEN.ENGLISH', 'AGE',
       'FORMAL.EDUCATION', 'PARENTS.FORMAL.EDUCATION',
       'AGE.AT.FIRST.COMPUTER.INTERNET', 'LOCATION.OF.FIRST.COMPUTER.INTERNET',
       'PARTICIPATION.TYPE.ANY.REPONSE', 'POPULATION', 'OFF.SITE.ID',
       'TRANSLATED']
In [18]:
survey_vars = [contrib_ident_vars, values_pri_vars, privacy_transp_vars, \
               help_vars, paid_work_vars, discouraging_vars, demographic_vars]

Negative incidents

Load into pandas

In [19]:
neg_df = pd.read_csv("data_for_public_release/negative_incidents.csv")
In [20]:
print("negative_incidents.csv length:", len(survey_df))
negative_incidents.csv length: 6029

Explore the negative dataset with some sample responses

In [21]:
neg_df[0:3].transpose()
Out[21]:
0 1 2
NEGATIVE.WITNESS.RUDENESS 1 1 0
NEGATIVE.WITNESS.NAME.CALLING 1 0 0
NEGATIVE.WITNESS.THREATS 0 0 0
NEGATIVE.WITNESS.IMPERSONATION 0 0 1
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT 0 0 0
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT 0 0 0
NEGATIVE.WITNESS.STALKING 0 0 0
NEGATIVE.WITNESS.SEXUAL.ADVANCES 0 0 0
NEGATIVE.WITNESS.STEREOTYPING 0 0 0
NEGATIVE.WITNESS.DOXXING 0 0 1
NEGATIVE.WITNESS.OTHER 0 0 0
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE 0 0 0
NEGATIVE.EXPERIENCE.RUDENESS 0 1 0
NEGATIVE.EXPERIENCE.NAME.CALLING 0 0 0
NEGATIVE.EXPERIENCE.THREATS 0 0 0
NEGATIVE.EXPERIENCE.IMPERSONATION 0 0 0
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT 0 0 0
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT 0 0 0
NEGATIVE.EXPERIENCE.STALKING 0 0 0
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES 0 0 0
NEGATIVE.EXPERIENCE.STEREOTYPING 0 0 0
NEGATIVE.EXPERIENCE.DOXXING 0 0 0
NEGATIVE.EXPERIENCE.OTHER 0 0 0
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE 1 0 1
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP 0 0 0
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT 0 0 0
NEGATIVE.RESPONSE.BLOCKED.USER 0 0 0
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS 0 0 0
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP 0 0 0
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL 0 0 0
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT 0 0 0
NEGATIVE.RESPONSE.OTHER 0 0 0
NEGATIVE.RESPONSE.IGNORED 0 1 0
RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP NaN NaN NaN
RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT NaN NaN NaN
RESPONSE.EFFECTIVENESS.BLOCKED.USER NaN NaN NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS NaN NaN NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP NaN NaN NaN
RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL NaN NaN NaN
RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT NaN NaN NaN
RESPONSE.EFFECTIVENESS.OTHER NaN NaN NaN
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING 0 0 1
NEGATIVE.CONSEQUENCES.PSEUDONYM 0 0 0
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE 0 0 0
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME 0 0 0
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE 0 0 0
NEGATIVE.CONSEQUENCES.SUGGEST.COC 0 0 0
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION 0 0 0
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION 0 1 0
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES 0 0 0
NEGATIVE.CONSEQUENCES.OTHER 0 0 0
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE 1 0 0
NEGATIVE.WITNESS.ANY.RESPONSE 1 1 1
NEGATIVE.EXPERIENCE.ANY.RESPONSE 1 1 1
NEGATIVE.RESPONSE.ANY.RESPONSE 0 1 0
NEGATIVE.CONSEQUENCES.ANY.RESPONSE 1 1 1
POPULATION github github github

Create lists of variables for bulk analysis

In [22]:
neg_witness_vars = ['NEGATIVE.WITNESS.RUDENESS', 'NEGATIVE.WITNESS.NAME.CALLING',
       'NEGATIVE.WITNESS.THREATS', 'NEGATIVE.WITNESS.IMPERSONATION',
       'NEGATIVE.WITNESS.SUSTAINED.HARASSMENT',
       'NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.WITNESS.STALKING', 'NEGATIVE.WITNESS.SEXUAL.ADVANCES',
       'NEGATIVE.WITNESS.STEREOTYPING', 'NEGATIVE.WITNESS.DOXXING',
       'NEGATIVE.WITNESS.OTHER', 'NEGATIVE.WITNESS.NONE.OF.THE.ABOVE', 'NEGATIVE.WITNESS.ANY.RESPONSE']
In [23]:
neg_exp_vars = ['NEGATIVE.EXPERIENCE.RUDENESS', 'NEGATIVE.EXPERIENCE.NAME.CALLING',
       'NEGATIVE.EXPERIENCE.THREATS', 'NEGATIVE.EXPERIENCE.IMPERSONATION',
       'NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT',
       'NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.EXPERIENCE.STALKING', 'NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES',
       'NEGATIVE.EXPERIENCE.STEREOTYPING', 'NEGATIVE.EXPERIENCE.DOXXING',
       'NEGATIVE.EXPERIENCE.OTHER', 'NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE']
In [24]:
neg_resp_vars = ['NEGATIVE.RESPONSE.ASKED.USER.TO.STOP',
       'NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT',
       'NEGATIVE.RESPONSE.BLOCKED.USER',
       'NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS',
       'NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP',
       'NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL',
       'NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT',
       'NEGATIVE.RESPONSE.OTHER', 'NEGATIVE.RESPONSE.IGNORED', 'NEGATIVE.RESPONSE.ANY.RESPONSE']
In [25]:
neg_effect_vars = ['RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP',
       'RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT',
       'RESPONSE.EFFECTIVENESS.BLOCKED.USER',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP',
       'RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL',
       'RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT',
       'RESPONSE.EFFECTIVENESS.OTHER']
In [26]:
neg_conseq_vars = ['NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING',
       'NEGATIVE.CONSEQUENCES.PSEUDONYM',
       'NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE',
       'NEGATIVE.CONSEQUENCES.CHANGE.USERNAME',
       'NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE',
       'NEGATIVE.CONSEQUENCES.SUGGEST.COC',
       'NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES', 'NEGATIVE.CONSEQUENCES.OTHER',
       'NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']
In [27]:
neg_anyresp_vars = ['NEGATIVE.WITNESS.ANY.RESPONSE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE',
       'NEGATIVE.RESPONSE.ANY.RESPONSE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']

Analysis

In [28]:
sns.set(font_scale=1.5)

Contributor identity

People participate in open source in different ways. Which of the following activities do you engage in?

Choose all that apply.

In [29]:
participation_type_resp= survey_df[participation_type_vars].apply(pd.Series.value_counts).transpose()
participation_type_resp.columns = ["No", "Yes"]
participation_type_resp
Out[29]:
No Yes
PARTICIPATION.TYPE.FOLLOW 1287 4742
PARTICIPATION.TYPE.USE.APPLICATIONS 454 5575
PARTICIPATION.TYPE.USE.DEPENDENCIES 946 5083
PARTICIPATION.TYPE.CONTRIBUTE 1722 4307
PARTICIPATION.TYPE.OTHER 5742 287
In [ ]:
 
In [30]:
participation_type_prop = survey_df[participation_type_vars].mean() * 100
participation_type_prop = participation_type_prop.sort_values()
pd.DataFrame(participation_type_prop, columns=["percent"])
Out[30]:
percent
PARTICIPATION.TYPE.OTHER 4.76%
PARTICIPATION.TYPE.CONTRIBUTE 71.44%
PARTICIPATION.TYPE.FOLLOW 78.65%
PARTICIPATION.TYPE.USE.DEPENDENCIES 84.31%
PARTICIPATION.TYPE.USE.APPLICATIONS 92.47%
In [31]:
ax = participation_type_prop.plot(kind='barh')

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[19:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
plt.xlim(0,100)
ax.set_yticklabels(labels)

ax.set_xlabel("Percent of respondents")
t = plt.title("% of people who participate in the following activities:")

Contributon type: How often do you engage in each of the following activities?

In [32]:
contrib_type_responses = survey_df[contrib_type_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_type_responses = contrib_type_responses[["Never", "Rarely", "Occasionally", "Frequently"]]
contrib_type_responses = contrib_type_responses[["Frequently", "Occasionally", "Rarely", "Never"]]
contrib_type_responses = contrib_type_responses.sort_values(by='Frequently')
contrib_type_responses
Out[32]:
Frequently Occasionally Rarely Never
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN 287 417 867 2412
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS 460 1214 1665 661
CONTRIBUTOR.TYPE.FEATURE.REQUESTS 573 1625 1346 451
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE 996 944 974 1090
CONTRIBUTOR.TYPE.FILE.BUGS 1067 2073 768 106
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE 1160 1383 1301 189
In [33]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
contrib_type_responses.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[17:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("How often do you engage in each of the following activities?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Employment status

EMPLOYMENT.STATUS

In [34]:
prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts()))
prop_df.columns=["count"]
prop_df
Out[34]:
count
Employed full time 3615
Full time student 1048
Employed part time 349
Temporarily not working 314
Other - please describe 184
Retired or permanently not working (e.g. due to disability) 90
In [35]:
prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[35]:
percent
Employed full time 64.55%
Full time student 18.71%
Employed part time 6.23%
Temporarily not working 5.61%
Other - please describe 3.29%
Retired or permanently not working (e.g. due to disability) 1.61%
In [36]:
ax = pd.DataFrame(survey_df['EMPLOYMENT.STATUS'].value_counts()).plot(kind='barh')
plt.suptitle("Employment status")
t = ax.set_xlabel("Count of responses")

In your main job, how often do you write or otherwise directly contribute to producing software?

PROFESSIONAL.SOFTWARE

In [37]:
prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts()))
prop_df.columns=["count"]
prop_df
Out[37]:
count
Frequently 2747
Occasionally 542
Rarely 339
Never 279
In [38]:
prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[38]:
percent
Frequently 70.31%
Occasionally 13.87%
Rarely 8.68%
Never 7.14%
In [39]:
ax = pd.DataFrame(survey_df['PROFESSIONAL.SOFTWARE'].value_counts()).plot(kind='barh')
plt.title("In your main job, how often do you write or\notherwise directly contribute to producing software?")
t = ax.set_xlabel("Count of responses")

How interested are you in contributing to open source projects in the future?

FUTURE.CONTRIBUTION.INTEREST

In [40]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()))
prop_df.columns=["count"]
prop_df
Out[40]:
count
Very interested 3929
Somewhat interested 1430
Not too interested 125
Not at all interested 24
In [41]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts(normalize=True).round(4).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[41]:
percent
Very interested 71.33%
Somewhat interested 25.96%
Not too interested 2.27%
Not at all interested 0.44%
In [42]:
ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()).plot(kind='barh')
plt.title("How interested are you in contributing\nto open source projects in the future?")
t = ax.set_xlabel("Count of responses")

How likely are you to contribute to open source projects in the future?

In [43]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()))
prop_df.columns=["count"]
prop_df
Out[43]:
count
Very likely 3271
Somewhat likely 1719
Somewhat unlikely 440
Very unlikely 81
In [44]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[44]:
percent
Very likely 59.35%
Somewhat likely 31.19%
Somewhat unlikely 7.98%
Very unlikely 1.47%
In [45]:
ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()).plot(kind='barh')
plt.title("How likely are you to contribute to\nopen source projects in the future?")
t = ax.set_xlabel("Count of responses")

Priorities and values

When thinking about whether to use open source software, how important are the following things?

OSS.USER.PRIORITIES.*

In [46]:
user_pri_responses = survey_df[user_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses = user_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses = user_pri_responses.sort_values(by="Very important to have")
In [47]:
idx = []
for i in user_pri_responses.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses.set_index(idx)
Out[47]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
CLA 490 1024 2282 336 157 488
CODE.OF.CONDUCT 848 1461 1993 166 120 209
WIDESPREAD.USE 984 2067 1576 114 47 28
CONTRIBUTING.GUIDE 1212 1866 1516 95 62 62
WELCOMING.COMMUNITY 2062 1822 812 67 33 18
RESPONSIVE.MAINTAINERS 2575 1850 302 31 35 20
ACTIVE.DEVELOPMENT 2768 1722 267 30 31 16
LICENSE 3125 1160 435 31 33 47
In [48]:
user_pri_responses_prop = survey_df[user_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses_prop = user_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses_prop = user_pri_responses_prop.sort_values(by="Very important to have")
user_pri_responses_prop = user_pri_responses_prop * 100
In [49]:
idx = []
for i in user_pri_responses_prop.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses_prop.set_index(idx)
Out[49]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
CLA 10.26% 21.44% 47.77% 7.03% 3.29% 10.22%
CODE.OF.CONDUCT 17.68% 30.46% 41.55% 3.46% 2.50% 4.36%
WIDESPREAD.USE 20.43% 42.92% 32.72% 2.37% 0.98% 0.58%
CONTRIBUTING.GUIDE 25.18% 38.77% 31.50% 1.97% 1.29% 1.29%
WELCOMING.COMMUNITY 42.83% 37.85% 16.87% 1.39% 0.69% 0.37%
RESPONSIVE.MAINTAINERS 53.50% 38.44% 6.27% 0.64% 0.73% 0.42%
ACTIVE.DEVELOPMENT 57.26% 35.62% 5.52% 0.62% 0.64% 0.33%
LICENSE 64.69% 24.01% 9.00% 0.64% 0.68% 0.97%
In [50]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
user_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[20:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *use* open source software,\n how important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

When thinking about whether to contribute to an open source project, how important are the following things?

OSS.CONTRIBUTOR.PRIORITIES.*

In [51]:
contrib_pri_responses = survey_df[contrib_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses = contrib_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]

contrib_pri_responses = contrib_pri_responses.sort_values(by="Very important to have")
In [52]:
idx = []
for i in contrib_pri_responses.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses.set_index(idx)
Out[52]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
WIDESPREAD.USE 387 1016 1666 70 30 12
CLA 419 712 1266 327 166 280
CODE.OF.CONDUCT 655 1145 1085 119 84 96
CONTRIBUTING.GUIDE 1198 1396 500 41 18 24
ACTIVE.DEVELOPMENT 1368 1333 448 21 18 5
WELCOMING.COMMUNITY 1533 1199 411 21 15 7
RESPONSIVE.MAINTAINERS 1994 1022 138 7 16 7
LICENSE 2199 610 337 16 15 18
In [ ]:
 
In [53]:
contrib_pri_responses_prop = survey_df[contrib_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses_prop = contrib_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
contrib_pri_responses_prop = contrib_pri_responses_prop.sort_values(by="Very important to have")
contrib_pri_responses_prop = contrib_pri_responses_prop * 100
In [54]:
idx = []
for i in contrib_pri_responses_prop.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses_prop.set_index(idx)
Out[54]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
WIDESPREAD.USE 12.17% 31.94% 52.37% 2.20% 0.94% 0.38%
CLA 13.22% 22.46% 39.94% 10.32% 5.24% 8.83%
CODE.OF.CONDUCT 20.57% 35.96% 34.08% 3.74% 2.64% 3.02%
CONTRIBUTING.GUIDE 37.71% 43.94% 15.74% 1.29% 0.57% 0.76%
ACTIVE.DEVELOPMENT 42.84% 41.75% 14.03% 0.66% 0.56% 0.16%
WELCOMING.COMMUNITY 48.12% 37.63% 12.90% 0.66% 0.47% 0.22%
RESPONSIVE.MAINTAINERS 62.63% 32.10% 4.33% 0.22% 0.50% 0.22%
LICENSE 68.83% 19.09% 10.55% 0.50% 0.47% 0.56%
In [55]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
contrib_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[27:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *contribute* to an open source project,\nhow important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

How often do you try to find open source options over other kinds of software?

SEEK.OPEN.SOURCE

In [56]:
count_df = pd.DataFrame(data=survey_df['SEEK.OPEN.SOURCE'].value_counts())
count_df.columns = ["count"]
count_df
Out[56]:
count
Always 3407
Sometimes 1111
Rarely 100
Never 25
In [57]:
prop_df = pd.DataFrame((survey_df['SEEK.OPEN.SOURCE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[57]:
percent
Always 73.38%
Sometimes 23.93%
Rarely 2.15%
Never 0.54%
In [58]:
ax = pd.DataFrame(survey_df['SEEK.OPEN.SOURCE'].value_counts()).plot(kind='barh')
plt.title("How often do you try to find open\nsource options over other kinds of software?")
t = ax.set_xlabel("Count of responses")

Open source software usability

OSS.UX: Do you believe that open source software is generally easier to use than closed source (proprietary) software, harder to use, or about the same?

In [59]:
count_df = pd.DataFrame(data=survey_df['OSS.UX'].value_counts())
count_df.columns = ["count"]
count_df
Out[59]:
count
About the same 2027
Generally easier to use 1597
Generally harder to use 897
In [60]:
prop_df = pd.DataFrame((survey_df['OSS.UX'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[60]:
percent
About the same 44.84%
Generally easier to use 35.32%
Generally harder to use 19.84%
In [61]:
ax = pd.DataFrame(survey_df['OSS.UX'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is generally\neasier to use than closed source (proprietary)\nsoftware, harder to use, or about the same?")
t = ax.set_xlabel("Count of responses")

Open source software security

OSS.SECURITY: Do you believe that open source software is generally more secure than closed source (proprietary) software, less secure, or about the same?

In [62]:
count_df = pd.DataFrame(data=survey_df['OSS.SECURITY'].value_counts())
count_df.columns = ["count"]
count_df
Out[62]:
count
Generally more secure 2688
About the same 1537
Generally less secure 295
In [63]:
prop_df = pd.DataFrame((survey_df['OSS.SECURITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[63]:
percent
Generally more secure 59.47%
About the same 34.00%
Generally less secure 6.53%
In [64]:
ax = pd.DataFrame(survey_df['OSS.SECURITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more secure than closed source (proprietary)\nsoftware, less secure, or about the same?")
t = ax.set_xlabel("Count of responses")

Open source software stability

OSS.STABILITY: Do you believe that open source software is generally more stable than closed source (proprietary) software, less stable, or about the same?

In [65]:
count_df = pd.DataFrame(data=survey_df['OSS.STABILITY'].value_counts())
count_df.columns = ["count"]
count_df
Out[65]:
count
About the same 2240
Generally more stable 1399
Generally less stable 877
In [66]:
prop_df = pd.DataFrame((survey_df['OSS.STABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[66]:
percent
About the same 49.60%
Generally more stable 30.98%
Generally less stable 19.42%
In [67]:
pd.DataFrame(survey_df['OSS.STABILITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more stable than closed source\n(proprietary), less stable, or about the same?")
t = ax.set_xlabel("Count of responses")
In [ ]:
 

Identification with open source

How much do you agree or disagree with the following statements:

  • EXTERNAL.EFFICACY: The open source community values contributions from people like me.
  • INTERNAL.EFFICACY: I have the skills and understanding necessary to make meaningful contributions to open source projects.
  • OSS.IDENTIFICATION: I consider myself to be a member of the open source (and/or the Free/Libre software) community.
In [68]:
oss_id_vars = ["INTERNAL.EFFICACY", "EXTERNAL.EFFICACY", "OSS.IDENTIFICATION"]
In [69]:
oss_id_responses = survey_df[oss_id_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
oss_id_responses = oss_id_responses[["Strongly agree",
                                     "Somewhat agree",
                                     "Neither agree nor disagree",
                                     "Somewhat disagree",
                                     "Strongly disagree"]]
oss_id_responses = oss_id_responses.sort_values(by="Strongly agree")
oss_id_responses
Out[69]:
Strongly agree Somewhat agree Neither agree nor disagree Somewhat disagree Strongly disagree
EXTERNAL.EFFICACY 1518 1610 1116 150 58
OSS.IDENTIFICATION 1579 1513 863 351 150
INTERNAL.EFFICACY 2052 1685 418 240 62
In [70]:
oss_id_responses_prop = survey_df[oss_id_vars].apply(pd.Series.value_counts, normalize=True).round(4) * 100 
oss_id_responses_prop.transpose()
Out[70]:
Neither agree nor disagree Somewhat agree Somewhat disagree Strongly agree Strongly disagree
INTERNAL.EFFICACY 9.38% 37.81% 5.38% 46.04% 1.39%
EXTERNAL.EFFICACY 25.07% 36.16% 3.37% 34.10% 1.30%
OSS.IDENTIFICATION 19.37% 33.95% 7.88% 35.44% 3.37%
In [71]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson"]
oss_id_responses.plot.barh(stacked=True, ax=ax, figsize=[12,5], cmap=matplotlib.cm.coolwarm, edgecolor='black', linewidth=1)

#print(str(ax.get_yticklabels()))

ax.set_yticklabels(["The open source community values\ncontributions from people like me.",
                    "I consider myself to be a member\nof the open source (and/or the\nFree/Libre software) community.",
                    "I have the skills and understanding\nnecessary to make meaningful\ncontributions to open source projects."])


plt.title("How much do you agree or disagree with the following statements:")

plt.xlabel("Number of responses")


legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.25), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')

Transparency vs privacy

Attribution

TRANSPARENCY.PRIVACY.BELIEFS: Which of the following statements is closest to your beliefs about attribution in software development?

  • Records of authorship should be required so that end users know who created the source code they are working with.
  • People should be able to contribute code without attribution, if they wish to remain anonymous.
In [72]:
counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts())
counts_df.columns=["count"]
counts_df
Out[72]:
count
People should be able to contribute code without attribution, if they wish to remain anonymous. 2454
Records of authorship should be required so that end users know who created the source code they are working with. 1594
In [73]:
prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[73]:
percent
People should be able to contribute code without attribution, if they wish to remain anonymous. 60.62%
Records of authorship should be required so that end users know who created the source code they are working with. 39.38%
In [74]:
ax = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts()).plot(kind='barh', figsize=[10,6])
plt.title("Which of the following statements is closest to your\nbeliefs about attribution in software development?")
ax.set_yticklabels(["People should be able to contribute\ncode without attribution, if\nthey wish to remain anonymous.",
                    "Records of authorship should be\nrequired so that end users know\nwho created the source code they are working with."])
t = ax.set_xlabel("Count of responses")

In general, how much information about you is publicly available online?

INFO.AVAILABILITY

In [75]:
count_df = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts())
count_df.columns=["count"]
count_df
Out[75]:
count
Some information about me 1776
A little information about me 1133
A lot of information about me 1011
No information at all about me 140
In [76]:
prop_df = pd.DataFrame((survey_df['INFO.AVAILABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df
Out[76]:
percent
Some information about me 43.74%
A little information about me 27.91%
A lot of information about me 24.90%
No information at all about me 3.45%
In [77]:
ax = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts()).plot(kind='barh')
plt.title("In general, how much information about\nyou is publicly available online?")
t = ax.set_xlabel("Count of responses")