mlcourse.ai – Open Machine Learning Course

Author: Kseniia Terekhova, ODS Slack Kseniia

Individual data analysis project

Predicting developer career satisfaction using StackOverflow survey

Part 1. Feature and data explanation

1.1 Dataset and task explanation

This project uses the dataset with results of Stack Overflow 2018 Developer Survey. The data is publicly available through Kaggle Datasets.

The Dataset description on Kaggle states:

Each year, we at Stack Overflow ask the developer community about everything from their favorite technologies to their job preferences. This year marks the eighth year we’ve published our Annual Developer Survey results—with the largest number of respondents yet. Over 100,000 developers took the 30-minute survey in January 2018.

This year, we covered a few new topics ranging from artificial intelligence to ethics in coding. We also found that underrepresented groups in tech responded to our survey at even lower rates than we would expect from their participation in the workforce. Want to dive into the results yourself and see what you can learn about salaries or machine learning or diversity in tech? We look forward to seeing what you find!

Indeed, there are numerous aspects of developers' lifes that can be learned from such kind of data. For this concrete project the task of predicting developer career satisfaction has been selected. So, the target value for this research is contained in CareerSatisfaction column. There is also JobSatisfaction feature, that could be possibly more useful for an HR or hiring manager, but for a technical specialist the question of overall career satisfaction of his/her peers seems to be more interesting.

The dataset consists of two files:

  • survey_results_public.csv with the main survey results, one respondent per row and one column per question;
  • survey_results_schema.csv with each column name from the main results along with the question text corresponding to that column;

The survey results file has columns for each one of the 128 questions, some of those are in the form "AssesBenefits4" or "AIDangerous". Detailed question content should be looked in the schema file, so the data observation is a little bit harder process than just listing columns with short comments.

1.2 Survey content

That can be difficult to undestand the nature of this or that feature without seeng a corresponding question along with available answers. Thus, it looks like a good idea to go through the columns of survey results and for each of them extract the question text from the survey schema and answers used in the survey. This list will be long and tedious, so if you don't want even to scroll through it, you can jump directly to Features conversion or Exploratory data analysis.

Loading the files, first of all.

In [1]:
import pandas as pd
import numpy as np
In [2]:
survey_schema = pd.read_csv('data/survey_results_schema.csv')
print(survey_schema.shape)
survey_schema.head()
(129, 2)
Out[2]:
Column QuestionText
0 Respondent Randomized respondent ID number (not in order ...
1 Hobby Do you code as a hobby?
2 OpenSource Do you contribute to open source projects?
3 Country In which country do you currently reside?
4 Student Are you currently enrolled in a formal, degree...

Some preprocessing is needed to read survey_results_public.csv without warnings

In [3]:
survey_results_file = 'data/survey_results_public.csv'
with open(survey_results_file) as f:
    header = f.readline().strip()
    
col_dtypes = {col:np.object_ for col in header.split(',')}    

assesjob_dtypes = {"AssessJob" + str(i): np.float64 for i in range(1, 11)}
col_dtypes.update(assesjob_dtypes)

assesbenefits_dtypes = {"AssessBenefits" + str(i): np.float64 for i in range(1, 12)}
col_dtypes.update(assesbenefits_dtypes)

jopcontacts_dtypes = {"JobContactPriorities" + str(i): np.float64 for i in range(1, 6)}
col_dtypes.update(jopcontacts_dtypes)

jobemail_dtypes = {"JobEmailPriorities" + str(i): np.float64 for i in range(1, 8)}
col_dtypes.update(jobemail_dtypes)

ads_dtypes = {"AdsPriorities" + str(i): np.float64 for i in range(1, 8)}
col_dtypes.update(ads_dtypes)

col_dtypes['ConvertedSalary'] = np.float64

survey_results = pd.read_csv('data/survey_results_public.csv', index_col='Respondent', dtype=col_dtypes)
In [32]:
print(survey_results.shape)
survey_results.head()
(98855, 128)
Out[32]:
Hobby OpenSource Country Student Employment FormalEducation UndergradMajor CompanySize DevType YearsCoding ... Exercise Gender SexualOrientation EducationParents RaceEthnicity Age Dependents MilitaryUS SurveyTooLong SurveyEasy
Respondent
1 Yes No Kenya No Employed part-time Bachelor’s degree (BA, BS, B.Eng., etc.) Mathematics or statistics 20 to 99 employees Full-stack developer 3-5 years ... 3 - 4 times per week Male Straight or heterosexual Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent 25 - 34 years old Yes NaN The survey was an appropriate length Very easy
3 Yes Yes United Kingdom No Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) A natural science (ex. biology, chemistry, phy... 10,000 or more employees Database administrator;DevOps specialist;Full-... 30 or more years ... Daily or almost every day Male Straight or heterosexual Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent 35 - 44 years old Yes NaN The survey was an appropriate length Somewhat easy
4 Yes Yes United States No Employed full-time Associate degree Computer science, computer engineering, or sof... 20 to 99 employees Engineering manager;Full-stack developer 24-26 years ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 No No United States No Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) Computer science, computer engineering, or sof... 100 to 499 employees Full-stack developer 18-20 years ... I don't typically exercise Male Straight or heterosexual Some college/university study without earning ... White or of European descent 35 - 44 years old No No The survey was an appropriate length Somewhat easy
7 Yes No South Africa Yes, part-time Employed full-time Some college/university study without earning ... Computer science, computer engineering, or sof... 10,000 or more employees Data or business analyst;Desktop or enterprise... 6-8 years ... 3 - 4 times per week Male Straight or heterosexual Some college/university study without earning ... White or of European descent 18 - 24 years old Yes NaN The survey was an appropriate length Somewhat easy

5 rows × 128 columns

Functions to extract and format the data

In [4]:
from IPython.core.display import HTML

def format_questions(questions, answers, hide_count=20):
    questions_formatted = "<b>[" + questions['Column'] + "]</b> " + \
        questions['QuestionText']
    answers_formatted = answers.map(lambda answs: 
        "{0} options".format(len(answs)) if len(answs) > hide_count else "</li><li>".join(answs.astype('str')))
    answers_formatted = answers_formatted.map(lambda a:
        "<ul><li>" + a + "</li></ul>" if a else "<br/><br/>")
    questions_answers = pd.concat([questions_formatted, answers_formatted], axis=1)
    questions_answers_formatted = questions_answers[0] + questions_answers[1]
    formatted = "".join(questions_answers_formatted.values)
    return formatted
In [5]:
def get_options(questions, survey):
    return questions.apply(lambda q: survey[q['Column']].dropna().unique(), axis=1)
In [6]:
def get_rank_questions(prefix, count, schema):
    questions_names = [prefix+str(i) for i in range(1, count+1)]
    questions_mask = schema['Column'].apply(lambda s: s in questions_names)
    questions = schema[questions_mask]['QuestionText']  
    
    first_question = questions.iat[0]
    last_dot = first_question.rfind('.')
    question_text = first_question[:last_dot+1]
    question_options = questions.str[last_dot+1:]
    return question_text, question_options.values
In [7]:
def get_multiselect_options(questions, survey, separator=";"):
    combinations = questions['Column'].map(lambda column: survey[column].dropna().unique())
    options = combinations.map(lambda comb: np.unique(np.concatenate([c.split(separator) for c in comb])))
    options = options.rename(1)
    return options
In [8]:
def format_rank_questions(prefix, count, question, options):
    question_formatted = "<b>[" + prefix + "1-" + str(count) + "]</b> " + question
    options_formatted = "</li><li>".join(options)
    if options_formatted:
        options_formatted = "<ol><li>" + options_formatted + "</li></ol>" 
    else:
        options_formatted += "</br>"

    question_formatted += options_formatted
    return question_formatted

So, the formatted survey content (column names are in square brackets)

In [9]:
coding_questions = survey_schema.loc[1:2, :]
coding_options = get_options(coding_questions, survey_results)
HTML(format_questions(coding_questions, coding_options))
Out[9]:
[Hobby] Do you code as a hobby?
  • Yes
  • No
[OpenSource] Do you contribute to open source projects?
  • No
  • Yes
In [10]:
current_questions = survey_schema.loc[3:7, :]
current_options = get_options(current_questions, survey_results)
HTML(format_questions(current_questions, current_options))
Out[10]:
[Country] In which country do you currently reside?
  • 183 options
[Student] Are you currently enrolled in a formal, degree-granting college or university program?
  • No
  • Yes, part-time
  • Yes, full-time
[Employment] Which of the following best describes your current employment status?
  • Employed part-time
  • Employed full-time
  • Independent contractor, freelancer, or self-employed
  • Not employed, and not looking for work
  • Not employed, but looking for work
  • Retired
[FormalEducation] Which of the following best describes the highest level of formal education that you’ve completed?
  • Bachelor’s degree (BA, BS, B.Eng., etc.)
  • Associate degree
  • Some college/university study without earning a degree
  • Master’s degree (MA, MS, M.Eng., MBA, etc.)
  • Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)
  • Primary/elementary school
  • Professional degree (JD, MD, etc.)
  • I never completed any formal education
  • Other doctoral degree (Ph.D, Ed.D., etc.)
[UndergradMajor] You previously indicated that you went to a college or university. Which of the following best describes your main field of study (aka 'major')
  • Mathematics or statistics
  • A natural science (ex. biology, chemistry, physics)
  • Computer science, computer engineering, or software engineering
  • Fine arts or performing arts (ex. graphic design, music, studio art)
  • Information systems, information technology, or system administration
  • Another engineering discipline (ex. civil, electrical, mechanical)
  • A business discipline (ex. accounting, finance, marketing)
  • A social science (ex. anthropology, psychology, political science)
  • Web development or web design
  • A humanities discipline (ex. literature, history, philosophy)
  • A health science (ex. nursing, pharmacy, radiology)
  • I never declared a major
In [11]:
company_question = survey_schema.loc[8:8, :]
company_options = get_options(company_question, survey_results)

# DevType question is multiselect and needs different handling
dev_question = survey_schema.loc[9:9, :]
dev_options = get_multiselect_options(dev_question, survey_results)

experience_questions = survey_schema.loc[10:11, :]
experience_options = get_options(experience_questions, survey_results)

job_questions = pd.concat([company_question, dev_question, experience_questions])
job_options = pd.concat([company_options, dev_options, experience_options])

HTML(format_questions(job_questions, job_options))
Out[11]:
[CompanySize] Approximately how many people are employed by the company or organization you work for?
  • 20 to 99 employees
  • 10,000 or more employees
  • 100 to 499 employees
  • 10 to 19 employees
  • 500 to 999 employees
  • 1,000 to 4,999 employees
  • 5,000 to 9,999 employees
  • Fewer than 10 employees
[DevType] Which of the following describe you? Please select all that apply.
  • Back-end developer
  • C-suite executive (CEO, CTO, etc.)
  • Data or business analyst
  • Data scientist or machine learning specialist
  • Database administrator
  • Designer
  • Desktop or enterprise applications developer
  • DevOps specialist
  • Educator or academic researcher
  • Embedded applications or devices developer
  • Engineering manager
  • Front-end developer
  • Full-stack developer
  • Game or graphics developer
  • Marketing or sales professional
  • Mobile developer
  • Product manager
  • QA or test developer
  • Student
  • System administrator
[YearsCoding] Including any education, for how many years have you been coding?
  • 3-5 years
  • 30 or more years
  • 24-26 years
  • 18-20 years
  • 6-8 years
  • 9-11 years
  • 0-2 years
  • 15-17 years
  • 12-14 years
  • 21-23 years
  • 27-29 years
[YearsCodingProf] For how many years have you coded professionally (as a part of your work)?
  • 3-5 years
  • 18-20 years
  • 6-8 years
  • 12-14 years
  • 0-2 years
  • 21-23 years
  • 24-26 years
  • 9-11 years
  • 15-17 years
  • 27-29 years
  • 30 or more years
In [12]:
job_questions = survey_schema.loc[12:16, :]
job_options = get_options(job_questions, survey_results)
HTML(format_questions(job_questions, job_options))
Out[12]:
[JobSatisfaction] How satisfied are you with your current job? If you work more than one job, please answer regarding the one you spend the most hours on.
  • Extremely satisfied
  • Moderately dissatisfied
  • Moderately satisfied
  • Neither satisfied nor dissatisfied
  • Slightly satisfied
  • Slightly dissatisfied
  • Extremely dissatisfied
[CareerSatisfaction] Overall, how satisfied are you with your career thus far?
  • Extremely satisfied
  • Neither satisfied nor dissatisfied
  • Moderately satisfied
  • Slightly dissatisfied
  • Slightly satisfied
  • Moderately dissatisfied
  • Extremely dissatisfied
[HopeFiveYears] Which of the following best describes what you hope to be doing in five years?
  • Working as a founder or co-founder of my own company
  • Working in a different or more specialized technical role than the one I'm in now
  • Doing the same work
  • Working as an engineering manager or other functional manager
  • Working in a career completely unrelated to software development
  • Working as a product manager or project manager
  • Retirement
[JobSearchStatus] Which of the following best describes your current job-seeking status?
  • I’m not actively looking, but I am open to new opportunities
  • I am actively looking for a job
  • I am not interested in new job opportunities
[LastNewJob] When was the last time that you took a job with a new employer?
  • Less than a year ago
  • More than 4 years ago
  • Between 1 and 2 years ago
  • Between 2 and 4 years ago
  • I've never had a job
In [13]:
question_text, options = get_rank_questions('AssessJob', 10, survey_schema)
HTML(format_rank_questions('AssessJob', 10, question_text, options))
Out[13]:
[AssessJob1-10] Imagine that you are assessing a potential job opportunity. Please rank the following aspects of the job opportunity in order of importance (by dragging the choices up and down), where 1 is the most important and 10 is the least important.
  1. The industry that I'd be working in
  2. The financial performance or funding status of the company or organization
  3. The specific department or team I'd be working on
  4. The languages, frameworks, and other technologies I'd be working with
  5. The compensation and benefits offered
  6. The office environment or company culture
  7. The opportunity to work from home/remotely
  8. Opportunities for professional development
  9. The diversity of the company or organization
  10. How widely used or impactful the product or service I'd be working on is
In [14]:
question_text, options = get_rank_questions('AssessBenefits', 11, survey_schema)
HTML(format_rank_questions('AssessBenefits', 11, question_text, options))
Out[14]:
[AssessBenefits1-11] Now, imagine you are assessing a job's benefits package. Please rank the following aspects of a job's benefits package from most to least important to you (by dragging the choices up and down), where 1 is most important and 11 is least important.
  1. Salary and/or bonuses
  2. Stock options or shares
  3. Health insurance
  4. Parental leave
  5. Fitness or wellness benefit (ex. gym membership, nutritionist)
  6. Retirement or pension savings matching
  7. Company-provided meals or snacks
  8. Computer/office equipment allowance
  9. Childcare benefit
  10. Transportation benefit (ex. company-provided transportation, public transit allowance)
  11. Conference or education budget
In [15]:
question_text, options = get_rank_questions('JobContactPriorities', 5, survey_schema)
HTML(format_rank_questions('JobContactPriorities', 5, question_text, options))
Out[15]:
[JobContactPriorities1-5] Imagine that a company wanted to contact you about a job that is a good fit for you. Please rank your preference in how you are contacted (by dragging the choices up and down), where 1 is the most preferred and 5 is the least preferred.
  1. Telephone call
  2. Email to my private address
  3. Email to my work address
  4. Message on a job site
  5. Message on a social media site
In [16]:
question_text, options = get_rank_questions('JobEmailPriorities', 7, survey_schema)
HTML(format_rank_questions('JobEmailPriorities', 7, question_text, options))
Out[16]:
[JobEmailPriorities1-7] Imagine that same company decided to contact you through email. Please rank the following items by how important it is to include them in the message (by dragging the choices up and down), where 1 is the most important and 7 is the least important.
  1. Details on the company I'd be working for
  2. Details on the specific department I'd be working for or product I'd be working on
  3. Specifics of why they think I'd be a good fit for the role (ex. my prior work history, projects on GitHub)
  4. Details of which technologies I'd be working with
  5. An estimate of the compensation range
  6. Information on the company's hiring process
  7. Details on the company's product development process
In [17]:
salary_questions = survey_schema.loc[50:55,:]
salary_questions_options = get_options(salary_questions, survey_results)

# Salary amount questions had arbitrary input
salary_questions_options.at[52] = np.empty(0)
salary_questions_options.at[54] = np.empty(0)

communication_tool_question = survey_schema.loc[56:56,:]
communication_tool_options = get_multiselect_options(communication_tool_question, survey_results)

productivity_question = survey_schema.loc[57:57, :]
productivity_options = get_options(productivity_question, survey_results)

earning_questions = pd.concat([salary_questions, communication_tool_question, productivity_question])
earning_options = pd.concat([salary_questions_options, communication_tool_options, productivity_options])

HTML(format_questions(earning_questions, earning_options))
Out[17]:
[UpdateCV] Think back to the last time you updated your resumé, CV, or an online profile on a job site. What is the main reason that you did so?
  • My job status or other personal status changed
  • I saw an employer’s advertisement
  • A recruiter contacted me
  • I did not receive an expected change in compensation
  • A friend told me about a job opportunity
  • I had a negative experience or interaction at work
  • I received bad news about the future of my company or department
  • I received negative feedback on my job performance
[Currency] Which currency do you use day-to-day? If your answer is complicated, please pick the one you're most comfortable estimating in.
  • British pounds sterling (£)
  • U.S. dollars ($)
  • South African rands (R)
  • Euros (€)
  • Swedish kroner (SEK)
  • Australian dollars (A$)
  • Indian rupees (₹)
  • Polish złoty (zł)
  • Russian rubles (₽)
  • Danish krone (kr)
  • Chinese yuan renminbi (¥)
  • Japanese yen (¥)
  • Brazilian reais (R$)
  • Canadian dollars (C$)
  • Mexican pesos (MXN$)
  • Norwegian krone (kr)
  • Swiss francs
  • Singapore dollars (S$)
  • Bitcoin (btc)
[Salary] What is your current gross salary (before taxes and deductions), in ${q://QID50/ChoiceGroup/SelectedChoicesTextEntry}? Please enter a whole number in the box below, without any punctuation. If you are paid hourly, please estimate an equivalent weekly, monthly, or yearly salary. If you prefer not to answer, please leave the box empty.

[SalaryType] Is that salary weekly, monthly, or yearly?
  • Monthly
  • Yearly
  • Weekly
[ConvertedSalary] Salary converted to annual USD salaries using the exchange rate on 2018-01-18, assuming 12 working months and 50 working weeks.

[CurrencySymbol] Three digit currency abbreviation.
  • 112 options
[CommunicationTools] Which of the following tools do you use to communicate, coordinate, or share knowledge with your coworkers? Please select all that apply.
  • Confluence
  • Facebook
  • Google Hangouts/Chat
  • HipChat
  • Jira
  • Office / productivity suite (Microsoft Office, Google Suite, etc.)
  • Other chat system (IRC, proprietary software, etc.)
  • Other wiki tool (Github, Google Sites, proprietary software, etc.)
  • Slack
  • Stack Overflow Enterprise
  • Trello
[TimeFullyProductive] Suppose a new developer with four years of experience, including direct experience working with your company's main technical stack, joined your team tomorrow. All other things being equal, how long would you expect it to take before they were fully productive and contributing at a typical level to your main code base?
  • One to three months
  • Three to six months
  • Less than a month
  • Six to nine months
  • More than a year
  • Nine months to a year
In [18]:
edu_types_question = survey_schema.loc[58:58, :]
edu_types_options = get_multiselect_options(edu_types_question, survey_results)

self_taught_question = survey_schema.loc[59:59, :]
self_taught_options = get_multiselect_options(self_taught_question, survey_results)

bootcamp_questions = survey_schema.loc[60:60, :]
bootcamp_questions_options = get_options(bootcamp_questions, survey_results)

hackathon_questions = survey_schema.loc[61:61, :]
hackathon_questions_options = get_multiselect_options(hackathon_questions, survey_results)

training_questions = pd.concat([
    edu_types_question,
    self_taught_question,
    bootcamp_questions,
    hackathon_questions
])
training_options = pd.concat([
    edu_types_options,
    self_taught_options,
    bootcamp_questions_options,
    hackathon_questions_options
])

HTML(format_questions(training_questions, training_options))
Out[18]:
[EducationTypes] Which of the following types of non-degree education have you used or participated in? Please select all that apply.
  • Completed an industry certification program (e.g. MCPD)
  • Contributed to open source software
  • Participated in a full-time developer training program or bootcamp
  • Participated in a hackathon
  • Participated in online coding competitions (e.g. HackerRank, CodeChef, TopCoder)
  • Received on-the-job training in software development
  • Taken a part-time in-person course in programming or software development
  • Taken an online course in programming or software development (e.g. a MOOC)
  • Taught yourself a new language, framework, or tool without taking a formal course
[SelfTaughtTypes] You indicated that you had taught yourself a programming technology without taking a course. What resources did you use to do that? If you’ve done it more than once, please think about the most recent time you’ve done so. Please select all that apply.
  • A book or e-book from O’Reilly, Apress, or a similar publisher
  • A college/university computer science or software engineering book
  • Internal Wikis, chat rooms, or documentation set up by my company for employees
  • Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.)
  • Pre-scheduled tutoring or mentoring sessions with a friend or colleague
  • Questions & answers on Stack Overflow
  • Tapping your network of friends, family, and peers versed in the technology
  • The official documentation and/or standards for the technology
  • The technology’s online help system
[TimeAfterBootcamp] You indicated previously that you went through a developer training program or bootcamp. How long did it take you to get a full-time job as a developer after graduating?
  • Immediately after graduating
  • I already had a full-time job as a developer when I began the program
  • Four to six months
  • One to three months
  • Six months to a year
  • I haven’t gotten a developer job
  • Less than a month
  • Longer than a year
[HackathonReasons] You indicated previously that you had participated in an online coding competition or hackathon. Which of the following best describe your reasons for doing so?
  • Because I find it enjoyable
  • To build my professional network
  • To help me find new job opportunities
  • To improve my ability to work on a team with other programmers
  • To improve my general technical skills or programming ability
  • To improve my knowledge of a specific programming language, framework, or other technology
  • To win prizes or cash awards
In [19]:
agree_questions = survey_schema.loc[62:64, :]
agree_options = get_options(agree_questions, survey_results)
HTML(format_questions(agree_questions, agree_options))
Out[19]:
[AgreeDisagree1] To what extent do you agree or disagree with each of the following statements? I feel a sense of kinship or connection to other developers
  • Strongly agree
  • Agree
  • Disagree
  • Neither Agree nor Disagree
  • Strongly disagree
[AgreeDisagree2] To what extent do you agree or disagree with each of the following statements? I think of myself as competing with my peers
  • Strongly agree
  • Agree
  • Disagree
  • Neither Agree nor Disagree
  • Strongly disagree
[AgreeDisagree3] To what extent do you agree or disagree with each of the following statements? I'm not as good at programming as most of my peers
  • Neither Agree nor Disagree
  • Strongly disagree
  • Strongly agree
  • Disagree
  • Agree
In [20]:
technologies_questions = survey_schema.loc[65:72, :]
technologies_options = get_multiselect_options(technologies_questions, survey_results)
HTML(format_questions(technologies_questions, technologies_options))
Out[20]:
[LanguageWorkedWith] Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)
  • 38 options
[LanguageDesireNextYear] Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)
  • 38 options
[DatabaseWorkedWith] Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.)
  • 21 options
[DatabaseDesireNextYear] Which of the following database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the database and want to continue to do so, please check both boxes in that row.)
  • 21 options
[PlatformWorkedWith] Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.)
  • 26 options
[PlatformDesireNextYear] Which of the following platforms have you done extensive development work for over the past year? (If you both developed for the platform and want to continue to do so, please check both boxes in that row.)
  • 26 options
[FrameworkWorkedWith] Which of the following libraries, frameworks, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year?
  • .NET Core
  • Angular
  • Cordova
  • Django
  • Hadoop
  • Node.js
  • React
  • Spark
  • Spring
  • TensorFlow
  • Torch/PyTorch
  • Xamarin
[FrameworkDesireNextYear] Which of the following libraries, frameworks, and tools have you done extensive development work in over the past year, and which do you want to work in over the next year?
  • .NET Core
  • Angular
  • Cordova
  • Django
  • Hadoop
  • Node.js
  • React
  • Spark
  • Spring
  • TensorFlow
  • Torch/PyTorch
  • Xamarin
In [21]:
env_questions = survey_schema.loc[73:78, :]
env_options = get_multiselect_options(env_questions, survey_results)
HTML(format_questions(env_questions, env_options))
Out[21]:
[IDE] Which development environment(s) do you use regularly? Please check all that apply.
  • 22 options
[OperatingSystem] What is the primary operating system in which you work?
  • BSD/Unix
  • Linux-based
  • MacOS
  • Windows
[NumberMonitors] How many monitors are set up at your workstation?
  • 1
  • 2
  • 3
  • 4
  • More than 4
[Methodology] Which of the following methodologies do you have experience working in?
  • Agile
  • Evidence-based software engineering
  • Extreme programming (XP)
  • Formal standard such as ISO 9001 or IEEE 12207 (aka “waterfall” methodologies)
  • Kanban
  • Lean
  • Mob programming
  • PRINCE2
  • Pair programming
  • Scrum
[VersionControl] What version control systems do you use regularly? Please select all that apply.
  • Copying and pasting files to network shares
  • Git
  • I don't use version control
  • Mercurial
  • Subversion
  • Team Foundation Version Control
  • Zip file back-ups
[CheckInCode] Over the last year, how often have you checked-in or committed code?
  • A few times per week
  • Less than once per month
  • Multiple times per day
  • Never
  • Once a day
  • Weekly or a few times per month
In [22]:
adbloker_questions = survey_schema.loc[79:81, :]
adbloker_options = get_multiselect_options(adbloker_questions, survey_results)

ads_agree_questions = survey_schema.loc[82:84, :]
ads_agree_options = get_options(ads_agree_questions, survey_results)

ads_actions_question = survey_schema.loc[85:85, :]
ads_actions_options = get_multiselect_options(ads_actions_question, survey_results)

ads_questions = pd.concat([adbloker_questions, ads_agree_questions, ads_actions_question])
ads_options = pd.concat([adbloker_options, ads_agree_options, ads_actions_options])

HTML(format_questions(ads_questions, ads_options))
Out[22]:
[AdBlocker] Do you have ad-blocking software installed on any computers you use regularly?
  • I'm not sure/I don't know
  • No
  • Yes
[AdBlockerDisable] In the past month, have you disabled your ad blocker for any reason, even temporarily or for a specific website?
  • I'm not sure/I can't remember
  • No
  • Yes
[AdBlockerReasons] What are the reasons that you have disabled your ad blocker in the past month? Please select all that apply.
  • I wanted to support the website I was visiting by viewing their ads
  • I wanted to view a specific advertisement
  • The ad-blocking software was causing display issues on a website
  • The website I was visiting asked me to disable it
  • The website I was visiting forced me to disable it to access their content
  • The website I was visiting has interesting ads
[AdsAgreeDisagree1] To what extent do you agree or disagree with the following statements: Online advertising can be valuable when it is relevant to me
  • Strongly agree
  • Somewhat agree
  • Neither agree nor disagree
  • Somewhat disagree
  • Strongly disagree
[AdsAgreeDisagree2] To what extent do you agree or disagree with the following statements: I enjoy seeing online updates from companies that I like
  • Strongly agree
  • Neither agree nor disagree
  • Somewhat agree
  • Strongly disagree
  • Somewhat disagree
[AdsAgreeDisagree3] To what extent do you agree or disagree with the following statements: I fundamentally dislike the concept of advertising
  • Strongly agree
  • Neither agree nor disagree
  • Somewhat agree
  • Somewhat disagree
  • Strongly disagree
[AdsActions] Which of the following actions have you taken in the past month? Please select all that apply.
  • Clicked on an online advertisement
  • Paid to access a website advertisement-free
  • Saw an online advertisement and then researched it (without clicking on the ad)
  • Stopped going to a website because of their advertising
In [23]:
question_text, options = get_rank_questions('AdsPriorities', 7, survey_schema)
HTML(format_rank_questions('AdsPriorities', 7, question_text, options))
Out[23]:
[AdsPriorities1-7] Please rank the following advertising qualities in order of their importance to you (by dragging the choices up and down), where 1 is the most important, and 7 is the least important.
  1. The advertisement is relevant to me
  2. The advertisement is honest about its goals
  3. The advertisement provides useful information
  4. The advertisement seems trustworthy
  5. The advertisement is from a company that I like
  6. The advertisement offers something of value, like a free trial
  7. The advertisement avoids fluffy or vague language
In [24]:
ai_ethic_questions = survey_schema.loc[93:100, :]
ai_ethic_options = get_options(ai_ethic_questions, survey_results)
HTML(format_questions(ai_ethic_questions, ai_ethic_options))
Out[24]:
[AIDangerous] What do you think is the most dangerous aspect of increasingly advanced AI technology?
  • Artificial intelligence surpassing human intelligence ("the singularity")
  • Increasing automation of jobs
  • Algorithms making important decisions
  • Evolving definitions of "fairness" in algorithmic versus human decisions
[AIInteresting] What do you think is the most exciting aspect of increasingly advanced AI technology?
  • Algorithms making important decisions
  • Increasing automation of jobs
  • Artificial intelligence surpassing human intelligence ("the singularity")
  • Evolving definitions of "fairness" in algorithmic versus human decisions
[AIResponsible] Whose responsibility is it, primarily, to consider the ramifications of increasingly advanced AI technology?
  • The developers or the people creating the AI
  • A governmental or other regulatory body
  • Prominent industry leaders
  • Nobody
[AIFuture] Overall, what's your take on the future of artificial intelligence?
  • I'm excited about the possibilities more than worried about the dangers.
  • I don't care about it, or I haven't thought about it.
  • I'm worried about the dangers more than I'm excited about the possibilities.
[EthicsChoice] Imagine that you were asked to write code for a purpose or product that you consider extremely unethical. Do you write the code anyway?
  • No
  • Depends on what it is
  • Yes
[EthicsReport] Do you report or otherwise call out the unethical code in question?
  • Yes, and publicly
  • Depends on what it is
  • Yes, but only within the company
  • No
[EthicsResponsible] Who do you believe is ultimately most responsible for code that accomplishes something unethical?
  • Upper management at the company/organization
  • The developer who wrote it
  • The person who came up with the idea
[EthicalImplications] Do you believe that you have an obligation to consider the ethical implications of the code that you write?
  • Yes
  • Unsure / I don't know
  • No
In [25]:
so_questions = survey_schema.loc[102:108, :]
so_options = get_options(so_questions, survey_results)
HTML(format_questions(so_questions, so_options))
Out[25]:
[StackOverflowVisit] How frequently would you say you visit Stack Overflow?
  • Multiple times per day
  • A few times per month or weekly
  • A few times per week
  • Daily or almost daily
  • I have never visited Stack Overflow (before today)
  • Less than once per month or monthly
[StackOverflowHasAccount] Do you have a Stack Overflow account?
  • Yes
  • I'm not sure / I can't remember
  • No
[StackOverflowParticipate] How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions.
  • I have never participated in Q&A on Stack Overflow
  • A few times per month or weekly
  • Less than once per month or monthly
  • Daily or almost daily
  • A few times per week
  • Multiple times per day
[StackOverflowJobs] Have you ever used or visited Stack Overflow Jobs?
  • No, I knew that Stack Overflow had a jobs board but have never used or visited it
  • Yes
  • No, I didn't know that Stack Overflow had a jobs board
[StackOverflowDevStory] Do you have an up-to-date Developer Story on Stack Overflow?
  • Yes
  • No, I have one but it's out of date
  • No, I know what it is but I don't have one
  • No, and I don't know what that is
[StackOverflowJobsRecommend] How likely is it that you would recommend Stack Overflow Jobs to a friend or colleague? Where 0 is not likely at all and 10 is very likely.
  • 7
  • 8
  • 10 (Very Likely)
  • 6
  • 5
  • 9
  • 1
  • 2
  • 0 (Not Likely)
  • 3
  • 4
[StackOverflowConsiderMember] Do you consider yourself a member of the Stack Overflow community?
  • Yes
  • No
  • I'm not sure
In [26]:
question_text, options = get_rank_questions('HypotheticalTools', 5, survey_schema)
HTML(format_rank_questions('HypotheticalTools', 5, question_text, options))
Out[26]:
[HypotheticalTools1-5] Please rate your interest in participating in each of the following hypothetical tools on Stack Overflow, where 1 is not at all interested and 5 is extremely interested.
  1. A peer mentoring system
  2. A private area for people new to programming
  3. A programming-oriented blog platform
  4. An employer or job review system
  5. An area for Q&A related to career growth
In [27]:
health_questions = survey_schema.loc[114:117, :]
health_options = get_options(health_questions, survey_results)

multiselect_questions = survey_schema.loc[118:121, :]
multiselect_options = get_multiselect_options(multiselect_questions, survey_results)

personal_questions = pd.concat([health_questions, multiselect_questions])
personal_options = pd.concat([health_options, multiselect_options])

HTML(format_questions(personal_questions, personal_options))
Out[27]:
[WakeTime] On days when you work, what time do you typically wake up?
  • Between 5:00 - 6:00 AM
  • Between 6:01 - 7:00 AM
  • Before 5:00 AM
  • Between 7:01 - 8:00 AM
  • Between 9:01 - 10:00 AM
  • I do not have a set schedule
  • Between 8:01 - 9:00 AM
  • Between 10:01 - 11:00 AM
  • Between 11:01 AM - 12:00 PM
  • After 12:01 PM
  • I work night shifts
[HoursComputer] On a typical day, how much time do you spend on a desktop or laptop computer?
  • 9 - 12 hours
  • 5 - 8 hours
  • Over 12 hours
  • 1 - 4 hours
  • Less than 1 hour
[HoursOutside] On a typical day, how much time do you spend outside?
  • 1 - 2 hours
  • 30 - 59 minutes
  • Less than 30 minutes
  • 3 - 4 hours
  • Over 4 hours
[SkipMeals] In a typical week, how many times do you skip a meal in order to be more productive?
  • Never
  • 3 - 4 times per week
  • 1 - 2 times per week
  • Daily or almost every day
[ErgonomicDevices] What ergonomic furniture or devices do you use on a regular basis? Please select all that apply.
  • Ergonomic keyboard or mouse
  • Fatigue-relieving floor mat
  • Standing desk
  • Wrist/hand supports or braces
[Exercise] In a typical week, how many times do you exercise?
  • 1 - 2 times per week
  • 3 - 4 times per week
  • Daily or almost every day
  • I don't typically exercise
[Gender] Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank.
  • Female
  • Male
  • Non-binary, genderqueer, or gender non-conforming
  • Transgender
[SexualOrientation] Which of the following do you currently identify as? Please select all that apply. If you prefer not to answer, you may leave this question blank.
  • Asexual
  • Bisexual or Queer
  • Gay or Lesbian
  • Straight or heterosexual
In [28]:
family_questions = survey_schema.loc[122:126, :]
family_options = get_multiselect_options(family_questions, survey_results)
HTML(format_questions(family_questions, family_options))
Out[28]:
[EducationParents] What is the highest level of education received by either of your parents? If you prefer not to answer, you may leave this question blank.
  • Associate degree
  • Bachelor’s degree (BA, BS, B.Eng., etc.)
  • Master’s degree (MA, MS, M.Eng., MBA, etc.)
  • Other doctoral degree (Ph.D, Ed.D., etc.)
  • Primary/elementary school
  • Professional degree (JD, MD, etc.)
  • Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)
  • Some college/university study without earning a degree
  • They never completed any formal education
[RaceEthnicity] Which of the following do you identify as? Please check all that apply. If you prefer not to answer, you may leave this question blank.
  • Black or of African descent
  • East Asian
  • Hispanic or Latino/Latina
  • Middle Eastern
  • Native American, Pacific Islander, or Indigenous Australian
  • South Asian
  • White or of European descent
[Age] What is your age? If you prefer not to answer, you may leave this question blank.
  • 18 - 24 years old
  • 25 - 34 years old
  • 35 - 44 years old
  • 45 - 54 years old
  • 55 - 64 years old
  • 65 years or older
  • Under 18 years old
[Dependents] Do you have any children or other dependents that you care for? If you prefer not to answer, you may leave this question blank.
  • No
  • Yes
[MilitaryUS] Are you currently serving or have you ever served in the U.S. Military?
  • No
  • Yes
In [29]:
survey_questions = survey_schema.loc[127:128, :]
survey_options = get_multiselect_options(survey_questions, survey_results)
HTML(format_questions(survey_questions, survey_options))
Out[29]:
[SurveyTooLong] How do you feel about the length of the survey that you just completed?
  • The survey was an appropriate length
  • The survey was too long
  • The survey was too short
[SurveyEasy] How easy or difficult was this survey to complete?
  • Neither easy nor difficult
  • Somewhat difficult
  • Somewhat easy
  • Very difficult
  • Very easy

1.2 Features conversion

As it can be seen from the long list above, most data in the columns of survey results are just strings with repeating values. Since this is not a NLP task, they have to be converted into something more suitable even for performing simple feature analysis.

The target variable of this task is stored in the CareerSatisfaction column, and, sure, it should never be null during prediction. Beside this, it will easier to handle this target in the form of numeric grade.

In [33]:
from sklearn.preprocessing import LabelEncoder
import math
In [34]:
survey_results = survey_results[survey_results['CareerSatisfaction'].notna()]
full_df = pd.DataFrame(index=survey_results.index)
In [45]:
satisfaction_map = {
    'Extremely satisfied' : 7.0,
    'Moderately satisfied' : 6.0,
    'Slightly satisfied' : 5.0,
    'Neither satisfied nor dissatisfied' : 4.0,
    'Slightly dissatisfied' : 3.0,
    'Moderately dissatisfied' : 2.0,
    'Extremely dissatisfied' : 1.0,
}
In [35]:
full_df['CareerSatisfaction'] = survey_results['CareerSatisfaction'].map(satisfaction_map)

The mentioned earlier feature JobSatisfaction seems to be too similar to the target value. It is more interesting to try predicting CareerSatisfaction using more indirect factors, so JobSatisfaction is not included into the features list at this stage. Maybe we'll have to return to it later.

There are a few Yes/No question in the survey, it is reasonable to convert them to binary 1/0 int variables.

In [36]:
yes_no_map = {'Yes': 1, 'No': 0}
full_df['Hobby'] = survey_results['Hobby'].map(yes_no_map)
full_df['OpenSource'] = survey_results['OpenSource'].map(yes_no_map)
full_df['Dependents'] = survey_results['Dependents'].map(yes_no_map)
full_df['MilitaryUS'] = survey_results['MilitaryUS'].map(yes_no_map)

Answers to some other questions cannot be directly converted into numeric format, though there are finit set of options for each of them. Obvious decision is to to interpret them as categorical features. LabelEncoder from sklearn.preprocessing allows to assign them numeric labels.

In [4]:
categorize_cols = [
    'Country',
    'Student',
    'Employment',
    'FormalEducation',
    'UndergradMajor',
    'HopeFiveYears',
    'JobSearchStatus',
    'UpdateCV',
    'SalaryType',
    'Currency',
    'OperatingSystem',
    'CheckInCode',
    'AdBlocker',
    'AdBlockerDisable',
    'AIDangerous',
    'AIInteresting',
    'AIResponsible',
    'AIFuture',
    'EthicsChoice',
    'EthicsReport',
    'EthicsResponsible',
    'EthicalImplications',
    'StackOverflowVisit',
    'StackOverflowHasAccount',
    'StackOverflowParticipate',
    'StackOverflowJobs',
    'StackOverflowDevStory',
    'StackOverflowConsiderMember',
    'EducationParents',
    'SurveyTooLong',
    'SurveyEasy'
]
In [38]:
category_encoders = {}

for column in categorize_cols:
    category_encoders[column] = LabelEncoder()
    to_categorize = survey_results[column].fillna('Unknown')
    full_df[column] = category_encoders[column].fit_transform(to_categorize)
    full_df[column] = full_df[column].astype('category')

Most of the columns that were loaded with implicit dtype=np.float64 designation in the begining of the notebook are "rank" questions colums. Corresponding questions look like "Please rank the ..., where 1 is the most important and ... is the least important."
They are numeric already, so just copy them to the features dataframe.

In [39]:
rank_cols = list(assesjob_dtypes.keys()) + \
    list(assesbenefits_dtypes.keys()) + \
    list(jopcontacts_dtypes.keys()) + \
    list(jobemail_dtypes.keys()) + \
    list(ads_dtypes.keys())

full_df[rank_cols] = survey_results[rank_cols]

The ConvertedSalary feature was initially float64 too.

In [40]:
full_df['ConvertedSalary'] = survey_results['ConvertedSalary']

The Salary column contains values with thousand separator. It has to be removed to convert values into float64.

In [41]:
full_df['Salary'] = survey_results['Salary'].str.replace(',', '').astype('float64')

There were questions in the survey that allowed multiselect options. Results of such questions are stored in the corresponded columns concatenated with ';' separator. There are too many combinations in such columns and there's no sense to categorize them in this form. It is much better to split them and convert kind of one-hot-encoding columns, for example. However, the initial options' texts should also be saved somewhere in the readable format for the case of further detailed analysis.

In [42]:
multiselect_columns_descriptions = {}

def create_binary_columns(multiselect_column, survey, result_df, separator=';'):
    combinations = survey[multiselect_column].dropna().unique()
    options = np.unique(np.concatenate([c.split(separator) for c in combinations]))    
    for i, option in enumerate(options):
        binary_column = multiselect_column + "_" + str(i)
        multiselect_columns_descriptions[binary_column] = option
        result_df[binary_column] = survey[multiselect_column].fillna("").map(lambda comb: int(option in comb))
In [43]:
multiselect_columns = [
    'DevType',
    'CommunicationTools',
    'EducationTypes',
    'SelfTaughtTypes',
    'HackathonReasons',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'DatabaseWorkedWith',
    'DatabaseDesireNextYear',
    'PlatformWorkedWith',
    'PlatformDesireNextYear',
    'FrameworkWorkedWith',
    'FrameworkDesireNextYear',
    'IDE',
    'Methodology',
    'VersionControl',
    'AdBlockerReasons',
    'AdsActions',
    'ErgonomicDevices',
    'Gender',
    'SexualOrientation',
    'RaceEthnicity'
]

for column in multiselect_columns:
    create_binary_columns(column, survey_results, full_df)

Columns like CompanySize with options "N to M employees" or YearsCoding with options 'N to M years' are, again, categorical. But they differ from categorical columns handled above because:

  1. Their options can be ordered, that can give useful computational information;
  2. Their options ratio can cary useful patterns as well;
  3. </ol> It seems reasonable to encode them by the labels that have numerical correspondence with the category. For example, the middle value of the specified interval. Some special categories ("More than..." e.g.) have to be encoded with special value, unfortunately.

In [44]:
def get_middle(low, top):
    return low + (top-low)/2
In [45]:
company_size_map = {
    'Fewer than 10 employees' : get_middle(0, 10),
    '10 to 19 employees' : get_middle(10, 20),
    '20 to 99 employees' : get_middle(20, 100),
    '100 to 499 employees' : get_middle(100, 500),
    '500 to 999 employees' : get_middle(500, 1000),
    '1,000 to 4,999 employees': get_middle(1000, 5000),
    '5,000 to 9,999 employees' : get_middle(5000, 10000),
    '10,000 or more employees' : get_middle(10000, 50000)
}

full_df['CompanySize'] = survey_results['CompanySize'].map(company_size_map)

In [46]:
coding_years_map = {
    '0-2 years' : get_middle(0, 2),
    '3-5 years' : get_middle(3, 5),
    '6-8 years' : get_middle(6, 8),
    '9-11 years' : get_middle(9, 11),
    '12-14 years' : get_middle(12, 14),
    '15-17 years' : get_middle(15, 17),
    '18-20 years' : get_middle(18, 20),
    '21-23 years' : get_middle(21, 23),
    '24-26 years' : get_middle(24, 26),
    '27-29 years' : get_middle(27, 29),
    '30 or more years' : 35.0
}

full_df['YearsCoding'] = survey_results['YearsCoding'].map(coding_years_map)
full_df['YearsCodingProf'] = survey_results['YearsCodingProf'].map(coding_years_map)
In [47]:
last_new_job_map = {
    "I've never had a job" : 0.0,
    'Less than a year ago' : 0.5,
    'Between 1 and 2 years ago' : get_middle(1, 2),
    'Between 2 and 4 years ago' : get_middle(2, 4),
    'More than 4 years ago' : 8.0,
}

full_df['LastNewJob'] = survey_results['LastNewJob'].map(last_new_job_map)
In [48]:
time_productive_map = {
    'Less than a month' : 0.5,
    'One to three months' : get_middle(1, 3),
    'Three to six months' : get_middle(3, 6),
    'Six to nine months' : get_middle(6, 9),
    'Nine months to a year': get_middle(9, 12),
    'More than a year' : 18.0
}

full_df['TimeFullyProductive'] = survey_results['TimeFullyProductive'].map(time_productive_map)
In [49]:
bootcamp_time = {
    'I already had a full-time job as a developer when I began the program' : -1.0,
    'Immediately after graduating' : 0.0,
    'Less than a month' : get_middle(0, 1),
    'One to three months' : get_middle(1, 3),
    'Four to six months' : get_middle(4, 6),    
    'Six months to a year' : get_middle(6, 12),
    'Longer than a year' : get_middle(12, 24),
    'I haven’t gotten a developer job' : 12000
}

full_df['TimeAfterBootcamp'] = survey_results['TimeAfterBootcamp'].map(bootcamp_time)
In [50]:
wake_time_map = {
    "Before 5:00 AM" : 5.0,
    "Between 5:00 - 6:00 AM" : 6.0,
    "Between 6:01 - 7:00 AM" : 7.0,
    "Between 7:01 - 8:00 AM" : 8.0,
    "Between 8:01 - 9:00 AM" : 9.0,
    "Between 9:01 - 10:00 AM" : 10.0,
    "Between 10:01 - 11:00 AM" : 11.0,
    "Between 11:01 AM - 12:00 PM" : 12.0,
    "After 12:01 PM" : 13.0,
    "I work night shifts" : 0.0,
    "I do not have a set schedule" : -1
}

full_df['WakeTime'] = survey_results['WakeTime'].map(wake_time_map)
In [51]:
hours_computing_map = {
    "Less than 1 hour" : 0.5,
    "1 - 4 hours" : get_middle(2, 4),
    "5 - 8 hours" : get_middle(5, 8),
    "9 - 12 hours" : get_middle(9, 12),
    "Over 12 hours" : 14.0
}

full_df['HoursComputer'] = survey_results['HoursComputer'].map(hours_computing_map)
In [52]:
hours_outside_map = {
    "Less than 30 minutes" : 0.25,
    "30 - 59 minutes" : get_middle(0.5, 1.0),
    "1 - 2 hours" : get_middle(1, 2),
    "3 - 4 hours" : get_middle(3, 4),
    "Over 4 hours" : 6.0
}

full_df['HoursOutside'] = survey_results['HoursOutside'].map(hours_outside_map)
In [53]:
skip_meals_map = {
    "Never" : 0.0,
    "1 - 2 times per week" : get_middle(1, 2),
    "3 - 4 times per week" : get_middle(3, 4),
    "Daily or almost every day" : 6.0
}

full_df['SkipMeals'] = survey_results['SkipMeals'].map(skip_meals_map)
In [54]:
exercise_map = {
    "I don't typically exercise" : 0.0,
    "1 - 2 times per week" : get_middle(1, 2),
    "3 - 4 times per week" : get_middle(3, 4),
    "Daily or almost every day" : 6.0,
}

full_df['Exercise'] = survey_results['Exercise'].map(exercise_map)
In [55]:
age_map = {
    "Under 18 years old" : 16.0,
    "18 - 24 years old" : get_middle(18, 24),
    "25 - 34 years old" : get_middle(25, 34),
    "35 - 44 years old" : get_middle(35, 44),
    "45 - 54 years old" : get_middle(45, 54),
    "55 - 64 years old" : get_middle(55, 64),
    "65 years or older" : get_middle(65, 75)
}

full_df['Age'] = survey_results['Age'].map(age_map)

Some columns can be mapped to 'rank' numeric format as well, but only manually.

In [56]:
agree_degree_map = {
    'strongly agree' : 5.0,
    'agree' : 4.0,
    'somewhat agree' : 4.0,
    'disagree' : 3.0,
    'somewhat disagree' : 3.0,
    'neither agree nor disagree' : 2.0,
    'strongly disagree': 1.0,
}

agree_cols = [
    'AgreeDisagree1',
    'AgreeDisagree2',
    'AgreeDisagree3',
    'AdsAgreeDisagree1',
    'AdsAgreeDisagree2',
    'AdsAgreeDisagree3'
]

for column in agree_cols:
    full_df[column] = survey_results[column].str.lower().map(agree_degree_map)
In [57]:
hypothetical_tools_map = {
    'Not at all interested' : 1.0,
    'A little bit interested' : 2.0,
    'Somewhat interested' : 3.0,
    'Very interested' : 4.0,
    'Extremely interested' : 5.0
}

for i in range(1,6):
    column = "HypotheticalTools" + str(i)
    full_df[column] = survey_results[column].map(hypothetical_tools_map)

NumberMonitors and StackOverflowJobsRecommend columns can be easily converted to numeric format if apply simple string processing.

In [58]:
full_df['NumberMonitors'] = survey_results['NumberMonitors'].map(
    lambda mon: 6.0 if (not pd.isna(mon) and ("More" in mon)) else float(mon))
In [59]:
full_df['StackOverflowJobsRecommend'] = \
    survey_results['StackOverflowJobsRecommend'].map(lambda rec:
        rec if pd.isna(rec) else float(rec.strip(" (Very Likely)").strip(" (Not Likely)")) )

Save converted features to disk not to run that all each time

In [60]:
print(full_df.shape)
full_df.head()
(76504, 421)
Out[60]:
CareerSatisfaction Hobby OpenSource Dependents MilitaryUS Country Student Employment FormalEducation UndergradMajor ... AdsAgreeDisagree1 AdsAgreeDisagree2 AdsAgreeDisagree3 HypotheticalTools1 HypotheticalTools2 HypotheticalTools3 HypotheticalTools4 HypotheticalTools5 NumberMonitors StackOverflowJobsRecommend
Respondent
1 7.0 1 0 1.0 NaN 82 0 1 1 10 ... 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 1.0 NaN
3 4.0 1 1 1.0 NaN 167 0 0 1 3 ... 4.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 7.0
4 6.0 1 1 NaN NaN 169 0 0 0 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 3.0 0 0 0.0 0.0 169 0 0 1 6 ... 2.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 2.0 8.0
7 6.0 1 0 1.0 NaN 145 3 0 8 6 ... 4.0 4.0 3.0 5.0 5.0 5.0 5.0 5.0 2.0 NaN

5 rows × 421 columns

In [61]:
full_df.to_csv('converted_features.csv')
In [6]:
import pickle
In [63]:
with open('category_encoders.pkl', 'wb') as encoders_file:
    pickle.dump(category_encoders, encoders_file, protocol=2)
In [64]:
with open('multiselect_columns_descriptions.pkl', 'wb') as descriptions_file:
    pickle.dump(multiselect_columns_descriptions, descriptions_file, protocol=2)

Part 2. Primary data analysis

In [3]:
from scipy import stats

2.1 Load previously converted and saved data

In [4]:
import pickle
In [5]:
full_df = pd.read_csv('converted_features.csv', index_col='Respondent')

with open('category_encoders.pkl', 'rb') as encoders_file:
    category_encoders = pickle.load(encoders_file)
    
categorize_cols = list(category_encoders.keys())
    
with open('multiselect_columns_descriptions.pkl', 'rb') as descriptions_file:
    multiselect_columns_descriptions = pickle.load(descriptions_file)
In [6]:
print(full_df.shape)
full_df.head()
(76504, 421)
Out[6]:
CareerSatisfaction Hobby OpenSource Dependents MilitaryUS Country Student Employment FormalEducation UndergradMajor ... AdsAgreeDisagree1 AdsAgreeDisagree2 AdsAgreeDisagree3 HypotheticalTools1 HypotheticalTools2 HypotheticalTools3 HypotheticalTools4 HypotheticalTools5 NumberMonitors StackOverflowJobsRecommend
Respondent
1 7.0 1 0 1.0 NaN 82 0 1 1 10 ... 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 1.0 NaN
3 4.0 1 1 1.0 NaN 167 0 0 1 3 ... 4.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 7.0
4 6.0 1 1 NaN NaN 169 0 0 0 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 3.0 0 0 0.0 0.0 169 0 0 1 6 ... 2.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 2.0 8.0
7 6.0 1 0 1.0 NaN 145 3 0 8 6 ... 4.0 4.0 3.0 5.0 5.0 5.0 5.0 5.0 2.0 NaN

5 rows × 421 columns

2.2. Missing data

This dataset contains survey results, where many questions were optional, ant it's no wonder that there are columns containing NAN values. Checking how many exactly.

In [14]:
cols_with_nans = full_df.isna().mean().sort_values(ascending=False)
cols_with_nans = cols_with_nans[cols_with_nans > 0]
cols_with_nans
Out[14]:
TimeAfterBootcamp             0.916540
MilitaryUS                    0.806128
StackOverflowJobsRecommend    0.536194
JobEmailPriorities2           0.419834
JobEmailPriorities7           0.419834
JobEmailPriorities6           0.419834
JobEmailPriorities1           0.419834
JobEmailPriorities5           0.419834
JobEmailPriorities4           0.419834
JobEmailPriorities3           0.419834
JobContactPriorities1         0.389078
JobContactPriorities2         0.389078
JobContactPriorities3         0.389078
JobContactPriorities4         0.389078
JobContactPriorities5         0.389078
ConvertedSalary               0.386111
Salary                        0.351433
TimeFullyProductive           0.326231
AdsPriorities6                0.282717
AdsPriorities7                0.282717
AdsPriorities1                0.282717
AdsPriorities2                0.282717
AdsPriorities3                0.282717
AdsPriorities5                0.282717
AdsPriorities4                0.282717
Dependents                    0.231792
Age                           0.208107
CompanySize                   0.205976
AssessBenefits1               0.183755
AssessBenefits11              0.183755
                                ...   
AssessBenefits4               0.183755
AgreeDisagree1                0.165678
AgreeDisagree2                0.164083
AgreeDisagree3                0.163835
HypotheticalTools4            0.161233
HypotheticalTools3            0.161142
HypotheticalTools5            0.160711
HypotheticalTools2            0.160697
HypotheticalTools1            0.160528
AssessJob3                    0.158815
AssessJob2                    0.158815
AssessJob1                    0.158815
AssessJob6                    0.158815
AssessJob7                    0.158815
AssessJob4                    0.158815
AssessJob5                    0.158815
AssessJob8                    0.158815
AssessJob9                    0.158815
AssessJob10                   0.158815
AdsAgreeDisagree2             0.138424
SkipMeals                     0.138333
AdsAgreeDisagree3             0.138189
AdsAgreeDisagree1             0.137784
HoursOutside                  0.137588
Exercise                      0.136477
HoursComputer                 0.136241
WakeTime                      0.135875
NumberMonitors                0.126621
LastNewJob                    0.012287
YearsCoding                   0.001163
Length: 68, dtype: float64

2.3 Statistical characteristics of the target feature

First of all let's see proportion of respondents giving this or that grade to their career satisfaction level.

In [147]:
satisfaction_counts = full_df['CareerSatisfaction'].value_counts().rename('Count')
satisfaction_proportions = (full_df['CareerSatisfaction'].value_counts() / len(full_df)).rename('Part')
satisfaction_df = pd.concat([satisfaction_counts, satisfaction_proportions], axis=1)
satisfaction_df
Out[147]:
Count Part
6.0 27926 0.365027
7.0 14316 0.187127
5.0 13484 0.176252
3.0 6587 0.086100
4.0 6316 0.082558
2.0 5262 0.068781
1.0 2613 0.034155

The most respondents gave "6" raiting, that is "Moderately satisfied". The "Extremely satisfied" is on the second place with little bit more proportion then "Slightly satisfied". In general, there are much more respondents who have choosen "Satisfied" grades than those who haven't. It allows to conclude that developer is a cool profession :)

Some statistics characteristics of the distribution:

In [148]:
print("CareerSatisfaction median: ", full_df['CareerSatisfaction'].median())
print("CareerSatisfaction mean: ", full_df['CareerSatisfaction'].mean())
print("CareerSatisfaction std: ", full_df['CareerSatisfaction'].std())
CareerSatisfaction median:  6.0
CareerSatisfaction mean:  5.141561225556834
CareerSatisfaction std:  1.6389014092370502

Yes, half of respondents ranked their carrier with two highest grades, and 1-5 ranks are distributed among the other half. The median is different from the mean, so it doesn't look like normal distribution.

In [149]:
satisfaction_values = full_df['CareerSatisfaction'].values.astype(np.float64)

_, p_normtest = stats.normaltest(satisfaction_values)
print("p-value in stats.normaltest: {0:.6f}".format(p_normtest))
p-value in stats.normaltest: 0.000000

Stats test confirms that is not. What if applying log transforms?

In [150]:
log1_satisfaction_values = np.log(1 + satisfaction_values)
log_inverted_satisfaction_values = np.log(8 - satisfaction_values)

_, p_log1_normtest = stats.normaltest(log1_satisfaction_values)
_, p_log_inverted_normtest = stats.normaltest(log_inverted_satisfaction_values)
print("p-value in log1 stats.normaltest: {0:.6f}".format(p_log1_normtest))
print("p-value in s inverted log stats.normaltest: {0:.6f}".format(p_log_inverted_normtest))
p-value in log1 stats.normaltest: 0.000000
p-value in s inverted log stats.normaltest: 0.000000

It look like the target variable isn't distributed neither normally, nor lognormally. May be that's due to discrete values of ranks, as normal distribution is continious one. Graphic of the distribution in the visual analysis section is needed to make further conclusions.

2.4 CareerSatisfaction <-> JobSatisfaction correlation

Though it was decided not to use JobSatisfaction feature yet, it is better to verify the assumption about its high correlation with the target variable.

In [151]:
satisfaction_df = survey_results[['CareerSatisfaction', 'JobSatisfaction']]
satisfaction_df = satisfaction_df[
    satisfaction_df['CareerSatisfaction'].notna() & satisfaction_df['JobSatisfaction'].notna()]
satisfaction_df['CareerSatisfaction'] = satisfaction_df['CareerSatisfaction'].map(satisfaction_map)
satisfaction_df['JobSatisfaction'] = satisfaction_df['JobSatisfaction'].map(satisfaction_map)
satisfaction_df.corr()
Out[151]:
CareerSatisfaction JobSatisfaction
CareerSatisfaction 1.000000 0.593935
JobSatisfaction 0.593935 1.000000

Indeed, it looks like high correlation result. Let's put this feature aside for the sake of more comprehensive analysis of other aspects. However, it can be returned if model will show too poor results without it.

2.5 ConvertedSalary analysis

The only continious numeric varibles in the research were the Salary and ConvertedSalary features. The second is much more convenient for the analysis as it was converted to common denominator of "annual USD salaries using the exchange rate on 2018-01-18, assuming 12 working months and 50 working weeks".

In [152]:
converted_salary = full_df['ConvertedSalary']
converted_salary.describe()
Out[152]:
count    4.696500e+04
mean     9.675330e+04
std      2.031932e+05
min      0.000000e+00
25%      2.469600e+04
50%      5.598100e+04
75%      9.400000e+04
max      2.000000e+06
Name: ConvertedSalary, dtype: float64

Let's see average salaries by career satisfactions ranks

In [153]:
full_df.pivot_table(['ConvertedSalary'], ['CareerSatisfaction'], aggfunc='mean')
Out[153]:
ConvertedSalary
CareerSatisfaction
1.0 97574.700441
2.0 81337.832422
3.0 86218.418929
4.0 76909.696446
5.0 83697.253697
6.0 104074.131315
7.0 110145.861186

Those who are paid the biggest average salaries are satisfied by the career the most, it is expectable result. But surprisingly, the lowest payment is received by those in the middle, with "Neither satisfied nor dissatisfied" rank. And the most dissatisfied respondents do not receive the least wage.

Let's see the correlation

In [154]:
salary_correlation = full_df[['CareerSatisfaction', 'ConvertedSalary']].corr()
salary_correlation
Out[154]:
CareerSatisfaction ConvertedSalary
CareerSatisfaction 1.000000 0.043178
ConvertedSalary 0.043178 1.000000

That's not so much. May be this dependency may get higher evaluation? DataFrame.corr() can use three correlation methods. As I know, the default 'pearson' is good for linear correlation, and we saw above that in this case it is not linear.

In [155]:
salary_correlation = full_df[['CareerSatisfaction', 'ConvertedSalary']].corr(method='spearman')
salary_correlation
Out[155]:
CareerSatisfaction ConvertedSalary
CareerSatisfaction 1.000000 0.182913
ConvertedSalary 0.182913 1.000000

It's obviously better. So, spearman looks like more suitable correlation method to be used in further analysis. If the correlation between salary and satisfaction is not linear, the others shouldn't be this way all the more.

2.6 Coding Experience, Professional Coding experience, Last Job Years and Age

There are YearsCoding, YearsCodingProf, LastNewJob and Age columns in the dataset. Despite them being categorical in the survey, the way they were processed during features conversion allows to treat them as numerical values.

In [177]:
years_columns = ['CareerSatisfaction', 'YearsCoding', 'YearsCodingProf', 'LastNewJob', 'Age']
years_coding_corelation = full_df[years_columns].corr(method='spearman')
years_coding_corelation = years_coding_corelation['CareerSatisfaction'].sort_values(ascending=False)
years_coding_corelation
Out[177]:
CareerSatisfaction    1.000000
YearsCodingProf       0.063052
YearsCoding           0.054862
Age                   0.018039
LastNewJob            0.015718
Name: CareerSatisfaction, dtype: float64

Not so much on the one hand, but not zero on the other.

2.7 Some other numerical categorical features

In [178]:
numerical_features = ['TimeAfterBootcamp',
        'ConvertedSalary', 'CompanySize', 'TimeFullyProductive', 'NumberMonitors',
        'WakeTime', 'HoursComputer', 'HoursOutside', 'SkipMeals', 'Exercise'
]

numerical_correlations = full_df[['CareerSatisfaction'] + numerical_features].astype('float64').corr(method='spearman')
numerical_correlations = numerical_correlations['CareerSatisfaction'].sort_values(ascending=False)
numerical_correlations
Out[178]:
CareerSatisfaction     1.000000
ConvertedSalary        0.182913
NumberMonitors         0.090212
Exercise               0.051191
CompanySize            0.022919
HoursComputer         -0.005023
WakeTime              -0.010328
TimeFullyProductive   -0.022407
HoursOutside          -0.027250
SkipMeals             -0.048306
TimeAfterBootcamp     -0.098974
Name: CareerSatisfaction, dtype: float64

2.8 Technical role and experience

As it was the developer survey there are a number a questions devoted to technical role and technical experience; they seem to affect the technical career satisfaction. However, there are too many of them with too many answers options, so let select only few as representatives for the analysis.

Dev roles question was multiselect and alowed to choose several options at once. Now all this options are splitted to separate columns with "1" for the respondents checked it, and "0" for those who didn't. That doesn't prevent performing correlation check with the target variable, anyway. But some mapping with readable descriptions is needed.

In [180]:
dev_types = [dt for dt in multiselect_columns_descriptions.keys() if dt.startswith('DevType')]
dev_types_df = full_df[['CareerSatisfaction'] + dev_types]

dev_types = [multiselect_columns_descriptions[dv] for dv in dev_types]
dev_types_df = dev_types_df.rename(columns=multiselect_columns_descriptions)
In [181]:
dev_types_correlation = dev_types_df.corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False)
dev_types_correlation
Out[181]:
CareerSatisfaction                               1.000000
Engineering manager                              0.049598
C-suite executive (CEO, CTO, etc.)               0.047887
DevOps specialist                                0.040737
Full-stack developer                             0.036597
Product manager                                  0.025695
Data scientist or machine learning specialist    0.018307
Back-end developer                               0.007011
Educator or academic researcher                  0.005794
Front-end developer                              0.003102
Mobile developer                                 0.002944
Marketing or sales professional                  0.001562
Embedded applications or devices developer      -0.001165
Database administrator                          -0.001362
QA or test developer                            -0.002059
System administrator                            -0.004868
Desktop or enterprise applications developer    -0.005695
Designer                                        -0.006873
Data or business analyst                        -0.008894
Game or graphics developer                      -0.026840
Student                                         -0.047713
Name: CareerSatisfaction, dtype: float64

That could be also interesting what percent of respondents with this or that level of satisfaction perform this or that role. As all these columns contain "1" or "0" only, the mean() function (i.e. sum/count) will give us that proportion.

In [161]:
dev_types_percent = dev_types_df.pivot_table(dev_types, ['CareerSatisfaction'], aggfunc='mean')
dev_types_percent = dev_types_percent.transpose()
dev_types_percent
Out[161]:
CareerSatisfaction 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Back-end developer 0.595101 0.597681 0.599059 0.583439 0.613097 0.608036 0.604638
C-suite executive (CEO, CTO, etc.) 0.046307 0.032687 0.022013 0.025332 0.027366 0.037743 0.057628
Data or business analyst 0.099120 0.087989 0.096554 0.086922 0.085064 0.086228 0.084451
Data scientist or machine learning specialist 0.085343 0.081718 0.071504 0.079956 0.072234 0.085941 0.090668
Database administrator 0.160735 0.150703 0.157128 0.162286 0.149733 0.146208 0.159542
Designer 0.158056 0.129989 0.136329 0.161178 0.124444 0.120927 0.143057
Desktop or enterprise applications developer 0.177191 0.186241 0.192349 0.194427 0.187556 0.182124 0.183990
DevOps specialist 0.098354 0.099202 0.104904 0.082014 0.105088 0.119673 0.132788
Educator or academic researcher 0.042097 0.040289 0.036284 0.041482 0.030110 0.037957 0.041632
Embedded applications or devices developer 0.060850 0.056442 0.062092 0.059690 0.051320 0.056077 0.058187
Engineering manager 0.057405 0.048651 0.038561 0.037524 0.050208 0.062630 0.077605
Front-end developer 0.404899 0.391106 0.394717 0.414345 0.398621 0.393218 0.409053
Full-stack developer 0.480291 0.494299 0.497647 0.466118 0.513720 0.527322 0.537231
Game or graphics developer 0.068504 0.052261 0.050554 0.099113 0.050578 0.044081 0.051271
Marketing or sales professional 0.018370 0.012163 0.010020 0.012666 0.010012 0.009131 0.014599
Mobile developer 0.245695 0.202585 0.206315 0.230842 0.212697 0.199384 0.231419
Product manager 0.048986 0.044660 0.041597 0.039265 0.041234 0.050240 0.058396
QA or test developer 0.073861 0.070315 0.081221 0.071406 0.072827 0.067464 0.076418
Student 0.174512 0.124287 0.129346 0.270583 0.123999 0.107534 0.130763
System administrator 0.129736 0.128658 0.120237 0.142020 0.113616 0.115734 0.126781

The similar approaches for LanguageWorkedWith columns.

In [162]:
langs_worked = [l for l in multiselect_columns_descriptions.keys() if l.startswith('LanguageWorkedWith')]
lang_worked_df = full_df[['CareerSatisfaction'] + langs_worked]

langs_worked = [multiselect_columns_descriptions[l] for l in langs_worked]
langs_worked_df = lang_worked_df.rename(columns=multiselect_columns_descriptions)
In [163]:
langs_worked_correlation = langs_worked_df.corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False)
langs_worked_correlation
Out[163]:
CareerSatisfaction      1.000000
Bash/Shell              0.051539
TypeScript              0.048590
JavaScript              0.040756
R                       0.038875
Ruby                    0.038824
Go                      0.032885
Java                    0.030800
CSS                     0.030782
HTML                    0.029567
C                       0.025828
Python                  0.025468
Swift                   0.024316
SQL                     0.023480
Kotlin                  0.023434
CoffeeScript            0.023151
Scala                   0.022293
Groovy                  0.021196
Objective-C             0.020837
C#                      0.017973
F#                      0.012238
Rust                    0.011805
Erlang                  0.009154
Clojure                 0.008811
Perl                    0.002779
Lua                     0.001887
Julia                   0.000279
Ocaml                  -0.000155
Haskell                -0.002360
Matlab                 -0.002900
VB.NET                 -0.003354
Assembly               -0.003957
Hack                   -0.005076
Cobol                  -0.005911
VBA                    -0.007535
C++                    -0.010874
Visual Basic 6         -0.010924
PHP                    -0.014080
Delphi/Object Pascal   -0.017225
Name: CareerSatisfaction, dtype: float64
In [164]:
langs_worked_percent = langs_worked_df.pivot_table(langs_worked, ['CareerSatisfaction'], aggfunc='mean')
langs_worked_percent = langs_worked_percent.transpose()
langs_worked_percent
Out[164]:
CareerSatisfaction 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Assembly 0.076158 0.059293 0.060422 0.072039 0.055473 0.054931 0.065172
Bash/Shell 0.303483 0.340935 0.338090 0.313331 0.347152 0.384445 0.389494
C 0.721776 0.728810 0.755579 0.733217 0.752966 0.760653 0.768092
C# 0.270953 0.286773 0.304236 0.290215 0.320009 0.317482 0.313495
C++ 0.215078 0.213037 0.217246 0.257283 0.209508 0.204075 0.216192
CSS 0.551091 0.548651 0.568392 0.563648 0.579279 0.586371 0.603660
Clojure 0.016073 0.010832 0.012145 0.009816 0.009938 0.011710 0.014878
Cobol 0.011481 0.006271 0.006528 0.008075 0.006452 0.005371 0.006915
CoffeeScript 0.026024 0.026606 0.023531 0.023116 0.027366 0.034377 0.034227
Delphi/Object Pascal 0.026406 0.028126 0.024138 0.026124 0.024622 0.020017 0.020117
Erlang 0.011481 0.009692 0.007591 0.007441 0.009789 0.010743 0.011316
F# 0.013012 0.011212 0.011690 0.010133 0.011347 0.014216 0.014739
Go 0.052813 0.055872 0.056779 0.054307 0.060368 0.068073 0.078793
Groovy 0.031764 0.034778 0.035828 0.028974 0.040122 0.043150 0.044635
HTML 0.577497 0.574686 0.595719 0.594364 0.608870 0.614589 0.628667
Hack 0.007654 0.003231 0.002733 0.002217 0.001928 0.001934 0.003143
Haskell 0.021814 0.022425 0.020191 0.026757 0.019430 0.020268 0.021794
Java 0.689246 0.713987 0.726886 0.706301 0.740581 0.747332 0.749162
JavaScript 0.575584 0.600532 0.610445 0.585339 0.630896 0.642627 0.650531
Julia 0.008037 0.005701 0.003036 0.005858 0.003337 0.003939 0.005588
Kotlin 0.040566 0.034398 0.033703 0.035307 0.038564 0.043221 0.048687
Lua 0.033678 0.031167 0.027782 0.035940 0.029368 0.030545 0.032761
Matlab 0.050517 0.046370 0.046000 0.054465 0.048947 0.046802 0.047779
Objective-C 0.063911 0.055872 0.059359 0.052248 0.061999 0.065387 0.072716
Ocaml 0.008037 0.005511 0.003947 0.004908 0.003856 0.004727 0.005169
PHP 0.295063 0.277461 0.275391 0.289424 0.280406 0.256177 0.275217
Perl 0.037887 0.037438 0.037194 0.038632 0.036414 0.038459 0.038768
Python 0.306544 0.314899 0.329437 0.341672 0.323050 0.341832 0.356804
R 0.132415 0.136260 0.136329 0.137745 0.136236 0.164184 0.169810
Ruby 0.077306 0.079818 0.077881 0.072673 0.084100 0.099835 0.106524
Rust 0.017987 0.020334 0.018977 0.025807 0.016761 0.021808 0.025077
SQL 0.466514 0.489738 0.518142 0.480051 0.519356 0.527322 0.522492
Scala 0.037887 0.040289 0.027934 0.031666 0.036339 0.044403 0.044496
Swift 0.076158 0.060623 0.062851 0.064123 0.069935 0.073981 0.083752
TypeScript 0.136625 0.135880 0.135570 0.131412 0.156037 0.174246 0.182942
VB.NET 0.068121 0.059293 0.052983 0.061748 0.061480 0.058834 0.057488
VBA 0.047072 0.046940 0.045089 0.040374 0.041086 0.042398 0.039676
Visual Basic 6 0.050134 0.036298 0.030970 0.037840 0.033595 0.032300 0.031992

And LanguageDesireNextYear columns too.

In [165]:
lang_desired = [l for l in multiselect_columns_descriptions.keys() if l.startswith('LanguageDesireNextYear')]
lang_desired_df = full_df[['CareerSatisfaction'] + lang_desired]
In [166]:
lang_desired = [multiselect_columns_descriptions[l] for l in lang_desired]
lang_desired_df = lang_desired_df.rename(columns=multiselect_columns_descriptions)
In [167]:
langs_desired_correlation = lang_desired_df.corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False)
langs_desired_correlation
Out[167]:
CareerSatisfaction      1.000000
Bash/Shell              0.040365
TypeScript              0.034861
HTML                    0.029015
CSS                     0.026265
JavaScript              0.024870
Go                      0.018186
SQL                     0.018184
Swift                   0.017544
C                       0.016269
Java                    0.016015
R                       0.009478
Rust                    0.009450
C#                      0.009422
Ruby                    0.009039
F#                      0.008360
Kotlin                  0.004180
Erlang                  0.003016
Groovy                  0.002753
Ocaml                   0.001271
Clojure                -0.000212
Haskell                -0.001878
CoffeeScript           -0.004414
Objective-C            -0.004642
Scala                  -0.005060
Python                 -0.006263
VB.NET                 -0.006592
Julia                  -0.006639
Delphi/Object Pascal   -0.007751
Perl                   -0.008277
VBA                    -0.008305
Lua                    -0.010136
Cobol                  -0.011806
Hack                   -0.012254
Visual Basic 6         -0.012284
Matlab                 -0.013729
Assembly               -0.015967
C++                    -0.019113
PHP                    -0.019904
Name: CareerSatisfaction, dtype: float64
In [168]:
lang_desired_percent = lang_desired_df.pivot_table(lang_desired, ['CareerSatisfaction'], aggfunc='mean')
lang_desired_percent = lang_desired_percent.rename(columns=multiselect_columns_descriptions).transpose()
lang_desired_percent
Out[168]:
CareerSatisfaction 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Assembly 0.055492 0.048271 0.050858 0.058265 0.038935 0.037062 0.045823
Bash/Shell 0.205511 0.227100 0.225748 0.214060 0.226787 0.250985 0.266485
C 0.538844 0.557203 0.590405 0.576789 0.579798 0.585440 0.591436
C# 0.205128 0.208286 0.237286 0.231792 0.241175 0.239991 0.231629
C++ 0.160352 0.170848 0.183240 0.200285 0.159893 0.153549 0.163384
CSS 0.329506 0.323071 0.335813 0.329164 0.340997 0.350605 0.365884
Clojure 0.028703 0.032117 0.037043 0.028182 0.031000 0.030867 0.032481
Cobol 0.010333 0.004561 0.006073 0.007125 0.003856 0.003080 0.004959
CoffeeScript 0.026024 0.018814 0.022924 0.025174 0.019060 0.018943 0.021584
Delphi/Object Pascal 0.015691 0.013683 0.012904 0.012666 0.010531 0.009418 0.012154
Erlang 0.031764 0.030027 0.033399 0.029132 0.033150 0.032873 0.032691
F# 0.032530 0.041239 0.047518 0.038632 0.042717 0.048163 0.043588
Go 0.144661 0.171608 0.184302 0.158486 0.184070 0.185383 0.188181
Groovy 0.022579 0.021855 0.025201 0.019791 0.024622 0.023813 0.023959
HTML 0.340987 0.336564 0.352361 0.344522 0.358351 0.370372 0.383417
Hack 0.017987 0.007792 0.008653 0.009500 0.006452 0.005156 0.008243
Haskell 0.052813 0.056632 0.063155 0.060798 0.057772 0.057187 0.058047
Java 0.510142 0.515583 0.538940 0.515674 0.546129 0.544367 0.545054
JavaScript 0.407960 0.421133 0.432822 0.408011 0.442599 0.448901 0.454107
Julia 0.017604 0.015393 0.016396 0.018683 0.012682 0.013643 0.014529
Kotlin 0.129353 0.136450 0.136785 0.134104 0.143429 0.139583 0.139983
Lua 0.032912 0.029837 0.027478 0.029766 0.025734 0.024529 0.025566
Matlab 0.027172 0.022995 0.022924 0.028499 0.019356 0.018943 0.019628
Objective-C 0.051282 0.036108 0.040686 0.043540 0.040344 0.035630 0.041771
Ocaml 0.014543 0.012923 0.013056 0.012191 0.011495 0.012426 0.013481
PHP 0.168772 0.138350 0.138455 0.153578 0.137422 0.125689 0.132090
Perl 0.029468 0.022615 0.023835 0.024224 0.019282 0.018370 0.022283
Python 0.351320 0.353098 0.382875 0.368904 0.363690 0.363854 0.354918
R 0.191351 0.215317 0.232731 0.219284 0.219519 0.231540 0.224015
Ruby 0.081898 0.083048 0.091392 0.083914 0.085880 0.088054 0.093602
Rust 0.073096 0.090270 0.095187 0.090722 0.086473 0.097364 0.093043
SQL 0.289705 0.296655 0.325641 0.305415 0.322679 0.332593 0.326558
Scala 0.067356 0.073166 0.078792 0.061273 0.075794 0.074089 0.066150
Swift 0.116341 0.098062 0.104145 0.102913 0.109760 0.109754 0.122590
TypeScript 0.163031 0.184151 0.189616 0.171628 0.196307 0.215140 0.214096
VB.NET 0.022962 0.015013 0.016092 0.021374 0.015574 0.014717 0.016345
VBA 0.015691 0.012353 0.013056 0.011241 0.010012 0.009633 0.010827
Visual Basic 6 0.010716 0.006842 0.008805 0.009500 0.005562 0.004440 0.006776

And also OperatingSystem.

In [169]:
os_df = full_df[['CareerSatisfaction', 'OperatingSystem']]

os_df['OperatingSystem'] = os_df['OperatingSystem'].map(
    lambda os: category_encoders['OperatingSystem'].inverse_transform(os)
)

os_dummies_df = pd.get_dummies(os_df, columns=['OperatingSystem'], prefix="OS")
In [170]:
os_correlation = os_dummies_df.corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False)
os_correlation
Out[170]:
CareerSatisfaction    1.000000
OS_MacOS              0.077067
OS_BSD/Unix          -0.009608
OS_Linux-based       -0.010232
OS_Unknown           -0.031194
OS_Windows           -0.036228
Name: CareerSatisfaction, dtype: float64

Interesting: MacOS shows much higher correlation with career satisfaction than others OS. When looked on the OS percentage by satisfaction grades, it can be seen that 6- and 7- respondents use this OS more often than others. Is MacOS the operating system of success?

In [171]:
satisfaction_os_percent = os_dummies_df.groupby(by='CareerSatisfaction').mean()
satisfaction_os_percent
Out[171]:
OS_BSD/Unix OS_Linux-based OS_MacOS OS_Unknown OS_Windows
CareerSatisfaction
1.0 0.002679 0.204746 0.212017 0.166475 0.414083
2.0 0.002090 0.211707 0.210946 0.145572 0.429685
3.0 0.002733 0.201306 0.200698 0.129194 0.466070
4.0 0.002217 0.213110 0.166086 0.161970 0.456618
5.0 0.001335 0.202981 0.216998 0.131267 0.447419
6.0 0.001611 0.197737 0.260331 0.118134 0.422187
7.0 0.001118 0.197262 0.285904 0.122381 0.393336

The main conclusion that can be made from the lists and tables with numbers above, in my opinion, is that it's too hard to search dependencies in such data with the method of staring gaze. It's better to move to visual data analysis.

Part 3. Primary visual data analysis

In [7]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import poisson
import warnings
warnings.filterwarnings('ignore')

3.1 Target value distribution

Target value distribution in the graphical form

In [179]:
sns.distplot(satisfaction_values, bins=7, kde=False)
Out[179]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f431c4c79e8>

That doesn't look like normal distribution, really. May be it resembles lognormal distribution being turned around, but lognormal distribution is continious and our one is discrete. Wouldn't it be better to search for discrete distribution with similar form? The Poisson distribution, for instance.
The histogram of Poisson distribution with the same mean and number of samples, when inverted, looks this way:

In [198]:
mu = satisfaction_values.mean()
data_poisson = poisson.rvs(mu=mu, size=len(satisfaction_values), random_state=42)
max_poisson = max(data_poisson)
ax = sns.distplot(max_poisson - data_poisson, bins=7, kde=False)

Unfortunately, I haven't found easy and relaible analytic way to check if the distribution is Poisson. Even graphical statsmodels.api qqplot doesn't easily work with Poisson. The scipy.stats.probplot function shows such plot:

In [199]:
ax2 = plt.subplot()
res = stats.probplot(satisfaction_values, dist='poisson', sparams=(mu), plot=plt)

As I know, if the distributions are similar then the blue points should be close to 45-degree red line. For "ordered values" under 7 they are, for some extent. Apparently, the satisfaction levels has some characteristics of the Poisson distribution, but... No idea that it gives. There was a task to perform statistical analysis of the target feature, I've done it :)

3.2 Boxplot for ConvertedSalary

As ConvertedSalary is continuous variable and its distribution can be depicted with the boxplot.

In [200]:
sns.boxplot(x='CareerSatisfaction', y='ConvertedSalary', data=full_df)
Out[200]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42db4a0eb8>

Outliers make this graph looking awful. This column required some mathematical actions to calculate converted US annual salary, may be some respondents have done it wrong. Let's see if cutting values >= 0.95 quantile would help.

In [201]:
quantile_95 = full_df['ConvertedSalary'].quantile(0.95)
cut_salary_outliers_df = full_df[full_df['ConvertedSalary'] < quantile_95]
sns.boxplot(x='CareerSatisfaction', y='ConvertedSalary', data=cut_salary_outliers_df)
Out[201]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f42db21f0b8>

Much more useful picture. And there is indeed dependency between salary distribution and career satisfaction.

3.3 Numerical category features corellations

In [183]:
sns.heatmap(numerical_correlations.iloc[1:].to_frame(), annot=True)
Out[183]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb077262358>

The heatmap shows only that these features are much less correlated with the target than the ConvertedSalary. Maybe plots divided by satisfaction level will show something...

In [203]:
grid = sns.FacetGrid(full_df, col="CareerSatisfaction", col_wrap=4, sharey=False)
grid.map(sns.countplot, "NumberMonitors")
Out[203]:
<seaborn.axisgrid.FacetGrid at 0x7f42db305358>
In [204]:
grid = sns.FacetGrid(full_df, col="CareerSatisfaction", col_wrap=4, sharey=False)
grid.map(sns.countplot, "CompanySize")
Out[204]:
<seaborn.axisgrid.FacetGrid at 0x7f431dee7d30>
In [205]:
grid = sns.FacetGrid(full_df, col="CareerSatisfaction", col_wrap=4, sharey=False)
grid.map(sns.countplot, "SkipMeals")
Out[205]:
<seaborn.axisgrid.FacetGrid at 0x7f43101b6d30>
In [187]:
grid = sns.FacetGrid(full_df, col="CareerSatisfaction", col_wrap=4, sharey=False)
grid.map(sns.countplot, "TimeAfterBootcamp")
Out[187]:
<seaborn.axisgrid.FacetGrid at 0x7fb077262320>

Some differences can be noticed if looking attentively, but they are not so big. And, again, the patterns are not changing in a linear manner from lower ranks to higher.

3.3 Technical experience features

In [182]:
_, axes = plt.subplots(1, 2, figsize=(12,8), sharey=False) 
sns.heatmap(dev_types_correlation.iloc[1:].to_frame(), ax=axes[0], annot=True)
sns.heatmap(years_coding_corelation.iloc[1:].to_frame(), ax=axes[1], annot=True)
Out[182]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb091797940>

Some non-zero but not very correlated role and experience features can be seen on the heatmap.

Let's draw the percentages of dev roles by career satisfaction levels. It's not the correlation heatmap, so different color scheme is used.

In [207]:
sns.heatmap(dev_types_percent, cmap='RdGy')
Out[207]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4324383390>

Not so many differences, but some vertical stripes can be seen, especially for the 1, 4, 6 ranks.

Languages experience correlations

In [208]:
_, axes = plt.subplots(1, 2, figsize=(16,6), sharey=False) 
sns.heatmap(langs_worked_correlation.iloc[1:].to_frame(), ax=axes[0])
sns.heatmap(langs_desired_correlation.iloc[1:].to_frame(), ax=axes[1])
Out[208]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4324594eb8>

Languages experience differentiating by satisfaction levels

In [209]:
_, axes = plt.subplots(1, 2, figsize=(16,6), sharey=False, ) 
sns.heatmap(langs_worked_percent, ax=axes[0], cmap='RdGy')
sns.heatmap(lang_desired_percent, ax=axes[1], cmap='RdGy')
Out[209]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f43244e2828>

OS corellations and differences by satisfaction levels.

In [210]:
_, axes = plt.subplots(1, 2, figsize=(16,6), sharey=False) 
sns.heatmap(os_correlation.iloc[1:].to_frame(), ax=axes[0], annot=True)
sns.heatmap(satisfaction_os_percent.transpose(), ax=axes[1], cmap='RdGy', annot=True)
Out[210]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f43103f4b38>

3.4 Some other factors correlation heatmaps.

In [211]:
corporative_cols = [
    'CareerSatisfaction',
    'Student',
    'Employment',
    'JobSearchStatus',
    'HopeFiveYears',
    'UpdateCV',
    'SalaryType',
    'Currency'
]

corporative_corellations = \
    full_df[corporative_cols].corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False)
In [212]:
education_cols = [
    'CareerSatisfaction',
    'FormalEducation',
    'UndergradMajor',
    'EducationParents'
]

self_taugh_cols = [st for st in multiselect_columns_descriptions.keys() if st.startswith('SelfTaughtTypes')]
edu_type_cols = [et for et in multiselect_columns_descriptions.keys() if et.startswith('EducationTypes')]

all_edu_cols = education_cols + self_taugh_cols + edu_type_cols
edu_correlation = \
    full_df[all_edu_cols].corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False).rename(
        'Education')
edu_correlation_descriptions = edu_correlation.rename(index=multiselect_columns_descriptions)
In [213]:
edu_correlation_descriptions
Out[213]:
CareerSatisfaction                                                                                    1.000000
Contributed to open source software                                                                   0.089353
Received on-the-job training in software development                                                  0.074706
Questions & answers on Stack Overflow                                                                 0.068967
Participated in a hackathon                                                                           0.066886
Taught yourself a new language, framework, or tool without taking a formal course                     0.066712
The official documentation and/or standards for the technology                                        0.066419
Tapping your network of friends, family, and peers versed in the technology                           0.048511
The technology’s online help system                                                                   0.034922
Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.)    0.028211
Completed an industry certification program (e.g. MCPD)                                               0.024266
A book or e-book from O’Reilly, Apress, or a similar publisher                                        0.019459
Participated in a full-time developer training program or bootcamp                                    0.012084
Internal Wikis, chat rooms, or documentation set up by my company for employees                       0.007128
Participated in online coding competitions (e.g. HackerRank, CodeChef, TopCoder)                      0.000820
Taken a part-time in-person course in programming or software development                            -0.000396
Pre-scheduled tutoring or mentoring sessions with a friend or colleague                              -0.001073
A college/university computer science or software engineering book                                   -0.002334
Taken an online course in programming or software development (e.g. a MOOC)                          -0.003225
FormalEducation                                                                                      -0.013118
UndergradMajor                                                                                       -0.032062
EducationParents                                                                                     -0.043779
Name: Education, dtype: float64
In [214]:
social_cols = [
    'CareerSatisfaction',
    'Country',
    'Dependents',
    'MilitaryUS'
]

social_corelation = \
    full_df[social_cols].corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False).rename(
        "Social Factors")
In [215]:
philosophic_cols = [
    'CareerSatisfaction',
    'AIDangerous',
    'AIInteresting',
    'AIResponsible',
    'AIFuture',
    'EthicsChoice',
    'EthicsReport',
    'EthicsResponsible',
    'EthicalImplications'
]

philosophic_corelation = \
    full_df[philosophic_cols].corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False).rename(
        "Philosophic questions")
In [216]:
stackoverflow_cols = [
    'CareerSatisfaction',
    'StackOverflowVisit',
    'StackOverflowHasAccount',
    'StackOverflowParticipate',
    'StackOverflowJobs',
    'StackOverflowDevStory',
    'StackOverflowConsiderMember'
]

stackoverflow_corelation = \
    full_df[stackoverflow_cols].corr(method='spearman')['CareerSatisfaction'].sort_values(ascending=False).rename(
        'StackOverflow questions')
In [217]:
_, axes = plt.subplots(3, 2, figsize=(18,16), sharey=False) 

sns.heatmap(corporative_corellations.to_frame().iloc[1:], ax=axes[0,0], annot=True)
sns.heatmap(edu_correlation.to_frame().iloc[1:], ax=axes[0,1])
sns.heatmap(social_corelation.to_frame().iloc[1:], ax=axes[1,0], annot=True)
sns.heatmap(philosophic_corelation.to_frame().iloc[1:], ax=axes[1,1], annot=True)
sns.heatmap(stackoverflow_corelation.to_frame().iloc[1:], ax=axes[2,0], annot=True)
axes[2,1].set_visible(False)

Not so much in general, but there are features with 0.06 - 0.08 correlations.

3.5 Conclusions

This way, we see a dataset wit a discrete distribution of target variable that shows slight characteristics of Poisson function.
The most correlated feature is numerical continious ConvertedSalary feature with the 0.182913 correlation value. It has to be cleared out of outliers.
Among numerical category variables the highest level is shown by the NumbersOfMonitors - 0.090212. Countplots with this feature shows some differences among satisfaction levels as well.
The highest levels of correlation in technical characteristics are about 0.04-0.05, some of them are not correlated with career satisfaction at all. There are some tech-related difference among satisfaction levels, one of the most noticeable is percentage of MacOS users among highly-satisfied respondents.
Some other aspects from educational strategies, social factors, country of living, philosophical outlook, StackOverflow membership etc. can correlate with career satisfaction as well, up to 0.06-0.08 values.

Part 4. Insights and found dependencies

Combining the observation from the previous paragraphs, the following is to be denoted:

  • The dataset under analysis contain many omissions. That's no wonder: the data came from a voluntary survey, where many questions can be optional. This omissions have to be handled when working with the specific selected model.
  • The distribution of the target feature, CareerSatisfaction is discrete distribution and it isn't normal or lognormal. That should be taken into account in model selection, as not all models handle non-normal distributions well.
  • The JobSatisfaction and CareerSatisfaction columns seems to be interconnected features, but the model predicting one of them based on anoter will not be so interesting. It's better to try drawing a conclusion using wide range of other factors first.
  • The feature with the next level of correlation with the target variable is ConvertedSalary. It is continious float feature but its correlation with levels of satisfaction is non-linear. Unfortunately, it has 0.386 part of NAN values and there are outliers. Hopping that it's partly due to difficulties of manual convertion and Salary column along with Currency and SalaryType will help soften those problems.
  • By now, other features are categorical with numeric labels, in some cases supporting ordering and ratio relationships, in some cases not. Some of them show non-zero correlation with career satisfaction and differences across satisfaction levels, some do not. Due to lage number of features it is too difficult to select most useful ones before starting the close work with concrete model. It is better to use all of them initially and then exclude/convert some according to practical results/observations/strategies.
  • </ul>

Part 5. Metrics selection

First of all, it should be specified that it is the task of multiclass classification. For some extent, predicting a satisfaction level can be seen as a regression task, as satisfaction is monothonicaly increasing function and it's better to erroneously predict the closest grade then the arbitrary one. But the analysis above shows that there is no linear correlation between features and satisfaction level, rather there are patterns inherent to respondents giving this or that rating.

One of the most frequently used and simple metric for classification is Accuracy. Unfortunately, it can work badly in the case of inbalanced classes, and the survey dataset is indeed unbalanced (see the distribution histogram).

Another popular classification metric is ROC-AUC. It evaluates correct positive results among all positive samples (True Positive Rate) against incorrect positive results among all negatives (False Positive Rate). However, this task is not the credit scoring one, false positives will not cause missed profit, and we are interested to increased the true positives over the whole set much more than decreasing false triggering. In addition ROC-AUC is initially a binary classification metric, and to use it in multiclassification task additional steps are needed (micro/macro-averaging, binary encoding etc.). This applies to many other binary classification metrics, extended to multiclass case.

One more metric is multiclass logarithmic loss. It can be explained by the formula: $$\mathcal logloss = -\dfrac{1}{N}\sum_{i=1}^N\sum_{j=1}^M y_{ij}\log(p_{ij})$$ ($N$ is the number of observations in the test set, $M$ is the number of class labels, $y_{ij}$ is 1 if observation $i$ belongs to class $j$ and 0 otherwise, and $p_{ij}$ is the predicted probability that observation $i$ belongs to class $j$) It involves probability estimates and looks interesting in such case. Career satisfaction is something subjective, non-predetermined, evaluated differently by different people in the same circumstances (and even by the same person in different mood). So some uncertainty is present and it usage in the evaluation metric can be helpfull.

Thus, metric selected for evaluating this task is multiclass logarithmic loss.

Part 6. Model selection

Human satisfaction seems to be very complicated thing, depending on different combinations of many factors. So this task doesn't look like linearly separable and non-linear model has to be selected, the one that allows to avoid tricks like polynomyal features and kernel change. The most popular non-linear models nowadays are ensembles of decision trees.

For the sake of easy reproducibility and without need to handle gigabytes of data or fighting for thousandth fractions of metric in the competition, this task is going to be solved by the means of scikit-learn library, without Vopal Wabbit, XGBoost etc. From tree ensembles presented in scikit-learn the RandomForestClassifier looks like the most obvious selection. Beside handling nonlinearity Random Forest offers some other advantages:

  • After a random split, only subset of features are selected, that is useful when there are so many of them;
  • Conditions in decisions handle both continuous and discrete variables;
  • Doesn't need feature rescaling;
  • Is said to work well with non-normal target distribution;
  • Robust to outliers and large amount of missing data;
  • There are developed methods to estimate feature importance;

Part 7. Data preprocessing

7.1 Data handling

In fact, the most part of data preprocessing was done on the features convertion step above primary data analysis (otherwise the analysis wouldn't be possible at all). There can be found replacing some strings with numbers using LabelEncoder, mapping other strings to numeric categories with order and ratio relationships, converting columns with multiple options to the sets of 1/0 columns (similar to One-Hot Encoding but not that exactly), mapping Yes/No columns to 1/0 columns, simple string processing to parse "almost numeric" string columns to numeric columns, etc.

There are a lot of columns with NAN value. Features with large amount of missed data do not bring much usefull information, but increase the computational complexity. So, cut off those with amount of NAN > 0.4

In [88]:
cols_too_many_nans = full_df.isna().mean()
cols_too_many_nans = cols_too_many_nans[cols_too_many_nans > 0.4].index.values
In [89]:
preprocessed_df = full_df.drop(columns=cols_too_many_nans)

Fill other NANs with zeros as the simple strategy. (To be honest I've tried other ones, but they didn't show significantly better results)

In [90]:
preprocessed_df = preprocessed_df.fillna(0)

During the visual analysis it was found that it's better to clear ConvertedSalary column from outliers. Let's incorporate it into the dataset.

In [91]:
quantile_95 = preprocessed_df['ConvertedSalary'].quantile(0.95)
preprocessed_df['ConvertedSalary'] = preprocessed_df['ConvertedSalary'].map(lambda cs: min(cs, quantile_95))

The same for 'Salary' column.

In [92]:
quantile_95 = preprocessed_df['Salary'].quantile(0.95)
preprocessed_df['Salary'] = preprocessed_df['Salary'].map(lambda cs: min(cs, quantile_95))

RandomForestClassifier in sklearn library doesn't handle unordered numerical category features well, so they have to be converted to One-Hot Encoding using pd.get_dummies:

In [93]:
preprocessed_df = pd.get_dummies(preprocessed_df, columns = categorize_cols, drop_first=True,
                            prefix=categorize_cols, sparse=False)
In [94]:
preprocessed_df.head()
Out[94]:
CareerSatisfaction Hobby OpenSource Dependents AssessJob1 AssessJob2 AssessJob3 AssessJob4 AssessJob5 AssessJob6 ... EducationParents_8 EducationParents_9 SurveyTooLong_1 SurveyTooLong_2 SurveyTooLong_3 SurveyEasy_1 SurveyEasy_2 SurveyEasy_3 SurveyEasy_4 SurveyEasy_5
Respondent
1 7.0 1 0 1.0 10.0 7.0 8.0 1.0 2.0 5.0 ... 0 0 0 0 0 0 0 0 0 1
3 4.0 1 1 1.0 1.0 7.0 10.0 8.0 2.0 5.0 ... 0 0 0 0 0 0 1 0 0 0
4 6.0 1 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 0 1 0 0 1 0 0
5 3.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 1 0 0 0
7 6.0 1 0 1.0 8.0 5.0 7.0 1.0 2.0 6.0 ... 0 0 0 0 0 0 1 0 0 0

5 rows × 712 columns

7.2 Dividing the data into training and hold-out sets

In [95]:
print("Preprocessed dataset shape:", preprocessed_df.shape)
Preprocessed dataset shape: (76504, 712)

There are 76504 rows in the preprocessed dataset. Let's separate 0.3 part of them into hold-out set. Respondents anwers are totaly independent from each other and from time characteristics, but the way records are listed is unknown, thus it's good to randomly shuffle them first.

In [97]:
# calling sample() with frac=1 will just shuffle all rows
preprocessed_df = preprocessed_df.sample(frac=1, random_state=17)

train_part_size = int(0.7 * preprocessed_df.shape[0])

train_df = preprocessed_df[:train_part_size]
test_df = preprocessed_df[train_part_size:]

print("Train shape: ", train_df.shape)
print("Test shape: ", test_df.shape)

train_df.to_csv('train.csv')
test_df.to_csv('test.csv')
Train shape:  (53552, 712)
Test shape:  (22952, 712)
In [ ]:
del test_df

Part 8. Cross-validation and adjustment of model hyperparameters

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import log_loss
In [99]:
train_df = pd.read_csv('train.csv', index_col='Respondent')

8.1 Initial evaluation of the model

First of all, evaluating the model "as is"

In [100]:
train_part = train_df.sample(frac=0.8, random_state=17)
validation_part = train_df.drop(train_part.index)
In [101]:
y_train = train_part['CareerSatisfaction']
X_train = train_part.drop(columns=['CareerSatisfaction'])

y_valid = validation_part['CareerSatisfaction']
X_valid = validation_part.drop(columns=['CareerSatisfaction'])

print("Train: ", X_train.shape, y_train.shape)
print("Test: ", X_valid.shape, y_valid.shape)
Train:  (42842, 711) (42842,)
Test:  (10710, 711) (10710,)
In [102]:
random_forest = RandomForestClassifier(random_state=17)
In [103]:
%%time 
random_forest.fit(X_train, y_train)
CPU times: user 2.73 s, sys: 156 ms, total: 2.89 s
Wall time: 3.11 s
Out[103]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=17, verbose=0, warm_start=False)
In [104]:
y_pred = random_forest.predict_proba(X_valid)
In [105]:
log_loss(y_valid, y_pred)
Out[105]:
6.8741212005329855

Doesn't look like too good result. Create some simple baselines to compare with

In [45]:
y_six_baseline = np.zeros(y_pred.shape)
y_six_baseline[:, 5] = 1.0
In [46]:
y_random_baseline = np.random.random_sample(y_pred.shape)
In [47]:
y_equal_likely_baseline = np.full(y_pred.shape, fill_value=1/y_pred.shape[1])
In [48]:
y_no_one_baseline = np.zeros(y_pred.shape)
In [50]:
print("All 6.0 baseline: ", log_loss(y_valid, y_six_baseline))
print("Random baseline: ", log_loss(y_valid, y_random_baseline))
print("Equaly likely baseline: ", log_loss(y_valid, y_equal_likely_baseline))
print("No one baseline: ", log_loss(y_valid, y_no_one_baseline))
All 6.0 baseline:  22.235748201952312
Random baseline:  2.2173541717971172
Equaly likely baseline:  1.9459101490553126
No one baseline:  1.945910149055314

It looks like logloss strongly penalizes constant-dominant-class predictor. But in general, results of the initial RandomForestClassifier are not good at all.
But the classifier was trained too quickly, in ~3s. It has n_estimators=10, may be it's too few for such a large amount of features?

In [24]:
random_forest = RandomForestClassifier(random_state=17, n_estimators=200)
In [94]:
%%time 
random_forest.fit(X_train, y_train)
CPU times: user 49.6 s, sys: 344 ms, total: 49.9 s
Wall time: 49.9 s
Out[94]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=17, verbose=0, warm_start=False)
In [96]:
y_pred = random_forest.predict_proba(X_valid)
initial_score = log_loss(y_valid, y_pred)
initial_score
Out[96]:
1.6535629549525597

That gives hope. Running GridSearchCV to tune hyperparameters.

8.2 Hyperparameters tuning

In [52]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])

Using 3 splits as one of the most frequently used amount, shuffle samples in random order.

In [ ]:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
tree_params = {
    'n_estimators' : [250, 300, 350],
    'max_depth' : [None, 10, 20],
    'max_features' : ['sqrt', 'log2', 50]
}
gcv = GridSearchCV(random_forest, tree_params, scoring='neg_log_loss', cv=skf, verbose=1)
gcv.fit(X_crossvalid, y_crossvalid)
Fitting 3 folds for each of 27 candidates, totalling 81 fits
In [23]:
gcv.best_estimator_, gcv.best_score_
Out[23]:
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=20, max_features=50, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=350, n_jobs=1,
             oob_score=False, random_state=17, verbose=0, warm_start=False),
 -1.6281354697593482)

It looks like more complex model performs better. However, increasing of its complicity leads to the exhaustion of computing power and rising of MemoryErros. Staying with these parameters.

Split with 3 folds was used to reduce parameters search time. Cross-validation can be run with large amount of splits.

In [54]:
skf_5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
cv_scores = cross_val_score(gcv.best_estimator_, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')
In [55]:
print(cv_scores)
print(np.mean(cv_scores))
[-1.62650046 -1.63022187 -1.62912835 -1.61962302 -1.61858969]
-1.6248126771219151

Cross-validation shows quite stable predictions values.

8.3 Confusion matrix

One more way to evaluate prediction results is a confusion matrix, that allows to see what classes are mixed out with each other the most often.

In [77]:
from sklearn.metrics import confusion_matrix
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
In [55]:
def evaluate_predictions(estimator, X, y, X_validation, y_validation):
    estimator.fit(X, y)
    y_pred = estimator.predict(X_validation)
    y_pred_proba = estimator.predict_proba(X_validation)
    
    cm = confusion_matrix(y_validation, y_pred)
    log_loss_value = log_loss(y_validation, y_pred_proba)
    
    # Convert matrix values to fractions due to different amount of samples accros classes.
    counts = cm.sum(axis=1)
    counts = counts.reshape(-1,1)
    sns.heatmap(cm / counts, annot=True)
    print("LogLoss value: ", log_loss_value)
In [36]:
evaluate_predictions(gcv.best_estimator_, X_train, y_train, X_valid, y_valid)
LogLoss value:  1.631079060634029

Not a beatiful picture. The most predictions were just attributed to the prevalent class. Maybe some samples balancing will help?

In [60]:
def balance_dataset(df):

    levels_counts = df['CareerSatisfaction'].value_counts().to_dict()
    mean_level_count = int(np.mean(list(levels_counts.values())))

    balanced_df = pd.DataFrame()

    for level in levels_counts.keys():
        level_rows = df[df['CareerSatisfaction']==level]
    
        level_df = level_rows.sample(n=mean_level_count, random_state=17, replace=True)

        index_start = 1 if balanced_df.empty else balanced_df.index[-1] + 1
        level_df = pd.DataFrame(
            index=range(index_start, index_start + mean_level_count),
            data=level_df.values,
            columns = df.columns
        )
        
        balanced_df = balanced_df.append(level_df)
    
    # randomly shuffle
    balanced_df = balanced_df.sample(frac=1.0, random_state=17)
        
    return balanced_df
In [38]:
train_balanced = balance_dataset(train_part)
print(train_balanced['CareerSatisfaction'].value_counts())
print(train_balanced.shape, validation_part.shape)
7.0    6120
3.0    6120
5.0    6120
1.0    6120
2.0    6120
4.0    6120
6.0    6120
Name: CareerSatisfaction, dtype: int64
(42840, 712) (10710, 712)
In [39]:
y_train_balanced = train_balanced['CareerSatisfaction']
X_train_balanced = train_balanced.drop(columns=['CareerSatisfaction'])
In [40]:
evaluate_predictions(gcv.best_estimator_, X_train_balanced, y_train_balanced, X_valid, y_valid)
LogLoss value:  1.7597617889670631

Predictions of non-prevalent classes became better due to prevalent class accuracy decrease, but the whole metric became worse as well.
It looks like current set of features just doesn't contain enough information to distinguish levels of satisfaction confidently. Trying to do something with that.

Part 9. Creation of new features and description of this process

In [7]:
train_df = pd.read_csv('train.csv', index_col='Respondent')
In [8]:
train_df.shape
Out[8]:
(53552, 712)

9.1 Amount of skipped questions

If there's a problem of lack of information, maybe the fact of missed information can bring a benefit?
Let's see if the average amount of skipped questions is different from one satisfaction level to another.

In [9]:
def get_skipped_questions_amount(df):
    isna_flags = df.drop(columns=['CareerSatisfaction']).isna()
    skipped = isna_flags.sum(axis=1).rename('SkippedQuestions')
    return skipped
In [10]:
full_train_df = full_df.loc[train_df.index]
skipped_questions = get_skipped_questions_amount(full_train_df)
In [11]:
skipped_mean = skipped_questions.mean()
skipped_mean
Out[11]:
17.04487227367792
In [12]:
skipped_questions_df = skipped_questions.to_frame()
skipped_questions_df['CareerSatisfaction'] = full_train_df['CareerSatisfaction']
skipped_questions_by_level = skipped_questions_df.groupby('CareerSatisfaction')
skipped_questions_by_level.describe()
Out[12]:
SkippedQuestions
count mean std min 25% 50% 75% max
CareerSatisfaction
1.0 1848.0 20.348485 20.379023 0.0 4.0 13.0 34.25 67.0
2.0 3714.0 17.715940 19.763913 0.0 3.0 9.0 27.00 67.0
3.0 4613.0 15.719272 19.040307 0.0 3.0 6.0 21.00 67.0
4.0 4426.0 19.266606 20.164159 0.0 4.0 10.0 32.00 68.0
5.0 9395.0 16.241192 18.878519 0.0 3.0 7.0 21.00 67.0
6.0 19546.0 16.228794 18.435652 0.0 3.0 10.0 19.00 68.0
7.0 10010.0 18.162338 18.289217 0.0 4.0 14.0 22.00 68.0
In [13]:
less_than_mean_skipped = skipped_questions_df[skipped_questions_df['SkippedQuestions'] < skipped_mean]
less_than_mean_skipped.groupby('CareerSatisfaction').size() / skipped_questions_by_level.size()
Out[13]:
CareerSatisfaction
1.0    0.629870
2.0    0.684976
3.0    0.728160
4.0    0.644600
5.0    0.719745
6.0    0.734268
7.0    0.698302
dtype: float64

The amount of respondents skipped more than average amount of questions differs from level to level. Adding this features, evaluating.

In [14]:
train_df['LessThanMeanAnswered'] = \
    skipped_questions_df.loc[train_df.index]['SkippedQuestions'].map(lambda skipped:
        int(skipped < skipped_mean))
train_df['MoreThanMeanAnswered'] = \
    skipped_questions_df.loc[train_df.index]['SkippedQuestions'].map(lambda skipped:
        int(skipped > skipped_mean))
In [18]:
random_forest = RandomForestClassifier(random_state=17, max_depth=20, max_features=50, n_estimators=350)
In [19]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])
In [20]:
skf_5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
cv_scores = cross_val_score(random_forest, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')
In [21]:
print(cv_scores)
print(np.mean(cv_scores))
[-1.63418229 -1.62929985 -1.62544378 -1.6291633  -1.62368132]
-1.628354106506782

Doesn't seem to give improvement, removing this feature.

In [22]:
train_df = train_df.drop(columns=['LessThanMeanAnswered', 'MoreThanMeanAnswered'])

9.2 Trying to convert Salary programmatically

The ConvertedSalary column looks like manualy infered by each respondent from Salary, SalaryType and Currency. Maybe there were mistakes; trying to automate the process.

In [111]:
#!pip install currencyconverter
In [23]:
from currency_converter import CurrencyConverter
from datetime import date

converter = CurrencyConverter()
course_date=date(2018, 1, 18)

salary_type_encoder = category_encoders['SalaryType']
currency_encoder = category_encoders['Currency']
In [24]:
salary_type_encoder.classes_
Out[24]:
array(['Monthly', 'Unknown', 'Weekly', 'Yearly'], dtype=object)
In [25]:
currency_encoder.classes_
Out[25]:
array(['Australian dollars (A$)', 'Bitcoin (btc)', 'Brazilian reais (R$)',
       'British pounds sterling (£)', 'Canadian dollars (C$)',
       'Chinese yuan renminbi (¥)', 'Danish krone (kr)', 'Euros (€)',
       'Indian rupees (₹)', 'Japanese yen (¥)', 'Mexican pesos (MXN$)',
       'Norwegian krone (kr)', 'Polish złoty (zł)', 'Russian rubles (₽)',
       'Singapore dollars (S$)', 'South African rands (R)',
       'Swedish kroner (SEK)', 'Swiss francs', 'U.S. dollars ($)',
       'Unknown'], dtype=object)
In [26]:
currency_map = {
    'Australian dollars (A$)': 'AUD',
    'Bitcoin (btc)': 'XBT',
    'Brazilian reais (R$)': 'BRL',
    'British pounds sterling (£)': 'GBP',
    'Canadian dollars (C$)': 'CAD',
    'Chinese yuan renminbi (¥)': 'CNY',
    'Danish krone (kr)': 'DKK',
    'Euros (€)': 'EUR',
    'Indian rupees (₹)': 'INR',
    'Japanese yen (¥)': 'JPY',
    'Mexican pesos (MXN$)': 'MXN',
    'Norwegian krone (kr)': 'NOK',
    'Polish złoty (zł)': 'PLN',
    'Russian rubles (₽)': 'RUB',
    'Singapore dollars (S$)': 'SGD',
    'South African rands (R)': 'ZAR',
    'Swedish kroner (SEK)': 'SEK',
    'Swiss francs': 'CHF',
    'U.S. dollars ($)': 'USD',
}
In [27]:
def try_convert_salary(df):
    if pd.isna(df['SalaryType']):
        return 0.0    
    salary_type = salary_type_encoder.classes_[int(df['SalaryType'])]
    
    if (pd.isna(df['Salary']) or pd.isna(df['Currency']) or salary_type == 'Unknown'):
        return 0.0
    
    currency = currency_encoder.classes_[int(df['Currency'])]
    if (currency == 'Unknown' or currency == 'Bitcoin (btc)'):
        return 0.0
    
    currency = currency_map[currency]
    
    if currency == 'USD':
        in_usd = df['Salary']
    else:
        in_usd = converter.convert(df['Salary'], currency, 'USD', date=course_date)
    
    if salary_type == 'Yearly':
        return in_usd    
    elif salary_type == 'Monthly':
        return 12 * in_usd
    else:
        return 50 * in_usd    
In [32]:
train_df['AutoconvertedSalary'] = full_train_df.apply(try_convert_salary, axis=1)
In [33]:
quantile_95 = train_df['AutoconvertedSalary'].quantile(0.95)
train_df['AutoconvertedSalary'] = train_df['AutoconvertedSalary'].map(lambda cs: min(cs, quantile_95))
In [34]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])
In [35]:
skf_5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
In [36]:
cv_scores = cross_val_score(random_forest, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')
In [37]:
print(cv_scores)
print(np.mean(cv_scores))
[-1.63337175 -1.62900048 -1.62722743 -1.62910953 -1.62575021]
-1.6288918812550075

The worsening is noticeable. It's better not to add the feature.

In [38]:
train_df = train_df.drop(columns=['AutoconvertedSalary'])

9.3 Secondary features using experience columns

The YearsCoding, YearsCodingProf, LastNewJob and Age didn't show to hight correlation with the target feature, but maybe their ratios contains any usefull patterns?

In [39]:
train_df['CodingProportion'] = train_df.apply(lambda df:
    0.0 if pd.isna(df['YearsCoding']) or pd.isna(df['Age']) or (df['Age'] == 0)
        else df['YearsCoding']/ df['Age'],
    axis=1)

train_df['ProfProportion'] = train_df.apply(lambda df:
    0.0 if pd.isna(df['YearsCodingProf']) or pd.isna(df['YearsCoding']) or (df['YearsCoding'] == 0)
        else df['YearsCodingProf'] / df['YearsCoding'],
    axis=1)

train_df['LastJobProportion'] = train_df.apply(lambda df:
    0.0 if pd.isna(df['LastNewJob']) or pd.isna(df['Age']) or (df['Age'] == 0)
        else df['LastNewJob'] / df['Age'],
    axis=1)
In [42]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])

cv_scores = cross_val_score(random_forest, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')

print(cv_scores)
print(np.mean(cv_scores))
[-1.63293954 -1.6291314  -1.62552157 -1.6318263  -1.62324109]
-1.6285319781532475

Doesn't seem to give increase.

In [43]:
train_df = train_df.drop(columns=['CodingProportion', 'ProfProportion', 'LastJobProportion'])

9.4 Adding JobSatisfaction feature

As all other attempts didn't help, adding the JobSatisfaction column as the last resort, though initial intent was not to use this obvious feature.

In [46]:
job_satisfaction = survey_results['JobSatisfaction'].map(satisfaction_map).fillna(0.0)
train_df['JobSatisfaction'] = job_satisfaction.loc[train_df.index]
In [47]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])
In [105]:
cv_scores = cross_val_score(random_forest, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')

print(cv_scores)
print(np.mean(cv_scores))
[-1.49881005 -1.49754313 -1.49616572 -1.48898074 -1.48607919]
-1.4935157648402513

Relatively high model improvement, as expected.

In [48]:
test_df = pd.read_csv('test.csv', index_col='Respondent')
In [49]:
test_df['JobSatisfaction'] = job_satisfaction.loc[test_df.index]
test_df.head()
Out[49]:
CareerSatisfaction Hobby OpenSource Dependents AssessJob1 AssessJob2 AssessJob3 AssessJob4 AssessJob5 AssessJob6 ... EducationParents_9 SurveyTooLong_1 SurveyTooLong_2 SurveyTooLong_3 SurveyEasy_1 SurveyEasy_2 SurveyEasy_3 SurveyEasy_4 SurveyEasy_5 JobSatisfaction
Respondent
17971 6.0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1 0 0 1 0 0 1 0 0 6.0
35830 2.0 1 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 0 0 1 0 0 0 2.0
78890 5.0 0 0 0.0 8.0 7.0 2.0 4.0 1.0 3.0 ... 0 1 0 0 1 0 0 0 0 6.0
75148 4.0 1 1 1.0 7.0 6.0 9.0 8.0 2.0 3.0 ... 0 0 0 0 0 1 0 0 0 3.0
100409 3.0 0 0 0.0 9.0 6.0 2.0 5.0 4.0 1.0 ... 0 1 0 0 0 1 0 0 0 3.0

5 rows × 713 columns

In [50]:
test_df.to_csv('test.csv')
In [ ]:
del test_df

9.5 Decreasing amount of target classes

As feature engineering doesn't bring much outcome, and low-ranking classes are too small, decreasing granularity of target variable looks like a admissible approach.
So, uniting and remaping target classes as:

  • 1-3, namely "dissatisfied", with the new value "1"
  • 4-5, "middling satisfied", with value "2"
  • Leaving 6 and 7 left as is separate classes ("Moderately satisfied" and "Extremely satisfied") with new values "3" and "4"
  • </ul>

In [51]:
lower_granularity_map = {
    1.0 : 1.0,
    2.0 : 1.0,
    3.0 : 1.0,
    4.0 : 2.0,
    5.0 : 2.0,
    6.0 : 3.0,
    7.0 : 4.0
}
In [52]:
train_df['CareerSatisfaction'] = train_df['CareerSatisfaction'].map(lower_granularity_map)
In [53]:
y_crossvalid = train_df['CareerSatisfaction']
X_crossvalid = train_df.drop(columns=['CareerSatisfaction'])

cv_scores = cross_val_score(random_forest, X_crossvalid, y_crossvalid, cv=skf_5, scoring='neg_log_loss')

print(cv_scores)
print(np.mean(cv_scores))
[-1.16274709 -1.16683842 -1.16263035 -1.16241163 -1.15532128]
-1.1619897550013043

That looks better. Evaluating confusion matrix:

In [58]:
train_part = train_df.sample(frac=0.8, random_state=17)
validation_part = train_df.drop(train_part.index)

y_train = train_part['CareerSatisfaction']
X_train = train_part.drop(columns=['CareerSatisfaction'])

y_valid = validation_part['CareerSatisfaction']
X_valid = validation_part.drop(columns=['CareerSatisfaction'])

evaluate_predictions(random_forest, X_train, y_train, X_valid, y_valid)
LogLoss value:  1.1580000908041657