import pandas as pd
Markus Harrer, Software Development Analyst
@feststelltaste
Visual Software Analytics Summer School, 18 September 2019
*and househusband
"Software Analytics is analytics on software data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions."
=> a great variety!
Individual systems == individual problems => individual analyses => individual insights!
Thomas Zimmermann in "One size does not fit all":
But: "... the methods typically are applicable on different datasets." => we see what's possible!
"Statistics on a Mac."
Data Science Venn Diagram (Drew Conway)
"Without data you‘re just another person with an opinion."
=> Delivering credible insights based on facts.
"The aim of science is to seek the simplest explanations of complex facts."
=> Working out insights in a comprehensible way.
Data from Stack Overflow Developer Survey 2019
Data from Stack Overflow Developer Survey 2019
"Who's Actively Looking for a Job?" (Top 5)
"R is for statisticians who want to program, Python is for developers who want to do statistics."
"100" == max. popularity!
"A data scientist is someone who
is better at statistics
than any software engineer
and better at software engineering
than any statistician."
Not so far away as you may have thought!
Roger Pengs "Stages of Data Analysis"
I. Stating Question
II. Exploratory Data Analysis
III. Formal Modeling
IV. Interpretation
V. Communication
=> from a question over data to insights!
...of inductive software engineering" (Tim Menzies)
(Intent + Code + Data + Results)
* Logical Step
+ Automation
= Literate Statistical Programming
Approach: Computational notebooks
Interactive Notebook
=> Working out results in a comprehensible way!
Best programming language for Data Science!
=> Data Analysis becomes repeatable
Pragmatic data analysis framework
=> Good integration point for your data sources!
Programmable visualization library
=> Direct visualization of results in Jupyter Notebooks
=> Provides the flexibility that is needed in specific situations
Jupyter Notebook works also with other technological platforms e. g.
=> If you want to use special technology, you can!
Data Science Python Distribution
=> Download, install, ready, go!
https://www.feststelltaste.de/category/top5/
Courses, videos, blogs, books and more...
*some pages are still under development
Meta goal: Get to know the basic mechanics of the stack.
We load Git log dataset extracted from a Git repository.
We explore some basic key elements of the dataset
1 DataFrame (~ programmable Excel worksheet), 6 Series (= columns), 1128819 rows (= entries)
We convert the text with a time to a real timestamp object.
We filter out older changes.
We keep just code written in Java.
We aggregate the rows by counting the number of changes per file.
We add additional information about the number of lines of all currently existing files...
...and join this data with the existing dataset.
We show only the TOP 10 hotspots in the code.
We plot the TOP 10 list as XY diagram.
vmstat
jdeps
and visualization with D3
1. Software Analytics with Data Science is possible!
2. If you need to go into deeper analysis: you can!
3. There are many data sources in software development. What are you waiting for?
=> from a question over data to insights!