In my previous blog post, we got to know the idea of "indentation-based complexity". We took a static view on the Linux kernel to spot the most complex areas.
This time, we wanna track the evolution of the indentation-based complexity of a software system over time. We are especially interested in it's correlation between the lines of code. Because if we have a more or less stable development of the lines of codes of our system, but an increasing number of indentation per source code file, we surely got a complexity problem.
Again, this analysis is higly inspired by Adam Tornhill's book "Software Design X-Ray" , which I currently always recommend if you want to get a deep dive into software data analysis.
For the calculation of the evolution of our software system, we can use data from the version control system. In our case, we can get all changes to Java source code files with Git. We just need so say the right magic words, which is
git log -p -- *.java
This gives us data like the following:
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date: Fri Aug 19 18:54:56 2016 +0200
Convert Controler's integration test to unit test
diff --git a/src/test/java/org/springframework/samples/petclinic
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
@@ -1,8 +1,5 @@
package org.springframework.samples.petclinic.web;
-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
We have the
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date: Fri Aug 19 18:54:56 2016 +0200
Convert Controler's integration test to unit test
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java```
* the extended index header
``` index ee83b8a..a83255b 100644```
``` --- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java```
* and the full file diff where we can see additions or modifications (`+`) and deletions (`-`)
package org.springframework.samples.petclinic.web;
-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
import org.junit.Before;
We "just" have to get this data into our favorite data analysis framework, which is, of course, Pandas :-). We can actually do that! Let's see how!
Reading in such a semi-structured data is a little challenge. But we can do it with some tricks. First, we read in the whole Git diff history by standard means, using read_csv
and the separator \n
to get one row per line. We make sure to give the columns a nice name as well.
import pandas as pd
diff_raw = pd.read_csv(
"../../buschmais-spring-petclinic_fork/git_diff.log",
sep="\n",
names=["raw"])
diff_raw.head(16)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-1-194884588aee> in <module>() 4 "../../buschmais-spring-petclinic_fork/git_diff.log", 5 sep="\n", ----> 6 names=["raw"]) 7 diff_raw.head(16) C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 707 skip_blank_lines=skip_blank_lines) 708 --> 709 return _read(filepath_or_buffer, kwds) 710 711 parser_f.__name__ = name C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 447 448 # Create the parser. --> 449 parser = TextFileReader(filepath_or_buffer, **kwds) 450 451 if chunksize or iterator: C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 816 self.options['has_index_names'] = kwds['has_index_names'] 817 --> 818 self._make_engine(self.engine) 819 820 def close(self): C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine) 1047 def _make_engine(self, engine='c'): 1048 if engine == 'c': -> 1049 self._engine = CParserWrapper(self.f, **self.options) 1050 else: 1051 if engine == 'python': C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds) 1693 kwds['allow_leading_cols'] = self.index_col is not False 1694 -> 1695 self._reader = parsers.TextReader(src, **kwds) 1696 1697 # XXX pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source() FileNotFoundError: File b'../../buschmais-spring-petclinic_fork/git_diff.log' does not exist
The output is the commit data that I've describe above where each in line the text file represents one row in the DataFrame (without blank lines).
We skip all the data we don't need for sure. Especially the "extended index header" with the two lines that being with +++
and ---
are candidates to mix with the real diff data that begins also with a +
or a -
. Furtunately, we can identify these rows easily: These are the rows that begin with the row that starts with index
. Using the shift
operation starting at the row with index
, we can get rid of all those lines.
index_row = diff_raw.raw.str.startswith("index ")
ignored_diff_rows = (index_row.shift(1) | index_row.shift(2))
diff_raw = diff_raw[~(index_row | ignored_diff_rows)]
diff_raw.head(10)
Next, we extract some metadata of a commit. We can identify the different entries by using a regular expression that looks up a specific key word for each line. We extract each individual information into a new Series/column because we need it for each change line during the software's history.
diff_raw['commit'] = diff_raw.raw.str.split("^commit ").str[1]
diff_raw['timestamp'] = pd.to_datetime(diff_raw.raw.str.split("^Date: ").str[1])
diff_raw['path'] = diff_raw.raw.str.extract("^diff --git.* b/(.*)", expand=True)[0]
diff_raw.head()
To assign each commit's metadata to the remaining rows, we forward fill those rows with the metadata by using the fillna
method.
diff_raw = diff_raw.fillna(method='ffill')
diff_raw.head(8)
We can now focus on the changed source code lines. We can identify
%%timeit
diff_raw.raw.str.extract("^\+( *).*$", expand=True)[0].str.len()
diff_raw["i"] = diff_raw.raw.str[1:].str.len() - diff_raw.raw.str[1:].str.lstrip().str.len()
diff_raw
%%timeit
diff_raw.raw.str[0] + diff_raw.raw.str.[1:].str.lstrip().str.len()
diff_raw['added'] = diff_raw.line.str.extract("^\+( *).*$", expand=True)[0].str.len()
diff_raw['deleted'] = diff_raw.line.str.extract("^-( *).*$", expand=True)[0].str.len()
diff_raw.head()
For our later indentation-based complexity calculation, we have to make sure that each line
diff_raw['line'] = diff_raw.raw.str.replace("\t", " ")
diff_raw.head()
diff = \
diff_raw[
(~diff_raw['added'].isnull()) |
(~diff_raw['deleted'].isnull())].copy()
diff.head()
diff['is_comment'] = diff.line.str[1:].str.match(r' *(//|/*\*).*')
diff['is_empty'] = diff.line.str[1:].str.replace(" ","").str.len() == 0
diff['is_source'] = ~(diff['is_empty'] | diff['is_comment'])
diff.head()
diff.raw.str[0].value_counts()
diff['lines_added'] = (~diff.added.isnull()).astype('int')
diff['lines_deleted'] = (~diff.deleted.isnull()).astype('int')
diff.head()
diff = diff.fillna(0)
#diff.to_excel("temp.xlsx")
diff.head()
commits_per_day = diff.set_index('timestamp').resample("D").sum()
commits_per_day.head()
%matplotlib inline
commits_per_day.cumsum().plot()
(commits_per_day.added - commits_per_day.deleted).cumsum().plot()
(commits_per_day.lines_added - commits_per_day.lines_deleted).cumsum().plot()
diff_sum = diff.sum()
diff_sum.lines_added - diff_sum.lines_deleted
3913