In this blog post, I want to show you a nice complexity metric that works for most major programming languages that we use for our software systems – the indentation-based complexity metric. Adam Tornhill describes this metric in his fabulous book Software Design X-Rays on page 25 as follows:
With indentation-based complexity we count the leading tabs and whitespaces
to convert them into logical indentations. ... This works because indentations in code carry meaning. Indentations are used to increase readability by separating code blocks from each other.
He further says that the trend of the indentation-based complexity in combination with the lines of code metric a good indicator for complexity-growth. If the lines of code don't change but the indentation does, you got a complexity problem in your code.
In this blog post, we are just interested in a static view of the indentation of a code (the evolution of the indentation-based complexity will be discussed in another blog post). I want to show you how you can spot complex areas in your application using the Pandas data analysis framework. As application under analysis, I choose the Linux kernel project because we want to apply our analysis at big scale.
This analysis works as follows:
glob
read_csv()
methodSo let's start!
First, let's get all the files with a recursive file search by using glob
. Because the Linux kernel code is mostly written in the C programming language, we search for the corresponding file endings .c
(C program code files) and .h
(header files).
import glob
file_list = glob.glob("../../linux/**/*.[c|h]", recursive=True)
file_list[:5]
['../../linux\\arch\\alpha\\boot\\bootp.c', '../../linux\\arch\\alpha\\boot\\bootpz.c', '../../linux\\arch\\alpha\\boot\\main.c', '../../linux\\arch\\alpha\\boot\\misc.c', '../../linux\\arch\\alpha\\boot\\stdio.c']
With the file_list
at hand, we can now read in the content of the source code files. We could have used a standard Python with <path> as f
and read in the content with a f.read()
, but we want to get the data as early into a Pandas' DataFrame (abbreviated "df") to leverage the power of the data analysis framework.
For each path in our file list, we create a single DataFrame content_df
. We use a line break \n
as column separator, ensuring that we create a single row for each source code line. We also specify the encoding
parameter with the file encoding latin-1
to make sure that we can read in weird file contents, too (this is totally normal when working in an international team). We also set skip_blank_lines
to False
to keep blank lines. Keeping the blank lines is necessary for debugging purposes (you can inspect certain source code lines with a text editor more easily) as well as for later analysis.
The process of reading in all the source code could take a while for the thousands of files of the Linux kernel. After looping through and reading in all the data, we concatenate all DataFrames with the concat
method. This gives us a single DataFrame with all source code content for all source code files.
import pandas as pd
content_dfs = []
for path in file_list:
content_df = pd.read_csv(
path,
encoding='latin-1',
sep='\n',
skip_blank_lines=False,
names=['line']
)
content_df.insert(0, 'filepath', path)
content_dfs.append(content_df)
content = pd.concat(content_dfs)
content.head()
filepath | line | |
---|---|---|
0 | ../../linux\arch\alpha\boot\bootp.c | /* |
1 | ../../linux\arch\alpha\boot\bootp.c | * arch/alpha/boot/bootp.c |
2 | ../../linux\arch\alpha\boot\bootp.c | * |
3 | ../../linux\arch\alpha\boot\bootp.c | * Copyright (C) 1997 Jay Estabrook |
4 | ../../linux\arch\alpha\boot\bootp.c | * |
Let's see what we've got here in our DataFrame named content
.
content.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 22278494 entries, 0 to 266 Data columns (total 2 columns): filepath object line object dtypes: object(2) memory usage: 509.9+ MB
We have over 22 million lines of code (including comments and blank lines) in our DataFrame. Let's clean up that data a little bit.
We convert filepath
into a categorical data type to optimize performance and memory consumption. We then get the operating system specific directory separator right and get rid of the superfluous parts of the path.
content['filepath'] = pd.Categorical(content['filepath'])
content['filepath'] = content['filepath']\
.str.replace("\\", "/")\
.str.replace("../../linux/", "")
content.head(1)
filepath | line | |
---|---|---|
0 | arch/alpha/boot/bootp.c | /* |
We further replace all empty lines with blanks and tabular spaces with a suitable number of whitespaces (we assume here that a tab corresponds to four whitespaces).
content['line'] = content['line'].fillna("")
FOUR_SPACES = " " * 4
content['line'] = content['line'].str.replace("\t", FOUR_SPACES)
content.head(1)
filepath | line | |
---|---|---|
0 | arch/alpha/boot/bootp.c | /* |
Let's get some measures that can help us to spot complex code. We add some additional information to make further analysis and debugging easier: We keep track of the line number of each source code file and create a single continuous index for all source code lines.
content['line_number'] = content.index + 1
content = content.reset_index(drop=True)
content.head(1)
filepath | line | line_number | |
---|---|---|---|
0 | arch/alpha/boot/bootp.c | /* | 1 |
Next, we mark comment lines and blank lines. For the identification of the comments, we use a simple heuristic that should be sufficient in most cases.
content['is_comment'] = content['line'].str.match(r'^ *(//|/\*|\*).*')
content['is_empty'] = content['line'].str.replace(" ","").str.len() == 0
content.head(1)
filepath | line | line_number | is_comment | is_empty | |
---|---|---|---|---|---|
0 | arch/alpha/boot/bootp.c | /* | 1 | True | False |
Now we come to the key part: We calculate the indentation for each source code line. An indent is the number of whitespaces that come before the first non-blank character in a line. We just subtract the length from a line without the preceding whitespaces from the length of the whole line.
content['indent'] = content['line'].str.len() - content['line'].str.lstrip().str.len()
content.head(1)
filepath | line | line_number | is_comment | is_empty | indent | |
---|---|---|---|---|---|---|
0 | arch/alpha/boot/bootp.c | /* | 1 | True | False | 0 |
We make sure to only inspect the real source code lines. Because we have the information about the blank lines and comments, we can filter these out very easily. We immediately aggregate the indentations with a count
to get the number of source code lines as well as the summary of all indents for each filepath
aka source code file. We also rename the columns of our new DataFrame accordingly.
source_code_content = content[content['is_comment'] | content['is_empty']]
source_code = source_code_content.groupby('filepath')['indent'].agg(['count', 'sum'])
source_code.columns = ['lines', 'indents']
source_code.head()
lines | indents | |
---|---|---|
filepath | ||
Documentation/scheduler/sched-pelt.c | 39 | 58 |
Documentation/usb/usbdevfs-drop-permissions.c | 26 | 38 |
arch/alpha/boot/bootp.c | 95 | 201 |
arch/alpha/boot/bootpz.c | 184 | 334 |
arch/alpha/boot/main.c | 60 | 67 |
To get a quick overview of the data we have now, we plot the results in a scatter diagram.
%matplotlib inline
source_code.plot.scatter('lines', 'indents', alpha=0.3);
Alright, we have a very few outliners in both dimensions and a slight correlation between the lines of code and the indentation (who guessed it?). But for now, we leave the outliers where the are.
Let's build the ratio between the indentations and the lines of code to kind of normalize the data. This is the complexity measure that we are using further on.
source_code['complexity'] = source_code['indents'] / source_code['lines']
source_code.head(1)
lines | indents | complexity | |
---|---|---|---|
filepath | |||
Documentation/scheduler/sched-pelt.c | 39 | 58 | 1.487179 |
So how complex is our code in general? Let's get also an overview of the distribution of the complexity. We use a histogram to visualize this.
source_code['complexity'].hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x14288aabd68>
Almost 30,000 files are probably okayish regarding our complexity measure.
Next, we execute an analysis per component to find out where the most complex areas are in the application. For this, we first sum up the metrics for each component. We can identify a component in the Linux kernel roughly by using the first parts of the file path.
source_code['component'] = source_code.index\
.str.split("/", n=2)\
.str[:2].str.join(":")
source_code.head(1)
lines | indents | complexity | component | |
---|---|---|---|---|
filepath | ||||
Documentation/scheduler/sched-pelt.c | 39 | 58 | 1.487179 | Documentation:scheduler |
We then sum up all our measures per component.
measures_per_component = source_code.groupby('component').sum()
measures_per_component.head()
lines | indents | complexity | |
---|---|---|---|
component | |||
Documentation:scheduler | 39 | 58 | 1.487179 |
Documentation:usb | 26 | 38 | 1.461538 |
arch:alpha | 16690 | 20468 | 189.999151 |
arch:arc | 7836 | 10610 | 114.869605 |
arch:arm | 125156 | 135452 | 1563.318216 |
Last, we print out the TOP 10 most complex parts of the Linux kernel measured by our indents per lines ratio complexity
.
measures_per_component['complexity'].nlargest(10)
component drivers:net 3526.370717 drivers:staging 2612.145956 drivers:gpu 2504.533202 drivers:media 1946.165103 arch:arm 1563.318216 include:linux 1549.853606 arch:mips 1370.794157 arch:powerpc 1168.349146 arch:x86 1047.736624 drivers:scsi 967.622213 Name: complexity, dtype: float64
Finally, we plot our the complexity per component with a treemap. We use the Python visualization library pygal
(http://www.pygal.org) which is very easy to use and just fits our use case perfectly. As size for the treemap's rectangles, we use the lines of code of the components. As color, we choose red and visualize with the help of the red color's alpha level (aka normalized complexity comp_norm
between 0 and 1), how red a rectangle of the treemap should be.
This gives us a treemap with the following properties:
We render the treemap as PNG image (to save space) and display it directly in the notebook.
import pygal
from IPython.display import Image
config = pygal.Config(show_legend=False)
treemap = pygal.Treemap(config)
max_comp = measures_per_component['complexity'].max()
for row in measures_per_component.iterrows():
filename = row[0]
entry = row[1]
comp_norm = entry['complexity'] / max_comp
data = {}
data['value'] = entry['lines']
data['color'] = 'rgba(255,0,0,' + str(comp_norm) + ')'
treemap.add(filename, [data])
Image(treemap.render_to_png())
There is also an interactive treemap where you can interact with the graphic. But beware: It's almost a megabyte big!
In this blog post, I showed you how we can obtain an almost programming language agnostic measure for complexity by measuring the indentation of source code. We also spotted the TOP 10 most complex components and visualized the completed results as a treemap.
All this was relatively easy and straightforward to do by using standard Data Science tooling. I hope you liked it and I would be happy, if you could provide some feedback in the comment section.