Remaking Figures from
Bertin's Semiology of Graphics

by Nicolas Kruchten, December 2020

One of my favourite books about data visualization is Semiology of Graphics: Diagrams, Networks, Maps by Jacques Bertin, from 1967. It's a dense, 450-page tome which is now considered a classic in the development of the field. Michael Friendly, in his Brief History of Data Visualization, identifies it as one of three key factors in the late-20th century renaissance in data visualization (following a golden age in the 19th century and a dark age in the early 20th).

One of my favourite parts of the book is a 40-page section called "The Graphic Problem", which enumerates by example 100 different visualizations of the same compact dataset, to outline what we would now call the design space of visualizations, and to forcefully make the point that the choice of which visualization to use for a given dataset is far from obvious, and that the visualization used must match the questions it should be used to answer. The dataset, which is printed in a small table at the beginning of the section, is quite simple on its face: the number of people working in agriculture, industry and services for each of 90 administrative divisions in France in 1954 (there are four other, derived columns: the sum of the first three, and each of the first three divided by the sum).

I don't think I can reasonably reproduce the entire section at high resolution here, but I'm including an overview of 32 of those pages, which should give a sense of the breadth of visual forms covered. There are bar charts (faceted, stacked, reordered, variable-width), histograms, concentration curves, scatterplots and scatterplot matrices, a parallel-coordinates plot, some ternary plots, and that's before we even get to maps. There are cartograms and graduated-symbol maps and contour maps and dot maps and choropleths and maps with pies and bars on them, and some stippled and striped maps that I don't believe have common names, as I've only ever seen them in this book.

The dataset is quite similar to one I spent a lot of time exploring through visualizations: the vote split between the top 3 candidates in a local Montreal mayoral election, which I visualized as a dot map, ternary charts, mosaic charts, choropleth maps and pies-on-maps. When I read this book I was pleased to see every idea I'd had was represented and I was fascinated to see so many new ones, and I've been wanting to remake these graphics with modern tools ever since. I recently read the excellent new book Visualizing With Text by Richard Brath, which includes a number of text-based remakes of these same figures, which motivated me to actually carry out this project. I'm pleased that the visualization library I've been working on for the past couple of years, Plotly Express, is now mature enough to let me do a decent job at many of these figures with just a few lines of code.

An Intellectual Ancestor to Plotly Express

Before getting into the figures and code, I want to talk about one neat little feature of this book. Each of these graphics is accompanied by a little glyph, like the ones highlighted below in pink, which the book explains how to interpret or generate.


glyphs glyphs glyphs

Each glyph is basically a very compact graphical specification/explanation of the chart, a kind of meta-legend. The L-shaped portion indicates which data variables are mapped to the horizontal and vertical dimensions of the surface. Sometimes order is indicated with an O character, as in the first figure on the left, ordered by Qt for total quantity. Any additional vertical or horizontal lines outside of that indicates faceting/stacking (lines without a crossbar, as in the lower half of the first image) or cumulative stacking variables (lines with a little crossbar, as in the upper half). Note that the horizontal and vertical dimensions are sometimes mapped to X and Y in a 2-d cartesian plot, but sometimes mapped to "Geo" in a map. Diagonal lines indicate data variables that are embedded into (Geo in the second impage) or mapped to variables that "jump out" of the page, like color, size, value etc. The diagonal line appears to be missing in the top figure of the first image, and Qt is mapped to size in the second image. There are similar little glyphs for pie charts, maps, ternary charts etc. The third image shows the complex glyph for 90 scaled pie charts overlaid on a map: the X/Y dimensions are geography, and the pie charts are stacked by S for sector, cumulatively scaled by Q for quantity, and then sized by Qt for total quantity, and shaded by S for sector.

What I find fascinating about these little glyphs is that they are in a way the intellectual ancestors of the code blocks you'll find below, which are used to generate the interactive figures. I don't know if Bertin would sketch these glyphs first, then make the charts, or if he drew them on afterwards, but with modern tools like Plotly Express, we can write just a few lines of code which express the same ideas as these glyphs (in rather less ambiguous form) and the figures just appear! For those who know how to read the code, it also provides a clear specification of the figure. This is possible because the design of these libraries was informed by a line of thinking which originated from this book, i.e. the formalized semiological notion idea that visualization involves a sign-system wherein visual variables (signifiers) are mapped onto data variables (signifieds). The little glyphs I mentioned above were part of the explanation of this mapping. This line of thinking and its relationship to visualization software was further elaborated in a book called The Grammar of Graphics by Leland Wilkinson in the 90s, and then embedded into multiple subsequent generations of visualization software since, including Plotly Express.

Data and Setup

The first step to remaking these figures with Plotly Express was to get the data into a Python-friendly format. The dataset nominally contains only a few columns of numbers, but in order to make maps we actually need some geographic data as well. This is sort of implicit in the book, but when working with code, everything must be made explicit. This was a little bit of a challenge since French administrative divisions have evolved a little bit since 1967, when the book was published, and the data was from 1954. I found a set of polygons for the modern boundaries of French departments and modified them as follows, to match the data in the book:

  1. I undid the 1964 reorganization of the Paris-region departments by merging departments 91 and 95 into department 78, and merging 92, 93 and 94 into 75.
  2. I then subtracted the present-day department 75 (the city of Paris) from the resulting department 75 and labelled it "P", as in the dataset in the book. I believe that Paris was carved out from its surrounding department to avoid department 75 from totally dominating all figures population-wise, although this is not called out in the book specifically.
  3. I dropped Corsica as no data was provided for this island department, which explains the missing department number 20.
  4. I simplified the geometry of the polygons to reduce the file size and even out some of the inaccuracies I introduced when I merged the Paris-region departments. The simplified polygons have approximately the same level of detail as the maps in the book, which are only rendered a few centimeters across anyway.
  5. I added two new columns which I use to generate certain figures below: region, which is the modern-day administrative division that regroups multiple departments, and type, which groups the departments into 6 distinct "types" based on the relative rank of the three economic sectors: type A>S>I means there are more people working in agriculture than in services, and more in services than industry, etc.

Here is what the resulting dataset looks like, when loaded from a 55kb GeoJSON file using geopandas. (Note: for anyone wanting to play with this data, there's also a CSV that doesn't include the polygons.) This dataset is in "wide form" i.e. one row per department with multiple data colums, so I've loaded it as wide_df.

Note: The page you're reading is actually the HTML export of an interactive Jupyter notebook, meaning that starting now you'll start to see code and its output interspersed with the prose. You can run this code yourself and play with the code in your browser for free by launching the notebook on the Binder service.

In [1]:
import geopandas as gpd
wide_df = gpd.read_file("data/semiology_of_graphics.geojson").set_index("code")
display(wide_df.head(5))
department agriculture industry services total agriculture_pct industry_pct services_pct region type geometry
code
P PARIS 2000 575000 940000 1517000 0 38 62 Ile-de-France S>I>A POLYGON ((2.33190 48.81701, 2.22422 48.85352, ...
75 SEINE 8000 574000 550000 1132000 1 51 49 Ile-de-France I>S>A POLYGON ((2.55306 49.00982, 2.60700 48.77440, ...
59 NORD 81000 483000 296000 860000 9 56 34 Hauts-de-France I>S>A POLYGON ((2.06771 51.00651, 2.54633 51.08840, ...
78 SEINE & O. 46000 328000 356000 730000 6 45 49 Ile-de-France S>I>A POLYGON ((2.57166 48.69201, 2.50635 48.43161, ...
62 P.D.C. 94000 242000 137000 473000 20 51 29 Hauts-de-France I>S>A POLYGON ((2.06771 51.00651, 2.42899 50.65702, ...

It's actually more convenient to make certain kinds of figures, especially faceted ones, if the data is in "long form" i.e. one row per department per sector, so we'll also un-pivot, or melt() the wide form dataset and store the result in long_form for use below.

In [2]:
long_df = wide_df.reset_index().melt(
    id_vars=["code", "department", "total", "region", "type", "geometry"],
    value_vars=["agriculture", "industry", "services"],
    var_name="sector", value_name="quantity"
).set_index("code")
long_df["percentage"] = 100* long_df.quantity / long_df.total
display(long_df.head(5))
department total region type geometry sector quantity percentage
code
P PARIS 1517000 Ile-de-France S>I>A POLYGON ((2.33190 48.81701, 2.22422 48.85352, ... agriculture 2000 0.131839
75 SEINE 1132000 Ile-de-France I>S>A POLYGON ((2.55306 49.00982, 2.60700 48.77440, ... agriculture 8000 0.706714
59 NORD 860000 Hauts-de-France I>S>A POLYGON ((2.06771 51.00651, 2.54633 51.08840, ... agriculture 81000 9.418605
78 SEINE & O. 730000 Ile-de-France S>I>A POLYGON ((2.57166 48.69201, 2.50635 48.43161, ... agriculture 46000 6.301370
62 P.D.C. 473000 Hauts-de-France I>S>A POLYGON ((2.06771 51.00651, 2.42899 50.65702, ... agriculture 94000 19.873150

This is as good a place as any to load Plotly Express into the notebook and preconfigure some default values for reuse throughout the various figures below. Notably, we'll set some default colors and rendering orders for the sector and type variables (see inline explanations for the color scheme).

In [3]:
import plotly.express as px
px.defaults.height=500
blue, red, green = px.colors.qualitative.Plotly[:3]
px.defaults.color_discrete_map = {
    'agriculture': green, 'industry': red, 'services': blue,
    'S>A>I': '#5588BB','S>I>A': '#8855BB','I>S>A': '#BB5588',
    'I>A>S': '#BB8855','A>I>S': '#88BB55','A>S>I': '#55BB88'
}
px.defaults.category_orders = dict(
    sector=["agriculture", "industry", "services"],
    type=['S>A>I','S>I>A','I>S>A','I>A>S','A>I>S','A>S>I']
)

Exploring the Design Space

The book includes a simplified decision-tree-like representation of the choices one must make when visualizing this dataset. Here's a slightly more in-depth version which drove some of my thinking in making the figures below:

  • Data: absolute quantities or percentages?
  • Granularity: one visual mark per department or one per sector per department?
  • Mark types: what type of visual marks will the figures include?
    • Charts:
      • Bars: fixed or variable width?
      • Points in abstract space: 2-d cartesian or ternary coordinates?
    • Maps:
      • Department polygons: scaled by geography or data?
      • Points on maps: one per department or on a regular grid?
  • Color: continuous or discrete?
  • Arrangement: one panel or multiple?

I've arranged the figures I made in roughly the same order as they appear in the book, broadly organized by mark type.

Bars

The book begins by showing the various ways you can visualize the dataset using bar charts, starting with a simple horizontal bar chart of the absolute counts, faceted by sector, like the one below. This figure is interactive in that you can hover over any bar to see the details of the data it encodes, which goes some way towards mitigating the legibility issues of the tiny font used in the labels (a problem in the book also).

A note on color: the book uses color sparingly and only on certain pages, presumably for cost reasons. I'll mostly be consistently using a green/red/blue color scheme because it's visually a bit more interesting, and because color is free on computer screens. I also use color in places where the book uses value or crosshatching. (Note: the colors I've used here are not great from an accessibility/colorblind-friendliness perspective.)

This first figure is not all that much more helpful at understanding patterns in the data than the original table, other than giving a sense of the great disparity in magnitude between the biggest and smallest numbers: roughly two orders of magnitude.

In [4]:
fig = px.bar(long_df, x="quantity", y="department", color="sector", facet_col="sector", height=600)
fig.update_layout(bargap=0, showlegend=False)
fig.update_yaxes(tickfont_size=4, autorange="reversed")
fig.show()