In this notebook, we simply reproduce the examples presented in the paper "Retrieval and analysis of Eurostat open data with the eurostat package" by L. Lahti, P. Biecek, M. Kainu and J. Huovari (R Journal 9(1):385-392, 2017) and in the online version of the vignettes. See also this online tutorial on eurostat package.

Some of the examples which were producing errors were modfied.

In [1]:

library("eurostat")
library("ggrepel")

Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

Finding data¶

In [2]:

toc <-get_eurostat_toc()
head(toc)

A tibble: 6 × 8
title	code	type	last update of data	last table structure change	data start	data end	values
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
Database by themes	data	folder	NA	NA	NA	NA	NA
General and regional statistics	general	folder	NA	NA	NA	NA	NA
European and national indicators for short-term analysis	euroind	folder	NA	NA	NA	NA	NA
Business and consumer surveys (source: DG ECFIN)	ei_bcs	folder	NA	NA	NA	NA	NA
Consumer surveys (source: DG ECFIN)	ei_bcs_cs	folder	NA	NA	NA	NA	NA
Consumers - monthly data	ei_bsco_m	dataset	28.05.2019	28.05.2019	1980M01	2019M05	NA

In [3]:

head(search_eurostat("passenger transport"))

A tibble: 6 × 8
title	code	type	last update of data	last table structure change	data start	data end	values
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
Volume of passenger transport relative to GDP	tran_hv_pstra	dataset	26.03.2019	26.03.2019	2000	2016	NA
Modal split of passenger transport	tran_hv_psmod	dataset	21.03.2019	21.03.2019	1990	2016	NA
Air passenger transport by reporting country	avia_paoc	dataset	23.05.2019	22.05.2019	1993	2018Q4	NA
Air passenger transport by main airports in each reporting country	avia_paoa	dataset	23.05.2019	05.04.2019	1993	2018Q4	NA
Air passenger transport between reporting countries	avia_paocc	dataset	10.04.2019	27.03.2019	1993	2018Q4	NA
Air passenger transport between main airports in each reporting country and partner reporting countries	avia_paoac	dataset	09.04.2019	27.03.2019	1993	2018Q4	NA

Downloading data¶

In [4]:

id <-search_eurostat("Modal split of passenger transport", type ="table")$code[1]
print(id)

[1] "t2020_rk310"

In [5]:

dat <-get_eurostat(id, time_format = "num")
str(dat)
head(dat)

Table t2020_rk310 cached at /tmp/RtmpNiXU9L/eurostat/t2020_rk310_num_code_TF.rds

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	2485 obs. of  5 variables:
 $ unit   : Factor w/ 1 level "PC": 1 1 1 1 1 1 1 1 1 1 ...
 $ vehicle: Factor w/ 3 levels "BUS_TOT","CAR",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ geo    : Factor w/ 34 levels "AT","BE","CH",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ time   : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
 $ values : num  11 10.6 3.7 9.1 11.3 32.4 14.9 13.5 6 24.8 ...

A tibble: 6 × 5
unit	vehicle	geo	time	values
<fct>	<fct>	<fct>	<dbl>	<dbl>
PC	BUS_TOT	AT	1990	11.0
PC	BUS_TOT	BE	1990	10.6
PC	BUS_TOT	CH	1990	3.7
PC	BUS_TOT	DE	1990	9.1
PC	BUS_TOT	DK	1990	11.3
PC	BUS_TOT	EL	1990	32.4

In [6]:

dat2 <- get_eurostat(id, filters = list(geo = c("EU28", "FI"), lastTimePeriod=1), time_format = "num")
head(dat2)

A tibble: 6 × 5
unit	vehicle	geo	time	values
<fct>	<fct>	<fct>	<dbl>	<dbl>
PC	BUS_TOT	EU28	2016	9.4
PC	BUS_TOT	FI	2016	11.9
PC	CAR	EU28	2016	82.9
PC	CAR	FI	2016	82.5
PC	TRN	EU28	2016	7.7
PC	TRN	FI	2016	5.6

Replacing codes with labels

In [7]:

datl2 <- get_eurostat(id, filters = list(geo = c("EU28", "FI"),lastTimePeriod = 1), type = "label", time_format = "num")
head(datl2)

A tibble: 6 × 5
unit	vehicle	geo	time	values
<fct>	<fct>	<fct>	<dbl>	<dbl>
Percentage	Motor coaches, buses and trolley buses	European Union - 28 countries	2016	9.4
Percentage	Motor coaches, buses and trolley buses	Finland	2016	11.9
Percentage	Passenger cars	European Union - 28 countries	2016	82.9
Percentage	Passenger cars	Finland	2016	82.5
Percentage	Trains	European Union - 28 countries	2016	7.7
Percentage	Trains	Finland	2016	5.6

In [8]:

datl <- label_eurostat(dat)
head(datl)

A tibble: 6 × 5
unit	vehicle	geo	time	values
<fct>	<fct>	<fct>	<dbl>	<dbl>
Percentage	Motor coaches, buses and trolley buses	Austria	1990	11.0
Percentage	Motor coaches, buses and trolley buses	Belgium	1990	10.6
Percentage	Motor coaches, buses and trolley buses	Switzerland	1990	3.7
Percentage	Motor coaches, buses and trolley buses	Germany (until 1990 former territory of the FRG)	1990	9.1
Percentage	Motor coaches, buses and trolley buses	Denmark	1990	11.3
Percentage	Motor coaches, buses and trolley buses	Greece	1990	32.4

In [9]:

label_eurostat_vars(names(datl))
levels(datl$vehicle)

'Unit of measure'
'Vehicles'
'Geopolitical entity (reporting)'
'Period of time (a=annual, q=quarterly, m=monthly, d=daily, c=cumulated from January)'

'Motor coaches, buses and trolley buses'
'Passenger cars'
'Trains'

Selecting and modifying data¶

EFTA, Eurozone, EU and EU candidate countries

In [10]:

data(efta_countries)
efta_countries

A df[,3]: 4 × 3
code	name	label
<chr>	<chr>	<chr>
IS	Iceland	Iceland
LI	Liechtenstein	Liechtenstein
NO	Norway	Norway
CH	Switzerland	Switzerland

EU data from 2012 in all vehicles

In [11]:

dat_eu12 <-subset(datl, geo=="European Union (current composition)" & time==2012)
dat_eu12

A tibble: 0 × 5
unit	vehicle	geo	time	values
<fct>	<fct>	<fct>	<dbl>	<dbl>

EU data from 2000 - 2012 with vehicle types as variables

In [12]:

library("tidyr")
dat_eu_0012 <- subset(dat, geo == "EU28" & time %in% 2000:2012)
dat_eu_0012_wide <- spread(dat_eu_0012, vehicle, values)
subset(dat_eu_0012_wide, select = -geo)

A tibble: 13 × 5
unit	time	BUS_TOT	CAR	TRN
<fct>	<dbl>	<dbl>	<dbl>	<dbl>
PC	2000	10.4	82.5	7.1
PC	2001	10.2	82.8	7.0
PC	2002	9.8	83.4	6.8
PC	2003	9.8	83.6	6.6
PC	2004	9.7	83.5	6.7
PC	2005	9.8	83.4	6.9
PC	2006	9.6	83.4	7.0
PC	2007	9.8	83.1	7.1
PC	2008	9.8	82.9	7.4
PC	2009	9.2	83.7	7.1
PC	2010	9.4	83.5	7.2
PC	2011	9.4	83.2	7.4
PC	2012	9.5	82.9	7.7

Train passengers for selected EU countries in 2000 - 2012

In [13]:

dat_trains <- subset(datl, geo %in% c("Austria", "Belgium", "Finland", "Sweden") & time %in% 2000:2012 & vehicle == "Trains")
dat_trains_wide <- spread(dat_trains, geo, values)
subset (dat_trains_wide, select = - vehicle)

A tibble: 13 × 6
unit	time	Austria	Belgium	Finland	Sweden
<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Percentage	2000	9.7	6.3	5.1	6.8
Percentage	2001	9.7	6.4	4.8	7.1
Percentage	2002	9.7	6.5	4.8	7.1
Percentage	2003	9.5	6.5	4.7	7.0
Percentage	2004	9.4	7.1	4.7	6.8
Percentage	2005	9.8	6.6	4.8	7.1
Percentage	2006	10.0	6.9	4.8	7.6
Percentage	2007	10.0	7.1	5.0	7.9
Percentage	2008	11.1	7.5	5.4	8.6
Percentage	2009	11.1	7.5	5.1	8.7
Percentage	2010	11.0	7.7	5.2	8.7
Percentage	2011	11.3	7.7	5.0	8.7
Percentage	2012	11.8	7.8	5.3	9.1

Visualization¶

In [14]:

library(ggplot2)
p <- ggplot (dat_trains, aes( x = time, y = values, colour = geo))
p <- p + geom_line()
print(p)

Triangle plot

In [15]:

library(tidyr)
library(plotrix)
library(eurostat)
library(dplyr)
library(tidyr)

# Some cleaning of the data is required
 energy3 <- get_eurostat("nrg_114a") %>%
 label_eurostat(dat) %>%
 filter(time == "2013-01-01") %>%
 mutate(geo = gsub(geo, pattern=" \\(.*", replacement="")) %>%
 select(product, geo, values) %>%
 group_by(product, geo) %>%
 summarise(svalue = sum(values)) %>%
 group_by(geo) %>%
 mutate(tvalue = sum(svalue),
 svalue = svalue/sum(svalue)) %>%
 filter(tvalue > 1000) %>% # only large countries
 spread(product, svalue)
 
# Triangle plot
 par(cex=0.75, mar=c(0,0,0,0))
 positions <- plotrix::triax.plot(as.matrix(energy3[, c(3,5,4)]),
                     show.grid = TRUE,
                     label.points= FALSE, point.labels = energy3$geo,
                     col.axis="gray50", col.grid="gray90",
                     pch = 19, cex.axis=0.8, cex.ticks=0.7, col="grey50")

 # Larger labels
 ind <- which(energy3$geo %in%  c("Norway", "Iceland","Denmark","Estonia", "Turkey", "Italy", "Finland"))
 df <- data.frame(positions$xypos, geo = energy3$geo)
 points(df$x[ind], df$y[ind], cex=2, col="red", pch=19)
 text(df$x[ind], df$y[ind], df$geo[ind], adj = c(0.5,-1), cex=1.5)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Table nrg_114a cached at /tmp/RtmpNiXU9L/eurostat/nrg_114a_date_code_TF.rds

Maps

Disposable income of private households by NUTS 2 regions at 1:60mln resolution using tmap

In [16]:

library(dplyr)
library(eurostat)
library(sf)
library(tmap)

# Download attribute data from Eurostat
sp_data <- eurostat::get_eurostat("tgs00026", time_format = "raw", stringsAsFactors = FALSE) %>% 
  # subset to have only a single row per geo
  dplyr::filter(time == 2010, nchar(geo) == 4) %>% 
  # categorise
  dplyr::mutate(income = cut_to_classes(values, n = 5))

# Download geospatial data from GISCO
geodata <- get_eurostat_geospatial(output_class = "sf", resolution = "60", nuts_level = 2, year = 2010)

# merge with attribute data with geodata
map_data <- inner_join(geodata, sp_data)

# Construct the map
map1 <- tmap::tm_shape(geodata) +
  tmap::tm_fill("lightgrey") +
  tmap::tm_shape(map_data) +
  tmap::tm_grid() +
  tmap::tm_polygons("income", title = "Disposable household\nincomes in 2010",  
                    palette = "Oranges")
tmap_mode("plot")
print(map1)

Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
Table tgs00026 cached at /tmp/RtmpNiXU9L/eurostat/tgs00026_raw_code_FF.rds

COPYRIGHT NOTICE

When data downloaded from this page 
<http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units>
is used in any printed or electronic publication, 
in addition to any other provisions 
applicable to the whole Eurostat website, 
data source will have to be acknowledged 
in the legend of the map and 
in the introductory page of the publication 
with the following copyright notice:

- EN: (C) EuroGeographics for the administrative boundaries
- FR: (C) EuroGeographics pour les limites administratives
- DE: (C) EuroGeographics bezuglich der Verwaltungsgrenzen

For publications in languages other than 
English, French or German, 
the translation of the copyright notice 
in the language of the publication shall be used.

If you intend to use the data commercially, 
please contact EuroGeographics for 
information regarding their licence agreements.
          
No encoding supplied: defaulting to UTF-8.

# --------------------------
HEADS UP!!

Function now returns the data in 'sf'-class (simple features) 
by default which is different 
from previous behaviour's 'SpatialPolygonDataFrame'. 

If you prefer either 'SpatialPolygonDataFrame' or 
fortified 'data_frame' (for ggplot2::geom_polygon), 
please specify it explicitly to 'output_class'-argument!

# --------------------------          
          
Joining, by = "geo"
tmap mode set to plotting

Disposable income of private households by NUTS 2 regions in Poland with labels at 1:1mln resolution using tmap

In [17]:

library(eurostat)
library(dplyr)
library(sf)
library(RColorBrewer)

# Downloading and manipulating the tabular data
print("Let us focus on year 2014 and NUTS-3 level")
euro_sf2 <- get_eurostat("tgs00026", time_format = "raw",
                         stringsAsFactors = FALSE,
			 filter = list(time = "2014")) %>% 
 
  # Subset to NUTS-3 level
  dplyr::filter(grepl("PL",geo)) %>% 
  # label the single geo column
  mutate(label = paste0(label_eurostat(.)[["geo"]], "\n", values, "€"),
         income = cut_to_classes(values))

print("Download geospatial data from GISCO")
geodata <- get_eurostat_geospatial(output_class = "sf", resolution = "60", nuts_level = 2, year = 2013)

# Merge with attribute data with geodata
map_data <- inner_join(geodata, euro_sf2)

# plot map
map2 <- tm_shape(geodata) +
  tm_fill("lightgrey") +
  tm_shape(map_data, is.master = TRUE) +
  tm_polygons("income", title = "Disposable household incomes in 2014",
              palette = "Oranges", border.col = "white") + 
  tm_text("NUTS_NAME", just = "center") + 
  tm_scale_bar()
map2

[1] "Let us focus on year 2014 and NUTS-3 level"
[1] "Download geospatial data from GISCO"

COPYRIGHT NOTICE

When data downloaded from this page 
<http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units>
is used in any printed or electronic publication, 
in addition to any other provisions 
applicable to the whole Eurostat website, 
data source will have to be acknowledged 
in the legend of the map and 
in the introductory page of the publication 
with the following copyright notice:

- EN: (C) EuroGeographics for the administrative boundaries
- FR: (C) EuroGeographics pour les limites administratives
- DE: (C) EuroGeographics bezuglich der Verwaltungsgrenzen

For publications in languages other than 
English, French or German, 
the translation of the copyright notice 
in the language of the publication shall be used.

If you intend to use the data commercially, 
please contact EuroGeographics for 
information regarding their licence agreements.
          
No encoding supplied: defaulting to UTF-8.

# --------------------------
HEADS UP!!

Function now returns the data in 'sf'-class (simple features) 
by default which is different 
from previous behaviour's 'SpatialPolygonDataFrame'. 

If you prefer either 'SpatialPolygonDataFrame' or 
fortified 'data_frame' (for ggplot2::geom_polygon), 
please specify it explicitly to 'output_class'-argument!

# --------------------------          
          
Joining, by = "geo"

Disposable income of private households by NUTS 2 regions at 1:10mln resolution using spplot

In [18]:

library(sp)
library(eurostat)
library(dplyr)
library(RColorBrewer)
dat <- get_eurostat("tgs00026", time_format = "raw", stringsAsFactors = FALSE) %>% 
  # subsetting to year 2014 and NUTS-2 level
  dplyr::filter(time == 2014, nchar(geo) == 4) %>% 
  # classifying the values the variable
  dplyr::mutate(cat = cut_to_classes(values))

# Download geospatial data from GISCO
geodata <- get_eurostat_geospatial(output_class = "spdf", resolution = "10", nuts_level = 2, year = 2013)

# merge with attribute data with geodata
geodata@data <- left_join(geodata@data, dat)

# plot map
sp::spplot(obj = geodata, "cat", main = "Disposable household income",
	   xlim = c(-22,34), ylim = c(35,70), 
           col.regions = c("dim grey", brewer.pal(n = 5, name = "Oranges")),
	   col = "white", usePolypath = FALSE)

Reading cache file /tmp/RtmpNiXU9L/eurostat/tgs00026_raw_code_FF.rds
Table  tgs00026  read from cache file:  /tmp/RtmpNiXU9L/eurostat/tgs00026_raw_code_FF.rds

COPYRIGHT NOTICE

When data downloaded from this page 
<http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units>
is used in any printed or electronic publication, 
in addition to any other provisions 
applicable to the whole Eurostat website, 
data source will have to be acknowledged 
in the legend of the map and 
in the introductory page of the publication 
with the following copyright notice:

- EN: (C) EuroGeographics for the administrative boundaries
- FR: (C) EuroGeographics pour les limites administratives
- DE: (C) EuroGeographics bezuglich der Verwaltungsgrenzen

For publications in languages other than 
English, French or German, 
the translation of the copyright notice 
in the language of the publication shall be used.

If you intend to use the data commercially, 
please contact EuroGeographics for 
information regarding their licence agreements.
          
No encoding supplied: defaulting to UTF-8.
SpatialPolygonDataFrame at resolution 1: 10  cached at:  /tmp/RtmpNiXU9L/eurostat/spdf1022013.RData

# --------------------------
HEADS UP!!

Function now returns the data in 'sf'-class (simple features) 
by default which is different 
from previous behaviour's 'SpatialPolygonDataFrame'. 

If you prefer either 'SpatialPolygonDataFrame' or 
fortified 'data_frame' (for ggplot2::geom_polygon), 
please specify it explicitly to 'output_class'-argument!

# --------------------------          
          
Joining, by = "geo"

Disposable income of private households by NUTS 2 regions at 1:60mln resolution using ggplot2

In [19]:

library(eurostat)
library(dplyr)
library(ggplot2)
dat <- get_eurostat("tgs00026", time_format = "raw", stringsAsFactors = FALSE) %>% 
  # subsetting to year 2014 and NUTS-2 level
  dplyr::filter(time == 2014, nchar(geo) == 4) %>% 
  # classifying the values the variable
  dplyr::mutate(cat = cut_to_classes(values))

# Download geospatial data from GISCO
geodata <- get_eurostat_geospatial(resolution = "60", nuts_level = "2", year = 2013)

# merge with attribute data with geodata
map_data <- inner_join(geodata, dat)

ggplot(data=map_data) + geom_sf(aes(fill=cat),color="dim grey", size=.1) + 
    scale_fill_brewer(palette = "Oranges") +
  guides(fill = guide_legend(reverse=T, title = "euro")) +
  labs(title="Disposable household income in 2014",
       caption="(C) EuroGeographics for the administrative boundaries 
                Map produced in R with a help from Eurostat-package <github.com/ropengov/eurostat/>") +
  theme_light() + theme(legend.position=c(.8,.8)) +
  coord_sf(xlim=c(-12,44), ylim=c(35,70))

Reading cache file /tmp/RtmpNiXU9L/eurostat/tgs00026_raw_code_FF.rds
Table  tgs00026  read from cache file:  /tmp/RtmpNiXU9L/eurostat/tgs00026_raw_code_FF.rds

COPYRIGHT NOTICE

When data downloaded from this page 
<http://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units>
is used in any printed or electronic publication, 
in addition to any other provisions 
applicable to the whole Eurostat website, 
data source will have to be acknowledged 
in the legend of the map and 
in the introductory page of the publication 
with the following copyright notice:

- EN: (C) EuroGeographics for the administrative boundaries
- FR: (C) EuroGeographics pour les limites administratives
- DE: (C) EuroGeographics bezuglich der Verwaltungsgrenzen

For publications in languages other than 
English, French or German, 
the translation of the copyright notice 
in the language of the publication shall be used.

If you intend to use the data commercially, 
please contact EuroGeographics for 
information regarding their licence agreements.
          
No encoding supplied: defaulting to UTF-8.

# --------------------------
HEADS UP!!

Function now returns the data in 'sf'-class (simple features) 
by default which is different 
from previous behaviour's 'SpatialPolygonDataFrame'. 

If you prefer either 'SpatialPolygonDataFrame' or 
fortified 'data_frame' (for ggplot2::geom_polygon), 
please specify it explicitly to 'output_class'-argument!

# --------------------------          
          
Joining, by = "geo"

SDMX¶

In [20]:

library(rsdmx)

# Data set URL
url <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/rail_if_line_tr/..TOTAL.1.BE/?startperiod=2005&endPeriod=2011"

# Read the data from eurostat
d <- readSDMX(url)

# Convert to data frame and show the first entries
df <- as.data.frame(d)

head(df)

A df[,8]: 6 × 8
TRA_INFR	UNIT	N_TRACKS	GEO	FREQ	obsTime	obsValue	OBS_STATUS
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<chr>
TOTAL	KM	1	BE	A	2011	NA	na
TOTAL	KM	1	BE	A	2010	NA	na
TOTAL	KM	1	BE	A	2009	796	NA
TOTAL	KM	1	BE	A	2008	744	NA
TOTAL	KM	1	BE	A	2007	828	NA
TOTAL	KM	1	BE	A	2006	825	NA

Further examples¶

In [21]:

query <- search_eurostat("road accidents", type = "table")

In [22]:

query$code[[1]]
query$title[[1]]

'sdg_11_40'

'People killed in road accidents (source: DG MOVE)'

Note that the table tsdtr420 mentioned in the original paper has been discontinued (see for instance this archive), it has been replaced with a dedicated EU SDG indicator, namely sdg_11_40 (this table).

In [23]:

dat <- get_eurostat(id = "sdg_11_40", time_format = "num")
head(dat)

Table sdg_11_40 cached at /tmp/RtmpNiXU9L/eurostat/sdg_11_40_num_code_TF.rds

A tibble: 6 × 4
unit	geo	time	values
<fct>	<fct>	<dbl>	<dbl>
NR	AT	2000	976
NR	BE	2000	1470
NR	BG	2000	1012
NR	CH	2000	592
NR	CY	2000	111
NR	CZ	2000	1486

We can still run similar operations:

In [24]:

countries <- c("UK", "SK", "FR", "PL", "ES", "PT")
t1 <- get_eurostat("sdg_11_40", filters = list(geo = countries))

In [25]:

library("ggplot2")
library("ggrepel")
ggplot(t1, aes(x = time, y = values, color = geo, group = geo, shape = geo)) +
    geom_point(size = 4) + geom_line() + theme_bw() +
    ggtitle("Road accidents") + xlab("Year") + ylab("Victims (n)") +
    theme(legend.position = "none") +
    ggrepel::geom_label_repel(data = t1 %>% group_by(geo) %>% na.omit() %>%
    filter(time %in% c(min(time), max(time))), aes(fill = geo, label = geo), color = "white")

Warning message:
“Removed 2 rows containing missing values (geom_point).”Warning message:
“Removed 2 rows containing missing values (geom_path).”