Country Converter¶

The country converter (coco) is a Python package to convert country names into different classifications and between different naming versions. Internally it uses regular expressions to match country names.

Installation¶

The package is available as PyPI, use

pip install country_converter -upgrade

from the command line or use your preferred python package installer. The source code is available on github: https://github.com/IndEcol/country_converter

Conversion¶

The country converter provides one main class which is used for the conversion:

In [1]:

import country_converter as coco

In [2]:

converter = coco.CountryConverter()

Given a list of countries is a certain classification:

In [3]:

iso3_codes = ["USA", "VUT", "TKL", "AUT", "AFG", "ALB"]

This can be converted to any classification provided by:

In [4]:

converter.convert(names=iso3_codes, src="ISO3", to="name_official")

Out[4]:

['United States of America',
 'Republic of Vanuatu',
 'Tokelau',
 'Republic of Austria',
 'Islamic Republic of Afghanistan',
 'Republic of Albania']

or

In [5]:

converter.convert(names=iso3_codes, src="ISO3", to="continent")

Out[5]:

['America', 'Oceania', 'Oceania', 'Europe', 'Asia', 'Europe']

The parameter "src" specifies the input-, "to" the output format. Possible values for both parameter can be found by:

In [6]:

converter.valid_class

Out[6]:

['APEC',
 'BASIC',
 'BRIC',
 'CIS',
 'Cecilia2050',
 'DACcode',
 'EEA',
 'EU',
 'EU12',
 'EU15',
 'EU25',
 'EU27',
 'EU27_2007',
 'EU28',
 'EURO',
 'EXIO1',
 'EXIO1_3L',
 'EXIO2',
 'EXIO2_3L',
 'EXIO3',
 'EXIO3_3L',
 'Eora',
 'FAOcode',
 'G20',
 'G7',
 'GBDcode',
 'GWcode',
 'IEA',
 'IMAGE',
 'ISO2',
 'ISO3',
 'ISOnumeric',
 'MESSAGE',
 'OECD',
 'REMIND',
 'Schengen',
 'UN',
 'UNcode',
 'UNmember',
 'UNregion',
 'WIOD',
 'ccTLD',
 'continent',
 'name_official',
 'name_short',
 'obsolete',
 'regex']

Internally, these names are the column header of the underlying pandas dataframe (see below).

The convert function can also be accessed without initiating the CountryConverter. This can be useful for one time usage. For multiple matches, initiating the CountryConverter avoids that the file providing the matching data gets read in for each conversion.

In [7]:

converter.convert(names=iso3_codes, src="ISO3", to="ISO2")

Out[7]:

['US', 'VU', 'TK', 'AT', 'AF', 'AL']

Some of the classifications can be accessed by some shortcuts. For example:

In [8]:

converter.EU27

Out[8]:

	name_short	EU27
14	Austria	EU27
21	Belgium	EU27
35	Bulgaria	EU27
55	Croatia	EU27
58	Cyprus	EU27
59	Czech Republic	EU27
60	Denmark	EU27
70	Estonia	EU27
76	Finland	EU27
77	France	EU27
84	Germany	EU27
87	Greece	EU27
101	Hungary	EU27
107	Ireland	EU27
110	Italy	EU27
122	Latvia	EU27
128	Lithuania	EU27
129	Luxembourg	EU27
137	Malta	EU27
156	Netherlands	EU27
177	Poland	EU27
178	Portugal	EU27
182	Romania	EU27
196	Slovakia	EU27
197	Slovenia	EU27
204	Spain	EU27
215	Sweden	EU27

In [9]:

converter.OECDas("ISO2")

Out[9]:

	ISO2	OECD
13	AU	1971.0
14	AT	1961.0
21	BE	1961.0
41	CA	1961.0
45	CL	2010.0
49	CO	2020.0
53	CR	2021.0
59	CZ	1995.0
60	DK	1961.0
70	EE	2010.0
76	FI	1969.0
77	FR	1961.0
84	DE	1961.0
87	GR	1961.0
101	HU	1996.0
102	IS	1961.0
107	IE	1961.0
109	IL	2010.0
110	IT	1962.0
112	JP	1964.0
122	LV	2016.0
128	LT	2018.0
129	LU	1961.0
143	MX	1994.0
156	NL	1961.0
158	NZ	1973.0
166	NO	1961.0
177	PL	1996.0
178	PT	1961.0
196	SK	2000.0
197	SI	2010.0
202	KR	1996.0
204	ES	1961.0
215	SE	1961.0
216	CH	1961.0
228	TR	1961.0
235	GB	1961.0
236	US	1961.0

Handling missing data¶

The return value for non-found entries is be default set to 'not found':

In [10]:

iso3_codes_missing = ["ABC", "AUT", "XXX"]
converter.convert(iso3_codes_missing, src="ISO3")

ABC not found in ISO3
XXX not found in ISO3

Out[10]:

['not found', 'AUT', 'not found']

but can also be rest to something else:

In [11]:

converter.convert(iso3_codes_missing, src="ISO3", not_found="missing")

ABC not found in ISO3
XXX not found in ISO3

Out[11]:

['missing', 'AUT', 'missing']

Alternativly, the non-found entries can be passed through by passing None to not_found:

In [12]:

converter.convert(iso3_codes_missing, src="ISO3", not_found=None)

ABC not found in ISO3
XXX not found in ISO3

Out[12]:

['ABC', 'AUT', 'XXX']

To extend the underlying dataset, an additional dataframe (or file) can be passed. Note, that all entries below (name_short, name_official, regex, ISO2 and ISO3) must be specified.

In [13]:

import pandas as pd

add_data = pd.DataFrame.from_dict(
    {
        "name_short": ["xxx country", "abc country"],
        "name_official": ["The XXX country", "The ABC country"],
        "regex": ["xxx country", "abc country"],
        "ISO2": ["xx", "ab"],
        "ISO3": ["xxx", "abc"],
    }
)

In [14]:

add_data

Out[14]:

	name_short	name_official	regex	ISO2	ISO3
0	xxx country	The XXX country	xxx country	xx	xxx
1	abc country	The ABC country	abc country	ab	abc

In [15]:

extended_converter = coco.CountryConverter(additional_data=add_data)
extended_converter.convert(iso3_codes_missing, src="ISO3", to="name_short")

Out[15]:

['abc country', 'Austria', 'xxx country']

Alternatively to a ad hoc dataframe, additional datafiles can be passed. These must have the same format as basic data set. An example can be found here: https://github.com/IndEcol/country_converter/tree/master/tests/custom_data_example.txt

The custom data example contains the ISO3 code mapping for Romania before 2002 and switches the regex matching for congo between DR Congo and Congo Republic.

To use is pass the path to the additional country file:

In [16]:

# extended_converter = coco.CountryConverter(additional_data=path/to/datafile)

The passed data (file or dataframe) must at least contain the headers 'name_official', 'name_short' and 'regex'. Of course, if the additional data shall be used to a conversion to any other field, these must also be included.

Additionally passed data always overwrites the existing one. This can be used to adjust coco for datasets with wrong country names. For example, assuming a dataset erroneous switched the ISO2 codes for India (IN) and Indonesia (ID) (therefore assuming 'ID' for India and 'IN' for Indonesia), one can accomedate for that by:

In [17]:

switched_converter = coco.CountryConverter(
    additional_data=pd.DataFrame.from_dict(
        {
            "name_short": ["India", "Indonesia"],
            "name_official": ["India", "Indonesia"],
            "regex": ["india", "indonesia"],
            "ISO2": ["ID", "IN"],
            "ISO3": ["IDN", "IND"],
        }
    )
)

Duplicated values in column name_short of merged data - keep last one
Duplicated values in column regex of merged data - keep last one

In [18]:

converter.convert("IN", src="ISO2", to="name_short")

Out[18]:

'India'

In [19]:

switched_converter.convert("ID", src="ISO2", to="name_short")

Out[19]:

'India'

Regular expression matching¶

The input parameter "src" can be set to "regex" to use regular expression matching for a given country list. For example:

In [20]:

some_names = [
    "United Rep. of Tanzania",
    "Cape Verde",
    "Burma",
    "Iran (Islamic Republic of)",
    "Korea, Republic of",
    "Dem. People's Rep. of Korea",
]

In [21]:

coco.convert(names=some_names, src="regex", to="name_short")

Out[21]:

['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea']

The regular expressions can also be used to match any list of countries to any other. For example:

In [22]:

match_these = ["norway", "united_states", "china", "taiwan"]
master_list = [
    "USA",
    "The Swedish Kingdom",
    "Norway is a Kingdom too",
    "Peoples Republic of China",
    "Republic of China",
]

coco.match(match_these, master_list)

Out[22]:

{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China'}

If the regular expression matches several times, all results are given as list and a warning is generated:

In [23]:

match_these = ["norway", "united_states", "china", "taiwan"]
master_list = [
    "USA",
    "The Swedish Kingdom",
    "Norway is a Kingdom too",
    "Peoples Republic of China",
    "Taiwan, province of china",
    "Republic of China",
]

coco.match(match_these, master_list)

Multiple matches for name taiwan in list_b

Out[23]:

{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

The parameter "enforce_sublist" can be set to ensure consistent output:

In [24]:

coco.match(match_these, master_list, enforce_sublist=True)

Multiple matches for name taiwan in list_b

Out[24]:

{'norway': ['Norway is a Kingdom too'],
 'united_states': ['USA'],
 'china': ['Peoples Republic of China'],
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

You get a warning if one of the names couldn't be found:

In [25]:

match_these = ["norway", "united_states", "china", "taiwan", "some other country"]
master_list = [
    "USA",
    "The Swedish Kingdom",
    "Norway is a Kingdom too",
    "Peoples Republic of China",
    "Republic of China",
]
coco.match(match_these, master_list)

Could not identify some other country in list_a

Out[25]:

{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'not_found'}

And the value for non found countries can be specified:

In [26]:

coco.match(match_these, master_list, not_found="its not there")

Could not identify some other country in list_a

Out[26]:

{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'its not there'}

This can also be used to pass the not found country to the new classification:

In [27]:

coco.match(match_these, master_list, not_found=None)

Could not identify some other country in list_a

Out[27]:

{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'some other country'}

Internals¶

Within the new instance, the raw data for the conversion is saved within a pandas dataframe. This dataframe can be accessed directly with:

In [28]:

converter.data.head()

Out[28]:

	APEC	BASIC	BRIC	CIS	Cecilia2050	DACcode	EEA	EU	EU12	EU15	...	UNcode	UNmember	UNregion	WIOD	ccTLD	continent	name_official	name_short	obsolete	regex
0	NaN	NaN	NaN	NaN	RoW	625.0	NaN	NaN	NaN	NaN	...	4.0	1946.0	Southern Asia	RoW	af	Asia	Islamic Republic of Afghanistan	Afghanistan	NaN	afghan
1	NaN	NaN	NaN	NaN	RoW	NaN	NaN	NaN	NaN	NaN	...	248.0	NaN	Northern Europe	RoW	ax	Europe	Åland Islands	Aland Islands	NaN	\b(a\|å)land
2	NaN	NaN	NaN	NaN	RoW	71.0	NaN	NaN	NaN	NaN	...	8.0	1955.0	Southern Europe	RoW	al	Europe	Republic of Albania	Albania	NaN	albania
3	NaN	NaN	NaN	NaN	RoW	130.0	NaN	NaN	NaN	NaN	...	12.0	1962.0	Northern Africa	RoW	dz	Africa	People's Democratic Republic of Algeria	Algeria	NaN	algeria
4	NaN	NaN	NaN	NaN	RoW	880.0	NaN	NaN	NaN	NaN	...	16.0	NaN	Polynesia	RoW	as	Oceania	American Samoa	American Samoa	NaN	^(?=.americ).samoa

5 rows × 47 columns

This dataframe can be extended in both directions. The only requirement is to provide unique values for name_short, name_official and regex.

Internally, the data is saved in country_data.txt as tab-separated values (utf-8 encoded).

Of course, all pandas indexing and matching methods can be used. For example, to get new OECD members since 1995 present in a list:

In [29]:

some_countries = [
    "Australia",
    "Belgium",
    "Brazil",
    "Bulgaria",
    "Cyprus",
    "Czech Republic",
    "Denmark",
    "Estonia",
    "Finland",
    "France",
    "Germany",
    "Greece",
    "Hungary",
    "India",
    "Indonesia",
    "Ireland",
    "Italy",
    "Japan",
    "Latvia",
    "Lithuania",
    "Luxembourg",
    "Malta",
    "Romania",
    "Russia",
    "Turkey",
    "United Kingdom",
    "United States",
]
converter.data[
    (converter.data.OECD >= 1995) & converter.data.name_short.isin(some_countries)
].name_short

Out[29]:

59     Czech Republic
70            Estonia
101           Hungary
122            Latvia
128         Lithuania
Name: name_short, dtype: object

Further information can be found here: http://pandas.pydata.org/pandas-docs/stable/

Testing¶

All regular expressions of the country converter are tested for a unique match to name_short and name_official. Test sets for alternative names found in various databases are also available.

The test sets are stored in the tests/ subdirectory. To tests require pytest. I recommend to rerun the test if a regular expression is changed.

To specify a new test set just add a tab-separated file with headers "name_short" and "name_test" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with "test_regex_ " it will be automatically recognised by the test functions.

Please see the file CONTRIBUTING.rst for further information.

Konstantin Stadler