Country Converter

The country converter (coco) is a Python package to convert country names into different classifications and between different naming versions. Internally it uses regular expressions to match country names.

Installation

The package is available as PyPI, use

pip install country_converter -upgrade

from the command line or use your preferred python package installer. The sourcecode is available on github: https://github.com/konstantinstadler/country_converter

Conversion

The country converter provides one main class which is used for the conversion:

In [1]:
import country_converter as coco
In [2]:
converter = coco.CountryConverter()

Given a list of countries is a certain classification:

In [3]:
iso3_codes = ['USA', 'VUT', 'TKL', 'AUT', 'AFG', 'ALB']

This can be converted to any classification provided by:

In [4]:
converter.convert(names = iso3_codes, src = 'ISO3', to = 'name_official')
Out[4]:
['United States of America',
 'Republic of Vanuatu',
 'Tokelau',
 'Republic of Austria',
 'Islamic Republic of Afghanistan',
 'Republic of Albania']

or

In [5]:
converter.convert(names = iso3_codes, src = 'ISO3', to = 'continent')
Out[5]:
['America', 'Oceania', 'Oceania', 'Europe', 'Asia', 'Europe']

The parameter "src" specifies the input-, "to" the output format. Possible values for both parameter can be found by:

In [6]:
converter.valid_class
Out[6]:
['APEC',
 'BASIC',
 'BRIC',
 'CIS',
 'Cecilia2050',
 'EU',
 'EURO',
 'EXIO1',
 'EXIO2',
 'EXIO3',
 'Eora',
 'G20',
 'G7',
 'ISO2',
 'ISO3',
 'ISOnumeric',
 'MESSAGE',
 'OECD',
 'UNcode',
 'UNmember',
 'UNregion',
 'WIOD',
 'continent',
 'name_official',
 'name_short',
 'obsolete',
 'regex']

Internally, these names are the column header of the underlying pandas dataframe (see below).

The convert function can also be accessed without initiating the CountryConverter. This can be useful for one time usage. For multiple matchings, initiating the CountryCotnverter avoids that the file providing the matching data gets read in for each conversion.

In [7]:
converter.convert(names = iso3_codes, src = 'ISO3', to = 'ISO2')
Out[7]:
['US', 'VU', 'TK', 'AT', 'AF', 'AL']

Some of the classifications can be accessed by some shortcuts. For example:

In [8]:
converter.EU27
Out[8]:
name_short
14 Austria
21 Belgium
35 Bulgaria
58 Cyprus
59 Czech Republic
60 Denmark
70 Estonia
76 Finland
77 France
84 Germany
87 Greece
101 Hungary
107 Ireland
110 Italy
122 Latvia
128 Lithuania
129 Luxembourg
137 Malta
156 Netherlands
177 Poland
178 Portugal
182 Romania
196 Slovakia
197 Slovenia
204 Spain
215 Sweden
235 United Kingdom
In [9]:
converter.OECDas('ISO2')
Out[9]:
ISO2
13 AU
14 AT
21 BE
41 CA
45 CL
59 CZ
60 DK
70 EE
76 FI
77 FR
84 DE
87 GR
101 HU
102 IS
107 IE
109 IL
110 IT
112 JP
122 LV
129 LU
143 MX
156 NL
158 NZ
166 NO
177 PL
178 PT
196 SK
197 SI
202 KR
204 ES
215 SE
216 CH
228 TR
235 GB
236 US

Handling missing data

The return value for non-found entries is be default set to 'not found':

In [10]:
iso3_codes_missing = ['ABC', 'AUT', 'XXX']
converter.convert(iso3_codes_missing, src='ISO3')
WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3
Out[10]:
['not found', 'AUT', 'not found']

but can also be rest to something else:

In [11]:
converter.convert(iso3_codes_missing, src='ISO3', not_found='missing')
WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3
Out[11]:
['missing', 'AUT', 'missing']

Alternativly, the non-found entries can be passed through by passing None to not_found:

In [12]:
converter.convert(iso3_codes_missing, src='ISO3', not_found=None)
WARNING:root:ABC not found in ISO3
WARNING:root:XXX not found in ISO3
Out[12]:
['ABC', 'AUT', 'XXX']

To extend the underlying dataset, an additional dataframe (or file) can be passed.

In [13]:
import pandas as pd
add_data = pd.DataFrame.from_dict({
       'name_short' : ['xxx country', 'abc country'],
       'name_official' : ['The XXX country', 'The ABC country'],
       'regex' : ['xxx country', 'abc country'], 
       'ISO3': ['xxx', 'abc']}
)
In [14]:
add_data
Out[14]:
name_short name_official regex ISO3
0 xxx country The XXX country xxx country xxx
1 abc country The ABC country abc country abc
In [15]:
extended_converter = coco.CountryConverter(additional_data=add_data)
extended_converter.convert(iso3_codes_missing, src='ISO3', to='name_short')
Out[15]:
['abc country', 'Austria', 'xxx country']

Alternatively to a ad hoc dataframe, additional datafiles can be passed. These must have the same format as basic data set. An example can be found here: https://github.com/konstantinstadler/country_converter/tree/master/tests/custom_data_example.txt

The custom data example contains the ISO3 code mapping for Romania before 2002 and switches the regex matching for congo between DR Congo and Congo Republic.

To use is pass the path to the additional country file:

In [16]:
# extended_converter = coco.CountryConverter(additional_data=path/to/datafile)

The passed data (file or dataframe) must at least contain the headers 'name_official', 'name_short' and 'regex'. Of course, if the additional data shall be used to a conversion to any other field, these must also be included.

Additionally passed data always overwrites the existing one. This can be used to adjust coco for datasets with wrong country names. For example, assuming a dataset erroneous switched the ISO2 codes for India (IN) and Indonesia (ID) (therefore assuming 'ID' for India and 'IN' for Indonesia), one can accomedate for that by:

In [17]:
switched_converter = coco.CountryConverter(additional_data=pd.DataFrame.from_dict({
       'name_short' : ['India', 'Indonesia'],
       'name_official' : ['India', 'Indonesia'],
       'regex' : ['india', 'indonesia'], 
       'ISO2': ['ID', 'IN']}))
WARNING:root:Duplicated values in column name_short of merged data - keep last one
WARNING:root:Duplicated values in column regex of merged data - keep last one
In [18]:
converter.convert('IN', src='ISO2', to='name_short')
Out[18]:
'India'
In [19]:
switched_converter.convert('ID', src='ISO2', to='name_short')
Out[19]:
'India'

Regular expression matching

The input parameter "src" can be set to "regex" to use regular expression matching for a given country list. For example:

In [20]:
some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma', 'Iran (Islamic Republic of)', 'Korea, Republic of', "Dem. People's Rep. of Korea"]
In [21]:
coco.convert(names = some_names, src = "regex", to = "name_short")
Out[21]:
['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea']

The regular expressions can also be used to match any list of countries to any other. For example:

In [22]:
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China' ]

coco.match(match_these, master_list)
Out[22]:
{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China'}

If the regular expression matches several times, all results are given as list and a warning is generated:

In [23]:
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Taiwan, province of china', 'Republic of China' ]

coco.match(match_these, master_list)
WARNING:root:Multiple matches for name taiwan in list_b
Out[23]:
{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

The parameter "enforce_sublist" can be set to ensure consistent output:

In [24]:
coco.match(match_these, master_list, enforce_sublist = True)
WARNING:root:Multiple matches for name taiwan in list_b
Out[24]:
{'norway': ['Norway is a Kingdom too'],
 'united_states': ['USA'],
 'china': ['Peoples Republic of China'],
 'taiwan': ['Taiwan, province of china', 'Republic of China']}

A warning also ococours if one of the names couldn't be found:

In [25]:
match_these = ['norway', 'united_states', 'china', 'taiwan', 'some other country']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China',  'Republic of China' ]
coco.match(match_these, master_list)
WARNING:root:Could not identify some other country in list_a
Out[25]:
{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'not_found'}

And the value for non found countries can be specified:

In [26]:
coco.match(match_these, master_list, not_found = 'its not there')
WARNING:root:Could not identify some other country in list_a
Out[26]:
{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'its not there'}

This can also be used to pass the not found country to the new classification:

In [27]:
coco.match(match_these, master_list, not_found = None)
WARNING:root:Could not identify some other country in list_a
Out[27]:
{'norway': 'Norway is a Kingdom too',
 'united_states': 'USA',
 'china': 'Peoples Republic of China',
 'taiwan': 'Republic of China',
 'some other country': 'some other country'}

Internals

Within the new instance, the raw data for the conversion is saved within a pandas dataframe. This dataframe can be acocoessed directly with:

In [28]:
converter.data.head()
Out[28]:
APEC BASIC BRIC CIS Cecilia2050 EU EURO EXIO1 EXIO2 EXIO3 ... OECD UNcode UNmember UNregion WIOD continent name_official name_short obsolete regex
0 NaN NaN NaN NaN RoW NaN NaN WW WA WA ... NaN 4.0 1946.0 Southern Asia RoW Asia Islamic Republic of Afghanistan Afghanistan NaN afghan
1 NaN NaN NaN NaN RoW NaN NaN WW WE WE ... NaN 248.0 NaN Northern Europe RoW Europe Åland Islands Aland Islands NaN \b(a|å)land
2 NaN NaN NaN NaN RoW NaN NaN WW WE WE ... NaN 8.0 1955.0 Southern Europe RoW Europe Republic of Albania Albania NaN albania
3 NaN NaN NaN NaN RoW NaN NaN WW WF WF ... NaN 12.0 1962.0 Northern Africa RoW Africa People's Democratic Republic of Algeria Algeria NaN algeria
4 NaN NaN NaN NaN RoW NaN NaN WW WA WA ... NaN 16.0 NaN Polynesia RoW Oceania American Samoa American Samoa NaN ^(?=.*americ).*samoa

5 rows × 27 columns

This dataframe can be extended in both directions. The only requirement is to provide unique values for name_short, name_official and regex.

Internally, the data is saved in country_data.txt as tab-separated values (utf-8 encoded).

Of course, all pandas indexing and matching methods can be used. For example, to get new OECD members since 1995 present in a list:

In [29]:
some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Romania', 'Russia',  'Turkey', 'United Kingdom', 'United States']
converter.data[(converter.data.OECD >= 1995) & converter.data.name_short.isin(some_countries)].name_short
Out[29]:
59     Czech Republic
70            Estonia
101           Hungary
122            Latvia
Name: name_short, dtype: object

Further information can be found here: http://pandas.pydata.org/pandas-docs/stable/

Testing

All regular expressions of the country converter are tested for a unique match to name_short and name_official. Test sets for alternative names found in various databases are also available.

The test sets are stored in the /test subbolder. To tests require py.test. I recommend to rerun the test if a regular expression is changed.

To specify a new test set just add a tab-separated file with headers "name_short" and "name_test" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with "test_regex_ " it will be automatically recognised by the test functions.

Please see the file CONTRIBUTING.rst for further information.

Konstantin Stadler

In [ ]: