I believe that testing is one of the most crucial parts of dealing with an epidemic virus. Testing helps us identify and isolate positive cases. The more tests you perform, the faster you isolate the case preventing them from coming into contact with others, slowing the rate of transmission.
This will be performed by "merging" the information of two data sources:
Date this notebook was written: 23/3/2020
Disclaimer: In any way I do not want to point my finger on the governments and people of countries. It is not possible to know what were the reasons and the circumstances that lead to a lack of testing. My only target is to see if the data suggests that testing has a major role on this specific epidemic.
import requests
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from bs4 import BeautifulSoup
Link to the created dataset: https://github.com/MikeXydas/Weekend-EDAs/blob/master/datasets/covid_testing_importance.csv
The first thing we want to find is how many tests have been performed on each country. The data will be extracted from this page: https://ourworldindata.org/coronavirus-testing-source-data . You may observe that the dates of last report are not the same for each country. This will be taken into account.
response = requests.get('https://ourworldindata.org/coronavirus-testing-source-data')
soup = BeautifulSoup(response.content)
table_soup = soup.find("div", {"class": "tableContainer"}).findAll("tr")[1:] # skip the headers
# RegEx to extract the country name, number of test, date
reg = re.compile("<tr><td>(.+)<\/td><td>([\d,]+)<\/td><td>([\w\s]+)<\/td><td>.*<\/td><td>.*<\/td><\/tr>")
testing_data = []
for row in table_soup:
reg_res = reg.match(str(row))
if reg_res is not None:
testing_data.append(reg_res.groups())
else:
# Print the rows that failed to be parsed
print(row)
# Create the dataframe
testing_df = pd.DataFrame(testing_data, columns=["Country", "Tests", "Date"])
testing_df['Tests'] = testing_df['Tests'].str.replace(',','').astype(int) # Transform the x,xxx string to integers
testing_df['Date'] = pd.to_datetime(testing_df['Date']) # Cast Date from string to date type
# Set country name as index of dataframe
testing_df = testing_df.set_index('Country', drop=False)
# For Canada and Australia we have info about specific regions
# Since on this notebook we will examine country response we merge them
# Australia, dropping all the province info and keeping the aggregated "Australia"
testing_df = testing_df[~testing_df.Country.str.contains("Australia –")]
# Canada
testing_df = testing_df[~testing_df.Country.str.contains("Canada –")]
# Usa has two trackers I will keep the most recent one
testing_df = testing_df.drop('United States – CDC samples tested')
testing_df = testing_df.drop("Hong Kong")
# We rename some countries so as to have the expected country name with the John Hopkins dataset
testing_df = testing_df.rename({
"China – Guangdong": "China",
"United States": "US",
"Czech Republic": "Czechia",
"South Korea": "Korea, South",
"Taiwan": "Taiwan*",
"Faeroe Islands":"Faroe Islands"
})
# Drop Palestine since it is not included on the John Hopkins dataset
testing_df = testing_df.drop("Palestine")
# Drop Faroe Islands since they are merged with Denmark
testing_df = testing_df.drop("Faroe Islands")
testing_df = testing_df.drop('Country', axis=1)
display(testing_df)
<tr><td>Canada</td><td></td><td></td><td></td><td></td><td>An aggregate figure for Canada is not provided given that the extent to which double-counting between the provincial labs and the national lab (NML) is unclear. See province level data and that for NML above. No figures have yet been found for Nunavut, Manitoba, Yukon and Newfoundland and Labrador provinces (collectively ~5% of population).</td></tr> <tr><td>Kuwait</td><td></td><td>17 Mar 2020</td><td><a href="https://drive.google.com/file/d/1pVBq-c4HLeUis_BS58xT_jJTJZni2JfR/view?usp=sharing" rel="noreferrer noopener" target="_blank">Communication from International Press Office at the Ministry of Information in the State of Kuwait, 17 March 2020.</a></td><td>17 Mar 2020</td><td>In an earlier version of this dataset we reported an estimate of 120,000 tests based on an official letter sent to us by the Ministry of Information. After this, they sent a second email correcting their estimate to 27,000, and then a further correction to 20,000. Since no figure is substantiated in a public statement, we have decided to not publish the numbers. We will revise this once the numbers are made public.</td></tr>
Tests | Date | |
---|---|---|
Country | ||
Armenia | 813 | 2020-03-18 |
Australia | 113615 | 2020-03-20 |
Austria | 15613 | 2020-03-20 |
Bahrain | 18645 | 2020-03-20 |
Belarus | 16000 | 2020-03-16 |
Belgium | 18360 | 2020-03-18 |
Brazil | 2927 | 2020-03-13 |
China | 320000 | 2020-02-24 |
Colombia | 4103 | 2020-03-20 |
Costa Rica | 1039 | 2020-03-19 |
Croatia | 1264 | 2020-03-19 |
Czechia | 11619 | 2020-03-20 |
Denmark | 10730 | 2020-03-20 |
Estonia | 2504 | 2020-03-20 |
Finland | 3000 | 2020-03-19 |
France | 36747 | 2020-03-15 |
Germany | 167000 | 2020-03-15 |
Hungary | 3007 | 2020-03-20 |
Iceland | 9189 | 2020-03-20 |
India | 14514 | 2020-03-20 |
Indonesia | 2028 | 2020-03-20 |
Iran | 80000 | 2020-03-14 |
Ireland | 6600 | 2020-03-17 |
Israel | 10864 | 2020-03-18 |
Italy | 206886 | 2020-03-20 |
Japan | 14901 | 2020-03-19 |
Kyrgyzstan | 1545 | 2020-03-13 |
Latvia | 3205 | 2020-03-20 |
Lithuania | 1154 | 2020-03-20 |
Malaysia | 13876 | 2020-03-20 |
Malta | 889 | 2020-03-13 |
Mexico | 278 | 2020-03-10 |
Netherlands | 6000 | 2020-03-07 |
New Zealand | 584 | 2020-03-17 |
Norway | 43735 | 2020-03-20 |
Pakistan | 1979 | 2020-03-19 |
Panama | 1455 | 2020-03-18 |
Philippines | 1269 | 2020-03-20 |
Poland | 13072 | 2020-03-20 |
Qatar | 8400 | 2020-03-17 |
Romania | 8284 | 2020-03-20 |
Russia | 143519 | 2020-03-19 |
Slovakia | 2707 | 2020-03-20 |
Slovenia | 9860 | 2020-03-19 |
South Africa | 6438 | 2020-03-20 |
Korea, South | 316664 | 2020-03-20 |
Spain | 30000 | 2020-03-18 |
Sweden | 14300 | 2020-03-17 |
Switzerland | 4000 | 2020-03-07 |
Taiwan* | 21376 | 2020-03-20 |
Thailand | 7084 | 2020-03-17 |
Turkey | 2900 | 2020-03-10 |
Ukraine | 316 | 2020-03-20 |
United Arab Emirates | 125000 | 2020-03-16 |
United Kingdom | 64621 | 2020-03-19 |
US | 103945 | 2020-03-19 |
Vietnam | 15637 | 2020-03-20 |
Next we must find info of confirmed cases, recovered cases, deaths for each country above.
The dataset used will be the John Hopkins one.
def transform_date_to_str(date_str):
"""Transforms the date strings to the John Hopkins one: mm/dd/yy """
if date_str[8] != '0':
return date_str[6:7] + "/" + date_str[8:10] + "/20"
else:
return date_str[6:7] + "/" + date_str[9:10] + "/20"
def read_john_hopkins_dataset(url, column_name):
john_hopkins_df = pd.read_csv(url, index_col=['Country/Region'])
john_hopkins_df = john_hopkins_df.drop('Lat', axis=1)
john_hopkins_df = john_hopkins_df.drop('Long', axis=1)
# We must sum the cases on countries that are displayed by region eg US and China
john_hopkins_df_grouped = john_hopkins_df.groupby(['Country/Region']).sum()
testing_df[column_name] = np.nan
for index, row in testing_df.iterrows():
try:
testing_df.at[index, column_name] = john_hopkins_df_grouped.loc[index][transform_date_to_str(str(row['Date']))]
except KeyError:
# We are recovering from some missing keys
print(index)
# Add confirmed cases column
read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv',
"Confirmed")
# Add recovered cases column
read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv',
"Recovered")
# Add deaths column
read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv',
"Deaths")
display(testing_df)
Tests | Date | Confirmed | Recovered | Deaths | |
---|---|---|---|---|---|
Country | |||||
Armenia | 813 | 2020-03-18 | 84.0 | 1.0 | 0.0 |
Australia | 113615 | 2020-03-20 | 791.0 | 26.0 | 7.0 |
Austria | 15613 | 2020-03-20 | 2388.0 | 9.0 | 6.0 |
Bahrain | 18645 | 2020-03-20 | 285.0 | 100.0 | 1.0 |
Belarus | 16000 | 2020-03-16 | 36.0 | 3.0 | 0.0 |
Belgium | 18360 | 2020-03-18 | 1486.0 | 31.0 | 14.0 |
Brazil | 2927 | 2020-03-13 | 151.0 | 0.0 | 0.0 |
China | 320000 | 2020-02-24 | 77241.0 | 25015.0 | 2595.0 |
Colombia | 4103 | 2020-03-20 | 128.0 | 1.0 | 0.0 |
Costa Rica | 1039 | 2020-03-19 | 69.0 | 0.0 | 1.0 |
Croatia | 1264 | 2020-03-19 | 105.0 | 5.0 | 1.0 |
Czechia | 11619 | 2020-03-20 | 833.0 | 4.0 | 0.0 |
Denmark | 10730 | 2020-03-20 | 1337.0 | 1.0 | 9.0 |
Estonia | 2504 | 2020-03-20 | 283.0 | 1.0 | 0.0 |
Finland | 3000 | 2020-03-19 | 400.0 | 10.0 | 0.0 |
France | 36747 | 2020-03-15 | 4523.0 | 12.0 | 91.0 |
Germany | 167000 | 2020-03-15 | 5795.0 | 46.0 | 11.0 |
Hungary | 3007 | 2020-03-20 | 85.0 | 2.0 | 3.0 |
Iceland | 9189 | 2020-03-20 | 409.0 | 5.0 | 0.0 |
India | 14514 | 2020-03-20 | 244.0 | 20.0 | 5.0 |
Indonesia | 2028 | 2020-03-20 | 369.0 | 15.0 | 32.0 |
Iran | 80000 | 2020-03-14 | 12729.0 | 2959.0 | 611.0 |
Ireland | 6600 | 2020-03-17 | 223.0 | 5.0 | 2.0 |
Israel | 10864 | 2020-03-18 | 433.0 | 11.0 | 0.0 |
Italy | 206886 | 2020-03-20 | 47021.0 | 4440.0 | 4032.0 |
Japan | 14901 | 2020-03-19 | 924.0 | 150.0 | 29.0 |
Kyrgyzstan | 1545 | 2020-03-13 | 0.0 | 0.0 | 0.0 |
Latvia | 3205 | 2020-03-20 | 111.0 | 1.0 | 0.0 |
Lithuania | 1154 | 2020-03-20 | 49.0 | 1.0 | 0.0 |
Malaysia | 13876 | 2020-03-20 | 1030.0 | 87.0 | 3.0 |
Malta | 889 | 2020-03-13 | 12.0 | 1.0 | 0.0 |
Mexico | 278 | 2020-03-10 | 7.0 | 4.0 | 0.0 |
Netherlands | 6000 | 2020-03-07 | 188.0 | 0.0 | 1.0 |
New Zealand | 584 | 2020-03-17 | 12.0 | 0.0 | 0.0 |
Norway | 43735 | 2020-03-20 | 1914.0 | 1.0 | 7.0 |
Pakistan | 1979 | 2020-03-19 | 454.0 | 13.0 | 2.0 |
Panama | 1455 | 2020-03-18 | 86.0 | 0.0 | 1.0 |
Philippines | 1269 | 2020-03-20 | 230.0 | 8.0 | 18.0 |
Poland | 13072 | 2020-03-20 | 425.0 | 1.0 | 5.0 |
Qatar | 8400 | 2020-03-17 | 439.0 | 4.0 | 0.0 |
Romania | 8284 | 2020-03-20 | 308.0 | 25.0 | 0.0 |
Russia | 143519 | 2020-03-19 | 199.0 | 9.0 | 1.0 |
Slovakia | 2707 | 2020-03-20 | 137.0 | 0.0 | 1.0 |
Slovenia | 9860 | 2020-03-19 | 286.0 | 0.0 | 1.0 |
South Africa | 6438 | 2020-03-20 | 202.0 | 0.0 | 0.0 |
Korea, South | 316664 | 2020-03-20 | 8652.0 | 1540.0 | 94.0 |
Spain | 30000 | 2020-03-18 | 13910.0 | 1081.0 | 623.0 |
Sweden | 14300 | 2020-03-17 | 1190.0 | 1.0 | 7.0 |
Switzerland | 4000 | 2020-03-07 | 268.0 | 3.0 | 1.0 |
Taiwan* | 21376 | 2020-03-20 | 135.0 | 26.0 | 2.0 |
Thailand | 7084 | 2020-03-17 | 177.0 | 41.0 | 1.0 |
Turkey | 2900 | 2020-03-10 | 0.0 | 0.0 | 0.0 |
Ukraine | 316 | 2020-03-20 | 29.0 | 0.0 | 3.0 |
United Arab Emirates | 125000 | 2020-03-16 | 98.0 | 23.0 | 0.0 |
United Kingdom | 64621 | 2020-03-19 | 2716.0 | 67.0 | 138.0 |
US | 103945 | 2020-03-19 | 13677.0 | 0.0 | 200.0 |
Vietnam | 15637 | 2020-03-20 | 91.0 | 16.0 | 0.0 |
Having created the dataset we will save it so as we can then import it instead of creating it from scratch having to treat special cases of missing values and future incompatibilities between the two datasets we combine.
Important field that should be added on the dataset to create more useful conclusions:
testing_df.to_csv(path_or_buf='datasets/covid_testing_importance.csv')
Having our dataset created we will try to infer insights by seeing how the number of tests correlates with the number of cases and the ability of the country to successfully deal with the epidemic.
Firstly, I will calculate the correlation between number of tests and number of cases.
# Plot the cases-axis, tests axis
fig, ax = plt.subplots()
fig.set_size_inches(5, 5)
ax.set_yscale('log')
ax.set_xscale('log')
ax.scatter(testing_df.Tests, testing_df.Confirmed)
# We do the +0.1 to avoid having 0 values on a log scale, this +1 will not be used on calculating the correlation value
ax.set_xlim([min(testing_df.Tests)+ 0.1,max(testing_df.Tests) + 0.1])
ax.set_ylim([min(testing_df.Confirmed) + 0.1,max(testing_df.Confirmed) + 0.1])
ax.grid(True)
for i, row in enumerate(testing_df.iterrows()):
# Add the annotation of the country name if the country has many tests
# to avoid text cluttering
if testing_df.Tests[i] > 2e4:
ax.annotate(row[0], (testing_df.Tests[i], testing_df.Confirmed[i]))
fig.suptitle('Tests vs Confirmed Cases', fontsize=20)
plt.xlabel('Tests (Log)', fontsize=18)
plt.ylabel('Cases (Log)', fontsize=16)
plt.show()
Although this is not a straight line it seems like it has a linear relationship. The use of a log-log plot suggests a power law in the form of Cases=kTestsn.
When a slope on a log-log plot is between 0 and 1, it signifies that the nonlinear effect of the dependent variable lessens as its value increases (ref). In our case this means that as the number of tests increases the rate of discovered cases slows down.
# Calculate the correlation value
corr_cases, _ = spearmanr(testing_df.Tests, testing_df.Confirmed)
print('Spearmans correlation: %.3f' % corr_cases)
Spearmans correlation: 0.718
As expected the Spearman's Correlation of 0.718 is big enough to suggest a positive correlation.
We let a new variable deathRatio=deathscases. This variable could be translated as the effectiveness of the health care system of each country, where the smaller the value the more effective.
testing_df['DeathRatio'] = testing_df.Deaths / testing_df.Confirmed
testing_df_death_ratio = testing_df[['Tests', 'DeathRatio']].dropna() # Remove null values
# Remove countries with no deaths so as to not clutter the plot
testing_df_death_ratio = testing_df_death_ratio[testing_df_death_ratio['DeathRatio'] != 0]
# Plot the cases-axis, tests axis
fig, ax = plt.subplots()
fig.set_size_inches(10, 10)
ax.scatter(testing_df_death_ratio.Tests, testing_df_death_ratio.DeathRatio)
ax.grid(True)
for i, row in enumerate(testing_df_death_ratio.iterrows()):
ax.annotate(row[0], (testing_df_death_ratio.Tests[i], testing_df_death_ratio.DeathRatio[i]))
fig.suptitle('Tests vs Death Ratio', fontsize=20)
plt.xlabel('Tests', fontsize=18)
plt.ylabel('Death Ratio', fontsize=16)
plt.show()
On the above plot we can see that countries that are performing a lot of testing like Germany, US and South Korea manage to face this epidemic with a little amount of deaths while countries like Ukraine, Indonesia and Philippines that perform a smaller amount of tests have greater number of deaths.
Correlation or Causality?
This is a hard to answer question. Countries like Germany, and Australia have a better economy than Ukraine and Indonesia which plays a major role on the above results. However, one can say that better economy means more testing which means a smaller death ratio. In conclusion, it seems like performing as much as possible testing will help contain the the number of deaths.
Following since this was written and ran at the end of March it would be really interesting to check the above numbers again when we have managed to deal with this epidemic.