ECON 323 Project

The primary goal of this analysis is to find the factors that impact transmission rate of COVID-19. To be precise, the question we are trying to answer is:

What role does temperature and relative humidity play in the transmission of COVID-19 ?

Background

Coronavirus disease (COVID-19) is an infectious disease caused by a new virus. The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty in breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell.

HOW IT SPREADS

Coronavirus disease spreads primarily through contact with an infected person when they cough or sneeze. It also spreads when a person touches a surface or object that has the virus on it, then touches their eyes, nose, or mouth.

Dataset

Our aim is to test this claim with the current data that dates from 22 January, 2020 to 22 April, 2020. We selected this specific timeline for two reasons:

Firstly, our study focuses on the 8 countries that have been affected the most by the virus (we define "most affected" by the "most number of confirmed cases"). Based on the John Hopkins dataset, which was last updated on April 22, the countries with the most confirmed cases include: USA, Spain, Italy, France, Germany, United Kingdom, Turkey, China. We selected the first data to be January 22, 2020 since that is the date when the first case of the virus was confirmed in the USA.

Secondly, due to temperature and humitdity data constraints, 22 April 2020 was the last date for which we could collect data on all the countries.

NOTE: We use "confirmed cases" instead of "number of deaths" to represent the most affected countries because almost all the countries have used different strategies to tackle the spread of the virus. While some countries, such as India and Iran have issue a national lockdown, Japan and South Korea have only issued national "recommendations" regarding social distancing. Furthermore, all the countries may also vary in terms of the population and literacy rate which might play a key role in the amount of interaction the people are exposed to. Number of deaths may be the outcome of the severity of the lockdown or the population or some other factor, therefore, not being the best indicator of the spread of the virus due to temperature-related conditions.

Methodology

1) General visualizations to show the spread of the virus across the world (including confirmed cases, death and recoveries by country)

2) Visualization of the trend between in the number of cases for each country over the selected time peiod (for the 8 countries with highest number of confirmed cases)- Formulate Hypothesis based on these visualizations

3) Presentation of estimation results from 2 different regression specifications: Ordinary Least Squares Regression and Random Forests Regression

4) Discussion of the limitations of our study and areas for further research.

https://www.bbc.com/future/article/20200323-coronavirus-will-hot-weather-kill-covid-19

https://www.bbc.com/news/world-52103747

alt text

In [114]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# import seaborn as sns
import matplotlib.pyplot as plt
import plotly_express as px

Datasets

Display Lastest counts for all Countries

In [115]:
# Load all datasets
confirmed_orig = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recoveries = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
confirmed_orig.head(10)
Out[115]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/17/20 4/18/20 4/19/20 4/20/20 4/21/20 4/22/20 4/23/20 4/24/20 4/25/20 4/26/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 906 933 996 1026 1092 1176 1279 1351 1463 1531
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 539 548 562 584 609 634 663 678 712 726
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 2418 2534 2629 2718 2811 2910 3007 3127 3256 3382
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 696 704 713 717 717 723 723 731 738 738
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 19 24 24 24 24 25 25 25 25 26
5 NaN Antigua and Barbuda 17.0608 -61.7964 0 0 0 0 0 0 ... 23 23 23 23 23 24 24 24 24 24
6 NaN Argentina -38.4161 -63.6167 0 0 0 0 0 0 ... 2669 2758 2839 2941 3031 3144 3435 3607 3780 3892
7 NaN Armenia 40.0691 45.0382 0 0 0 0 0 0 ... 1201 1248 1291 1339 1401 1473 1523 1596 1677 1746
8 Australian Capital Territory Australia -35.4735 149.0124 0 0 0 0 0 0 ... 103 103 103 104 104 104 104 105 106 106
9 New South Wales Australia -33.8688 151.2093 0 0 0 0 3 4 ... 2926 2936 2957 2963 2969 2971 2976 2982 2994 3002

10 rows × 100 columns

In [116]:
# melt datasets to get dates into 1 column
ids = ["Province/State","Country/Region", "Lat","Long"]
confirmed=confirmed_orig.melt(id_vars=ids, var_name="Date", value_name="cases")
deaths=deaths.melt(id_vars=ids,var_name="Date", value_name="deaths")
recoveries=recoveries.melt(id_vars=ids,var_name="Date", value_name="recoveries")
In [117]:
# get latest data 
max_date = recoveries.iloc[[-1]]["Date"].iloc[0]

latest_data_recoveries = recoveries.loc[recoveries['Date'] == max_date]
latest_data_confirmed = confirmed.loc[confirmed['Date'] == max_date]
latest_data_deaths = deaths.loc[deaths['Date'] == max_date]

latest_data_confirmed.loc[latest_data_confirmed['Country/Region'] == "Canada"]
latest_data_confirmed_countries = latest_data_confirmed.groupby('Country/Region').agg('sum')
latest_data_deaths_countries = latest_data_deaths.groupby('Country/Region').agg('sum')
latest_data_recoveries = latest_data_recoveries.drop(columns=['Province/State', 'Date']).groupby('Country/Region').agg('sum')

latest_data_recoveries.head(10)
Out[117]:
Lat Long recoveries
Country/Region
Afghanistan 33.0000 65.0000 207
Albania 41.1533 20.1683 410
Algeria 28.0339 1.6596 1508
Andorra 42.5063 1.5218 344
Angola -11.2027 17.8739 6
Antigua and Barbuda 17.0608 -61.7964 11
Argentina -38.4161 -63.6167 1107
Armenia 40.0691 45.0382 833
Australia -255.9695 1129.8623 5541
Austria 47.5162 14.5501 12282

Country wise total defined cases

In [118]:
import warnings
warnings.filterwarnings('ignore')

covid = latest_data_confirmed_countries.join(latest_data_deaths_countries["deaths"]).join(latest_data_recoveries["recoveries"]);
covid = covid.rename(columns={"cases": "Confirmed", "deaths": "Deaths", "recoveries": "Recovered"});
latest_data_complete = covid[["Confirmed", "Deaths","Recovered" ]];
latest_data_complete['Active'] = latest_data_complete['Confirmed'] - latest_data_complete['Deaths'] - latest_data_complete['Recovered'];
latest_data_complete.sort_values(by="Confirmed", ascending=False).head(10)
Out[118]:
Confirmed Deaths Recovered Active
Country/Region
US 965785 54881 106988 803916
Spain 226629 23190 117727 85712
Italy 197675 26644 64928 106103
France 162220 22890 45681 93649
Germany 157770 5976 112000 39794
United Kingdom 154037 20794 778 132465
Turkey 110130 2805 29140 78185
Iran 90481 5710 69657 15114
China 83912 4637 78277 998
Russia 80949 747 6767 73435
In [119]:
# latest_data_complete[:15].plot.bar()
colors = [
    (0.902, 0.902, 0.997), (0.695, 0.695, 0.993), (0.488, 0.488, 0.989),
    (0.282, 0.282, 0.985), (0.078, 0.078, 0.980)
]
fig, ax = plt.subplots(figsize=(15,15))

latest_data_complete.sort_values(by="Confirmed", ascending=True).tail(10).plot(kind="barh", ax=ax, color=colors)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_title("Top 10 countries with the Highest number of confirmed cases")
Out[119]:
Text(0.5, 1.0, 'Top 10 countries with the Highest number of confirmed cases')

Data Visualizations

Load Dataset created with above modifications

In [122]:
complete_time_series = pd.read_csv('data/covid_19_clean_complete.csv', parse_dates=['Date'])
complete_time_series.tail(5)
Out[122]:
Province/State Country/Region Lat Long Date Confirmed Deaths Recovered
21479 Saint Pierre and Miquelon France 46.885200 -56.315900 2020-04-12 1 0 0
21480 NaN South Sudan 6.877000 31.307000 2020-04-12 4 0 0
21481 NaN Western Sahara 24.215500 -12.885800 2020-04-12 6 0 0
21482 NaN Sao Tome and Principe 0.186360 6.613081 2020-04-12 4 0 0
21483 NaN Yemen 15.552727 48.516388 2020-04-12 1 0 0
In [123]:
# Preparing Data for visualizations

# Defining COVID-19 cases as per classifications 
cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']

# Defining Active Case: Active Case = confirmed - deaths - recovered
complete_time_series['Active'] = complete_time_series['Confirmed'] - complete_time_series['Deaths'] - complete_time_series['Recovered']

# Renaming Mainland china as China in the data table
complete_time_series['Country/Region'] = complete_time_series['Country/Region'].replace('Mainland China', 'China')

# filling missing values 
complete_time_series[['Province/State']] = complete_time_series[['Province/State']].fillna('')
complete_time_series[cases] = complete_time_series[cases].fillna(0)
In [124]:
complete_time_series[complete_time_series["Country/Region"] == "US"]
Out[124]:
Province/State Country/Region Lat Long Date Confirmed Deaths Recovered Active
225 US 37.0902 -95.7129 2020-01-22 1 0 0 1
487 US 37.0902 -95.7129 2020-01-23 1 0 0 1
749 US 37.0902 -95.7129 2020-01-24 2 0 0 2
1011 US 37.0902 -95.7129 2020-01-25 2 0 0 2
1273 US 37.0902 -95.7129 2020-01-26 5 0 0 5
... ... ... ... ... ... ... ... ... ...
20399 US 37.0902 -95.7129 2020-04-08 429052 14695 23559 390798
20661 US 37.0902 -95.7129 2020-04-09 461437 16478 25410 419549
20923 US 37.0902 -95.7129 2020-04-10 496535 18586 28790 449159
21185 US 37.0902 -95.7129 2020-04-11 526396 20463 31270 474663
21447 US 37.0902 -95.7129 2020-04-12 555313 22020 32988 500305

82 rows × 9 columns

In [126]:
complete_time_series["ln(confirmed)"] = np.log1p(complete_time_series["Confirmed"] )
px.line(complete_time_series.sort_values(by="Confirmed", ascending=False), x='Date', y='Confirmed', color='Country/Region',title='COVID19 Total Confrimed Cases growth for top 10 worst affected countries');
In [127]:
px.line(complete_time_series, x='Date', y='Deaths', color='Country/Region', title='COVID19 Total Deaths growth for top 10 worst affected countries')

A Time-series graph of the confirmed and recovered cases of COVID-19

In [128]:
import plotly as py
import plotly.graph_objects as go
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)   

fig = go.Figure()
fig.add_trace(go.Scatter(
                x=complete_time_series.Date,
                y=complete_time_series['Confirmed'],
                name="Confirmed",
                line_color='deepskyblue',
                opacity=0.8))

fig.add_trace(go.Scatter(
                x=complete_time_series.Date,
                y=complete_time_series['Recovered'],
                name="Recovered",
                line_color='green',
                opacity=0.6))
fig.update_layout(title_text='Progression of cases since the beginning',
                  xaxis_rangeslider_visible=True)
py.offline.iplot(fig)