The aim of this notebook is to identify the optimal model from a given simple
parametric family that best characterizes the growth curve of COVID-19 cases without
overfitting the data.
The first part of the dataset (up to 20 March) will be used for training, along Bayesian Information Criterion for selecting the degree of our polynomial.
The chosen model will be tested on the rest of the dataset.
Intuitively, is a timeseries forecasting model able to accurately predict the future number of cases?
I believe that it is not for most countries. There are many factors that a time series model will not be able to forecast. A country may decide to take some measures like lockdowns, increase the number of tests, promote self isolation, etc.
This is most obvious with weather forecasting where we have excellent models based on the physics of the atmosphere. No time series model will perform as well as a good atmospheric model for forecasting short-term weather. That is why meteorologists do not use time series models (ref).
However, it is an interesting exercise on the field of forecasting in a domain that is highly interpretable (at least at the depths I am reaching).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
import logging
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import normalize
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf,pacf
from statsmodels.tsa.arima_model import ARIMA
As in the previous notebook the number of confirmed cases will be taken from the John Hopkins dataset which is daily updated.
We will group by country name and sum over provinces/states so as to have the cases per country.
# In case the link fails I am including a stored dataset which is not up to date
try:
confirmed_cases_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
index_col=['Country/Region'])
# Drop unused columns Lat, Long
confirmed_cases_df = confirmed_cases_df.drop('Lat', axis=1)
confirmed_cases_df = confirmed_cases_df.drop('Long', axis=1)
# Group by and sum number of cases of each country
confirmed_cases_df = confirmed_cases_df.groupby(['Country/Region']).sum()
# Filter the countries that had no cases up until 20/3/2020 to avoid divisions by zero
confirmed_cases_df = confirmed_cases_df[confirmed_cases_df['3/20/20'] != 0]
except:
logging.warning("GitHub link not working. Using stored csv file...")
confirmed_cases_df = pd.read_csv(filepath_or_buffer="datasets/covid_19_country_cases.csv")
display(confirmed_cases_df)
# Uncomment below if you want the DataFrame created to be saved on disk
# confirmed_cases_df.to_csv(path_or_buf="datasets/covid_19_country_cases.csv")
1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | 1/29/20 | 1/30/20 | 1/31/20 | ... | 3/22/20 | 3/23/20 | 3/24/20 | 3/25/20 | 3/26/20 | 3/27/20 | 3/28/20 | 3/29/20 | 3/30/20 | 3/31/20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country/Region | |||||||||||||||||||||
Afghanistan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 40 | 40 | 74 | 84 | 94 | 110 | 110 | 120 | 170 | 174 |
Albania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 89 | 104 | 123 | 146 | 174 | 186 | 197 | 212 | 223 | 243 |
Algeria | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 201 | 230 | 264 | 302 | 367 | 409 | 454 | 511 | 584 | 716 |
Andorra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 113 | 133 | 164 | 188 | 224 | 267 | 308 | 334 | 370 | 376 |
Angola | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 3 | 3 | 3 | 4 | 4 | 5 | 7 | 7 | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Venezuela | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 70 | 77 | 84 | 91 | 107 | 107 | 119 | 119 | 135 | 135 |
Vietnam | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 113 | 123 | 134 | 141 | 153 | 163 | 174 | 188 | 203 | 212 |
West Bank and Gaza | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 52 | 59 | 59 | 59 | 84 | 91 | 98 | 109 | 116 | 119 |
Zambia | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 3 | 12 | 16 | 22 | 28 | 29 | 35 | 35 |
Zimbabwe | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 3 | 3 | 3 | 3 | 5 | 7 | 7 | 7 | 8 |
161 rows × 70 columns
Firstly we will sum over all the countries and try to fit a polynomial that forecasts the global number of cases as days pass. I avoid fitting a model to each country from the start since it will be more difficult to debug them.
Steps: 1. Fit a global polynomial model 2. Examine the results, find and correct any bugs 3. Fit a different model for each country
test_days = 7 # How many days will be used for testing
bic_criterion = 2 # Tolerance of BIC (0 - 2 strict, 2 - 6 intermediate, >6 loose)
global_cases_df = confirmed_cases_df.sum() # Sum over all countries to get global number of cases for each day
# Plot the number of cases for each day globally
x_values = [datetime.datetime.strptime(d[:-2]+ '20' + d[-2:],"%m/%d/%Y").date() for d in global_cases_df.index]
y_values = global_cases_df.values
fig, ax = plt.subplots(figsize=(20, 10))
ax = plt.gca()
formatter = mdates.DateFormatter("%Y-%m-%d")
ax.xaxis.set_major_formatter(formatter)
locator = mdates.DayLocator(interval=5) # Add an interval on the x-axis so as to not be cluttered
ax.xaxis.set_major_locator(locator)
# We will use the last 6 days as a test set
plt.scatter(x_values[:-test_days], y_values[:-test_days], color="blue", label="Train")
plt.scatter(x_values[-test_days:], y_values[-test_days:], color="red", label="Test")
ax.set_xlabel("Date")
ax.set_ylabel("Cases")
plt.title("Number of Cases Globally")
plt.legend()
plt.grid()
plt.gcf().autofmt_xdate()
plt.show()
/home/mikexydas/pythonEnvs/thesisEnv/lib/python3.6/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters. To register the converters: >>> from pandas.plotting import register_matplotlib_converters >>> register_matplotlib_converters() warnings.warn(msg, FutureWarning)
Firstly, we will create a model able to forecast future global number of future cases. This will help us have a greater understanding of a pipeline we could later create to train a model for each country separately.
On the above plot we will use the blue part in order to find the degree of our polynomial and train our linear regression model, and the red part for testing.
x_series = np.arange(len(x_values)) # Transform dates to an integer mapping for easier parsing
# Split the data in training and testing and
# add a new axis so as the PolynomialFeatures input has the expected 2D shape
train_X = x_series[:-test_days, np.newaxis]
train_Y = y_values[:-test_days, np.newaxis]
test_X = x_series[-test_days:, np.newaxis]
test_Y = y_values[-test_days:, np.newaxis]
# Calculate the models up to 9th degree and cache them in a dictionary
models_dict = {}
for degree in range(1,10):
# Create the new features
# eg degree=2, [a] => [a^0, a^1, a^2]
polynomial_features = PolynomialFeatures(degree=degree)
x_poly = polynomial_features.fit_transform(train_X)
# Train a linear regression model with our new polynmial feature set
model = LinearRegression()
model.fit(x_poly, train_Y)
# Cache the model and the new set of features
models_dict[degree] = [model, x_poly]
fig = plt.figure(figsize=(15, 15))
plot_index = 1
# We will create 9 plots one for each degree
for degree, model_features in models_dict.items():
ax = fig.add_subplot(3, 3, plot_index)
# Plot our true data
plt.scatter(train_X, train_Y)
# Use our new model to predict on our train set
pred_train_Y = model_features[0].predict(model_features[1])
plt.plot(train_X, pred_train_Y, linewidth=3, label=f'Degree: {str(degree)}', color="orange")
plot_index += 1
plt.title(f"Degree: {degree}")
plt.show()
Now we need a robust and unbiased way of deciding how good each degree fits our model. How good each model is depends on its accuracy and its simplicity (lower degree).
BCI: Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models. It deals with overfitting since it penalizes models that need many parameters. So in our case as the degree of the polynomial increases the penalty will increase too.
BIC=kln(n)−2ln(ˆL)In order to calculate the ˆL we first make some assumptions.
Linear regression is in the form of Y=Xβ+ε where ε is the residue between the prediction and the true data result.
Assuming that our residue ε is independent Gaussian noise we can prove that the maximum likelihood equals the mean squared error metric (ref).
Under these assumptions we calculate BIC (ref): BIC=nln(mse)+kln(n)
# Calculate MSE of Linear regression model
mse = mean_squared_error(train_Y, pred_train_Y)
# Calculate BIC for each model degree
bics = []
for degree, model_features in models_dict.items():
# Use our new model to predict on our train set
pred_train_Y = model_features[0].predict(model_features[1])
mse = mean_squared_error(train_Y, pred_train_Y)
n = len(train_Y)
k = degree + 1 # We +1 for the intercept
bics.append(n * np.log(mse) + k * np.log(n))
plt.plot(np.arange(1, len(bics) + 1), bics, marker=".")
plt.xlabel("Degree")
plt.ylabel("BIC")
plt.title("Bayesian Information Criterion vs Degree")
plt.show()
On the above plot we can see that the major on BIC happens from the 3rd to the 4th degree. However we need a rule in order to decide from the other models of degree 4 to 9.
A usual criterion is (ref):
Let B∗ be the lowest BIC value and B a different BIC than B∗
So we will deem as the best model the one that has smallest degree and its BIC is not bigger than bic_criterion
from the model with the lowest BIC.
def find_simplest_min_bic(bics, criterion=2):
"""Finds the smallest bic following the above rule"""
true_min = np.argmin(bics)
for which_bic in range(true_min):
if bics[which_bic] - bics[true_min] < criterion:
return which_bic # Return the bic index that was the simplest model and B - B* < criterion(2)
return true_min # No B != B* with B - B* < criterion(2) found, return the index of the smallest bic
bic_degree = find_simplest_min_bic(bics, bic_criterion) + 1
print(f"Best BIC Degree: {bic_degree}")
Best BIC Degree: 9
We can see that the model that covers our criterion is the one of the 7th degree. Following Occam's razor we could relax our above criterion to a higher value and pick a more parsimonious model.
Having found the degree of the polynomial that best fits our data without overfitting we must now forecast some dates that were not used in training (the last 7 days in our case) and evaluate our model's performance.
In the cell below I compute the test set predictions along with the train set making it easier for me to plot them together.
# Transform our train and test data to much the expected input degree of our chosen model
polyn_test_features = PolynomialFeatures(degree=bic_degree)
x_poly_train_test = polyn_test_features.fit_transform(np.vstack((train_X, test_X)))
# Predict on the train set and on the test set
chosen_model = models_dict[bic_degree]
pred_train_test_Y = chosen_model[0].predict(x_poly_train_test)
# Plot the predictions on the train and test set
plt.scatter(train_X, train_Y, color="blue", label="Train")
plt.scatter(test_X, test_Y, color="red", label="Test")
plt.plot(np.vstack((train_X, test_X)), pred_train_test_Y, color="orange", label="Prediction")
plt.xlabel("Days")
plt.ylabel("Cases")
plt.title("Final Model Predictions for Global Cases")
plt.legend()
plt.show()
We can see that the predicted line has a higher rate of increase than the actual rate. This can be expected since a lot of countries have taken measures (lockdowns and curfews) on the previous days and the results are showing up on our test dates. Consequently, our model is making pessimistic predictions.
On the first part of this notebook we predicted the global number of cases. This way we ended up having a simple problem that we can easily supervise its results and any mistake could be easily debugged. Using the above methods we will continue by creating a separate model for each country in the John Hopkins dataset.
This pipeline will be used to forecast future (yet unknown) number of cases for each country.
Algorithm:
for country in countries:
for degree in range(1, 10):
create_polynomial_features(degree)
train_linear_regression_model()
select_best_bic_model()
save_best_model()
country_models = {} # Dictionary where key:country_name, value:[LinearRegressionModel, degree]
x_data = np.arange(len(x_values))
# Iterate through the countries
for country, row in confirmed_cases_df.iterrows():
train_X = np.array(x_data[:-test_days, np.newaxis])
train_Y = np.array(row[:-test_days, np.newaxis])
test_X = np.array(x_data[-test_days:, np.newaxis])
test_Y = np.array(row[-test_days:, np.newaxis])
models_list = []
for degree in range(1, 10):
# Creating polynomial features
polynomial_features = PolynomialFeatures(degree=degree)
x_poly = polynomial_features.fit_transform(train_X)
# Train a linear regression model with our new polynmial feature set
model = LinearRegression()
model.fit(x_poly, train_Y)
# Predict on the train set
pred_train_Y = model.predict(x_poly)
# Calculate BIC value
mse = mean_squared_error(train_Y, pred_train_Y)
n = len(train_Y)
k = degree + 1 # We +1 for the intercept
bic = n * np.log(mse + 1e-2) + k * np.log(n) # We do mse + 1e-2 to avoid log(0)
# Save the model and its bic
models_list.append([model, bic])
# Find the best BIC model
best_model = find_simplest_min_bic(np.array(models_list)[:, 1], criterion=bic_criterion)
# Save the best model as the country modeland its best BIC degree
country_models[country] = [models_list[best_model][0], best_model + 1]
Having created a forecasting model for each country we will then plot some example countries and see how good of a fit are our lines.
# Which countries will be plotted
example_countries = ["Italy", "China", "Greece", "US", "United Kingdom", "Spain"]
cols = 2
rows = np.ceil(len(example_countries) / cols)
fig = plt.figure(figsize=(15, 15))
plot_index = 1
for country in example_countries:
# Get the dataset
full_x = np.array(x_data[:, np.newaxis])
full_y = np.array(confirmed_cases_df.loc[country][:, np.newaxis])
# Create the features and predict using the best model of each country
model, degree = country_models[country]
polynomial_features = PolynomialFeatures(degree=degree)
x_poly = polynomial_features.fit_transform(full_x)
pred_Y = model.predict(x_poly)
# Plot the true train, test points and the predicted line
ax = fig.add_subplot(rows, cols, plot_index)
plt.plot(full_x, pred_Y, color="orange")
plt.scatter(full_x[:-test_days], full_y[:-test_days], color="blue", label="Train")
plt.scatter(full_x[-test_days:], full_y[-test_days:], color="red", label="Test")
plt.xlabel("Days")
plt.ylabel("Cases")
plt.title(f"{country} | Degree: {degree}")
plt.legend()
plot_index += 1
plt.show()
As we can see for some countries that are experiencing great increase of cases our model is doing a good job with acceptable forecasts on the test set.
For some others like China our model predicts an increase which is currently not true. Increasing the tolerance of the bic_criterion
will decrease the degree and create simpler models.
Following I calculate and visualize the distribution of degrees for all the best models of each country.
Increasing the bic_criterion
value we will have bigger tolerance of error choosing simpler models and vice versa.
models_degrees = np.array(list(country_models.values()))[:, 1]
values, counts = np.unique(models_degrees, return_counts=True)
plt.vlines(values, 0, counts, color='C0', lw=8)
plt.xlabel("Degree")
plt.ylabel("Number of Models")
plt.title(f"Distribution of Degrees/Complexity | BIC Tolerance = {bic_criterion}")
plt.grid()
plt.show()
Autoregressive integrated moving average (ARIMA) models are used for better understanding time series data or for predicting future values.
We will undergo the following stages:
I will perform the above steps on the global statistics (like I did with polynomial fitting). Again, the model is not perfect. Actually, there is no simple perfect model since a whole field named epidemiology is dedicated on predicting these numbers. However, for educational reasons I found interesting of following the above steps.
In order to decide our parameters p, q, d we must first reach to a stationarized time series.
# First we plot the line of the global cases
plt.plot(np.arange(len(global_cases_df)), global_cases_df)
plt.xlabel("Day")
plt.ylabel("Cases")
plt.show()
As expected our series has no seasonality but is in an upward trend.
# We will use a part of the data for training
testing_arima_days = 10
global_cases_df.index = pd.to_datetime(global_cases_df.index) # Cast our index from type Object to Datetime
global_cases_df_train = global_cases_df[:-testing_arima_days]
# Seasonal decompose to extract trend, seasonality and residues
result = seasonal_decompose(global_cases_df_train, model='multiplicative')
result.plot()
plt.gcf().autofmt_xdate()
plt.show()
Since we consider a multiplicative model (since we have an exponential increase as time increases) our model has the form of
y(t)=Level∗Trend∗Seasonality∗Noise
Most of the forecasting models need our series to be stationarized (almost stationary) since this has the nice property of easily "predicting the future will be the same as the past".
The test we will use for determining if our data is stationarized is the Augmented Dickey–Fuller test.
Null Hypothesis: The series has a unit root
Alternate Hypothesis: The series has no unit root.
If we fail to reject the null hypothesis, we can say that the series is non-stationary (ref) and we must transform it to stationarize it.
# Perform dickey fuller test
print("Results of dickey fuller test")
adft = adfuller(global_cases_df, autolag='BIC')
output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
for key,values in adft[4].items():
output['critical value (%s)'%key] = values
print(output)
Results of dickey fuller test Test Statistics 3.031854 p-value 1.000000 No. of lags used 1.000000 Number of observations used 68.000000 critical value (1%) -3.530399 critical value (5%) -2.905087 critical value (10%) -2.590001 dtype: float64
As expected we are not at all able to reject the null hypothesis with a p-value of 0.998.
To get a stationary series we must eliminate the trend.
Stationarization:
rolling_mean_window = 6
# Take the log of the values to stabilize their rate of increase
global_cases_df_log = np.log(global_cases_df_train)
# Calculate the rolling average (the window is define by rolling_mean_window)
moving_avg = global_cases_df_log.rolling(rolling_mean_window).mean()
# Subtract the rolling average from the log number of cases
global_cases_log_moving_avg_diff = global_cases_df_log - moving_avg
global_cases_log_moving_avg_diff.dropna(inplace=True)
plt.plot(global_cases_df_log, label="Log(cases)")
plt.plot(moving_avg, label=f"Rolling Mean ({rolling_mean_window} days)")
plt.plot(global_cases_log_moving_avg_diff, label="Log(cases) - rolling_mean")
plt.gcf().autofmt_xdate()
plt.legend()
plt.show()
# We again do the ADF test
print("Results of dickey fuller test")
adft = adfuller(global_cases_log_moving_avg_diff, autolag='BIC')
output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
for key,values in adft[4].items():
output['critical value (%s)'%key] = values
print(output)
Results of dickey fuller test Test Statistics -2.753799 p-value 0.065182 No. of lags used 0.000000 Number of observations used 54.000000 critical value (1%) -3.557709 critical value (5%) -2.916770 critical value (10%) -2.596222 dtype: float64
We get a p-value of 0.06 and due to some anomalies (big jump around 2/15) I will consider this enough and continue with finding the p, q values for the ARIMA model.
ARIMA has 3 values that must be carefully defined:
In order to define p, q we will plot the Auto Correlation Function(ACF) and the Partially Auto Correlation Function(PACF).
acf = acf(global_cases_log_moving_avg_diff, nlags=15)
pacf= pacf(global_cases_log_moving_avg_diff, nlags=15,method='ols')
# ACF
plt.subplot(121)
plt.plot(acf)
plt.axhline(y=0,linestyle='-',color='blue')
plt.axhline(y=-1.96/np.sqrt(len(global_cases_log_moving_avg_diff)),linestyle='--',color='black')
plt.axhline(y=1.96/np.sqrt(len(global_cases_log_moving_avg_diff)),linestyle='--',color='black')
plt.title('Auto correlation function')
plt.tight_layout()
# PACF
plt.subplot(122)
plt.plot(pacf)
plt.axhline(y=0,linestyle='-',color='blue')
plt.axhline(y=-1.96/np.sqrt(len(global_cases_log_moving_avg_diff)),linestyle='--',color='black')
plt.axhline(y=1.96/np.sqrt(len(global_cases_log_moving_avg_diff)),linestyle='--',color='black')
plt.title('Partially auto correlation function')
plt.tight_layout()
/home/mikexydas/pythonEnvs/thesisEnv/lib/python3.6/site-packages/statsmodels/tsa/stattools.py:572: FutureWarning: fft=True will become the default in a future version of statsmodels. To suppress this warning, explicitly set fft=False. FutureWarning
We observe a continuous decrease of the ACF while PACF remains significant until value 3. This means that we have an AR process where p = 3, q = 0. q = 0, can also be explained by the fact that we have no residues from previous forecasts.
Having determined the nature of our process (AR(3)) and its values we will then proceed on fitting on our training data.
model = ARIMA(global_cases_df_log, order=(3,1,0))
fit_result = model.fit(disp = 0)
plt.plot(global_cases_log_moving_avg_diff)
plt.plot(fit_result.fittedvalues, color='red')
plt.title("sum of squares of residuals")
plt.gcf().autofmt_xdate()
plt.show()
print('RSS : %f' %sum((fit_result.fittedvalues-global_cases_log_moving_avg_diff).dropna()**2))
/home/mikexydas/pythonEnvs/thesisEnv/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency D will be used. % freq, ValueWarning) /home/mikexydas/pythonEnvs/thesisEnv/lib/python3.6/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency D will be used. % freq, ValueWarning)
RSS : 2.488376
# We then forecast our last testing days
fc, se, conf = fit_result.forecast(testing_arima_days, alpha=0.05)
# Match the index of the correct dates so as to be on the correct day on the below plots
fc_series = pd.Series(fc, index=global_cases_df.index[-testing_arima_days:])
lower_series = pd.Series(conf[:, 0], index=global_cases_df.index[-testing_arima_days:])
upper_series = pd.Series(conf[:, 1], index=global_cases_df.index[-testing_arima_days:])
fig = plt.figure(figsize=(10,5), dpi=100)
# Log Cases Plot
ax = fig.add_subplot(1, 2, 1)
plt.plot(global_cases_df_log, label='Training')
plt.plot(fc_series, label='Forecast')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.plot(np.log(global_cases_df[-testing_arima_days:]), label='True')
plt.title('Forecast vs True (Log Cases)')
plt.ylabel("Log(Cases)")
plt.legend()
plt.grid()
plt.gcf().autofmt_xdate()
# Actual Scale Cases
ax = fig.add_subplot(1, 2, 2)
plt.plot(np.exp(global_cases_df_log), label='Training')
plt.plot(np.exp(fc_series), label='Forecast')
plt.plot(global_cases_df[-testing_arima_days:], label='True')
plt.title('Forecast vs True (Actual Cases)')
plt.ylabel("Log(Cases)")
plt.legend()
plt.grid()
plt.gcf().autofmt_xdate()
plt.show()
We can see that we are getting close to true results that with great confidence (the grey area is the 95% confidence band). Admittedly, the problem with the global cases is easy for he polynomial and the ARIMA fitting. However, as days pass our forecast seems to predict higher number of cases than true. This is explained since on the testing days we can see the effect of the measures taken. An event that is not possible to be predicted by these models (at least with the data I have given them).