--by Lu Tang
The purpose of this project is to analyze data for the differences and similarities in temperature trends between global and the city Xi’an, China, which is my hometown. The project will look answers for the following questions:
- Is Xi’an hotter or cooler on average compared to the global?
- Has the difference been consistent over time?
- What are the overall trends for Xi’an and the world? Are they getting hotter or cooler over time?
- Are the overall trends consistent?
The tools used in the project:
SQL and Python, Pandas, Matplotlib and Seaborn
The data is stored in database, in order to understand the data, I extracted all the data from three tables using SQL, I then use Pandas to analyze and understand how the three tables are related. I discovered the following:
The SQL query to extra data
Saved as 'results.csv'
Data Description:
# loading data in pandas and display first 5 rows
import pandas as pd
df=pd.read_csv('results.csv')
df.head()
year | city | country | avg_temp | avg_temp_global | |
---|---|---|---|---|---|
0 | 1820 | Xian | China | 9.55 | 7.62 |
1 | 1821 | Xian | China | 11.12 | 8.09 |
2 | 1822 | Xian | China | 11.16 | 8.19 |
3 | 1823 | Xian | China | 11.76 | 7.72 |
4 | 1824 | Xian | China | NaN | 8.55 |
# display last 5 rows
df.tail()
year | city | country | avg_temp | avg_temp_global | |
---|---|---|---|---|---|
189 | 2009 | Xian | China | 12.53 | 9.51 |
190 | 2010 | Xian | China | 12.59 | 9.70 |
191 | 2011 | Xian | China | 12.08 | 9.52 |
192 | 2012 | Xian | China | 11.90 | 9.51 |
193 | 2013 | Xian | China | 14.46 | 9.61 |
# rename 'avg_temp'
df.rename({'avg_temp':'avg_temp_xian'},axis=1,inplace=True)
# set column 'year' as index
df.index=df['year']
# delete unneccesary rows
df.drop(['year','city','country'],axis=1, inplace=True)
#check the result
df.head(5)
avg_temp_xian | avg_temp_global | |
---|---|---|
year | ||
1820 | 9.55 | 7.62 |
1821 | 11.12 | 8.09 |
1822 | 11.16 | 8.19 |
1823 | 11.76 | 7.72 |
1824 | NaN | 8.55 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 194 entries, 1820 to 2013 Data columns (total 2 columns): avg_temp_xian 179 non-null float64 avg_temp_global 194 non-null float64 dtypes: float64(2) memory usage: 4.5 KB
# Check sum of NaN data
df.isnull().sum()
avg_temp_xian 15 avg_temp_global 0 dtype: int64
# Drop null since there is less then 10% of the data
df.dropna(inplace=True)
# Calculating 7-year moving average for xian and global data and make it new columns
df['moving_avg_xian']=df['avg_temp_xian'].rolling(window=7).mean()
df['moving_avg_global']=df['avg_temp_global'].rolling(window=7).mean()
df.head(10)
avg_temp_xian | avg_temp_global | moving_avg_xian | moving_avg_global | |
---|---|---|---|---|
year | ||||
1820 | 9.55 | 7.62 | NaN | NaN |
1821 | 11.12 | 8.09 | NaN | NaN |
1822 | 11.16 | 8.19 | NaN | NaN |
1823 | 11.76 | 7.72 | NaN | NaN |
1837 | 21.19 | 7.38 | NaN | NaN |
1840 | 10.81 | 7.80 | NaN | NaN |
1841 | 10.26 | 7.69 | 12.264286 | 7.784286 |
1842 | 11.05 | 8.02 | 12.478571 | 7.841429 |
1843 | 11.12 | 8.17 | 12.478571 | 7.852857 |
1844 | 11.01 | 7.65 | 12.457143 | 7.775714 |
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Dark2')
fig,ax = plt.subplots(figsize=(16,8))
ax.plot(df['moving_avg_xian'], label='Xian Weather Trends')
ax.plot(df['moving_avg_global'], label='Global Weather Trends')
ax.legend(loc='best')
ax.set_xlabel('Year')
ax.set_ylabel('Temparature')
ax.set_title('Weather Trends')
Text(0.5, 1.0, 'Weather Trends')
In this project, I analyzed the weather trend using temperature data for Xi’an and Global from the year 1820 to 2013. We can conclude that Xi'an is a hotter place and has higher fluctuation on the weather compared with global average. However, the temperature for both Xian and global are increasing over years and particularly, it is increasing at higher rate in recent years. Furthermore, based on historical trends, we can predict that the future trend will be continuely increasing at higher rate; Our world is facing climate change and protecting the environment is very important.
Further Notes: This project is mainly focused on EDA (Exploratory Data Analysis). To precisely predict the future trends, a robust data prediction model is needed, but this is beyond the purpose of this projects. In my other projects, there will be machine learning, data modeling and predictions.