In case of the total drought in California, how many desalination plant projects we need to supply residential use water for population who live in urban areas in California?
Suppose that a $1 billion desalination plant project can produce 50 million gallons water per day and the average residential water use is about 90 gallons per capita per day. After researching online, we got that urban population in Califonia is (1) 28,102,643 and the average water each person use at home per day is about (2) 80-100 gallons. We estimate the total number of desalination plant projects to supply residential-use water for urban population in California by the formula:
the total residential-use water / one desalination plant project can produce = (28102643 * 90) / 5e7 = 51 desalination plants
Our hypothesis is that we need at least 51 desalination plant projects to supply urban residential-use water in Califonia.
References:
California is going to suffer from a crippling megadrought in the next century. If our global temperature raises by 2 degrees celsius, the chance that this occurs is merely 30-60 percent. However, if we continue on the path that we are currently, it is almost certain that we will face a drought significantly worse than any previously recorded drought.(1) Our water infrastructure is not prepared for this, as evidenced by the most recent drought. A drought lasting many years will exhaust all of our freshwater, leaving us with two options: desalination and importing water. Importing water isn't preferable because of transportation costs and the need to rely on other states for a huge amount of our basic needs. Desalination is innefficient, requiring huge investments to obtain even normal prices for the water produced by desalinating ocean water. However, if we want to retain our independence from other states and avoid building massive amounts of infrastructure to import water, we need to build these expensive plants. Not only is this an expensive problem, but it is a problem that requires planning and foresight: the billion dollar desalination plant in Oceanside, for example, took 14 years to build. (2) For these reasons, it's critical to start thinking of how many desalination plants we need to support our population.
We are interested in this question since we noticed that California has been under the situation of drought for five years. Even though recently, the news mentioned that "2017 could be the end of a dry spell"(3), we still think that it is necessary to prepare for the next drought. After doing some research online, we think that building some desalination plant projects may be a good strategy. Someone may argue that it is too expensive and it is not realistic. However, based on the article,"Israel Proves the Desalination Era Is Here", the desalination project mentioned in the article can produce the low-cost water. " Desalination used to be an expensive energy hog, but the kind of advanced technologies being employed at Sorek have been a game changer. Water produced by desalination costs just a third of what it did in the 1990s. Sorek can produce a thousand liters of drinking water for 58 cents. Israeli households pay about US$30 a month for their water — similar to households in most U.S. cities, and far less than Las Vegas (US$47) or Los Angeles (US$58)." (4) In the near future, such desalination project may be a win-win method to solve water shortage problems.
References:
In order to solve our question, we have to figure out total urban population and total residential-use water in California and how many water a desalination plant project can produce per day. Based on the link below, we can know these important information. We mentioned them in the hypothesis part as well.
References:
Database Name | Description | # of Observations | Brief Useage |
---|---|---|---|
Urban_Water_Supplier_Monitoring_Reports.csv | Drinking Water Information Clearinghouse (DRINC) includes water activities such as the amount of water being served to different regions and counties. There are 31 columns in this dataset and for for information, please read the pdf file: 'Urban_Water_Supplier_Monitoring_Data_Definitions.pdf' | 9356 | This dataset allows us to find information we need to solve our research question. We will forcus on these columns, R-GPCD Calculated,Supplier_Name,Hydrologic_Region,County,Reporting_Month Population_Served,R-GPCD_Calculated |
1. How 'clean' is the data? To the extent that it requires cleaning, how will you do so?
2. How will you analyze the data? Be as specific as you can. Briefly mention any pre-processing steps that are required for your methods (for example, checking data distributions and performing any transformations that may be required). Include a brief outline of how you will apply your chosen method(s).
3. What do you plan to report? Briefly mention any key visualizations you plan to create, and/or the kind of result you will be able to report that addresses your question (this could be, for example, the outcome of some statistical test(s), prediction error on a model, a model fit parameter, etc.).
** 4.Include a list of packages you expect to use for you project. If you plan to use packages we have not used in class, add a very brief description about them (a few words is sufficient). **
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('Urban_Water_Supplier_Monitoring_Reports.csv')
df.shape
(9356, 31)
#Drop unnecessary columns
df=df.drop(['PWSID'],axis=1)
df=df.drop(['Days_Allowed/Week'],axis=1)
df=df.drop(['Actions_Implemented'],axis=1)
df=df.drop(['Complaints'],axis=1)
df=df.drop(['Follow-ups'],axis=1)
df=df.drop(['Penalties_Rate'],axis=1)
df=df.drop(['Penalties_Other'],axis=1)
df=df.drop(['Warnings'],axis=1)
df=df.drop(['ID'],axis=1)
df=df.drop(['Ag_Cert'],axis=1)
df=df.drop(['Current_Ag'],axis=1)
df=df.drop(['Current_Production_Calculated'],axis=1)
df=df.drop(['Mandatory Restrictions'],axis=1)
df=df.drop(['2013_Production'],axis=1)
df=df.drop(['2013_Ag'],axis=1)
df=df.drop(['Current_Recycled'],axis=1)
df=df.drop(['Current_Non-revenue'],axis=1)
df=df.drop(['Percent_Reduction'],axis=1)
df=df.drop(['2013_Production_Calculated'],axis=1)
df=df.drop(['CII'],axis=1)
df=df.drop(['Days-in-Month'],axis=1)
df=df.drop(['Percent_Residential_Use'],axis=1)
df=df.drop(['Current_Production'],axis=1)
df=df.drop(['Units'],axis=1)
df=df.drop(['R_GPCD_Reported '],axis=1)
# Figuring out urban areas which we are interested in.
df=df.drop(df[df.Population_Served<=50000].index)
df.shape
(4066, 6)
df.head(n=10)
Supplier_Name | Hydrologic_Region | County | Reporting_Month | Population_Served | R-GPCD_Calculated | |
---|---|---|---|---|---|---|
0 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 4/15/2016 0:00 | 1400000 | 59 |
1 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 3/15/2016 0:00 | 1400000 | 53 |
2 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 2/15/2016 0:00 | 1400000 | 54 |
3 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 1/15/2016 0:00 | 1400000 | 53 |
4 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 12/15/2015 0:00 | 1390000 | 53 |
5 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 11/15/2015 0:00 | 1390000 | 57 |
6 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 10/15/2015 0:00 | 1390000 | 68 |
7 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 9/15/2015 0:00 | 1390000 | 74 |
8 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 8/15/2015 0:00 | 1390000 | 74 |
9 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 7/15/2015 0:00 | 1390000 | 72 |
df2016=df[df.Reporting_Month.str.contains("2016") == True]
df2015=df[df.Reporting_Month.str.contains("2015") == True]
df2014=df[df.Reporting_Month.str.contains("2014") == True]
#Remove all rows from above dataframes that have missing data (that have NaN values)
df2016= df2016.dropna()
df2015= df2015.dropna()
df2014= df2014.dropna()
#Add one column "Total_Res_use_perDay" which is based on the population served
#and average residential water use per-capita-per-day.
df2016=df2016.assign(Total_Res_use_PerDay = df2016['R-GPCD_Calculated']*df2016['Population_Served'])
df2015=df2015.assign(Total_Res_use_PerDay = df2015['R-GPCD_Calculated']*df2015['Population_Served'])
df2014=df2014.assign(Total_Res_use_PerDay = df2014['R-GPCD_Calculated']*df2014['Population_Served'])
#Show the form of 2016 dataframe
df2016.head(n=5)
Supplier_Name | Hydrologic_Region | County | Reporting_Month | Population_Served | R-GPCD_Calculated | Total_Res_use_PerDay | |
---|---|---|---|---|---|---|---|
0 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 4/15/2016 0:00 | 1400000 | 59 | 82600000 |
1 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 3/15/2016 0:00 | 1400000 | 53 | 74200000 |
2 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 2/15/2016 0:00 | 1400000 | 54 | 75600000 |
3 | East Bay Municipal Utilities District | San Francisco Bay | ALAMEDA | 1/15/2016 0:00 | 1400000 | 53 | 74200000 |
47 | Yorba Linda Water District | South Coast | ORANGE | 4/15/2016 0:00 | 78539 | 117 | 9189063 |
#1. Plot the histogram of number of residential gallons-per-capita-day in 2016
fig = plt.figure(figsize=(15, 8))
cm=plt.cm.get_cmap('RdYlBu_r')
data=df2016['R-GPCD_Calculated']
n, bins, patches = plt.hist(data)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
plt.xlabel('Gallons of Water ')
plt.ylabel('# Counts')
plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2016\ (gallons-per-capita-day)}$')
col = bin_centers - min(bin_centers)
col /= max(col)
for c, p in zip(col, patches):
plt.setp(p, 'facecolor', cm(c))
plt.grid(True)
#number of residential gallons-per-capita-day
# 2. Plot the histogram of number of residential gallons-per-capita-day in 2015
fig = plt.figure(figsize=(15, 8))
cm=plt.cm.get_cmap('RdYlGn_r')
data=df2015['R-GPCD_Calculated']
n, bins, patches = plt.hist(data)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
plt.xlabel('Gallons of Water ')
plt.ylabel('# Counts')
plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2015\ (gallons-per-capita-day)}$')
col = bin_centers - min(bin_centers)
col /= max(col)
for c, p in zip(col, patches):
plt.setp(p, 'facecolor', cm(c))
plt.grid(True)
# 3. Plot the histogram of number of residential gallons-per-capita-day in 2014
fig = plt.figure(figsize=(15, 8))
cm=plt.cm.get_cmap('PuOr')
data=df2014['R-GPCD_Calculated']
n, bins, patches = plt.hist(data)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
plt.xlabel('Gallons of Water ')
plt.ylabel('# Counts')
plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2014\ (gallons-per-capita-day)}$')
col = bin_centers - min(bin_centers)
col /= max(col)
for c, p in zip(col, patches):
plt.setp(p, 'facecolor', cm(c))
plt.grid(True)
#A bar chart showing R-GPCD_Calculated in each county based on df206
countyList = df2016['County'].unique()
countyList.sort()
countyGPCDdict = {}
gpcdList16 = []
for i in range(len(countyList)):
currRGPCD = df2016[df2016['County'] == countyList[i]]['R-GPCD_Calculated'].mean()
countyGPCDdict[countyList[i]] = currRGPCD
countyGPCDdict = pd.Series(countyGPCDdict)
countyGPCDdict.sort_values(inplace = True)
countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2016')
# randList = list(range(0,len(countyList)))
# plt.figure(figsize=(20,15))
# plt.bar(randList, gpcdList16, align='center', alpha=0.5)
# plt.xlabel('Counties')
# plt.ylabel('Average R-GPCD in 2016')
# plt.ylim(ymax = 220)
# plt.xticks(randList, countyList, fontsize = 8, rotation = 'vertical')
# plt.title(r'$\mathrm{Bar\ of\ Residential\ Gallons-per-capita-day(gallons-per-capita-day)\ 2016}$')
# plt.show()
<matplotlib.axes._subplots.AxesSubplot at 0x7f363111de48>
#A bar chart showing R-GPCD_Calculated in each county based on df2015
countyList = df2015['County'].unique()
countyList.sort()
countyGPCDdict = {}
gpcdList16 = []
for i in range(len(countyList)):
currRGPCD = df2015[df2015['County'] == countyList[i]]['R-GPCD_Calculated'].mean()
countyGPCDdict[countyList[i]] = currRGPCD
countyGPCDdict = pd.Series(countyGPCDdict)
countyGPCDdict.sort_values(inplace = True)
countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2015')
<matplotlib.axes._subplots.AxesSubplot at 0x7f362af7d828>
#A bar chart showing R-GPCD_Calculated in each county based on df2014
countyList = df2014['County'].unique()
countyList.sort()
countyGPCDdict = {}
gpcdList16 = []
for i in range(len(countyList)):
currRGPCD = df2014[df2014['County'] == countyList[i]]['R-GPCD_Calculated'].mean()
countyGPCDdict[countyList[i]] = currRGPCD
countyGPCDdict = pd.Series(countyGPCDdict)
countyGPCDdict.sort_values(inplace = True)
countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2014')
<matplotlib.axes._subplots.AxesSubplot at 0x7f362aff7588>
# Plot a line chart showing the tendency of the average water use (units in gallons)
mean_14 = df2014['R-GPCD_Calculated'].mean()
mean_15 = df2015['R-GPCD_Calculated'].mean()
mean_16 = df2016['R-GPCD_Calculated'].mean()
years = [2014,2015,2016]
waterUsage = [mean_14, mean_15, mean_16]
f = plt.figure(figsize=(6, 8))
ax = plt.axes()
plt.xlabel('Years')
plt.ylabel('Average Water Use per-capita-per-day (in gallons)')
plt.title(r'$\mathrm{Trend\ of\ Average\ Water\ Use\ per-capita-day\ from\ 2014\ to\ 2016 }$')
ax.set_xticks(years)
plt.ylim((50,150))
plt.plot(years, waterUsage)
plt.show()
#This is the function which helps us find the population based on the dataframe
def estimate_popuation(df):
supplier_list =list(df['Supplier_Name'].unique())
ttpop=0
for i in range(len(supplier_list)):
df1 = df[df['Supplier_Name'] == supplier_list[i]]
ttpop=ttpop+df1.groupby('Supplier_Name')['Population_Served'].mean()[0]
return(round(ttpop))
#2. Plot line chart to show the change of urban population in California
#2014 urban population:
point1=estimate_popuation(df2014)
#2015 urban population:
point2=estimate_popuation(df2015)
#2016 urban population:
point3=estimate_popuation(df2016)
years = [2014,2015,2016]
population = [point1, point2, point3]
f = plt.figure(figsize=(8, 6))
ax = plt.axes()
plt.xlabel('Years')
plt.ylabel('Population (Units *10^7)')
plt.title(r'$\mathrm{Trend\ of\ Urban\ Population\ from\ 2014\ to\ 2016 }$')
ax.set_xticks(years)
plt.plot(years, population)
plt.ylim((2.9e7,3.18e7))
plt.show()
hydro_list =list(df['Hydrologic_Region'].unique())
supplier_list =list(df2016['Supplier_Name'].unique())
county_list = list(df['County'].unique())
# Function based on the name of County
def estimate_county(county,df):
gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day
df1=df[df.County.str.contains(county) == True]
m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallons of water each person need per day
# Find the population based on new data frame df1.
# We use groupby and take the mean since we noticed that even under the same supplier district,
# the population served each month is slightly changing so we decided to take the mean of the the population served in these months in the data frame.
ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean()
ttpop=round(ttpop.sum())
ttwater=ttpop*m_ppw
the_number_of_projectsneed= round(ttwater /gallons_water_per_project)
if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation
the_number_of_projectsneed=1
return (the_number_of_projectsneed)
# Function based on the name of Supplier_Name
def estimate_supplier(supplier,df):
gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day
df1=df[df.Supplier_Name.str.contains(supplier) == True] #Get on new data frame which contains the information which related to our interests.
m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallsons of water each person need per day
# Find the population based on the new data frame
ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean()
ttpop=round(ttpop.sum())
ttwater=ttpop*m_ppw
the_number_of_projectsneed= round(ttwater /gallons_water_per_project)
if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation
the_number_of_projectsneed=1
return (the_number_of_projectsneed)
# Function based on the name of Hydrologic_Region
def estimate_Hydrologic_Region(Hydrologic_Region,df):
gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day
df1=df[df.Hydrologic_Region.str.contains(Hydrologic_Region) == True]
m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallsons of water each person need per day
# Find the population based on new data frame
ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean()
ttpop=round(ttpop.sum())
ttwater=round(ttpop*m_ppw )
the_number_of_projectsneed= round(ttwater /gallons_water_per_project)
if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation
the_number_of_projectsneed=1
return (the_number_of_projectsneed)
Overview:
We have built three models to predict how many desalination plant projects we need in the previous part. In the result part,we will show how to use them.
We are going to make one dataframe to show our results based on Hydrologic_Region
(There are two columns in the dataframe. One is for the Hydrologic_Region name and another one is for number of projects we need)
#Prediction based on the county's names: (we will call the function we wrote before)
#Example:
county_example = estimate_county('ALAMEDA',df2014)
# Prediction based on the supplier names: (we will call the function we wrote before)
#Example:
supplier_example = estimate_supplier('Milpitas City of',df2014)
# Prediction based on the Hydrologic_Region: (we will call the function we wrote before)
# Example:
hydrologic_example = estimate_Hydrologic_Region('San Joaquin River',df2014)
print("By building three models to predict deslaination plants, we can deterimine the number of plants, supplier, and hydrologic religions affected. \n")
print("Based on the county's name:")
print("For example, in Alameda in 2014, there can be", county_example, "desalination plant project(s)\n")
print("Based on the supplier name:")
print("For example: In Milpitas City in 2014, it needs", supplier_example, "desalination plant project(s)\n")
print("Based on the Hydrolic Region:")
print("For example: In the San Joaquin River region in 2014, it needs", hydrologic_example, "desalination plant project(s)\n")
By building three models to predict deslaination plants, we can deterimine the number of plants, supplier, and hydrologic religions affected. Based on the county's name: For example, in Alameda in 2014, there can be 5 desalination plant project(s) Based on the supplier name: For example: In Milpitas City in 2014, it needs 1 desalination plant project(s) Based on the Hydrolic Region: For example: In the San Joaquin River region in 2014, it needs 3 desalination plant project(s)
Based on the Hydrolic Region, we can predict a certain amount of plant projects
hydro_list =list(df['Hydrologic_Region'].unique())
h_list=[]
for x in hydro_list:
h_list.append(round((estimate_Hydrologic_Region(x,df2014)+estimate_Hydrologic_Region(x,df2015)+
estimate_Hydrologic_Region(x,df2016))/3))
hydro_se = {'Hydrologic_Region' : hydro_list,
'num_Projects_needed': h_list }
hydro_df = pd.DataFrame(hydro_se)
hydro_df
Hydrologic_Region | num_Projects_needed | |
---|---|---|
0 | San Francisco Bay | 8 |
1 | South Coast | 33 |
2 | San Joaquin River | 2 |
3 | Sacramento River | 5 |
4 | North Coast | 1 |
5 | Tulare Lake | 3 |
6 | Colorado River | 2 |
7 | Central Coast | 1 |
8 | South Lahontan | 1 |
t=hydro_df['num_Projects_needed'].sum()
print('Total Number of Desalination Plant Projects Need is',t, "to supply the urban residential-use water in California")
Total Number of Desalination Plant Projects Need is 56 to supply the urban residential-use water in California
t=hydro_df['num_Projects_needed'].sum()
print('Total Number of Desalination Plant Projects Need is',t, "to supply the urban residential-use water in California")
Total Number of Desalination Plant Projects Need is 56 to supply the urban residential-use water in California
If each plant is a billion dollar plant supplying 50 million gallons, our hypothesis was very close to the concluded result (56 vs 51).
1. When we are trying to loop through the unique list of the supplier name or counties' names, we realized that dataframes of different years don't contain the exact same counties' names or supplier's names. Hence, we decided to conclude our result based on the Hydrologic_Region's name. At the same time, we build a model to predict the amount of desalination plant projects needed based on suppliers' names or the counties' names.
2. Why did we switch our project?
We switched our project idea for a few reasons.
We did not have any domain knowledge for the problem that we were trying to solve. Our best guesses at water consumption were just that - guesses. We thought that it would be more practical and more useful to look at how many desalination plants would be needed, rather than where they should be.
The second reason had to do with the algorithm itself. Our original plan, to constrain the centroids to be along the coast, was something that no one had attempted to do before (as far as we'd found). So we simplified our problem a bit, to "Constrain the size of the clusters to be a specific size" as there could only be a certain amount of people serviced by a desalination plant. We also chose to limit our observations to only be along the coast, to make it so that the desalination plant would only service coastal populations. However, this lowered the amount of observations to 743, which would have costed us points in the end. Both because of the complexity of limiting the cluster size and because of the small number of observations, we chose to go with a problem that was better in nearly every way.
3. What are the pitfalls and potential confounds of your data and/or methods?
Our project is very hypothetical since we are unsure of many variables. For example, if we are placing the desalination plant in a more population dense area, the water delivery will be more efficient, but the desalination plant itself will take up a huge space. Therefore placing the plant in the most population dense area might not be the best idea, and we need to come up with a constraint of population density for where we can place the plants. Furthremore, Since the implementation of desalination plants takes relatively long time, and our project is using population as a independent variable, we might also need to consider the growth and decay of population, or the population shifts in California. The water pipe distance and cost to maintain to transport water to plants would also cause the number of plants to change.
Confounds in our design:
The average water use and total population are two key factors to solve our research question. We noticed that the trend of average water use per day is decreasing from 2014 to 2016. As for urban population, we know that the population in California is still growing so in that case, it will be hard to predict the exact number of desalination projects we need in case of total drought in future. Also, if we have more time, we could build another two models to predict the population and the average residential water usage per-capita-day. After that, these two models will help us get more accurate result. This is also data that is collected up to 2016. If we had current data, it would make this estimate even more accurate.