#!/usr/bin/env python # coding: utf-8 # # Exploring Desalination Plant Numbers # ## Christina Cook # ## Dominic Suares # ## Youxi Li # ## Linzhi Xie # ## Erik Mei # # Introduction and Background # ## Research Question: # >In case of the total drought in California, how many desalination plant projects we need to supply residential use water for population who live in urban areas in California? # # ## Hypothesis: # Suppose that a $1 billion desalination plant project can produce 50 million gallons water per day and the average residential water use is about 90 gallons per capita per day. After researching online, we got that urban population in Califonia is [(1)](http://ruralhealth.stanford.edu/health-pros/factsheets/) 28,102,643 and the average water each person use at home per day is about [(2)](https://water.usgs.gov/edu/qa-home-percapita.html) 80-100 gallons. We estimate the total number of desalination plant projects to supply residential-use water for urban population in California by the formula: # > **the total residential-use water** / **one desalination plant project can produce** = (28102643 * 90) / 5e7 = 51 desalination plants # # Our hypothesis is that we need at least 51 desalination plant projects to supply urban residential-use water in Califonia. # # References: # - (1) http://ruralhealth.stanford.edu/health-pros/factsheets/ (For urban population in Califonia: 28,102,643) # - (2) https://water.usgs.gov/edu/qa-home-percapita.html (For How much water does the average person use at home per day? ) # # ## Background: # # ### 1. Why is this question of interest? # California is going to suffer from a crippling megadrought in the next century. If our global temperature raises by 2 degrees celsius, the chance that this occurs is merely 30-60 percent. However, if we continue on the path that we are currently, it is almost certain that we will face a drought significantly worse than any previously recorded drought.(1) Our water infrastructure is not prepared for this, as evidenced by the most recent drought. A drought lasting many years will exhaust all of our freshwater, leaving us with two options: desalination and importing water. Importing water isn't preferable because of transportation costs and the need to rely on other states for a huge amount of our basic needs. Desalination is innefficient, requiring huge investments to obtain even normal prices for the water produced by desalinating ocean water. However, if we want to retain our independence from other states and avoid building massive amounts of infrastructure to import water, we need to build these expensive plants. Not only is this an expensive problem, but it is a problem that requires planning and foresight: the billion dollar desalination plant in Oceanside, for example, took 14 years to build. (2) For these reasons, it's critical to start thinking of how many desalination plants we need to support our population. # # We are interested in this question since we noticed that California has been under the situation of drought for five years. Even though recently, the news mentioned that "2017 could be the end of a dry spell"(3), we still think that it is necessary to prepare for the next drought. After doing some research online, we think that building some desalination plant projects may be a good strategy. Someone may argue that it is too expensive and it is not realistic. However, based on the article,"Israel Proves the Desalination Era Is Here", the desalination project mentioned in the article can produce the low-cost water. " Desalination used to be an expensive energy hog, but the kind of advanced technologies being employed at Sorek have been a game changer. Water produced by desalination costs just a third of what it did in the 1990s. Sorek can produce a thousand liters of drinking water for 58 cents. Israeli households pay about US$30 a month for their water — similar to households in most U.S. cities, and far less than Las Vegas (US\$47) or Los Angeles (US\$58)." (4) In the near future, such desalination project may be a win-win method to solve water shortage problems. # # References: # - (1) https://www.ecowatch.com/megadroughts-2031955357.html # - (2) http://www.latimes.com/local/california/la-me-carlsbad-desalination-20150604-story.html#page=1 # - (3) http://www.latimes.com/local/california/la-me-storms-drought-20170111-story.html (News about 2017 could be the end of a dry spell) # - (4) https://www.scientificamerican.com/article/israel-proves-the-desalination-era-is-here/ (The article about other countries' successful desalination projects and why the desalination project is feasible) # # # ### 2. What background information led you to your hypothesis, and why is this important? # In order to solve our question, we have to figure out total urban population and total residential-use water in California and how many water a desalination plant project can produce per day. Based on the link below, we can know these important information. We mentioned them in the hypothesis part as well. # # References: # - (1) http://ruralhealth.stanford.edu/health-pros/factsheets/ (For urban population in Califonia: 28,102,643) # - (2) https://water.usgs.gov/edu/qa-home-percapita.html (For How much water does the average person use at home per day? ) # - (3) http://www.mercurynews.com/2014/05/29/nations-largest-ocean-desalination-plant-goes-up-near-san-diego-future-of-the-california-coast/ (One $1-billion desalination plant project can produce 50 million gallons water per day.) # # ### 3. What areas do we consider as major urban areas? # # - Based on this website https://en.wikipedia.org/wiki/List_of_California_urban_areas, we will set the cut-off point at the population larger than 50000. # # ### 4. How many gallons water in these urban areas each people are using per day? # # - Based on the dataset, the column 'R-GPCD_Calculated' will be the number of gallons water each people use per day. # - More details about description of our dataset used: Please read the "Urban_Water_Supplier_Monitoring_Data_Definitions.pdf" # # ### 5. How many water can each desalination plant project produce per day? # - Based on the article: One $1-billion desalination plant project can produce 50 million gallons water per day.(http://www.mercurynews.com/2014/05/29/nations-largest-ocean-desalination-plant-goes-up-near-san-diego-future-of-the-california-coast/ ) # # ### 6. How many desalination plant projects we need for a given urban areas? # # - The number of desalination plant projects= the total residence-use water per day / 50 million gallons # # # # Data Description # # ### Data Description # # # | Database Name | Description | # of Observations | Brief Useage | # |:------------ |:-----------|:------------------:|:--------------| # | **[Urban_Water_Supplier_Monitoring_Reports.csv](https://data.ca.gov/dataset/drinking-water-%E2%80%93-urban-water-supplier-monitoring-%E2%80%93-drinc)**| Drinking Water Information Clearinghouse (DRINC) includes water activities such as the amount of water being served to different regions and counties. There are 31 columns in this dataset and for for information, please read the pdf file: 'Urban_Water_Supplier_Monitoring_Data_Definitions.pdf' | 9356 | This dataset allows us to find information we need to solve our research question. We will forcus on these columns, R-GPCD Calculated,Supplier_Name,Hydrologic_Region,County,Reporting_Month Population_Served,R-GPCD_Calculated | # # # # ## Proposed Methods # # **1. How 'clean' is the data? To the extent that it requires cleaning, how will you do so?** # # - Our data set has more information than we need so we plan on dropping the unecessary columns first. # # **2. How will you analyze the data? Be as specific as you can. Briefly mention any pre-processing steps that are required for your methods (for example, checking data distributions and performing any transformations that may be required). Include a brief outline of how you will apply your chosen method(s).** # # - We will seperate our data set based on the year since we noticed that the raw dataset contains different years data. It is convenient for us to work on the data set which contains the same year data and extract information we need to answer our question. # # **3. What do you plan to report? Briefly mention any key visualizations you plan to create, and/or the kind of result you will be able to report that addresses your question (this could be, for example, the outcome of some statistical test(s), prediction error on a model, a model fit parameter, etc.).** # # - We will report our result as a list and at the same time, we will build three models to predict how many desalination plant projects we need based on the supplier name, county name or hydrologic name. For example, if we want to find how many desalination projects we need to substitude one water supplier, we will call the function estimate_supplier to figure it out. For more details,we will show them in the result part. # # ** 4.Include a list of packages you expect to use for you project. If you plan to use packages we have not used in class, add a very brief description about them (a few words is sufficient). ** # # - We plan to use panda, matplotlib.pyplot, and numpy # # # Data Cleaning and Pre-processing # # ### Import Packages # In[78]: get_ipython().run_line_magic('matplotlib', 'inline') import pandas as pd import numpy as np import matplotlib.pyplot as plt # In[79]: df = pd.read_csv('Urban_Water_Supplier_Monitoring_Reports.csv') df.shape # In[80]: #Drop unnecessary columns df=df.drop(['PWSID'],axis=1) df=df.drop(['Days_Allowed/Week'],axis=1) df=df.drop(['Actions_Implemented'],axis=1) df=df.drop(['Complaints'],axis=1) df=df.drop(['Follow-ups'],axis=1) df=df.drop(['Penalties_Rate'],axis=1) df=df.drop(['Penalties_Other'],axis=1) df=df.drop(['Warnings'],axis=1) df=df.drop(['ID'],axis=1) df=df.drop(['Ag_Cert'],axis=1) df=df.drop(['Current_Ag'],axis=1) df=df.drop(['Current_Production_Calculated'],axis=1) df=df.drop(['Mandatory Restrictions'],axis=1) df=df.drop(['2013_Production'],axis=1) df=df.drop(['2013_Ag'],axis=1) df=df.drop(['Current_Recycled'],axis=1) df=df.drop(['Current_Non-revenue'],axis=1) df=df.drop(['Percent_Reduction'],axis=1) df=df.drop(['2013_Production_Calculated'],axis=1) df=df.drop(['CII'],axis=1) df=df.drop(['Days-in-Month'],axis=1) df=df.drop(['Percent_Residential_Use'],axis=1) df=df.drop(['Current_Production'],axis=1) df=df.drop(['Units'],axis=1) df=df.drop(['R_GPCD_Reported '],axis=1) # In[81]: # Figuring out urban areas which we are interested in. df=df.drop(df[df.Population_Served<=50000].index) df.shape # In[82]: df.head(n=10) # ##### Seperate to three dataframe based on reporting_year ... # In[83]: df2016=df[df.Reporting_Month.str.contains("2016") == True] df2015=df[df.Reporting_Month.str.contains("2015") == True] df2014=df[df.Reporting_Month.str.contains("2014") == True] #Remove all rows from above dataframes that have missing data (that have NaN values) df2016= df2016.dropna() df2015= df2015.dropna() df2014= df2014.dropna() # In[84]: #Add one column "Total_Res_use_perDay" which is based on the population served #and average residential water use per-capita-per-day. df2016=df2016.assign(Total_Res_use_PerDay = df2016['R-GPCD_Calculated']*df2016['Population_Served']) df2015=df2015.assign(Total_Res_use_PerDay = df2015['R-GPCD_Calculated']*df2015['Population_Served']) df2014=df2014.assign(Total_Res_use_PerDay = df2014['R-GPCD_Calculated']*df2014['Population_Served']) # In[85]: #Show the form of 2016 dataframe df2016.head(n=5) # # Data Visualization # # ## Histogram Part # #### Plot the histogram to show the potential distribution of residential water usage gallons-per-capita-day in different years # # In[86]: #1. Plot the histogram of number of residential gallons-per-capita-day in 2016 fig = plt.figure(figsize=(15, 8)) cm=plt.cm.get_cmap('RdYlBu_r') data=df2016['R-GPCD_Calculated'] n, bins, patches = plt.hist(data) bin_centers = 0.5 * (bins[:-1] + bins[1:]) plt.xlabel('Gallons of Water ') plt.ylabel('# Counts') plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2016\ (gallons-per-capita-day)}$') col = bin_centers - min(bin_centers) col /= max(col) for c, p in zip(col, patches): plt.setp(p, 'facecolor', cm(c)) plt.grid(True) #number of residential gallons-per-capita-day # 2. Plot the histogram of number of residential gallons-per-capita-day in 2015 fig = plt.figure(figsize=(15, 8)) cm=plt.cm.get_cmap('RdYlGn_r') data=df2015['R-GPCD_Calculated'] n, bins, patches = plt.hist(data) bin_centers = 0.5 * (bins[:-1] + bins[1:]) plt.xlabel('Gallons of Water ') plt.ylabel('# Counts') plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2015\ (gallons-per-capita-day)}$') col = bin_centers - min(bin_centers) col /= max(col) for c, p in zip(col, patches): plt.setp(p, 'facecolor', cm(c)) plt.grid(True) # 3. Plot the histogram of number of residential gallons-per-capita-day in 2014 fig = plt.figure(figsize=(15, 8)) cm=plt.cm.get_cmap('PuOr') data=df2014['R-GPCD_Calculated'] n, bins, patches = plt.hist(data) bin_centers = 0.5 * (bins[:-1] + bins[1:]) plt.xlabel('Gallons of Water ') plt.ylabel('# Counts') plt.title(r'$\mathrm{Histogram\ of\ Residential\ Water\ Usage\ in\ 2014\ (gallons-per-capita-day)}$') col = bin_centers - min(bin_centers) col /= max(col) for c, p in zip(col, patches): plt.setp(p, 'facecolor', cm(c)) plt.grid(True) # ## Bar Chart Part # ###### For R-GPCD_Calculated in each County # R-GPCD: Residential Gallons-per-capita-day # In[87]: #A bar chart showing R-GPCD_Calculated in each county based on df206 countyList = df2016['County'].unique() countyList.sort() countyGPCDdict = {} gpcdList16 = [] for i in range(len(countyList)): currRGPCD = df2016[df2016['County'] == countyList[i]]['R-GPCD_Calculated'].mean() countyGPCDdict[countyList[i]] = currRGPCD countyGPCDdict = pd.Series(countyGPCDdict) countyGPCDdict.sort_values(inplace = True) countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2016') # randList = list(range(0,len(countyList))) # plt.figure(figsize=(20,15)) # plt.bar(randList, gpcdList16, align='center', alpha=0.5) # plt.xlabel('Counties') # plt.ylabel('Average R-GPCD in 2016') # plt.ylim(ymax = 220) # plt.xticks(randList, countyList, fontsize = 8, rotation = 'vertical') # plt.title(r'$\mathrm{Bar\ of\ Residential\ Gallons-per-capita-day(gallons-per-capita-day)\ 2016}$') # plt.show() # In[88]: #A bar chart showing R-GPCD_Calculated in each county based on df2015 countyList = df2015['County'].unique() countyList.sort() countyGPCDdict = {} gpcdList16 = [] for i in range(len(countyList)): currRGPCD = df2015[df2015['County'] == countyList[i]]['R-GPCD_Calculated'].mean() countyGPCDdict[countyList[i]] = currRGPCD countyGPCDdict = pd.Series(countyGPCDdict) countyGPCDdict.sort_values(inplace = True) countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2015') # In[89]: #A bar chart showing R-GPCD_Calculated in each county based on df2014 countyList = df2014['County'].unique() countyList.sort() countyGPCDdict = {} gpcdList16 = [] for i in range(len(countyList)): currRGPCD = df2014[df2014['County'] == countyList[i]]['R-GPCD_Calculated'].mean() countyGPCDdict[countyList[i]] = currRGPCD countyGPCDdict = pd.Series(countyGPCDdict) countyGPCDdict.sort_values(inplace = True) countyGPCDdict.plot(kind = 'bar', grid = True, figsize = (20,8), title = 'Average Gallons Per Capita Day for each county in 2014') # ### Line Chart # In[90]: # Plot a line chart showing the tendency of the average water use (units in gallons) mean_14 = df2014['R-GPCD_Calculated'].mean() mean_15 = df2015['R-GPCD_Calculated'].mean() mean_16 = df2016['R-GPCD_Calculated'].mean() years = [2014,2015,2016] waterUsage = [mean_14, mean_15, mean_16] f = plt.figure(figsize=(6, 8)) ax = plt.axes() plt.xlabel('Years') plt.ylabel('Average Water Use per-capita-per-day (in gallons)') plt.title(r'$\mathrm{Trend\ of\ Average\ Water\ Use\ per-capita-day\ from\ 2014\ to\ 2016 }$') ax.set_xticks(years) plt.ylim((50,150)) plt.plot(years, waterUsage) plt.show() # In[91]: #This is the function which helps us find the population based on the dataframe def estimate_popuation(df): supplier_list =list(df['Supplier_Name'].unique()) ttpop=0 for i in range(len(supplier_list)): df1 = df[df['Supplier_Name'] == supplier_list[i]] ttpop=ttpop+df1.groupby('Supplier_Name')['Population_Served'].mean()[0] return(round(ttpop)) # In[92]: #2. Plot line chart to show the change of urban population in California #2014 urban population: point1=estimate_popuation(df2014) #2015 urban population: point2=estimate_popuation(df2015) #2016 urban population: point3=estimate_popuation(df2016) years = [2014,2015,2016] population = [point1, point2, point3] f = plt.figure(figsize=(8, 6)) ax = plt.axes() plt.xlabel('Years') plt.ylabel('Population (Units *10^7)') plt.title(r'$\mathrm{Trend\ of\ Urban\ Population\ from\ 2014\ to\ 2016 }$') ax.set_xticks(years) plt.plot(years, population) plt.ylim((2.9e7,3.18e7)) plt.show() # # Data Analysis and Results # # # # #### Estimating how many desalination plant projects we need in case of total drought based on a Counties' name or Water Supplier Names or Hydrologic_Region's name (From the list below) # In[93]: hydro_list =list(df['Hydrologic_Region'].unique()) supplier_list =list(df2016['Supplier_Name'].unique()) county_list = list(df['County'].unique()) # In[94]: # Function based on the name of County def estimate_county(county,df): gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day df1=df[df.County.str.contains(county) == True] m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallons of water each person need per day # Find the population based on new data frame df1. # We use groupby and take the mean since we noticed that even under the same supplier district, # the population served each month is slightly changing so we decided to take the mean of the the population served in these months in the data frame. ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean() ttpop=round(ttpop.sum()) ttwater=ttpop*m_ppw the_number_of_projectsneed= round(ttwater /gallons_water_per_project) if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation the_number_of_projectsneed=1 return (the_number_of_projectsneed) # In[95]: # Function based on the name of Supplier_Name def estimate_supplier(supplier,df): gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day df1=df[df.Supplier_Name.str.contains(supplier) == True] #Get on new data frame which contains the information which related to our interests. m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallsons of water each person need per day # Find the population based on the new data frame ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean() ttpop=round(ttpop.sum()) ttwater=ttpop*m_ppw the_number_of_projectsneed= round(ttwater /gallons_water_per_project) if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation the_number_of_projectsneed=1 return (the_number_of_projectsneed) # In[96]: # Function based on the name of Hydrologic_Region def estimate_Hydrologic_Region(Hydrologic_Region,df): gallons_water_per_project=5e7 # Number of gallons water per desalination plants project can produced per day df1=df[df.Hydrologic_Region.str.contains(Hydrologic_Region) == True] m_ppw=df1['R-GPCD_Calculated'].mean() #Mean of gallsons of water each person need per day # Find the population based on new data frame ttpop=df1.groupby('Supplier_Name')['Population_Served'].mean() ttpop=round(ttpop.sum()) ttwater=round(ttpop*m_ppw ) the_number_of_projectsneed= round(ttwater /gallons_water_per_project) if the_number_of_projectsneed==0: #In case, it is zero. We should prepare for at least one for this situation the_number_of_projectsneed=1 return (the_number_of_projectsneed) # ### Results # Overview: # 1. We have built three models to predict how many desalination plant projects we need in the previous part. In the result part,we will show how to use them. # # 2. We are going to make one dataframe to show our results based on Hydrologic_Region # > (There are two columns in the dataframe. One is for the Hydrologic_Region name and another one is for number of projects we need) # ### 1. Predicting the number of desalination plants # In[97]: #Prediction based on the county's names: (we will call the function we wrote before) #Example: county_example = estimate_county('ALAMEDA',df2014) # Prediction based on the supplier names: (we will call the function we wrote before) #Example: supplier_example = estimate_supplier('Milpitas City of',df2014) # Prediction based on the Hydrologic_Region: (we will call the function we wrote before) # Example: hydrologic_example = estimate_Hydrologic_Region('San Joaquin River',df2014) # In[98]: print("By building three models to predict deslaination plants, we can deterimine the number of plants, supplier, and hydrologic religions affected. \n") print("Based on the county's name:") print("For example, in Alameda in 2014, there can be", county_example, "desalination plant project(s)\n") print("Based on the supplier name:") print("For example: In Milpitas City in 2014, it needs", supplier_example, "desalination plant project(s)\n") print("Based on the Hydrolic Region:") print("For example: In the San Joaquin River region in 2014, it needs", hydrologic_example, "desalination plant project(s)\n") # ### 2. Predicting desalination plant projects based on hydrologic region # Based on the Hydrolic Region, we can predict a certain amount of plant projects # In[99]: hydro_list =list(df['Hydrologic_Region'].unique()) h_list=[] for x in hydro_list: h_list.append(round((estimate_Hydrologic_Region(x,df2014)+estimate_Hydrologic_Region(x,df2015)+ estimate_Hydrologic_Region(x,df2016))/3)) hydro_se = {'Hydrologic_Region' : hydro_list, 'num_Projects_needed': h_list } hydro_df = pd.DataFrame(hydro_se) hydro_df # In[100]: t=hydro_df['num_Projects_needed'].sum() print('Total Number of Desalination Plant Projects Need is',t, "to supply the urban residential-use water in California") # # Conclusion / Dicussion # # ### Conclusion: # In[101]: t=hydro_df['num_Projects_needed'].sum() print('Total Number of Desalination Plant Projects Need is',t, "to supply the urban residential-use water in California") # If each plant is a billion dollar plant supplying 50 million gallons, our hypothesis was very close to the concluded result (56 vs 51). # ### Discussion: # **1.** When we are trying to loop through the unique list of the supplier name or counties' names, we realized that dataframes of different years don't contain the exact same counties' names or supplier's names. Hence, we decided to conclude our result based on the Hydrologic_Region's name. At the same time, we build a model to predict the amount of desalination plant projects needed based on suppliers' names or the counties' names. # # **2. Why did we switch our project?** # # We switched our project idea for a few reasons. # # We did not have any domain knowledge for the problem that we were trying to solve. Our best guesses at water consumption were just that - guesses. We thought that it would be more practical and more useful to look at how many desalination plants would be needed, rather than where they should be. # # The second reason had to do with the algorithm itself. Our original plan, to constrain the centroids to be along the coast, was something that no one had attempted to do before (as far as we'd found). So we simplified our problem a bit, to "Constrain the size of the clusters to be a specific size" as there could only be a certain amount of people serviced by a desalination plant. We also chose to limit our observations to only be along the coast, to make it so that the desalination plant would only service coastal populations. However, this lowered the amount of observations to 743, which would have costed us points in the end. Both because of the complexity of limiting the cluster size and because of the small number of observations, we chose to go with a problem that was better in nearly every way. # # **3. What are the pitfalls and potential confounds of your data and/or methods?** # # Our project is very hypothetical since we are unsure of many variables. For example, if we are placing the desalination plant in a more population dense area, the water delivery will be more efficient, but the desalination plant itself will take up a huge space. Therefore placing the plant in the most population dense area might not be the best idea, and we need to come up with a constraint of population density for where we can place the plants. # Furthremore, Since the implementation of desalination plants takes relatively long time, and our project is using population as a independent variable, we might also need to consider the growth and decay of population, or the population shifts in California. # The water pipe distance and cost to maintain to transport water to plants would also cause the number of plants to change. # # **Confounds in our design:** # # The average water use and total population are two key factors to solve our research question. We noticed that the trend of average water use per day is decreasing from 2014 to 2016. As for urban population, we know that the population in California is still growing so in that case, it will be hard to predict the exact number of desalination projects we need in case of total drought in future. Also, if we have more time, we could build another two models to predict the population and the average residential water usage per-capita-day. After that, these two models will help us get more accurate result. This is also data that is collected up to 2016. If we had current data, it would make this estimate even more accurate. # #