Retrieve State Usage Data

This notebook downloads water use data for a user specified state and year from the National Water Information System server, and translates these data into physical supply and use table (PSUT), using a preformatted PSUT template and value mapping table.

Original data are from the USGS state water programs. Below is a sample URL for Louisiana: https://waterdata.usgs.gov/la/nwis/water_use?format=rdb&rdb_compression=value&wu_area=County&wu_year=2000%2C2005%2C2010&wu_county=ALL&wu_category=ALL&wu_county_nms=--ALL%2BCounties--&wu_category_nms=--ALL%2BCategories--

Workflow:
  • Get the state and year from the user
  • Get the filenames for the PSUT template and value mapping table
  • Construct the url and retrieve the data into a pandas data frame
  • Melt/gather the usage columns into row values under the column name 'Group'
  • Remove rows with no usage data (identified by not having "Mgal" in the 'Group' name)

Required 3rd party modules

The following may require installation before running this notebook. These can each be installed using pip

  • Pandas------>pip install pandas
  • OpenPyXl---->pip install openpyxl

Created September 2017
[email protected]


Begin by specifying the 2-character state code and the year to process

In [1]:
#Specify the state and year to process
state = 'la' #Louisiana
year = 2010

Next, specify the path to the template PSUT and the value mapping filenames

In [2]:
#Get the input filenames
templateFilename = '../Data/Templates/StatePSUT_Template.xlsx'
remapFilename = '../Data/RemapTables/StatePSUTLookup.csv'

The rest is automated...

First, we need to import required packages, installing 3rd party packages if required

In [3]:
#Import built-in modules
import sys, os, urllib
from shutil import copyfile
In [4]:
#Import pandas, install if necessary...
try:
    import pandas as pd
except:
    import pip
    pip.main(['install','pandas'])
    import pandas as pd
In [5]:
#Import openpyxl, install if necessary...
try:
    from openpyxl import load_workbook
except:
    import pip
    pip.main(['install','openpyxl'])
    from openpyxl import load_workbook
Output data will be named with the state code and year (e.g. la_2010.csv) and will be saved in the StateData subfolder of the Data directory.

This folder will be created, if it does not exist already. And then the PSUT template will be copied to this location using the output filename.

In [6]:
#Create the StateData folder, if not already present
outFolder = '..\\Data\\StateData'
if os.path.exists(outFolder) == False:
    os.mkdir(outFolder)
In [7]:
#Set the output filename, putting it in the output data folder
outFilename = outFolder + os.sep + '{0}_{1}.xlsx'.format(state,year)
print("Output will be written to " + outFilename)
Output will be written to ..\Data\StateData\la_2010.xlsx
In [8]:
#Copy the template to the output filename
copyfile(src=templateFilename,dst=outFilename)
Load in the value mapping table as a data frame
In [9]:
#Get the remap table and load as a dataframe
dfRemap = pd.read_csv(remapFilename,dtype='str',index_col="Index")
Construct the URL for data retreival and then pull the on-line data to an in-memory data frame named dfRaw.
In [10]:
#Set the data URL path and parameters and construct the url
path = 'https://waterdata.usgs.gov/{}/nwis/water_use?'.format(state)
values = {'format':'rdb',
         'rdb_compression':'value',
         'wu_area':'County',
         'wu_year': year,
         'wu_county':'ALL',
         'wu_county_nms':'--ALL+Counties--',
         'wu_category_nms':'--ALL+Categories--'
        }
url = path + urllib.urlencode(values)
In [11]:
#Pull data in using the URL and remove the 2nd row of headers
dfRaw = pd.read_table(url,comment='#',header=[0,1],na_values='-')
dfRaw.columns = dfRaw.columns.droplevel(level=1)
In [12]:
#CHECK: Display a sample of the retrieved data
dfRaw.head()
Out[12]:
state_cd state_name county_cd county_nm year Total Population total population of area, in thousands Public Supply population served by groundwater, in thousands Public Supply population served by surface water, in thousands Public Supply total population served, in thousands Public Supply self-supplied groundwater withdrawals, fresh, in Mgal/d ... Hydroelectric Power total offstream surface-water withdrawals in Mgal/d Hydroelectric Power power generated by instream use, in gigawatt-hours Hydroelectric Power power generated by offstream use, in gigawatt-hours Hydroelectric Power total power generated, in gigawatt-hours Hydroelectric Power number of instream facilities Hydroelectric Power number of offstream facilities Hydroelectric Power total number of facilities Wastewater Treatment returns by public wastewater facilities, in Mgal/d Wastewater Treatment number of public wastewater facilities Wastewater Treatment reclaimed wastewater released by wastewater facilities, in Mgal/d
0 22 Louisiana 1 Acadia Parish 2010 61.773 44.666 0.000 44.666 5.82 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 22 Louisiana 3 Allen Parish 2010 25.764 22.573 0.000 22.573 4.27 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 22 Louisiana 5 Ascension Parish 2010 107.215 29.510 27.094 56.604 3.02 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 22 Louisiana 7 Assumption Parish 2010 23.421 0.000 23.041 23.041 0.00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 22 Louisiana 9 Avoyelles Parish 2010 42.073 39.756 0.000 39.756 3.85 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 281 columns

With the data now held locally, we reformat the table so that each of the usage (and other) columns are "melted" into a single column such that the original column name is stored in a new field and the value is stored in another. This facilitates subsequent analysis, which includes:
  • Removing rows (formerly columns) that don't report volume, e.g. population served values, and
  • Ensuring that volume data is a floating point number.
In [13]:
#Tidy the data: transform so data in each usage column become row values with a new column listing the usage type
rowHeadings = ['county_cd', 'county_nm', 'state_cd', 'state_name', 'year']
dfTidy = pd.melt(dfRaw,id_vars=rowHeadings,value_name='MGal',var_name='Group')
print("Data transformed from {0} rows/columns to {1} rows/columns".format(dfRaw.shape, dfTidy.shape))
Data transformed from (64, 281) rows/columns to (17664, 7) rows/columns
In [14]:
#Remove rows that don't have volume data (i.e. keep only columns with 'Mgal' in the name)
dfTidy = dfTidy[dfTidy['Group'].str.contains('Mgal')].copy(deep=True)
dfTidy.shape
Out[14]:
(15040, 7)
In [15]:
#Change the type of the MGal column to float 
dfTidy['MGal'] = dfTidy.MGal.astype("float")
In [16]:
#CHECK: Show the structure of the 'tidied' data frame
dfTidy.head()
Out[16]:
county_cd county_nm state_cd state_name year Group MGal
256 1 Acadia Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 5.82
257 3 Allen Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 4.27
258 5 Ascension Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 3.02
259 7 Assumption Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 0.00
260 9 Avoyelles Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 3.85
Next, we join the value mapping table to the tidied data frame, thus linking the values (in MGal/year) with the address of the PSUT cell in which they should be ported.
In [17]:
#Join the remap table
dfAll = pd.merge(dfTidy,dfRemap,how='inner',left_on="Group",right_on="Group")
dfAll.head()
Out[17]:
county_cd county_nm state_cd state_name year Group MGal Column1 Row1 Column2 Row2 Column3 Row3
0 1 Acadia Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 5.82 N 31 NaN NaN NaN NaN
1 3 Allen Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 4.27 N 31 NaN NaN NaN NaN
2 5 Ascension Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 3.02 N 31 NaN NaN NaN NaN
3 7 Assumption Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 0.00 N 31 NaN NaN NaN NaN
4 9 Avoyelles Parish 22 Louisiana 2010 Public Supply self-supplied groundwater withdr... 3.85 N 31 NaN NaN NaN NaN
With the tables joined, we can open the output worksheet and start porting over values
In [18]:
os.path.exists(outFilename)
Out[18]:
True
In [19]:
#Open the spreadsheet template
wb = load_workbook(filename=outFilename)
ws = wb['template']
ws.title = str(year)
Notes on mapping values in the dataframe to PSUT cells
  • As some water usage categories are inserted into more than one cell - some in as many as three - in the PSUT, the value mapping file contains three sets of cell coordinates (i.e. column/row pairs). Thus, we repeat the process of reading the dataframe and mapping values to the PSUT three times.

  • In cases where two or more usage categories map to the same PSUT cell, the values are summed. This is done in the dfAll.groupby statement.

  • Where no data exist for a particular usage category, the value n/a is inserted.

In [20]:
#Loop through the first set of row/columns and insert values into the Excel spreadsheet
dfRound1 = dfAll.groupby(['Column1','Row1'])['MGal'].sum()
dfRound1.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound1.iteritems():
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)
In [21]:
#Loop through the second set of row/columns and insert values into the Excel spreadsheet
dfRound2 = dfAll.groupby(['Column2','Row2'])['MGal'].sum()
dfRound2.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound2.iteritems():
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)
In [22]:
#Loop through the third set of row/columns and insert values into the Excel spreadsheet
dfRound3 = dfAll.groupby(['Column3','Row3'])['MGal'].sum()
dfRound3.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound3.iteritems():
    #print row, column, value
    #Set the value in the workbook
    rv = str(row)+ str(column)
    ws[rv] = value
#Save the workbook
wb.save(outFilename)
Open the notebook
In [23]:
os.startfile(outFilename)