Owner: Joanne Bogart @jrbogart
Last Verified to Run: 2020-08-03
This notebook is an introduction to use of the PostgreSQL database at NERSC. Currently the fully supported datasets are the object catalogs for Run1.2i, Run1.2p v4, and run2.1i dr1b v1. Since object catalogs as well as other kinds of catalogs are also available via GCR one might question the need for another form of access. The justification (for those applications only using object catalogs) is performance. Typical queries such as the one labeled q5
below run significantly faster. Ingest also tends to be faster. The Run1.2p v4 data were available via PostgreSQL within a day of the completion of stack processing and transfer to NERSC.
Learning objectives:
After going through this notebook, you should be able to:
Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC
The first part consists of a typical try-as-you-go interactive tutorial. This is followed by strictly expository sections, starting with SQL Primer. For those new to SQL they should be read in order (more technical sections may be skipped), either before or after the interactive sections, depending on preferred learning style; others may find them useful for reference.
nerscdb03.nersc.gov:54432:desc_dc2_drp:desc_dc2_drp_user:
password
This line allows you to use the desc_dc2_drp_user account, which has SELECT privileges on the database, without entering a password in plain text. There is a separate account for adding to or modifying the database. .pgpass must be protected so that only owner may read and write it.
You can obtain the file by running the script /global/common/software/lsst/dbaccess/postgres_reader.sh
. It will copy a suitable file to your home directory and set permissions.
If you already have a .pgpass
file in your home directory the script will stop without doing anything to avoid clobbering your file. In that case, see the file reader.pgpass
in the same directory. You can merge it into your .pgpass
file by hand.
This notebook uses psycopg2 directly for queries. It is also possible to use sqlalchemy but you will still need a PostgreSQL driver. Of these psycopg2 is the most popular.
import psycopg2
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
Make the db connection
dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)
schema = 'run12i'
# schema = 'run12p_v4'
# schema = 'run21i_dr1b_v1'
Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the schema
(acts like a namespace) run12i. The value for schema
above will change for other datasets.
There is a special system schema, information_schema, which contains tables describing the structure of user tables. Of these information_schema.columns is most likely to be useful. The following lists all tables and views belonging to schema run12i. (I will use the convention of writing SQL keywords in all caps in queries. It's not necessary; the SQL interpreter ignores case.)
q1 = "SELECT DISTINCT table_name FROM information_schema.columns WHERE table_schema='{schema}' ORDER BY table_name".format(**locals())
with dbconn.cursor() as cursor:
# Could have several queries interspersed with other code in this block
cursor.execute(q1)
for record in cursor:
print(record[0])
dpdd dpdd_forced dpdd_ref forced2 forced3 forced4 forced5 misc_ref position _temp:forced_patch
_temp:forced_patch is an artifact of the ingest process which is of no interest here.
All the remaining entries with the exception of dpdd are tables. Each table has rows of data, one per object in the catalog. The rows consist of "native quantities" as produced by running the dm stack on the simulated data. dpdd is a view* making the quantities identified in GCRCatalogs/SCHEMA.md available. Information is broken across several tables because there are too many columns for a single table. All tables have a field object_id
. In the dpdd view it's called objectId
because that's the official name for it. The following code will list all quantities in the dpdd view. Note the database ignores case; all quantities appear in lower case only.
*A view definition consists of references to quantities stored in the tables or computable from them; the view has no data of its own. The view name is used in queries just like a table name.
tbl = 'dpdd'
q2 = "SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}' order by column_name ".format(**locals())
print(q2)
with dbconn.cursor() as cursor:
cursor.execute(q2)
records = cursor.fetchall()
print("There are {} columns in table {}. They are:\n".format(len(records), tbl))
print("Name Data Type")
for record in records:
print("{0!s:55} {1!s:20}".format(record[0], record[1]) )
SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='run12i' AND table_name='dpdd' order by column_name There are 118 columns in table dpdd. They are: Name Data Type blendedness real clean boolean coord USER-DEFINED dec double precision extendedness real good boolean i_flag boolean i_flag_g boolean i_flag_i boolean i_flag_r boolean i_flag_u boolean i_flag_y boolean i_flag_z boolean ixx real ixx_g real ixx_i real ixxpsf real ixxpsf_g real ixxpsf_i real ixxpsf_r real ixxpsf_u real ixxpsf_y real ixxpsf_z real ixx_r real ixx_u real ixx_y real ixx_z real ixy real ixy_g real ixy_i real ixypsf real ixypsf_g real ixypsf_i real ixypsf_r real ixypsf_u real ixypsf_y real ixypsf_z real ixy_r real ixy_u real ixy_y real ixy_z real iyy real iyy_g real iyy_i real iyypsf real iyypsf_g real iyypsf_i real iyypsf_r real iyypsf_u real iyypsf_y real iyypsf_z real iyy_r real iyy_u real iyy_y real iyy_z real magerr_g real magerr_g_cmodel real magerr_i real magerr_i_cmodel real magerr_r real magerr_r_cmodel real magerr_u real magerr_u_cmodel real magerr_y real magerr_y_cmodel real magerr_z real magerr_z_cmodel real mag_g real mag_g_cmodel real mag_i real mag_i_cmodel real mag_r real mag_r_cmodel real mag_u real mag_u_cmodel real mag_y real mag_y_cmodel real mag_z real mag_z_cmodel real objectid bigint parentobjectid bigint psf_fwhm_g double precision psf_fwhm_i double precision psf_fwhm_r double precision psf_fwhm_u double precision psf_fwhm_y double precision psf_fwhm_z double precision psfluxerr_g real psfluxerr_i real psfluxerr_r real psfluxerr_u real psfluxerr_y real psfluxerr_z real psflux_flag_g boolean psflux_flag_i boolean psflux_flag_r boolean psflux_flag_u boolean psflux_flag_y boolean psflux_flag_z boolean psflux_g real psflux_i real psflux_r real psflux_u real psflux_y real psflux_z real psndata real ra double precision snr_g_cmodel real snr_i_cmodel real snr_r_cmodel real snr_u_cmodel real snr_y_cmodel real snr_z_cmodel real x double precision xerr real xy_flag boolean y double precision yerr real
Here is a similar query for the position table.
tbl = 'position'
q2_pos = "SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}'".format(**locals())
with dbconn.cursor() as cursor:
cursor.execute(q2_pos)
records = cursor.fetchall()
print("There are {} columns in table {}. They are:\n".format(len(records), tbl))
print("Name Data Type")
for record in records:
print("{0!s:55} {1!s:20}".format(record[0], record[1]) )
There are 28 columns in table position. They are: Name Data Type coord USER-DEFINED parent bigint deblend_nchild integer detect_isprimary boolean extinction_bv real detect_ispatchinner boolean detect_istractinner boolean merge_footprint_i boolean merge_footprint_r boolean merge_footprint_z boolean merge_footprint_y boolean merge_footprint_g boolean merge_footprint_u boolean merge_footprint_sky boolean merge_peak_i boolean merge_peak_r boolean merge_peak_z boolean merge_peak_y boolean merge_peak_g boolean merge_peak_u boolean merge_peak_sky boolean merge_measurement_i boolean merge_measurement_r boolean merge_measurement_z boolean merge_measurement_y boolean merge_measurement_g boolean merge_measurement_u boolean object_id bigint
Here is a query which counts up objects per tract and stores the results (queries return a list of tuples) in a pandas DataFrame. It makes use of a user-defined function (UDF*) tract_from_object_id
, which is by far the fastest way to determine the tract.
*The UDF tract_from_object_id
is one of several which minimize query time by making optimal use of the structure of the database. See the second tutorial in this series for a discussion of some of the others.
q3 = "SELECT tract_from_object_id(object_id), COUNT(object_id) FROM {schema}.position WHERE detect_isprimary GROUP BY tract_from_object_id(object_id)".format(**locals())
with dbconn.cursor() as cursor:
%time cursor.execute(q3)
df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])
print(df)
CPU times: user 2.7 ms, sys: 0 ns, total: 2.7 ms Wall time: 1.02 s tract count 0 4429 108716 1 4637 203074 2 4850 202172 3 5063 233982 4 4639 204060 5 4430 168003 6 4433 4215 7 5065 154926 8 4431 177438 9 4851 197206 10 4636 15652 11 4849 250167 12 4640 117612 13 4852 14887 14 5062 25020 15 4432 151792 16 5066 81879 17 4848 137348 18 4638 203034 19 5064 189304
Here is the same query, but made on the dpdd view rather than the position table. There is no need to specify "is primary" because the dpdd view already has this constraint.
q4 = "SELECT tract_from_object_id(objectid), COUNT(objectid) FROM {schema}.dpdd GROUP BY tract_from_object_id(objectid)".format(**locals())
with dbconn.cursor() as cursor:
cursor.execute(q4)
df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])
print(df)
tract count 0 4429 108716 1 4637 203074 2 4850 202172 3 5063 233982 4 4639 204060 5 4430 168003 6 4433 4215 7 5065 154926 8 4431 177438 9 4851 197206 10 4636 15652 11 4849 250167 12 4640 117612 13 4852 14887 14 5062 25020 15 4432 151792 16 5066 81879 17 4848 137348 18 4638 203034 19 5064 189304
The following can be compared with a similar query in the GCR Intro notebook.
q5 = "SELECT ra,dec FROM {schema}.dpdd".format(**locals())
with dbconn.cursor() as cursor:
%time cursor.execute(q5)
%time records = cursor.fetchall()
CPU times: user 606 ms, sys: 386 ms, total: 992 ms Wall time: 26.2 s CPU times: user 1.66 s, sys: 213 ms, total: 1.87 s Wall time: 1.87 s
Techniques are adapted from the notebook object_pandas_stellar_locus.
Put some typical cuts in a WHERE clause: select clean
objects (no flagged pixels; not skipped by deblender) for which signal to noise in bands of interest is acceptable.
global_cuts = 'clean '
t = None
min_SNR = 25
max_err = 1/min_SNR
band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)
where = ' WHERE ' + global_cuts + ' AND ' + band_cuts
q6 = "SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd ".format(**locals()) + where
print(q6)
records = []
with dbconn.cursor() as cursor:
%time cursor.execute(q6)
records = cursor.fetchall()
nObj = len(records)
df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])
SELECT mag_g,mag_r,mag_i FROM run12i.dpdd WHERE clean AND (magerr_g < 0.04) AND (magerr_i < 0.04) AND (magerr_r < 0.04) CPU times: user 97.8 ms, sys: 42 ms, total: 140 ms Wall time: 27 s
def get_stellar_locus_davenport(color1='gmr', color2='rmi',
datafile='../tutorials/assets/Davenport_2014_MNRAS_440_3430_table1.txt'):
#datafile='assets/Davenport_2014_MNRAS_440_3430_table1.txt'):
data = pd.read_csv(datafile, sep='\s+', header=1)
return data[color1], data[color2]
def plot_stellar_locus(color1='gmr', color2='rmi',
color='red', linestyle='--', linewidth=2.5):
model_gmr, model_rmi = get_stellar_locus_davenport(color1, color2)
plot_kwargs = {'linestyle': linestyle, 'linewidth': linewidth, 'color': color,
'scalex': False, 'scaley': False}
plt.plot(model_gmr, model_rmi, **plot_kwargs)
def plot_color_color(z, color1, color2, range1=(-1, +2), range2=(-1, +2), bins=31, title=None):
"""Plot a color-color diagram. Overlay stellar locus. """
band1, band2 = color1[0], color1[-1]
band3, band4 = color2[0], color2[-1]
H, xedges, yedges = np.histogram2d(
z['mag_%s' % band1] - z['mag_%s' % band2],
z['mag_%s' % band3] - z['mag_%s' % band4],
range=(range1, range2), bins=bins)
zi = H.T
xi = (xedges[1:] + xedges[:-1])/2
yi = (yedges[1:] + yedges[:-1])/2
cmap = 'viridis_r'
plt.figure(figsize=(8, 8))
plt.pcolormesh(xi, yi, zi, cmap=cmap)
plt.contour(xi, yi, zi)
plt.xlabel('%s-%s' % (band1, band2))
plt.ylabel('%s-%s' % (band3, band4))
if not title == None:
plt.suptitle(title, size='xx-large', y=0.92)
plot_stellar_locus(color1, color2)
plot_color_color(df, 'gmr', 'rmi')
print('Using schema {}, cut on max err={}, found {} objects'.format(schema, max_err, nObj))
Using schema run12i, cut on max err=0.04, found 342338 objects
Now make the same plot, but for Run 1.2p data. The query takes noticeably longer because the Run 1.2p catalog is about 5 times bigger than the one for Run 1.2i. For Run 2.1i dr1b v1 it takes over 30 minutes.
schema = 'run12p_v4'
global_cuts = 'clean '
t = None
min_SNR = 25
max_err = 1/min_SNR
band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)
where = ' WHERE ' + global_cuts + ' AND ' + band_cuts
q7 = "SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd ".format(**locals()) + where
print(q7)
records = []
with dbconn.cursor() as cursor:
%time cursor.execute(q7)
records = cursor.fetchall()
nObj = len(records)
df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])
SELECT mag_g,mag_r,mag_i FROM run12p_v4.dpdd WHERE clean AND (magerr_g < 0.04) AND (magerr_i < 0.04) AND (magerr_r < 0.04) CPU times: user 224 ms, sys: 107 ms, total: 331 ms Wall time: 4min 37s
plot_color_color(df, 'gmr', 'rmi', title=t)
print(f'Using schema {schema}, max err={max_err}, found {nObj} objects')
Using schema run12p_v4, max err=0.04, found 791346 objects
This optional section is intended for those with little or no prior experience with an SQL database, in particular with a server-mediated database like PostgreSQL. Most of the material discussed here applies to SQLite as well but not all; differences are pointed out. Since the PostgreSQL database is read-only for users write operations like INSERT
and UPDATE
are not discussed.
NOTE: The most up-to-date version of the Primer and following sections may be found in Confluence
For definitive answers to any questions concerning the Postgres implementation of SQL, consult the manual
In order to use a database server (e.g. PostgreSQL, MySQL, Oracle) you have to supply connection information: host, database name, user name and password. That is precisely what the first block of code at the start of this tutorial above does. Everything is there but the password. That comes from a file accessible only to the user as described in the Prerequisites section.
None of this is necessary for SQLite; all you need is read access to the file. Opening the file with the Python sqlite3 module is the analog of the line dbconn = psycopg2.connect(**dbconfig)
above
A table definition contains a set of column definitions: a name, a datatype, and optional constraints. For example all values of the column can be required to be NOT NULL
(i.e., every row must have a value for the column, which is not the default) or they can be required to be UNIQUE
. Or both. A table definition may also include index definitions. An INDEX
is an extra data structure internal to the database which makes it faster to look up the column or columns which are indexed.
The data for each column must conform to the datatype of the column. SQLite is pretty lax in this respect; PostgreSQL requires more. If a column is of type TEXT
all literal values appearing in queries must be quoted, even if the value is a string of digits. A query like this
SELECT col1 FROM mytable WHERE col2=123;
will fail for PostgreSQL if col2 is of type TEXT
. SQLite allows it.
A view has the same apparent structure as a table (a collection of rows, each with the same columns) and is used the same way in SELECT
statements, but the column definitions and data actually come from one or more tables. In the definition of the view, columns from the table can be renamed and the collection of rows included may be subject to conditions, just as in a SELECT
statement. Views can simplify the interface for users by eliminating the need for JOIN
s in their queries and regularizing column names for commonly-used collections of columns which may come from multiple tables, as is the case for run12p_v4.dpdd
.
No matter what system you use to access a database, to fetch data someone (you or some service on your behalf) has to send a text string to the server that has this form :
SELECT
things-to-fetch FROM
table(s)-specification optional-selection-conditions other-optional-clauses
and returns a table whose columns correspond to the things-to-fetch list. Unlike parquet files or pandas dataframes, the returned data are physically ordered first by row, then column.
As in the rest of this tutorial, I follow the convention of writing keywords in ALL CAPS
only to make them stand out; it's not necessary.
Examples come primarily from tables in the schema star_truth
, in particular truth_summary
and stellar_variability_truth
. See [https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/postgres_truth.ipynb] to learn more about these tables.
things-to-fetch is a comma-separated list of expressions. The most common expression is simply a column name, but a wide variety of expressions are allowed, including functions applied to columns, for example
SELECT id, ra, dec ... SELECT DISTINCT id SELECT MAX(ABS(delta_flux)), AVG(delta_flux) ... SELECT CONCAT(raft, '_', ccd), one_col + another_col, ...
where, in the last example, the values of columns raft
and ccd
are text strings. MAX
and AVG
are aggregate functions, so-called because they apply to collections of column values rather than the value for a single row. SELECT DISTINCT
eliminates duplicate rows from the output. The complete list of available functions and operators for PostgreSQL 9.6 (the version running at NERSC) can be found in their comprehensive documentation
A table expression may identify a single table (or view) or multiple. For our PostgreSQL database, identifying a table requires a schema name and a table name separated by .
; e.g. star_truth.truth_summary
.
SELECT MAX(ra), MIN(ra) from star_truth.truth_summary
A query may involve more than one table, especially when the tables are related by a column which appears in both or some other expression involving columns. The table expression says how to JOIN
(i.e., match) the tables to make a new temporary table from which values are selected.
SELECT obshistid,delta_flux,S.id FROM star_truth.truth_summary AS S JOIN star_truth.stellar_variability_truth AS V ON V.id = S.id WHERE ABS(ra - 59.9) < 0.2 LIMIT 5
or equivalently
SELECT obshistid,delta_flux,id FROM star_truth.truth_summary JOIN star_truth.stellar_variability_truth USING (id) WHERE ABS(ra - 59.9) < 0.2 LIMIT 5
In this query ra
appears only in truth_summary
whereas delta_flux
, being a (per-object, per-visit) quantity, can only be in stellar_variability_truth
. The quantity to be fetched is in one table, but the quantity needed to select rows of interest is in a different table (though in a real query you might want to fetch ra
as well). The second form is a shorthand available when, as is often the case, the join condition is just the equality of columns from each table with the same name. The column name or list of column names in a USING
expression must be enclosed in ( )
.
The first example also illustrates the use of the keyword AS
to define an alias to avoid having to retype long table specifications to qualify a column name. The extra qualification is only necessary if
USING
list (hence unnecessary in the second example)There are different kinds of joins. The simplest (T1 CROSS JOIN T2
) has no condition. It's equivalent to T1 INNER JOIN T2 ON TRUE
. The result is the Cartesian product of rows from T1 concatenated with rows from T2, not usually what you want. The syntax for the other kinds of joins all involve some kind of constraint or condition. They will return different results if one table or the other has rows which don't meet the condition; e.g., if a column to be matched has a NULL value for some row. For the example query it wouldn't matter which was used because every row in each of the two tables has a match with at least one row in the other table. INNER
is the default so JOIN
in the example is equivalent to INNER JOIN
.
This is the (optional, but usually present) WHERE
clause of a query. It's a boolean expression normally involving column names which is applied to each row of the table expression in FROM
. Rows are included in the query result if and only if they evaluate to TRUE
. See the example in the preceding section. There are all kinds of complicated forms of WHERE
clauses - for example they may include subqueries - but typical ones are just simple conditions involving columns and functions of columns, connected by AND
or OR
.
The IN
operator can be useful:
SELECT delta_flux, id, obshistid FROM star_truth.stellar_variability_truth WHERE obshistid IN (2205, 7371, 10853);
It's just shorthand for (obshistid = 2205) OR (obshistid = 7371) OR (obshistid = 10853)
.
NOTE: In the SQL standard the test for equality is A = B
, not A == B
. SQLite will let you get away with either; PostgreSQL will not.
Groups returned rows together according to specified criteria. When there is a GROUP BY
clause, each of the fetched items must either be something in the GROUP BY
list or must involve an aggregate function (MAX
and COUNT
in the example below).
SELECT id, bandpass, MAX(ABS(delta_flux)), COUNT(obshistid) FROM star_truth.stellar_variability_truth WHERE id IN ('835183','31303590103','31102013522','31303588649', '30317268917', '30825472052','835279','31102039372','30825477672','31102046245', '30321363109','31102051190','31102061342') GROUP BY id,bandpass
This query returns up to six rows for each id
in the list.
Return results in sorted order as specified. The results from the previous query are easier to deal with if sorted.
SELECT id, bandpass, MAX(ABS(delta_flux)), COUNT(obshistid) FROM star_truth.stellar_variability_truth WHERE id IN ('835183','31303590103','31102013522','31303588649', '30317268917', '30825472052','835279','31102039372','30825477672','31102046245', '30321363109','31102051190','31102061342') GROUP BY id,bandpass ORDER BY id,bandpass
Don't return more rows than the limit specified. Useful when you just want to see if a new query is working but don't want to wait for potentially large output.
There are more ways to concoct the table expression. There are more optional clauses. For the complete picture see the section on SELECT in Postgres documentation.
There are plenty of examples of use in the first part of this tutorial. This section extracts the basic patterns. It is specific to psycopg2 but other Python interfaces to Postgres or other databases (e.g. mysql.connector for MySQL; sqlalchemy if you stick to the core interface) are similarly structured. There are three things you need to do:
This is already described above, but for ease of reference it is repeated here:
The last three steps for our database at NERSC look like
import psycopg2
dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)
WARNING: If all your subsequent queries succeed you will only need to make the connection once per script or notebook execution, but certain kinds of failures will make the connection unusable. In that case you need to re-execute the last line above.
As appropriate for your application, you will have created a string, call it q, which starts out SELECT...
. To execute it you get a cursor and call its execute
method:
cursor = dbconn.cursor()
cursor.execute(q)
You can reuse the same cursor for another query as long as you obtain results from the first query before issuing the second.
NOTE 1: Unlike Python, SQL uses only single quotes '
for string delimiters. A construction like this
q = 'SELECT col1 FROM t1 WHERE col2 = "red" '
won't work because SQL doesn't recognize the "
q = 'SELECT col1 FROM t1 WHERE col2='red' '
will fail because the internal single quotes aren't escaped. Fortunately Python is rich in quoting methods so you can just use another one for the outer string:
q = "SELECT col1 FROM t1 WHERE col2='red'"
q = '''SELECT col1 from t1
WHERE col2='red'
AND col3='blue'
'''
NOTE 2: In the first part of this tutorial I've used standard Python string formatting to assemble the query string. Alternatively, one can use an SQL-specific way to substitute parameters in a template. That's the preferred method if the values for the variables come from an untrusted source, like a web page, and definitely should be used if such values may be used in writes to the database, but neither of those holds in our case. For a full explanation of how to use this method, see the section Passing parameters to SQL queries of the psycopg2 documentation.
Results are retrieved using cursor methods fetchone()
, fetchall()
and fetchmany(size)
where the things fetched are rows. (In essence a row is a tuple whose components may be and usually are of different types.) The most appropriate method will depend on your application. Use fetchmany()
when you would ideally process all the data at once, but there is so much that it could strain resources like memory. The following shows a typical pattern of use
import psycopg2
import pandas as pd
dbconn = psycopg2.connect(db_dict)
.
.
q = ...
max_rows = 10000
# Cursor will automatically be closed at the end of the block
with dbconn.cursor() as cur:
cur.execute(q)
while True:
rows = cur.fetchmany(max_rows)
if len(rows) == 0:
break
df = pd.DataFrame(rows)
process_some(df)
Putting the data in a DataFrame requires a copy; there is a limit to available memory. Choosing an appropriate value for max_rows
depends on knowing how much memory is available and how wide the rows are.