I've found the Pandas + SQLAlchemy combination to be very useful; extending this setup just a bit further can help query Redshift directly from Python
There are many situations where you might like to get data from a database into a pandas
dataframe. The simplest way to is to initialize a dataframe via the pandas
read_sql_query method. The ORM of choice for pandas is SQLAlchemy. By installing a couple more packages, you can query Redshift data and read that into a dataframe with just a few lines of of Python code.
This post assumes you have a number of Python packages already installed, such as pandas
, numpy
, sqlalchemy
and iPython
or Jupyter Lab
. I've used the Python distribution Anacoda, which can be downloaded here. It comes packed with many of the libraries you might need for data analysis.
The Dialects documentation for SQLAlchemy mentions that Redshift is supported through another Python package, which itself depends on a PostgreSQL driver. So, the additional packages needed for connecting to Redshift are redshift-sqlalchemy and psycopg2. If you already have Anaconda, you can install psycopg2
quickly using conda. For getting redshift-sqlalchemy
installed, here are some docs on using conda & pip together. That being said, the process of downloading the packages looked like this in my terminal:
conda install psycopg2
cd Downloads/redshift-sqlalchemy-0.4.1
python setup.py install
Of course, I did this in my default environment. Everything seems to work, but the Anaconda docs are very useful if you have multiple Python environments on your machine.
Start iPython in your terminal (or Notebook if you prefer). Note that the connection string follows the pattern 'flavor+DBAPI://username:password@host:port/database'
. Make sure to change the connection string below to your own!
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# what is the database equivalent of 'hello world' anyways?
test_query = """
SELECT COUNT(*)
FROM datawarehouse.users u
WHERE u.date BETWEEN '2015-01-01' AND '2015-05-31'
"""
red_engine = create_engine('redshift+psycopg2://username:password@your.redshift.host.123456abcdef.us-east-1.redshift.amazonaws.com:port/database')
test = pd.read_sql_query(test_query,red_engine)
If we look at the "test" object I see:
In [5]: test
Out[5]:
count
0 178545
Of course, you'll need to modify the query string to match the settings for your own warehouse to test this out.
It's been a while since I had to use this type of connection, so things may have changed since this was originally in June of 2015. Specifically, once my company required SSL connections, the setup above went south and I got errors like:
OperationalError: (OperationalError) sslmode value "require" invalid when SSL support is not compiled in
None None
Samantha Zeitlin responded to my request for help with a super-thorough post to help debug the issue, which had to do with fixing symlinks. Unfortunately the Google Group where that response was posted is now defunct :weary:
Miraculously I happen to have the last post to that thread:
For posterity, a workaround that worked for me without having to rely on symlinks:
As others have suggested, install with pip: $ `conda uninstall psycopg2` $
pip install psycopg2
Python will throw an error when trying to import psycopg2, so run this: $ `export DYLD_FALLBACK_LIBRARY_PATH=$HOME/anaconda/lib/:$DYLD_FALLBACK_LIBRARY_PATH`
Credit: http://stackoverflow.com/questions/27264574/import-psycopg2-library-not-loaded-libssl-1-0-0-dylib
If you have any issues or fixes, I'd love to hear about them; you can reach out to me on Twitter.