In [1]:

# don't forget to open the ipython notebook AFTER you've already started the virtual environment
import bq  
client = bq.Client.Get()

/usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/local/lib/python2.7/site-packages/google/apputils/__init__.py:3: UserWarning: Module oauth2client was already imported from /usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/local/lib/python2.7/site-packages/oauth2client/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
  import pkg_resources
/usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/local/lib/python2.7/site-packages/google/apputils/__init__.py:3: UserWarning: Module uritemplate was already imported from /usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/local/lib/python2.7/site-packages/uritemplate/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
  import pkg_resources
/usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/local/lib/python2.7/site-packages/google/apputils/__init__.py:3: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  import pkg_resources

In [2]:

import numpy
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap import Basemap
from scipy.io import netcdf
import os
import time

BigQuery/IPython Notebook Intro

This part of the notebook will demonstrate how to access and query data that you've already uploaded to BigQuery.

If you already know how to do this or you just want to skip straight to the plots and figures, scroll on :)

We'll use argo float data in this notebook.

First, you'll need a data set in BigQuery (or you'll need to have access to the argo data set).

Getting started with BigQuery in an IPython Notebook:

Directions copied from an earlier notebook by Kurt Schwehr:

Do this in a terminal before starting your ipython notebook

virtualenv --system-site-packages bq_ve
source bq_ve/bin/activate
type python
pip install bigquery

pip install yolk
yolk -U
# httplib2 0.7.2 (0.7.7)

bq init
ipython notebook --pylab=inline

How to run BQ queries in the terminal (with bang):

Next, we'll run some queries on our data. From the terminal (or using bangs in IPython), we can use BigQuery syntax as though we were running queries directly in BigQuery.

This all relies on being in the virtual environment.

In [3]:

!bq query 'SELECT COUNT(float_id) FROM argo.all_data_20140806;'

/usr/local/google/home/cobbc/argo/bigquery_python/bq_ve/bin/bq:5: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import load_entry_point
Waiting on bqjob_r182af530c8b76eec_00000147d5ed5e11_1 ... (0s) Current status: DONE   
+-----------+
|    f0_    |
+-----------+
| 197790159 |
+-----------+

How to run queries directly in Python:

This all relies on being in the virtual environment and importing bq and the client stuff in the first cell.

Running a bq query in Python is pretty simple, but there are some sort of strange ways to access the data once it's finished running.

In [4]:

q = 'SELECT COUNT(float_id) FROM argo.all_data_20140806;'
total_measurements = client.ReadTableRows(client.Query(q)['configuration']['query']['destinationTable'])
print total_measurements
print int(total_measurements[0][0])

Waiting on bqjob_r4e659d5257c5e102_00000147d5ed6863_1 ... (0s) Current status: DONE   
[[u'197790159']]
197790159

To make this slightly simpler, here's a quick function that runs the query and (when the query finishes) returns the answer (so you don't have to do the lookups like above).

We've also added a timer so that we know how long the query takes to run (bq just tells you how long the time is once it gets to bq, not including how long it takes to send the message and get the data back).

In [5]:

def Query(q):
    t0 = time.time()
    answer = client.ReadTableRows(client.Query(q)['configuration']['query']['destinationTable'])
    print 'Query time: ' + str(time.time() - t0) + ' seconds.'
    return answer

We'll still have to deal with understanding what the structure of the answer will be and casting to python generic types, but it's significantly easier.

From here on out, we'll also define queries in multi-line strings and leave off the semicolon at the end of the query.

For example:

In [6]:

q = '''
    SELECT x, y, z 
    FROM argo.all_data_20140806 
    LIMIT 10
    '''

answer = Query(q)

answer

Waiting on bqjob_r58dab6e87b0e68e7_00000147d5ed78e9_2 ... (0s) Current status: DONE   
Query time: 3.84358215332 seconds.

Out[6]:

[[u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657'],
 [u'-48.10088', u'17.06217', u'-80193.7705657']]

Note above that the first column is x, the second is y, and the third is z. All of these should be able to be cast as floats, but you'll want to be aware when you write loops to read this data, since "None" cannot be cast to a float and will raise an exception (in some cases, I just add "WHERE [my data] IS NOT NULL" to the query).

Analyzing argo data!

Several Approaches to Histograms ...

Initially, we'll explore histograms while trying to answer the simple question: Where (how deep) do these argo floats generally get most of their measurements?

That is, we want a histogram of depth measurements.

Option 0: Run one query per bucket. This gets tedious, time consuming, expensive, etc. pretty quickly.

Option 1: Manually specify all of the buckets. This gets tedious pretty quickly.

In [7]:

q = '''
    SELECT bucket, COUNT(z) 
    FROM (
    SELECT z, CASE WHEN z >=  0 THEN 1
                   WHEN z >= -100 AND z < 0 THEN 2
                   WHEN z >= -200 AND z < -100 THEN 3
                   WHEN z >= -300 AND z < -200 THEN 4
                   WHEN z >= -400 AND z < -300 THEN 5
                   WHEN z >= -500 AND z < -400 THEN 6
                   WHEN z >= -600 AND z < -500 THEN 7
                   WHEN z >= -700 AND z < -600 THEN 8
                   WHEN z >= -800 AND z < -700 THEN 9
                   WHEN z >= -900 AND z < -800 THEN 10
                   WHEN z >= -1000 AND z < -900 THEN 11
                   WHEN z >= -1100 AND z < -1000 THEN 12
                   WHEN z >= -1200 AND z < -1100 THEN 13
                   WHEN z >= -1300 AND z < -1200 THEN 14
                   WHEN z >= -1400 AND z < -1300 THEN 15
                   WHEN z >= -1500 AND z < -1400 THEN 16
                   WHEN z >= -1600 AND z < -1500 THEN 17
                   WHEN z >= -1700 AND z < -1600 THEN 18
                   WHEN z >= -1800 AND z < -1700 THEN 19
                   WHEN z >= -1900 AND z < -1800 THEN 20  
                   WHEN z >= -2000 AND z < -1900 THEN 21
                   ELSE 22 END as bucket
    FROM argo.all_data_20140806)
    GROUP BY bucket;
    '''

depth_slices = Query(q)

Waiting on bqjob_r2f3c17a7f74b534b_00000147d5ed87f7_3 ... (0s) Current status: DONE   
Query time: 3.36321187019 seconds.

In [8]:

depth_slices

Out[8]:

[[u'13', u'3990698'],
 [u'14', u'3820852'],
 [u'15', u'3759695'],
 [u'16', u'3615709'],
 [u'17', u'3384447'],
 [u'18', u'3320779'],
 [u'19', u'3265692'],
 [u'20', u'3116200'],
 [u'21', u'2008688'],
 [u'2', u'18197264'],
 [u'3', u'15275136'],
 [u'4', u'11520273'],
 [u'5', u'10226423'],
 [u'6', u'8145358'],
 [u'7', u'6896657'],
 [u'8', u'6620817'],
 [u'9', u'6179856'],
 [u'10', u'6197353'],
 [u'11', u'5743494'],
 [u'12', u'4089372'],
 [u'22', u'68365201'],
 [u'1', u'20201']]

In general, we'll be getting data back in this pattern fairly often -- where the first column contains buckets and the second column contains number that belong in that bucket.

To make plotting simpler, here's a function for plotting a histogram:

In [9]:

def PlotHistogram(data, B = 1, C = 0):
    buckets = [float(entry[B]) for entry in data]
    counts = [int(entry[C]) for entry in data]
    
    plt.bar(counts, buckets)
    plt.show()

In [10]:

PlotHistogram(depth_slices)

Option 2: Bin (or bucket) by dividing and casting as integers.

It's worth spending a bit of time to explain exactly what's going on here. If you have data in the range from 1 to 100, dividing by 5 and casting that as an integer will give you the discrete values 0 through 20. For example: Integer(1/5) = 0; Integer(23/5) = 4; Integer(100/5) = 20, etc.

This means that [0, 5) will map to 0; [5, 10) will map to 1, etc.

In the query below, we group the entries based on what discrete value this division results in and then return a count of the items in each group.

For this method, you specify how large you want your bin to be, but the number of bins will depend on the range of the data. For example: You get 21 bins by dividing values ranging from 0 to 100 by 5 but 41 bins by dividing values ranging from 0 to 200 by 5.

If you want to specify how many bins you have, you'll need to artificially limit the range of the data (as we do below - values outside of this range shouldn't be valid anyway) or first write a query to determine the range of your data.

One caveat to this is that negative numbers can throw this off and make the bin around 0 twice as large. For example: Integer(-1/5) = 0; Integer(-6/5) = -1

So if you've got negative and positive values your bins will be: ... (-15, -10], (-10, -5], (-5, 5), [5, 10), [10, 15) ...

In our examples, I'll (usually) artificially limit values to not cross from negative to positive.

Another drawback is that this doesn't result in a nicely/intuitively labelled plot if you use the bins as labels, but we can deal with that later.

In [11]:

# Depth is derived from pressure, so we check that the corresponding pressure measurement is good data. 

q = '''
    SELECT bucket, COUNT(z) 
    FROM(
        SELECT z, INTEGER(z / 1000) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            z > -2500 and z < 0 and 
            pres_a_qc = 1
        )
    GROUP BY bucket
    '''

depth_slices2 = Query(q)

Waiting on bqjob_r4720d5a261efa6b5_00000147d5ed9724_4 ... (0s) Current status: DONE   
Query time: 2.60399889946 seconds.

In [12]:

depth_slices2

Out[12]:

[[u'0', u'89474772'], [u'-1', u'33437226'], [u'-2', u'14479']]

In [13]:

PlotHistogram(depth_slices2)

In [14]:

# More buckets! 

q = '''
    SELECT bucket, COUNT(z) 
    FROM (
        SELECT z, INTEGER(z / 100) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            z > -2500 and z < 0 and 
            pres_a_qc = 1
          )
    GROUP BY bucket
    '''

depth_slices3 = Query(q)

PlotHistogram(depth_slices3)

Waiting on bqjob_r55911f1ab9965a5_00000147d5eda2bb_5 ... (0s) Current status: DONE   
Query time: 4.1874871254 seconds.

More histogram-based analyses:

Great! Now that we've gotten the hang of creating histograms, we'll aim to answer a few more questions.

How many measurements are there over time?

Note that the following gives an approximate histogram by dates, but the structure of dates is a bit strange, and I am still trying to learn whether there are easier ways to bin dates (here we rely on the fact that there's an integer "julian date" that maps to better-formated datetimes).

We're grouping into bins of 365 days, so each of these bins corresponds to about 1 year (though not to a calendar year from Jan 1 - Dec 31).

More info here for trying to be better with timestamps.

In [15]:

q = '''
    SELECT bucket, COUNT(datetime) 
    FROM (
        SELECT datetime, INTEGER(julian_date / 365) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            julian_date > 15000 and julian_date < 35000
          )
    GROUP BY bucket
    '''

time_slices = Query(q)

PlotHistogram(time_slices)

Waiting on bqjob_r10248b45abcf28b5_00000147d5edb3d7_6 ... (0s) Current status: DONE   
Query time: 3.75078201294 seconds.

As I mentioned before, mapping the julian date to datetimes is not exactly as straightforward as you might hope, but I'm pretty sure that the shape of this histogram can be explained as a ramp-up since when the Argo data project started, with the fall-off in the last bin explainable since we're part of the way through the current period.

Next, we'll try to use a histogram to answer:

What is the temperature breakdown overall and at different depths?

In [16]:

# Overall temperatures

q = '''
    SELECT bucket, COUNT(temp_a) 
    FROM (
        SELECT temp_a, INTEGER(temp_a / 1) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            temp_a > -10 and temp_a < 60 and 
            temp_a_qc = 1 and
            pres_a_qc = 1 
          )
    GROUP BY bucket
    '''

temp_slices = Query(q)

PlotHistogram(temp_slices)

Waiting on bqjob_r4aa9424b04744c2c_00000147d5edc32e_7 ... (0s) Current status: DONE   
Query time: 3.38352108002 seconds.

In [17]:

# Shallow temperatures

q = '''
    SELECT bucket, COUNT(temp_a) 
    FROM (
        SELECT temp_a, INTEGER(temp_a / 1) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            temp_a > -10 and temp_a < 60 and 
            temp_a_qc = 1 and 
            z < 0 and z > -20 and
            pres_a_qc = 1 
          )
    GROUP BY bucket
    '''

temp_slices_shallow = Query(q)

PlotHistogram(temp_slices_shallow)

Waiting on bqjob_r15be37988f1abafb_00000147d5edd13a_8 ... (0s) Current status: DONE   
Query time: 3.43206906319 seconds.

In [18]:

# Deep temperatures 

q = '''
    SELECT bucket, COUNT(temp_a) 
    FROM (
        SELECT temp_a, INTEGER(temp_a) as bucket 
        FROM argo.all_data_20140806
        WHERE 
            temp_a > -10 and temp_a < 60 and 
            temp_a_qc = 1 and 
            z < -1500 and z > -1600 and 
            pres_a_qc = 1 
          )
    GROUP BY bucket
    '''

temp_slices_deep = Query(q)

PlotHistogram(temp_slices_deep)

Waiting on bqjob_r6c73ab7b4705adff_00000147d5eddf6f_9 ... (0s) Current status: DONE   
Query time: 2.89166593552 seconds.

We'll be doing a lot of mapping, so here's a function for plotting a map.

In [19]:

def PlotMap(data, map_rotation = 180, marker_size = 2, X = 0, Y = 1, VALUES = 2):
    lon = [float(row[X]) for row in data]
    lat = [float(row[Y]) for row in data]
    values = [float(row[VALUES]) for row in data]
    
    # if the range will be too small with ints, then make the range bigger
    if (max(values) - min(values)) < 100:
        values = [(value*100.) for value in values]
    values_int = [int(value) for value in values]
    
    m1 = Basemap(projection='hammer',lon_0=map_rotation)
    x, y = m1(lon,lat)
    m1.drawmapboundary(fill_color='#ffffff')
    m1.fillcontinents(color='#cc9966',lake_color='#ffffff')
    m1.scatter(x,y,marker_size,marker='o',c=values_int, cmap = cm.jet, lw=0)
    plt.show()

Integrating maps:

See the notebook here for more of a walkthrough on mapping with matplotlib's basemap.

A follow-up to the previous histogram: Plot global temperatures for deep measurements. And where on the globe are the deep temperatures particularly warm?

Our initial assumptions were that the measurement at 13 degrees is probably bad data, but the warm temperatures in the 4-8 degree range might be interesting.

In [20]:

q = '''
    SELECT x, y, temp_a, rand() as random
    FROM argo.all_data_20140806
    WHERE 
        temp_a >= 0 and temp_a < 60 and 
        temp_a_qc = 1 and 
        z < -1500 and z > -1600 and 
        pres_a_qc = 1 and 
        x is not NULL and y is not NULL
    ORDER BY random
    LIMIT 100000
    '''

deep_temps = Query(q)
PlotMap(deep_temps, map_rotation = 70)

Waiting on bqjob_r571498257813ed87_00000147d5edec24_10 ... (2s) Current status: DONE   
Query time: 13.8585169315 seconds.

And where on the globe are the deep temperatures particularly warm?

In [21]:

q = '''
    SELECT x, y, temp_a 
    FROM argo.all_data_20140806
    WHERE 
        temp_a >= 4 and temp_a < 60 and 
        temp_a_qc = 1 and 
        z < -1500 and z > -1600 and 
        pres_a_qc = 1 and 
        x is not NULL and y is not NULL
    '''

high_deep_temps = Query(q)
PlotMap(high_deep_temps, map_rotation = 70)

Waiting on bqjob_r64ee3cfa68415e01_00000147d5ee2eae_11 ... (0s) Current status: DONE   
Query time: 12.8334288597 seconds.

Next, we'll spend some time looking at only one float at a time.

Which float has the most valid temperature measurements, and can we look specifically at the temperature profiles of that float?

In [22]:

q = '''
    SELECT float_id, COUNT(*)  
    FROM argo.all_data_20140806
    WHERE 
        temp_a_qc = 1 and temp_a IS NOT NULL and 
        x IS NOT NULL and y IS NOT NULL 
    GROUP BY float_id
    '''
valid_temps = Query(q)

Waiting on bqjob_r5fb34af5af81bf7b_00000147d5ee7a5c_12 ... (0s) Current status: DONE   
Query time: 3.58828997612 seconds.

In [23]:

max_valid = 0
fid_with_max = None
for entry in valid_temps:
    if entry[1] > max_valid:
        max_valid = int(entry[1])
        fid_with_max = str(entry[0])

print max_valid
print fid_with_max

6462
4900519

In [24]:

q = '''
    SELECT bucket, COUNT(z)
    FROM ( 
        SELECT *, INTEGER(z / 100) as bucket 
        FROM argo.all_data_20140806 
        WHERE 
            float_id = 5900432 and 
            temp_a_qc = 1 and temp_a IS NOT NULL and
            x IS NOT NULL and y IS NOT NULL and 
            pres_a_qc = 1 and z IS NOT NULL 
        ) 
    GROUP BY bucket
    '''

single_float_depth_slices = Query(q)
PlotHistogram(single_float_depth_slices)

Waiting on bqjob_r405282e103c78c19_00000147d5ee897a_13 ... (0s) Current status: DONE   
Query time: 4.04566502571 seconds.

In [25]:

q = '''
    SELECT bucket, COUNT(temp_a)
    FROM ( 
        SELECT *, INTEGER(temp_a) as bucket 
        FROM argo.all_data_20140806 
        WHERE 
            float_id = 5900432 and 
            temp_a_qc = 1 and temp_a IS NOT NULL and 
            x IS NOT NULL and y IS NOT NULL and 
            pres_a_qc = 1
        ) 
    GROUP BY bucket
    '''

single_float_temp_slices = Query(q)
PlotHistogram(single_float_temp_slices)

Waiting on bqjob_r3668123f0b3f2816_00000147d5ee9a1d_14 ... (0s) Current status: DONE   
Query time: 3.58629202843 seconds.

In [26]:

q = '''
    SELECT bucket, COUNT(temp_a)
    FROM ( 
        SELECT *, INTEGER((temp_a * 2) / 1) as bucket 
        FROM argo.all_data_20140806 
        WHERE 
            float_id = 5900432 and 
            temp_a_qc = 1 and temp_a IS NOT NULL and 
            x IS NOT NULL and y IS NOT NULL and
            z < 0  and z > -20 and 
            pres_a_qc = 1
        ) 
    GROUP BY bucket
    '''

single_float_shallow_temp_slices = Query(q)
PlotHistogram(single_float_shallow_temp_slices)

Waiting on bqjob_r60fe0301e27618b4_00000147d5eea90c_15 ... (0s) Current status: DONE   
Query time: 15.4249329567 seconds.

In [27]:

q = '''
    SELECT bucket, COUNT(temp_a)
    FROM ( 
        SELECT *, INTEGER((temp_a * 2) / 1) as bucket 
        FROM argo.all_data_20140806 
        WHERE float_id = 5900432 and temp_a_qc = 1 and x IS NOT NULL and y IS NOT NULL and z < -600  and z > -2000 and pres_a_qc = 1
        ) 
    GROUP BY bucket
    '''

single_float_deep_temp_slices = Query(q)
PlotHistogram(single_float_deep_temp_slices)

Waiting on bqjob_r1e2231983c503861_00000147d5eee627_16 ... (0s) Current status: DONE   
Query time: 3.05194687843 seconds.

In [28]:

q = '''
    SELECT x, y, temp_a 
    FROM ( 
        SELECT * 
        FROM argo.all_data_20140806 
        WHERE 
            float_id = 5900432 and 
            temp_a_qc = 1 and temp_a IS NOT NULL and 
            x IS NOT NULL and y IS NOT NULL
        )
    '''
single_float_temps = Query(q)
PlotMap(single_float_temps, marker_size = 20)

Waiting on bqjob_r2fe5e0418da85d0c_00000147d5eef2e2_17 ... (0s) Current status: DONE   
Query time: 1.77006983757 seconds.

How do patterns change by season, especially for mid-range latitudes?

In [29]:

q = '''
    SELECT QUARTER(datetime) as quar, COUNT(*)
    FROM argo.all_data_20140806
    GROUP BY quar
    '''

measurements_by_quarter = Query(q)

PlotHistogram(measurements_by_quarter)

Waiting on bqjob_r752cdf5c10cb40da_00000147d5eefb65_18 ... (0s) Current status: DONE   
Query time: 6.7907781601 seconds.

How to take a random sampling in BigQuery

The queries for answering this question turn out to take a while to run, and then when they do return, it's way too many points. So I wanted to figure out how to downsample the data.

The strategy I took was to use rand() to generate a random number for each point, then order by this randomly generated number and limit to however many points I can deal with.

See the example here.

In [30]:

by_quarter = {'1': None, '2': None, '3': None, '4': None}

for quarter in by_quarter:
    q = '''
        SELECT x, y, temp_a, rand() as random
        FROM argo.all_data_20140806
        WHERE 
            temp_a_qc = 1 and temp_a IS NOT NULL and
            x IS NOT NULL and y IS NOT NULL and 
            z < 0 and z > -20 and 
            pres_a_qc = 1 and 
            QUARTER(datetime) = ''' + quarter + '''
        ORDER BY random
        LIMIT 10000
        ''' 
    by_quarter[quarter] = Query(q)

Waiting on bqjob_rdb6f3445da1d903_00000147d5ef16c4_19 ... (0s) Current status: DONE   
Query time: 5.60277199745 seconds.
Waiting on bqjob_r6b856c78d871dbd5_00000147d5ef2ca8_20 ... (1s) Current status: DONE   
Query time: 11.9062550068 seconds.
Waiting on bqjob_r2dc9025600cfce52_00000147d5ef5b2b_21 ... (1s) Current status: DONE   
Query time: 10.8653149605 seconds.
Waiting on bqjob_r54efc9b334fc90b6_00000147d5ef859d_22 ... (0s) Current status: DONE   
Query time: 5.2163169384 seconds.

In [31]:

for quarter in sorted(by_quarter.keys()):
    print quarter
    PlotMap(by_quarter[quarter], marker_size = 10)

What do the temperatures at different layers of depth look like?

For this one, the depths didn't sort themselves appropriately off the bat, so I added a comment in there to control the sorting. There's also some weirdness going on with some of the queries, so we just don't plot them if the query doesn't return as expected.

In [32]:

by_depth = {'/*01*/ z < -1750 z >= -2000': None, '/*02*/ z < -1500 z >= -1750': None, '/*03*/ z < -1250 and z >= -1500': None, 
            '/*04*/ z < -1000 and z >= -1250': None, '/*05*/ z < -750 and z >= -1000': None, 
            '/*06*/ z < -500 and z >= -750': None, '/*07*/ z < -250 and z >= -500': None, 
            '/*08*/ z < -50 and z >= -250': None, '/*09*/ z < -50 and z >= 0': None}

for depth in by_depth:
    q = '''
        SELECT x, y, temp_a, rand() as random
        FROM argo.all_data_20140806
        WHERE 
            temp_a_qc = 1 and temp_a IS NOT NULL and
            x IS NOT NULL and y IS NOT NULL and 
            pres_a_qc = 1 and ''' + depth + '''
        ORDER BY random
        LIMIT 10000
        ''' 
    try:
        by_depth[depth] = Query(q)
    except: 
        print depth + ' not enough points?'
        by_depth[depth]  = [[0, 0, 0]]

Waiting on bqjob_r28ca84215cd43400_00000147d5efa393_23 ... (0s) Current status: DONE   
Query time: 6.65865898132 seconds.
Waiting on bqjob_r63f662b2ec0aa000_00000147d5efbd96_24 ... (1s) Current status: DONE   
Query time: 6.99153184891 seconds.
/*02*/ z < -1500 z >= -1750 not enough points?
Waiting on bqjob_r4a2622016bc59dc3_00000147d5efdc4d_26 ... (1s) Current status: DONE   
Query time: 5.8561630249 seconds.
Waiting on bqjob_r686a3917fb18ba60_00000147d5eff32e_27 ... (0s) Current status: DONE   
/*09*/ z < -50 and z >= 0 not enough points?
/*01*/ z < -1750 z >= -2000 not enough points?
Waiting on bqjob_r2420c5a51317d9b6_00000147d5effc46_29 ... (0s) Current status: DONE   
Query time: 7.43559098244 seconds.
Waiting on bqjob_r180a22e0a6b88bbf_00000147d5f01952_30 ... (0s) Current status: DONE   
Query time: 9.26333403587 seconds.
Waiting on bqjob_r643117ede42183f4_00000147d5f03d82_31 ... (0s) Current status: DONE   
Query time: 5.89804983139 seconds.

In [33]:

for depth in sorted(by_depth.keys()):
    if len(by_depth[depth]) > 1:
        print depth
        PlotMap(by_depth[depth], marker_size = 10, map_rotation = 70)

/*03*/ z < -1250 and z >= -1500

/*04*/ z < -1000 and z >= -1250

/*05*/ z < -750 and z >= -1000

/*06*/ z < -500 and z >= -750

/*07*/ z < -250 and z >= -500

/*08*/ z < -50 and z >= -250

There's still plenty more to explore, but here's a fun start!

Good vs. bad data?

Below is a quick look at whether/how many measurements' qc values don't match what is expected.

That is, temp_a_qc = 1 and temp_a = a fill value. There aren't many, but there are lots of similar "pitfalls" with this data (ex. qc = 1 and data is outside of valid range, even if it's not a fill value).

In [34]:

q = '''
    SELECT COUNT(*) 
    FROM argo.all_data_20140806
    WHERE
        pres_a_qc = 1 and 
        (pres_a >= 99999) 
    '''
print (Query(q))

Waiting on bqjob_r66f195cc9639e350_00000147d5f063b0_32 ... (0s) Current status: DONE   
Query time: 2.93041205406 seconds.
[[u'2383']]

In [35]:

q = '''
    SELECT COUNT(*) 
    FROM argo.all_data_20140806
    WHERE
        temp_a_qc = 1 and 
        (temp_a >= 99999) 
    '''
print (Query(q))

Waiting on bqjob_r20a96c2e61018bb8_00000147d5f06f79_33 ... (0s) Current status: DONE   
Query time: 2.09261989594 seconds.
[[u'7']]

In [36]:

q = '''
    SELECT COUNT(*) 
    FROM argo.all_data_20140806
    WHERE
        psal_a_qc = 1 and 
        (psal_a >= 99999) 
    '''
print (Query(q))

Waiting on bqjob_r572d331203b9abe_00000147d5f077eb_34 ... (0s) Current status: DONE   
Query time: 2.98701786995 seconds.
[[u'1533']]

In [37]:

q = '''
    SELECT COUNT(*) 
    FROM argo.all_data_20140806
    WHERE
        doxy_a_qc = 1 and 
        (doxy_a >= 99999) 
    '''
print (Query(q))

Waiting on bqjob_r227cc6189abfd176_00000147d5f083e2_35 ... (0s) Current status: DONE   
Query time: 2.34619998932 seconds.
[[u'0']]