In [1]:

```
# don't forget to open the ipython notebook AFTER you've already started the virtual environment
import bq
client = bq.Client.Get()
```

In [2]:

```
import numpy
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap import Basemap
from scipy.io import netcdf
import os
import time
```

This part of the notebook will demonstrate how to access and query data that you've already uploaded to BigQuery.

If you already know how to do this or you just want to skip straight to the plots and figures, scroll on :)

We'll use argo float data in this notebook.

First, you'll need a data set in BigQuery (or you'll need to have access to the argo data set).

Directions copied from an earlier notebook by Kurt Schwehr:

Do this in a terminal before starting your ipython notebook

```
virtualenv --system-site-packages bq_ve
source bq_ve/bin/activate
type python
pip install bigquery
pip install yolk
yolk -U
# httplib2 0.7.2 (0.7.7)
bq init
ipython notebook --pylab=inline
```

See Also:

- https://pypi.python.org/pypi/bigquery
- https://code.google.com/p/google-bigquery-tools/
- https://developers.google.com/bigquery/bigquery-api-quickstart

Next, we'll run some queries on our data. From the terminal (or using bangs in IPython), we can use BigQuery syntax as though we were running queries directly in BigQuery.

This all relies on being in the virtual environment.

In [3]:

```
!bq query 'SELECT COUNT(float_id) FROM argo.all_data_20140806;'
```

This all relies on being in the virtual environment and importing bq and the client stuff in the first cell.

Running a bq query in Python is pretty simple, but there are some sort of strange ways to access the data once it's finished running.

In [4]:

```
q = 'SELECT COUNT(float_id) FROM argo.all_data_20140806;'
total_measurements = client.ReadTableRows(client.Query(q)['configuration']['query']['destinationTable'])
print total_measurements
print int(total_measurements[0][0])
```

To make this slightly simpler, here's a quick function that runs the query and (when the query finishes) returns the answer (so you don't have to do the lookups like above).

We've also added a timer so that we know how long the query takes to run (bq just tells you how long the time is once it gets to bq, not including how long it takes to send the message and get the data back).

In [5]:

```
def Query(q):
t0 = time.time()
answer = client.ReadTableRows(client.Query(q)['configuration']['query']['destinationTable'])
print 'Query time: ' + str(time.time() - t0) + ' seconds.'
return answer
```

We'll still have to deal with understanding what the structure of the answer will be and casting to python generic types, but it's significantly easier.

From here on out, we'll also define queries in multi-line strings and leave off the semicolon at the end of the query.

For example:

In [6]:

```
q = '''
SELECT x, y, z
FROM argo.all_data_20140806
LIMIT 10
'''
answer = Query(q)
answer
```

Out[6]:

Note above that the first column is x, the second is y, and the third is z. All of these should be able to be cast as floats, but you'll want to be aware when you write loops to read this data, since "None" cannot be cast to a float and will raise an exception (in some cases, I just add "`WHERE [my data] IS NOT NULL`

" to the query).

Initially, we'll explore histograms while trying to answer the simple question:
*Where (how deep) do these argo floats generally get most of their measurements? *

That is, we want a histogram of depth measurements.

**Option 0:** Run one query per bucket. This gets tedious, time consuming, expensive, etc. pretty quickly.

**Option 1:** Manually specify all of the buckets. This gets tedious pretty quickly.

In [7]:

```
q = '''
SELECT bucket, COUNT(z)
FROM (
SELECT z, CASE WHEN z >= 0 THEN 1
WHEN z >= -100 AND z < 0 THEN 2
WHEN z >= -200 AND z < -100 THEN 3
WHEN z >= -300 AND z < -200 THEN 4
WHEN z >= -400 AND z < -300 THEN 5
WHEN z >= -500 AND z < -400 THEN 6
WHEN z >= -600 AND z < -500 THEN 7
WHEN z >= -700 AND z < -600 THEN 8
WHEN z >= -800 AND z < -700 THEN 9
WHEN z >= -900 AND z < -800 THEN 10
WHEN z >= -1000 AND z < -900 THEN 11
WHEN z >= -1100 AND z < -1000 THEN 12
WHEN z >= -1200 AND z < -1100 THEN 13
WHEN z >= -1300 AND z < -1200 THEN 14
WHEN z >= -1400 AND z < -1300 THEN 15
WHEN z >= -1500 AND z < -1400 THEN 16
WHEN z >= -1600 AND z < -1500 THEN 17
WHEN z >= -1700 AND z < -1600 THEN 18
WHEN z >= -1800 AND z < -1700 THEN 19
WHEN z >= -1900 AND z < -1800 THEN 20
WHEN z >= -2000 AND z < -1900 THEN 21
ELSE 22 END as bucket
FROM argo.all_data_20140806)
GROUP BY bucket;
'''
depth_slices = Query(q)
```

In [8]:

```
depth_slices
```

Out[8]:

In general, we'll be getting data back in this pattern fairly often -- where the first column contains buckets and the second column contains number that belong in that bucket.

To make plotting simpler, here's a function for plotting a histogram:

In [9]:

```
def PlotHistogram(data, B = 1, C = 0):
buckets = [float(entry[B]) for entry in data]
counts = [int(entry[C]) for entry in data]
plt.bar(counts, buckets)
plt.show()
```

In [10]:

```
PlotHistogram(depth_slices)
```

**Option 2:** Bin (or bucket) by dividing and casting as integers.

It's worth spending a bit of time to explain exactly what's going on here. If you have data in the range from 1 to 100, dividing by 5 and casting that as an integer will give you the discrete values 0 through 20. For example: Integer(1/5) = 0; Integer(23/5) = 4; Integer(100/5) = 20, etc.

This means that [0, 5) will map to 0; [5, 10) will map to 1, etc.

In the query below, we group the entries based on what discrete value this division results in and then return a count of the items in each group.

For this method, you specify how large you want your bin to be, but the number of bins will depend on the range of the data. For example: You get 21 bins by dividing values ranging from 0 to 100 by 5 but 41 bins by dividing values ranging from 0 to 200 by 5.

If you want to specify how many bins you have, you'll need to artificially limit the range of the data (as we do below - values outside of this range shouldn't be valid anyway) or first write a query to determine the range of your data.

One caveat to this is that negative numbers can throw this off and make the bin around 0 twice as large. For example: Integer(-1/5) = 0; Integer(-6/5) = -1

So if you've got negative and positive values your bins will be: ... (-15, -10], (-10, -5], (-5, 5), [5, 10), [10, 15) ...

In our examples, I'll (usually) artificially limit values to not cross from negative to positive.

Another drawback is that this doesn't result in a nicely/intuitively labelled plot if you use the bins as labels, but we can deal with that later.

In [11]:

```
# Depth is derived from pressure, so we check that the corresponding pressure measurement is good data.
q = '''
SELECT bucket, COUNT(z)
FROM(
SELECT z, INTEGER(z / 1000) as bucket
FROM argo.all_data_20140806
WHERE
z > -2500 and z < 0 and
pres_a_qc = 1
)
GROUP BY bucket
'''
depth_slices2 = Query(q)
```

In [12]:

```
depth_slices2
```

Out[12]:

In [13]:

```
PlotHistogram(depth_slices2)
```

In [14]:

```
# More buckets!
q = '''
SELECT bucket, COUNT(z)
FROM (
SELECT z, INTEGER(z / 100) as bucket
FROM argo.all_data_20140806
WHERE
z > -2500 and z < 0 and
pres_a_qc = 1
)
GROUP BY bucket
'''
depth_slices3 = Query(q)
PlotHistogram(depth_slices3)
```

Great! Now that we've gotten the hang of creating histograms, we'll aim to answer a few more questions.

*How many measurements are there over time?*

Note that the following gives an approximate histogram by dates, but the structure of dates is a bit strange, and I am still trying to learn whether there are easier ways to bin dates (here we rely on the fact that there's an integer "julian date" that maps to better-formated datetimes).

We're grouping into bins of 365 days, so each of these bins corresponds to about 1 year (though not to a calendar year from Jan 1 - Dec 31).

More info here for trying to be better with timestamps.

In [15]:

```
q = '''
SELECT bucket, COUNT(datetime)
FROM (
SELECT datetime, INTEGER(julian_date / 365) as bucket
FROM argo.all_data_20140806
WHERE
julian_date > 15000 and julian_date < 35000
)
GROUP BY bucket
'''
time_slices = Query(q)
PlotHistogram(time_slices)
```

As I mentioned before, mapping the julian date to datetimes is not exactly as straightforward as you might hope, but I'm pretty sure that the shape of this histogram can be explained as a ramp-up since when the Argo data project started, with the fall-off in the last bin explainable since we're part of the way through the current period.

Next, we'll try to use a histogram to answer:

*What is the temperature breakdown overall and at different depths?*

In [16]:

```
# Overall temperatures
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT temp_a, INTEGER(temp_a / 1) as bucket
FROM argo.all_data_20140806
WHERE
temp_a > -10 and temp_a < 60 and
temp_a_qc = 1 and
pres_a_qc = 1
)
GROUP BY bucket
'''
temp_slices = Query(q)
PlotHistogram(temp_slices)
```

In [17]:

```
# Shallow temperatures
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT temp_a, INTEGER(temp_a / 1) as bucket
FROM argo.all_data_20140806
WHERE
temp_a > -10 and temp_a < 60 and
temp_a_qc = 1 and
z < 0 and z > -20 and
pres_a_qc = 1
)
GROUP BY bucket
'''
temp_slices_shallow = Query(q)
PlotHistogram(temp_slices_shallow)
```

In [18]:

```
# Deep temperatures
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT temp_a, INTEGER(temp_a) as bucket
FROM argo.all_data_20140806
WHERE
temp_a > -10 and temp_a < 60 and
temp_a_qc = 1 and
z < -1500 and z > -1600 and
pres_a_qc = 1
)
GROUP BY bucket
'''
temp_slices_deep = Query(q)
PlotHistogram(temp_slices_deep)
```

We'll be doing a lot of mapping, so here's a function for plotting a map.

In [19]:

```
def PlotMap(data, map_rotation = 180, marker_size = 2, X = 0, Y = 1, VALUES = 2):
lon = [float(row[X]) for row in data]
lat = [float(row[Y]) for row in data]
values = [float(row[VALUES]) for row in data]
# if the range will be too small with ints, then make the range bigger
if (max(values) - min(values)) < 100:
values = [(value*100.) for value in values]
values_int = [int(value) for value in values]
m1 = Basemap(projection='hammer',lon_0=map_rotation)
x, y = m1(lon,lat)
m1.drawmapboundary(fill_color='#ffffff')
m1.fillcontinents(color='#cc9966',lake_color='#ffffff')
m1.scatter(x,y,marker_size,marker='o',c=values_int, cmap = cm.jet, lw=0)
plt.show()
```

See the notebook here for more of a walkthrough on mapping with matplotlib's basemap.

*A follow-up to the previous histogram: Plot global temperatures for deep measurements. And where on the globe are the deep temperatures particularly warm?*

Our initial assumptions were that the measurement at 13 degrees is probably bad data, but the warm temperatures in the 4-8 degree range might be interesting.

In [20]:

```
q = '''
SELECT x, y, temp_a, rand() as random
FROM argo.all_data_20140806
WHERE
temp_a >= 0 and temp_a < 60 and
temp_a_qc = 1 and
z < -1500 and z > -1600 and
pres_a_qc = 1 and
x is not NULL and y is not NULL
ORDER BY random
LIMIT 100000
'''
deep_temps = Query(q)
PlotMap(deep_temps, map_rotation = 70)
```

*And where on the globe are the deep temperatures particularly warm?*

In [21]:

```
q = '''
SELECT x, y, temp_a
FROM argo.all_data_20140806
WHERE
temp_a >= 4 and temp_a < 60 and
temp_a_qc = 1 and
z < -1500 and z > -1600 and
pres_a_qc = 1 and
x is not NULL and y is not NULL
'''
high_deep_temps = Query(q)
PlotMap(high_deep_temps, map_rotation = 70)
```

Next, we'll spend some time looking at only one float at a time.

*Which float has the most valid temperature measurements, and can we look specifically at the temperature profiles of that float?*

In [22]:

```
q = '''
SELECT float_id, COUNT(*)
FROM argo.all_data_20140806
WHERE
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL
GROUP BY float_id
'''
valid_temps = Query(q)
```

In [23]:

```
max_valid = 0
fid_with_max = None
for entry in valid_temps:
if entry[1] > max_valid:
max_valid = int(entry[1])
fid_with_max = str(entry[0])
print max_valid
print fid_with_max
```

In [24]:

```
q = '''
SELECT bucket, COUNT(z)
FROM (
SELECT *, INTEGER(z / 100) as bucket
FROM argo.all_data_20140806
WHERE
float_id = 5900432 and
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL and
pres_a_qc = 1 and z IS NOT NULL
)
GROUP BY bucket
'''
single_float_depth_slices = Query(q)
PlotHistogram(single_float_depth_slices)
```

In [25]:

```
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT *, INTEGER(temp_a) as bucket
FROM argo.all_data_20140806
WHERE
float_id = 5900432 and
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL and
pres_a_qc = 1
)
GROUP BY bucket
'''
single_float_temp_slices = Query(q)
PlotHistogram(single_float_temp_slices)
```

In [26]:

```
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT *, INTEGER((temp_a * 2) / 1) as bucket
FROM argo.all_data_20140806
WHERE
float_id = 5900432 and
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL and
z < 0 and z > -20 and
pres_a_qc = 1
)
GROUP BY bucket
'''
single_float_shallow_temp_slices = Query(q)
PlotHistogram(single_float_shallow_temp_slices)
```

In [27]:

```
q = '''
SELECT bucket, COUNT(temp_a)
FROM (
SELECT *, INTEGER((temp_a * 2) / 1) as bucket
FROM argo.all_data_20140806
WHERE float_id = 5900432 and temp_a_qc = 1 and x IS NOT NULL and y IS NOT NULL and z < -600 and z > -2000 and pres_a_qc = 1
)
GROUP BY bucket
'''
single_float_deep_temp_slices = Query(q)
PlotHistogram(single_float_deep_temp_slices)
```

In [28]:

```
q = '''
SELECT x, y, temp_a
FROM (
SELECT *
FROM argo.all_data_20140806
WHERE
float_id = 5900432 and
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL
)
'''
single_float_temps = Query(q)
PlotMap(single_float_temps, marker_size = 20)
```

*How do patterns change by season, especially for mid-range latitudes?*

In [29]:

```
q = '''
SELECT QUARTER(datetime) as quar, COUNT(*)
FROM argo.all_data_20140806
GROUP BY quar
'''
measurements_by_quarter = Query(q)
PlotHistogram(measurements_by_quarter)
```

The queries for answering this question turn out to take a while to run, and then when they do return, it's way too many points. So I wanted to figure out how to downsample the data.

The strategy I took was to use rand() to generate a random number for each point, then order by this randomly generated number and limit to however many points I can deal with.

See the example here.

In [30]:

```
by_quarter = {'1': None, '2': None, '3': None, '4': None}
for quarter in by_quarter:
q = '''
SELECT x, y, temp_a, rand() as random
FROM argo.all_data_20140806
WHERE
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL and
z < 0 and z > -20 and
pres_a_qc = 1 and
QUARTER(datetime) = ''' + quarter + '''
ORDER BY random
LIMIT 10000
'''
by_quarter[quarter] = Query(q)
```

In [31]:

```
for quarter in sorted(by_quarter.keys()):
print quarter
PlotMap(by_quarter[quarter], marker_size = 10)
```

**What do the temperatures at different layers of depth look like? **

For this one, the depths didn't sort themselves appropriately off the bat, so I added a comment in there to control the sorting. There's also some weirdness going on with some of the queries, so we just don't plot them if the query doesn't return as expected.

In [32]:

```
by_depth = {'/*01*/ z < -1750 z >= -2000': None, '/*02*/ z < -1500 z >= -1750': None, '/*03*/ z < -1250 and z >= -1500': None,
'/*04*/ z < -1000 and z >= -1250': None, '/*05*/ z < -750 and z >= -1000': None,
'/*06*/ z < -500 and z >= -750': None, '/*07*/ z < -250 and z >= -500': None,
'/*08*/ z < -50 and z >= -250': None, '/*09*/ z < -50 and z >= 0': None}
for depth in by_depth:
q = '''
SELECT x, y, temp_a, rand() as random
FROM argo.all_data_20140806
WHERE
temp_a_qc = 1 and temp_a IS NOT NULL and
x IS NOT NULL and y IS NOT NULL and
pres_a_qc = 1 and ''' + depth + '''
ORDER BY random
LIMIT 10000
'''
try:
by_depth[depth] = Query(q)
except:
print depth + ' not enough points?'
by_depth[depth] = [[0, 0, 0]]
```

In [33]:

```
for depth in sorted(by_depth.keys()):
if len(by_depth[depth]) > 1:
print depth
PlotMap(by_depth[depth], marker_size = 10, map_rotation = 70)
```