--- title: "Basic queries" teaching: 3000 exercises: 0 questions: - "How can we select and download the data we want from the Gaia server?" objectives: - "Compose a basic query in ADQL/SQL." - "Use queries to explore a database and its tables." - "Use queries to download data." - "Develop, test, and debug a query incrementally." keypoints: - "If you can't download an entire dataset (or it's not practical) use queries to select the data you need." - "Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean." - "Develop queries incrementally: start with something simple, test it, and add a little bit at a time." - "Use ADQL features like `TOP` and `COUNT` to test before you run a query that might return a lot of data." - "If you know your query will return fewer than 3000 rows, you can run it synchronously, which might complete faster (but it doesn't seem to make much difference). If it might return more than 3000 rows, you should run it asynchronously." - "ADQL and SQL are not case-sensitive, so you don't have to capitalize the keywords, but you should." - "ADQL and SQL don't require you to break a query into multiple lines, but you should." --- {% include links.md %}

Queries

Outline

This lesson demonstrates the steps for selecting and downloading data from the Gaia Database:

  1. First we'll make a connection to the Gaia server,

  2. We will explore information about the database and the tables it contains,

  3. We will write a query and send it to the server, and finally

  4. We will download the response from the server.

Query Language

In order to select data from a database, you have to compose a query, which is a program written in a "query language". The query language we'll use is ADQL, which stands for "Astronomical Data Query Language".

ADQL is a dialect of SQL (Structured Query Language), which is by far the most commonly used query language. Almost everything you will learn about ADQL also works in SQL.

The reference manual for ADQL is here. But you might find it easier to learn from this ADQL Cookbook.

Using Jupyter

If you have not worked with Jupyter notebooks before, you might start with the tutorial on from Jupyter.org called "Try Classic Notebook", or this tutorial from DataQuest.

There are two environments you can use to write and run notebooks:

  • "Jupyter Notebook" is the original, and

  • "Jupyter Lab" is a newer environment with more features.

For these lessons, you can use either one.

If you are too impatient for the tutorials, here's are the most important things to know:

  1. Notebooks are made up of code cells and text cells (and a few other less common kinds). Code cells contain code; text cells, like this one, contain explanatory text written in Markdown.

  2. To run a code cell, click the cell to select it and press Shift-Enter. The output of the code should appear below the cell.

  3. In general, notebooks only run correctly if you run every code cell in order from top to bottom. If you run cells out of order, you are likely to get errors.

  4. You can modify existing cells, but then you have to run them again to see the effect.

  5. You can add new cells, but again, you might have to be careful about the order you run them in.

  6. If you have added or modified cells and the behavior of the notebook seems strange, you can restart the "kernel", which clears all of the variables and functions you have defined, and run the cells again from the beginning.

  • If you are using Jupyter notebook, open the Kernel menu and select "Restart and Run All".

  • In Jupyter Lab...

  • In Colab, open the Runtime menu and select "Restart and run all"

Before you go on, you might want to explore the other menus and the toolbar to see what else you can do.

Installing libraries

If you are running this notebook on Colab, you can run the following cell to install the libraries we'll need.

If you are running this notebook on your own computer, you might have to install these libraries yourself. See the instructions in the preface.

In [1]:
# If we're running on Colab, install libraries

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install astroquery

Connecting to Gaia

The library we'll use to get Gaia data is Astroquery. Astroquery provides Gaia, which is an object that represents a connection to the Gaia database.

We can connect to the Gaia database like this:

In [2]:
from astroquery.gaia import Gaia
Created TAP+ (v1.2.1) - Connection:
	Host: gea.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443
Created TAP+ (v1.2.1) - Connection:
	Host: geadata.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443

Running this import statement has the effect of creating a TAP+ connection; TAP stands for "Table Access Protocol", which is a network protocol for sending queries to the database and getting back the results.

Databases and Tables

What is a database, anyway? Most generally, it can be any collection of data, but when we are talking about ADQL or SQL:

  • A database is a collection of one or more named tables.

  • Each table is a 2-D array with one or more named columns of data.

We can use Gaia.load_tables to get the names of the tables in the Gaia database. With the option only_names=True, it loads information about the tables, called "metadata", not the data itself.

In [3]:
tables = Gaia.load_tables(only_names=True)
INFO: Retrieving tables... [astroquery.utils.tap.core]
INFO: Parsing tables... [astroquery.utils.tap.core]
INFO: Done. [astroquery.utils.tap.core]
In [4]:
for table in tables:
    print(table.name)
external.apassdr9
external.gaiadr2_geometric_distance
external.galex_ais
external.ravedr5_com
external.ravedr5_dr5
external.ravedr5_gra
external.ravedr5_on
external.sdssdr13_photoprimary
external.skymapperdr1_master
external.skymapperdr2_master
external.tmass_xsc
public.hipparcos
public.hipparcos_newreduction
public.hubble_sc
public.igsl_source
public.igsl_source_catalog_ids
public.tycho2
public.dual
tap_config.coord_sys
tap_config.properties
tap_schema.columns
tap_schema.key_columns
tap_schema.keys
tap_schema.schemas
tap_schema.tables
gaiaedr3.gaia_source
gaiaedr3.agn_cross_id
gaiaedr3.commanded_scan_law
gaiaedr3.dr2_neighbourhood
gaiaedr3.frame_rotator_source
gaiaedr3.hipparcos2_best_neighbour
gaiaedr3.hipparcos2_neighbourhood
gaiaedr3.panstarrs1_best_neighbour
gaiaedr3.panstarrs1_join
gaiaedr3.panstarrs1_neighbourhood
gaiaedr3.sdssdr13_best_neighbour
gaiaedr3.sdssdr13_join
gaiaedr3.sdssdr13_neighbourhood
gaiaedr3.skymapperdr2_best_neighbour
gaiaedr3.skymapperdr2_join
gaiaedr3.skymapperdr2_neighbourhood
gaiaedr3.tycho2tdsc_merge_best_neighbour
gaiaedr3.tycho2tdsc_merge_neighbourhood
gaiaedr3.urat1_best_neighbour
gaiaedr3.urat1_neighbourhood
gaiaedr3.gaia_source_simulation
gaiaedr3.gaia_universe_model
gaiaedr3.tycho2tdsc_merge
gaiadr1.aux_qso_icrf2_match
gaiadr1.ext_phot_zero_point
gaiadr1.allwise_best_neighbour
gaiadr1.allwise_neighbourhood
gaiadr1.gsc23_best_neighbour
gaiadr1.gsc23_neighbourhood
gaiadr1.ppmxl_best_neighbour
gaiadr1.ppmxl_neighbourhood
gaiadr1.sdss_dr9_best_neighbour
gaiadr1.sdss_dr9_neighbourhood
gaiadr1.tmass_best_neighbour
gaiadr1.tmass_neighbourhood
gaiadr1.ucac4_best_neighbour
gaiadr1.ucac4_neighbourhood
gaiadr1.urat1_best_neighbour
gaiadr1.urat1_neighbourhood
gaiadr1.cepheid
gaiadr1.phot_variable_time_series_gfov
gaiadr1.phot_variable_time_series_gfov_statistical_parameters
gaiadr1.rrlyrae
gaiadr1.variable_summary
gaiadr1.allwise_original_valid
gaiadr1.gsc23_original_valid
gaiadr1.ppmxl_original_valid
gaiadr1.sdssdr9_original_valid
gaiadr1.tmass_original_valid
gaiadr1.ucac4_original_valid
gaiadr1.urat1_original_valid
gaiadr1.gaia_source
gaiadr1.tgas_source
gaiadr2.aux_allwise_agn_gdr2_cross_id
gaiadr2.aux_iers_gdr2_cross_id
gaiadr2.aux_sso_orbit_residuals
gaiadr2.aux_sso_orbits
gaiadr2.dr1_neighbourhood
gaiadr2.allwise_best_neighbour
gaiadr2.allwise_neighbourhood
gaiadr2.apassdr9_best_neighbour
gaiadr2.apassdr9_neighbourhood
gaiadr2.gsc23_best_neighbour
gaiadr2.gsc23_neighbourhood
gaiadr2.hipparcos2_best_neighbour
gaiadr2.hipparcos2_neighbourhood
gaiadr2.panstarrs1_best_neighbour
gaiadr2.panstarrs1_neighbourhood
gaiadr2.ppmxl_best_neighbour
gaiadr2.ppmxl_neighbourhood
gaiadr2.ravedr5_best_neighbour
gaiadr2.ravedr5_neighbourhood
gaiadr2.sdssdr9_best_neighbour
gaiadr2.sdssdr9_neighbourhood
gaiadr2.tmass_best_neighbour
gaiadr2.tmass_neighbourhood
gaiadr2.tycho2_best_neighbour
gaiadr2.tycho2_neighbourhood
gaiadr2.urat1_best_neighbour
gaiadr2.urat1_neighbourhood
gaiadr2.sso_observation
gaiadr2.sso_source
gaiadr2.vari_cepheid
gaiadr2.vari_classifier_class_definition
gaiadr2.vari_classifier_definition
gaiadr2.vari_classifier_result
gaiadr2.vari_long_period_variable
gaiadr2.vari_rotation_modulation
gaiadr2.vari_rrlyrae
gaiadr2.vari_short_timescale
gaiadr2.vari_time_series_statistics
gaiadr2.panstarrs1_original_valid
gaiadr2.gaia_source
gaiadr2.ruwe

So that's a lot of tables. The ones we'll use are:

  • gaiadr2.gaia_source, which contains Gaia data from data release 2,

  • gaiadr2.panstarrs1_original_valid, which contains the photometry data we'll use from PanSTARRS, and

  • gaiadr2.panstarrs1_best_neighbour, which we'll use to cross-match each star observed by Gaia with the same star observed by PanSTARRS.

We can use load_table (not load_tables) to get the metadata for a single table. The name of this function is misleading, because it only downloads metadata, not the contents of the table.

In [5]:
meta = Gaia.load_table('gaiadr2.gaia_source')
meta
Retrieving table 'gaiadr2.gaia_source'
Parsing table 'gaiadr2.gaia_source'...
Done.
Out[5]:
<astroquery.utils.tap.model.taptable.TapTableMeta at 0x7f2e23f089d0>

Jupyter shows that the result is an object of type TapTableMeta, but it does not display the contents.

To see the metadata, we have to print the object.

In [6]:
print(meta)
TAP Table name: gaiadr2.gaiadr2.gaia_source
Description: This table has an entry for every Gaia observed source as listed in the
Main Database accumulating catalogue version from which the catalogue
release has been generated. It contains the basic source parameters,
that is only final data (no epoch data) and no spectra (neither final
nor epoch).
Num. columns: 96

Columns

The following loop prints the names of the columns in the table.

In [7]:
for column in meta.columns:
    print(column.name)
solution_id
designation
source_id
random_index
ref_epoch
ra
ra_error
dec
dec_error
parallax
parallax_error
parallax_over_error
pmra
pmra_error
pmdec
pmdec_error
ra_dec_corr
ra_parallax_corr
ra_pmra_corr
ra_pmdec_corr
dec_parallax_corr
dec_pmra_corr
dec_pmdec_corr
parallax_pmra_corr
parallax_pmdec_corr
pmra_pmdec_corr
astrometric_n_obs_al
astrometric_n_obs_ac
astrometric_n_good_obs_al
astrometric_n_bad_obs_al
astrometric_gof_al
astrometric_chi2_al
astrometric_excess_noise
astrometric_excess_noise_sig
astrometric_params_solved
astrometric_primary_flag
astrometric_weight_al
astrometric_pseudo_colour
astrometric_pseudo_colour_error
mean_varpi_factor_al
astrometric_matched_observations
visibility_periods_used
astrometric_sigma5d_max
frame_rotator_object_type
matched_observations
duplicated_source
phot_g_n_obs
phot_g_mean_flux
phot_g_mean_flux_error
phot_g_mean_flux_over_error
phot_g_mean_mag
phot_bp_n_obs
phot_bp_mean_flux
phot_bp_mean_flux_error
phot_bp_mean_flux_over_error
phot_bp_mean_mag
phot_rp_n_obs
phot_rp_mean_flux
phot_rp_mean_flux_error
phot_rp_mean_flux_over_error
phot_rp_mean_mag
phot_bp_rp_excess_factor
phot_proc_mode
bp_rp
bp_g
g_rp
radial_velocity
radial_velocity_error
rv_nb_transits
rv_template_teff
rv_template_logg
rv_template_fe_h
phot_variable_flag
l
b
ecl_lon
ecl_lat
priam_flags
teff_val
teff_percentile_lower
teff_percentile_upper
a_g_val
a_g_percentile_lower
a_g_percentile_upper
e_bp_min_rp_val
e_bp_min_rp_percentile_lower
e_bp_min_rp_percentile_upper
flame_flags
radius_val
radius_percentile_lower
radius_percentile_upper
lum_val
lum_percentile_lower
lum_percentile_upper
datalink_url
epoch_photometry_url

You can probably guess what many of these columns are by looking at the names, but you should resist the temptation to guess. To find out what the columns mean, read the documentation.

If you want to know what can go wrong when you don't read the documentation, you might like this article.

Exercise

One of the other tables we'll use is gaiadr2.panstarrs1_original_valid. Use load_table to get the metadata for this table. How many columns are there and what are their names?

In [8]:
# Solution

meta2 = Gaia.load_table('gaiadr2.panstarrs1_original_valid')
print(meta2)

for column in meta2.columns:
    print(column.name)
Retrieving table 'gaiadr2.panstarrs1_original_valid'
Parsing table 'gaiadr2.panstarrs1_original_valid'...
Done.
TAP Table name: gaiadr2.gaiadr2.panstarrs1_original_valid
Description: The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is
a system for wide-field astronomical imaging developed and operated by
the Institute for Astronomy at the University of Hawaii. Pan-STARRS1
(PS1) is the first part of Pan-STARRS to be completed and is the basis
for Data Release 1 (DR1). The PS1 survey used a 1.8 meter telescope and
its 1.4 Gigapixel camera to image the sky in five broadband filters (g,
r, i, z, y).

The current table contains a filtered subsample of the 10 723 304 629
entries listed in the original ObjectThin table.
We used only ObjectThin and MeanObject tables to extract
panstarrs1OriginalValid table, this means that objects detected only in
stack images are not included here. The main reason for us to avoid the
use of objects detected in stack images is that their astrometry is not
as good as the mean objects astrometry: “The stack positions (raStack,
decStack) have considerably larger systematic astrometric errors than
the mean epoch positions (raMean, decMean).” The astrometry for the
MeanObject positions uses Gaia DR1 as a reference catalog, while the
stack positions use 2MASS as a reference catalog.

In details, we filtered out all objects where:

-   nDetections = 1

-   no good quality data in Pan-STARRS, objInfoFlag 33554432 not set

-   mean astrometry could not be measured, objInfoFlag 524288 set

-   stack position used for mean astrometry, objInfoFlag 1048576 set

-   error on all magnitudes equal to 0 or to -999;

-   all magnitudes set to -999;

-   error on RA or DEC greater than 1 arcsec.

The number of objects in panstarrs1OriginalValid is 2 264 263 282.

The panstarrs1OriginalValid table contains only a subset of the columns
available in the combined ObjectThin and MeanObject tables. A
description of the original ObjectThin and MeanObjects tables can be
found at:
https://outerspace.stsci.edu/display/PANSTARRS/PS1+Database+object+and+detection+tables

Download:
http://mastweb.stsci.edu/ps1casjobs/home.aspx
Documentation:
https://outerspace.stsci.edu/display/PANSTARRS
http://pswww.ifa.hawaii.edu/pswww/
References:
The Pan-STARRS1 Surveys, Chambers, K.C., et al. 2016, arXiv:1612.05560
Pan-STARRS Data Processing System, Magnier, E. A., et al. 2016,
arXiv:1612.05240
Pan-STARRS Pixel Processing: Detrending, Warping, Stacking, Waters, C.
Z., et al. 2016, arXiv:1612.05245
Pan-STARRS Pixel Analysis: Source Detection and Characterization,
Magnier, E. A., et al. 2016, arXiv:1612.05244
Pan-STARRS Photometric and Astrometric Calibration, Magnier, E. A., et
al. 2016, arXiv:1612.05242
The Pan-STARRS1 Database and Data Products, Flewelling, H. A., et al.
2016, arXiv:1612.05243

Catalogue curator:
SSDC - ASI Space Science Data Center
https://www.ssdc.asi.it/
Num. columns: 26
obj_name
obj_id
ra
dec
ra_error
dec_error
epoch_mean
g_mean_psf_mag
g_mean_psf_mag_error
g_flags
r_mean_psf_mag
r_mean_psf_mag_error
r_flags
i_mean_psf_mag
i_mean_psf_mag_error
i_flags
z_mean_psf_mag
z_mean_psf_mag_error
z_flags
y_mean_psf_mag
y_mean_psf_mag_error
y_flags
n_detections
zone_id
obj_info_flag
quality_flag

Writing queries

By now you might be wondering how we download the actual data. With tables this big, you generally don't. Instead, you use queries to select only the data you want.

A query is a string written in a query language like SQL; for the Gaia database, the query language is a dialect of SQL called ADQL.

Here's an example of an ADQL query.

In [9]:
query1 = """SELECT 
TOP 10
source_id, ref_epoch, ra, dec, parallax 
FROM gaiadr2.gaia_source"""

Python note: We use a triple-quoted string here so we can include line breaks in the query, which makes it easier to read.

The words in uppercase are ADQL keywords:

  • SELECT indicates that we are selecting data (as opposed to adding or modifying data).

  • TOP indicates that we only want the first 10 rows of the table, which is useful for testing a query before asking for all of the data.

  • FROM specifies which table we want data from.

The third line is a list of column names, indicating which columns we want.

In this example, the keywords are capitalized and the column names are lowercase. This is a common style, but it is not required. ADQL and SQL are not case-sensitive.

To run this query, we use the Gaia object, which represents our connection to the Gaia database, and invoke launch_job:

In [10]:
job1 = Gaia.launch_job(query1)
job1
Out[10]:
<astroquery.utils.tap.model.job.Job at 0x7f2e23f2afa0>

The result is an object that represents the job running on a Gaia server.

If you print it, it displays metadata for the forthcoming table.

In [11]:
print(job1)
<Table length=10>
   name    dtype  unit                            description                             n_bad
--------- ------- ---- ------------------------------------------------------------------ -----
source_id   int64      Unique source identifier (unique within a particular Data Release)     0
ref_epoch float64   yr                                                    Reference epoch     0
       ra float64  deg                                                    Right ascension     0
      dec float64  deg                                                        Declination     0
 parallax float64  mas                                                           Parallax     2
Jobid: None
Phase: COMPLETED
Owner: None
Output file: sync_20201229114647.xml.gz
Results: None

Don't worry about Results: None. That does not actually mean there are no results.

However, Phase: COMPLETED indicates that the job is complete, so we can get the results like this:

In [12]:
results1 = job1.get_results()
type(results1)
Out[12]:
astropy.table.table.Table

Optional detail: Why is table repeated three times? The first is the name of the module, the second is the name of the submodule, and the third is the name of the class. Most of the time we only care about the last one. It's like the Linnean name for gorilla, which is Gorilla gorilla gorilla.

The result is an Astropy Table, which is similar to a table in an SQL database except:

  • SQL databases are stored on disk drives, so they are persistent; that is, they "survive" even if you turn off the computer. An Astropy Table is stored in memory; it disappears when you turn off the computer (or shut down this Jupyter notebook).

  • SQL databases are designed to process queries. An Astropy Table can perform some query-like operations, like selecting columns and rows. But these operations use Python syntax, not SQL.

Jupyter knows how to display the contents of a Table.

In [13]:
results1
Out[13]:
Table length=10
source_idref_epochradecparallax
yrdegdegmas
int64float64float64float64float64
67585097575941414402015.5290.4899010727352-30.343172184207830.48023816159705535
67585086924379769602015.5290.64176055082714-30.289720131422252.2625971293368154
67585270362508492802015.5290.35703708993327-30.193657826181596-0.2763960334229464
67585644582982429442015.5290.77376294142846-29.7653684392252380.5907906528352993
67585586128421557762015.5290.67550630386245-29.921133960781690.2858563565989917
67585563794599219202015.5290.5454006212404-29.900709054816964-1.0012355835832834
67585375202603851522015.5290.7341683169994-30.158181298418626--
67585063559800106242015.5290.586950869878-30.371376421675690.3769870991981157
67585616064339681282015.5290.82697136098506-29.75247697212053--
67585641834144084482015.5290.7133096669958-29.7817436736796070.9376387942856869

Each column has a name, units, and a data type.

For example, the units of ra and dec are degrees, and their data type is float64, which is a 64-bit floating-point number, used to store measurements with a fraction part.

This information comes from the Gaia database, and has been stored in the Astropy Table by Astroquery.

Exercise

Read the documentation of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?

In [14]:
# Solution

Asynchronous queries

launch_job asks the server to run the job "synchronously", which normally means it runs immediately. But synchronous jobs are limited to 2000 rows. For queries that return more rows, you should run "asynchronously", which mean they might take longer to get started.

If you are not sure how many rows a query will return, you can use the SQL command COUNT to find out how many rows are in the result without actually returning them. We'll see an example of this later.

The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results.

For anonymous users, files are kept for three days.

As an example, let's try a query that's similar to query1, with two changes:

  • It selects the first 3000 rows, so it is bigger than we should run synchronously.

  • It uses a new keyword, WHERE.

In [15]:
query2 = """SELECT TOP 3000
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1
"""

A WHERE clause indicates which rows we want; in this case, the query selects only rows "where" parallax is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away. We'll use this clause to exclude nearby stars that are unlikely to be part of GD-1.

WHERE is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to select only the rows we need from the database.

We use launch_job_async to submit an asynchronous query.

In [16]:
job2 = Gaia.launch_job_async(query2)
print(job2)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=3000>
   name    dtype  unit                            description                            
--------- ------- ---- ------------------------------------------------------------------
source_id   int64      Unique source identifier (unique within a particular Data Release)
ref_epoch float64   yr                                                    Reference epoch
       ra float64  deg                                                    Right ascension
      dec float64  deg                                                        Declination
 parallax float64  mas                                                           Parallax
Jobid: 1609260407863O
Phase: COMPLETED
Owner: None
Output file: async_20201229114648.vot
Results: None

And here are the results.

In [17]:
results2 = job2.get_results()
results2
Out[17]:
Table length=3000
source_idref_epochradecparallax
yrdegdegmas
int64float64float64float64float64
67585097575941414402015.5290.4899010727352-30.343172184207830.48023816159705535
67585270362508492802015.5290.35703708993327-30.193657826181596-0.2763960334229464
67585644582982429442015.5290.77376294142846-29.7653684392252380.5907906528352993
67585586128421557762015.5290.67550630386245-29.921133960781690.2858563565989917
67585563794599219202015.5290.5454006212404-29.900709054816964-1.0012355835832834
67585063559800106242015.5290.586950869878-30.371376421675690.3769870991981157
67585641834144084482015.5290.7133096669958-29.7817436736796070.9376387942856869
67585297893226531842015.5290.35449465190385-30.10231481465004-1.3305133038319952
67585074855553360642015.5290.6745961953205-30.359651480610280.057302711920868686
...............
46604648940382954242015.579.74509945051724-66.496560127377590.1631257821260955
46604635196504801282015.580.06150680482719-66.42891601595986-0.5822669736733328
46604876745540710402015.580.03536662790846-66.33136792885134-1.4978065466579966
46604947183008867842015.579.6181997343967-66.174723353182120.1488652133861382
46604764732741593602015.579.12508291180023-66.356448447572750.10671540523101529
46604710100939402242015.579.1908378165463-66.489168402940960.8152150079365807
46604983948071677442015.580.23431532382547-66.08688681815092-0.14060596426163252
46604987383917228802015.580.20205898946918-66.043923132951260.1894642263668475
46604740680910969602015.578.93686107060341-66.457955767458960.4665513634059366
46604373031626977282015.580.63872120225027-66.46144560233674-1.1319163320535979

You might notice that some values of parallax are negative. As this FAQ explains, "Negative parallaxes are caused by errors in the observations." Negative parallaxes have "no physical meaning," but they can be a "useful diagnostic on the quality of the astrometric solution."

Exercise

The clauses in a query have to be in the right order. Go back and change the order of the clauses in query2 and run it again.

The query should fail, but notice that you don't get much useful debugging information.

For this reason, developing and debugging ADQL queries can be really hard. A few suggestions that might help:

  • Whenever possible, start with a working query, either an example you find online or a query you have used in the past.

  • Make small changes and test each change before you continue.

  • While you are debugging, use TOP to limit the number of rows in the result. That will make each test run faster, which reduces your development time.

  • Launching test queries synchronously might make them start faster, too.

In [18]:
# Solution

Operators

In a WHERE clause, you can use any of the SQL comparison operators; here are the most common ones:

Symbol Operation
> greater than
< less than
>= greater than or equal
<= less than or equal
= equal
!= or <> not equal

Most of these are the same as Python, but some are not. In particular, notice that the equality operator is =, not ==. Be careful to keep your Python out of your ADQL!

You can combine comparisons using the logical operators:

  • AND: true if both comparisons are true
  • OR: true if either or both comparisons are true

Finally, you can use NOT to invert the result of a comparison.

Exercise

Read about SQL operators here and then modify the previous query to select rows where bp_rp is between -0.75 and 2.

You can read about this variable here.

In [19]:
# Solution

# Here's a solution using > and < operators

query = """SELECT TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp > -0.75 AND bp_rp < 2
"""

# And here's a solution using the BETWEEN operator

query = """SELECT TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp BETWEEN -0.75 AND 2
"""

This Hertzsprung-Russell diagram shows the BP-RP color and luminosity of stars in the Gaia catalog (Copyright: ESA/Gaia/DPAC, CC BY-SA 3.0 IGO).

Selecting stars with bp-rp less than 2 excludes many class M dwarf stars, which are low temperature, low luminosity. A star like that at GD-1's distance would be hard to detect, so if it is detected, it it more likely to be in the foreground.

Cleaning up

Asynchronous jobs have a jobid.

In [20]:
job1.jobid, job2.jobid
Out[20]:
(None, '1609260407863O')

Which you can use to remove the job from the server.

In [21]:
Gaia.remove_jobs([job2.jobid])
Removed jobs: '['1609260407863O']'.

If you don't remove it job from the server, it will be removed eventually, so don't feel too bad if you don't clean up after yourself.

Formatting queries

So far the queries have been string "literals", meaning that the entire string is part of the program. But writing queries yourself can be slow, repetitive, and error-prone.

It is often a good idea to write Python code that assembles a query for you. One useful tool for that is the string format method.

As an example, we'll divide the previous query into two parts; a list of column names and a "base" for the query that contains everything except the column names.

Here's the list of columns we'll select.

In [22]:
columns = 'source_id, ra, dec, pmra, pmdec, parallax, radial_velocity'

And here's the base; it's a string that contains at least one format specifier in curly brackets (braces).

In [23]:
query3_base = """SELECT TOP 10 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2
"""

This base query contains one format specifier, {columns}, which is a placeholder for the list of column names we will provide.

To assemble the query, we invoke format on the base string and provide a keyword argument that assigns a value to columns.

In [24]:
query3 = query3_base.format(columns=columns)

The result is a string with line breaks. If you display it, the line breaks appear as \n.

In [25]:
query3
Out[25]:
'SELECT TOP 10 \nsource_id, ra, dec, pmra, pmdec, parallax, radial_velocity\nFROM gaiadr2.gaia_source\nWHERE parallax < 1\n  AND bp_rp BETWEEN -0.75 AND 2\n'

But if you print it, the line breaks appear as... line breaks.

In [26]:
print(query3)
SELECT TOP 10 
source_id, ra, dec, pmra, pmdec, parallax, radial_velocity
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2

Notice that the format specifier has been replaced with the value of columns.

Let's run it and see if it works:

In [27]:
job3 = Gaia.launch_job(query3)
print(job3)
<Table length=10>
      name       dtype    unit                              description                             n_bad
--------------- ------- -------- ------------------------------------------------------------------ -----
      source_id   int64          Unique source identifier (unique within a particular Data Release)     0
             ra float64      deg                                                    Right ascension     0
            dec float64      deg                                                        Declination     0
           pmra float64 mas / yr                         Proper motion in right ascension direction     0
          pmdec float64 mas / yr                             Proper motion in declination direction     0
       parallax float64      mas                                                           Parallax     0
radial_velocity float64   km / s                                                    Radial velocity    10
Jobid: None
Phase: COMPLETED
Owner: None
Output file: sync_20201229114653.xml.gz
Results: None
In [28]:
results3 = job3.get_results()
results3
Out[28]:
Table length=10
source_idradecpmrapmdecparallaxradial_velocity
degdegmas / yrmas / yrmaskm / s
int64float64float64float64float64float64float64
466046637150777484879.49100199261952-66.414117776726682.6073927760139983-1.1968907801968522-0.13534455558687877--
466049870403198400080.19346436358076-66.049076752328561.644062563874402-1.954975321006136-0.1774586376397--
466045894982481702479.68666778641992-66.522209596625571.47776972758805810.3328961930362448-0.030149510331454386--
466045197477392652879.89972073493868-66.750378580711912.4635048563021065-0.7474574729840501-0.005219591141134416--
466047403373135705678.93214036467484-66.464836445916671.2369118132455594-1.71871430346386770.6018032937392243--
466045410507787468879.91045649235578-66.681068461359711.47478554075336770.022464361420165786-0.19949109843083218--
466049485575374054479.63101881952036-66.150633158135361.62278632632221130.204908932619051520.0013030724292998944--
466043953657130816080.24817961417894-66.463032706640921.8652105429518493-0.5618584442373111-0.10578442360531497--
466047018544046656079.25942306061864-66.511207775496053.3735611720039890.25326243152876704-0.6932631338638376--
466044146069036556880.38862155241748-66.344044941350461.6588478675146066-0.09646951910977163-0.2100859303087445--

Good so far.

Exercise

This query always selects sources with parallax less than 1. But suppose you want to take that upper bound as an input.

Modify query3_base to replace 1 with a format specifier like {max_parallax}. Now, when you call format, add a keyword argument that assigns a value to max_parallax, and confirm that the format specifier gets replaced with the value you provide.

In [29]:
# Solution

query4_base = """SELECT TOP 10
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < {max_parallax} AND 
bp_rp BETWEEN -0.75 AND 2
"""

query4 = query4_base.format(columns=columns,
                            max_parallax=0.5)
print(query)
SELECT TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp BETWEEN -0.75 AND 2

Summary

This notebook demonstrates the following steps:

  1. Making a connection to the Gaia server,

  2. Exploring information about the database and the tables it contains,

  3. Writing a query and sending it to the server, and finally

  4. Downloading the response from the server as an Astropy Table.

In the next lesson we will extend these queries to select a particular region of the sky.

Best practices

  • If you can't download an entire dataset (or it's not practical) use queries to select the data you need.

  • Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Use ADQL features like TOP and COUNT to test before you run a query that might return a lot of data.

  • If you know your query will return fewer than 3000 rows, you can run it synchronously, which might complete faster (but it doesn't seem to make much difference). If it might return more than 3000 rows, you should run it asynchronously.

  • ADQL and SQL are not case-sensitive, so you don't have to capitalize the keywords, but you should.

  • ADQL and SQL don't require you to break a query into multiple lines, but you should.

Jupyter notebooks can be good for developing and testing code, but they have some drawbacks. In particular, if you run the cells out of order, you might find that variables don't have the values you expect.

There are a few things you can do to mitigate these problems:

  • Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.

  • Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase.

One of the other tables we'll use is gaiadr2.panstarrs1_original_valid. Use load_table to get the metadata for this table. How many columns are there and what are their names?

In [ ]: