Pandas Logo

Week 08 - Introduction to Pandas

Today's Agenda

  • Pandas: Introduction
    • Series
    • DataFrames
    • Indexing, Selecting, Filtering
    • Drop columns
    • Handling missing Data
In [1]:
# Importing modules
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context("notebook")

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. One can use any NumPy data type to assign to the Series

Creating a Series:

In [2]:
series_1 = pd.Series(np.random.random(10))
series_1
Out[2]:
0    0.746534
1    0.921092
2    0.706502
3    0.237317
4    0.167151
5    0.837645
6    0.550416
7    0.117424
8    0.655950
9    0.408263
dtype: float64

One can get a NumPy array from the Series, by typing:

In [3]:
series_1.values
Out[3]:
array([ 0.74653444,  0.92109235,  0.70650184,  0.23731732,  0.16715077,
        0.83764482,  0.55041596,  0.11742395,  0.65595028,  0.40826346])

Reindexing

One can also get the indices of each element, by typing:

In [4]:
series_1.index.values
Out[4]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One can also have a custom set of indices:

In [5]:
import string
alphabet = string.lowercase
alphabet = np.array([x for x in alphabet])[0:10]
alphabet
Out[5]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], 
      dtype='|S1')
In [6]:
series_2 = pd.Series(np.random.random(len(alphabet)), index=alphabet)
series_2
Out[6]:
a    0.845875
b    0.142329
c    0.337683
d    0.063682
e    0.575187
f    0.973920
g    0.502718
h    0.496463
i    0.734602
j    0.019638
dtype: float64

One can select only a subsample of the Series

In [7]:
series_1[[1,3,4]]
Out[7]:
1    0.921092
3    0.237317
4    0.167151
dtype: float64
In [8]:
series_2[['a','d','h']]
Out[8]:
a    0.845875
d    0.063682
h    0.496463
dtype: float64

Arithmetic and function Mapping

You can also perform numerical expressions

In [9]:
series_1[1]**2
Out[9]:
0.84841111721463347

Or find values greater than some value 'x'

In [10]:
x = 0.5
series_1[series_1 >= x]
Out[10]:
0    0.746534
1    0.921092
2    0.706502
5    0.837645
6    0.550416
8    0.655950
dtype: float64

You can apply functions to a column, and save it as a new Series

In [11]:
import sys
def exponentials(arr, basis=10.):
    """
    Uses the array `arr` as the exponents for `basis`
    
    Parameters
    ----------
    arr: numpy array, list, pandas Series; shape (N,)
        array to be used as exponents of `basis`
    
    power: int or float, optional (default = 10)
        number used as the basis
    
    Returns
    -------
    exp_arr: numpy array or list, shape (N,)
        array of values for `basis`**`arr`
    """
    if isinstance(arr, list):
        exp_arr = [basis**x for x in arr]
        return exp_arr        
    elif isinstance(arr, np.ndarray) or isinstance(arr, pd.core.series.Series):
        exp_arr = basis**arr
        return exp_arr
    else:
        cmd = ">>>> `arr` is not a list nor a numpy array"
        cmd +="\n>>>> Please give the correct type of object"
        print(cmd)
        sys.exit(1)
In [12]:
exponentials(series_1[series_1 >= x])
Out[12]:
0    5.578718
1    8.338585
2    5.087470
5    6.880893
6    3.551534
8    4.528457
dtype: float64

You can also create a Series using a dictionary (we talked about these on Week 4)

In [13]:
labels_arr = ['foo', 'bar', 'baz']
data_arr   = [100, 200, 300]
dict_1     = dict(zip(labels_arr, data_arr))
dict_1
Out[13]:
{'bar': 200, 'baz': 300, 'foo': 100}
In [14]:
series_3 = pd.Series(dict_1)
series_3
Out[14]:
bar    200
baz    300
foo    100
dtype: int64

Handling Missing Data

One of the most useful features of pandas is that it can handle missing data quite easily:

In [15]:
index = ['foo', 'bar', 'baz', 'qux']
series_4 = pd.Series(dict_1, index=index)
series_4
Out[15]:
foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64
In [16]:
pd.isnull(series_4)
Out[16]:
foo    False
bar    False
baz    False
qux     True
dtype: bool
In [17]:
series_3 + series_4
Out[17]:
bar    400.0
baz    600.0
foo    200.0
qux      NaN
dtype: float64

So using a Series is powerful, but DataFrames are probably what gets used the most since it represents a tabular data structure containing an ordered collection of columns and rows.

DataFrames

A DataFrame is a "tabular data structure" containing an ordered collection of columns. Each column can a have a different data type.

Row and column operations are treated roughly symmetrically. One can obtain a DataFrame from a normal dictionary, or by reading a file with columns and rows.

Creating a DataFrame

In [18]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'popu' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1
Out[18]:
popu state year
0 5.0 VA 2012
1 5.1 VA 2013
2 5.2 VA 2014
3 4.0 MD 2014
4 4.1 MD 2015

This DataFrame has 4 rows and 3 columns by the name "pop", "state", and "year".

The way to access a DataFrame is quite similar to that of accessing a Series.
To access a column, one writes the name of the column, as in the following example:

In [19]:
df_1['popu']
Out[19]:
0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64
In [20]:
df_1.popu
Out[20]:
0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

One can also handle missing data with DataFrames. Like Series, columns that are not present in the data are NaNs:

In [21]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'popu', 'unempl'])
df_2
Out[21]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 NaN
2 2014 VA 5.2 NaN
3 2014 MD 4.0 NaN
4 2015 MD 4.1 NaN
In [22]:
df_2['state']
Out[22]:
0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

One can retrieve a row by:

In [23]:
df_2.ix[1]
Out[23]:
year      2013
state       VA
popu       5.1
unempl     NaN
Name: 1, dtype: object

Editing a DataFrame is quite easy to do. One can assign a Series to a column of the DataFrame. If the Series is a list or an array, the length must match the DataFrame.

In [24]:
unempl = pd.Series([1.0, 2.0, 10.], index=[1,3,5])
unempl
Out[24]:
1     1.0
3     2.0
5    10.0
dtype: float64
In [25]:
df_2['unempl'] = unempl
df_2
Out[25]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN
In [26]:
df_2.unempl.isnull()
Out[26]:
0     True
1    False
2     True
3    False
4     True
Name: unempl, dtype: bool

You can also transpose a DataFrame, i.e. switch rows by columns, and columns by rows

In [27]:
df_2.T
Out[27]:
0 1 2 3 4
year 2012 2013 2014 2014 2015
state VA VA VA MD MD
popu 5 5.1 5.2 4 4.1
unempl NaN 1 NaN 2 NaN

Now, let's say you want to show only the 'year' and 'popu' columns. You can do it by:

In [28]:
df_2[['year', 'popu']]
Out[28]:
year popu
0 2012 5.0
1 2013 5.1
2 2014 5.2
3 2014 4.0
4 2015 4.1

Dropping Entries

Let's say you only need a subsample of the table that you have, and you need to drop a column from the DataFrame. You can do that by using the 'drop' option:

In [29]:
df_2
Out[29]:
year state popu unempl
0 2012 VA 5.0 NaN
1 2013 VA 5.1 1.0
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0
4 2015 MD 4.1 NaN
In [30]:
df_3 = df_2.drop('unempl', axis=1)
df_3
Out[30]:
year state popu
0 2012 VA 5.0
1 2013 VA 5.1
2 2014 VA 5.2
3 2014 MD 4.0
4 2015 MD 4.1

You can also drop certain rows:

In [31]:
df_4 = df_2.drop([1,4])
df_4
Out[31]:
year state popu unempl
0 2012 VA 5.0 NaN
2 2014 VA 5.2 NaN
3 2014 MD 4.0 2.0

Look at this carefully! The DataFrame kept preserved the same indices as for df_2.

If you can to reset the indices, you can do that by:

In [32]:
df_4.reset_index(inplace=True)
df_4
Out[32]:
index year state popu unempl
0 0 2012 VA 5.0 NaN
1 2 2014 VA 5.2 NaN
2 3 2014 MD 4.0 2.0

Gaia Dataset

Gaia

Pandas is great at reading Data tables and CSV files, and other kinds of documents. For the remainder of this notebook, we will be using the Gaia's DR1 catalogue.

In [33]:
gaia_df = pd.read_csv('http://1016243957.rsc.cdn77.org/Gaia/gaia_source/csv/GaiaSource_000-000-000.csv.gz', compression='gzip')
In [34]:
gaia_df
Out[34]:
solution_id source_id random_index ref_epoch ra ra_error dec dec_error parallax parallax_error ... scan_direction_mean_k4 phot_g_n_obs phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_mag phot_variable_flag l b ecl_lon ecl_lat
0 1635378410781933568 65408 973786105 2015.0 44.996152 14.379929 0.005616 6.517028 NaN NaN ... 25.226347 30 1567.255110 5.856073 17.536921 NOT_AVAILABLE 176.951072 -48.901522 42.533723 -16.329572
1 1635378410781933568 34359896320 1010840134 2015.0 45.004970 0.156231 0.019875 0.528851 NaN NaN ... 28.650621 68 40086.238958 27.159845 14.017282 NOT_AVAILABLE 176.944759 -48.885280 42.546862 -16.318524
2 1635378410781933568 34361129088 388985243 2015.0 45.004312 0.347814 0.021042 1.100095 NaN NaN ... 28.207079 52 1761.880082 19.831304 17.409829 NOT_AVAILABLE 176.942786 -48.884943 42.546563 -16.317215
3 1635378410781933568 309238066432 590730619 2015.0 44.995037 2.168152 0.038152 1.215309 NaN NaN ... 25.310766 61 275.451290 2.733062 19.424658 NOT_AVAILABLE 176.914265 -48.879747 42.542548 -16.298139
4 1635378410781933568 343597448960 329156827 2015.0 44.963892 11.881098 0.043596 5.411748 NaN NaN ... 27.213010 53 1025.723310 10.073989 17.997195 NOT_AVAILABLE 176.875418 -48.898381 42.513189 -16.283803
5 1635378410781933568 515396233856 260581828 2015.0 44.998326 1.953613 0.066335 1.112199 NaN NaN ... 24.463037 52 292.297111 2.363495 19.360209 NOT_AVAILABLE 176.886900 -48.857814 42.554428 -16.272157
6 1635378410781933568 549755818112 278964178 2015.0 45.048282 0.143346 0.048259 0.150784 NaN NaN ... 27.246462 77 18064.661353 18.664497 14.882696 NOT_AVAILABLE 176.959360 -48.834394 42.598674 -16.304073
7 1635378410781933568 824633879296 201645987 2015.0 45.023616 7.686524 0.068421 14.300411 NaN NaN ... 27.421402 40 4228.838911 8.266022 16.459217 NOT_AVAILABLE 176.911300 -48.838164 42.580255 -16.277571
8 1635378410781933568 927713095040 1018609930 2015.0 45.026727 18.851057 0.081702 46.240572 NaN NaN ... 27.308582 35 327.815930 4.918327 19.235695 NOT_AVAILABLE 176.900059 -48.826705 42.587406 -16.265784
9 1635378410781933568 1099511693312 442633663 2015.0 44.966545 1.672164 0.046310 0.932317 NaN NaN ... 25.379011 59 356.809033 2.622480 19.143680 NOT_AVAILABLE 176.875251 -48.894588 42.516660 -16.281987
10 1635378410781933568 1271310478464 1115856713 2015.0 44.993271 0.216263 0.076336 0.144935 NaN NaN ... 25.836508 64 5714.666962 8.945431 16.132293 NOT_AVAILABLE 176.870627 -48.854506 42.552446 -16.261113
11 1635378410781933568 1340029955712 201575893 2015.0 44.969076 0.682169 0.084426 0.373422 NaN NaN ... 25.172558 60 1411.076794 13.594468 17.650893 NOT_AVAILABLE 176.836234 -48.866289 42.530821 -16.246288
12 1635378410781933568 1340029956224 5951362 2015.0 44.978459 53.350179 0.092581 24.708885 NaN NaN ... 23.500828 51 620.840141 5.899377 18.542321 NOT_AVAILABLE 176.837229 -48.853875 42.542654 -16.241241
13 1635378410781933568 1374389600384 920426374 2015.0 44.932801 1.040805 0.064812 0.583973 NaN NaN ... 25.417435 69 804.421636 2.771768 18.261061 NOT_AVAILABLE 176.819345 -48.906004 42.488706 -16.254403
14 1635378410781933568 1511828647680 84985118 2015.0 44.952650 1.606169 0.084952 0.875326 NaN NaN ... 25.286890 68 362.618752 4.127016 19.126144 NOT_AVAILABLE 176.818303 -48.877735 42.514624 -16.240969
15 1635378410781933568 1614907863552 1072871075 2015.0 44.951155 0.496565 0.105309 0.335391 NaN NaN ... 26.131811 85 1311.494046 5.143534 17.730354 NOT_AVAILABLE 176.794474 -48.864659 42.519351 -16.221068
16 1635378410781933568 1649267601280 351948544 2015.0 44.996058 0.155590 0.084922 0.117742 NaN NaN ... 26.377840 78 6016.654908 8.316853 16.076382 NOT_AVAILABLE 176.864182 -48.846537 42.557843 -16.253721
17 1635378410781933568 1683627775360 872201236 2015.0 45.013791 2.772176 0.087743 1.475106 NaN NaN ... 25.030005 59 204.312422 2.355248 19.749033 NOT_AVAILABLE 176.879811 -48.831820 42.576364 -16.256219
18 1635378410781933568 1717987078400 813806903 2015.0 44.983097 1.104445 0.096410 0.605834 NaN NaN ... 25.261309 60 658.182045 2.995584 18.478905 NOT_AVAILABLE 176.837942 -48.847878 42.548442 -16.238939
19 1635378410781933568 1752346816896 1021139689 2015.0 45.005039 0.644877 0.101940 0.463658 NaN NaN ... 25.554262 60 433.868742 2.950488 18.931374 NOT_AVAILABLE 176.855060 -48.828255 42.571979 -16.240080
20 1635378410781933568 2199023259904 790860275 2015.0 45.095578 0.838257 0.085309 0.566988 NaN NaN ... 24.390670 67 891.659393 3.272273 18.149273 NOT_AVAILABLE 176.968651 -48.774626 42.657089 -16.282487
21 1635378410781933568 2199023292160 269964057 2015.0 45.081356 11.015226 0.081706 6.681021 NaN NaN ... 23.379484 50 127.643943 2.999310 20.259770 NOT_AVAILABLE 176.957618 -48.787370 42.641823 -16.281771
22 1635378410781933568 2336462213888 1125415564 2015.0 45.104663 0.649207 0.098518 0.610497 NaN NaN ... 25.553093 79 408.336380 2.360694 18.997225 NOT_AVAILABLE 176.963776 -48.758920 42.670164 -16.272513
23 1635378410781933568 2405181690752 422277103 2015.0 45.096167 2.205147 0.105424 1.446492 NaN NaN ... 24.717205 71 236.532837 2.578600 19.590041 NOT_AVAILABLE 176.947291 -48.760249 42.663805 -16.263424
24 1635378410781933568 2542620800000 879777187 2015.0 45.057029 0.616399 0.114982 0.416486 NaN NaN ... 26.451584 60 1276.918666 3.345120 17.759362 NOT_AVAILABLE 176.895634 -48.781791 42.627735 -16.242833
25 1635378410781933568 2748779106176 434751122 2015.0 45.140486 43.841507 0.120844 20.328849 NaN NaN ... 23.307291 61 257.753422 2.749363 19.496759 NOT_AVAILABLE 176.977064 -48.717629 42.712650 -16.261636
26 1635378410781933568 2753074326400 527624082 2015.0 45.121968 86.218208 0.121684 39.899984 NaN NaN ... 22.415926 52 178.186162 2.386511 19.897585 NOT_AVAILABLE 176.956677 -48.730386 42.694460 -16.255420
27 1635378410781933568 2851858288640 591268808 2015.0 45.132143 0.206403 0.137848 0.158738 NaN NaN ... 24.536875 71 198381.427649 95.346283 12.281018 NOT_AVAILABLE 176.949730 -48.711844 42.709517 -16.242936
28 1635378410781933568 2920577765120 895292880 2015.0 45.164653 0.138736 0.154504 0.118131 NaN NaN ... 24.592000 70 13193.978370 14.564631 15.223831 NOT_AVAILABLE 176.965723 -48.676874 42.746969 -16.236507
29 1635378410781933568 2989297243520 357757792 2015.0 45.151781 42.562805 0.154113 19.750830 NaN NaN ... 22.130322 52 191.990513 2.543990 19.816571 NOT_AVAILABLE 176.952624 -48.686417 42.734028 -16.233120
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
218423 1635378410781933568 14919754314204032 260281904 2015.0 45.650800 0.383226 9.081488 0.261477 NaN NaN ... 24.002770 70 3363.547982 5.101128 16.707776 NOT_AVAILABLE 168.790929 -41.821862 45.833405 -7.820349
218424 1635378410781933568 14919823033679488 1027452090 2015.0 45.704786 2.042026 9.071023 1.450689 NaN NaN ... 23.002306 60 258.053755 2.764092 19.495495 NOT_AVAILABLE 168.854540 -41.795342 45.882092 -7.845349
218425 1635378410781933568 14919823033679744 236659816 2015.0 45.699352 0.929139 9.076520 0.639298 NaN NaN ... 22.957360 59 680.917341 3.275510 18.442034 NOT_AVAILABLE 168.844280 -41.794626 45.878449 -7.838568
218426 1635378410781933568 14919891753156352 435272755 2015.0 45.692881 0.190421 9.075114 0.163815 NaN NaN ... 27.661507 87 7012.414539 16.006395 15.910101 NOT_AVAILABLE 168.838968 -41.799835 45.871859 -7.838125
218427 1635378410781933568 14919960472635648 959701713 2015.0 45.722289 9.825346 9.093279 6.492147 NaN NaN ... 23.688305 60 1596.069575 4.220688 17.517141 NOT_AVAILABLE 168.852879 -41.767191 45.905138 -7.828832
218428 1635378410781933568 14920132271328128 942427254 2015.0 45.689493 0.268733 9.097753 0.171978 NaN NaN ... 26.496485 80 6805.308872 8.148101 15.942650 NOT_AVAILABLE 168.815896 -41.784742 45.875028 -7.815457
218429 1635378410781933568 14920166631067008 632724484 2015.0 45.664481 18.762107 9.100180 9.810925 NaN NaN ... 19.576363 31 103.585201 3.509294 20.486526 NOT_AVAILABLE 168.788531 -41.798869 45.851787 -7.806199
218430 1635378410781933568 14920166631067392 582109148 2015.0 45.674276 1.685143 9.101238 1.361935 NaN NaN ... 24.006506 69 230.669617 2.052386 19.617294 NOT_AVAILABLE 168.797505 -41.791806 45.861457 -7.807897
218431 1635378410781933568 14920269710286208 779603400 2015.0 45.720753 0.608597 9.122757 0.427416 NaN NaN ... 24.420046 67 1544.315651 5.862847 17.552930 NOT_AVAILABLE 168.825747 -41.745705 45.912009 -7.800110
218432 1635378410781933568 14920269710287104 501846505 2015.0 45.708972 0.541434 9.125572 0.350618 NaN NaN ... 25.532549 60 1864.310293 4.470867 17.348475 NOT_AVAILABLE 168.811421 -41.751086 45.901536 -7.794147
218433 1635378410781933568 14920338429763456 32369006 2015.0 45.699755 0.728796 9.124834 0.487984 NaN NaN ... 23.891710 60 1586.669976 3.602134 17.523554 NOT_AVAILABLE 168.802759 -41.757538 45.892510 -7.792304
218434 1635378410781933568 14920338429767040 983009529 2015.0 45.704411 1.069254 9.136586 0.634729 NaN NaN ... 23.919147 59 432.644682 2.342617 18.934442 NOT_AVAILABLE 168.797267 -41.745602 45.900289 -7.782312
218435 1635378410781933568 14920407149232128 456578257 2015.0 45.803526 0.822002 9.076716 0.547173 NaN NaN ... 26.017690 61 791.473344 3.277256 18.278679 NOT_AVAILABLE 168.949154 -41.727825 45.978188 -7.867202
218436 1635378410781933568 14920441508971904 459156748 2015.0 45.777613 0.592911 9.084311 0.364729 NaN NaN ... 26.010004 66 1411.994227 3.829917 17.650188 NOT_AVAILABLE 168.916450 -41.738631 45.955538 -7.852746
218437 1635378410781933568 14920475868710912 349670923 2015.0 45.801526 12.097061 9.092938 5.199756 NaN NaN ... 27.578463 60 252.307472 2.821898 19.519945 NOT_AVAILABLE 168.933047 -41.716756 45.980859 -7.851076
218438 1635378410781933568 14920514523093632 36613424 2015.0 45.805955 27.153874 9.095252 10.859021 NaN NaN ... 24.191610 22 72.347293 3.372488 20.876214 NOT_AVAILABLE 168.935497 -41.712158 45.985751 -7.850078
218439 1635378410781933568 14920578947928704 979137908 2015.0 45.802628 2.994153 9.110668 2.094225 NaN NaN ... 26.079325 61 149.356346 2.980845 20.089211 NOT_AVAILABLE 168.918760 -41.702551 45.986924 -7.834360
218440 1635378410781933568 14920682027143680 234382041 2015.0 45.776151 11.184474 9.107903 13.340924 NaN NaN ... 26.989640 52 255.031598 3.810901 19.508285 NOT_AVAILABLE 168.894490 -41.721600 45.960808 -7.829694
218441 1635378410781933568 14920682027143936 534175679 2015.0 45.779258 2.427386 9.110364 1.614261 NaN NaN ... 26.010275 67 233.591796 2.710746 19.603626 NOT_AVAILABLE 168.895483 -41.717738 45.964476 -7.828191
218442 1635378410781933568 14920755041238784 566149119 2015.0 45.765103 27.140964 9.108320 10.771321 NaN NaN ... 22.895716 25 114.828565 4.934718 20.374645 NOT_AVAILABLE 168.882995 -41.728350 45.950355 -7.826239
218443 1635378410781933568 14920789400983680 163695001 2015.0 45.791348 18.048335 9.115565 7.274865 NaN NaN ... 25.600533 40 125.169392 3.182193 20.281025 NOT_AVAILABLE 168.903149 -41.706041 45.977514 -7.826541
218444 1635378410781933568 14920853825840384 374198172 2015.0 45.782804 2.011283 9.128334 1.271993 NaN NaN ... 26.053460 67 340.367415 3.085575 19.194900 NOT_AVAILABLE 168.883459 -41.701782 45.972948 -7.811920
218445 1635378410781933568 14920961199665920 1026325869 2015.0 45.845940 0.154804 9.117229 0.133720 NaN NaN ... 30.232103 105 34335.292371 118.342051 14.185418 NOT_AVAILABLE 168.956655 -41.669825 46.030220 -7.840027
218446 1635378410781933568 14920991264791424 130617855 2015.0 45.818292 2.132414 9.119688 1.562758 NaN NaN ... 25.077946 59 265.553116 5.101247 19.464392 NOT_AVAILABLE 168.926699 -41.685657 46.004459 -7.830029
218447 1635378410781933568 14921025624532096 485380553 2015.0 45.831571 1.013164 9.127799 0.636519 NaN NaN ... 26.132261 58 823.749366 5.018586 18.235282 NOT_AVAILABLE 168.933022 -41.670980 46.019456 -7.825910
218448 1635378410781933568 14921025624532736 360272742 2015.0 45.840498 0.597650 9.130125 0.389516 NaN NaN ... 25.829742 63 3481.243336 9.832782 16.670434 NOT_AVAILABLE 168.939985 -41.663494 46.028655 -7.826143
218449 1635378410781933568 14921025624533888 616104949 2015.0 45.836316 1.050996 9.133721 0.700946 NaN NaN ... 25.784980 57 609.708253 4.083088 18.561965 NOT_AVAILABLE 168.932655 -41.663435 46.025668 -7.821535
218450 1635378410781933568 14921059984272128 1062009676 2015.0 45.857035 0.122367 9.132760 0.099148 NaN NaN ... 29.490643 79 21878.692612 21.994991 14.674717 NOT_AVAILABLE 168.954330 -41.650899 46.045222 -7.828179
218451 1635378410781933568 14921128703751040 989948028 2015.0 45.845190 2.005531 9.143289 1.342038 NaN NaN ... 25.893215 61 277.107730 3.519941 19.418148 NOT_AVAILABLE 168.933278 -41.650470 46.036860 -7.814800
218452 1635378410781933568 14921128703751808 813083591 2015.0 45.845149 1.588282 9.144765 1.058416 NaN NaN ... 25.928852 61 408.096779 3.346556 18.997862 NOT_AVAILABLE 168.931956 -41.649373 46.037237 -7.813372

218453 rows × 57 columns

Shape, Columns and Rows

You can get the shape of the "gaia_df" DataFrame by typing:

In [35]:
gaia_df.shape
Out[35]:
(218453, 57)

That means there are 218453 rows and 57 columns.

To get an array of the columns available, one could write:

In [36]:
gaia_df.columns.values.sort()
gaia_df.columns.values
Out[36]:
array(['astrometric_delta_q', 'astrometric_excess_noise',
       'astrometric_excess_noise_sig', 'astrometric_n_bad_obs_ac',
       'astrometric_n_bad_obs_al', 'astrometric_n_good_obs_ac',
       'astrometric_n_good_obs_al', 'astrometric_n_obs_ac',
       'astrometric_n_obs_al', 'astrometric_primary_flag',
       'astrometric_priors_used', 'astrometric_relegation_factor',
       'astrometric_weight_ac', 'astrometric_weight_al', 'b', 'dec',
       'dec_error', 'dec_parallax_corr', 'dec_pmdec_corr', 'dec_pmra_corr',
       'duplicated_source', 'ecl_lat', 'ecl_lon', 'l',
       'matched_observations', 'parallax', 'parallax_error',
       'parallax_pmdec_corr', 'parallax_pmra_corr', 'phot_g_mean_flux',
       'phot_g_mean_flux_error', 'phot_g_mean_mag', 'phot_g_n_obs',
       'phot_variable_flag', 'pmdec', 'pmdec_error', 'pmra', 'pmra_error',
       'pmra_pmdec_corr', 'ra', 'ra_dec_corr', 'ra_error',
       'ra_parallax_corr', 'ra_pmdec_corr', 'ra_pmra_corr', 'random_index',
       'ref_epoch', 'scan_direction_mean_k1', 'scan_direction_mean_k2',
       'scan_direction_mean_k3', 'scan_direction_mean_k4',
       'scan_direction_strength_k1', 'scan_direction_strength_k2',
       'scan_direction_strength_k3', 'scan_direction_strength_k4',
       'solution_id', 'source_id'], dtype=object)

Let's say you only want a DataFrame with the the colums:

  • ra (right ascension)
  • dec (declination)
  • l (galactic longitude)
  • b (galactic latitude)

You do this by using the loc option for the DataFrame:

In [37]:
gaia_df_2 = gaia_df.loc[:,['ra','dec','l','b']]
gaia_df_2
Out[37]:
ra dec l b
0 44.996152 0.005616 176.951072 -48.901522
1 45.004970 0.019875 176.944759 -48.885280
2 45.004312 0.021042 176.942786 -48.884943
3 44.995037 0.038152 176.914265 -48.879747
4 44.963892 0.043596 176.875418 -48.898381
5 44.998326 0.066335 176.886900 -48.857814
6 45.048282 0.048259 176.959360 -48.834394
7 45.023616 0.068421 176.911300 -48.838164
8 45.026727 0.081702 176.900059 -48.826705
9 44.966545 0.046310 176.875251 -48.894588
10 44.993271 0.076336 176.870627 -48.854506
11 44.969076 0.084426 176.836234 -48.866289
12 44.978459 0.092581 176.837229 -48.853875
13 44.932801 0.064812 176.819345 -48.906004
14 44.952650 0.084952 176.818303 -48.877735
15 44.951155 0.105309 176.794474 -48.864659
16 44.996058 0.084922 176.864182 -48.846537
17 45.013791 0.087743 176.879811 -48.831820
18 44.983097 0.096410 176.837942 -48.847878
19 45.005039 0.101940 176.855060 -48.828255
20 45.095578 0.085309 176.968651 -48.774626
21 45.081356 0.081706 176.957618 -48.787370
22 45.104663 0.098518 176.963776 -48.758920
23 45.096167 0.105424 176.947291 -48.760249
24 45.057029 0.114982 176.895634 -48.781791
25 45.140486 0.120844 176.977064 -48.717629
26 45.121968 0.121684 176.956677 -48.730386
27 45.132143 0.137848 176.949730 -48.711844
28 45.164653 0.154504 176.965723 -48.676874
29 45.151781 0.154113 176.952624 -48.686417
... ... ... ... ...
218423 45.650800 9.081488 168.790929 -41.821862
218424 45.704786 9.071023 168.854540 -41.795342
218425 45.699352 9.076520 168.844280 -41.794626
218426 45.692881 9.075114 168.838968 -41.799835
218427 45.722289 9.093279 168.852879 -41.767191
218428 45.689493 9.097753 168.815896 -41.784742
218429 45.664481 9.100180 168.788531 -41.798869
218430 45.674276 9.101238 168.797505 -41.791806
218431 45.720753 9.122757 168.825747 -41.745705
218432 45.708972 9.125572 168.811421 -41.751086
218433 45.699755 9.124834 168.802759 -41.757538
218434 45.704411 9.136586 168.797267 -41.745602
218435 45.803526 9.076716 168.949154 -41.727825
218436 45.777613 9.084311 168.916450 -41.738631
218437 45.801526 9.092938 168.933047 -41.716756
218438 45.805955 9.095252 168.935497 -41.712158
218439 45.802628 9.110668 168.918760 -41.702551
218440 45.776151 9.107903 168.894490 -41.721600
218441 45.779258 9.110364 168.895483 -41.717738
218442 45.765103 9.108320 168.882995 -41.728350
218443 45.791348 9.115565 168.903149 -41.706041
218444 45.782804 9.128334 168.883459 -41.701782
218445 45.845940 9.117229 168.956655 -41.669825
218446 45.818292 9.119688 168.926699 -41.685657
218447 45.831571 9.127799 168.933022 -41.670980
218448 45.840498 9.130125 168.939985 -41.663494
218449 45.836316 9.133721 168.932655 -41.663435
218450 45.857035 9.132760 168.954330 -41.650899
218451 45.845190 9.143289 168.933278 -41.650470
218452 45.845149 9.144765 168.931956 -41.649373

218453 rows × 4 columns

This selects all of the rows, and only the selected columns in the list.

You can also select only a subsample of the rows as well, as in the following example. Let's say I just want a random subsample of 10% of the galaxies in the Gaia DR1 catalogue. I can do that by:

In [38]:
# Defining indices to select from the DataFrame
import random
nrows = len(gaia_df_2)
random_idx = random.sample(np.arange(nrows), int(0.01*nrows))
random_idx = np.sort(random_idx)
In [39]:
gaia_df_3 = gaia_df_2.loc[random_idx,:]
gaia_df_3
Out[39]:
ra dec l b
14 44.952650 0.084952 176.818303 -48.877735
228 44.795829 0.184017 176.544033 -48.921306
232 44.865366 0.191660 176.609452 -48.866135
442 44.855181 0.433960 176.335425 -48.704228
665 45.446312 0.631135 176.743966 -48.143173
690 45.482151 0.686472 176.721784 -48.078930
789 45.067868 0.614153 176.365601 -48.426195
1149 44.715455 0.599169 176.008540 -48.688023
1352 44.737282 0.753932 175.865316 -48.563660
1493 44.859724 0.726785 176.024473 -48.495707
1551 45.178663 0.785090 176.298463 -48.227447
1581 45.140997 0.843021 176.196810 -48.213758
1733 44.936399 0.849941 175.973620 -48.354617
1778 44.713011 0.910525 175.671805 -48.470494
1812 44.806291 0.986470 175.689898 -48.350819
2032 45.556398 0.762355 176.717555 -47.972808
2093 45.857417 0.816131 176.971489 -47.718950
2138 45.821782 0.855618 176.892478 -47.717209
2322 45.828167 1.038208 176.704208 -47.585698
2540 46.206883 1.111540 177.014918 -47.262336
2595 46.177627 1.180242 176.912119 -47.235764
2632 46.280532 1.250090 176.943134 -47.113211
2655 46.246838 1.325347 176.829207 -47.085214
2759 46.082620 1.225213 176.767246 -47.272850
2799 46.132264 1.292491 176.746878 -47.190398
2833 46.005289 1.301752 176.606937 -47.275077
3264 45.223419 1.264838 175.834433 -47.858780
3320 45.401993 1.309549 175.974049 -47.700446
3481 45.715832 1.418924 176.184854 -47.400183
3622 45.847707 1.607167 176.123211 -47.174205
... ... ... ... ...
216100 46.174012 9.278993 169.144967 -41.336419
216147 46.192177 9.404230 169.054645 -41.229654
216352 46.985863 9.170142 170.042727 -40.893077
216364 47.073474 9.184270 170.116200 -40.825312
216436 47.384730 9.300200 170.318366 -40.534940
216652 46.920830 9.388837 169.788374 -40.770972
216660 46.936358 9.433456 169.764838 -40.727296
216764 47.116399 9.455294 169.921985 -40.593850
216815 47.449891 9.406562 170.288801 -40.412708
216823 47.560540 9.403056 170.398984 -40.342827
216832 47.592997 9.417901 170.417410 -40.310449
216878 47.349501 9.462677 170.142536 -40.436337
217112 47.337417 9.564464 170.042356 -40.367906
217281 47.387897 9.736963 169.941869 -40.205558
217321 47.505362 9.816321 169.986914 -40.069499
217529 46.938933 9.592249 169.629726 -40.605975
217576 46.841054 9.616003 169.513240 -40.651387
217607 46.891189 9.635480 169.545573 -40.604265
217633 46.592042 9.516725 169.353984 -40.887191
217722 46.586715 9.603640 169.273644 -40.824826
217990 47.072413 9.762536 169.613070 -40.391149
218013 47.128453 9.773364 169.658358 -40.346689
218021 47.099246 9.810655 169.597742 -40.337499
218062 47.269429 9.837590 169.740069 -40.206915
218259 46.947980 9.954020 169.326830 -40.326949
218276 46.916050 10.010193 169.247453 -40.305015
218349 46.993641 10.008716 169.324492 -40.256162
218356 47.076047 10.062687 169.358552 -40.162243
218368 47.016633 10.071601 169.293029 -40.193793
218437 45.801526 9.092938 168.933047 -41.716756

2184 rows × 4 columns

I'm re-normalizing the indices of this DataFrame

In [40]:
gaia_df_3.reset_index(inplace=True, drop=True)
gaia_df_3
Out[40]:
ra dec l b
0 44.952650 0.084952 176.818303 -48.877735
1 44.795829 0.184017 176.544033 -48.921306
2 44.865366 0.191660 176.609452 -48.866135
3 44.855181 0.433960 176.335425 -48.704228
4 45.446312 0.631135 176.743966 -48.143173
5 45.482151 0.686472 176.721784 -48.078930
6 45.067868 0.614153 176.365601 -48.426195
7 44.715455 0.599169 176.008540 -48.688023
8 44.737282 0.753932 175.865316 -48.563660
9 44.859724 0.726785 176.024473 -48.495707
10 45.178663 0.785090 176.298463 -48.227447
11 45.140997 0.843021 176.196810 -48.213758
12 44.936399 0.849941 175.973620 -48.354617
13 44.713011 0.910525 175.671805 -48.470494
14 44.806291 0.986470 175.689898 -48.350819
15 45.556398 0.762355 176.717555 -47.972808
16 45.857417 0.816131 176.971489 -47.718950
17 45.821782 0.855618 176.892478 -47.717209
18 45.828167 1.038208 176.704208 -47.585698
19 46.206883 1.111540 177.014918 -47.262336
20 46.177627 1.180242 176.912119 -47.235764
21 46.280532 1.250090 176.943134 -47.113211
22 46.246838 1.325347 176.829207 -47.085214
23 46.082620 1.225213 176.767246 -47.272850
24 46.132264 1.292491 176.746878 -47.190398
25 46.005289 1.301752 176.606937 -47.275077
26 45.223419 1.264838 175.834433 -47.858780
27 45.401993 1.309549 175.974049 -47.700446
28 45.715832 1.418924 176.184854 -47.400183
29 45.847707 1.607167 176.123211 -47.174205
... ... ... ... ...
2154 46.174012 9.278993 169.144967 -41.336419
2155 46.192177 9.404230 169.054645 -41.229654
2156 46.985863 9.170142 170.042727 -40.893077
2157 47.073474 9.184270 170.116200 -40.825312
2158 47.384730 9.300200 170.318366 -40.534940
2159 46.920830 9.388837 169.788374 -40.770972
2160 46.936358 9.433456 169.764838 -40.727296
2161 47.116399 9.455294 169.921985 -40.593850
2162 47.449891 9.406562 170.288801 -40.412708
2163 47.560540 9.403056 170.398984 -40.342827
2164 47.592997 9.417901 170.417410 -40.310449
2165 47.349501 9.462677 170.142536 -40.436337
2166 47.337417 9.564464 170.042356 -40.367906
2167 47.387897 9.736963 169.941869 -40.205558
2168 47.505362 9.816321 169.986914 -40.069499
2169 46.938933 9.592249 169.629726 -40.605975
2170 46.841054 9.616003 169.513240 -40.651387
2171 46.891189 9.635480 169.545573 -40.604265
2172 46.592042 9.516725 169.353984 -40.887191
2173 46.586715 9.603640 169.273644 -40.824826
2174 47.072413 9.762536 169.613070 -40.391149
2175 47.128453 9.773364 169.658358 -40.346689
2176 47.099246 9.810655 169.597742 -40.337499
2177 47.269429 9.837590 169.740069 -40.206915
2178 46.947980 9.954020 169.326830 -40.326949
2179 46.916050 10.010193 169.247453 -40.305015
2180 46.993641 10.008716 169.324492 -40.256162
2181 47.076047 10.062687 169.358552 -40.162243
2182 47.016633 10.071601 169.293029 -40.193793
2183 45.801526 9.092938 168.933047 -41.716756

2184 rows × 4 columns

You can produce plots directly from the DataFrame

In [41]:
gaia_df_3.plot('ra','dec',kind='scatter', label='Gaia', title='Right Ascension and Declination for Gaia')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x10faf9350>

Or even Scatterplot Matrices:

In [42]:
sns.pairplot(gaia_df_3)
Out[42]:
<seaborn.axisgrid.PairGrid at 0x1060555d0>
In [43]:
sns.jointplot(gaia_df_3.l, gaia_df_3.b, color='green')
Out[43]:
<seaborn.axisgrid.JointGrid at 0x117a14690>

Indexing, Selecting, Filtering Data

Now I want to filter the data based on ra and dec:

I want to select all the stars within:

  • 45 < RA < 50
  • 5 < Dec < 10

Normally, you would could do in numpy using the np.where function, like in the following example:

In [44]:
ra_arr = gaia_df.ra.values
dec_arr = gaia_df.dec.values
In [45]:
# Just showing the first 25 elements
np.column_stack((ra_arr, dec_arr))[0:25]
Out[45]:
array([[  4.49961522e+01,   5.61634775e-03],
       [  4.50049701e+01,   1.98746513e-02],
       [  4.50043123e+01,   2.10424715e-02],
       [  4.49950374e+01,   3.81519670e-02],
       [  4.49638922e+01,   4.35964570e-02],
       [  4.49983264e+01,   6.63346354e-02],
       [  4.50482820e+01,   4.82585916e-02],
       [  4.50236165e+01,   6.84211252e-02],
       [  4.50267273e+01,   8.17017400e-02],
       [  4.49665454e+01,   4.63102291e-02],
       [  4.49932711e+01,   7.63359005e-02],
       [  4.49690763e+01,   8.44258917e-02],
       [  4.49784589e+01,   9.25808117e-02],
       [  4.49328009e+01,   6.48117207e-02],
       [  4.49526503e+01,   8.49519208e-02],
       [  4.49511545e+01,   1.05309011e-01],
       [  4.49960582e+01,   8.49218778e-02],
       [  4.50137910e+01,   8.77425966e-02],
       [  4.49830971e+01,   9.64104969e-02],
       [  4.50050386e+01,   1.01940363e-01],
       [  4.50955780e+01,   8.53094013e-02],
       [  4.50813558e+01,   8.17060309e-02],
       [  4.51046630e+01,   9.85184628e-02],
       [  4.50961672e+01,   1.05423752e-01],
       [  4.50570288e+01,   1.14981782e-01]])
In [46]:
## Numpy way of finding the stars that meet the criteria
# RA critera
ra_idx = np.where((ra_arr >= 45) & (ra_arr <= 50))[0]
# Dec criteria
dec_idx = np.where((dec_arr >= 5) & (dec_arr <= 10))[0]
# Finding `intersecting' indices that meet both criteria
radec_idx = np.intersect1d(ra_idx, dec_idx)
# Selecting the values from only those indices
ra_new = ra_arr[radec_idx]
dec_new = dec_arr[radec_idx]
# Printing out ra and dec for corresponding indices
print(np.column_stack((ra_new, dec_new)))
[[ 49.64774675   5.00280731]
 [ 49.63224857   5.0081667 ]
 [ 49.58879967   5.00074408]
 ..., 
 [ 45.85703513   9.13276023]
 [ 45.84518961   9.14328854]
 [ 45.84514856   9.14476464]]

This is rather convoluted and long, and one can easily make a mistake if s/he doesn't keep track of which arrays s/he is using!

In Pandas, this is much easier!!

In [47]:
gaia_df_4 = gaia_df.loc[(
                (gaia_df.ra >= 45) & (gaia_df.ra <= 50) &
                (gaia_df.dec >= 5) & (gaia_df.dec <= 10))]
gaia_df_4[['ra','dec']]
Out[47]:
astrometric_n_bad_obs_al astrometric_n_good_obs_al
43777 49.647747 5.002807
43778 49.632249 5.008167
43779 49.588800 5.000744
43780 49.583936 5.015898
43781 49.606410 5.014561
43782 49.626470 5.031774
43784 49.548174 5.000879
43785 49.556817 5.005441
43786 49.545010 5.014518
43787 49.553024 5.020243
43788 49.571743 5.074447
43888 49.927577 5.000644
43889 49.942065 5.004038
43890 49.967965 5.019667
43891 49.967711 5.021598
43892 49.959912 5.028652
43897 49.881441 5.005472
43898 49.896872 5.016831
43899 49.890492 5.013979
43900 49.901816 5.020011
43901 49.855065 5.013263
43902 49.920615 5.021050
43903 49.937499 5.023916
43904 49.934832 5.027709
43905 49.914081 5.034258
43906 49.944793 5.028222
43907 49.943595 5.033689
43908 49.946732 5.039008
43909 49.948072 5.027802
43910 49.953600 5.034959
... ... ...
218423 45.650800 9.081488
218424 45.704786 9.071023
218425 45.699352 9.076520
218426 45.692881 9.075114
218427 45.722289 9.093279
218428 45.689493 9.097753
218429 45.664481 9.100180
218430 45.674276 9.101238
218431 45.720753 9.122757
218432 45.708972 9.125572
218433 45.699755 9.124834
218434 45.704411 9.136586
218435 45.803526 9.076716
218436 45.777613 9.084311
218437 45.801526 9.092938
218438 45.805955 9.095252
218439 45.802628 9.110668
218440 45.776151 9.107903
218441 45.779258 9.110364
218442 45.765103 9.108320
218443 45.791348 9.115565
218444 45.782804 9.128334
218445 45.845940 9.117229
218446 45.818292 9.119688
218447 45.831571 9.127799
218448 45.840498 9.130125
218449 45.836316 9.133721
218450 45.857035 9.132760
218451 45.845190 9.143289
218452 45.845149 9.144765

56515 rows × 2 columns

Future of Pandas

Pandas is a great for handling data, especially comma-delimited or space-separated data. Pandas is also compatible with many other packages, like seaborn, astropy, NumPy, etc.

We will have another lecture on Pandas that will cover much more advanced aspects of Pandas. Make sure you keep checking the schedule!