Pandas Logo

Week 08 - Introduction to Pandas ¶

Today's Agenda ¶

Pandas: Introduction
- Series
- DataFrames
- Indexing, Selecting, Filtering
- Drop columns
- Handling missing Data

In [1]:

# Importing modules
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context("notebook")
import matplotlib
matplotlib.rc("text", usetex=False)

Series¶

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. One can use any NumPy data type to assign to the Series

Creating a Series:

In [2]:

np.random.seed(1)

np.random.random(10)

Out[2]:

array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
       1.46755891e-01, 9.23385948e-02, 1.86260211e-01, 3.45560727e-01,
       3.96767474e-01, 5.38816734e-01])

In [3]:

series_1 = pd.Series(np.random.random(10))
series_1

Out[3]:

0    0.419195
1    0.685220
2    0.204452
3    0.878117
4    0.027388
5    0.670468
6    0.417305
7    0.558690
8    0.140387
9    0.198101
dtype: float64

One can get a NumPy array from the Series, by typing:

In [4]:

series_1.values

Out[4]:

array([0.41919451, 0.6852195 , 0.20445225, 0.87811744, 0.02738759,
       0.67046751, 0.4173048 , 0.55868983, 0.14038694, 0.19810149])

Reindexing¶

One can also get the indices of each element, by typing:

In [5]:

series_1.index.values

Out[5]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One can also have a custom set of indices:

In [6]:

# import string
# alphabet = string.lowercase
# alphabet = np.array([x for x in alphabet])[0:10]
# alphabet

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
alphabet

Out[6]:

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [7]:

series_2 = pd.Series(np.random.random(len(alphabet)), index=alphabet)
series_2

Out[7]:

a    0.800745
b    0.968262
c    0.313424
d    0.692323
e    0.876389
f    0.894607
g    0.085044
h    0.039055
i    0.169830
j    0.878143
dtype: float64

One can select only a subsample of the Series

In [8]:

series_1[[0, 1, 2]]

Out[8]:

0    0.419195
1    0.685220
2    0.204452
dtype: float64

In [9]:

series_1[[1,3,4]]

Out[9]:

1    0.685220
3    0.878117
4    0.027388
dtype: float64

In [10]:

series_2[['a','d','j']]

Out[10]:

a    0.800745
d    0.692323
j    0.878143
dtype: float64

Arithmetic and function Mapping¶

You can also perform numerical expressions

In [11]:

series_1**2

Out[11]:

0    0.175724
1    0.469526
2    0.041801
3    0.771090
4    0.000750
5    0.449527
6    0.174143
7    0.312134
8    0.019708
9    0.039244
dtype: float64

In [12]:

series_1[1]**2

Out[12]:

0.4695257637239847

Or find values greater than some value 'x'

In [13]:

x = 0.5
series_1[(series_1 >= x) & (series_1 < 0.8)]

Out[13]:

1    0.685220
5    0.670468
7    0.558690
dtype: float64

You can apply functions to a column, and save it as a new Series

In [14]:

import sys
def exponentials(arr, basis=10.):
    """
    Uses the array `arr` as the exponents for `basis`
    
    Parameters
    ----------
    arr: numpy array, list, pandas Series; shape (N,)
        array to be used as exponents of `basis`
    
    power: int or float, optional (default = 10)
        number used as the basis
    
    Returns
    -------
    exp_arr: numpy array or list, shape (N,)
        array of values for `basis`**`arr`
    """
    if isinstance(arr, list):
        exp_arr = [basis**x for x in arr]
        return exp_arr        
    elif isinstance(arr, np.ndarray) or isinstance(arr, pd.core.series.Series):
        exp_arr = basis**arr
        return exp_arr
    else:
        cmd = ">>>> `arr` is not a list nor a numpy array"
        cmd +="\n>>>> Please give the correct type of object"
        print(cmd)
        sys.exit(1)

In [15]:

exponentials(series_1[(series_1 >= x) & (series_1 > 0.6)]).values

Out[15]:

array([4.84417139, 7.55296438, 4.68238921])

You can also create a Series using a dictionary (we talked about these on Week 4)

In [16]:

labels_arr = ['foo', 'bar', 'baz']
data_arr   = [100, 200, 300]
dict_1     = dict(zip(labels_arr, data_arr))
dict_1

Out[16]:

{'foo': 100, 'bar': 200, 'baz': 300}

In [17]:

series_3 = pd.Series(dict_1)
series_3

Out[17]:

foo    100
bar    200
baz    300
dtype: int64

Handling Missing Data¶

One of the most useful features of pandas is that it can handle missing data quite easily:

In [18]:

index = ['foo', 'bar', 'baz', 'qux']
series_4 = pd.Series(dict_1, index=index)
series_4

Out[18]:

foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64

In [19]:

pd.isnull(series_4)

Out[19]:

foo    False
bar    False
baz    False
qux     True
dtype: bool

In [20]:

series_3

Out[20]:

foo    100
bar    200
baz    300
dtype: int64

In [21]:

series_3 + series_4

Out[21]:

bar    400.0
baz    600.0
foo    200.0
qux      NaN
dtype: float64

So using a Series is powerful, but DataFrames are probably what gets used the most since it represents a tabular data structure containing an ordered collection of columns and rows.

DataFrames¶

A DataFrame is a "tabular data structure" containing an ordered collection of columns. Each column can a have a different data type.

Row and column operations are treated roughly symmetrically. One can obtain a DataFrame from a normal dictionary, or by reading a file with columns and rows.

Creating a DataFrame

In [22]:

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'popu' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1

Out[22]:

	state	year	popu
0	VA	2012	5.0
1	VA	2013	5.1
2	VA	2014	5.2
3	MD	2014	4.0
4	MD	2015	4.1

This DataFrame has 4 rows and 3 columns by the name "pop", "state", and "year".

The way to access a DataFrame is quite similar to that of accessing a Series.
To access a column, one writes the name of the column, as in the following example:

In [23]:

df_1['popu']

Out[23]:

0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

In [24]:

df_1.popu

Out[24]:

0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

One can also handle missing data with DataFrames. Like Series, columns that are not present in the data are NaNs:

In [25]:

df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'popu', 'unempl'])
df_2

Out[25]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	NaN
3	2014	MD	4.0	NaN
4	2015	MD	4.1	NaN

In [26]:

df_2['state']

Out[26]:

0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

One can retrieve a row by:

In [27]:

df_2.iloc[1:4]

Out[27]:

	year	state	popu	unempl
1	2013	VA	5.1	NaN
2	2014	VA	5.2	NaN
3	2014	MD	4.0	NaN

Editing a DataFrame is quite easy to do. One can assign a Series to a column of the DataFrame. If the Series is a list or an array, the length must match the DataFrame.

In [28]:

unempl = pd.Series([1.0, 2.0, 10.], index=[1,3,5])
unempl

Out[28]:

1     1.0
3     2.0
5    10.0
dtype: float64

In [29]:

df_2['unempl'] = unempl
df_2

Out[29]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	1.0
2	2014	VA	5.2	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

In [30]:

df_2.unempl.isnull()

Out[30]:

0     True
1    False
2     True
3    False
4     True
Name: unempl, dtype: bool

You can also transpose a DataFrame, i.e. switch rows by columns, and columns by rows

In [31]:

df_2.T

Out[31]:

	0	1	2	3	4
year	2012	2013	2014	2014	2015
state	VA	VA	VA	MD	MD
popu	5	5.1	5.2	4	4.1
unempl	NaN	1	NaN	2	NaN

Now, let's say you want to show only the 'year' and 'popu' columns. You can do it by:

In [32]:

df_2

Out[32]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	1.0
2	2014	VA	5.2	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

In [33]:

df_2[['year', 'unempl']]

Out[33]:

	year	unempl
0	2012	NaN
1	2013	1.0
2	2014	NaN
3	2014	2.0
4	2015	NaN

Dropping Entries¶

Let's say you only need a subsample of the table that you have, and you need to drop a column from the DataFrame. You can do that by using the 'drop' option:

In [34]:

df_2

Out[34]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	1.0
2	2014	VA	5.2	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

In [35]:

df_3 = df_2.drop('unempl', axis=1)
df_3

df_2.drop('unempl', axis=1)

Out[35]:

	year	state	popu
0	2012	VA	5.0
1	2013	VA	5.1
2	2014	VA	5.2
3	2014	MD	4.0
4	2015	MD	4.1

In [36]:

df_2

Out[36]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	1.0
2	2014	VA	5.2	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

You can also drop certain rows:

In [37]:

df_2

Out[37]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	1.0
2	2014	VA	5.2	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

In [38]:

df_4 = df_2.drop([1,2])
df_4

Out[38]:

	year	state	popu	unempl
0	2012	VA	5.0	NaN
3	2014	MD	4.0	2.0
4	2015	MD	4.1	NaN

Look at this carefully! The DataFrame preserved the same indices as for df_2.

If you can to reset the indices, you can do that by:

In [39]:

df_4.reset_index(inplace=True)
df_4

Out[39]:

	index	year	state	popu	unempl
0	0	2012	VA	5.0	NaN
1	3	2014	MD	4.0	2.0
2	4	2015	MD	4.1	NaN

Gaia Dataset¶

Gaia

Pandas is great at reading Data tables and CSV files, and other kinds of documents. For the remainder of this notebook, we will be using the Gaia's DR1 catalogue.

In [40]:

# Path to online file
url_path = 'http://cdn.gea.esac.esa.int/Gaia/gdr2/gaia_source/csv/GaiaSource_1000172165251650944_1000424567594791808.csv.gz'

# Converting data to DataFrame
gaia_df = pd.read_csv(url_path, compression='gzip')

In [41]:

gaia_df.head()

Out[41]:

	solution_id	designation	source_id	random_index	ref_epoch	ra	ra_error	dec	dec_error	parallax	...	e_bp_min_rp_val	e_bp_min_rp_percentile_lower	e_bp_min_rp_percentile_upper	flame_flags	radius_val	radius_percentile_lower	radius_percentile_upper	lum_val	lum_percentile_lower	lum_percentile_upper
0	1635721458409799680	Gaia DR2 1000225938242805248	1000225938242805248	1197051105	2015.5	103.447529	0.041099	56.022025	0.045175	0.582790	...	0.0595	0.0080	0.1351	200111.0	1.024730	1.017359	1.038814	1.075774	0.801798	1.349751
1	1635721458409799680	Gaia DR2 1000383512003001728	1000383512003001728	598525552	2015.5	105.187856	0.016978	56.267982	0.016904	1.385686	...	0.2430	0.0830	0.4030	200111.0	1.388711	1.311143	1.453106	1.937890	1.852440	2.023341
2	1635721458409799680	Gaia DR2 1000274106300491264	1000274106300491264	299262776	2015.5	103.424758	0.464608	56.450903	0.582490	0.314035	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1635721458409799680	Gaia DR2 1000396156385741312	1000396156385741312	1148557518	2015.5	105.049751	0.838232	56.508777	0.744511	1.939951	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1635721458409799680	Gaia DR2 1000250024419296000	1000250024419296000	574278759	2015.5	103.352525	0.023159	56.395144	0.022836	0.747108	...	0.2870	0.1196	0.4051	200111.0	1.507958	1.435618	1.540208	2.427377	2.152597	2.702158

5 rows × 94 columns

Shape, Columns and Rows¶

You can get the shape of the "gaia_df" DataFrame by typing:

In [42]:

gaia_df.shape

Out[42]:

(14209, 94)

That means there are 14209 rows and 94 columns.

To get an array of the columns available, one could write:

In [43]:

gaia_df.columns.values.sort()
gaia_df.columns.values

Out[43]:

array(['a_g_percentile_lower', 'a_g_percentile_upper', 'a_g_val',
       'astrometric_chi2_al', 'astrometric_excess_noise',
       'astrometric_excess_noise_sig', 'astrometric_gof_al',
       'astrometric_matched_observations', 'astrometric_n_bad_obs_al',
       'astrometric_n_good_obs_al', 'astrometric_n_obs_ac',
       'astrometric_n_obs_al', 'astrometric_params_solved',
       'astrometric_primary_flag', 'astrometric_pseudo_colour',
       'astrometric_pseudo_colour_error', 'astrometric_sigma5d_max',
       'astrometric_weight_al', 'b', 'bp_g', 'bp_rp', 'dec', 'dec_error',
       'dec_parallax_corr', 'dec_pmdec_corr', 'dec_pmra_corr',
       'designation', 'duplicated_source', 'e_bp_min_rp_percentile_lower',
       'e_bp_min_rp_percentile_upper', 'e_bp_min_rp_val', 'ecl_lat',
       'ecl_lon', 'flame_flags', 'frame_rotator_object_type', 'g_rp', 'l',
       'lum_percentile_lower', 'lum_percentile_upper', 'lum_val',
       'matched_observations', 'mean_varpi_factor_al', 'parallax',
       'parallax_error', 'parallax_over_error', 'parallax_pmdec_corr',
       'parallax_pmra_corr', 'phot_bp_mean_flux',
       'phot_bp_mean_flux_error', 'phot_bp_mean_flux_over_error',
       'phot_bp_mean_mag', 'phot_bp_n_obs', 'phot_bp_rp_excess_factor',
       'phot_g_mean_flux', 'phot_g_mean_flux_error',
       'phot_g_mean_flux_over_error', 'phot_g_mean_mag', 'phot_g_n_obs',
       'phot_proc_mode', 'phot_rp_mean_flux', 'phot_rp_mean_flux_error',
       'phot_rp_mean_flux_over_error', 'phot_rp_mean_mag',
       'phot_rp_n_obs', 'phot_variable_flag', 'pmdec', 'pmdec_error',
       'pmra', 'pmra_error', 'pmra_pmdec_corr', 'priam_flags', 'ra',
       'ra_dec_corr', 'ra_error', 'ra_parallax_corr', 'ra_pmdec_corr',
       'ra_pmra_corr', 'radial_velocity', 'radial_velocity_error',
       'radius_percentile_lower', 'radius_percentile_upper', 'radius_val',
       'random_index', 'ref_epoch', 'rv_nb_transits', 'rv_template_fe_h',
       'rv_template_logg', 'rv_template_teff', 'solution_id', 'source_id',
       'teff_percentile_lower', 'teff_percentile_upper', 'teff_val',
       'visibility_periods_used'], dtype=object)

Let's say you only want a DataFrame with the the colums:

ra (right ascension)
dec (declination)
l (galactic longitude)
b (galactic latitude)

You do this by using the loc option for the DataFrame:

In [44]:

gaia_df_2 = gaia_df.loc[:,['ra','dec','l','b']]

# Displaying the first 15 lines
gaia_df_2.head(15)

Out[44]:

	ra	dec	l	b
0	103.447529	56.022025	160.163475	22.533932
1	105.187856	56.267982	160.174346	23.534087
2	103.424758	56.450903	159.712110	22.635989
3	105.049751	56.508777	159.899324	23.518554
4	103.352525	56.395144	159.758838	22.582657
5	101.929791	55.973333	159.959619	21.705035
6	101.853926	56.129320	159.785705	21.709303
7	105.128850	56.285081	160.147553	23.506483
8	103.396330	56.714410	159.432230	22.690559
9	101.780437	55.945333	159.962507	21.616907
10	103.500366	56.844629	159.312220	22.779748
11	105.649481	56.632527	159.854367	23.868954
12	103.189617	56.815154	159.294467	22.607972
13	103.530144	55.988175	160.212164	22.569395
14	105.938862	56.588687	159.941603	24.013615

This selects all of the rows, and only the selected columns in the list.

You can also select only a subsample of the rows as well, as in the following example. Let's say I just want a random subsample of 10% of the galaxies in the Gaia DR1 catalogue. I can do that by:

In [45]:

import random
random.sample

Out[45]:

<bound method Random.sample of <random.Random object at 0x7fb45d83e618>>

In [46]:

# Decission indices to select from DataFrame
import random

# Number of rows
nrows = len(gaia_df_2)

# Randomly selecting `nrows` from `gaia_df_2`
gaia_df_3 = gaia_df_2.sample(nrows)

gaia_df_3.shape

Out[46]:

(14209, 4)

I'm re-normalizing the indices of this DataFrame

In [47]:

gaia_df_3.reset_index(inplace=True, drop=True)
gaia_df_3

Out[47]:

	ra	dec	l	b
0	102.023218	56.433992	159.500110	21.886331
1	105.272250	56.502381	159.938521	23.636142
2	105.087569	56.445058	159.972272	23.523299
3	103.359287	56.193791	159.970039	22.532502
4	103.240760	56.630222	159.495644	22.585832
5	103.222515	56.549474	159.577022	22.554616
6	103.361986	56.583872	159.563251	22.637842
7	103.775851	56.168097	160.063471	22.749245
8	101.836603	56.119531	159.792805	21.697240
9	102.678893	56.936834	159.086470	22.371832
10	105.667572	56.619706	159.870549	23.875620
11	102.236439	55.945081	160.041579	21.861470
12	102.433467	56.518693	159.480950	22.127565
13	105.105611	56.486566	159.931001	23.543060
14	105.362118	56.534916	159.917022	23.692076
15	102.368405	56.455920	159.535245	22.075601
16	103.676978	56.307551	159.901923	22.732851
17	102.504898	57.212852	158.771444	22.355833
18	101.804305	55.852613	160.062130	21.602850
19	102.119267	56.169582	159.789621	21.862448
20	102.071402	56.730196	159.201815	21.995346
21	102.774558	56.830031	159.212977	22.393163
22	102.137264	56.350726	159.605436	21.923322
23	103.921963	56.655100	159.575937	22.953848
24	102.932901	56.335646	159.753100	22.342795
25	105.385988	56.643913	159.804840	23.730851
26	105.542361	56.377615	160.109832	23.751272
27	102.647940	56.268813	159.775683	22.172426
28	103.711527	56.349395	159.863591	22.762315
29	103.597854	56.158071	160.045628	22.650994
...	...	...	...	...
14179	102.489214	56.081397	159.943479	22.035450
14180	102.147983	56.711074	159.234323	22.030321
14181	103.716786	56.159878	160.062696	22.715352
14182	105.252391	56.393553	160.050906	23.599250
14183	102.596930	56.310072	159.724415	22.156631
14184	105.268657	56.207321	160.250440	23.562900
14185	102.044837	56.476426	159.459891	21.909801
14186	102.284147	56.917347	159.042894	22.159330
14187	102.602635	57.186376	158.814544	22.399603
14188	102.825593	56.441695	159.625292	22.314552
14189	101.982163	56.314866	159.616222	21.830699
14190	102.244534	56.641037	159.322869	22.061669
14191	102.608435	57.217545	158.783009	22.411067
14192	102.743681	57.158952	158.865450	22.465819
14193	102.777462	56.444267	159.614761	22.289662
14194	105.891445	56.646945	159.872951	24.001706
14195	102.788402	56.867407	159.176279	22.410572
14196	102.019136	56.186270	159.755308	21.813762
14197	103.381823	56.792650	159.348148	22.703505
14198	102.970067	56.429717	159.661250	22.388156
14199	101.940300	55.874246	160.063572	21.682161
14200	102.951998	56.286436	159.807412	22.339588
14201	103.637537	56.235559	159.970962	22.692768
14202	102.089903	55.881491	160.082031	21.764630
14203	103.215704	56.428451	159.702186	22.518617
14204	102.642189	56.958484	159.058057	22.358449
14205	102.233121	57.066272	158.880057	22.173822
14206	104.919145	56.263134	160.139589	23.388225
14207	103.295677	56.738090	159.391665	22.643558
14208	101.923237	56.098917	159.828998	21.737590

14209 rows × 4 columns

You can produce plots directly from the DataFrame

In [48]:

title_txt = 'Right Ascension and Declination for Gaia'

gaia_df_3.plot('ra','dec',       # Columns to plot
               kind='scatter',   # Kind of plot. In this case, it's `scatter`
               label='Gaia',     # Label of the points
               title=title_txt,  # Title of the figure
               color='#4c72b0',  # Color of the points
               figsize=(12,8))  # Size of the fiure

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a16cd80f0>

Or even Scatterplot Matrices:

In [49]:

sns.pairplot(gaia_df_3, plot_kws={'color': '#4c72b0'}, diag_kws={'color': '#4c72b0'})

Out[49]:

<seaborn.axisgrid.PairGrid at 0x1a1760b438>

In [50]:

sns.jointplot(gaia_df_3['l'], gaia_df_3['b'], color='#3c8f40')

Out[50]:

<seaborn.axisgrid.JointGrid at 0x1a193999e8>

Indexing, Selecting, Filtering Data¶

Now I want to filter the data based on ra and dec:

I want to select all the stars within:

45 < RA < 50
5 < Dec < 10

Normally, you would could do in numpy using the np.where function, like in the following example:

In [51]:

ra_arr = gaia_df.ra.values
dec_arr = gaia_df.dec.values

In [52]:

# Just showing the first 25 elements
np.column_stack((ra_arr, dec_arr))[0:25]

Out[52]:

array([[103.44752895,  56.02202543],
       [105.18785594,  56.2679821 ],
       [103.42475813,  56.45090293],
       [105.04975071,  56.50877738],
       [103.35252488,  56.39514381],
       [101.92979073,  55.97333308],
       [101.85392576,  56.12931976],
       [105.12884963,  56.28508092],
       [103.39632957,  56.7144103 ],
       [101.78043734,  55.94533326],
       [103.50036565,  56.84462941],
       [105.64948082,  56.63252739],
       [103.18961712,  56.81515376],
       [103.53014423,  55.98817459],
       [105.93886175,  56.58868695],
       [102.24453393,  56.64103702],
       [102.02432422,  56.0158414 ],
       [103.0848673 ,  56.25264172],
       [102.30172541,  56.6658301 ],
       [103.28806439,  56.2536194 ],
       [103.05853189,  56.72134178],
       [102.0578448 ,  56.39003547],
       [103.71357369,  56.62167297],
       [103.37050966,  56.11391562],
       [103.44554963,  56.29543043]])

In [53]:

## Numpy way of finding the stars that meet the criteria

ra_min, ra_max = (102, 104)
dec_min, dec_max = (56.4, 56.7)

# RA critera
ra_idx = np.where((ra_arr >= ra_min) & (ra_arr <= ra_max))[0]

# Dec criteria
dec_idx = np.where((dec_arr >= dec_min) & (dec_arr <= dec_max))[0]

# Finding `intersecting' indices that meet both criteria
radec_idx = np.intersect1d(ra_idx, dec_idx)

# Selecting the values from only those indices
ra_new = ra_arr[radec_idx]
dec_new = dec_arr[radec_idx]

# Printing out ra and dec for corresponding indices
print(np.column_stack((ra_new, dec_new)))

[[103.42475813  56.45090293]
 [102.24453393  56.64103702]
 [102.30172541  56.6658301 ]
 ...
 [103.81978156  56.62006303]
 [103.31712396  56.61945699]
 [103.57468884  56.4318757 ]]

This is rather convoluted and long, and one can easily make a mistake if s/he doesn't keep track of which arrays s/he is using!

In Pandas, this is much easier!!

In [54]:

gaia_df_4 = gaia_df.loc[(
                (gaia_df.ra >= ra_min) & (gaia_df.ra <= ra_max) &
                (gaia_df.dec >= dec_min) & (gaia_df.dec <= dec_max))]
gaia_df_4[['ra','dec']]

Out[54]:

	astrometric_excess_noise_sig	astrometric_matched_observations
2	103.424758	56.450903
15	102.244534	56.641037
18	102.301725	56.665830
22	103.713574	56.621673
32	103.462653	56.532017
35	103.360486	56.413851
36	103.676326	56.623116
39	103.085718	56.447884
40	103.783947	56.521326
46	102.206913	56.640966
54	103.726192	56.673523
57	102.069819	56.405446
63	102.920433	56.430698
64	102.975403	56.593006
69	103.870145	56.684652
73	103.201988	56.538079
74	102.256910	56.592725
78	103.819070	56.577742
79	103.797238	56.403185
81	102.342651	56.599794
84	103.661779	56.492773
89	103.637186	56.622849
92	102.241080	56.684968
98	102.843411	56.658409
103	102.898776	56.470195
105	103.480558	56.617093
110	103.236574	56.509793
112	103.633321	56.650588
113	103.776568	56.622350
114	102.260166	56.659839
...	...	...
14066	102.337580	56.565822
14068	103.495447	56.455223
14069	103.408320	56.628237
14071	103.436448	56.564892
14075	102.350965	56.622774
14078	103.267697	56.473831
14080	102.227288	56.665774
14081	102.962817	56.594834
14088	103.021430	56.620846
14090	103.863479	56.469286
14104	102.954900	56.594641
14114	103.442994	56.647008
14119	102.902805	56.627175
14130	102.974802	56.451924
14135	103.838059	56.442466
14138	102.619619	56.504853
14139	103.687508	56.489265
14141	103.495952	56.632908
14142	103.412026	56.620438
14143	102.714416	56.661466
14161	102.844313	56.595309
14167	103.724074	56.430243
14168	103.211444	56.645173
14173	102.584448	56.539904
14180	103.682558	56.585142
14195	102.307805	56.625449
14200	103.733646	56.568726
14201	103.819782	56.620063
14204	103.317124	56.619457
14207	103.574689	56.431876

3156 rows × 2 columns

Future of Pandas ¶

Pandas is a great for handling data, especially comma-delimited or space-separated data. Pandas is also compatible with many other packages, like seaborn, astropy, NumPy, etc.

We will have another lecture on Pandas that will cover much more advanced aspects of Pandas. Make sure you keep checking the schedule!