Analizando Datos con PyData¶

PyData
Relación entre Bibliotecas
Objetivos de la Lección

Pandas¶

Relación Numpy y Pandas
DataFrame y Series
Entorno y bibliotecas Auxiliares
Una vista las funcionalidades en Pandas (api)

Básico 1: Carga de Datos y Exploración¶

Carga de Datos
Mínimas funciones para explorar los datos
Mínima manipulación de datos.

PyData¶

Se usa el nombre PyData para referirce a las bibliotecas de Python que se usan para cómputo científico. Pero no es la definición en si; eso es el "stack", PyData es un programa respaldado por una organización sin fines de lucro que lo que busca a apoyar el uso y desarrollo del open source y en especial el uso e implementación del stack de Python para cómputo científico.

La organización NumFocus, dentro de sus programas PyData le da nombre a los eventos relacionados con la enseñanza y divulgación del uso de tecnologías en Python.

En nuestro caso, siguiendo las corrientes diremos PyData para referirnos al conjunto de bibliotecas que conforman el ecosistema de Python para computo científico.

La evolución y estado actual se pueden ver en la siguiente presentación de Travis Oliphant

Presentación

Relación entre bibliotecas¶

Para tener una idea de la relación entre proyectos resulta ilustrativo explorar la lista de proyectos relacionados entre 4 bibliotecas:

¿Cuál es la relación entre las bibliotecas?

Objetivo¶

Conocer el entorno de trabajo.
Conocer un mímino del ecosistema de trabajo.
Un primer análisis.

Pandas¶

De manera rápida se pude explorar la documentación de los proyectos; Pandas y Numpy, pero desde la documentación de la biblioteca. Esto se puede hacer desde una orden en la consola.

Entorno¶

In [0]:

#Se cargan Pandas y Numpy
import pandas as pd
import numpy as np

print("Version de Pandas:",pd.__version__)
print("Version de Numpy:",np.__version__)

Version de Pandas: 0.22.0
Version de Numpy: 1.14.6

In [0]:

#Se imprime la documentación de Pandas
print(pd.__doc__)

pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can  be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets
  - Intuitive merging and joining data sets
  - Flexible reshaping and pivoting of data sets
  - Hierarchical labeling of axes (possible to have multiple labels per tick)
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, moving window linear regressions,
    date shifting and lagging, etc.

In [0]:

#Se imprime la documentación de Numpy
print(np.__doc__)

NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <http://www.scipy.org>`_.

We recommend exploring the docstrings using
`IPython <http://ipython.scipy.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as `np`::

  >>> import numpy as np

Code snippets are indicated by three greater-than signs::

  >>> x = 42
  >>> x = x + 1

Use the built-in ``help`` function to view a function's docstring::

  >>> help(np.sort)
  ... # doctest: +SKIP

For some objects, ``np.info(obj)`` may provide additional help.  This is
particularly true if you see the line "Help on ufunc object:" at the top
of the help() page.  Ufuncs are implemented in C, not Python, for speed.
The native Python help() does not know how to view their help, but our
np.info() function does.

To search for documents containing a keyword, do::

  >>> np.lookfor('keyword')
  ... # doctest: +SKIP

General-purpose documents like a glossary and help on the basic concepts
of numpy are available under the ``doc`` sub-module::

  >>> from numpy import doc
  >>> help(doc)
  ... # doctest: +SKIP

Available subpackages
---------------------
doc
    Topical documentation on broadcasting, indexing, etc.
lib
    Basic functions used by several sub-packages.
random
    Core Random Tools
linalg
    Core Linear Algebra Tools
fft
    Core FFT routines
polynomial
    Polynomial tools
testing
    NumPy testing tools
f2py
    Fortran to Python Interface Generator.
distutils
    Enhancements to distutils with support for
    Fortran compilers support and more.

Utilities
---------
test
    Run numpy unittests
show_config
    Show numpy build configuration
dual
    Overwrite certain functions with high-performance Scipy tools
matlib
    Make everything matrices.
__version__
    NumPy version string

Viewing documentation using IPython
-----------------------------------
Start IPython with the NumPy profile (``ipython -p numpy``), which will
import `numpy` under the alias `np`.  Then, use the ``cpaste`` command to
paste examples into the shell.  To see which functions are available in
`numpy`, type ``np.<TAB>`` (where ``<TAB>`` refers to the TAB key), or use
``np.*cos*?<ENTER>`` (where ``<ENTER>`` refers to the ENTER key) to narrow
down the list.  To view the docstring for a function, use
``np.cos?<ENTER>`` (to view the docstring) and ``np.cos??<ENTER>`` (to view
the source code).

Copies vs. in-place operation
-----------------------------
Most of the functions in `numpy` return a copy of the array argument
(e.g., `np.sort`).  In-place versions of these functions are often
available as array methods, i.e. ``x = np.array([1,2,3]); x.sort()``.
Exceptions to this rule are documented.

Magic Commands¶

Como se mensionó en la lección 1, el entorno ayuda agílizar el proceso de códificación y sobre todo de desarrollo de un análisis de datos. Con entorno me refiero a Jupyter y sus derivados.

Jupyter, como heredero de Ipython tiene varias caracterísicas interesantes:

Autocompletado
Magic Commands

Los siguientes comandos son ejemplos de las utilidades de los magic commands.

In [0]:

#Si se desea conocer lo que se tiene en la sección de trabajo
# se puede usar el siguiente comando.
%whos

Variable   Type      Data/Info
------------------------------
np         module    <module 'numpy' from '/us<...>kages/numpy/__init__.py'>
pd         module    <module 'pandas' from '/u<...>ages/pandas/__init__.py'>

In [0]:

# Si se desea conocer la lista completa de magic commmands disponibles
%lsmagic

Out[0]:

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %shell  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%bigquery  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%shell  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

In [0]:

# Si se tiene dudas respecto al funcionamiento de cualquier objeto
?%lsmagic

In [0]:

?pd

In [0]:

#También se puede hacer uso de los comandos del sistema
!ls -l -h

total 4.0K
drwxr-xr-x 2 root root 4.0K Nov 29 18:21 sample_data

In [0]:

#Se pide ver la dirección o ubicación de trabajo
!pwd

/content

In [0]:

# Igual que el anterior pero usando un mag-comm
%pwd

Out[0]:

'/content'

In [0]:

#Para ver los procesos que se ejecutan 
!top

=top - 00:22:58 up  1:11,  0 users,  load average: 0.00, 0.01, 0.00
Tasks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.3 sy,  0.0 ni, 98.9 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13335212 total, 11335820 free,   678956 used,  1320436 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12470684 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND   
      1 root      20   0   39196   6312   4828 S   0.0  0.0   0:00.05 run.sh    
      7 root      20   0  679236  44192  24428 S   0.0  0.3   0:01.15 node      
     28 root      20   0  683112  55048  24980 S   0.0  0.4   0:03.21 node      
     53 root      20   0  187712  58848  12520 S   0.0  0.4   0:05.01 jupyter-+ 
     60 root      20   0  783272 206804  40060 S   0.0  1.6   0:06.78 python3   
     82 root      20   0   54376  14556   7516 S   0.0  0.1   0:00.07 python3   
    108 root      20   0  711676 135524  39976 S   0.0  1.0   0:02.46 python3   
    124 root      20   0   54376  14572   7536 S   0.0  0.1   0:00.06 python3   
    188 root      20   0  720332 136040  40020 S   0.0  1.0   0:02.35 python3   
    204 root      20   0   54376  14532   7496 S   0.0  0.1   0:00.07 python3   
    256 root      20   0   61088   6632   4948 R   0.0  0.0   0:00.01 top       
top - 00:23:01 up  1:11,  0 users,  load average: 0.00, 0.01, 0.00
Tasks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  0.5 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13335212 total, 11335636 free,   679064 used,  1320512 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12470556 avail Mem 


    188 root      20   0  720332 136040  40020 S   1.0  1.0   0:02.38 python3   
      1 root      20   0   39196   6312   4828 S   0.0  0.0   0:00.05 run.sh    
      7 root      20   0  679236  44192  24428 S   0.0  0.3   0:01.15 node      
     28 root      20   0  683112  55048  24980 S   0.0  0.4   0:03.21 node      
     53 root      20   0  187712  58848  12520 S   0.0  0.4   0:05.01 jupyter-+ 
     60 root      20   0  783272 206804  40060 S   0.0  1.6   0:06.78 python3   
     82 root      20   0   54376  14556   7516 S   0.0  0.1   0:00.07 python3   
    108 root      20   0  711676 135524  39976 S   0.0  1.0   0:02.46 python3   
    124 root      20   0   54376  14572   7536 S   0.0  0.1   0:00.06 python3   


>

Estos comandos por simples que parecen ayudan a concentrarte en tu código y hacer más fluido tu trabajo.

Se pueden desarrollar tus propios "Magic commands".Ejemplos de otros fuera de los estandar son los siguientes proyectos:

El siguiente ejemplo trata de ilustrar un problema común, tener un error y tratar de entender la secuencia de errores. Para eso se puede usar un magic commands que nos dice donde está el origen de nuestro problema.

In [0]:

#Tratamos de correr un test de pandas
pd.test()

running: pytest --skip-slow --skip-network /usr/local/lib/python3.6/dist-packages/pandas
============================= test session starts ==============================
platform linux -- Python 3.6.7, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /usr/local/lib/python3.6/dist-packages/pandas, inifile:
collected 0 items / 1 errors

==================================== ERRORS ====================================
______________________________ ERROR collecting  _______________________________
/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:430: in _importconftest
    return self._conftestpath2mod[conftestpath]
E   KeyError: local('/usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py')

During handling of the above exception, another exception occurred:
/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:436: in _importconftest
    mod = conftestpath.pyimport()
/usr/local/lib/python3.6/dist-packages/py/_path/local.py:668: in pyimport
    __import__(modname)
/usr/local/lib/python3.6/dist-packages/_pytest/assertion/rewrite.py:294: in load_module
    six.exec_(co, mod.__dict__)
/usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py:3: in <module>
    import moto
E   ModuleNotFoundError: No module named 'moto'

During handling of the above exception, another exception occurred:
/usr/local/lib/python3.6/dist-packages/py/_path/common.py:377: in visit
    for x in Visitor(fil, rec, ignore, bf, sort).gen(self):
/usr/local/lib/python3.6/dist-packages/py/_path/common.py:429: in gen
    for p in self.gen(subdir):
/usr/local/lib/python3.6/dist-packages/py/_path/common.py:418: in gen
    dirs = self.optsort([p for p in entries
/usr/local/lib/python3.6/dist-packages/py/_path/common.py:419: in <listcomp>
    if p.check(dir=1) and (rec is None or rec(p))])
/usr/local/lib/python3.6/dist-packages/_pytest/main.py:601: in _recurse
    ihook = self.gethookproxy(dirpath)
/usr/local/lib/python3.6/dist-packages/_pytest/main.py:418: in gethookproxy
    my_conftestmodules = pm._getconftestmodules(fspath)
/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:414: in _getconftestmodules
    mod = self._importconftest(conftestpath)
/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:453: in _importconftest
    raise ConftestImportFailure(conftestpath, sys.exc_info())
E   _pytest.config.ConftestImportFailure: (local('/usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py'), (<class 'ModuleNotFoundError'>, ModuleNotFoundError("No module named 'moto'",), <traceback object at 0x7f16c248b4c8>))
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 0.64 seconds ============================

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

In [0]:

# Pedimos ver la trata o razón de nuestro problema.
%tb

---------------------------------------------------------------------------
SystemExit                                Traceback (most recent call last)
<ipython-input-11-f37ca8985d3b> in <module>()
----> 1 pd.test()

/usr/local/lib/python3.6/dist-packages/pandas/util/_tester.py in test(extra_args)
     20     cmd += [PKG]
     21     print("running: pytest {}".format(' '.join(cmd)))
---> 22     sys.exit(pytest.main(cmd))
     23 
     24 

SystemExit: 2

Para conocer más respecto a las funcionalidades de Jupyter o IPython se puede revisar la siguiente liga:

https://ipython.readthedocs.io/en/stable/index.html

Pandas en 10 minutos (... y alta velocidad)¶

Pandas es una biblioteca de alto performance para la manipulación y procesamiento de datos estructurados.

Consiste en general de los siguientes elementos:

Dos tipos de objetos principales: Series y DataFrames.
Indexación de los ejes simples como la multi - nivel y jerárquica.
Operaciones optimizadas para crear agrupaciones y transformaciones en los datos.
Capacidad para generar rangos de fechas con total facilidad para ser modificadas a conveniencia.
Herramientas para Input/Output de archivos con diversos formatos y tecnologías.
Eficiente administración de memoria. Capacidad para generar estructuras "sparse" para hacer más eficiente el procesamiento de los datos.
Funciones y herramientas para hacer estadistica sobre los datos.

Las dos estructuras de datos fundamentales en Pandas son:

Series: arrays de 1D con tipos de datos homogeneos.
DataFrame: arrays 2D, objetos con estructura tabular de tamaño mutable y con la capacidad de tener columnas de tipos heterogenios.

Una idea para entender a relación entre los objetos en Pandas es pensarlos como estructuras o contenedores de estructuras de datos menores dimensiones. Los Dataframes se pueden pensar como contenedores de Series y las Series como contenedores de números o cadenas.

Nota Avanzada: Todos los objetos en Pandas son mutables ( los valores que contienen pueden ser alterados). Pero todos los métodos o funciones producen nuevos objetos y dejan los objetos iniciales sin cambios.

El siguiente "tour" en Pandas es una versión de la que se puede encontrar en la página oficial:

10 minutes to pandas

In [0]:

# Se modifica el entorno para hacer más 
# rápida la revisión

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
 
%matplotlib inline

import matplotlib.pyplot as plt

Series y DataFrames¶

Se define una Serie y un DataFrame

In [0]:

#Se define la serie s
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)

#Se define un DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

#Se visualizan
s
df

Out[0]:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Out[0]:

	A	B	C	D
2013-01-01	-2.662312	-0.930931	-1.069079	1.103445
2013-01-02	2.770103	-0.032381	-0.114339	0.289050
2013-01-03	-0.742892	-1.425946	0.994186	-0.266048
2013-01-04	-0.891284	-0.695291	-0.924549	1.942704
2013-01-05	0.426666	0.380999	0.486457	-1.035013
2013-01-06	-0.211796	0.351017	1.557697	-1.496206

Se crea otro DataFrame, pero como ejemplo se muestra como crearlo con diferentes tipos de datos.

In [0]:

#Se crea el DF con diferentes tipos de Series
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })

print("Se visualiza el DF")

df2

print("="*50)
print("¿Qué tipo de datos tiene?")

df2.dtypes
#Para tener un resumen completo del DF
print()
print("="*50)
print("Resumen")
df2.info()

Se visualiza el DF

Out[0]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

==================================================
¿Qué tipo de datos tiene?

Out[0]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

==================================================
Resumen
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes

Para un DataFrame largo, es recomendable visualizar solo algunas filas.

In [0]:

#Se muestran las 3 primeras filas
df.head(3)
#Se muestran las 3 últimas filas
df.tail(3)

Out[0]:

	A	B	C	D
2013-01-01	-2.662312	-0.930931	-1.069079	1.103445
2013-01-02	2.770103	-0.032381	-0.114339	0.289050
2013-01-03	-0.742892	-1.425946	0.994186	-0.266048

Out[0]:

	A	B	C	D
2013-01-04	-0.891284	-0.695291	-0.924549	1.942704
2013-01-05	0.426666	0.380999	0.486457	-1.035013
2013-01-06	-0.211796	0.351017	1.557697	-1.496206

Relación Pandas y NumPy¶

De manera general, todo DataFrame tiene 3 elementos: index, columnas y valores.

In [0]:

#Se visualizan los índices
df.index
print("\n")
#Se revisan las columnas
print("Columndas\n")
df.columns

#Se visualizan los valores
print("\n")
df.values

Out[0]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


Columndas

Out[0]:

Index(['A', 'B', 'C', 'D'], dtype='object')

Out[0]:

array([[-2.66231197, -0.93093083, -1.06907874,  1.10344508],
       [ 2.77010318, -0.03238136, -0.11433878,  0.28905019],
       [-0.74289222, -1.42594597,  0.99418636, -0.26604766],
       [-0.89128408, -0.69529057, -0.92454875,  1.94270435],
       [ 0.4266665 ,  0.38099875,  0.48645678, -1.03501286],
       [-0.2117956 ,  0.35101741,  1.55769736, -1.49620624]])

Los comandos anteriores muestran la relación que existe entre Numpy y Pandas.

In [0]:

type(df.index)
print()
type(df.columns)
print()
type(df.values)

Out[0]:

pandas.core.indexes.datetimes.DatetimeIndex

Out[0]:

pandas.core.indexes.base.Index

Out[0]:

numpy.ndarray

Se observa que el tipo de dato que para los valores de un DataFrame es un arreglo o array de Numpy. Entonces, eso implica que toda función que tiene un arreglo en Numpy la hereda un DataFrame.

In [0]:

df.values.argmax()
print()
df.values.diagonal()

Out[0]:

array([-0.05463496,  0.40983999, -1.13152355, -0.44288347])

Lo mismo ocurre con las series.

In [0]:

type(s)
print()
type(s.values)

Out[0]:

pandas.core.series.Series

Out[0]:

numpy.ndarray

Exploración¶

Algunas funciones útiles para explora los datos del DataFrame son las siguientes:

In [0]:

#Se pide la estádistica básica
df.describe()

Out[0]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.469452	0.142528	-0.765056	0.125801
std	0.793282	0.879375	0.636024	1.171310
min	-0.797758	-1.191412	-1.604023	-1.657919
25%	0.044665	-0.090017	-1.137977	-0.418734
50%	0.666154	0.127492	-0.872436	0.128347
75%	1.099234	0.349461	-0.194710	0.920442
max	1.201068	1.530666	-0.046149	1.572652

In [0]:

#Quizás se necesita ordenar los datos
# de algun modo. 
df.sort_index(axis=1, ascending=False)
df.sort_values(by='B')

Out[0]:

	D	C	B	A
2013-01-01	-1.657919	-1.140128	0.168322	-0.054635
2013-01-02	1.572652	-0.613349	0.409840	0.342567
2013-01-03	0.602979	-1.131524	-0.148911	-0.797758
2013-01-04	-0.442883	-1.604023	1.530666	1.135732
2013-01-05	-0.346286	-0.046149	-1.191412	1.201068
2013-01-06	1.026263	-0.055164	0.086662	0.989741

Out[0]:

	A	B	C	D
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979
2013-01-06	0.989741	0.086662	-0.055164	1.026263
2013-01-01	-0.054635	0.168322	-1.140128	-1.657919
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-04	1.135732	1.530666	-1.604023	-0.442883

Álgebra Lineal¶

Si bien los DataFrame son arreglos de dos dimensiones y su valores son un array de Numpy. Por consecuencia las operaciones de matrices pueden ser aplicables.

In [0]:

#Se estima la transpuesta
df.T
print()
df.values.diagonal()
print()
df.values.max()

Out[0]:

	2013-01-01 00:00:00	2013-01-02 00:00:00	2013-01-03 00:00:00	2013-01-04 00:00:00	2013-01-05 00:00:00	2013-01-06 00:00:00
A	-0.054635	0.342567	-0.797758	1.135732	1.201068	0.989741
B	0.168322	0.409840	-0.148911	1.530666	-1.191412	0.086662
C	-1.140128	-0.613349	-1.131524	-1.604023	-0.046149	-0.055164
D	-1.657919	1.572652	0.602979	-0.442883	-0.346286	1.026263

Out[0]:

array([-0.05463496,  0.40983999, -1.13152355, -0.44288347])

Out[0]:

1.5726523256536802

En los comandos anteriores se pide la transpuesta del DataFrame, la cual es una operacion diferentes a la siguiente:

In [0]:

#¿cuál es la diferencia con df.T?
df.values.T

Out[0]:

array([[-0.05463496,  0.34256677, -0.79775845,  1.13573175,  1.20106755,
         0.98974133],
       [ 0.16832219,  0.40983999, -0.14891052,  1.53066581, -1.19141211,
         0.08666247],
       [-1.14012769, -0.61334922, -1.13152355, -1.60402263, -0.04614857,
        -0.05516379],
       [-1.65791871,  1.57265233,  0.60297925, -0.44288347, -0.34628572,
         1.02626349]])

Selección de Columnas y Filas¶

Como se mencionó, los objetos de Pandas son mutables. En el sentido de que pueden ser modificados, por lo cual se puede seleccionar solo las columnas o parte de los datos que a uno le interesan.

La selección de columnas y filas tiene 3 modos de ser realizada:

Slicing
Referencia
Posición

¿Por qué tantos modos de hacer lo mismo?

In [0]:

#Por slicing
df['A']
df[0:3]
df['20130102':'20130104']

Out[0]:

2013-01-01   -0.054635
2013-01-02    0.342567
2013-01-03   -0.797758
2013-01-04    1.135732
2013-01-05    1.201068
2013-01-06    0.989741
Freq: D, Name: A, dtype: float64

Out[0]:

	A	B	C	D
2013-01-01	-0.054635	0.168322	-1.140128	-1.657919
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979

Out[0]:

	A	B	C	D
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979
2013-01-04	1.135732	1.530666	-1.604023	-0.442883

In [0]:

#Por referencia
df.loc[dates[0]]
df.loc[:,['A','B']]
df.loc['20130102':'20130104',['A','B']]
df.loc['20130102',['A','B']]

In [0]:

#¿Cuál es la diferencia?
df.loc[dates[0],'A']

print()

df.at[dates[0],'A']

Out[0]:

1.3604942373447584

Out[0]:

1.3604942373447584

In [0]:

#Por posición
df.iloc[3]
df.iloc[3:5,0:2]
df.iloc[[1,2,4],[0,2]]
df.iloc[1:3,:]
df.iloc[:,1:3]

Out[0]:

A    1.135732
B    1.530666
C   -1.604023
D   -0.442883
Name: 2013-01-04 00:00:00, dtype: float64

Out[0]:

	A	B
2013-01-04	1.135732	1.530666
2013-01-05	1.201068	-1.191412

Out[0]:

	A	C
2013-01-02	0.342567	-0.613349
2013-01-03	-0.797758	-1.131524
2013-01-05	1.201068	-0.046149

Out[0]:

	A	B	C	D
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979

Out[0]:

	B	C
2013-01-01	0.168322	-1.140128
2013-01-02	0.409840	-0.613349
2013-01-03	-0.148911	-1.131524
2013-01-04	1.530666	-1.604023
2013-01-05	-1.191412	-0.046149
2013-01-06	0.086662	-0.055164

In [0]:

#¿Cuál es la diferencia?
df.iloc[1,1]
print()
df.iat[1,1]

Out[0]:

0.4098399942252009

Out[0]:

0.4098399942252009

Condiciones Lógicas¶

Cuando se manipulan datos resulta lógico hacer pregunta "lógicas" o comparativos, como desear saber cuantos elementos cumplen una condición, por ejemplo:

#Cuantos elementos tienen valor
# mayor a cero

df.C>0

Para ayudar con este tipo de casos, se tienen la selección por índices booleanos.

In [0]:

#Se visualizan los datos
df.head()

print()
#La condicion a pedir
df.D>0

Out[0]:

	A	B	C	D
2013-01-01	-0.054635	0.168322	-1.140128	-1.657919
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979
2013-01-04	1.135732	1.530666	-1.604023	-0.442883
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286

Out[0]:

2013-01-01    False
2013-01-02     True
2013-01-03     True
2013-01-04    False
2013-01-05    False
2013-01-06     True
Freq: D, Name: D, dtype: bool

In [0]:

#Se eligen los que cumplen esta condición
df[df.D>0]

Out[0]:

	A	B	C	D
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979
2013-01-06	0.989741	0.086662	-0.055164	1.026263

De manera similar a las codiciones lógicas en programación, se desea quizás ver que se cumplan más de una condicion. Para ello se usan los conectores lógicos.

In [0]:

#Conector lógico "and"
df[(df.C<-1.0)& (df.D>0)]

# Conector lógico "or"
df[(df.B<-1.0) | (df.D>1)]

Out[0]:

	A	B	C	D
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979

Out[0]:

	A	B	C	D
2013-01-02	0.342567	0.409840	-0.613349	1.572652
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286
2013-01-06	0.989741	0.086662	-0.055164	1.026263

Se puede dar el caso que solo se quiere ver aquellos elementos que son parte de un subcojunto de elementos.

In [0]:

#Se crea otro DataFrame y se agrega una nueva columna
df2 = df.copy()

df2['E'] = ['one', 'one','two','three','four','three']

df2.head() 

Out[0]:

	A	B	C	D	E
2013-01-01	-0.054635	0.168322	-1.140128	-1.657919	one
2013-01-02	0.342567	0.409840	-0.613349	1.572652	one
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	two
2013-01-04	1.135732	1.530666	-1.604023	-0.442883	three
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	four

In [0]:

# De manera afirmativa
df2[df2['E'].isin(['two','four'])]

Out[0]:

	A	B	C	D	E
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	two
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	four

In [0]:

# La condicion negativa de la anterior
df2[~df2.E.isin(['one','three'])]

Out[0]:

	A	B	C	D	E
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	two
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	four

Con los anteriores ejemplos se muestra que se tiene lo necesario para construir expresiones lógicas sobre los DataFrame (and, or , not).

Cambiando datos¶

Cuando se desea modificar un DataFrame en general, se puede eliminar una columna, agregar una nueva columna o modificar los valores actuales.

In [0]:

#Se elimina la columna D
del df2['D']

In [0]:

df2.head()

Out[0]:

	A	B	C	E
2013-01-01	-0.054635	0.168322	-1.140128	one
2013-01-02	0.342567	0.409840	-0.613349	one
2013-01-03	-0.797758	-0.148911	-1.131524	two
2013-01-04	1.135732	1.530666	-1.604023	three
2013-01-05	1.201068	-1.191412	-0.046149	four

In [0]:

df2['D']=df.D
df2

Out[0]:

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	NaN	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	1.572652	-1.0	NaN
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	-2.0	NaN
2013-01-04	-1.135732	-1.530666	-1.604023	-0.442883	-3.0	NaN
2013-01-05	-1.201068	-1.191412	-0.046149	-0.346286	-4.0	NaN
2013-01-06	-0.989741	-0.086662	-0.055164	1.026263	-5.0	NaN

In [0]:

#¿Qué sucede?
df2.drop(labels=['D'],axis=1)
df2

Out[0]:

	A	B	C	F	E
2013-01-01	0.000000	0.000000	-1.140128	NaN	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	-1.0	NaN
2013-01-03	-0.797758	-0.148911	-1.131524	-2.0	NaN
2013-01-04	-1.135732	-1.530666	-1.604023	-3.0	NaN
2013-01-05	-1.201068	-1.191412	-0.046149	-4.0	NaN
2013-01-06	-0.989741	-0.086662	-0.055164	-5.0	NaN

Out[0]:

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	NaN	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	1.572652	-1.0	NaN
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	-2.0	NaN
2013-01-04	-1.135732	-1.530666	-1.604023	-0.442883	-3.0	NaN
2013-01-05	-1.201068	-1.191412	-0.046149	-0.346286	-4.0	NaN
2013-01-06	-0.989741	-0.086662	-0.055164	1.026263	-5.0	NaN

In [0]:

df2.drop(labels=['D'],axis=1,inplace=True)
df2

Out[0]:

	A	B	C	F	E
2013-01-01	0.000000	0.000000	-1.140128	NaN	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	-1.0	NaN
2013-01-03	-0.797758	-0.148911	-1.131524	-2.0	NaN
2013-01-04	-1.135732	-1.530666	-1.604023	-3.0	NaN
2013-01-05	-1.201068	-1.191412	-0.046149	-4.0	NaN
2013-01-06	-0.989741	-0.086662	-0.055164	-5.0	NaN

In [0]:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

Out[0]:

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [0]:

df['F']=s1

In [0]:

df.head()

Out[0]:

	A	B	C	D	F
2013-01-01	-0.054635	0.168322	-1.140128	-1.657919	NaN
2013-01-02	0.342567	0.409840	-0.613349	1.572652	1.0
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	2.0
2013-01-04	1.135732	1.530666	-1.604023	-0.442883	3.0
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	4.0

In [0]:

df.at[dates[0],'A'] = 0
df.iat[2,1] = 0

In [0]:

df

Out[0]:

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	NaN
2013-01-02	0.342567	0.409840	-0.613349	1.572652	1.0
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	2.0
2013-01-04	1.135732	1.530666	-1.604023	-0.442883	3.0
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	4.0
2013-01-06	0.989741	0.086662	-0.055164	1.026263	5.0

In [0]:

#Por qué se pone copy()
df2 = df.copy()

df2[df2 > 0] = -df2

df2

Out[0]:

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	-1.572652	-1.0
2013-01-03	-0.797758	-0.148911	-1.131524	-0.602979	-2.0
2013-01-04	-1.135732	-1.530666	-1.604023	-0.442883	-3.0
2013-01-05	-1.201068	-1.191412	-0.046149	-0.346286	-4.0
2013-01-06	-0.989741	-0.086662	-0.055164	-1.026263	-5.0

Missing Values¶

Con el nombre missing values, se consideran aquellos valores que o bien no tienen información o tienen un tipo de dato NA.

El tipo de dato "NA", es proveniente de NumPy. Es un estandar tener manera de indicar que no se tienen valor.

Pero también existen valores "infinitos".

In [0]:

df2['E']=np.nan

In [0]:

df2

Out[0]:

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	NaN	NaN
2013-01-02	-0.342567	-0.409840	-0.613349	-1.572652	-1.0	NaN
2013-01-03	-0.797758	-0.148911	-1.131524	-0.602979	-2.0	NaN
2013-01-04	-1.135732	-1.530666	-1.604023	-0.442883	-3.0	NaN
2013-01-05	-1.201068	-1.191412	-0.046149	-0.346286	-4.0	NaN
2013-01-06	-0.989741	-0.086662	-0.055164	-1.026263	-5.0	NaN

In [0]:

df2.dropna()
print()
df.dropna()
print()
df2.fillna(-99)

Out[0]:

	A	B	C	D	F	E

Out[0]:

	A	B	C	D	F
2013-01-02	0.342567	0.409840	-0.613349	1.572652	1.0
2013-01-03	-0.797758	-0.148911	-1.131524	0.602979	2.0
2013-01-04	1.135732	1.530666	-1.604023	-0.442883	3.0
2013-01-05	1.201068	-1.191412	-0.046149	-0.346286	4.0
2013-01-06	0.989741	0.086662	-0.055164	1.026263	5.0

Out[0]:

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	-1.140128	-1.657919	-99.0	-99.0
2013-01-02	-0.342567	-0.409840	-0.613349	-1.572652	-1.0	-99.0
2013-01-03	-0.797758	-0.148911	-1.131524	-0.602979	-2.0	-99.0
2013-01-04	-1.135732	-1.530666	-1.604023	-0.442883	-3.0	-99.0
2013-01-05	-1.201068	-1.191412	-0.046149	-0.346286	-4.0	-99.0
2013-01-06	-0.989741	-0.086662	-0.055164	-1.026263	-5.0	-99.0

In [0]:

pd.isna(df2)

Out[0]:

	A	B	C	F	E
2013-01-01	False	False	False	True	True
2013-01-02	False	False	False	False	True
2013-01-03	False	False	False	False	True
2013-01-04	False	False	False	False	True
2013-01-05	False	False	False	False	True
2013-01-06	False	False	False	False	True

In [0]:

df2.isna()

Out[0]:

	A	B	C	F	E
2013-01-01	False	False	False	True	True
2013-01-02	False	False	False	False	True
2013-01-03	False	False	False	False	True
2013-01-04	False	False	False	False	True
2013-01-05	False	False	False	False	True
2013-01-06	False	False	False	False	True

Temas Pendientes¶

Restan los siguientes aspectos básicos:

Estadísticos
Aplicar funciones o transformaciones sobre los datos
Operaciones entre DataFrames (merge, join)
Cambios de forma (melt y tablas pivot)
Agrupaciones
Variables tipo Categorical
Tratamiento de indices de tiempo o timestamp

Se veran en otras lecciones, pero por ahora pasamos a un aspecto visual.

Visualizaciones¶

Pandas cuenta con métodos para generar un conjunto de gráficas que son las más usadas o requeridas al momento de hacer un análsis de datos, estas gráficas son originadas con las funciones de matplotlib.

In [0]:

#Se gráfica la columna A
df['A'].plot()

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f939103d4e0>

In [0]:

plt.plot(df['A'])

Out[0]:

[<matplotlib.lines.Line2D at 0x7f9390f44208>]

In [0]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

In [0]:

ts.plot()

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f93910287b8>

In [0]:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
  

df = df.cumsum()

In [0]:

df.head()

Out[0]:

	A	B	C	D
2000-01-01	-1.644883	-1.614222	0.237673	-0.212443
2000-01-02	0.375091	-0.187983	0.274747	0.736999
2000-01-03	-1.698535	0.730529	0.537786	0.791468
2000-01-04	-2.512115	0.022670	0.265403	0.978397
2000-01-05	-2.173313	-0.297431	1.172374	1.206713

In [0]:

df.plot()

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f938e606668>

In [0]:

df.plot.hist(alpha=0.5)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f938e52bac8>

In [0]:

from pandas.plotting import scatter_matrix

scatter_matrix(df,diagonal='kde',alpha=0.2)

Out[0]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f938df80470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c671ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c62e208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c663828>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c6233c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c623400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c589438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c53ca58>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4fd5f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4b1048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4e55f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c49ac18>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4587b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4637f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c3bd9e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c380048>]],
      dtype=object)

In [0]:

#Ejemplo de seaborn para generar scatter matrix
import seaborn as sns


sns.pairplot(df)

Out[0]:

<seaborn.axisgrid.PairGrid at 0x7f938c2ad8d0>

API de Pandas¶

Una vista a la API de Pandas.

Nunca es el objetivo para aprender una biblioteca el conocer todos los métodos, pero si resulta recomendable saber donde se puede ver información. Con la práctica se va conociendo mejor todas las funcionalidades de los métodos y se vuelve más familiar el api.

La información sobre el api de Pandas se puede encontrar aquí.

Basico 1: Carga de Datos y Exploración¶

Primero se limpia el entorno y después de cargan las bibliotecas requeridas.

In [0]:

%clear

In [0]:

%reset -f

In [0]:

%whos

Interactive namespace is empty.

In [0]:

#Se cargan las bibliotecas que e usarán 
# además se modifica el entorno para trabajar
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
 

In [0]:

#Carga de datos
data=pd.read_csv("sample_data/california_housing_train.csv")
data.head()

Out[0]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

Las primeras preguntas que pueden hacerse sobre los datos son:

¿Qué tamaño tiene la muestra?
¿Qué tipo de variables se tienen?
¿Tiene missing values?
¿Cuáles son los nombres de las columnas?

...etc.

In [0]:

#Cuantos nulos tiene el DF?
data.isnull().sum()
#Qué tipo de variables se tiene
data.dtypes
#Cuál es el tamaño
data.shape

Out[0]:

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

Out[0]:

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dtype: object

Out[0]:

(17000, 9)

In [0]:

#Cuanta memoria ocupa?
data.memory_usage(index=False,deep=True)
print()
data.memory_usage(index=False,deep=True).sum()

Out[0]:

longitude             136000
latitude              136000
housing_median_age    136000
total_rooms           136000
total_bedrooms        136000
population            136000
households            136000
median_income         136000
median_house_value    136000
dtype: int64

Out[0]:

In [0]:

#Resumen general de los datos
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
longitude             17000 non-null float64
latitude              17000 non-null float64
housing_median_age    17000 non-null float64
total_rooms           17000 non-null float64
total_bedrooms        17000 non-null float64
population            17000 non-null float64
households            17000 non-null float64
median_income         17000 non-null float64
median_house_value    17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB

Lo mínimo que se puede hacer es ver algunos registros de la tabla o DataFrame y ver las estadísticas generales de los datos.

In [0]:

data.head()
data.describe()

Out[0]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

Out[0]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000
mean	-119.562108	35.625225	28.589353	2643.664412	539.410824	1429.573941	501.221941	3.883578	207300.912353
std	2.005166	2.137340	12.586937	2179.947071	421.499452	1147.852959	384.520841	1.908157	115983.764387
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.790000	33.930000	18.000000	1462.000000	297.000000	790.000000	282.000000	2.566375	119400.000000
50%	-118.490000	34.250000	29.000000	2127.000000	434.000000	1167.000000	409.000000	3.544600	180400.000000
75%	-118.000000	37.720000	37.000000	3151.250000	648.250000	1721.000000	605.250000	4.767000	265000.000000
max	-114.310000	41.950000	52.000000	37937.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

En apariencia se pueden graficar las Latitudes y Longitudes.

In [0]:

plt.scatter(data.longitude,data.latitude)

Out[0]:

<matplotlib.collections.PathCollection at 0x7f16c20c9c88>

In [0]:

data.columns

Out[0]:

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [0]:

data.plot.scatter(x='longitude', y='latitude',c='median_income',
                  colormap='viridis')

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f16c2342a58>

In [0]:

data.plot.scatter(x='longitude', y='latitude',c='median_house_value',
                  colormap='viridis')

data.plot.scatter(x='longitude', y='latitude',c='population',
                  colormap='viridis')

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe26e621a20>

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe26e520ef0>

Quizás alguna variable parece relevante, se puede explorar sola.

In [0]:

data.housing_median_age.plot.hist(bins=40)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f938c011eb8>

In [0]:

#¿Cuáles son los 10 años más frecuentes? 
data.housing_median_age.value_counts().head(10)

Out[0]:

52.0    1052
36.0     715
35.0     692
16.0     635
17.0     576
34.0     567
33.0     513
26.0     503
18.0     478
25.0     461
Name: housing_median_age, dtype: int64

In [0]:

data['housing_median_age'].value_counts().head(10)

Out[0]:

52.0    1052
36.0     715
35.0     692
16.0     635
17.0     576
34.0     567
33.0     513
26.0     503
18.0     478
25.0     461
Name: housing_median_age, dtype: int64

In [0]:

data[['total_rooms','total_bedrooms']].plot.hist(bins=40,alpha=0.5)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f938b05fd68>

In [0]:

data.skew()

Out[0]:

longitude            -0.304003
latitude              0.471801
housing_median_age    0.064894
total_rooms           4.002730
total_bedrooms        3.322637
population            5.187212
households            3.342668
median_income         1.626693
median_house_value    0.973037
dtype: float64

In [0]:

data2=data[['total_rooms','total_bedrooms']].sum(axis=1)
plt.scatter(data2,data['median_income'])

Out[0]:

<matplotlib.collections.PathCollection at 0x7fe26df01160>

Los ejemplos anteriores tratan de ilustrar como se puede ir trabajando con Pandas y las gráficas en Matplotlib para explorar y conocer nuestros datos. No es un análisis exhaustivo o concluyente, es solo ilustrivo.

Notas finales:¶

El entorno de trabajo de Jupyter y Pandas, permite hacer una manipulación de datos fácil y rápida. Algunos de los comandos auxiliales de Jupyter ( Ipython) resultan sumamente útiles para trabajar con datos y supervisar nuestro entorno.

Pandas cuenta con 2 objetos principales, Series y DataFrames. Sobre ellos se tienen ciertas funcionalidades estandar, como seleccion, modificacion , transformación, etc. Resulta importante siempre recordar que un objeto en pandas esta formado por un conjunto de índices y de valores, los cuales son arrays.

Referencias y Créditos:¶

Libros:

Sitios Web:

https://pandas.pydata.org/pandas-docs/stable/10min.html