Se usa el nombre PyData para referirce a las bibliotecas de Python que se usan para cómputo científico. Pero no es la definición en si; eso es el "stack", PyData es un programa respaldado por una organización sin fines de lucro que lo que busca a apoyar el uso y desarrollo del open source y en especial el uso e implementación del stack de Python para cómputo científico.
La organización NumFocus, dentro de sus programas PyData le da nombre a los eventos relacionados con la enseñanza y divulgación del uso de tecnologías en Python.
En nuestro caso, siguiendo las corrientes diremos PyData para referirnos al conjunto de bibliotecas que conforman el ecosistema de Python para computo científico.
La evolución y estado actual se pueden ver en la siguiente presentación de Travis Oliphant
Para tener una idea de la relación entre proyectos resulta ilustrativo explorar la lista de proyectos relacionados entre 4 bibliotecas:
¿Cuál es la relación entre las bibliotecas?
De manera rápida se pude explorar la documentación de los proyectos; Pandas y Numpy, pero desde la documentación de la biblioteca. Esto se puede hacer desde una orden en la consola.
#Se cargan Pandas y Numpy
import pandas as pd
import numpy as np
print("Version de Pandas:",pd.__version__)
print("Version de Numpy:",np.__version__)
Version de Pandas: 0.22.0 Version de Numpy: 1.14.6
#Se imprime la documentación de Pandas
print(pd.__doc__)
pandas - a powerful data analysis and manipulation library for Python ===================================================================== **pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on its way toward this goal. Main Features ------------- Here are just a few of the things that pandas does well: - Easy handling of missing data in floating point as well as non-floating point data - Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects - Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let `Series`, `DataFrame`, etc. automatically align the data for you in computations - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data - Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets - Intuitive merging and joining data sets - Flexible reshaping and pivoting of data sets - Hierarchical labeling of axes (possible to have multiple labels per tick) - Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format - Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
#Se imprime la documentación de Numpy
print(np.__doc__)
NumPy ===== Provides 1. An array object of arbitrary homogeneous items 2. Fast mathematical operations over arrays 3. Linear Algebra, Fourier Transforms, Random Number Generation How to use the documentation ---------------------------- Documentation is available in two forms: docstrings provided with the code, and a loose standing reference guide, available from `the NumPy homepage <http://www.scipy.org>`_. We recommend exploring the docstrings using `IPython <http://ipython.scipy.org>`_, an advanced Python shell with TAB-completion and introspection capabilities. See below for further instructions. The docstring examples assume that `numpy` has been imported as `np`:: >>> import numpy as np Code snippets are indicated by three greater-than signs:: >>> x = 42 >>> x = x + 1 Use the built-in ``help`` function to view a function's docstring:: >>> help(np.sort) ... # doctest: +SKIP For some objects, ``np.info(obj)`` may provide additional help. This is particularly true if you see the line "Help on ufunc object:" at the top of the help() page. Ufuncs are implemented in C, not Python, for speed. The native Python help() does not know how to view their help, but our np.info() function does. To search for documents containing a keyword, do:: >>> np.lookfor('keyword') ... # doctest: +SKIP General-purpose documents like a glossary and help on the basic concepts of numpy are available under the ``doc`` sub-module:: >>> from numpy import doc >>> help(doc) ... # doctest: +SKIP Available subpackages --------------------- doc Topical documentation on broadcasting, indexing, etc. lib Basic functions used by several sub-packages. random Core Random Tools linalg Core Linear Algebra Tools fft Core FFT routines polynomial Polynomial tools testing NumPy testing tools f2py Fortran to Python Interface Generator. distutils Enhancements to distutils with support for Fortran compilers support and more. Utilities --------- test Run numpy unittests show_config Show numpy build configuration dual Overwrite certain functions with high-performance Scipy tools matlib Make everything matrices. __version__ NumPy version string Viewing documentation using IPython ----------------------------------- Start IPython with the NumPy profile (``ipython -p numpy``), which will import `numpy` under the alias `np`. Then, use the ``cpaste`` command to paste examples into the shell. To see which functions are available in `numpy`, type ``np.<TAB>`` (where ``<TAB>`` refers to the TAB key), or use ``np.*cos*?<ENTER>`` (where ``<ENTER>`` refers to the ENTER key) to narrow down the list. To view the docstring for a function, use ``np.cos?<ENTER>`` (to view the docstring) and ``np.cos??<ENTER>`` (to view the source code). Copies vs. in-place operation ----------------------------- Most of the functions in `numpy` return a copy of the array argument (e.g., `np.sort`). In-place versions of these functions are often available as array methods, i.e. ``x = np.array([1,2,3]); x.sort()``. Exceptions to this rule are documented.
Como se mensionó en la lección 1, el entorno ayuda agílizar el proceso de códificación y sobre todo de desarrollo de un análisis de datos. Con entorno me refiero a Jupyter y sus derivados.
Jupyter, como heredero de Ipython tiene varias caracterísicas interesantes:
Los siguientes comandos son ejemplos de las utilidades de los magic commands.
#Si se desea conocer lo que se tiene en la sección de trabajo
# se puede usar el siguiente comando.
%whos
Variable Type Data/Info ------------------------------ np module <module 'numpy' from '/us<...>kages/numpy/__init__.py'> pd module <module 'pandas' from '/u<...>ages/pandas/__init__.py'>
# Si se desea conocer la lista completa de magic commmands disponibles
%lsmagic
Available line magics: %alias %alias_magic %autocall %automagic %autosave %bookmark %cat %cd %clear %colors %config %connect_info %cp %debug %dhist %dirs %doctest_mode %ed %edit %env %gui %hist %history %killbgscripts %ldir %less %lf %lk %ll %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %ls %lsmagic %lx %macro %magic %man %matplotlib %mkdir %more %mv %notebook %page %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %popd %pprint %precision %profile %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole %quickref %recall %rehashx %reload_ext %rep %rerun %reset %reset_selective %rm %rmdir %run %save %sc %set_env %shell %store %sx %system %tb %time %timeit %unalias %unload_ext %who %who_ls %whos %xdel %xmode Available cell magics: %%! %%HTML %%SVG %%bash %%bigquery %%capture %%debug %%file %%html %%javascript %%js %%latex %%perl %%prun %%pypy %%python %%python2 %%python3 %%ruby %%script %%sh %%shell %%svg %%sx %%system %%time %%timeit %%writefile Automagic is ON, % prefix IS NOT needed for line magics.
# Si se tiene dudas respecto al funcionamiento de cualquier objeto
?%lsmagic
?pd
#También se puede hacer uso de los comandos del sistema
!ls -l -h
total 4.0K drwxr-xr-x 2 root root 4.0K Nov 29 18:21 sample_data
#Se pide ver la dirección o ubicación de trabajo
!pwd
/content
# Igual que el anterior pero usando un mag-comm
%pwd
'/content'
#Para ver los procesos que se ejecutan
!top
=top - 00:22:58 up 1:11, 0 users, load average: 0.00, 0.01, 0.00 Tasks: 11 total, 1 running, 10 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.3 sy, 0.0 ni, 98.9 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13335212 total, 11335820 free, 678956 used, 1320436 buff/cache KiB Swap: 0 total, 0 free, 0 used. 12470684 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 39196 6312 4828 S 0.0 0.0 0:00.05 run.sh 7 root 20 0 679236 44192 24428 S 0.0 0.3 0:01.15 node 28 root 20 0 683112 55048 24980 S 0.0 0.4 0:03.21 node 53 root 20 0 187712 58848 12520 S 0.0 0.4 0:05.01 jupyter-+ 60 root 20 0 783272 206804 40060 S 0.0 1.6 0:06.78 python3 82 root 20 0 54376 14556 7516 S 0.0 0.1 0:00.07 python3 108 root 20 0 711676 135524 39976 S 0.0 1.0 0:02.46 python3 124 root 20 0 54376 14572 7536 S 0.0 0.1 0:00.06 python3 188 root 20 0 720332 136040 40020 S 0.0 1.0 0:02.35 python3 204 root 20 0 54376 14532 7496 S 0.0 0.1 0:00.07 python3 256 root 20 0 61088 6632 4948 R 0.0 0.0 0:00.01 top top - 00:23:01 up 1:11, 0 users, load average: 0.00, 0.01, 0.00 Tasks: 11 total, 1 running, 10 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 0.5 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13335212 total, 11335636 free, 679064 used, 1320512 buff/cache KiB Swap: 0 total, 0 free, 0 used. 12470556 avail Mem 188 root 20 0 720332 136040 40020 S 1.0 1.0 0:02.38 python3 1 root 20 0 39196 6312 4828 S 0.0 0.0 0:00.05 run.sh 7 root 20 0 679236 44192 24428 S 0.0 0.3 0:01.15 node 28 root 20 0 683112 55048 24980 S 0.0 0.4 0:03.21 node 53 root 20 0 187712 58848 12520 S 0.0 0.4 0:05.01 jupyter-+ 60 root 20 0 783272 206804 40060 S 0.0 1.6 0:06.78 python3 82 root 20 0 54376 14556 7516 S 0.0 0.1 0:00.07 python3 108 root 20 0 711676 135524 39976 S 0.0 1.0 0:02.46 python3 124 root 20 0 54376 14572 7536 S 0.0 0.1 0:00.06 python3 >
Estos comandos por simples que parecen ayudan a concentrarte en tu código y hacer más fluido tu trabajo.
Se pueden desarrollar tus propios "Magic commands".Ejemplos de otros fuera de los estandar son los siguientes proyectos:
El siguiente ejemplo trata de ilustrar un problema común, tener un error y tratar de entender la secuencia de errores. Para eso se puede usar un magic commands que nos dice donde está el origen de nuestro problema.
#Tratamos de correr un test de pandas
pd.test()
running: pytest --skip-slow --skip-network /usr/local/lib/python3.6/dist-packages/pandas ============================= test session starts ============================== platform linux -- Python 3.6.7, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 rootdir: /usr/local/lib/python3.6/dist-packages/pandas, inifile: collected 0 items / 1 errors ==================================== ERRORS ==================================== ______________________________ ERROR collecting _______________________________ /usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:430: in _importconftest return self._conftestpath2mod[conftestpath] E KeyError: local('/usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py') During handling of the above exception, another exception occurred: /usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:436: in _importconftest mod = conftestpath.pyimport() /usr/local/lib/python3.6/dist-packages/py/_path/local.py:668: in pyimport __import__(modname) /usr/local/lib/python3.6/dist-packages/_pytest/assertion/rewrite.py:294: in load_module six.exec_(co, mod.__dict__) /usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py:3: in <module> import moto E ModuleNotFoundError: No module named 'moto' During handling of the above exception, another exception occurred: /usr/local/lib/python3.6/dist-packages/py/_path/common.py:377: in visit for x in Visitor(fil, rec, ignore, bf, sort).gen(self): /usr/local/lib/python3.6/dist-packages/py/_path/common.py:429: in gen for p in self.gen(subdir): /usr/local/lib/python3.6/dist-packages/py/_path/common.py:418: in gen dirs = self.optsort([p for p in entries /usr/local/lib/python3.6/dist-packages/py/_path/common.py:419: in <listcomp> if p.check(dir=1) and (rec is None or rec(p))]) /usr/local/lib/python3.6/dist-packages/_pytest/main.py:601: in _recurse ihook = self.gethookproxy(dirpath) /usr/local/lib/python3.6/dist-packages/_pytest/main.py:418: in gethookproxy my_conftestmodules = pm._getconftestmodules(fspath) /usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:414: in _getconftestmodules mod = self._importconftest(conftestpath) /usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py:453: in _importconftest raise ConftestImportFailure(conftestpath, sys.exc_info()) E _pytest.config.ConftestImportFailure: (local('/usr/local/lib/python3.6/dist-packages/pandas/tests/io/conftest.py'), (<class 'ModuleNotFoundError'>, ModuleNotFoundError("No module named 'moto'",), <traceback object at 0x7f16c248b4c8>)) !!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!! =========================== 1 error in 0.64 seconds ============================
An exception has occurred, use %tb to see the full traceback. SystemExit: 2
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D. warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
# Pedimos ver la trata o razón de nuestro problema.
%tb
--------------------------------------------------------------------------- SystemExit Traceback (most recent call last) <ipython-input-11-f37ca8985d3b> in <module>() ----> 1 pd.test() /usr/local/lib/python3.6/dist-packages/pandas/util/_tester.py in test(extra_args) 20 cmd += [PKG] 21 print("running: pytest {}".format(' '.join(cmd))) ---> 22 sys.exit(pytest.main(cmd)) 23 24 SystemExit: 2
Para conocer más respecto a las funcionalidades de Jupyter o IPython se puede revisar la siguiente liga:
Pandas es una biblioteca de alto performance para la manipulación y procesamiento de datos estructurados.
Consiste en general de los siguientes elementos:
Las dos estructuras de datos fundamentales en Pandas son:
Una idea para entender a relación entre los objetos en Pandas es pensarlos como estructuras o contenedores de estructuras de datos menores dimensiones. Los Dataframes se pueden pensar como contenedores de Series y las Series como contenedores de números o cadenas.
Nota Avanzada: Todos los objetos en Pandas son mutables ( los valores que contienen pueden ser alterados). Pero todos los métodos o funciones producen nuevos objetos y dejan los objetos iniciales sin cambios.
El siguiente "tour" en Pandas es una versión de la que se puede encontrar en la página oficial:
# Se modifica el entorno para hacer más
# rápida la revisión
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
import matplotlib.pyplot as plt
Se define una Serie y un DataFrame
#Se define la serie s
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
#Se define un DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
#Se visualizan
s
df
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -2.662312 | -0.930931 | -1.069079 | 1.103445 |
2013-01-02 | 2.770103 | -0.032381 | -0.114339 | 0.289050 |
2013-01-03 | -0.742892 | -1.425946 | 0.994186 | -0.266048 |
2013-01-04 | -0.891284 | -0.695291 | -0.924549 | 1.942704 |
2013-01-05 | 0.426666 | 0.380999 | 0.486457 | -1.035013 |
2013-01-06 | -0.211796 | 0.351017 | 1.557697 | -1.496206 |
Se crea otro DataFrame, pero como ejemplo se muestra como crearlo con diferentes tipos de datos.
#Se crea el DF con diferentes tipos de Series
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
print("Se visualiza el DF")
df2
print("="*50)
print("¿Qué tipo de datos tiene?")
df2.dtypes
#Para tener un resumen completo del DF
print()
print("="*50)
print("Resumen")
df2.info()
Se visualiza el DF
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
================================================== ¿Qué tipo de datos tiene?
A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
================================================== Resumen <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 6 columns): A 4 non-null float64 B 4 non-null datetime64[ns] C 4 non-null float32 D 4 non-null int32 E 4 non-null category F 4 non-null object dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1) memory usage: 260.0+ bytes
Para un DataFrame largo, es recomendable visualizar solo algunas filas.
#Se muestran las 3 primeras filas
df.head(3)
#Se muestran las 3 últimas filas
df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -2.662312 | -0.930931 | -1.069079 | 1.103445 |
2013-01-02 | 2.770103 | -0.032381 | -0.114339 | 0.289050 |
2013-01-03 | -0.742892 | -1.425946 | 0.994186 | -0.266048 |
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | -0.891284 | -0.695291 | -0.924549 | 1.942704 |
2013-01-05 | 0.426666 | 0.380999 | 0.486457 | -1.035013 |
2013-01-06 | -0.211796 | 0.351017 | 1.557697 | -1.496206 |
De manera general, todo DataFrame tiene 3 elementos: index, columnas y valores.
#Se visualizan los índices
df.index
print("\n")
#Se revisan las columnas
print("Columndas\n")
df.columns
#Se visualizan los valores
print("\n")
df.values
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
Columndas
Index(['A', 'B', 'C', 'D'], dtype='object')
array([[-2.66231197, -0.93093083, -1.06907874, 1.10344508], [ 2.77010318, -0.03238136, -0.11433878, 0.28905019], [-0.74289222, -1.42594597, 0.99418636, -0.26604766], [-0.89128408, -0.69529057, -0.92454875, 1.94270435], [ 0.4266665 , 0.38099875, 0.48645678, -1.03501286], [-0.2117956 , 0.35101741, 1.55769736, -1.49620624]])
Los comandos anteriores muestran la relación que existe entre Numpy y Pandas.
type(df.index)
print()
type(df.columns)
print()
type(df.values)
pandas.core.indexes.datetimes.DatetimeIndex
pandas.core.indexes.base.Index
numpy.ndarray
Se observa que el tipo de dato que para los valores de un DataFrame es un arreglo o array de Numpy. Entonces, eso implica que toda función que tiene un arreglo en Numpy la hereda un DataFrame.
df.values.argmax()
print()
df.values.diagonal()
7
array([-0.05463496, 0.40983999, -1.13152355, -0.44288347])
Lo mismo ocurre con las series.
type(s)
print()
type(s.values)
pandas.core.series.Series
numpy.ndarray
Algunas funciones útiles para explora los datos del DataFrame son las siguientes:
#Se pide la estádistica básica
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.469452 | 0.142528 | -0.765056 | 0.125801 |
std | 0.793282 | 0.879375 | 0.636024 | 1.171310 |
min | -0.797758 | -1.191412 | -1.604023 | -1.657919 |
25% | 0.044665 | -0.090017 | -1.137977 | -0.418734 |
50% | 0.666154 | 0.127492 | -0.872436 | 0.128347 |
75% | 1.099234 | 0.349461 | -0.194710 | 0.920442 |
max | 1.201068 | 1.530666 | -0.046149 | 1.572652 |
#Quizás se necesita ordenar los datos
# de algun modo.
df.sort_index(axis=1, ascending=False)
df.sort_values(by='B')
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -1.657919 | -1.140128 | 0.168322 | -0.054635 |
2013-01-02 | 1.572652 | -0.613349 | 0.409840 | 0.342567 |
2013-01-03 | 0.602979 | -1.131524 | -0.148911 | -0.797758 |
2013-01-04 | -0.442883 | -1.604023 | 1.530666 | 1.135732 |
2013-01-05 | -0.346286 | -0.046149 | -1.191412 | 1.201068 |
2013-01-06 | 1.026263 | -0.055164 | 0.086662 | 0.989741 |
A | B | C | D | |
---|---|---|---|---|
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
2013-01-06 | 0.989741 | 0.086662 | -0.055164 | 1.026263 |
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | -1.657919 |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 |
Si bien los DataFrame son arreglos de dos dimensiones y su valores son un array de Numpy. Por consecuencia las operaciones de matrices pueden ser aplicables.
#Se estima la transpuesta
df.T
print()
df.values.diagonal()
print()
df.values.max()
2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 | |
---|---|---|---|---|---|---|
A | -0.054635 | 0.342567 | -0.797758 | 1.135732 | 1.201068 | 0.989741 |
B | 0.168322 | 0.409840 | -0.148911 | 1.530666 | -1.191412 | 0.086662 |
C | -1.140128 | -0.613349 | -1.131524 | -1.604023 | -0.046149 | -0.055164 |
D | -1.657919 | 1.572652 | 0.602979 | -0.442883 | -0.346286 | 1.026263 |
array([-0.05463496, 0.40983999, -1.13152355, -0.44288347])
1.5726523256536802
En los comandos anteriores se pide la transpuesta del DataFrame, la cual es una operacion diferentes a la siguiente:
#¿cuál es la diferencia con df.T?
df.values.T
array([[-0.05463496, 0.34256677, -0.79775845, 1.13573175, 1.20106755, 0.98974133], [ 0.16832219, 0.40983999, -0.14891052, 1.53066581, -1.19141211, 0.08666247], [-1.14012769, -0.61334922, -1.13152355, -1.60402263, -0.04614857, -0.05516379], [-1.65791871, 1.57265233, 0.60297925, -0.44288347, -0.34628572, 1.02626349]])
Como se mencionó, los objetos de Pandas son mutables. En el sentido de que pueden ser modificados, por lo cual se puede seleccionar solo las columnas o parte de los datos que a uno le interesan.
La selección de columnas y filas tiene 3 modos de ser realizada:
¿Por qué tantos modos de hacer lo mismo?
#Por slicing
df['A']
df[0:3]
df['20130102':'20130104']
2013-01-01 -0.054635 2013-01-02 0.342567 2013-01-03 -0.797758 2013-01-04 1.135732 2013-01-05 1.201068 2013-01-06 0.989741 Freq: D, Name: A, dtype: float64
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | -1.657919 |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 |
#Por referencia
df.loc[dates[0]]
df.loc[:,['A','B']]
df.loc['20130102':'20130104',['A','B']]
df.loc['20130102',['A','B']]
#¿Cuál es la diferencia?
df.loc[dates[0],'A']
print()
df.at[dates[0],'A']
1.3604942373447584
1.3604942373447584
#Por posición
df.iloc[3]
df.iloc[3:5,0:2]
df.iloc[[1,2,4],[0,2]]
df.iloc[1:3,:]
df.iloc[:,1:3]
A 1.135732 B 1.530666 C -1.604023 D -0.442883 Name: 2013-01-04 00:00:00, dtype: float64
A | B | |
---|---|---|
2013-01-04 | 1.135732 | 1.530666 |
2013-01-05 | 1.201068 | -1.191412 |
A | C | |
---|---|---|
2013-01-02 | 0.342567 | -0.613349 |
2013-01-03 | -0.797758 | -1.131524 |
2013-01-05 | 1.201068 | -0.046149 |
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
B | C | |
---|---|---|
2013-01-01 | 0.168322 | -1.140128 |
2013-01-02 | 0.409840 | -0.613349 |
2013-01-03 | -0.148911 | -1.131524 |
2013-01-04 | 1.530666 | -1.604023 |
2013-01-05 | -1.191412 | -0.046149 |
2013-01-06 | 0.086662 | -0.055164 |
#¿Cuál es la diferencia?
df.iloc[1,1]
print()
df.iat[1,1]
0.4098399942252009
0.4098399942252009
Cuando se manipulan datos resulta lógico hacer pregunta "lógicas" o comparativos, como desear saber cuantos elementos cumplen una condición, por ejemplo:
#Cuantos elementos tienen valor
# mayor a cero
df.C>0
Para ayudar con este tipo de casos, se tienen la selección por índices booleanos.
#Se visualizan los datos
df.head()
print()
#La condicion a pedir
df.D>0
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | -1.657919 |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 |
2013-01-01 False 2013-01-02 True 2013-01-03 True 2013-01-04 False 2013-01-05 False 2013-01-06 True Freq: D, Name: D, dtype: bool
#Se eligen los que cumplen esta condición
df[df.D>0]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
2013-01-06 | 0.989741 | 0.086662 | -0.055164 | 1.026263 |
De manera similar a las codiciones lógicas en programación, se desea quizás ver que se cumplan más de una condicion. Para ello se usan los conectores lógicos.
#Conector lógico "and"
df[(df.C<-1.0)& (df.D>0)]
# Conector lógico "or"
df[(df.B<-1.0) | (df.D>1)]
A | B | C | D | |
---|---|---|---|---|
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 |
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 |
2013-01-06 | 0.989741 | 0.086662 | -0.055164 | 1.026263 |
Se puede dar el caso que solo se quiere ver aquellos elementos que son parte de un subcojunto de elementos.
#Se crea otro DataFrame y se agrega una nueva columna
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2.head()
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | -1.657919 | one |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 | one |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | two |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 | three |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | four |
# De manera afirmativa
df2[df2['E'].isin(['two','four'])]
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | two |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | four |
# La condicion negativa de la anterior
df2[~df2.E.isin(['one','three'])]
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | two |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | four |
Con los anteriores ejemplos se muestra que se tiene lo necesario para construir expresiones lógicas sobre los DataFrame (and, or , not).
Cuando se desea modificar un DataFrame en general, se puede eliminar una columna, agregar una nueva columna o modificar los valores actuales.
#Se elimina la columna D
del df2['D']
df2.head()
A | B | C | E | |
---|---|---|---|---|
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | one |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | one |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | two |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | three |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | four |
df2['D']=df.D
df2
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | NaN | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | 1.572652 | -1.0 | NaN |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | -2.0 | NaN |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -0.442883 | -3.0 | NaN |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -0.346286 | -4.0 | NaN |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | 1.026263 | -5.0 | NaN |
#¿Qué sucede?
df2.drop(labels=['D'],axis=1)
df2
A | B | C | F | E | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | NaN | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | -1.0 | NaN |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | -2.0 | NaN |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -3.0 | NaN |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -4.0 | NaN |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | -5.0 | NaN |
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | NaN | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | 1.572652 | -1.0 | NaN |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | -2.0 | NaN |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -0.442883 | -3.0 | NaN |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -0.346286 | -4.0 | NaN |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | 1.026263 | -5.0 | NaN |
df2.drop(labels=['D'],axis=1,inplace=True)
df2
A | B | C | F | E | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | NaN | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | -1.0 | NaN |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | -2.0 | NaN |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -3.0 | NaN |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -4.0 | NaN |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | -5.0 | NaN |
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64
df['F']=s1
df.head()
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | -0.054635 | 0.168322 | -1.140128 | -1.657919 | NaN |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 | 1.0 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | 2.0 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 | 3.0 |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | 4.0 |
df.at[dates[0],'A'] = 0
df.iat[2,1] = 0
df
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | NaN |
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 | 1.0 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | 2.0 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 | 3.0 |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | 4.0 |
2013-01-06 | 0.989741 | 0.086662 | -0.055164 | 1.026263 | 5.0 |
#Por qué se pone copy()
df2 = df.copy()
df2[df2 > 0] = -df2
df2
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | -1.572652 | -1.0 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | -0.602979 | -2.0 |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -0.442883 | -3.0 |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -0.346286 | -4.0 |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | -1.026263 | -5.0 |
Con el nombre missing values, se consideran aquellos valores que o bien no tienen información o tienen un tipo de dato NA.
El tipo de dato "NA", es proveniente de NumPy. Es un estandar tener manera de indicar que no se tienen valor.
Pero también existen valores "infinitos".
df2['E']=np.nan
df2
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | NaN | NaN |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | -1.572652 | -1.0 | NaN |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | -0.602979 | -2.0 | NaN |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -0.442883 | -3.0 | NaN |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -0.346286 | -4.0 | NaN |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | -1.026263 | -5.0 | NaN |
df2.dropna()
print()
df.dropna()
print()
df2.fillna(-99)
A | B | C | D | F | E |
---|
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-02 | 0.342567 | 0.409840 | -0.613349 | 1.572652 | 1.0 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | 0.602979 | 2.0 |
2013-01-04 | 1.135732 | 1.530666 | -1.604023 | -0.442883 | 3.0 |
2013-01-05 | 1.201068 | -1.191412 | -0.046149 | -0.346286 | 4.0 |
2013-01-06 | 0.989741 | 0.086662 | -0.055164 | 1.026263 | 5.0 |
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -1.140128 | -1.657919 | -99.0 | -99.0 |
2013-01-02 | -0.342567 | -0.409840 | -0.613349 | -1.572652 | -1.0 | -99.0 |
2013-01-03 | -0.797758 | -0.148911 | -1.131524 | -0.602979 | -2.0 | -99.0 |
2013-01-04 | -1.135732 | -1.530666 | -1.604023 | -0.442883 | -3.0 | -99.0 |
2013-01-05 | -1.201068 | -1.191412 | -0.046149 | -0.346286 | -4.0 | -99.0 |
2013-01-06 | -0.989741 | -0.086662 | -0.055164 | -1.026263 | -5.0 | -99.0 |
pd.isna(df2)
A | B | C | F | E | |
---|---|---|---|---|---|
2013-01-01 | False | False | False | True | True |
2013-01-02 | False | False | False | False | True |
2013-01-03 | False | False | False | False | True |
2013-01-04 | False | False | False | False | True |
2013-01-05 | False | False | False | False | True |
2013-01-06 | False | False | False | False | True |
df2.isna()
A | B | C | F | E | |
---|---|---|---|---|---|
2013-01-01 | False | False | False | True | True |
2013-01-02 | False | False | False | False | True |
2013-01-03 | False | False | False | False | True |
2013-01-04 | False | False | False | False | True |
2013-01-05 | False | False | False | False | True |
2013-01-06 | False | False | False | False | True |
Restan los siguientes aspectos básicos:
Se veran en otras lecciones, pero por ahora pasamos a un aspecto visual.
Pandas cuenta con métodos para generar un conjunto de gráficas que son las más usadas o requeridas al momento de hacer un análsis de datos, estas gráficas son originadas con las funciones de matplotlib.
#Se gráfica la columna A
df['A'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f939103d4e0>
plt.plot(df['A'])
[<matplotlib.lines.Line2D at 0x7f9390f44208>]
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f93910287b8>
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.head()
A | B | C | D | |
---|---|---|---|---|
2000-01-01 | -1.644883 | -1.614222 | 0.237673 | -0.212443 |
2000-01-02 | 0.375091 | -0.187983 | 0.274747 | 0.736999 |
2000-01-03 | -1.698535 | 0.730529 | 0.537786 | 0.791468 |
2000-01-04 | -2.512115 | 0.022670 | 0.265403 | 0.978397 |
2000-01-05 | -2.173313 | -0.297431 | 1.172374 | 1.206713 |
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f938e606668>
df.plot.hist(alpha=0.5)
<matplotlib.axes._subplots.AxesSubplot at 0x7f938e52bac8>
from pandas.plotting import scatter_matrix
scatter_matrix(df,diagonal='kde',alpha=0.2)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f938df80470>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c671ba8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c62e208>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c663828>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c6233c8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c623400>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c589438>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c53ca58>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4fd5f8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4b1048>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4e55f8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c49ac18>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4587b8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c4637f0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c3bd9e8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f938c380048>]], dtype=object)
#Ejemplo de seaborn para generar scatter matrix
import seaborn as sns
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7f938c2ad8d0>
Una vista a la API de Pandas.
Nunca es el objetivo para aprender una biblioteca el conocer todos los métodos, pero si resulta recomendable saber donde se puede ver información. Con la práctica se va conociendo mejor todas las funcionalidades de los métodos y se vuelve más familiar el api.
La información sobre el api de Pandas se puede encontrar aquí.
Primero se limpia el entorno y después de cargan las bibliotecas requeridas.
%clear
%reset -f
%whos
Interactive namespace is empty.
#Se cargan las bibliotecas que e usarán
# además se modifica el entorno para trabajar
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#Carga de datos
data=pd.read_csv("sample_data/california_housing_train.csv")
data.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
Las primeras preguntas que pueden hacerse sobre los datos son:
...etc.
#Cuantos nulos tiene el DF?
data.isnull().sum()
#Qué tipo de variables se tiene
data.dtypes
#Cuál es el tamaño
data.shape
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 median_house_value 0 dtype: int64
longitude float64 latitude float64 housing_median_age float64 total_rooms float64 total_bedrooms float64 population float64 households float64 median_income float64 median_house_value float64 dtype: object
(17000, 9)
#Cuanta memoria ocupa?
data.memory_usage(index=False,deep=True)
print()
data.memory_usage(index=False,deep=True).sum()
longitude 136000 latitude 136000 housing_median_age 136000 total_rooms 136000 total_bedrooms 136000 population 136000 households 136000 median_income 136000 median_house_value 136000 dtype: int64
1224000
#Resumen general de los datos
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17000 entries, 0 to 16999 Data columns (total 9 columns): longitude 17000 non-null float64 latitude 17000 non-null float64 housing_median_age 17000 non-null float64 total_rooms 17000 non-null float64 total_bedrooms 17000 non-null float64 population 17000 non-null float64 households 17000 non-null float64 median_income 17000 non-null float64 median_house_value 17000 non-null float64 dtypes: float64(9) memory usage: 1.2 MB
Lo mínimo que se puede hacer es ver algunos registros de la tabla o DataFrame y ver las estadísticas generales de los datos.
data.head()
data.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 |
mean | -119.562108 | 35.625225 | 28.589353 | 2643.664412 | 539.410824 | 1429.573941 | 501.221941 | 3.883578 | 207300.912353 |
std | 2.005166 | 2.137340 | 12.586937 | 2179.947071 | 421.499452 | 1147.852959 | 384.520841 | 1.908157 | 115983.764387 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.790000 | 33.930000 | 18.000000 | 1462.000000 | 297.000000 | 790.000000 | 282.000000 | 2.566375 | 119400.000000 |
50% | -118.490000 | 34.250000 | 29.000000 | 2127.000000 | 434.000000 | 1167.000000 | 409.000000 | 3.544600 | 180400.000000 |
75% | -118.000000 | 37.720000 | 37.000000 | 3151.250000 | 648.250000 | 1721.000000 | 605.250000 | 4.767000 | 265000.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 37937.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
En apariencia se pueden graficar las Latitudes y Longitudes.
plt.scatter(data.longitude,data.latitude)
<matplotlib.collections.PathCollection at 0x7f16c20c9c88>
data.columns
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'], dtype='object')
data.plot.scatter(x='longitude', y='latitude',c='median_income',
colormap='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x7f16c2342a58>
data.plot.scatter(x='longitude', y='latitude',c='median_house_value',
colormap='viridis')
data.plot.scatter(x='longitude', y='latitude',c='population',
colormap='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x7fe26e621a20>
<matplotlib.axes._subplots.AxesSubplot at 0x7fe26e520ef0>
Quizás alguna variable parece relevante, se puede explorar sola.
data.housing_median_age.plot.hist(bins=40)
<matplotlib.axes._subplots.AxesSubplot at 0x7f938c011eb8>
#¿Cuáles son los 10 años más frecuentes?
data.housing_median_age.value_counts().head(10)
52.0 1052 36.0 715 35.0 692 16.0 635 17.0 576 34.0 567 33.0 513 26.0 503 18.0 478 25.0 461 Name: housing_median_age, dtype: int64
data['housing_median_age'].value_counts().head(10)
52.0 1052 36.0 715 35.0 692 16.0 635 17.0 576 34.0 567 33.0 513 26.0 503 18.0 478 25.0 461 Name: housing_median_age, dtype: int64
data[['total_rooms','total_bedrooms']].plot.hist(bins=40,alpha=0.5)
<matplotlib.axes._subplots.AxesSubplot at 0x7f938b05fd68>
data.skew()
longitude -0.304003 latitude 0.471801 housing_median_age 0.064894 total_rooms 4.002730 total_bedrooms 3.322637 population 5.187212 households 3.342668 median_income 1.626693 median_house_value 0.973037 dtype: float64
data2=data[['total_rooms','total_bedrooms']].sum(axis=1)
plt.scatter(data2,data['median_income'])
<matplotlib.collections.PathCollection at 0x7fe26df01160>
Los ejemplos anteriores tratan de ilustrar como se puede ir trabajando con Pandas y las gráficas en Matplotlib para explorar y conocer nuestros datos. No es un análisis exhaustivo o concluyente, es solo ilustrivo.
El entorno de trabajo de Jupyter y Pandas, permite hacer una manipulación de datos fácil y rápida. Algunos de los comandos auxiliales de Jupyter ( Ipython) resultan sumamente útiles para trabajar con datos y supervisar nuestro entorno.
Pandas cuenta con 2 objetos principales, Series y DataFrames. Sobre ellos se tienen ciertas funcionalidades estandar, como seleccion, modificacion , transformación, etc. Resulta importante siempre recordar que un objeto en pandas esta formado por un conjunto de índices y de valores, los cuales son arrays.
Libros:
Sitios Web: