Porque Charles Xavier debe cambiar a Cerebro por Python¶

In [1]:

speakers = [{'name':'Mai Giménez', 
           'twitter': '@adahopper',
           'weapons': ['Python', 'Bash', 'C++'],
           'pyladies': True}, 
            {'name':'Angela Rivera', 
           'twitter': '@ghilbrae ',
           'weapons': ['Python', 'Django', 'C++'],
           'pyladies': True}]

for speaker in speakers:
    for k,v in speaker.items():
        print("- {}: {}".format(k,v))
    print()
#print('\n'.join(["- {}: {}".format(k, v) for speaker in speakers for k,v in speaker.items()]))

- name: Mai Giménez
- pyladies: True
- weapons: ['Python', 'Bash', 'C++']
- twitter: @adahopper

- name: Angela Rivera
- pyladies: True
- weapons: ['Python', 'Django', 'C++']
- twitter: @ghilbrae

In [2]:

from IPython.display import Image
Image(filename='pyladies.png')

Out[2]:

¿Qué es lo que vamos a ver en esta charla?¶

Explicación a través de ejemplos de pandas y numpy.
Cómo hacer gráficos simples con matplotlib.
Una breve explicación del aprendizaje automático (machine learning).
El algoritmo Knn

In [3]:

Image(filename='notebook.png')

Out[3]:

In [4]:

Image(filename='marvel_logo.jpg')

Out[4]:

Marvel, es una editorial de cómics estadounidense fundada por Martin Goodman en 1939, como Marvel Mystery Comics. Aunque Marvel, tal y como hoy la conocemos (Marvel Worldwide Inc.), data de 1961 con la publicación de Los Cuatro Fantásticos y otras historias de superhéroes creadas por autores como Stan Lee, Jack Kirby o Steve Ditko, entre otros.

Marvel es madre de archiconocidos personajes o equipos como:

Spider-Man
X-Men
Captain America
Black Widow
Fantastic Four
...

¡Y todos estos datos son nuestros!

1. Explorando la API de Marvel¶

Existe una libería para acceder directamente a la API de Marvel en Python desarrollada por Garrett Pennington pymarvel en Python 2 y está portada a Python 3 en pymarvel3

In [5]:

from marvel.marvel import Marvel
from marveldev import Developer

Para acceder a la API es necesario pedir unas credenciales de desarrolladores en http://developer.marvel.com/

¡Ojo con las peticiones! Podemos pedir hasta 100 resultados cada vez.

In [6]:

developer = Developer()
marvel = Marvel(*developer.get_marvel_credentials())

In [7]:

character_data_wrapper = marvel.get_characters(orderBy="-modified", limit="100")
print(character_data_wrapper.status)

Ok

In [8]:

for character in character_data_wrapper.data.results[:10]:
    print("* {character.name}: {character.modified_raw}".format(character=character))

* Thor (Goddess): 2014-11-05T15:16:57-0500
* Spider-Man (Miles Morales): 2014-10-23T12:07:33-0400
* Hawkeye (Kate Bishop): 2014-10-23T12:05:03-0400
* Black Widow: 2014-09-09T16:09:03-0400
* New Mutants: 2014-08-12T12:59:29-0400
* Cosmo (dog): 2014-07-24T15:14:21-0400
* Rocket Raccoon: 2014-07-17T17:32:43-0400
* Ronan: 2014-07-17T16:45:26-0400
* Star-Lord (Peter Quill): 2014-07-14T20:45:53-0400
* Captain Marvel (Carol Danvers): 2014-07-08T18:17:18-0400

¿Qué información tenemos disponible para cada personaje?

In [9]:

', '.join([attr for attr in dir(character) if not attr.startswith('_')])

Out[9]:

'comics, description, detail, dict, events, get_comics, get_events, get_related_resource, get_series, get_stories, id, list_to_instance_list, marvel, modified, modified_raw, name, resourceURI, resource_url, series, stories, thumbnail, to_dict, urls, wiki'

Mmmm, no está mal pero ¿es eso lo que buscamos? Veamos el wiki:

In [25]:

from IPython.core.display import HTML
HTML("<iframe src={} width=1000 height=800></iframe>".format(character_data_wrapper.data.results[2].wiki))

Out[25]:

Extrayendo datos de la web (scrap)¶

Aquí encontramos muchos más datos acerca del personaje: Rasgos físicos, ocupación, educación... ¡Tiene buena pinta! ¡Scrappemos la wiki!

El problema reside en que Marvel sólo nos deja obtener hasta 100 resultados cada vez.

Lo primero que deberíamos hacer es recoger información de la web y almacenarla.

Pero, a alguien más se le ha ocurrido eso, y no vamos a reinventar la rueda. @asamiller ha desarrollado una app en node.js que explora la API de Marvel y almacena los datos usando Orchestrate. El código está disponible en github.

# TODO

En realidad molaría scrappear la wiki y no tener una versión estática, que además puede estar un poco desfasada, pero esto deberíamos incluirlo en la librería pyMarvel. Si te animas, búscanos después de la charla y hablamos.

Cargamos los ficheros json.¶

In [26]:

import json

In [33]:

from os.path import join
from os import listdir
import socket

MARVELOUSDB_PATH_A = "../marvelousdb-master/data/characters/"
MARVELOUSDB_PATH_M = "../marvelousdb/data/characters/"
MARVELOUSDB_PATH = MARVELOUSDB_PATH_M if 'alan' in socket.gethostname() else MARVELOUSDB_PATH_A

In [34]:

json_db = [join(MARVELOUSDB_PATH, json_file) for json_file in listdir(MARVELOUSDB_PATH)]
print("En MarvelousDB tenemos un backup de {} personajes".format(len(json_db)))

En MarvelousDB tenemos un backup de 1402 personajes

Organicemos la información que hemos conseguido (¡Pandas time!)¶

Pandas es una librería de código abierto, con licencia BSD, que permite trabajar eficientemente analizando datos en Python.

A pandas se le da bien:

Estructuras de datos eficientes (DataFrames) para trabajar con datos indexados.
Herramientas para leer y escribir datos eficientemente. Es capaz de trabajar con distintos formatos:
- csv
- Ficheros de texto
- Microsoft Excel
- Bases de datos SQL
- HDF5 format
- ...
Remodelado flexible y alternancia entre conjuntos de datos.
Selección inteligente basada en etiquetas, indexación compleja, selección de subconjuntos en grandes conjuntos de datos.
Se pueden insertar y borrar columnas: mutabilidad de los conjuntos de datos.
Agrupado y fusionados sencillo de conjuntos de datos.
Funciones para series de tiempos: gestiona eficientemente rangos de fechas.
...

In [35]:

import pandas as pd

DataFrames¶

Un DataFrame es una estructura de 2 dimensiones con datos etiquetados en columnas. Los datos que componen un DataFrame pueden ser de distintos tipos. Piensa en un dataframe como si fuera una hoja de cáculo o una tabla SQL.

Se pueden crear a partir de:

Diccionarios 1D de ndarrays, listas, diccionarios o Series (Pandas).
Una matriz 2D ndarray.
Otro DataFrame

Al crear un DataFrame, también se pueden especificar los índices (index, etiquetas para las filas) y las columnas. Si no se proporcionan estas etiquetas como argumentos pandas creará un DataFrame usando el sentido común.

En nuestro caso, leeremos todos los ficheros json y crearemos un DataFrame. Como tenemos información jerárquica en los ficheros json necesitamos normalizar los datos, pero pandas tiene funciones que lo hacen por nosotros.

In [36]:

json_to_dataframe = []
for json_file in json_db:
    with open(json_file, 'r') as jf:
        json_character = json.loads(''.join(jf.readlines()))
        json_plain = pd.io.json.json_normalize(json_character)
        json_to_dataframe.append(json_plain)
        
marvel_df = pd.concat(json_to_dataframe)

Podemos hacer esto en una super instrucción. Perdemos en legibilidad pero ganamos en molancia. ¡Totalmente desaconsejado!

In [37]:

df = pd.concat([pd.io.json.json_normalize(json.loads(''.join(open(json_file,'r').readlines()))) for json_file in json_db])

Podemos realizar operaciones lógica sobre todos los elementos de un DataFrame, son operaciones vectoriales. Esto acerlera los cálculos.

In [38]:

all(df == marvel_df)

Out[38]:

True

¿Y que pinta tiene un DataFrame?

In [39]:

marvel_df.head()

Out[39]:

comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight
36	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Marvel Adventures Super Heroes (201...	36	AIM is a terrorist organization bent on destro...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	1009144	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	NaN	NaN	NaN
43	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Incredible Hulks (2009) #619', 'res...	43	Formerly known as Emil Blonsky, a spy of Sovie...	2	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Chaos War', 'resourceURI': 'http://...	2	1009146	...	NaN	NaN	NaN	NaN	NaN	NaN	Marvel Universe	None	NaN	(Abomination) 980 lbs.; (Blonsky) 180 lbs.
43	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Avengers Academy (2010) #21', 'reso...	43		4	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Fear Itself', 'resourceURI': 'http:...	4	1009148	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	He uses a prison ball-and-chain as a weapon, a...	NaN	365 lbs. (variable)
8	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Uncanny X-Men (1963) #402', 'resour...	8		1	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Age of Apocalypse', 'resourceURI': ...	1	1009149	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Unrevealed	NaN	Unrevealed
20	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Weapon X: Days of Future Now (Trade...	20		0	http://gateway.marvel.com/v1/public/characters...	[]	0	1009150	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Agent Zero carries a wide array of weapons inc...	NaN	230 lbs.

5 rows × 89 columns

Los DataFrames de pandas están implementados sobre numpy, de modo que si queremos saber la longitud que tiene un DataFrame es exactamente igual que en numpy.

In [40]:

marvel_df.shape

Out[40]:

(1402, 89)

Tenemos 89 columnas, es decir 89 campos que explorar sobre personajes de la Marvel, ¡Genial!

In [41]:

', '.join(marvel_df.columns.values)

Out[41]:

'comics.available, comics.collectionURI, comics.items, comics.returned, description, events.available, events.collectionURI, events.items, events.returned, id, modified, name, resourceURI, series.available, series.collectionURI, series.items, series.returned, stories.available, stories.collectionURI, stories.items, stories.returned, thumbnail.extension, thumbnail.path, urls, wiki.Date_of_birth, wiki.Place_of_birth, wiki.abilities, wiki.aliases, wiki.appearance, wiki.base_of_operations, wiki.bio, wiki.bio_text, wiki.blurb, wiki.builder, wiki.categories, wiki.categorytext, wiki.citizenship, wiki.creator, wiki.creators, wiki.current_members, wiki.debut, wiki.distinguishing_features, wiki.dstinguishing_features, wiki.education, wiki.event_text, wiki.eyes, wiki.features, wiki.former_members, wiki.govenment, wiki.government, wiki.groups, wiki.hair, wiki.height, wiki.home_world, wiki.identity, wiki.key_characters, wiki.key_issues, wiki.leader, wiki.location, wiki.main_image, wiki.members, wiki.object_text, wiki.occupation, wiki.origin, wiki.other_members, wiki.owner, wiki.paraphernalia, wiki.place_of_birth, wiki.place_of_creation, wiki.place_text, wiki.points_of_interest, wiki.power, wiki.powers, wiki.real_name, wiki.relatives, wiki.significant_citizens, wiki.significant_issues, wiki.skin, wiki.special_limitations, wiki.specieshistory, wiki.team_name, wiki.teamicon, wiki.technology, wiki.tie-ins, wiki.title_graphic, wiki.universe, wiki.weapons, wiki.weaponss, wiki.weight'

En realidad no deberíamos lanzar las campanas al vuelo porque spoiler muchos de los campos están vacios

In [42]:

marvel_df.dropna()

Out[42]:

	comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight

0 rows × 89 columns

Series¶

Series es un array de 1 dimensión etiquetado. Como una tabla con una única columna. Puede almacenar cualquier tipo de datos:

Enteros
Cadenas
Números en coma flotante.
Objetos Python.
...

Se etiquetan en función del índice, si por ejemplo, el índice que le pasamos son fechas se creará una instancia de TimeSerie.

Cuando se hace una selección de 1 columna en un DataFrame se crea una Serie.

Vamos a usar los creadores de comics para jugar un poco con las Series.

In [43]:

#Sacamos la lista de creadores que hay en nuestros datos
creators_serie = marvel_df['wiki.creators'].dropna()
creators_serie.describe()

Out[43]:

count                               119
unique                               37
top       this has not been updated yet
freq                                 44
dtype: object

In [44]:

#Renombramos la serie y el índice
creators_serie.name = 'Creadores de personajes'
creators_serie.index.name = 'creators'

# Podemos usar head o como estamos sobre series también podemos coger una porción de la lista
# creators_serie.head()
creators_serie[:20]

Out[44]:

creators
0            this has not been updated yet
0                                         
0                                         
0                                         
0                                         
0                                         
0                                         
0                  Peter David & Sam Keith
0               Bill Mantlo and Ed Hanigan
0                                         
0                 Stan Lee and Steve Ditko
0             Grant Morrison & Igor Kordey
0                          Chris Claremont
0                                         
0           Chris Claremont & Dave Cockrum
0            this has not been updated yet
0            this has not been updated yet
0                                         
0                     Stan Lee, Jack Kirby
0                           Grant Morrison
Name: Creadores de personajes, dtype: object

Usando máscaras para extraer información¶

Vamos a eliminar todos aquellas filas en el DataFrame en las que el creador no exista, bien porque encontremos la cadena de error o bien porque el campo esté vacío.

In [45]:

default_string = creators_serie != "this has not been updated yet"
default_string.head()

Out[45]:

creators
0           False
0            True
0            True
0            True
0            True
Name: Creadores de personajes, dtype: bool

In [46]:

empty_string = creators_serie != ""
empty_string[:10]

Out[46]:

creators
0            True
0           False
0           False
0           False
0           False
0           False
0           False
0            True
0            True
0           False
Name: Creadores de personajes, dtype: bool

Ahora simplemente juntamos estas dos máscaras.

In [47]:

default_string and empty_string

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-47-544bf713079b> in <module>()
----> 1 default_string and empty_string

/Users/ada/Dev/.virtualenvs/marvel/lib/python3.3/site-packages/pandas/core/generic.py in __nonzero__(self)
    690         raise ValueError("The truth value of a {0} is ambiguous. "
    691                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 692                          .format(self.__class__.__name__))
    693 
    694     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

A pesar de que la palabra reservada and exista también en pandas y pudiéramos pensar que funcionaría para unir series no es así, ya que la operación no se aplica elemento a elemento.

Sin embargo, pandas sabe que esto nos podría hacer falta y tenemos operadores que funcionan para elementos (& (and), | (or), ~(not))

In [49]:

creators_mask = default_string & empty_string
creators_mask[:10]

Out[49]:

creators
0           False
0           False
0           False
0           False
0           False
0           False
0           False
0            True
0            True
0           False
Name: Creadores de personajes, dtype: bool

In [50]:

creators_serie[creators_mask].head()

Out[50]:

creators
0                Peter David & Sam Keith
0             Bill Mantlo and Ed Hanigan
0               Stan Lee and Steve Ditko
0           Grant Morrison & Igor Kordey
0                        Chris Claremont
Name: Creadores de personajes, dtype: object

Aquí ya tenemos buena parte de la información que queremos, pero vamos a separar los autores que trabajan juntos para poder contar cuantos personajes ha creado cada uno.

In [51]:

import re
creators = [re.split('&|and|,', line) for line in creators_serie[creators_mask]]
clean_creators =  pd.Series([c.rstrip().lstrip() for creator in creators for c in creator])
clean_creators.head()

Out[51]:

0    Peter David
1      Sam Keith
2    Bill Mantlo
3     Ed Hanigan
4       Stan Lee
dtype: object

In [52]:

clean_creators.value_counts()

Out[52]:

Chris Claremont         17
Stan Lee                 9
John Byrne               7
Jack Kirby               4
Brian K. Vaughan         4
Adrian Alphona           4
Steve Ditko              3
Grant Morrison           2
Christina Weir           2
John Buscema             2
Scott Lobdell            2
Nunzio DeFilippis        2
Paul Smith               1
Jim Lee                  1
Keron Grant              1
Marc Silvestri           1
John Romita Jr.          1
Brian Michael Bendis     1
Joe Bennett              1
Sam Keith                1
John Romita Sr.          1
Roger Cruz               1
Mark Millar              1
Andy Kubert (artist)     1
Javier Saltares          1
John Cassaday            1
Frank Miller             1
Bill Everett             1
Ed Hanigan               1
Len Wein                 1
Peter David              1
Bill Mantlo              1
Marv Wolfman             1
Christopher Priest       1
Alan Moore               1
Keront Grant             1
Chris Bachalo            1
Howard Mackie            1
Mark Millar (writer)     1
Salvador Larroca         1
Art Adams                1
Alan Davis               1
Joss Whedon              1
Dave Cockrum             1
Mark Bagley              1
Igor Kordey              1
                         1
dtype: int64

Esperábamos que Stan Lee ganara y además tenemos la impresión de que ha creado más de 9 personajes, sin querer hacer un feo a Chris Claremont.

Según nuestras fuentes (Wikipedia & ComicVine):

In [53]:

from IPython.display import Image
Image(filename='stanvschris.png')

Out[53]:

Obviamente es un problema de falta de datos. Por eso debemos ser muy cuidadosos con la confianza que tenemos en nuestros resultados. Un corpus con errores nos llevará a conclusiones erróneas, hay que ser conscientes de esto.

Limpiando los datos¶

Eliminar grupos¶

En la API Marvel no distingue entre personajes y equipos. Es decir, Los Vengadores tiene el mismo status de personaje que Rachel Grey, pero existe un campo en la wiki que nos permite diferenciar grupos de personajes: Former members. Intentaremos entonces filtrar para quedarnos sólo con los personajes.

Lo normal es que quisiéramos eliminar las filas que contienen nulos, y pandas tiene implementada una función para ello dropna. Pero lo que queremos es quedarnos con aquellas filas en cuya columna current_members tengamos un nulo, porque hemos comprobado que si no hay miembros es porque es un personaje.

In [54]:

 marvel_df.dropna(subset=['wiki.current_members'])['name']

Out[54]:

0                         A.I.M.
0                       Avengers
0    Brotherhood of Evil Mutants
0                         Exiles
0                 Fantastic Four
0                    Force Works
0                  Hellfire Club
0                          Hydra
0                 Imperial Guard
0                      Marauders
0                        Reavers
0                   S.H.I.E.L.D.
0                Serpent Society
0                        X-Force
0                          X-Men
...
0                         Sinister Six
0                          ClanDestine
0                            New X-Men
0                      Masters of Evil
0                         Generation X
0              Guardians of the Galaxy
0                               U-Foes
0                            Sentinels
0                          New Mutants
0             Lightning Lords of Nepal
0           Nine-Fold Daughters of Xao
0          Confederates of the Curious
0                             X-Babies
0                        Lethal Legion
0    Brotherhood of Mutants (Ultimate)
Name: name, Length: 70, dtype: object

In [55]:

%timeit (~marvel_df['wiki.current_members'].isnull())

import numpy as np
%timeit (np.invert(marvel_df['wiki.current_members'].isnull()))

1000 loops, best of 3: 206 µs per loop
1000 loops, best of 3: 226 µs per loop

In [56]:

not_groups_mask = marvel_df['wiki.current_members'].isnull()
not_groups_mask.head()

Out[56]:

0    False
0     True
0     True
0     True
0     True
Name: wiki.current_members, dtype: bool

In [57]:

marvel_df_characters = marvel_df[not_groups_mask]
marvel_df_characters.head()

Out[57]:

comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	...	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight
43	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Incredible Hulks (2009) #619', 'res...	43	Formerly known as Emil Blonsky, a spy of Sovie...	2	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Chaos War', 'resourceURI': 'http://...	2	1009146	...	NaN	NaN	NaN	NaN	NaN	NaN	Marvel Universe	None	NaN	(Abomination) 980 lbs.; (Blonsky) 180 lbs.
43	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Avengers Academy (2010) #21', 'reso...	43		4	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Fear Itself', 'resourceURI': 'http:...	4	1009148	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	He uses a prison ball-and-chain as a weapon, a...	NaN	365 lbs. (variable)
8	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Uncanny X-Men (1963) #402', 'resour...	8		1	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Age of Apocalypse', 'resourceURI': ...	1	1009149	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Unrevealed	NaN	Unrevealed
20	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Weapon X: Days of Future Now (Trade...	20		0	http://gateway.marvel.com/v1/public/characters...	[]	0	1009150	...	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	Agent Zero carries a wide array of weapons inc...	NaN	230 lbs.
11	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Uncanny X-Men (1963) #181', 'resour...	11		1	http://gateway.marvel.com/v1/public/characters...	[{'name': 'Secret Wars', 'resourceURI': 'http:...	1	1009151	...	NaN	NaN	NaN	NaN	NaN	NaN	Marvel Universe		NaN	100 lbs

5 rows × 89 columns

Se nos han colado The Watchers, éste es uno de los problemas del aprendizaje automático que los datos de entrada pueden contener errores, y el sistema que entrenemos debe ser capaz de generalizar suficiente como para sobreponerse a estos errores.

In [58]:

marvel_df_characters.shape

Out[58]:

(1332, 89)

Hemos comenzado con 1402 personajes, eliminando los equipos nos quedamos con 1332. Es decir, hemos perdido el 4.9929 % de los datos. Nada grave por ahora.

Representación racial, cultural y de género en los cómics de Marvel¶

Un caso de estudio...

In [59]:

from IPython.display import Image
Image(filename='oracle.jpg')

Out[59]:

Por ejemplo, ¿qué encontramos en relación a la representación racial?

In [60]:

marvel_df_characters['wiki.skin'].dropna()

Out[60]:

0    White (as GAmbit), Black (as Death)
Name: wiki.skin, dtype: object

Otro ejemplo: sería muy interesante saber quienes son los líderes de los grupos de superhéroes, pero...

In [61]:

marvel_groups =  marvel_df.dropna(subset=['wiki.current_members'])
marvel_groups['wiki.leader'].dropna()

Out[61]:

0    Steve Rogers
0                
Name: wiki.leader, dtype: object

Eliminar aquellos personajes de los que no tenemos información¶

Vamos a eliminar a todos aquellos personajes de los que no tenemos información. Sin datos no tenemos nada que analizar. A los científicos nos encantaría tener muchos datos disponibles, porque eso implicaría que podríamos hacer muchos experimentos y sacar conclusiones probablemente válidas. Lamentablemente la mayor parte del tiempo no podremos hacer machine learning sobre big data.

Vamos a querer quedarnos con la siguiente información:

Id (id)
Nombre (name)
Descripción
Educación
Peso
Altura
Bio
Color del pelo
Color de los ojos
Nacionalidad
Lugar de nacimiento

In [62]:

# Agrupamos los datos para tener claro con que queremos trabajar
# No hay nadie con 'ocupation' así que lo quitamos
physical_data = {'wiki.hair':'hair', 'wiki.weight':'weight', 'wiki.height':'height', 'wiki.eyes':'eyes'}
cultural_data = {'wiki.education':'education', 'wiki.citizenship':'citizenship', 
                 'wiki.place_of_birth':'place_of_birth', 'wiki.occupation':'occupation'}
personal_data = {'wiki.bio':'bio', 'wiki.bio_text':'bio', 'wiki.categories':'categories'}
marvelesque_data = {'wiki.abilities':'abilities', 'wiki.weapons':'weapons', 'wiki.powers': 'powers'}

data_keys = (list(physical_data.keys()) + list(cultural_data.keys()) + 
             list(personal_data.keys()) + ['name','comics.available'])
#+ marvelesque_data

In [63]:

print(data_keys)

['wiki.height', 'wiki.hair', 'wiki.eyes', 'wiki.weight', 'wiki.place_of_birth', 'wiki.citizenship', 'wiki.occupation', 'wiki.education', 'wiki.bio_text', 'wiki.bio', 'wiki.categories', 'name', 'comics.available']

In [64]:

clean_df = marvel_df_characters.dropna(subset = data_keys)
clean_df = clean_df[data_keys].set_index('name')
clean_df.shape

Out[64]:

(762, 12)

Podemos explorar secciones de un dataframe¶

Datos físicos¶

In [65]:

clean_df[list(physical_data.keys())].head()

Out[65]:

	wiki.height	wiki.hair	wiki.eyes	wiki.weight
name
Abomination (Emil Blonsky)	(Abomination) 6'8"; (Blonsky) 5'10"	(Abomination) None; (Blonsky) Blond	(Abomination) Green; (Blonsky) Blue	(Abomination) 980 lbs.; (Blonsky) 180 lbs.
Absorbing Man	6'4" (variable)	Bald	Blue	365 lbs. (variable)
Abyss	Unrevealed	Unrevealed	Unrevealed	Unrevealed
Agent Zero	6'3"	(Originally) Brown; (currently) Black	Blue	230 lbs.
Annihilus	5'11"	None	Green	200 lbs.

In [66]:

clean_df[list(physical_data.keys())].describe()

Out[66]:

	wiki.height	wiki.hair	wiki.eyes	wiki.weight
count	762	762	762	762
unique	213	223	165	307
top	Unrevealed	Black	Blue	Unrevealed
freq	44	165	236	48

Características culturales¶

In [67]:

clean_df[list(cultural_data.keys())].head()

Out[67]:

	wiki.place_of_birth	wiki.citizenship	wiki.occupation	wiki.education
name
Abomination (Emil Blonsky)	Zagreb, Yugoslavia	Citizen of Croatia; former citizen of Yugoslavia	Professional Criminal, Former Spy	Unrevealed
Absorbing Man	New York City, New York	U.S.A. with a criminal record	Professional criminal; former boxer	High school dropout
Abyss	Unrevealed	Unrevealed	Cosmic sorcerer	Unrevealed
Agent Zero	Unrevealed location in former East Germany	German	Mercenary, former government operative, freedo...	Unrevealed
Annihilus	Planet of [[Arthros]], Sector 17A, [[Negative ...	Arthros	Conqueror, scavenger	Unrevealed

¿Cómo diriáis que es físicamente el personaje típico de la marvel? (pandas lo sabe)

In [68]:

clean_df[list(cultural_data.keys())].describe()

Out[68]:

	wiki.place_of_birth	wiki.citizenship	wiki.occupation	wiki.education
count	762	762	762	762
unique	412	262	636	357
top	Unrevealed	U.S.A.	Adventurer	Unrevealed
freq	156	230	31	236

De modo que el personaje arquetípico de la Marvel tiene el pelo negro y los ojos azules, es de EE.UU. se dedica a ser aventurero.

¡Los datos son caros! Teníamos 1402, pero en realidad solo tenemos 762 personajes con datos para poder trabajar. Hemos perdido el 45.6491 % de los datos.

Exploremos el dataframe que nos ha quedado:

In [69]:

clean_df.dtypes

Out[69]:

wiki.height            object
wiki.hair              object
wiki.eyes              object
wiki.weight            object
wiki.place_of_birth    object
wiki.citizenship       object
wiki.occupation        object
wiki.education         object
wiki.bio_text          object
wiki.bio               object
wiki.categories        object
comics.available        int64
dtype: object

In [70]:

clean_df.describe()

Out[70]:

	comics.available
count	762.000000
mean	53.292651
std	179.820372
min	0.000000
25%	2.000000
50%	10.000000
75%	33.750000
max	2575.000000

In [71]:

clean_df[clean_df['comics.available'] == 2575.000000]

Out[71]:

	wiki.height	wiki.hair	wiki.eyes	wiki.weight	wiki.place_of_birth	wiki.citizenship	wiki.occupation	wiki.education	wiki.bio_text	wiki.bio	wiki.categories	comics.available
name
Spider-Man	5'10"	Brown	Hazel	167 lbs.	Forest Hills, New York	U.S.A.	Scientist and inventor; former freelance photo...	College graduate (biophysics major), doctorate...	The bite of an irradiated spider granted high-...	The bite of an irradiated spider granted high-...	[Avengers, Civil War, Heroes, Marvel Knights, ...	2575

¡Spiderman es el rey del cómic!

Antes de ponernos a jugar con los datos (más), tenemos una columna de la que se pude sacar mucho partido "wiki.categories"

In [72]:

clean_df['wiki.categories']

Out[72]:

name
Abomination (Emil Blonsky)    [Avengers, Deceased, Hulk, International, Vill...
Absorbing Man                                   [Avengers, Civil War, Villains]
Abyss                                                 [Cosmic, Magic, Villains]
Agent Zero                    [Heroes, X-Men, Villains, International, Mutants]
Annihilus                      [Annihilation, Cosmic, Fantastic Four, Villains]
Apocalypse                            [Mutants, Villains, International, X-Men]
Spider-Girl (Anya Corazon)    [Women, Heroes, Spider-Man, Civil War, Initiat...
Arcade                                            [Spider-Man, Villains, X-Men]
Archangel                           [X-Men, Heroes, Reformed Villains, Mutants]
Arclight                      [X-Men, Women, Villains, Mutants, People who u...
Aurora                        [Heroes, Women, X-Men, International, Canadian...
Avalanche                             [X-Men, International, Villains, Mutants]
Banshee                       [X-Men, People who used to be dead but aren't ...
Baron Strucker                [Villains, International, Thunderbolts, People...
Baron Zemo (Heinrich Zemo)        [Villains, Avengers, Deceased, International]
...
Contessa (Vera Vidal)                                             [Heroes, Women]
Chores MacGillicudy                                            [Deceased, Heroes]
Iron Fist (Wu Ao-Shi)                                   [Heroes, Women, Deceased]
Loa                                               [X-Men, Women, Heroes, Mutants]
Grey Gargoyle                                 [Avengers, International, Villains]
Nekra                                        [Avengers, Mutants, Villains, Women]
Miss America                                                    [Women, Deceased]
Whizzer (Stanley Stewart)                    [Heroes, Avengers, Squadron Supreme]
Scarlet Spider (Kaine)          [Villains, Spider-Man, people who used to be d...
Hope Summers                                              [Mutants, Women, X-Men]
Enchantress (Sylvie Lushton)                             [Avengers, Magic, Women]
Hank Pym                                [Heroes, Avengers, Civil War, Initiative]
Azazel (Mutant)                                 [Magic, Mutants, Villains, X-Men]
Spider-Man (House of M)                                              [House of M]
Gargoyle (Yuri Topolov)                                          [Hulk, Villains]
Name: wiki.categories, Length: 762, dtype: object

A priori no tenemos información de que personajes son hombres, mujeres o alienígenas. Pero Marvel debió intuir que nos podría interesar el papel de las mujeres en los cómics y nos incluyó una categoría: "Mujeres", que nos va a facilitar la vida un montón. Vamos a crear dos nuevas columnas en el DataFrame:

woman: que simplemente contendrá True o False si el personaje es femenino o no respectivamente.
villain: ídem T/F si el personaje es villano o no.

In [73]:

women = clean_df['wiki.categories'].map(lambda x: 'Women' in x)
clean_df['Women'] = women 
women[:5]

Out[73]:

name
Abomination (Emil Blonsky)    False
Absorbing Man                 False
Abyss                         False
Agent Zero                    False
Annihilus                     False
Name: wiki.categories, dtype: bool

In [74]:

# ~ Esto es una negación element-wise
print("Women: #{}, men #{}".format(clean_df[women].shape[0],clean_df[~women].shape[0]))

Women: #199, men #563

Es decir, tenemos 199 personajes femeninos y 563 masculinos. Es decir solo el 26% de los personajes son femeninos.

In [75]:

villain = clean_df['wiki.categories'].map(lambda x: 'Villains' in x)
clean_df['Villain'] = villain 

In [76]:

men = ~women
gender_data = {'Women':{'Heroes':0,'Villains':0},'Men':{'Heroes':0,'Villains':0}}
# Women and villains
gender_data['Women']['Villains'] = clean_df[villain & women].shape[0]
# Women and heroes
gender_data['Women']['Heroes'] = clean_df[~villain & women].shape[0]

# Men and villains
gender_data['Men']['Villains'] = clean_df[villain & men].shape[0]
# Men and heroes
gender_data['Men']['Heroes'] = clean_df[~villain & men].shape[0]
gender_data

Out[76]:

{'Women': {'Villains': 30, 'Heroes': 169},
 'Men': {'Villains': 201, 'Heroes': 362}}

In [77]:

%matplotlib inline
import matplotlib.pyplot as plt

In [78]:

n_groups = 2

men_data = (gender_data['Men']['Villains'], gender_data['Men']['Heroes'])
women_data = (gender_data['Women']['Villains'], gender_data['Women']['Heroes'])

fig, ax = plt.subplots()

index = np.arange(n_groups)
bar_width = 0.4

opacity = 0.5
rects1 = plt.bar(index, men_data, bar_width,
                 alpha=opacity,
                 color='b',
                 label='Hombres')

rects2 = plt.bar(index + bar_width, women_data, bar_width,
                 alpha=opacity,
                 color='r',
                 label='Mujeres')

plt.xlabel('Rol')
plt.ylabel('Número de personajes')
plt.title('Distribución por género y roles')
plt.xticks(index + bar_width, ('Villanos', 'Héroes'))
plt.legend(loc=0, borderaxespad=1.)

plt.show()

Otro caso de además de contrastar el número de villanos, es el de salir de dudas con respecto a una sospecha que tenemos. La cantidad de pelirrojas que hay en los cómics!

La ocurrencia del cabello rojo en la población es del 1-2% globalmente y del 2-6% en poblaciones con ascendencia del norte u oeste de Europa. Irlanda y Escocia destacan con un 10% y 13% de ocurrencia, respectivamente.

Veamos qué ocurre en Marvel.

In [79]:

red_heads = clean_df['wiki.hair'].map(lambda x: 'Red' in x)
clean_df['red_heads'] = red_heads
red_heads[:5]

Out[79]:

name
Abomination (Emil Blonsky)    False
Absorbing Man                 False
Abyss                         False
Agent Zero                    False
Annihilus                     False
Name: wiki.hair, dtype: bool

In [80]:

print("Red heads: #{}, Non-red heads #{}".format(clean_df[red_heads].shape[0],clean_df[~red_heads].shape[0]))

Red heads: #76, Non-red heads #686

In [81]:

non_red = ~red_heads
hair_data = {'Women':{'Red heads':0,'Non-red heads':0},'Men':{'Red heads':0,'Non-red heads':0}}
# Red haired women
hair_data['Women']['Red heads'] = clean_df[red_heads & women].shape[0]
# Non-red haired women
hair_data['Women']['Non-red heads'] = clean_df[~red_heads & women].shape[0]

# Red haired men
hair_data['Men']['Red heads'] = clean_df[red_heads & men].shape[0]
# Non-red haired women
hair_data['Men']['Non-red heads'] = clean_df[~red_heads & men].shape[0]
hair_data

Out[81]:

{'Women': {'Non-red heads': 169, 'Red heads': 30},
 'Men': {'Non-red heads': 517, 'Red heads': 46}}

In [1]:

#¿Qué es esto?
redwomen = 30 / 199.
redmen = 46 / 563.
print('Women: {0:5.2f}%, Men: {1:5.2f}%'.format(redwomen * 100, redmen * 100))

Women: 15.08%, Men:  8.17%

Machine Learning time¶

¿Qué es el aprendizaje automático?¶

Es "simplemente" una serie de algoritmos que permiten que una máquina aprenda a partir de datos. Por lo tanto los ingredientes básicos para nuestra receta son datos + algoritmos.

Para hacer este cócktel necesitamos las habilidades de ingeniería informática, estadística y conociemiento del problema.

Y a partir del aprendizaje automático podremos:

Podemos predecir eventos futros.
Clasificar datos.
Descubrir patrones en los datos.

Armas¶

In [83]:

import sys
import matplotlib
%matplotlib inline
import sklearn

print("Versión de Python:     ", sys.version)
print("Versión de Pandas:     ", pd.version.short_version)
print("Versión de Numpy:      ", np.version.short_version)
print("Versión de Matplotlib: ", matplotlib.__version__)
print("Versión de Pandas:     ", pd.version.short_version)
print("Versión de scikit-learn:     ", sklearn.__version__)

Versión de Python:      3.3.4 (default, Jul 25 2014, 00:04:27) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Versión de Pandas:      0.14.1
Versión de Numpy:       1.8.2
Versión de Matplotlib:  1.4.0
Versión de Pandas:      0.14.1
Versión de scikit-learn:      0.15.2

Scikit-learn mola.¶

Herramientas simples y eficientes para hacer minería de datos y análisis de datos.
Accesible y reusable en distintos contextos.
Construido sobre NumPy, SciPy, and matplotlib
Open source, permite uso commercial - BSD license

scikit-learn algorithms

Ciclo de trabajo típico¶

Recolectar datos.
1. ¿Hay suficientes datos? ¿No? Volver a 1.
Preprocesar datos.
Dividir el corpus en train, test y development.
Seleccionar y entrenar el algoritmo.
Ajustar los parámetros.
Verificar los resultados.
Celebrar.

Algoritmos de los N-Vecinos (Knn)¶

Algoritmo simple para clasificar muestras.
Necesita muestras etiquetadas.
Calcula la distancia a los vecinos de la muestra a etiquetar. Cada vecino (hasta N o K) vota.
Añadir nuevos datos no es gratis.
Es computacionalmente complejo.
La mayoría no siempre tiene la razón.

Clasificación utilizando el peso y altura¶

Hipótesis: Las características físicas diferencian a los personajes femeninos de los masculinos

2. Preprocesado de los datos (data munging)¶

In [84]:

clean_df['wiki.weight'].describe()

Out[84]:

count            762
unique           307
top       Unrevealed
freq              48
dtype: object

In [85]:

physical = clean_df[clean_df['wiki.weight'] != "Unrevealed"]

In [86]:

any(physical['wiki.height'] == "Unrevealed")

Out[86]:

False

¡Genial! Al menos los que no tienen peso son los mismo que no tienen altura

In [87]:

physical_knn = physical[['wiki.weight', 'wiki.height', 'Women', 'Villain']]

In [88]:

physical_knn.dtypes

Out[88]:

wiki.weight    object
wiki.height    object
Women            bool
Villain          bool
dtype: object

Queremos que sean enteros

In [89]:

physical_knn

Out[89]:

	wiki.weight	wiki.height	Women	Villain
name
Abomination (Emil Blonsky)	(Abomination) 980 lbs.; (Blonsky) 180 lbs.	(Abomination) 6'8"; (Blonsky) 5'10"	False	True
Absorbing Man	365 lbs. (variable)	6'4" (variable)	False	True
Agent Zero	230 lbs.	6'3"	False	True
Annihilus	200 lbs.	5'11"	False	True
Apocalypse	300 lbs. (variable)	Variable (usually around 7')	False	True
Spider-Girl (Anya Corazon)	115 lbs.	5'3"	True	False
Arcade	140 lbs.	5'6"	False	True
Archangel	150 lbs.	6'	False	False
Arclight	126 lbs.	5'8"	True	True
Aurora	140 lbs.	5'11"	True	False
Avalanche	195 lbs.	5'7"	False	True
Banshee	170 lbs.	6'	False	False
Baron Strucker	225 lbs.	6'2"	False	True
Baron Zemo (Heinrich Zemo)	180 lbs	5'9"	False	True
Bastion	375 lbs.	6'3"	False	True
Batroc the Leaper	225 lbs.	6’	False	True
Battering Ram	380 lbs.	7'4"	False	False
Beak	140	5'9"	False	False
Beast	402 lbs.	5'11"	False	False
Beef	250 lbs.	6'6"	False	True
Beta-Ray Bill	(As Bill) 480 lbs.; (as Walters) 132 lbs.	(As Bill) 6'7"; (as Walters) 5'9"	False	False
Big Wheel	140 lbs.	5'5"	False	False
Bishop	275 lbs.	6'6"	False	False
Black Bolt	210 lbs	6' 2"	False	False
Black Cat	120 lbs.	5'10"	True	False
Black Knight	180 lbs.	5’ 11”	False	False
Black Panther	200 lbs.	6'	False	False
Black Tom	(originally) 200 lbs.; (currently) Variable	(originally) 6'0"; (currently) Variable	False	True
Black Widow	131 lbs.	5'7"	True	False
Blackheart	679 lbs (Variable)	6'10" (Variable)	False	True
...	...	...	...	...
Starhawk (Stakar Ogord)	450 lbs.	6'4"	False	False
Vance Astro	(as an adult, with protective gear) 250 lbs.	(as an adult) 6'1"	False	False
Jamie Braddock	151 lbs.	6'1"	False	True
Jazinda	135 lbs. (variable)	5'6" (variable)	True	False
Tinkerer	120 lbs.	5'4"	False	True
Cosmo	70 lbs.	23" (at withers)	False	False
Red Hulk	245 lbs. (Ross); 1200 lbs. (Red Hulk)	6'1" (Ross); 7' (Red Hulk)	False	False
American Eagle (Jason Strongbow)	200 lbs	6'	False	False
Cottonmouth	200 lbs.	6'	False	True
Vanisher (Telford Porter)	175 lbs.	5'5"	False	True
Sphinx (Anath-Na Mut)	450 lbs	7'2"	False	True
Molten Man	550 lbs.	6’5”	False	True
Henry Peter Gyrich	205 lbs.	6’ 1”	False	False
Cypher	150 lbs.	5'9"	False	False
Karma	119 lbs.	5'4"	True	False
She-Hulk (Lyra)	220 lbs.	6'6"	True	False
She-Hulk (Ultimate)	110 lbs. (as Betty); Unrevealed (as She-Hulk)	5'6" (as Betty); Unrevealed (as She-Hulk)	False	False
Talon (Fraternity of Raptors)	180 lbs.	6'1"	False	False
Angel (Golden Age)			False	False
Meggan	Usually 120 lbs., 130 lbs. in true form	Usually 5'7", 5'10" in true form	True	False
Loa	139 lbs.	5'8"	True	False
Grey Gargoyle	(normal) 175 lbs.; (stone) 750 lbs.	5’11”	False	True
Nekra	145 lbs.	5' 11"	True	True
Miss America	130 lbs	5'8"	True	False
Whizzer (Stanley Stewart)	180 lbs.	5'11"	False	False
Scarlet Spider (Kaine)	250 lbs.	6'4"	False	True
Hank Pym	varies, normally 185 lbs.	varies, normally 6'	False	False
Azazel (Mutant)	149 lbs.	6'	False	True
Spider-Man (House of M)	165 lbs.	5'10"	False	False
Gargoyle (Yuri Topolov)	215 lbs.	4'6"	False	True

714 rows × 4 columns

In [90]:

physical_knn.applymap(str)
physical_knn = physical_knn[physical_knn['wiki.weight'].str.contains("lbs.")]
physical_knn = physical_knn[physical_knn['wiki.height'].str.contains('’|\'')]

In [91]:

def get_weight(pandas_weight):
    """ Return first int parameter in a string """
    for p in pandas_weight.split():
        try:
            return int(p)
        except ValueError:
            pass

In [92]:

physical_knn['wiki.weight'] = physical_knn['wiki.weight'].map(get_weight)

In [93]:

FOOT = 30.48
INCH = 2.54
def get_height(pandas_height):
    """ Return first int parameter in a string """
    
    height = None 
    for p in pandas_height.split():
        colon_split = p.split('\'')
        strange_colon_split = p.split('’')
        if len(colon_split) == 2 :
            height = colon_split
        elif len(colon_split) == 4 :
            height = colon_split[:2]
            height[1] += "\'" 
        elif len(strange_colon_split) == 2 :
            height = strange_colon_split
        elif len((pandas_height.split()[-1]).split('\'')) == 2:
            height = pandas_height.split()[-1].split('\'')
        elif len((pandas_height.split()[-1]).split('’')) == 2:
            height = pandas_height.split()[-1].split('’')
                
        else:
            universe_split = ((pandas_height.split(';')[0]).split()[-1]).split('\'')
            if len(universe_split) == 2:
                height = universe_split
            else:
                space_split = (pandas_height.split(';')[0].split()[-2:])
                if space_split[0][-1] == '\'' or space_split[0][-1] == '’':
                    height = [space_split[0][:-1], space_split[1]]
                else:
                    return None
        if height:
            try:
                foot_part = int(height[0])
                inch_part = int(height[1][:-1]) if height[1][:-1].strip() else 0
                return (foot_part*FOOT + inch_part*INCH)
            except ValueError:
                pass

In [94]:

physical_knn['wiki.height'] = physical_knn['wiki.height'].map(get_height)

In [95]:

physical_knn = physical_knn.dropna()

In [96]:

physical_knn.dtypes

Out[96]:

wiki.weight    float64
wiki.height    float64
Women             bool
Villain           bool
dtype: object

In [97]:

physical_knn.shape

Out[97]:

(578, 4)

¡Ahora ya podemos empezar a trabajar!

3. Separar el corpus en train y test¶

In [98]:

from math import floor

In [99]:

TRAIN_PERCENTAGE = 0.8
train_section = floor(physical_knn.shape[0]*TRAIN_PERCENTAGE)
test_section = physical_knn.shape[0]-train_section
print("Usaremos {} personajes para entrenar el clasificador y"\
      " {} para probar el clasificador entrenado.".format(train_section, test_section))

Usaremos 462 personajes para entrenar el clasificador y 116 para probar el clasificador entrenado.

In [100]:

train_rows = np.random.choice(physical_knn.index.values, train_section)
test_rows = np.setdiff1d(physical_knn.index.values,train_rows)

In [101]:

physical_knn.loc[train_rows[0]]

Out[101]:

wiki.weight      190
wiki.height    190.5
Women           True
Villain        False
Name: Cerise, dtype: object

Separamos datos y etiquetas

In [102]:

X_train = physical_knn.loc[train_rows][['wiki.weight','wiki.height']]
y_train = physical_knn.loc[train_rows]['Women']

X_test = physical_knn.loc[test_rows][['wiki.weight','wiki.height']]
y_test = physical_knn.loc[test_rows]['Women']

3.1. Visualizar los datos¶

Vamos a echarle un vistazo a los datos para comprobar la complejidad de la tarea:

In [103]:

for i, group in physical_knn.groupby(women):
    if not i:
        ax = group.plot(kind='scatter', x='wiki.height', y='wiki.weight', 
                        color='DarkBlue', label='Men');
    else:
        print(i)
        group.plot(kind='scatter', x='wiki.height', y='wiki.weight', 
                          color='DarkGreen', label='Women', ax=ax)

print(physical_knn.groupby(women).aggregate(np.mean))

True
                 wiki.weight  wiki.height  Women   Villain
wiki.categories                                           
False             243.191943   183.439763      0  0.355450
True              149.628205   170.701026      1  0.134615

4. Entrenar el clasificador¶

Creamos una instancia del clasificador

In [104]:

from sklearn import neighbors
classifier = neighbors.KNeighborsClassifier()

In [105]:

classifier.fit(X_train, y_train)

Out[105]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')

In [106]:

predict = classifier.predict(X_test)
predict

Out[106]:

array([False, False, False, False, False, False,  True,  True, False,
       False, False, False, False, False,  True, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False,  True,  True,  True, False,  True, False, False, False,
       False, False, False,  True, False, False, False,  True,  True,
       False, False, False, False, False, False, False, False, False,
        True,  True, False, False, False, False, False, False,  True,
       False,  True,  True, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True,  True, False, False,  True,  True, False, False,
       False, False,  True, False, False, False, False, False, False,
       False,  True, False, False, False, False,  True,  True,  True,
        True, False,  True,  True, False,  True, False, False,  True,
        True, False, False, False, False,  True, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False,  True, False, False, False, False, False, False,  True,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False,  True,  True, False,  True, False,
       False,  True, False,  True, False,  True,  True, False, False,
        True,  True, False, False, False, False, False,  True, False,
       False,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True,  True,
       False,  True, False, False, False, False,  True,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False], dtype=bool)

6. Comprobar los resultados.¶

In [107]:

from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predict)
precision, recall, f1, _ = metrics.precision_recall_fscore_support(y_test, predict)
print("* Acierto: {:.2f}%".format(accuracy*100))
print("* Precisión: {}\n* Exhaustividad: {}.\n* F1-Score: {}".format(accuracy*100, precision, recall, f1))

* Acierto: 82.13%
* Precisión: 82.12927756653993
* Exhaustividad: [ 0.89162562  0.58333333].
* F1-Score: [ 0.87864078  0.61403509]

La ciencia tras la bestia¶

True positive: el elemento es miembro de la clase y decimos que lo es.
True negative: el elemento no es miembro de la clase y decimos que no lo es.
False positive: falsa alarma, el elemento es miembro de la clase y decimos que no lo es.
False negative: el elemento no es miembro de la clase y decimos que lo es.

In [123]:

%%latex
\begin{align}
accuray = \frac{\text{# True Positives}+\text{# True Negatives}}
{\text{# True Positives}+\text{False Positives} + \text{False Negatives} + \text{True Negatives}}
\end{align}

\begin{align} accuray = \frac{\text{# True Positives}+\text{# True Negatives}} {\text{# True Positives}+\text{False Positives} + \text{False Negatives} + \text{True Negatives}} \end{align}

In [124]:

%%latex
\begin{align}
precision = \frac{\text{# True Positives}} {\text{# True Positives}+\text{False Positives}}
\end{align}

\begin{align} precision = \frac{\text{# True Positives}} {\text{# True Positives}+\text{False Positives}} \end{align}

7. ¡Celebrar!¶

In [125]:

from matplotlib.colors import ListedColormap

cmap_light = ListedColormap(['#AAAAFF', '#AAFFAA'])
cmap_bold = ListedColormap(['#0000FF', '#00FF00'])

step = 2

x_min, x_max = X_test['wiki.height'].min() - 1, X_test['wiki.height'].max() + 1
y_min, y_max = X_test['wiki.weight'].min() - 1, X_test['wiki.weight'].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, step),
                     np.arange(y_min, y_max, step))    
prediction = classifier.predict(X_test)
Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter( X_test['wiki.height'], X_test['wiki.weight'], c=y_test, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), 400)

Out[125]:

(19.0, 400)

¿Y si... ?¶

In [126]:

import time

for n in range(1,20, 2):
    classifier = neighbors.KNeighborsClassifier(n_neighbors=n)
    classifier.fit(X_train, y_train)
    predict = classifier.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, predict)
    print("({}) Acierto: {:.2f}%".format(n, accuracy*100))

(1) Acierto: 74.52%
(3) Acierto: 79.09%
(5) Acierto: 82.13%
(7) Acierto: 84.03%
(9) Acierto: 85.93%
(11) Acierto: 86.69%
(13) Acierto: 86.31%
(15) Acierto: 85.93%
(17) Acierto: 86.31%
(19) Acierto: 85.93%

Educación superior y ciudadanía¶

Hipótesis: Las características sociales (nacionalidad y educación) diferencian a los personajes femeninos de los masculinos

2. Preprocesado de los datos (data munging)¶

In [127]:

cultural_knn = clean_df[['wiki.education', 'wiki.citizenship', 'Women', 'Villain']]
cultural_knn.dtypes

Out[127]:

wiki.education      object
wiki.citizenship    object
Women                 bool
Villain               bool
dtype: object

In [128]:

usa = cultural_knn['wiki.citizenship'].map(lambda x: 'U.S.A.' in x)
cultural_knn['USA'] = usa
cultural_knn = cultural_knn.drop('wiki.citizenship',1)

In [129]:

cultural_knn

Out[129]:

	wiki.education	Women	Villain	USA
name
Abomination (Emil Blonsky)	Unrevealed	False	True	False
Absorbing Man	High school dropout	False	True	True
Abyss	Unrevealed	False	True	False
Agent Zero	Unrevealed	False	True	False
Annihilus	Unrevealed	False	True	False
Apocalypse	Centuries of study and experience	False	True	False
Spider-Girl (Anya Corazon)	High school student	True	False	True
Arcade	Unrevealed	False	True	True
Archangel	College degree from Xavier's School for Gifted...	False	False	True
Arclight	Unrevealed; some military training	True	True	True
Aurora	Madame DuPont's School for Girls	True	False	False
Avalanche	Unrevealed	False	True	False
Banshee	Bachelor of Science degree from Trinity Colleg...	False	False	False
Baron Strucker	University graduate	False	True	False
Baron Zemo (Heinrich Zemo)	Doctorate Degree	False	True	False
Bastion	Inapplicable	False	True	False
Batroc the Leaper	Military training	False	True	False
Battering Ram	Unrevealed	False	False	True
Beak	Some college-level courses	False	False	True
Beast	Ph.D. Biophysics	False	False	True
Beef		False	True	True
Beta-Ray Bill	Unrevealed	False	False	False
Big Wheel	College educated	False	False	True
Bishop	Unrevealed	False	False	True
Black Bolt	Unrevealed	False	False	False
Black Cat	College graduate (arts major)	True	False	True
Black Knight	Unrevealed	False	False	False
Black Panther	Ph.D in physics	False	False	False
Black Tom	Oxford University	False	True	False
Black Widow	Unrevealed; intensive espionage training throu...	True	False	False
...	...	...	...	...
Vanisher (Telford Porter)		False	True	False
Sphinx (Anath-Na Mut)	studied under caretakers of Arcturus, absorbed...	False	True	False
Molten Man	College graduate	False	True	True
Henry Peter Gyrich	University graduate	False	False	False
Reptil	Unrevealed	False	False	False
Cypher	High school, university level courses in langu...	False	False	True
Karma	unrevealed	True	False	False
She-Hulk (Lyra)	Tutored by the [[Gynosure]]	True	False	False
She-Hulk (Ultimate)	Communications degree from Berkeley, studies a...	False	False	True
Talon (Fraternity of Raptors)	Unrevealed, but the Datasong of Talon's armor ...	False	False	False
Angel (Golden Age)	Unrevealed	False	False	False
Romulus	Unrevealed, possible knowledge of genetics.	False	True	False
Meggan	No formal schooling; self-taught from watching...	True	False	False
Lucky Pierre	Unrevealed	False	False	False
Shadu the Shady	Unrevealed	False	False	False
Contessa (Vera Vidal)	Unrevealed	True	False	False
Chores MacGillicudy	Unrevealed	False	False	False
Iron Fist (Wu Ao-Shi)	Unrevealed	True	False	False
Loa	Currently in high school level courses	True	False	False
Grey Gargoyle	Unrevealed	False	True	False
Nekra	Elementary school	True	True	False
Miss America	Unrevealed	True	False	False
Whizzer (Stanley Stewart)	High school Graduate	False	False	False
Scarlet Spider (Kaine)	Possesses memories of Peter Parker's college e...	False	True	False
Hope Summers	Unrevealed	True	False	False
Enchantress (Sylvie Lushton)	Unrevealed	True	False	False
Hank Pym	extensive knowledge in various fields of scien...	False	False	False
Azazel (Mutant)	Unrevealed	False	True	False
Spider-Man (House of M)	Ph.D in biochemistry	False	False	True
Gargoyle (Yuri Topolov)		False	True	False

762 rows × 4 columns

In [130]:

def delete_without_education(cultural_knn, not_education):
    for word in not_education:
        cultural_knn = cultural_knn[~cultural_knn['wiki.education'].str.contains(word)]
    return cultural_knn

In [131]:

#Eliminar todo los que sean "Unreveal"
cultural_knn = delete_without_education(cultural_knn, ["Unrevealed", "unrevealed", 'None', 'none', 'Not applicable',
                                                       'Unknown', 'unknown', 'Inapplicable', 'Limited'])
cultural_knn = cultural_knn[cultural_knn['wiki.education'] != '']

# Crear los grupos de niveles educativos
education = cultural_knn['wiki.education']

unfinished = education.map(lambda x: 'unfinished' in x or 'dropout' in x or 'incomplete' in x
                           or 'drop-out' in x or 'No official schooling' in x 
                           or 'No formal education' in x or 'Unfinished' in x
                           or 'Incomplete' in x)
education[unfinished].tolist()
education = education[~unfinished]

phd = education.map(lambda x: 'Ph.D' in x or 'master' in x or 'Masters' in x or 'PhD' in x  
                    or 'doctorate' in x or 'Doctorate' in x or 'Ph.d.' in x or 'Doctoral' in x 
                    or 'NASA' in x or 'Journalism graduate' in x or 'scientist' in x
                    or 'Geneticist' in x or 'residency' in x)
education[phd].tolist()
education = education[~phd]

college = education.map(lambda x: 'College' in x or 'college' in x or 'University' in x 
                        or 'post-graduate' in x or 'B.A' in x or 'B.S.' in x or 'university' in x  
                        or 'Master' in x or 'Collage' in x or 'Degree' in x or 'degree' in x
                        or 'Engineering' in x or 'engineer' in x or 'programming' in x
                        or 'Programming' in x or 'Doctor' in x or 'Medical school' in x
                        or 'higher education' in x) 
education[college].tolist()
education = education[~college]

militar = education.map(lambda x: 'Military' in x or 'Xandarian Nova Corps' in x  or 'FBI' in x
                        or 'S.H.I.E.L.D.' in x or 'military' in x or 'Nicholas Fury' in x 
                        or 'Warrior' in x or 'combat' in x or 'Combat' in x or 'Soldier' in x
                        or 'spy academy' in x or 'Police' in x or 'warfare' in x or 'Public Eye' in x)
education[militar].tolist()
education = education[~militar]

hs = education.map(lambda x: 'High school' in x or 'high school' in x or 'High-school' in x 
                   or 'High School' in x or 'high School' in x)
education[hs].tolist()
education = education[~hs]

tutored = education.map(lambda x: 'Tutored' in x or 'tutors' in x or 'tutored' in x 
                        or 'Mentored' in x or 'Home schooled' in x  or 'Private education' in x)
education[tutored].tolist()
education = education[~tutored]

autodidacta = education.map(lambda x: 'Self-taught' in x or 'self-taught' in x 
                            or 'Little or no formal schooling' in x or 'Little formal schooling' in x
                            or 'Some acting school' in x or 'through observation' in x)
education[autodidacta].tolist()
education = education[~autodidacta]

special = education.map(lambda x: 'Sorcery' in x or 'cosmic experience' in x or 'magic' in x 
                        or 'Priests of Pama' in x or 'Xavier Institute' in x or 'Carlos Javier’s' in x
                        or 'Self educated' in x or 'Shao-Lom' in x or 'Centuries of study and experience' in x
                        or 'Askani' in x or 'Madame DuPont' in x or 'Titanian' in x or 'arcane arts' in x
                        or 'Muir-MacTaggert' in x or 'Uploaded data' in x or 'Programmed' in x
                        or 'Accelerated' in x or 'Inhumans' in x or 'Able to access knowledge' in x
                        or 'lifetime' in x or 'Watchers\' homeworld' in x or 'Uranian Eternals' in x
                        or 'Arcturus' in x or 'Oatridge School for Boys' in x)
education[special].tolist()
education = education[~special] 

basic = education.map(lambda x: 'Self-taught' in x or 'Homed schooled' in x or 'graduate school' in x
                      or 'Elementary school' in x or 'Secondary school' in x or 'school graduate' in x
                      or 'Boarding school' in x or 'Massachusetts Academy' in x
                      or 'school graduate' in x)
education[basic].tolist()
education = education[~basic] 

In [132]:

educational_dict = {'autodidacta': autodidacta, 'unfinished': unfinished, 'superior': phd, 'college':college, 
                    'militar': militar, 'high school':hs, 'tutored': tutored, 'special':special, 'basic': basic}

numeric = {'autodidacta': 1, 'unfinished': 2, 'superior': 3, 'college':4, 
           'militar': 5, 'high school':6, 'tutored': 7, 'special':8, 'basic': 9}

In [133]:

def clean_education_levels(educational_dict, cultural_knn):
    """ It will use our new categories in the wiki.education column"""
    for k, education in educational_dict.items():
        index = education[education.loc[:]].index
        for character in index:
            cultural_knn.loc[character, 'wiki.education'] = numeric[k]

In [134]:

clean_education_levels(educational_dict, cultural_knn)

3. Separar el corpus en train y test¶

In [135]:

TRAIN_PERCENTAGE = 0.8
train_section = floor(cultural_knn.shape[0]*TRAIN_PERCENTAGE)
test_section = cultural_knn.shape[0]-train_section
print("Usaremos {} personajes para entrenar el clasificador y"\
      " {} para probar el clasificador entrenado.\n".format(train_section, test_section))

train_rows = np.random.choice(cultural_knn.index.values, train_section)
test_rows = np.setdiff1d(cultural_knn.index.values,train_rows)

print(cultural_knn.loc[train_rows[0]])

Usaremos 344 personajes para entrenar el clasificador y 87 para probar el clasificador entrenado.

wiki.education        4
Women             False
Villain            True
USA                True
Name: Toxin (Eddie Brock), dtype: object

3.1. Visualizar los datos¶

In [150]:

for i, group in cultural_knn.groupby(women):
    print(group)

                           wiki.education  Women Villain    USA
name                                                           
Absorbing Man                           2  False    True   True
Apocalypse                              8  False    True  False
Archangel                               4  False   False   True
Banshee                                 4  False   False  False
Baron Strucker                          4  False    True  False
Baron Zemo (Heinrich Zemo)              3  False    True  False
Batroc the Leaper                       5  False    True  False
Beak                                    4  False   False   True
Beast                                   3  False   False   True
Big Wheel                               4  False   False   True
Black Panther                           3  False   False  False
Black Tom                               4  False    True  False
Blackheart                              7  False    True  False
Blade                                   6  False   False   True
Blizzard                                6  False   False   True
Cable                                   8  False   False   True
Luke Cage                               2  False   False   True
Cannonball                              6  False   False   True
Captain America                         5  False   False   True
Captain Britain                         3  False   False  False
Captain Marvel (Mar-Vell)               5  False   False  False
Captain Stacy                           4  False   False   True
Carnage                                 6  False    True   True
Chamber                                 8  False   False  False
Cloak                                   2  False   False   True
Malcolm Colcord                         5  False    True  False
Colossus                                4  False   False   True
Constrictor                             4  False   False   True
Count Nefaria                           4  False    True   True
Crimson Dynamo                          3  False    True  False
...                                   ...    ...     ...    ...
Mindworm                                6  False   False   True
Calamity                                4  False   False  False
Skaar                                   2  False   False  False
Supernaut                               5  False   False  False
Hypno-Hustler                           2  False    True  False
Vin Gonzales                            5  False   False  False
Blue Shield                             6  False   False  False
Crimson Crusader                        9  False   False  False
Cobalt Man                              3  False    True  False
Jackal                                  3  False    True   True
High Evolutionary                       3  False    True  False
Anole                                   6  False   False   True
Justin Hammer                           4  False    True  False
Junta                                   5  False   False   True
Omega Sentinel                          5  False   False  False
3-D Man                                 5  False   False   True
Nightcrawler (Ultimate)                 6  False   False  False
Angel (Ultimate)                        4  False   False  False
Vance Astro                             4  False   False  False
Jamie Braddock                          4  False    True  False
Tinkerer                                4  False    True  False
Sphinx (Anath-Na Mut)                   8  False    True  False
Molten Man                              4  False    True   True
Henry Peter Gyrich                      4  False   False  False
Cypher                                  4  False   False   True
She-Hulk (Ultimate)                     4  False   False   True
Whizzer (Stanley Stewart)               6  False   False  False
Scarlet Spider (Kaine)                  4  False    True  False
Hank Pym                                3  False   False  False
Spider-Man (House of M)                 3  False   False   True

[305 rows x 4 columns]
                           wiki.education Women Villain    USA
name                                                          
Spider-Girl (Anya Corazon)              6  True   False   True
Aurora                                  8  True   False  False
Black Cat                               4  True   False   True
Catseye                                 9  True    True   True
Clea                                    4  True   False  False
Crystal                                 7  True   False  False
Dagger                                  2  True   False   True
Darkstar                                5  True   False  False
Dazzler                                 4  True   False   True
Dust                                    6  True   False  False
Elektra                                 4  True   False  False
Expediter                               7  True   False  False
Firestar                                4  True   False   True
Emma Frost                              4  True   False   True
Husk                                    8  True   False  False
Invisible Woman                         2  True   False   True
Jocasta                                 2  True   False  False
Jessica Jones                           6  True   False  False
Jubilee                                 6  True   False  False
Lady Deathstrike                        7  True    True  False
Magik (Illyana Rasputin)                6  True   False  False
Magma (Amara Aquilla)                   4  True   False  False
Marrow                                  5  True   False   True
Rachel Grey                             4  True   False   True
Alicia Masters                          4  True   False  False
Medusa                                  7  True   False  False
Meltdown                                2  True   False   True
Moondragon                              8  True   False   True
Namorita                                2  True   False  False
Nocturne                                4  True   False   True
...                                   ...   ...     ...    ...
Hobgoblin (Robin Borne)                 3  True    True   True
Nova (Frankie Raye)                     4  True   False  False
Puck (Zuzha Yu)                         2  True   False  False
Wind Dancer                             6  True   False  False
Sway                                    8  True   False   True
Mantis                                  8  True   False  False
Joystick                                2  True   False  False
Satana                                  7  True   False   True
Turbo                                   4  True   False   True
M (Monet St. Croix)                     9  True   False  False
Bloodaxe                                3  True    True   True
Layla Miller                            9  True   False   True
Beyonder                                7  True   False  False
Cammi                                   2  True   False   True
Tana Nile                               9  True   False  False
Praxagora                               8  True   False  False
Skreet                                  8  True   False  False
Thena                                   8  True   False  False
Mockingbird                             3  True   False  False
Menace                                  4  True    True   True
Geiger                                  4  True   False  False
Carlie Cooper                           4  True   False   True
Imp                                     9  True   False  False
Armor (Hisako Ichiki)                   8  True   False   True
Thundra                                 5  True   False  False
Vapor                                   4  True    True  False
She-Hulk (Lyra)                         7  True   False  False
Meggan                                  1  True   False  False
Loa                                     6  True   False  False
Nekra                                   9  True    True  False

[126 rows x 4 columns]

In [168]:

for i, group in cultural_knn.groupby(women):
    if not i:
        area = (np.pi * (group.shape[0])**2)*.002
        ax = group.plot(kind='scatter', x='wiki.education', y='USA', s=area, 
                        color='Cornflowerblue', label='Men', alpha=0.5);
    else:
        area = (np.pi * (group.shape[0])**2)*.002
        group.plot(kind='scatter', x='wiki.education', y='USA', 
                   color='LightGreen', label='Women', ax=ax, s=area, alpha=0.5)
        
        

ax.set_xticks(range(1,10))
ax.set_xticklabels(list(numeric.keys()), rotation='vertical')
ax.set_yticks(range(0,2))
ax.set_yticklabels(['USA', 'non USA'], rotation='horizontal')

ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.1), ncol=2, fancybox=True, shadow=True)

Out[168]:

<matplotlib.legend.Legend at 0x117198a90>

4 y 6. Entrenar el clasificador y comprobar los resultados¶

In [108]:

X_train = cultural_knn.loc[train_rows][['wiki.education','USA']]
y_train = cultural_knn.loc[train_rows]['Women']

X_test = cultural_knn.loc[test_rows][['wiki.education','USA']]
y_test = cultural_knn.loc[test_rows]['Women']

classifier = neighbors.KNeighborsClassifier()

classifier.fit(X_train, y_train)

predict = classifier.predict(X_test)

accuracy = metrics.accuracy_score(y_test, predict)
precision, recall, f1, _ = metrics.precision_recall_fscore_support(y_test, predict)
print("* Acierto: {:.2f}%".format(accuracy*100))
print("* Precisión: {}\n* Exhaustividad: {}.\n* F1-Score: {}".format(accuracy*100, precision, recall, f1))

* Acierto: 68.04%
* Precisión: 68.04123711340206
* Exhaustividad: [ 0.72159091  0.27777778].
* F1-Score: [ 0.90714286  0.09259259]

7. ¡Celebrar!¶

In [127]:

cmap_light = ListedColormap(['#AAAAFF', '#AAFFAA'])
cmap_bold = ListedColormap(['#0000FF', '#00FF00'])

step = 1

xx, yy = np.meshgrid(np.arange(1, 10, step),
                     np.arange(0, 1, step))    
prediction = classifier.predict(X_test)
Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter( X_test['wiki.education'], X_test['USA'], c=y_test, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.yticks(range(0,2),['USA', 'non USA'], rotation='horizontal')
plt.ylim(-0.5, 1.5)
plt.xticks(range(1,11), list(numeric.keys()), rotation='vertical')
plt.xlabel("Education")

Out[127]:

<matplotlib.text.Text at 0x7f8220813e50>

¡Gracias!¶

Si tienes una pregunta es el momento. * si no nos hemos quedado sin tiempo, claro :)*¶

¶

# Teníamos más...¶

¿En que momentos se han modificado personajes?¶

In [128]:

%pylab inline --no-import-all
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)
from matplotlib.pyplot import *

Populating the interactive namespace from numpy and matplotlib
line_width has been deprecated, use display.width instead (currently both are
identical)

In [132]:

marvel_df['modified'] = pd.to_datetime(marvel_df['modified'])

In [134]:

plot(marvel_df['modified'])

Out[134]:

[<matplotlib.lines.Line2D at 0x7f8221388350>]

In [135]:

start = marvel_df.modified.min()
end =  marvel_df.modified.max()

yearly_range = pd.date_range(start, end, freq='365D6H')

In [137]:

marvel_df[['modified']].head()

Out[137]:

	modified
0	1970-01-01 00:00:00
0	1970-01-01 00:00:00
0	1970-01-01 00:00:00
0	2011-05-17 21:26:18
0	1970-01-01 00:00:00

In [139]:

characters_per_year = marvel_df.groupby(marvel_df['modified'].map(lambda x: x.year)).size()
characters_per_year

Out[139]:

modified
1970        802
2004          8
2010         54
2011        119
2012         53
2013        292
2014         74
dtype: int64

In [140]:

characters_per_year.plot()

Out[140]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f821f8ca990>

In [141]:

marvel_df.sort('modified',ascending=False).head()

Out[141]:

comics.available	comics.collectionURI	comics.items	comics.returned	description	events.available	events.collectionURI	events.items	events.returned	id	modified	name	resourceURI	series.available	series.collectionURI	series.items	series.returned	stories.available	stories.collectionURI	stories.items	stories.returned	thumbnail.extension	thumbnail.path	urls	wiki.Date_of_birth	wiki.Place_of_birth	wiki.abilities	wiki.aliases	wiki.appearance	wiki.base_of_operations	wiki.bio	wiki.bio_text	wiki.blurb	wiki.builder	wiki.categories	wiki.categorytext	wiki.citizenship	wiki.creator	wiki.creators	wiki.current_members	wiki.debut	wiki.distinguishing_features	wiki.dstinguishing_features	wiki.education	wiki.event_text	wiki.eyes	wiki.features	wiki.former_members	wiki.govenment	wiki.government	wiki.groups	wiki.hair	wiki.height	wiki.home_world	wiki.identity	wiki.key_characters	wiki.key_issues	wiki.leader	wiki.location	wiki.main_image	wiki.members	wiki.object_text	wiki.occupation	wiki.origin	wiki.other_members	wiki.owner	wiki.paraphernalia	wiki.place_of_birth	wiki.place_of_creation	wiki.place_text	wiki.points_of_interest	wiki.power	wiki.powers	wiki.real_name	wiki.relatives	wiki.significant_citizens	wiki.significant_issues	wiki.skin	wiki.special_limitations	wiki.specieshistory	wiki.team_name	wiki.teamicon	wiki.technology	wiki.tie-ins	wiki.title_graphic	wiki.universe	wiki.weapons	wiki.weaponss	wiki.weight
26	http://gateway.marvel.com/v1/public/characters...	[{'id': 36834, 'resourceURI': 'http://gateway....	26	Decades after participating in military airdro...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	1011006	2014-03-05 18:58:52	Wolverine (Ultimate)	http://gateway.marvel.com/v1/public/characters...	17	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	17	21	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	20	jpg	http://i.annihil.us/u/prod/marvel/i/mg/9/03/53...	[{'url': 'http://marvel.com/comics/characters/...	NaN	NaN	NaN	Logan, Weapon X, Lucky Jim	NaN	NaN	Howlett's past is mostly unknown, but during W...	Howlett's past is mostly unknown, but during W...	NaN	NaN	[Ultimate Marvel, Deceased]	NaN	Presumably Canada	NaN	NaN	NaN	Ultimate X-Men #1 (2001)	NaN	NaN	Unrevealed	NaN	Blue	NaN	NaN	NaN	NaN	[[X-Men (Ultimate)\|X-Men]]; formerly [[Brother...	Black	5'9"	NaN	("Logan") publicly known; (James Howlett) Know...	NaN	NaN	NaN	NaN	Ultwolv.jpg	NaN	NaN	Student, adventurer; formerly mercenary, gover...		NaN	NaN	NaN	Unrevealed, probably somewhere in Canada	NaN	NaN	NaN	NaN	Wolverine's mutant healing factor enables him ...	James Howlett	Wife (name and status unrevealed); son (allege...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[[Ultimate]]	NaN	NaN	292 lbs. (including adamantium)
6	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	6	While Eddie Brock’s academic career seemed to ...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	1011128	2014-03-05 18:58:42	Venom (Ultimate)	http://gateway.marvel.com/v1/public/characters...	5	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	5	3	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	3	jpg	http://i.annihil.us/u/prod/marvel/i/mg/e/10/53...	[{'url': 'http://marvel.com/comics/characters/...	NaN	NaN	Eddie has a natural aptitude for bioengineerin...	The Suit	NaN	NaN	Eddie Brock was the son of a brilliant scienti...	Eddie Brock was the son of a brilliant scienti...	NaN	NaN	[Ultimate_Marvel]	NaN	United States	NaN	NaN	NaN	Ultimate Spider-Man #33 (2003)	NaN	NaN	College student, extensive Bioengineering studies	NaN	(Eddie) Blue; (Venom) White	NaN	NaN	NaN	NaN	none	(Eddie) Blond; (Venom) None	5'11"	NaN	Secret	NaN	NaN	NaN	NaN	Ultimatevenom.jpg	NaN	NaN	Student	Ultimate Spider-Man #36-37 (2003)	NaN	NaN	NaN	New York, New York	NaN	NaN	NaN	NaN	The symbiotic suit bonded to Brock was designe...	Edward "Eddie" Brock Jr.	Edward Brock Sr. (father, deceased), unidentif...	NaN	Reunited with Peter, shared their fathers’ wor...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[[Ultimate]]	NaN	NaN	175 lbs
22	http://gateway.marvel.com/v1/public/characters...	[{'id': 35528, 'resourceURI': 'http://gateway....	22	One of Spider-Man's oldest enemies, Mac Gargan...	5	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	5	1010788	2014-03-05 18:58:37	Venom (Mac Gargan)	http://gateway.marvel.com/v1/public/characters...	10	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	10	21	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	20	jpg	http://i.annihil.us/u/prod/marvel/i/mg/5/50/53...	[{'url': 'http://marvel.com/comics/characters/...	NaN	NaN	Mac Gargan has the intellectual skills of an a...	Spider-Man; formerly Scorpion	NaN	NaN	One of Spider-Man's oldest enemies, MacGargan ...		NaN	NaN	[Avengers, Civil War, Spider-Man, Spider-Man V...	NaN	U.S.A. with a criminal record	NaN	NaN	NaN	(As Gargan) Amazing Spider-Man #19 (1964); (as...	NaN	NaN	High school graduate	NaN	Brown	NaN	NaN	NaN	NaN	Formerly [[Avengers (Osborn's team)]], [[Thund...	Brown (shaves head)	6'3"	NaN	Publicly known	NaN	NaN	NaN	NaN	Venom(MacGargan)_Head.jpg	NaN	NaN	U.S. government agent; former professional cri...	(As Scorpion) Amazing Spider-Man #20 (1965); (...	NaN	NaN	As Scorpion: Mac Gargan wore a costume that wa...	Yonkers, New York	NaN	NaN	NaN	NaN	As Scorpion: enhanced strength, enabling him t...	MacDonald "Mac" Gargan	None	NaN	Confronted by Venom symbiote (Marvel Knights: ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[[Marvel Universe]]	As Scorpion: The tail on his Scorpion costume ...	NaN	220 lbs. / 245 lbs. (with symbiote)
0	http://gateway.marvel.com/v1/public/characters...	[]	0		0	http://gateway.marvel.com/v1/public/characters...	[]	0	1011239	2014-03-05 18:58:33	Valkyrie (Ultimate)	http://gateway.marvel.com/v1/public/characters...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	0	http://gateway.marvel.com/v1/public/characters...	[]	0	jpg	http://i.annihil.us/u/prod/marvel/i/mg/4/20/53...	[{'url': 'http://marvel.com/comics/characters/...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
30	http://gateway.marvel.com/v1/public/characters...	[{'id': 41845, 'resourceURI': 'http://gateway....	30	He claims he is the legendary Norse thunder de...	0	http://gateway.marvel.com/v1/public/characters...	[]	0	1011025	2014-03-05 18:58:19	Thor (Ultimate)	http://gateway.marvel.com/v1/public/characters...	19	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	19	24	http://gateway.marvel.com/v1/public/characters...	[{'resourceURI': 'http://gateway.marvel.com/v1...	20	jpg	http://i.annihil.us/u/prod/marvel/i/mg/3/80/53...	[{'url': 'http://marvel.com/comics/characters/...	NaN	NaN	NaN	None	NaN	NaN	He claims he is the legendary Norse thunder de...	He claims he is the legendary Norse thunder de...	NaN	NaN	[Ultimate Marvel]	NaN	(Thor) Asgard; (Golmen) Norway	NaN	this has not been updated yet	NaN	The Ultimates #4 (2002)	NaN	NaN	Unrevealed	NaN	Blue	NaN	NaN	NaN	NaN	Formerly Ultimates	Blond	6'5"	NaN	(Thor) Publicly known; (Golmen) known to autho...	NaN	NaN	NaN	NaN	Thorult head.jpg	NaN	NaN	Guardian deity; formerly psychiatric nurse	NaN	NaN	NaN	NaN	(Thor) Asgard; (Golmen) Norway	NaN	NaN	NaN	NaN	Thor possesses immense superhuman strength, en...	Thor or Thorlief Golmen	(Thor) Odin (father), Loki (half-brother); (Go...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[[Ultimate]]	Enchanted hammer named Mjolnir.	NaN	285 lbs.