My friend Johannes Giorgis and I are developing a series of Data Science Challenges to help others become better data scientists by presenting a series of challenges. Why did we do this?
Because that's what heroes do!
--Johannes Giorgis
I now present my response to the first challenge, Exploring the Meetup API in the city of my choice.
San Francisco, CA, I choose you!
Use the Meetup API to explore meetups in your city of choice.
Guide Questions:
Below are some guide line questions to get you started:
What is the biggest Tech Group in San Francisco that meets regularly and has a growing and enthusiastic membership?
Add a Meetup API Key to your environment.
import datetime
import json
import math
import meetup.api
import os
import pprint
import requests
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
from tqdm import tnrange, tqdm_notebook
%matplotlib inline
Nothing works without MEETUP_API_KEY.
assert 'MEETUP_API_KEY' in os.environ, (
"You need a MEETUP_API_KEY in your environment please look at the "
"README for instructions.")
client = meetup.api.Client()
There are multiple categories of groups in Meetup, let's use Python's meetup.api to GetCategories.
%%capture --no-display
categories = client.GetCategories()
First, let's create ahelper function to help us parse out the two most useful different pieces:
def parse_response(response):
"""Returns two dataframes, meta and results:
meta: a vertically aligned dataframe, where each row is an element
of the response.meta dictionary
results: a horizontally aligned dataframe, where each column is
an element of the response.results dictionary"""
meta = pd.DataFrame.from_dict(response.meta, orient='index')
results = pd.DataFrame.from_dict(response.results)
return meta, results
We received a response object when we called client.GetCategories()
.
By looking at the categories meta dataframe, we can see that there are 33 different categories.
cats_meta_df, cats_df = parse_response(categories)
cats_meta_df
0 | |
---|---|
next | |
method | Categories |
total_count | 33 |
link | https://api.meetup.com/2/categories |
count | 33 |
description | Returns a list of Meetup group categories |
lon | None |
title | Categories |
url | https://api.meetup.com/2/categories?offset=0&f... |
id | |
updated | 1450292956000 |
lat | None |
We can see from the meta that there are 33 categories available to us. I wonder what they are.
Notice that the value of next (above) is an empty string. Meetup API v2 response payloads come in pages, one at a time, but provide the URI of the next API call in the sequence. We can use this to programmatically get each next page in response.meta["next"]. until the complete result is returned.
As we can see, the response.meta["next"] for this page is an empty string, so all of the categories fit into our first API call.
cats_df
id | name | shortname | sort_name | |
---|---|---|---|---|
0 | 1 | Arts & Culture | Arts | Arts & Culture |
1 | 18 | Book Clubs | Book Clubs | Book Clubs |
2 | 2 | Career & Business | Business | Career & Business |
3 | 3 | Cars & Motorcycles | Auto | Cars & Motorcycles |
4 | 4 | Community & Environment | Community | Community & Environment |
5 | 5 | Dancing | Dancing | Dancing |
6 | 6 | Education & Learning | Education | Education & Learning |
7 | 8 | Fashion & Beauty | Fashion | Fashion & Beauty |
8 | 9 | Fitness | Fitness | Fitness |
9 | 10 | Food & Drink | Food & Drink | Food & Drink |
10 | 11 | Games | Games | Games |
11 | 13 | Movements & Politics | Movements | Movements & Politics |
12 | 14 | Health & Wellbeing | Well-being | Health & Wellbeing |
13 | 15 | Hobbies & Crafts | Crafts | Hobbies & Crafts |
14 | 16 | Language & Ethnic Identity | Languages | Language & Ethnic Identity |
15 | 12 | LGBT | LGBT | LGBT |
16 | 17 | Lifestyle | Lifestyle | Lifestyle |
17 | 20 | Movies & Film | Films | Movies & Film |
18 | 21 | Music | Music | Music |
19 | 22 | New Age & Spirituality | Spirituality | New Age & Spirituality |
20 | 23 | Outdoors & Adventure | Outdoors | Outdoors & Adventure |
21 | 24 | Paranormal | Paranormal | Paranormal |
22 | 25 | Parents & Family | Moms & Dads | Parents & Family |
23 | 26 | Pets & Animals | Pets | Pets & Animals |
24 | 27 | Photography | Photography | Photography |
25 | 28 | Religion & Beliefs | Beliefs | Religion & Beliefs |
26 | 29 | Sci-Fi & Fantasy | Sci fi | Sci-Fi & Fantasy |
27 | 30 | Singles | Singles | Singles |
28 | 31 | Socializing | Social | Socializing |
29 | 32 | Sports & Recreation | Sports | Sports & Recreation |
30 | 33 | Support | Support | Support |
31 | 34 | Tech | Tech | Tech |
32 | 36 | Writing | Writing | Writing |
In this case, I want Tech. Let's query the dataframe for categories named Tech.
tech_df = cats_df.loc[cats_df['name'] == 'Tech']
tech_df
id | name | shortname | sort_name | |
---|---|---|---|---|
31 | 34 | Tech | Tech | Tech |
tech_category_id = tech_df['id'].values[0]
tech_category_id
34
%%capture --no-display
cities_resp = client.GetCities(country='us', query='San Francisco')
Here we used the GetCities method of the Python Meetup API client
I used a query for cities in United States called San Francisco.
Now let's take a look at the meta for our results.
cities_meta_df, cities_df = parse_response(cities_resp)
cities_meta_df
0 | |
---|---|
next | |
method | Cities |
total_count | 4 |
link | https://api.meetup.com/2/cities |
count | 4 |
description | Returns Meetup cities. This method supports se... |
lon | None |
title | Cities |
url | https://api.meetup.com/2/cities?country=us&off... |
id | |
updated | 1263132740000 |
lat | None |
I only know of the one San Francisco, why are there 4 cities?
cities_df
city | country | id | lat | localized_country_name | lon | member_count | name_string | ranking | state | zip | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | San Francisco | us | 94101 | 37.779999 | USA | -122.419998 | 60351 | San Francisco, California, USA | 0 | CA | 94101 |
1 | Bosque | us | 87006 | 34.560001 | USA | -106.779999 | 5 | San Francisco, New Mexico, USA | 1 | NM | 87006 |
2 | San Luis | us | 81152 | 37.080002 | USA | -105.620003 | 4 | San Francisco, Colorado, USA | 2 | CO | 81152 |
3 | Reserve | us | 87830 | 33.650002 | USA | -108.769997 | 1 | San Francisco Plaza, New Mexico, USA | 3 | NM | 87830 |
Oh, there are lots of San Franciscos!
single_city_df = cities_df.loc[
(cities_df['state'] == 'CA')]
single_city_df
city | country | id | lat | localized_country_name | lon | member_count | name_string | ranking | state | zip | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | San Francisco | us | 94101 | 37.779999 | USA | -122.419998 | 60351 | San Francisco, California, USA | 0 | CA | 94101 |
One San Francisco, perfect!
latitude = single_city_df['lat'][0]
longitude = single_city_df['lon'][0]
latitude, longitude
(37.779998779296875, -122.41999816894531)
Note: This function will use the tech_category_id, latitude, and longitude values that we found eariler.
def get_a_group(page_number, category_id=tech_category_id, lat=latitude,
lon=longitude):
group = None
retry_counter, retry_max = 0, 3
print(f"Getting page {page_number}")
while retry_counter < retry_max:
try:
group = client.GetGroups(
category_id=category_id, lat=lat, lon=lon, offset=page_number)
return group
except:
print(f"Fetch failure {retry_counter + 1}")
retry_counter += 1
raise Exception(f"Unable to fetch page after {retry_counter} attempts")
Let's review the meta to help us see what we are getting into
%%capture --no-display
group_resp = get_a_group(0)
group_meta, _ = parse_response(group_resp)
group_meta
0 | |
---|---|
next | https://api.meetup.com/2/groups?offset=1&forma... |
method | Groups |
total_count | 2197 |
link | https://api.meetup.com/2/groups |
count | 200 |
description | None |
lon | -122.42 |
title | Meetup Groups v2 |
url | https://api.meetup.com/2/groups?offset=0&forma... |
id | |
updated | 1550965553000 |
lat | 37.78 |
Remember earlier when I spoke about response.meta["next"]?
It seems as though our result will span mulitple API calls, each returning 200 new groups in a page.
Let's make a new helper that will grab each page in a series of API calls until we obtain the entire data set:
We will use the pandas.DataFrame.concat function to collate all pages into a single useful dataframe
def get_all_groups_as_a_df():
"""Returns a single dataframe composed from data from multiple
successive calls to get_a_group.
We will loop through get_a_group pages while our page.meta['next'] is
not the empty string.
"""
page_df_list = []
next_page = None
page_number = 0
while next_page != '':
page = get_a_group(page_number)
next_page = page.meta["next"]
_, frame = parse_response(page)
page_number += 1
page_df_list.append(frame)
return pd.concat(page_df_list, ignore_index=True)
# Collect all groups into a single dataframe
all_groups_df = get_all_groups_as_a_df()
# Show the first row in the dataframe
all_groups_df.head(1)
Getting page 0 28/30 (5 seconds remaining) Getting page 1 27/30 (3 seconds remaining) Getting page 2 26/30 (3 seconds remaining) Getting page 3 25/30 (2 seconds remaining) Getting page 4 24/30 (1 seconds remaining) Getting page 5 29/30 (10 seconds remaining) Getting page 6 28/30 (9 seconds remaining) Getting page 7 27/30 (8 seconds remaining) Getting page 8 26/30 (7 seconds remaining) Getting page 9 25/30 (6 seconds remaining) Getting page 10 24/30 (5 seconds remaining)
category | city | country | created | description | group_photo | id | join_mode | lat | link | ... | name | organizer | rating | state | timezone | topics | urlname | utc_offset | visibility | who | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'name': 'tech', 'id': 34, 'shortname': 'tech'} | San Francisco | US | 1034097740000 | <p>The SF PHP Community Meetup is an open foru... | {'highres_link': 'https://secure.meetupstatic.... | 120903 | open | 37.77 | https://www.meetup.com/sf-php/ | ... | SF PHP Community | {'member_id': 126468982, 'name': 'Andre Marigo... | 4.38 | CA | US/Pacific | [{'urlkey': 'php', 'name': 'PHP', 'id': 455}, ... | sf-php | -28800000 | public | PHP Developers |
1 rows × 22 columns
I really only care about a small list of columns, let's exclude the unneeded columns.
column_list = ['id', 'name', 'members', 'rating', 'join_mode', 'urlname']
all_groups_df = all_groups_df[column_list]
all_groups_df.head(1)
id | name | members | rating | join_mode | urlname | |
---|---|---|---|---|---|---|
0 | 120903 | SF PHP Community | 2702 | 4.38 | open | sf-php |
all_groups_df.shape
(2197, 6)
# Using seaborn's distplot function
plt.rcParams['figure.figsize'] = [11, 6]
sb.distplot(all_groups_df['members'], kde=False, color="g");
Let's take a closer look at some basic stats for our data in a tabular format for some hard numbers:
pd.options.display.float_format = '{:20,.2f}'.format
all_groups_df[["members"]].describe()
members | |
---|---|
count | 2,197.00 |
mean | 811.79 |
std | 1,741.14 |
min | 1.00 |
25% | 86.00 |
50% | 256.00 |
75% | 780.00 |
max | 36,058.00 |
As a table I can see some numbers:
What an outlier! But are there other mega-groups like this?
plt.rcParams['figure.figsize'] = [6, 20]
all_groups_df['members'].plot.box();
Wow, there are quite a few mega-groups, as indicated by the circles above our top whisker!
Why are the groups so big?
In fact...
biggest_ten_df = all_groups_df.sort_values('members',
ascending=False).head(10)
biggest_ten_df[["name", "members"]]
name | members | |
---|---|---|
19 | Silicon Valley Entrepreneurs & Startups | 36058 |
107 | SFHTML5 | 17718 |
106 | Designers + Geeks | 15467 |
426 | SF Data Science | 14874 |
28 | The SF JavaScript Meetup | 13359 |
250 | Tech in Motion Events: San Francisco | 13090 |
540 | Docker Online Meetup | 12475 |
191 | SF Data Mining | 12378 |
201 | Women Who Code SF | 12334 |
706 | SF Big Analytics | 11889 |
Mostly we need to:
id_list = biggest_ten_df['id'].tolist()
id_list
ids = ','.join(str(x) for x in id_list)
ids
'844726,1619955,1615633,9226282,1060260,3483762,13402242,2065031,2252591,18354966'
def to_millis(dt):
return int(pd.to_datetime(dt).value / 1000000)
right_now = to_millis(datetime.datetime.now())
nine_months_ago = int(right_now - 180 * 24 * 60 * 60 * 1000)
print(f"Now: {right_now}, nine months ago: {nine_months_ago}")
Now: 1550937159472, nine months ago: 1535385159472
%%capture --no-display
events_resp = client.GetEvents(group_id=ids, status='past',
time=f"{nine_months_ago},{right_now}");
events_meta, events_df = parse_response(events_resp)
events_df.head(1)
created | description | duration | event_url | group | headcount | how_to_find_us | id | maybe_rsvp_count | name | ... | rating | rsvp_limit | status | time | updated | utc_offset | venue | visibility | waitlist_count | yes_rsvp_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1534368847000 | <p>Demo Session is free to meetup attendees. U... | 7200000 | https://www.meetup.com/sventrepreneurs/events/... | {'join_mode': 'open', 'created': 1196203591000... | 0 | NaN | 253824506 | 0 | Demo Session @ Mars Blockchain Summit by Mars ... | ... | {'count': 0, 'average': 0} | nan | past | 1535495400000 | 1535527171000 | -25200000 | {'country': 'us', 'localized_country_name': 'U... | public | 0 | 66 |
1 rows × 21 columns
column_list = ['group', 'time', 'duration', 'yes_rsvp_count']
events_df = events_df[column_list]
events_df.head(1)
group | time | duration | yes_rsvp_count | |
---|---|---|---|---|
0 | {'join_mode': 'open', 'created': 1196203591000... | 1535495400000 | 7200000 | 66 |
The group column is actually a JSON object full of metadata about the group.
I really only need the group["id"] for now, so let's focus on that.
def get_id(my_dict):
"""Extract the id member of a python dictionary"""
return my_dict["id"]
events_df["id"] = events_df["group"].apply(get_id)
# Let's
columns = ['id', 'time', 'duration', 'yes_rsvp_count']
events_df = events_df[columns]
events_df.head(1)
id | time | duration | yes_rsvp_count | |
---|---|---|---|---|
0 | 844726 | 1535495400000 | 7200000 | 66 |
The time is stored in Epoch milliseconds format.
This is great if you want to see time as the number of milliseconds since Jan 1, 1970.
This is not-so-great if you just want to see a human-readable date and time equivalent.
Let's make a new human-readable column called time_dt
events_df["time_dt"] = pd.to_datetime(
events_df["time"], unit='ms').dt.strftime('%m/%d/%y %H:%M')
columns = ['id', 'time','time_dt', 'duration', 'yes_rsvp_count']
events_df = events_df[columns]
events_df.head(1)
id | time | time_dt | duration | yes_rsvp_count | |
---|---|---|---|---|---|
0 | 844726 | 1535495400000 | 08/28/18 22:30 | 7200000 | 66 |
Let's convert the column to a string that shows hours and minutes.
def millis_2_hours_and_minutes(ms):
"""Converts milliseconds to hours and minutes."""
seconds = ms / 1000
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
return f"{int(hours)} hours, {int(minutes)} minutes"
events_df["duration"] = events_df["duration"].apply(
millis_2_hours_and_minutes)
events_df.head(1)
id | time | time_dt | duration | yes_rsvp_count | |
---|---|---|---|---|---|
0 | 844726 | 1535495400000 | 08/28/18 22:30 | 2 hours, 0 minutes | 66 |
If you are familiar with SQL this is similar to a left join from raw_results_df to biggest_ten_df on id
Then we sort the output by name ascending and then time descending.
merged_df = pd.merge(
events_df,
biggest_ten_df[['id', 'name']],
on='id',
how='left')
columns = ['name', 'time', 'time_dt', 'duration', 'yes_rsvp_count', 'id']
final_df = merged_df[columns]
# Sort the output by name and time
final_df = final_df.sort_values(by=['name', 'time'], ascending=[True, False])
final_df
name | time | time_dt | duration | yes_rsvp_count | id | |
---|---|---|---|---|---|---|
101 | Designers + Geeks | 1550804400000 | 02/22/19 03:00 | 2 hours, 0 minutes | 178 | 1615633 |
55 | Designers + Geeks | 1542337200000 | 11/16/18 03:00 | 2 hours, 0 minutes | 71 | 1615633 |
46 | Designers + Geeks | 1541124000000 | 11/02/18 02:00 | 2 hours, 0 minutes | 71 | 1615633 |
32 | Designers + Geeks | 1539914400000 | 10/19/18 02:00 | 2 hours, 0 minutes | 41 | 1615633 |
24 | Designers + Geeks | 1538704800000 | 10/05/18 02:00 | 2 hours, 0 minutes | 29 | 1615633 |
14 | Designers + Geeks | 1537495200000 | 09/21/18 02:00 | 2 hours, 0 minutes | 37 | 1615633 |
4 | Designers + Geeks | 1536285600000 | 09/07/18 02:00 | 2 hours, 0 minutes | 87 | 1615633 |
40 | Docker Online Meetup | 1540915200000 | 10/30/18 16:00 | 1 hours, 0 minutes | 1 | 13402242 |
97 | SF Big Analytics | 1549591200000 | 02/08/19 02:00 | 2 hours, 30 minutes | 418 | 18354966 |
80 | SF Big Analytics | 1547690400000 | 01/17/19 02:00 | 3 hours, 0 minutes | 450 | 18354966 |
78 | SF Big Analytics | 1547604000000 | 01/16/19 02:00 | 3 hours, 0 minutes | 5 | 18354966 |
64 | SF Big Analytics | 1544061600000 | 12/06/18 02:00 | 3 hours, 0 minutes | 7 | 18354966 |
53 | SF Big Analytics | 1542247200000 | 11/15/18 02:00 | 3 hours, 0 minutes | 16 | 18354966 |
51 | SF Big Analytics | 1542160800000 | 11/14/18 02:00 | 2 hours, 30 minutes | 478 | 18354966 |
50 | SF Big Analytics | 1541782800000 | 11/09/18 17:00 | 57 hours, 0 minutes | 16 | 18354966 |
44 | SF Big Analytics | 1541120400000 | 11/02/18 01:00 | 3 hours, 0 minutes | 152 | 18354966 |
29 | SF Big Analytics | 1539909000000 | 10/19/18 00:30 | 2 hours, 30 minutes | 390 | 18354966 |
16 | SF Big Analytics | 1538010000000 | 09/27/18 01:00 | 2 hours, 45 minutes | 245 | 18354966 |
8 | SF Big Analytics | 1536800400000 | 09/13/18 01:00 | 2 hours, 30 minutes | 218 | 18354966 |
10 | SF Data Mining | 1537200000000 | 09/17/18 16:00 | 104 hours, 0 minutes | 6 | 2065031 |
93 | SF Data Science | 1548988200000 | 02/01/19 02:30 | 2 hours, 0 minutes | 94 | 9226282 |
89 | SF Data Science | 1548383400000 | 01/25/19 02:30 | 2 hours, 0 minutes | 48 | 9226282 |
82 | SF Data Science | 1547778600000 | 01/18/19 02:30 | 2 hours, 0 minutes | 26 | 9226282 |
81 | SF Data Science | 1547692200000 | 01/17/19 02:30 | 2 hours, 0 minutes | 59 | 9226282 |
60 | SF Data Science | 1543539600000 | 11/30/18 01:00 | 3 hours, 30 minutes | 18 | 9226282 |
52 | SF Data Science | 1542211200000 | 11/14/18 16:00 | 10 hours, 0 minutes | 1 | 9226282 |
38 | SF Data Science | 1540774800000 | 10/29/18 01:00 | 4 hours, 0 minutes | 21 | 9226282 |
30 | SF Data Science | 1539910800000 | 10/19/18 01:00 | 2 hours, 0 minutes | 95 | 9226282 |
26 | SF Data Science | 1539306000000 | 10/12/18 01:00 | 3 hours, 0 minutes | 15 | 9226282 |
102 | SFHTML5 | 1550883600000 | 02/23/19 01:00 | 4 hours, 0 minutes | 336 | 1619955 |
... | ... | ... | ... | ... | ... | ... |
95 | Women Who Code SF | 1549420200000 | 02/06/19 02:30 | 2 hours, 0 minutes | 47 | 2252591 |
92 | Women Who Code SF | 1548815400000 | 01/30/19 02:30 | 2 hours, 0 minutes | 37 | 2252591 |
86 | Women Who Code SF | 1548210600000 | 01/23/19 02:30 | 2 hours, 0 minutes | 29 | 2252591 |
79 | Women Who Code SF | 1547605800000 | 01/16/19 02:30 | 2 hours, 0 minutes | 32 | 2252591 |
76 | Women Who Code SF | 1547001000000 | 01/09/19 02:30 | 2 hours, 0 minutes | 57 | 2252591 |
71 | Women Who Code SF | 1545271200000 | 12/20/18 02:00 | 2 hours, 0 minutes | 199 | 2252591 |
69 | Women Who Code SF | 1545186600000 | 12/19/18 02:30 | 1 hours, 30 minutes | 50 | 2252591 |
70 | Women Who Code SF | 1545186600000 | 12/19/18 02:30 | 2 hours, 0 minutes | 23 | 2252591 |
67 | Women Who Code SF | 1544581800000 | 12/12/18 02:30 | 2 hours, 0 minutes | 22 | 2252591 |
65 | Women Who Code SF | 1544149800000 | 12/07/18 02:30 | 2 hours, 0 minutes | 17 | 2252591 |
63 | Women Who Code SF | 1543977000000 | 12/05/18 02:30 | 2 hours, 0 minutes | 29 | 2252591 |
62 | Women Who Code SF | 1543975200000 | 12/05/18 02:00 | 3 hours, 0 minutes | 1 | 2252591 |
58 | Women Who Code SF | 1542767400000 | 11/21/18 02:30 | 2 hours, 0 minutes | 31 | 2252591 |
54 | Women Who Code SF | 1542335400000 | 11/16/18 02:30 | 2 hours, 0 minutes | 50 | 2252591 |
49 | Women Who Code SF | 1541557800000 | 11/07/18 02:30 | 2 hours, 0 minutes | 27 | 2252591 |
45 | Women Who Code SF | 1541122200000 | 11/02/18 01:30 | 2 hours, 0 minutes | 50 | 2252591 |
43 | Women Who Code SF | 1540949400000 | 10/31/18 01:30 | 2 hours, 0 minutes | 29 | 2252591 |
39 | Women Who Code SF | 1540859400000 | 10/30/18 00:30 | 2 hours, 30 minutes | 64 | 2252591 |
34 | Women Who Code SF | 1540400400000 | 10/24/18 17:00 | 1 hours, 0 minutes | 2 | 2252591 |
33 | Women Who Code SF | 1540344600000 | 10/24/18 01:30 | 2 hours, 0 minutes | 36 | 2252591 |
31 | Women Who Code SF | 1539912600000 | 10/19/18 01:30 | 2 hours, 0 minutes | 50 | 2252591 |
25 | Women Who Code SF | 1539135000000 | 10/10/18 01:30 | 2 hours, 0 minutes | 25 | 2252591 |
23 | Women Who Code SF | 1538703000000 | 10/05/18 01:30 | 2 hours, 0 minutes | 50 | 2252591 |
21 | Women Who Code SF | 1538530200000 | 10/03/18 01:30 | 2 hours, 0 minutes | 10 | 2252591 |
22 | Women Who Code SF | 1538530200000 | 10/03/18 01:30 | 2 hours, 0 minutes | 27 | 2252591 |
13 | Women Who Code SF | 1537493400000 | 09/21/18 01:30 | 2 hours, 0 minutes | 33 | 2252591 |
11 | Women Who Code SF | 1537320600000 | 09/19/18 01:30 | 2 hours, 0 minutes | 14 | 2252591 |
9 | Women Who Code SF | 1537030800000 | 09/15/18 17:00 | 3 hours, 0 minutes | 10 | 2252591 |
7 | Women Who Code SF | 1536715800000 | 09/12/18 01:30 | 2 hours, 0 minutes | 25 | 2252591 |
1 | Women Who Code SF | 1535506200000 | 08/29/18 01:30 | 2 hours, 0 minutes | 35 | 2252591 |
103 rows × 6 columns
Let's convert those to something easier.
# Helper function to transform long id number to easy id numbers
# Get an enumeration of unique ids
unique_ids = final_df['id'].unique()
ids_2_new_ids = {k: v for v, k in enumerate(unique_ids)}
ids_2_new_ids
def get_new_id(old_id):
return ids_2_new_ids[old_id]
# re-write those ids as something simpler
final_df['id'] = final_df['id'].apply(get_new_id)
final_df.head(30)
name | time | time_dt | duration | yes_rsvp_count | id | |
---|---|---|---|---|---|---|
101 | Designers + Geeks | 1550804400000 | 02/22/19 03:00 | 2 hours, 0 minutes | 178 | 0 |
55 | Designers + Geeks | 1542337200000 | 11/16/18 03:00 | 2 hours, 0 minutes | 71 | 0 |
46 | Designers + Geeks | 1541124000000 | 11/02/18 02:00 | 2 hours, 0 minutes | 71 | 0 |
32 | Designers + Geeks | 1539914400000 | 10/19/18 02:00 | 2 hours, 0 minutes | 41 | 0 |
24 | Designers + Geeks | 1538704800000 | 10/05/18 02:00 | 2 hours, 0 minutes | 29 | 0 |
14 | Designers + Geeks | 1537495200000 | 09/21/18 02:00 | 2 hours, 0 minutes | 37 | 0 |
4 | Designers + Geeks | 1536285600000 | 09/07/18 02:00 | 2 hours, 0 minutes | 87 | 0 |
40 | Docker Online Meetup | 1540915200000 | 10/30/18 16:00 | 1 hours, 0 minutes | 1 | 1 |
97 | SF Big Analytics | 1549591200000 | 02/08/19 02:00 | 2 hours, 30 minutes | 418 | 2 |
80 | SF Big Analytics | 1547690400000 | 01/17/19 02:00 | 3 hours, 0 minutes | 450 | 2 |
78 | SF Big Analytics | 1547604000000 | 01/16/19 02:00 | 3 hours, 0 minutes | 5 | 2 |
64 | SF Big Analytics | 1544061600000 | 12/06/18 02:00 | 3 hours, 0 minutes | 7 | 2 |
53 | SF Big Analytics | 1542247200000 | 11/15/18 02:00 | 3 hours, 0 minutes | 16 | 2 |
51 | SF Big Analytics | 1542160800000 | 11/14/18 02:00 | 2 hours, 30 minutes | 478 | 2 |
50 | SF Big Analytics | 1541782800000 | 11/09/18 17:00 | 57 hours, 0 minutes | 16 | 2 |
44 | SF Big Analytics | 1541120400000 | 11/02/18 01:00 | 3 hours, 0 minutes | 152 | 2 |
29 | SF Big Analytics | 1539909000000 | 10/19/18 00:30 | 2 hours, 30 minutes | 390 | 2 |
16 | SF Big Analytics | 1538010000000 | 09/27/18 01:00 | 2 hours, 45 minutes | 245 | 2 |
8 | SF Big Analytics | 1536800400000 | 09/13/18 01:00 | 2 hours, 30 minutes | 218 | 2 |
10 | SF Data Mining | 1537200000000 | 09/17/18 16:00 | 104 hours, 0 minutes | 6 | 3 |
93 | SF Data Science | 1548988200000 | 02/01/19 02:30 | 2 hours, 0 minutes | 94 | 4 |
89 | SF Data Science | 1548383400000 | 01/25/19 02:30 | 2 hours, 0 minutes | 48 | 4 |
82 | SF Data Science | 1547778600000 | 01/18/19 02:30 | 2 hours, 0 minutes | 26 | 4 |
81 | SF Data Science | 1547692200000 | 01/17/19 02:30 | 2 hours, 0 minutes | 59 | 4 |
60 | SF Data Science | 1543539600000 | 11/30/18 01:00 | 3 hours, 30 minutes | 18 | 4 |
52 | SF Data Science | 1542211200000 | 11/14/18 16:00 | 10 hours, 0 minutes | 1 | 4 |
38 | SF Data Science | 1540774800000 | 10/29/18 01:00 | 4 hours, 0 minutes | 21 | 4 |
30 | SF Data Science | 1539910800000 | 10/19/18 01:00 | 2 hours, 0 minutes | 95 | 4 |
26 | SF Data Science | 1539306000000 | 10/12/18 01:00 | 3 hours, 0 minutes | 15 | 4 |
102 | SFHTML5 | 1550883600000 | 02/23/19 01:00 | 4 hours, 0 minutes | 336 | 5 |
This is nice, but which groups have regular meetings, which ones have growing interest (increasing RSVP-yes)? How do any of these groups compare to each other?
Let's just use linear regression to draw trend lines for each mega-group.
We'll use seaborn's lmplot to visualize all ten mega-groups.
We need to use epoch milliseconds time column of our dataframe, since it is numeric and can be used to generate trend lines.
# import statsmodels.api as sm
ax = sb.lmplot("time", "yes_rsvp_count", data=final_df, hue="name",
height=9, aspect=0.75, order=1, ci=None)
From the top 10 mega-groups I can see that several groups have major problems. For instance, the Docker Online Meetup group has almost no yes_rsvp_counts.
Let's take a closer look.
interesting_group_df = final_df[final_df['name'] == "Docker Online Meetup"]
interesting_group_df
name | time | time_dt | duration | yes_rsvp_count | id | |
---|---|---|---|---|---|---|
40 | Docker Online Meetup | 1540915200000 | 10/30/18 16:00 | 1 hours, 0 minutes | 1 | 1 |
rsvp_df = final_df.groupby(['name'], as_index=False).mean()
rsvp_df = rsvp_df.sort_values(by=['yes_rsvp_count', 'name'],
ascending=[True, True])
rsvp_df
name | time | yes_rsvp_count | id | |
---|---|---|---|---|
1 | Docker Online Meetup | 1,540,915,200,000.00 | 1.00 | 1.00 |
3 | SF Data Mining | 1,537,200,000,000.00 | 6.00 | 3.00 |
9 | Women Who Code SF | 1,542,342,600,000.00 | 37.03 | 9.00 |
4 | SF Data Science | 1,544,287,200,000.00 | 41.89 | 4.00 |
0 | Designers + Geeks | 1,540,952,228,571.43 | 73.43 | 0.00 |
8 | The SF JavaScript Meetup | 1,544,289,840,000.00 | 83.80 | 8.00 |
7 | Tech in Motion Events: San Francisco | 1,544,405,400,000.00 | 113.50 | 7.00 |
6 | Silicon Valley Entrepreneurs & Startups | 1,542,303,840,000.00 | 130.30 | 6.00 |
2 | SF Big Analytics | 1,542,816,163,636.36 | 217.73 | 2.00 |
5 | SFHTML5 | 1,546,360,500,000.00 | 218.33 | 5.00 |
Let's exclude those now.
final_df = final_df[~final_df['id'].isin([1, 3])]
event_count_df = final_df.groupby(['id'], as_index=False).count()
event_count_df = event_count_df.sort_values(by=['name', 'id'],ascending=[True, True])
event_count_df[["id", "time"]]
id | time | |
---|---|---|
5 | 7 | 2 |
6 | 8 | 5 |
3 | 5 | 6 |
0 | 0 | 7 |
2 | 4 | 9 |
1 | 2 | 11 |
4 | 6 | 30 |
7 | 9 | 31 |
Let's remove those as well.
final_df = final_df[~final_df['id'].isin([7, 8, 5, 0])]
This time we're going to do two more things:
%%capture --no-display
sb.lmplot("time", "yes_rsvp_count", data=final_df, hue="name",
height=9, aspect=0.75, order=1,
ci=68, # 1. Confidence interval of 1 Standard Deviation
robust=True) # 2. estimate robust regression, to de-weight outliers
<seaborn.axisgrid.FacetGrid at 0x7fac2adc0f60>
At last! We found the four best Tech groups in San Francisco, CA that:
We achieved our objectives and demonstrated several useful techniques along the way we :