We will analyze startup investments from Crunchbase.com.
Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.
In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.
The data set of investments we'll be exploring is current as of October 2013. You can download it from GitHub.
Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv
consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).
import sqlite3
import pandas as pd
import pprint as pp
import matplotlib.pyplot as plt
pd.options.display.max_columns = 99
!ls
crunchbase.db crunchbase-investments.csv project03_analyzing_startup_fundraising_deals_from_crunchbase.ipynb
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1')
memory_footprints = []
i = 0
for chunk in chunks_it:
i += 1
chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576
memory_footprints.append(chunk_memory_mb)
print("Chunk {0} memory: {1:0.2f} mb".format(i, chunk_memory_mb))
plt.hist(memory_footprints)
plt.show()
Chunk 1 memory: 5.58 mb Chunk 2 memory: 5.53 mb Chunk 3 memory: 5.54 mb Chunk 4 memory: 5.53 mb Chunk 5 memory: 5.52 mb Chunk 6 memory: 5.55 mb Chunk 7 memory: 5.53 mb Chunk 8 memory: 5.51 mb Chunk 9 memory: 5.40 mb Chunk 10 memory: 4.64 mb Chunk 11 memory: 2.66 mb
Observations:
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1')
missing_vc_list = []
i = 0
for chunk in chunks_it:
i += 1
#print("Chunk {0}:".format(i))
result = chunk.isnull().sum()
#print(type(result))
missing_vc_list.append(result)
combined_missing_vc = pd.concat(missing_vc_list)
unique_combined_missing_vc = combined_missing_vc.groupby(combined_missing_vc.index).sum()
unique_combined_missing_vc.sort_values()
company_country_code 1 company_name 1 company_permalink 1 company_region 1 investor_region 2 investor_permalink 2 investor_name 2 funded_quarter 3 funded_at 3 funded_month 3 funded_year 3 funding_round_type 3 company_state_code 492 company_city 533 company_category_code 643 raised_amount_usd 3599 investor_country_code 12001 investor_city 12480 investor_state_code 16809 investor_category_code 50427 dtype: int64
Observations:
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1')
series_memory_fp = pd.Series()
counter = 0
for chunk in chunks_it:
if counter == 0:
series_memory_fp = chunk.memory_usage(deep=True)
else:
series_memory_fp += chunk.memory_usage(deep=True)
counter += 1
series_memory_fp = series_memory_fp.drop('Index')
series_memory_fp
company_permalink 4057788 company_name 3591326 company_category_code 3421104 company_country_code 3172176 company_state_code 3106051 company_region 3411585 company_city 3505926 investor_permalink 4980548 investor_name 3915666 investor_category_code 622424 investor_country_code 2647292 investor_state_code 2476607 investor_region 3396281 investor_city 2885083 funding_round_type 3410707 funded_at 3542185 funded_month 3383584 funded_quarter 3383584 funded_year 422960 raised_amount_usd 422960 dtype: int64
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1')
memory_footprints = []
i = 0
for chunk in chunks_it:
i += 1
chunk_memory_mb = chunk.memory_usage(deep=True).sum() / 1048576
memory_footprints.append(chunk_memory_mb)
print("Chunk {0} memory: {1:0.2f} mb".format(i, chunk_memory_mb))
print("Total memory: {0:0.2f} mb".format(sum(memory_footprints)))
Chunk 1 memory: 5.58 mb Chunk 2 memory: 5.53 mb Chunk 3 memory: 5.54 mb Chunk 4 memory: 5.53 mb Chunk 5 memory: 5.52 mb Chunk 6 memory: 5.55 mb Chunk 7 memory: 5.53 mb Chunk 8 memory: 5.51 mb Chunk 9 memory: 5.40 mb Chunk 10 memory: 4.64 mb Chunk 11 memory: 2.66 mb Total memory: 56.99 mb
series_memory_fp.sum() / (1048576)
56.9876070022583
Let's drop columns that aren't very helpful (URLs) or columns that have too many missing columns (> 90%
)
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1')
total_num_rows = 0
for chunk in chunks_it:
total_num_rows += chunk.shape[0]
print("Total Rows: {0}".format(total_num_rows))
Total Rows: 52870
# get percentage of missing values per column
unique_combined_missing_vc.sort_values() / total_num_rows * 100
company_country_code 0.001891 company_name 0.001891 company_permalink 0.001891 company_region 0.001891 investor_region 0.003783 investor_permalink 0.003783 investor_name 0.003783 funded_quarter 0.005674 funded_at 0.005674 funded_month 0.005674 funded_year 0.005674 funding_round_type 0.005674 company_state_code 0.930584 company_city 1.008133 company_category_code 1.216191 raised_amount_usd 6.807263 investor_country_code 22.699073 investor_city 23.605069 investor_state_code 31.793077 investor_category_code 95.379232 dtype: float64
# Drop columns representing URLs or containing too many missing values ( > 90%)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
print('Columns to Keep:')
print(keep_cols.tolist())
Columns to Keep: ['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']
Let's get familiar with the column types before adding the data into SQLite.
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
col_types = {}
i = 0
for chunk in chunks_it:
i += 1
#print("Chunk {0}:".format(i))
#print(chunk.dtypes)
for col in chunk.columns:
if col not in col_types:
# use a set to keep only unique values!
col_types[col] = set([str(chunk.dtypes[col])])
else:
col_types[col].add(str(chunk.dtypes[col]))
#print('\t', col, '|', col_types[col])
#break
pp.pprint(col_types)
{'company_category_code': {'object'}, 'company_city': {'object'}, 'company_country_code': {'object'}, 'company_name': {'object'}, 'company_region': {'object'}, 'company_state_code': {'object'}, 'funded_at': {'object'}, 'funded_month': {'object'}, 'funded_quarter': {'object'}, 'funded_year': {'int64', 'float64'}, 'funding_round_type': {'object'}, 'investor_city': {'object', 'float64'}, 'investor_country_code': {'object', 'float64'}, 'investor_name': {'object'}, 'investor_region': {'object'}, 'investor_state_code': {'object', 'float64'}, 'raised_amount_usd': {'float64'}}
Observations:
funded_year
, investor_city
, investor_country_code
, investor_state_code
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
count = 0
for chunk in chunks_it:
count += 1
print(f'\nChunk {count:<3}:', end=' ')
float_cols = chunk.select_dtypes(include=['float', 'int']).columns
print(float_cols)
chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'],
downcast='float')
Chunk 1 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 2 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 3 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 4 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 5 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 6 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 7 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 8 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 9 : Index(['funded_year', 'raised_amount_usd'], dtype='object') Chunk 10 : Index(['investor_country_code', 'investor_state_code', 'investor_city', 'funded_year', 'raised_amount_usd'], dtype='object') Chunk 11 : Index(['investor_country_code', 'investor_state_code', 'investor_city', 'funded_year', 'raised_amount_usd'], dtype='object')
Observations:
raised_amount_usd
is the only consistent float column across all the chunks. We could use a more space efficient type for it.Let's dig a little deeper and look at the unique values within the numeric columns.
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
count = 0
str_cols_vc = {}
for chunk in chunks_it:
count += 1
print(f'\nCHUNK {count}:')
num_cols = chunk.select_dtypes(include=['float', 'int'])
for col in num_cols:
print(f'\n{col}')
current_col_vc = num_cols[col].value_counts()
#continue
if col in str_cols_vc:
str_cols_vc[col].append(current_col_vc)
else:
str_cols_vc[col] = [current_col_vc]
unique_values = chunk[col].unique()
print(f'type: {chunk[col].dtype} | {len(unique_values)} values: {unique_values[:5]}')
#break
## Combine the value count results
combined_vcs = {}
for col, vc_list in str_cols_vc.items():
combined_vc = pd.concat(vc_list)
final_vc = combined_vc.groupby(combined_vc.index).sum()
combined_vcs[col] = final_vc
#pp.pprint(combined_vcs)
CHUNK 1: funded_year type: int64 | 18 values: [2012 2011 2010 2009 2007] raised_amount_usd type: float64 | 625 values: [2000000. 20000. 70000. 75000. 100000.] CHUNK 2: funded_year type: int64 | 17 values: [2008 2009 2007 2010 2006] raised_amount_usd type: float64 | 613 values: [25000000. 20000000. 26000000. 34500000. 68000000.] CHUNK 3: funded_year type: int64 | 17 values: [2009 2007 2011 2012 2010] raised_amount_usd type: float64 | 634 values: [37400000. 6000000. 22300000. 8387128. 15700000.] CHUNK 4: funded_year type: int64 | 15 values: [2011 2012 2013 2010 2005] raised_amount_usd type: float64 | 588 values: [ 3000000. 5000000. 85000000. 15000000. 5300000.] CHUNK 5: funded_year type: int64 | 19 values: [2010 2007 2008 2011 2012] raised_amount_usd type: float64 | 627 values: [15000000. 14900000. 935000. 365000. 1300000.] CHUNK 6: funded_year type: int64 | 15 values: [2009 2010 2006 2005 2011] raised_amount_usd type: float64 | 612 values: [29000000. 2000000. 3200000. 3000000. 12000000.] CHUNK 7: funded_year type: float64 | 18 values: [2003. 2012. 2013. 2007. 2009.] raised_amount_usd type: float64 | 606 values: [ 500000. 3000000. 20000000. 1400000. 10000000.] CHUNK 8: funded_year type: int64 | 18 values: [2006 2010 2005 2011 2012] raised_amount_usd type: float64 | 615 values: [13000000. 12000000. 19300000. 12500000. 23500000.] CHUNK 9: funded_year type: int64 | 16 values: [2007 2010 2011 2012 2013] raised_amount_usd type: float64 | 617 values: [ 3500000. 9000000. 1400000. 10000000. 800000.] CHUNK 10: investor_country_code type: float64 | 1 values: [nan] investor_state_code type: float64 | 1 values: [nan] investor_city type: float64 | 1 values: [nan] funded_year type: int64 | 16 values: [2011 2013 2005 2006 2010] raised_amount_usd type: float64 | 405 values: [1000000. 1600000. 300000. 1100000. 250000.] CHUNK 11: investor_country_code type: float64 | 1 values: [nan] investor_state_code type: float64 | 1 values: [nan] investor_city type: float64 | 1 values: [nan] funded_year type: int64 | 15 values: [2012 2007 2008 2010 2006] raised_amount_usd type: float64 | 349 values: [ 3060000. 12000000. 500000. 750000. nan]
Observations:
funded_year
column can be converted from float to integer as it shows the funded yearfunded_year
and raised_amount_usd
. The remaining columns are showing up because they have Nan values in the later chunks.raised_amount_usd
column can be downcast as a float to a more space efficient type.funded_at
column has the necessary year, month, day info - so the funded_year
is no longer required!chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
count = 0
#total_num_rows = 0
str_cols_vc = {}
for chunk in chunks_it:
count += 1
#total_num_rows += chunk.shape[0]
print(f'\nCHUNK {count}:')
#print(chunk.dtypes.value_counts())
obj_cols = chunk.select_dtypes(include=['object'])
#print(obj_cols.value_counts())
#print(type(obj_cols))
for col in obj_cols:
current_col_vc = obj_cols[col].value_counts()
if col in str_cols_vc:
str_cols_vc[col].append(current_col_vc)
else:
str_cols_vc[col] = [current_col_vc]
unique_values = chunk[col].unique()
print(f'col: {col} | type: {chunk[col].dtype} |'\
f' {len(unique_values)} values: {unique_values[:5]}')
#print(f'Total Num Rows: {total_num_rows}')
## Combine the value count results
combined_vcs = {}
for col, vc_list in str_cols_vc.items():
combined_vc = pd.concat(vc_list)
final_vc = combined_vc.groupby(combined_vc.index).sum()
combined_vcs[col] = final_vc
#pp.pprint(combined_vcs)
CHUNK 1: col: company_name | type: object | 3373 values: ['AdverCar' 'LaunchGram' 'uTaP' 'ZoopShop' 'eFuneral'] col: company_category_code | type: object | 43 values: ['advertising' 'news' 'messaging' 'software' 'web'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 48 values: ['CA' nan 'OH' 'NY' 'FL'] col: company_region | type: object | 215 values: ['SF Bay' 'United States - Other' 'Columbus' 'Cleveland' 'New York'] col: company_city | type: object | 543 values: ['San Francisco' 'Mountain View' nan 'columbus' 'Cleveland'] col: investor_name | type: object | 1540 values: ['1-800-FLOWERS.COM' '10Xelerator' '2010 NYU Stern Business Plan Competition' '22Hundred Group' '3taps'] col: investor_country_code | type: object | 46 values: ['USA' nan 'BMU' 'CHE' 'CAN'] col: investor_state_code | type: object | 44 values: ['NY' 'OH' nan 'CA' 'MA'] col: investor_region | type: object | 236 values: ['New York' 'Columbus' 'unknown' 'SF Bay' 'Boston'] col: investor_city | type: object | 387 values: ['New York' 'Columbus' nan 'San Francisco' 'Palo Alto'] col: funding_round_type | type: object | 9 values: ['series-a' 'other' 'angel' 'venture' 'series-b'] col: funded_at | type: object | 1677 values: ['2012-10-30' '2012-01-23' '2012-01-01' '2012-02-15' '2011-09-08'] col: funded_month | type: object | 159 values: ['2012-10' '2012-01' '2012-02' '2011-09' '2012-06'] col: funded_quarter | type: object | 63 values: ['2012-Q4' '2012-Q1' '2011-Q3' '2012-Q2' '2012-Q3'] CHUNK 2: col: company_name | type: object | 2946 values: ['Visible World' 'obopay' 'Jumptap' 'Avid Radiopharmaceuticals' 'Pacific Biosciences'] col: company_category_code | type: object | 41 values: ['advertising' 'mobile' 'biotech' 'semiconductor' 'enterprise'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 49 values: ['NY' 'CA' 'MA' 'PA' 'WA'] col: company_region | type: object | 229 values: ['New York' 'SF Bay' 'Boston' 'Philadelphia' 'Seattle'] col: company_city | type: object | 559 values: ['New York' 'Redwood City' 'Boston' 'Philadelphia' 'Menlo Park'] col: investor_name | type: object | 623 values: ['AllianceBernstein' 'Allianz Private Equity Partners' 'Allianz ROSNO' 'Allied Capital' 'Allied Minds'] col: investor_country_code | type: object | 40 values: ['USA' nan 'SWE' 'FRA' 'CHE'] col: investor_state_code | type: object | 39 values: ['NY' nan 'MI' 'MA' 'IN'] col: investor_region | type: object | 146 values: ['New York' 'unknown' 'Kalamazoo' 'Boston' 'Indianapolis'] col: investor_city | type: object | 236 values: ['New York' nan 'Kalamazoo' 'Boston' 'Carmel'] col: funding_round_type | type: object | 8 values: ['series-c+' 'venture' 'series-b' 'series-a' 'angel'] col: funded_at | type: object | 1835 values: ['2008-04-14' '2008-04-21' '2008-08-26' '2009-05-21' '2007-05-09'] col: funded_month | type: object | 163 values: ['2008-04' '2008-08' '2009-05' '2007-05' '2009-10'] col: funded_quarter | type: object | 61 values: ['2008-Q2' '2008-Q3' '2009-Q2' '2007-Q2' '2009-Q4'] CHUNK 3: col: company_name | type: object | 2922 values: ['Zag' 'FRX Polymers' 'Targeted Growth' 'Mendel Biotechnology' 'Yammer'] col: company_category_code | type: object | 41 values: ['search' 'cleantech' 'biotech' 'enterprise' 'finance'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 43 values: ['CA' 'MA' 'WA' 'NJ' 'OR'] col: company_region | type: object | 208 values: ['Los Angeles' 'Boston' 'Seattle' 'SF Bay' 'unknown'] col: company_city | type: object | 517 values: ['Santa Monica' 'Chelmsford' 'Seattle' 'Hayward' 'San Francisco'] col: investor_name | type: object | 653 values: ['Capricorn Management' 'Capricorn Venture Partners' 'Capsalus' 'Capstone Partners' 'Capstone Ventures'] col: investor_country_code | type: object | 32 values: ['USA' 'BEL' nan 'GBR' 'ISR'] col: investor_state_code | type: object | 36 values: ['CT' nan 'GA' 'MA' 'OR'] col: investor_region | type: object | 137 values: ['New York' 'Leuven' 'Atlanta' 'Boston' 'unknown'] col: investor_city | type: object | 227 values: ['Greenwich' 'Leuven' 'Atlanta' 'Boston' nan] col: funding_round_type | type: object | 9 values: ['series-c+' 'series-a' 'venture' 'private-equity' 'series-b'] col: funded_at | type: object | 1852 values: ['2009-02-05' '2009-07-27' '2007-02-07' '2007-06-13' '2011-06-06'] col: funded_month | type: object | 159 values: ['2009-02' '2009-07' '2007-02' '2007-06' '2011-06'] col: funded_quarter | type: object | 61 values: ['2009-Q1' '2009-Q3' '2007-Q1' '2007-Q2' '2011-Q2'] CHUNK 4: col: company_name | type: object | 2847 values: ['Zettaset' 'BrightTag' 'Joyent' 'Exagen Diagnostics' 'Aspen Avionics'] col: company_category_code | type: object | 42 values: ['enterprise' 'biotech' 'transportation' 'web' 'analytics'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 47 values: ['CA' 'IL' 'NM' 'OR' 'UT'] col: company_region | type: object | 182 values: ['SF Bay' 'Chicago' 'Albuquerque' 'Portland' 'Santa Barbara'] col: company_city | type: object | 458 values: ['Mountain View' 'Chicago' 'San Francisco' 'Albuquerque' 'Portland'] col: investor_name | type: object | 501 values: ['EPIC Ventures' 'Epiphany Ventures' 'Equis Capital Partners' 'Equitas Capital Advisors' 'Equitek Capital'] col: investor_country_code | type: object | 30 values: ['USA' 'IND' 'GBR' nan 'FIN'] col: investor_state_code | type: object | 37 values: ['UT' nan 'NJ' 'LA' 'OH'] col: investor_region | type: object | 116 values: ['Salt Lake City' 'Mumbai' 'Hoboken' 'New Orleans' 'Cleveland'] col: investor_city | type: object | 175 values: ['Salt Lake City' 'Mumbai' 'Hoboken' 'New Orleans' 'Beachwood'] col: funding_round_type | type: object | 9 values: ['venture' 'series-c+' 'series-b' 'angel' 'series-a'] col: funded_at | type: object | 1786 values: ['2011-07-12' '2011-08-15' '2012-01-23' '2012-04-25' '2012-09-10'] col: funded_month | type: object | 146 values: ['2011-07' '2011-08' '2012-01' '2012-04' '2012-09'] col: funded_quarter | type: object | 59 values: ['2011-Q3' '2012-Q1' '2012-Q2' '2012-Q3' '2012-Q4'] CHUNK 5: col: company_name | type: object | 2941 values: ['ChargePoint, Inc.' 'Oncothyreon' 'Grouply' 'Blue Buzz Network' 'Glue Networks'] col: company_category_code | type: object | 42 values: ['cleantech' 'biotech' 'network_hosting' 'news' 'enterprise'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 46 values: ['CA' 'WA' 'NC' 'OH' 'NY'] col: company_region | type: object | 204 values: ['SF Bay' 'Seattle' 'Sacramento' 'Los Angeles' 'Raleigh-Durham'] col: company_city | type: object | 511 values: ['Campbell' 'Seattle' 'Redwood City' 'San Francisco' 'Sacramento'] col: investor_name | type: object | 592 values: ['Hartford Ventures' 'Hartz Capital' 'Harvard Angel Group' 'Hasso Plattner Ventures' 'Hatch Partners, LLC'] col: investor_country_code | type: object | 41 values: ['USA' nan 'DEU' 'ZAF' 'CHE'] col: investor_state_code | type: object | 36 values: ['CT' nan 'WA' 'NC' 'CA'] col: investor_region | type: object | 142 values: ['Hartford' 'unknown' 'TBD' 'Potsdam' 'Seattle'] col: investor_city | type: object | 207 values: ['Hartford' nan 'Potsdam' 'Redmond' 'Raleigh'] col: funding_round_type | type: object | 8 values: ['series-c+' 'venture' 'angel' 'series-a' 'series-b'] col: funded_at | type: object | 1874 values: ['2010-09-08' '2010-09-24' '2007-06-01' '2008-01-01' '2008-01-14'] col: funded_month | type: object | 165 values: ['2010-09' '2007-06' '2008-01' '2010-07' '2011-03'] col: funded_quarter | type: object | 64 values: ['2010-Q3' '2007-Q2' '2008-Q1' '2011-Q1' '2012-Q3'] CHUNK 6: col: company_name | type: object | 2909 values: ['Smith & Tinker' 'Sway' 'Shoutlet' 'Firefly Mobile' 'Fastmobile'] col: company_category_code | type: object | 42 values: ['games_video' 'advertising' 'enterprise' 'mobile' 'software'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 47 values: ['WA' 'WI' 'FL' 'IL' 'MA'] col: company_region | type: object | 202 values: ['Seattle' 'Middleton' 'Madison' 'Miami' 'Chicago'] col: company_city | type: object | 502 values: ['Bellevue' 'Middleton' 'Madison' 'Miami Beach' 'Rolling Meadows'] col: investor_name | type: object | 566 values: ['Leo Capital Holdings' 'LeoGroup' 'Leopard Capital' 'Lerer Ventures' 'lessin X2, llc'] col: investor_country_code | type: object | 34 values: ['USA' nan 'KHM' 'RUS' 'KOR'] col: investor_state_code | type: object | 38 values: ['IL' nan 'NY' 'CA' 'MN'] col: investor_region | type: object | 140 values: ['Chicago' 'unknown' 'Phnom Penh' 'New York' 'Moscow'] col: investor_city | type: object | 207 values: ['Northbrook' nan 'Phnom Penh' 'New York' 'Moscow'] col: funding_round_type | type: object | 8 values: ['venture' 'series-a' 'series-b' 'series-c+' 'angel'] col: funded_at | type: object | 1822 values: ['2009-08-25' '2010-01-19' '2006-10-19' '2005-05-16' '2005-11-10'] col: funded_month | type: object | 149 values: ['2009-08' '2010-01' '2006-10' '2005-05' '2005-11'] col: funded_quarter | type: object | 60 values: ['2009-Q3' '2010-Q1' '2006-Q4' '2005-Q2' '2005-Q4'] CHUNK 7: col: company_name | type: object | 2987 values: ['Tricycle' 'Adapx' 'Action Engine' 'Vigilos' 'SprayCool'] col: company_category_code | type: object | 42 values: [nan 'software' 'mobile' 'hardware' 'public_relations'] col: company_country_code | type: object | 3 values: ['USA' nan '2008-02'] col: company_state_code | type: object | 49 values: ['TN' 'WA' 'OR' 'NV' 'CA'] col: company_region | type: object | 208 values: ['Chattanooga' 'Seattle' 'Liberty Lake' 'Portland' 'Spokane Valley'] col: company_city | type: object | 530 values: ['Chattanooga' 'Seattle' 'Bellevue' 'Liberty Lake' 'Beaverton'] col: investor_name | type: object | 688 values: ['Northwest Georgia Bank' 'Northwest Technology Ventures' 'Northwest Venture Associates' 'Northwestern Capital' 'Northzone'] col: investor_country_code | type: object | 36 values: ['USA' nan 'SWE' 'GBR' 'CAN'] col: investor_state_code | type: object | 37 values: ['GA' 'OR' 'WA' nan 'MN'] col: investor_region | type: object | 156 values: ['Ringgold' 'Portland' 'Spokane' 'unknown' 'Stockholm'] col: investor_city | type: object | 228 values: ['Ringgold' 'Beaverton' 'Spokane' nan 'Stockholm'] col: funding_round_type | type: object | 9 values: ['other' 'venture' 'series-c+' 'series-b' 'angel'] col: funded_at | type: object | 1882 values: ['2003-01-01' '2012-07-30' '2013-04-11' '2007-07-01' '2009-04-28'] col: funded_month | type: object | 159 values: ['2003-01' '2012-07' '2013-04' '2007-07' '2009-04'] col: funded_quarter | type: object | 62 values: ['2003-Q1' '2012-Q3' '2013-Q2' '2007-Q3' '2009-Q2'] CHUNK 8: col: company_name | type: object | 3077 values: ['Ping Identity Corporation' 'MuleSoft' 'AmberPoint' 'Questra' 'Reva Systems'] col: company_category_code | type: object | 41 values: ['security' 'enterprise' 'web' 'software' 'hardware'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 48 values: ['CO' 'CA' 'MA' 'WA' 'TX'] col: company_region | type: object | 208 values: ['Denver' 'SF Bay' 'Boston' 'Seattle' 'Austin'] col: company_city | type: object | 497 values: ['Denver' 'San Francisco' 'Oakland' 'Redwood City' 'Westford'] col: investor_name | type: object | 607 values: ['SAP Ventures' 'Saratoga Ventures' 'Sark Fund' 'Sarofim Fayez and Co.' 'SAS Investors'] col: investor_country_code | type: object | 32 values: ['USA' nan 'SAU' 'IND' 'CYM'] col: investor_state_code | type: object | 34 values: ['CA' nan 'TX' 'MA' 'NH'] col: investor_region | type: object | 151 values: ['SF Bay' 'unknown' 'Houston' 'TBD' 'Dallas'] col: investor_city | type: object | 212 values: ['Palo Alto' 'Santa Clara' nan 'Houston' 'Dallas'] col: funding_round_type | type: object | 8 values: ['series-c+' 'series-b' 'venture' 'other' 'series-a'] col: funded_at | type: object | 1852 values: ['2006-10-11' '2010-03-30' '2006-03-28' '2006-08-15' '2006-10-02'] col: funded_month | type: object | 162 values: ['2006-10' '2010-03' '2006-03' '2006-08' '2010-05'] col: funded_quarter | type: object | 62 values: ['2006-Q4' '2010-Q1' '2006-Q1' '2006-Q3' '2010-Q2'] CHUNK 9: col: company_name | type: object | 2999 values: ['GuardianEdge Technologies' 'NetBase Solutions' 'GigaLogix' 'Axcient' 'Apsalar'] col: company_category_code | type: object | 43 values: ['software' 'social' 'public_relations' 'enterprise' 'advertising'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 46 values: ['CA' 'GA' 'MA' 'MN' 'WA'] col: company_region | type: object | 177 values: ['SF Bay' 'Atlanta' 'Boston' 'Minneapolis' 'Los Angeles'] col: company_city | type: object | 478 values: ['San Mateo' 'Mountain View' 'Oakland' 'SAN FRANCISCO' 'Pleasanton'] col: investor_name | type: object | 1018 values: ['Thomvest Ventures' 'Thorner Ventures' 'Three Arch Partners' 'Thrive Capital' 'TI Ventures'] col: investor_country_code | type: object | 28 values: ['USA' nan 'PAK' 'IND' 'JPN'] col: investor_state_code | type: object | 35 values: ['CA' nan 'NY' 'TX' 'CT'] col: investor_region | type: object | 112 values: ['SF Bay' 'TBD' 'New York' 'Dallas' 'unknown'] col: investor_city | type: object | 168 values: ['Redwood City' nan 'Portola Valley' 'New York' 'Dallas'] col: funding_round_type | type: object | 7 values: ['other' 'series-c+' 'series-a' 'series-b' 'angel'] col: funded_at | type: object | 1783 values: ['2007-07-23' '2010-03-02' '2007-02-01' '2010-07-14' '2010-11-01'] col: funded_month | type: object | 154 values: ['2007-07' '2010-03' '2007-02' '2010-07' '2010-11'] col: funded_quarter | type: object | 59 values: ['2007-Q3' '2010-Q1' '2007-Q1' '2010-Q3' '2010-Q4'] CHUNK 10: col: company_name | type: object | 2045 values: ['Bread' 'Nuiku' 'Savvy Cellar Wines' 'Enigma Technologies' 'Yext'] col: company_category_code | type: object | 43 values: ['advertising' 'software' 'other' 'analytics' 'web'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 42 values: ['CA' 'WA' 'NY' 'TX' 'CO'] col: company_region | type: object | 128 values: ['SF Bay' 'Seattle' 'New York' 'Los Angeles' 'Austin'] col: company_city | type: object | 326 values: ['San Francisco' 'Redmond' 'Redwood City' 'New York' 'West Hollywood'] col: investor_name | type: object | 2361 values: ['Brendan Wallace' 'Brent Frei' 'Brent Harrison' 'Brent Hurley' 'Brent Metz'] col: investor_region | type: object | 1 values: ['unknown'] col: funding_round_type | type: object | 7 values: ['angel' 'series-a' 'series-b' 'venture' 'private-equity'] col: funded_at | type: object | 1056 values: ['2011-02-01' '2013-09-17' '2005-01-01' '2013-02-27' '2006-11-01'] col: funded_month | type: object | 135 values: ['2011-02' '2013-09' '2005-01' '2013-02' '2006-11'] col: funded_quarter | type: object | 55 values: ['2011-Q1' '2013-Q3' '2005-Q1' '2013-Q1' '2006-Q4'] CHUNK 11: col: company_name | type: object | 1541 values: ['NuORDER' 'ChaCha' 'Binfire' 'Unified Color' 'HItviews'] col: company_category_code | type: object | 42 values: ['fashion' 'advertising' 'software' 'social' 'ecommerce'] col: company_country_code | type: object | 1 values: ['USA'] col: company_state_code | type: object | 40 values: ['CA' 'IN' 'FL' 'NY' 'MO'] col: company_region | type: object | 106 values: ['Los Angeles' 'Indianapolis' 'Bocat Raton' 'SF Bay' 'New York'] col: company_city | type: object | 271 values: ['West Hollywood' 'Carmel' 'Bocat Raton' 'South San Frnacisco' 'New York City'] col: investor_name | type: object | 1337 values: ['Mortimer Singer' 'Morton Meyerson' 'Moshe Ariel' 'Mr. Andrew Oung' 'multiple parties'] col: investor_region | type: object | 1 values: ['unknown'] col: funding_round_type | type: object | 7 values: ['series-a' 'series-b' 'angel' 'other' 'venture'] col: funded_at | type: object | 886 values: ['2012-10-01' '2007-10-01' '2008-04-18' '2010-01-01' '2007-11-29'] col: funded_month | type: object | 126 values: ['2012-10' '2007-10' '2008-04' '2010-01' '2007-11'] col: funded_quarter | type: object | 51 values: ['2012-Q4' '2007-Q4' '2008-Q2' '2010-Q1' '2012-Q2']
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
count = 0
#total_num_rows = 0
str_cols_vc = {}
for chunk in chunks_it:
count += 1
#total_num_rows += chunk.shape[0]
#print(f'\nChunk {count}:')
#print(chunk.dtypes.value_counts())
#obj_cols = chunk.select_dtypes(include=['object'])
#print(obj_cols.value_counts())
#print(type(obj_cols))
#columns = ['investor_city', 'investor_country_code',
# 'investor_state_code']
for col in chunk:
current_col_vc = chunk[col].value_counts()
if col in str_cols_vc:
str_cols_vc[col].append(current_col_vc)
else:
str_cols_vc[col] = [current_col_vc]
#print(f'Total Num Rows: {total_num_rows}')
## Combine the value count results
combined_vcs = {}
for col, vc_list in str_cols_vc.items():
combined_vc = pd.concat(vc_list)
final_vc = combined_vc.groupby(combined_vc.index).sum()
combined_vcs[col] = final_vc
pp.pprint(combined_vcs.keys())
print('\nAll Value Counts:')
pp.pprint(combined_vcs)
dict_keys(['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']) All Value Counts: {'company_category_code': 2/7/08 1 advertising 3200 analytics 1863 automotive 164 biotech 4951 cleantech 1948 consulting 233 design 55 ecommerce 2168 education 783 enterprise 4489 fashion 368 finance 931 games_video 1893 government 10 hardware 1537 health 670 hospitality 331 legal 87 local 22 manufacturing 310 medical 1315 messaging 452 mobile 4067 music 287 nanotech 216 network_hosting 1075 news 363 nonprofit 149 other 481 pets 43 photo_video 230 public_relations 659 real_estate 190 search 632 security 996 semiconductor 1292 social 920 software 7243 sports 121 transportation 130 travel 337 web 5015 Name: company_category_code, dtype: int64, 'company_city': (Oct. 01, 2011 - Sep. 30, 2012) 2 10000000 1 ALLSTON 2 ATLANTA 2 AUSTIN 2 Acton 90 Addison 29 Agoura Hills 13 Akron 12 Alachua 19 Alameda 44 Albuquerque 83 Alexandria 40 Aliso Viejo 55 Aliso Viejo, 6 Allen 14 Allen park 6 Allentown 20 Alliance 1 Allston 4 Alpharetta 83 Alpharetta, 1 Altadena 17 Altamonte Springs 1 Alviso 1 American Fork 40 Ames 6 Amesbury 5 Amherst 9 Anaheim 3 .. Worcester 10 Wyckoff 2 Wynnewood 1 Yaphank 2 Yardley 1 Yonkers 7 Yorba Linda 2 Youngstown 2 Ypsilanti 1 Zagreb 2 Zeeland 2 ann arbor 1 atlanta 1 bluffton 1 columbus 1 foster city 2 hermosa beach 3 irvine 3 los angeles 11 miami 4 new york 5 omaha 1 pahoa 1 san francisco 1 santa monica 19 sausalito 1 scottsdale 2 seattle 1 w. babylon 2 westminster 1 Name: company_city, Length: 1229, dtype: int64, 'company_country_code': 2008-02 1 USA 52868 Name: company_country_code, dtype: int64, 'company_name': #waywire 5 0xdata 1 1-800-DENTIST 2 1000memories 10 100Plus 4 1010data 1 11i Solutions 1 121nexus 5 12Society 5 1366 Technologies 12 140 Proof 6 15Five 8 170 Systems 1 1Cast 1 1World Online 1 1stdibs 5 2080 Media 8 20JEANS 5 20x200 17 23andMe 16 24/7 Card 2 247 Techies 2 24M Technologies 3 24PageBooks 1 27 Perry 1 27 bards 1 2U 18 2nd Story Software, Inc. 1 2sms 1 3-V Biosciences 7 .. uromovie 1 userfox 3 ustyme 1 vArmour Networks 1 vSocial 3 vcopious Software 3 via680 1 viaCycle 2 vidIQ 7 videoNEXT 2 vline 2 walkby 2 webmynd 9 wedgies 4 weeSpring 1 weendy 5 whereIstand.com 1 wmbly 1 woodpellets.com 3 writewith 1 wunderloop 4 xAd 4 xG Technology 1 xkoto 5 y prime 1 yaM Labs 1 ybuy 4 zozi 38 zulily 6 zuuka! 3 Name: company_name, Length: 11573, dtype: int64, 'company_region': 2008 1 Akron 11 Alachua 19 Albuquerque 83 Allentown 20 Alliance 1 Ames 1 Amherst 9 Angier 4 Appleton 3 Asheville 3 Ashford 1 Ashland 4 Atlanta 558 Atlantic Highlands 8 Auburn Hlls 2 Augusta 1 Aurora 3 Austin 947 Avon 1 B 5 Bakersfield 1 Bala Cynwyd 6 Baltimore 95 Bangalore 3 Bar Harbor 1 Barre 2 Baton Rouge 14 Battleground 3 Bedford 8 ... West Trenton 2 Westfield 14 Westport 17 Whippany 6 White River 3 Whiting 1 Wilbraham 3 Williamsburg 2 Williamstown 9 Williamsville 3 Williston 2 Willmington 25 Wilmington 12 Wilton 4 Windber 2 Windsor 9 Winooski 3 Winston Salem 6 Woodbury 3 Woodstock 5 Woonsocket 1 Wooster 3 Yaphank 2 Youngstown 2 Zagreb 2 Zeeland 2 piscataway 9 unknown 102 w. babylon 2 wilkes barre 1 Name: company_region, Length: 546, dtype: int64, 'company_state_code': 2008-Q1 1 AL 17 AR 33 AZ 261 CA 27054 CO 977 CT 333 DC 219 DE 28 FL 550 GA 568 HI 31 IA 33 ID 32 IL 911 IN 143 KS 65 KY 45 LA 43 MA 4889 MD 600 ME 94 MI 265 MN 205 MO 182 MS 33 MT 19 NC 596 NE 39 NH 159 NJ 691 NM 90 NV 105 NY 5890 OH 439 OK 29 OR 466 PA 918 RI 191 SC 97 SD 3 TN 176 TX 1619 UT 419 VA 704 VT 55 WA 1888 WI 163 WV 9 WY 1 Name: company_state_code, dtype: int64, 'funded_at': 1987-01-01 1 1987-06-16 1 1995-04-01 1 1995-07-01 1 1995-11-30 3 1996-01-01 2 1997-03-01 1 1997-06-01 1 1997-11-10 3 1998-01-01 1 1998-02-09 3 1998-04-01 1 1998-05-01 4 1998-07-01 1 1998-08-01 1 1999-01-01 16 1999-03-01 10 1999-04-01 13 1999-04-15 1 1999-05-01 3 1999-06-01 3 1999-06-07 2 1999-07-01 2 1999-07-07 1 1999-07-22 1 1999-08-02 2 1999-08-30 2 1999-09-01 8 1999-09-02 1 1999-10-01 14 .. 2013-08-27 36 2013-08-28 35 2013-08-29 55 2013-08-30 11 2013-09-01 27 2013-09-03 39 2013-09-04 68 2013-09-05 45 2013-09-06 13 2013-09-07 5 2013-09-09 23 2013-09-10 40 2013-09-11 50 2013-09-12 38 2013-09-13 22 2013-09-15 4 2013-09-16 38 2013-09-17 60 2013-09-18 41 2013-09-19 38 2013-09-20 20 2013-09-22 17 2013-09-23 34 2013-09-24 45 2013-09-25 36 2013-09-26 56 2013-09-27 51 2013-09-29 2 2013-09-30 34 2013-10-01 5 Name: funded_at, Length: 2808, dtype: int64, 'funded_month': 1987-01 1 1987-06 1 1995-04 1 1995-07 1 1995-11 3 1996-01 2 1997-03 1 1997-06 1 1997-11 3 1998-01 1 1998-02 3 1998-04 1 1998-05 4 1998-07 1 1998-08 1 1999-01 16 1999-03 10 1999-04 14 1999-05 3 1999-06 5 1999-07 4 1999-08 4 1999-09 9 1999-10 16 1999-11 3 1999-12 11 2000-01 21 2000-02 8 2000-03 13 2000-04 35 ... 2011-05 691 2011-06 760 2011-07 641 2011-08 752 2011-09 837 2011-10 624 2011-11 666 2011-12 667 2012-01 1005 2012-02 567 2012-03 651 2012-04 696 2012-05 805 2012-06 926 2012-07 779 2012-08 608 2012-09 666 2012-10 818 2012-11 669 2012-12 583 2013-01 836 2013-02 702 2013-03 707 2013-04 654 2013-05 784 2013-06 913 2013-07 927 2013-08 741 2013-09 846 2013-10 5 Name: funded_month, Length: 192, dtype: int64, 'funded_quarter': 1987-Q1 1 1987-Q2 1 1995-Q2 1 1995-Q3 1 1995-Q4 3 1996-Q1 2 1997-Q1 1 1997-Q2 1 1997-Q4 3 1998-Q1 4 1998-Q2 5 1998-Q3 2 1999-Q1 26 1999-Q2 22 1999-Q3 17 1999-Q4 30 2000-Q1 42 2000-Q2 66 2000-Q3 28 2000-Q4 40 2001-Q1 38 2001-Q2 20 2001-Q3 26 2001-Q4 32 2002-Q1 40 2002-Q2 35 2002-Q3 37 2002-Q4 33 2003-Q1 54 2003-Q2 50 ... 2006-Q3 942 2006-Q4 915 2007-Q1 1269 2007-Q2 1189 2007-Q3 1280 2007-Q4 1043 2008-Q1 1476 2008-Q2 1414 2008-Q3 1061 2008-Q4 986 2009-Q1 1074 2009-Q2 1077 2009-Q3 1020 2009-Q4 1539 2010-Q1 1578 2010-Q2 1575 2010-Q3 1651 2010-Q4 1491 2011-Q1 2026 2011-Q2 2046 2011-Q3 2230 2011-Q4 1957 2012-Q1 2223 2012-Q2 2427 2012-Q3 2053 2012-Q4 2070 2013-Q1 2245 2013-Q2 2351 2013-Q3 2514 2013-Q4 5 Name: funded_quarter, Length: 72, dtype: int64, 'funded_year': 1987.0 2 1995.0 5 1996.0 2 1997.0 5 1998.0 11 1999.0 95 2000.0 176 2001.0 116 2002.0 145 2003.0 172 2004.0 411 2005.0 2990 2006.0 3867 2007.0 4781 2008.0 4937 2009.0 4710 2010.0 6295 2011.0 8259 2012.0 8773 2013.0 7115 Name: funded_year, dtype: int64, 'funding_round_type': angel 8989 crowdfunding 5 other 964 post-ipo 33 private-equity 357 series-a 13938 series-b 8794 series-c+ 10870 venture 8917 Name: funding_round_type, dtype: int64, 'investor_city': (Oct. 01, 2011 - Sep. 30, 2012) 2 ABERDEEN 2 ADELAIDE 2 ALBURY 1 AUSTIN 1 Aalborg 1 Abbotsford 1 Abu Dhabi 2 Accra 1 Addison 4 Adelaide 2 Agoura Hills 4 Aichi 1 Akron 1 Alamo 12 Albany 23 Alberta 1 Albuquerque 13 Alexandria 98 Aliso Viejo 5 Allentown 1 Alpharetta 6 Alphen aan den Rijn 8 Alpine 1 Altadena 15 American Falls 1 Ames 2 Amesbury 1 Amherst 18 Amman 5 .. Wilsonville 1 Wilton 3 Windermer 1 Windermere 1 Winnetka 2 Winnipeg 2 Winston-Salem 10 Winter Park 18 Wood Dale 1 Woodcliff Lake 2 Woodside 71 Wooster 1 Wrexham 1 Wuppertal 3 Wynne 1 Yardley 2 Yvette cedex 1 Zagreb 1 Zeist 1 Zuerich 2 Zug 1 Zurich 41 Zrich 1 ZÌ_rich 2 _stanbul 1 cape town 1 herndon 2 kansas city 1 paris 1 portland metro 1 Name: investor_city, Length: 990, dtype: int64, 'investor_country_code': ARE 7 ARG 14 AUS 163 BEL 44 BGR 4 BHR 10 BHS 1 BMU 6 BRA 10 CAN 432 CHE 184 CHL 8 CHN 143 CYM 15 CYP 2 CZE 2 DEU 349 DNK 70 EGY 1 ESP 45 EST 8 FIN 36 FRA 229 GBR 825 GHA 1 GIB 3 GRC 1 HKG 53 HRV 1 HUN 9 ... LIE 1 LTU 1 LUX 32 MAR 1 MEX 5 MUS 23 MYS 15 NGA 6 NLD 72 NOR 37 NZL 14 PAK 1 PHL 6 POL 1 PRI 2 PRT 13 QAT 2 ROM 2 RUS 63 SAU 2 SGP 105 SVN 3 SWE 120 TUR 4 TWN 47 UKR 9 USA 36574 VNM 5 WSM 4 ZAF 5 Name: investor_country_code, Length: 72, dtype: int64, 'investor_name': .406 Ventures 36 1-800-FLOWERS.COM 1 10 Others 1 10X Fund LP 11 10X Venture Partners 2 10Xelerator 7 1in10 Ventures 1 1to1 Venture Partners 1 2010 NYU Stern Business Plan Competition 1 2020 Ventures 5 212 Capital Partners 1 21Ventures 12 22Hundred Group 1 2M Companies 2 2x Consumer Products Growth Partners 1 3 Others 1 3 Wise Ventures 1 313 Ventures 1 37 Ventures 2 3Com Ventures 1 3M New Ventures 6 3TS Capital Partners 2 3TS Cisco Growth Fund 1 3V SourceOne Ventures 1 3i Deutschland Gesellschaft fr Industriebeteiligungen 1 3i Group 46 3taps 1 3x5 Special Opportunity Partners 3 42 Ventures 1 4th Level Ventures 2 .. impact investment fund 1 incTANK Ventures 1 international private investors 1 jVen Capital 2 jamell Givens 1 john bialk 1 kbs+ Ventures 12 lessin X2, llc 1 lilian samartino 1 m8 Capital 7 merrickhanna ventures 1 michael gartenberg 1 mike schell 1 multiple parties 2 netprice.com 3 reInvention Capital 2 redbutler 1 reinmkr capital 9 ru-Net Holdings 3 sheila gulati 1 thetime 3 undisclosed investor 1 vSpring Capital 37 venBio 1 venture capitals 6 ventureblue Capital 2 venturecapital.de 1 yet2Ventures 1 zohar israel 1 zulily 1 Name: investor_name, Length: 10465, dtype: int64, 'investor_region': Aalborg 1 Abbotsford 1 Aberdeen 2 Abu Dhabi 2 Accra 1 Adelaide 4 Aichi 1 Akron 1 Albany 23 Alberta 1 Albuquerque 13 Allentown 1 Alphen aan den Rijn 8 American Falls 1 Ames 1 Amherst 18 Amman 5 Amsterdam 25 Antwerpen 14 Asheville 2 Aspen 1 Athens 19 Atlanta 266 Atsugi-City 1 Auburn 2 Auckland 10 Auckland Central 1 Austin 368 Australia - Other 6 Avon 7 ... West Lafayette 2 West Point 2 Weston 1 Westport 38 Whitehouse Station 1 Wichita 3 Williamstown 66 Willmington 20 Wilmington 8 Wilton 3 Windermer 1 Winnipeg 2 Winston Salem 10 Woodcliff Lake 2 Woodstock 1 Wooster 1 Wrexham 1 Wuppertal 3 Wynne 1 Zagreb 1 Zuerich 2 Zug 1 Zurich 48 chicago 7 freehold 3 mkalles 1 portland metro 1 regina 1 unknown 12285 wilkes barre 1 Name: investor_region, Length: 585, dtype: int64, 'investor_state_code': AL 67 AR 14 AZ 84 CA 18405 CO 729 CT 577 DC 323 DE 20 FL 242 GA 274 HI 13 IA 9 ID 40 IL 992 IN 88 KS 13 KY 54 LA 15 MA 3619 MD 486 ME 41 MI 315 MN 101 MO 148 MS 6 MT 1 NC 339 ND 5 NE 35 NH 51 NJ 456 NM 41 NV 38 NY 4404 OH 309 OK 21 OR 85 PA 762 RI 92 SC 34 SD 9 TN 147 TX 816 UT 200 VA 579 VT 26 WA 847 WI 82 WV 4 WY 3 Name: investor_state_code, dtype: int64, 'raised_amount_usd': 1.000000e+03 3 2.000000e+03 2 2.100000e+03 1 3.000000e+03 3 5.000000e+03 8 5.500000e+03 1 5.600000e+03 1 6.000000e+03 1 6.500000e+03 2 8.000000e+03 2 8.300000e+03 1 1.000000e+04 30 1.156700e+04 2 1.160000e+04 1 1.180000e+04 1 1.200000e+04 18 1.300000e+04 2 1.320000e+04 1 1.400000e+04 6 1.500000e+04 79 1.528200e+04 1 1.600000e+04 2 1.650000e+04 1 1.700000e+04 5 1.800000e+04 47 1.833300e+04 1 1.890000e+04 1 1.900000e+04 1 2.000000e+04 133 2.200000e+04 2 ... 2.250000e+08 9 2.280000e+08 1 2.380000e+08 2 2.400000e+08 1 2.490000e+08 1 2.500000e+08 16 2.580000e+08 2 2.650000e+08 2 2.750000e+08 5 2.830000e+08 1 3.000000e+08 17 3.160000e+08 1 3.190000e+08 5 3.500000e+08 4 3.601000e+08 4 4.000000e+08 6 4.500000e+08 7 4.650000e+08 1 4.900000e+08 4 5.290000e+08 1 5.350000e+08 1 5.650000e+08 1 7.500000e+08 1 9.200000e+08 5 9.500000e+08 10 1.000000e+09 1 1.050000e+09 2 1.500000e+09 8 2.600000e+09 1 3.200000e+09 5 Name: raised_amount_usd, Length: 1458, dtype: int64}
Observations:
funded_month
, funded_quarter
, funded_year
columns seem unnecessary as we already have a funded_at
column which has the exact funding date. The other 2 columns can be easily derived from the latter column.funded_at
column can be converted into a datetime type columnchunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
total_num_rows = 0
for chunk in chunks_it:
total_num_rows += chunk.shape[0]
print("Total Rows: {0}".format(total_num_rows))
Total Rows: 52870
# Drop columns representing URLs or containing too many missing values ( > 90%)
# or the 2 unnecessary fund date related columns we identified
drop_cols = ['funded_month', 'funded_quarter', 'funded_year']
keep_cols = chunk.columns.drop(drop_cols)
print(keep_cols.tolist())
['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'raised_amount_usd']
# Memory Comparison - Before vs After Type Conversion
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
counter = 0
total_initial_memory_mb = 0
total_final_memory_mb = 0
for chunk in chunks_it:
counter += 1
print("\nCHUNK {0}:".format(counter))
total_initial_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)
#print('Initial Memory Usage:')
memory_comparison = []
initial_memory_mb = 0
for col in chunk:
chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576
initial_memory_mb += chunk_memory_mb
initial_status = f'{col:<30}: {chunk_memory_mb:0.2f} ({chunk[col].dtype})'
memory_comparison.append(initial_status)
#for col in float_cols.columns:
chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'],
downcast='float')
# convert object columns to datetime type
chunk['funded_at'] = pd.to_datetime(chunk['funded_at'])
total_memory_mb = 0
for i, col in enumerate(chunk):
chunk_memory_mb = chunk[col].memory_usage(deep=True) / 1048576
total_memory_mb += chunk_memory_mb
previous_info = memory_comparison[i]
print(f'{previous_info:<10} vs. {chunk_memory_mb:0.2f} ({chunk[col].dtype})')
print(f'Memory: {initial_memory_mb:0.2f} mb vs {total_memory_mb:0.2f} mb')
total_final_memory_mb += (chunk.memory_usage(deep=True).sum() / 1048576)
print(f'\nMemory:{total_initial_memory_mb:0.2f} mb vs {total_final_memory_mb:0.2f} mb')
CHUNK 1: company_name : 0.32 (object) vs. 0.32 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.34 (object) vs. 0.34 (object) investor_country_code : 0.27 (object) vs. 0.27 (object) investor_state_code : 0.25 (object) vs. 0.25 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.29 (object) vs. 0.29 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 3.93 mb vs 3.63 mb CHUNK 2: company_name : 0.33 (object) vs. 0.33 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.36 (object) vs. 0.36 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.30 (object) vs. 0.30 (object) investor_city : 0.31 (object) vs. 0.31 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 4.00 mb vs 3.70 mb CHUNK 3: company_name : 0.33 (object) vs. 0.33 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.36 (object) vs. 0.36 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.31 (object) vs. 0.31 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 4.01 mb vs 3.71 mb CHUNK 4: company_name : 0.32 (object) vs. 0.32 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.36 (object) vs. 0.36 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.31 (object) vs. 0.31 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 4.01 mb vs 3.71 mb CHUNK 5: company_name : 0.33 (object) vs. 0.33 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.36 (object) vs. 0.36 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.25 (object) vs. 0.25 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.30 (object) vs. 0.30 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 3.99 mb vs 3.69 mb CHUNK 6: company_name : 0.33 (object) vs. 0.33 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.37 (object) vs. 0.37 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.31 (object) vs. 0.31 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 4.02 mb vs 3.72 mb CHUNK 7: company_name : 0.33 (object) vs. 0.33 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.36 (object) vs. 0.36 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.30 (object) vs. 0.30 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 4.00 mb vs 3.70 mb CHUNK 8: company_name : 0.32 (object) vs. 0.32 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.35 (object) vs. 0.35 (object) investor_country_code : 0.28 (object) vs. 0.28 (object) investor_state_code : 0.26 (object) vs. 0.26 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.31 (object) vs. 0.31 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 3.99 mb vs 3.69 mb CHUNK 9: company_name : 0.32 (object) vs. 0.32 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.35 (object) vs. 0.35 (object) investor_country_code : 0.25 (object) vs. 0.25 (object) investor_state_code : 0.24 (object) vs. 0.24 (object) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.27 (object) vs. 0.27 (object) funding_round_type : 0.31 (object) vs. 0.31 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 3.90 mb vs 3.60 mb CHUNK 10: company_name : 0.32 (object) vs. 0.32 (object) company_category_code : 0.31 (object) vs. 0.31 (object) company_country_code : 0.29 (object) vs. 0.29 (object) company_state_code : 0.28 (object) vs. 0.28 (object) company_region : 0.31 (object) vs. 0.31 (object) company_city : 0.32 (object) vs. 0.32 (object) investor_name : 0.33 (object) vs. 0.33 (object) investor_country_code : 0.04 (float64) vs. 0.04 (float64) investor_state_code : 0.04 (float64) vs. 0.04 (float64) investor_region : 0.31 (object) vs. 0.31 (object) investor_city : 0.04 (float64) vs. 0.04 (float64) funding_round_type : 0.30 (object) vs. 0.30 (object) funded_at : 0.32 (object) vs. 0.04 (datetime64[ns]) raised_amount_usd : 0.04 (float64) vs. 0.02 (float32) Memory: 3.23 mb vs 2.93 mb CHUNK 11: company_name : 0.18 (object) vs. 0.18 (object) company_category_code : 0.18 (object) vs. 0.18 (object) company_country_code : 0.16 (object) vs. 0.16 (object) company_state_code : 0.16 (object) vs. 0.16 (object) company_region : 0.18 (object) vs. 0.18 (object) company_city : 0.18 (object) vs. 0.18 (object) investor_name : 0.19 (object) vs. 0.19 (object) investor_country_code : 0.02 (float64) vs. 0.02 (float64) investor_state_code : 0.02 (float64) vs. 0.02 (float64) investor_region : 0.18 (object) vs. 0.18 (object) investor_city : 0.02 (float64) vs. 0.02 (float64) funding_round_type : 0.17 (object) vs. 0.17 (object) funded_at : 0.18 (object) vs. 0.02 (datetime64[ns]) raised_amount_usd : 0.02 (float64) vs. 0.01 (float32) Memory: 1.85 mb vs 1.68 mb Memory:40.92 mb vs 37.74 mb
Observations:
By converting the 2 columns (raised_amount_usd
to a more space efficient float type and funded_at
to a datetime type), we saved ~3 MB from the overall data. Not a significant amount. There are opportunities to convert some of the other columns into categories to further save space in the dataframe, however I am not sure how this would translate to more space saved on the SQLite database.
The next step is to load each chunk into a table in a SQLite database so we can query the full data set.
!wc IPython
command to return the file size of the database.chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
total_size_list = []
for chunk in chunks_it:
total_size_list.append(chunk.shape)
print(f"Total Size List: '{total_size_list}'")
Total Size List: '[(5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (5000, 14), (2870, 14)]'
We verify that our rows add up to 52,870:
total_rows = $(5000 * 10) + 2,870$
We have 14 columns.
conn = sqlite3.connect('crunchbase.db')
chunks_it = pd.read_csv('crunchbase-investments.csv', chunksize=5000,
encoding='ISO-8859-1', usecols=keep_cols)
counter = 0
for chunk in chunks_it:
#print(chunk.columns.tolist())
counter += 1
print(f"Loading chunk {counter}...")
chunk.to_sql("investments", conn, if_exists='append', index=False)
#break
conn.close()
print("Completed loading all chunks to crunchbase.db!")
Loading chunk 1... Loading chunk 2... Loading chunk 3... Loading chunk 4... Loading chunk 5... Loading chunk 6... Loading chunk 7... Loading chunk 8... Loading chunk 9... Loading chunk 10... Loading chunk 11... Completed loading all chunks to crunchbase.db!
!ls
crunchbase.db crunchbase-investments.csv project03_analyzing_startup_fundraising_deals_from_crunchbase.ipynb
Successfully created database, crunchbase.db
.
# SQLite Helper Functions
def run_query(query):
with sqlite3.connect('crunchbase.db') as conn:
return pd.read_sql(query, conn)
def show_tables():
query = '''
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table", "view");
'''
return run_query(query)
show_tables()
name | type | |
---|---|---|
0 | investments | table |
df = run_query('SELECT * from investments')
df.head()
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AdverCar | advertising | USA | CA | SF Bay | San Francisco | 1-800-FLOWERS.COM | USA | NY | New York | New York | series-a | 2012-10-30 | 2000000.0 |
1 | LaunchGram | news | USA | CA | SF Bay | Mountain View | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-23 | 20000.0 |
2 | uTaP | messaging | USA | None | United States - Other | None | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-01 | 20000.0 |
3 | ZoopShop | software | USA | OH | Columbus | columbus | 10Xelerator | USA | OH | Columbus | Columbus | angel | 2012-02-15 | 20000.0 |
4 | eFuneral | web | USA | OH | Cleveland | Cleveland | 10Xelerator | USA | OH | Columbus | Columbus | other | 2011-09-08 | 20000.0 |
!wc crunchbase.db
4773 229954 6340608 crunchbase.db
Note:
Earlier, I calculated 52870 rows across all the chunks. The database ends up having only 4773 rows.
Update: After a quick double check, I realized the number of rows returned by wc
command doesn't correspond to the number of lines in our SQL table. Below we verified our row and column numbers match up with our expectations.
df = run_query('SELECT count(*) as num_rows from investments')
df
num_rows | |
---|---|
0 | 52870 |
# Table Columns info
df = run_query('PRAGMA table_info(investments)')
df
cid | name | type | notnull | dflt_value | pk | |
---|---|---|---|---|---|---|
0 | 0 | company_name | TEXT | 0 | None | 0 |
1 | 1 | company_category_code | TEXT | 0 | None | 0 |
2 | 2 | company_country_code | TEXT | 0 | None | 0 |
3 | 3 | company_state_code | TEXT | 0 | None | 0 |
4 | 4 | company_region | TEXT | 0 | None | 0 |
5 | 5 | company_city | TEXT | 0 | None | 0 |
6 | 6 | investor_name | TEXT | 0 | None | 0 |
7 | 7 | investor_country_code | TEXT | 0 | None | 0 |
8 | 8 | investor_state_code | TEXT | 0 | None | 0 |
9 | 9 | investor_region | TEXT | 0 | None | 0 |
10 | 10 | investor_city | TEXT | 0 | None | 0 |
11 | 11 | funding_round_type | TEXT | 0 | None | 0 |
12 | 12 | funded_at | TEXT | 0 | None | 0 |
13 | 13 | raised_amount_usd | REAL | 0 | None | 0 |
We can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.
def run_query(query):
with sqlite3.connect('crunchbase.db') as conn:
return pd.read_sql(query, conn)
def run_command(command):
with sqlite3.connect('crunchbase.db') as conn:
conn.isolation_level = None
conn.execute(command)
def show_tables():
query = '''
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table", "view");
'''
return run_query(query)
show_tables()
name | type | |
---|---|---|
0 | investments | table |
query = '''
SELECT *
FROM investments
LIMIT 10'''
run_query(query)
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AdverCar | advertising | USA | CA | SF Bay | San Francisco | 1-800-FLOWERS.COM | USA | NY | New York | New York | series-a | 2012-10-30 | 2000000.0 |
1 | LaunchGram | news | USA | CA | SF Bay | Mountain View | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-23 | 20000.0 |
2 | uTaP | messaging | USA | None | United States - Other | None | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-01 | 20000.0 |
3 | ZoopShop | software | USA | OH | Columbus | columbus | 10Xelerator | USA | OH | Columbus | Columbus | angel | 2012-02-15 | 20000.0 |
4 | eFuneral | web | USA | OH | Cleveland | Cleveland | 10Xelerator | USA | OH | Columbus | Columbus | other | 2011-09-08 | 20000.0 |
5 | Tackk | web | USA | OH | Cleveland | Cleveland | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-02-01 | 20000.0 |
6 | Acclaimd | analytics | USA | OH | Columbus | Columbus | 10Xelerator | USA | OH | Columbus | Columbus | angel | 2012-06-01 | 20000.0 |
7 | Acclaimd | analytics | USA | OH | Columbus | Columbus | 10Xelerator | USA | OH | Columbus | Columbus | angel | 2012-08-07 | 70000.0 |
8 | ToVieFor | ecommerce | USA | NY | New York | New York | 2010 NYU Stern Business Plan Competition | None | None | unknown | None | angel | 2010-04-01 | 75000.0 |
9 | OHK Labs | sports | USA | FL | Palm Beach | Boca Raton | 22Hundred Group | None | None | unknown | None | angel | 2011-09-01 | 100000.0 |
query = '''SELECT * FROM investments'''
investments = run_query(query)
investments.shape
(52870, 14)
investments.head()
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AdverCar | advertising | USA | CA | SF Bay | San Francisco | 1-800-FLOWERS.COM | USA | NY | New York | New York | series-a | 2012-10-30 | 2000000.0 |
1 | LaunchGram | news | USA | CA | SF Bay | Mountain View | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-23 | 20000.0 |
2 | uTaP | messaging | USA | None | United States - Other | None | 10Xelerator | USA | OH | Columbus | Columbus | other | 2012-01-01 | 20000.0 |
3 | ZoopShop | software | USA | OH | Columbus | columbus | 10Xelerator | USA | OH | Columbus | Columbus | angel | 2012-02-15 | 20000.0 |
4 | eFuneral | web | USA | OH | Cleveland | Cleveland | 10Xelerator | USA | OH | Columbus | Columbus | other | 2011-09-08 | 20000.0 |
# verify unique companies
unique_companies = list(investments['company_name'].unique())
len(unique_companies)
11574
# Found the additional row - this messed up the
# unique companies list above with the unique startups list
# below
investments[investments['company_name'].isnull()]
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34225 | None | None | None | None | None | None | None | None | None | None | None | None | None | NaN |
funds_raised_per_startup = investments.groupby('company_name')['raised_amount_usd'].sum()
funds_raised_per_startup.head()
company_name #waywire 8750000.0 0xdata 1700000.0 1-800-DENTIST 0.0 1000memories 20190000.0 100Plus 2750000.0 Name: raised_amount_usd, dtype: float64
funds_raised_per_startup.shape
(11573,)
The reason for the difference between this number and the earlier number (11574) is because there is a row with all values set as None. We can safely ignore this discrepancy.
Note: Obviously pandas groupby
function would not keep a None value.
#suppress displaying long numbers in scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)
#reset option
pd.reset_option('display.float_format')
funds_raised_per_startup.sort_values(ascending=False)
company_name Clearwire 2.968000e+10 Groupon 1.018540e+10 Nanosolar 4.505000e+09 Facebook 4.154100e+09 SurveyMonkey 3.250000e+09 Zynga 2.886013e+09 Fisker Automotive 2.788000e+09 Dropbox 2.764400e+09 LivingSocial 2.685000e+09 sigmacare 2.600000e+09 Bloom Energy 2.530000e+09 Pacific Biosciences 2.374160e+09 Fab.com 2.349100e+09 BrightSource Energy 2.219000e+09 TRUECar 2.106500e+09 Wave Broadband 2.100000e+09 Twitter 1.950000e+09 PayPal 1.857000e+09 Solyndra 1.833000e+09 Demand Media 1.819992e+09 KAYAK 1.813500e+09 MiaSol 1.750000e+09 Tesla Motors 1.736500e+09 Pure Storage 1.650000e+09 SunEdison 1.635300e+09 Kosmos Biotherapeutics 1.595000e+09 Gilt Groupe 1.556000e+09 NextG Networks 1.540000e+09 Legendary Entertainment 1.375000e+09 Pinterest 1.356000e+09 ... Tacit Software 0.000000e+00 Tacere Therapeutics 0.000000e+00 Taaz 0.000000e+00 Sparkcloud 0.000000e+00 Spartan Race 0.000000e+00 Speaktoit 0.000000e+00 Spill Inc 0.000000e+00 mo9 (moKredit) 0.000000e+00 Splitcast Technology 0.000000e+00 SponsorHub 0.000000e+00 Timbuktu Labs 0.000000e+00 Tideland Signal Corporation 0.000000e+00 8020 Media 0.000000e+00 TicketBiscuit 0.000000e+00 g2One 0.000000e+00 for; to (do) Centers 0.000000e+00 Triad Digital Media 0.000000e+00 Tomorrowish 0.000000e+00 iCracked 0.000000e+00 Tippr 0.000000e+00 Affinaquest 0.000000e+00 Thinkspeed 0.000000e+00 Thounds 0.000000e+00 Threat Stack 0.000000e+00 Thumb Friendly 0.000000e+00 AfterCollege 0.000000e+00 Tickengo 0.000000e+00 Ticket Cake 0.000000e+00 The Talk Market 0.000000e+00 zuuka! 0.000000e+00 Name: raised_amount_usd, Length: 11573, dtype: float64
investments.groupby('company_name')['raised_amount_usd'].sum().sort_values()
company_name zuuka! 0.00 Marinexplore 0.00 MarketGid 0.00 Marketbright 0.00 Marketsync 0.00 Marquee 0.00 Massage Envy 0.00 Massively Fun 0.00 Masterson Industries 0.00 Matchmove Games 0.00 Mattermark 0.00 Measureful 0.00 Best Apps Market 0.00 MedSolutions 0.00 Bespoke Global 0.00 MedTest DX 0.00 MediKeeper 0.00 MediProPharma 0.00 MediaSpike 0.00 Medical Solutions 0.00 Manymoon 0.00 Medico.com 0.00 Mangia 0.00 Mandalay Sports Media (MSM) 0.00 Location Based Technologies 0.00 Lockitron 0.00 Look.io 0.00 LoopIt 0.00 Lucent Sky 0.00 LumaSense Technologies 0.00 ... Pinterest 1356000000.00 Legendary Entertainment 1375000000.00 NextG Networks 1540000000.00 Gilt Groupe 1556000000.00 Kosmos Biotherapeutics 1595000000.00 SunEdison 1635300000.00 Pure Storage 1650000000.00 Tesla Motors 1736500000.00 MiaSol 1750000000.00 KAYAK 1813500000.00 Demand Media 1819992000.00 Solyndra 1833000000.00 PayPal 1857000000.00 Twitter 1950000000.00 Wave Broadband 2100000000.00 TRUECar 2106500000.00 BrightSource Energy 2219000000.00 Fab.com 2349100000.00 Pacific Biosciences 2374160000.00 Bloom Energy 2530000000.00 sigmacare 2600000000.00 LivingSocial 2685000000.00 Dropbox 2764400000.00 Fisker Automotive 2788000000.00 Zynga 2886013000.00 SurveyMonkey 3250000000.00 Facebook 4154100000.00 Nanosolar 4505000000.00 Groupon 10185400000.00 Clearwire 29680000000.00 Name: raised_amount_usd, Length: 11573, dtype: float64
I'm getting 29,680,000,000 raised for Clearwire which seems to be off by a power of 10. As of 2009, Clearwire had raised $2.8 billion.
NOT $29 billion
clearwire_investments = investments[investments['company_name'] == 'Clearwire']
clearwire_investments
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
521 | Clearwire | mobile | USA | WA | Seattle | Kirkland | BrightHouse | USA | CA | Los Angeles | Santa Monica | post-ipo | 2009-11-10 | 1.500000e+09 |
522 | Clearwire | mobile | USA | WA | Seattle | Kirkland | BrightHouse | USA | CA | Los Angeles | Santa Monica | post-ipo | 2008-05-06 | 3.200000e+09 |
714 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Comcast | USA | PA | Philadelphia | Philadelphia | post-ipo | 2009-11-24 | 9.200000e+08 |
715 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Comcast | USA | PA | Philadelphia | Philadelphia | post-ipo | 2009-11-10 | 1.500000e+09 |
716 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Comcast | USA | PA | Philadelphia | Philadelphia | post-ipo | 2008-05-06 | 3.200000e+09 |
1172 | Clearwire | mobile | USA | WA | Seattle | Kirkland | USA | CA | SF Bay | Mountain View | post-ipo | 2008-05-06 | 3.200000e+09 | |
1348 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Intel | USA | CA | SF Bay | Santa Clara | post-ipo | 2009-11-24 | 9.200000e+08 |
1349 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Intel | USA | CA | SF Bay | Santa Clara | post-ipo | 2009-11-10 | 1.500000e+09 |
1350 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Intel | USA | CA | SF Bay | Santa Clara | post-ipo | 2008-05-06 | 3.200000e+09 |
2367 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Sprint Nextel | None | None | unknown | None | post-ipo | 2009-11-24 | 9.200000e+08 |
2368 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Sprint Nextel | None | None | unknown | None | post-ipo | 2009-11-10 | 1.500000e+09 |
2369 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Sprint Nextel | None | None | unknown | None | private-equity | 2013-02-27 | 8.000000e+07 |
2851 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Time Warner | USA | NY | New York | New York | post-ipo | 2009-11-24 | 9.200000e+08 |
2852 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Time Warner | USA | NY | New York | New York | post-ipo | 2009-11-10 | 1.500000e+09 |
2853 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Time Warner | USA | NY | New York | New York | post-ipo | 2008-05-06 | 3.200000e+09 |
14252 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Eagle River Holdings | USA | WA | Seattle | Kirkland | post-ipo | 2009-11-24 | 9.200000e+08 |
14253 | Clearwire | mobile | USA | WA | Seattle | Kirkland | Eagle River Holdings | USA | WA | Seattle | Kirkland | post-ipo | 2009-11-10 | 1.500000e+09 |
list(clearwire_investments['raised_amount_usd'])
[1500000000.0, 3200000000.0, 920000000.0, 1500000000.0, 3200000000.0, 3200000000.0, 920000000.0, 1500000000.0, 3200000000.0, 920000000.0, 1500000000.0, 80000000.0, 920000000.0, 1500000000.0, 3200000000.0, 920000000.0, 1500000000.0]