In this project our goal is to identify profitable mobile application profiles that we can use to inform our development team's future efforts. Our company only builds apps that are free to download and install, and our main source of revenue is accrued from advertiser payments via in-app ads. This means that the number of active users of our apps determines our revenue — the more users who see and engage with the ads, the better. There may be other considerations, such as the likelihood for new apps to successfully penetrate a market category, that we can use to balance our recommendations.
We will use two data sources for our analysis. One containing Apple App Store Data, and the other Google Play Store Data. These can be downloaded at the below links.
The middle to mid-high end of the road in average installs and total frequency of apps in the category/genre is most attractive. We see a few categories/genres that meet this criteria across marketplaces:
# read in datasets
from csv import reader
handle_apple = open('AppleStore.csv')
read_apple = reader(handle_apple)
apple_app_data = list(read_apple)
apple_headers = apple_app_data[0]
handle_google = open('googleplaystore.csv')
read_google = reader(handle_google)
google_app_data = list(read_google)
google_headers = google_app_data[0]
To make requerying our data simpler, let's create a function that allows us to select our dataset, query rows, and optionally print the number of rows/columns in the data set.
# function that prints each row of a list of lists and tells us the number of rows and columns
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')#print('\n') # adds a new (empty) line
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
Test pull of the datasets using the custom function.
# print the column names and a sample of the data from our two datasets
print('Column Names')
print('Apple:')
explore_data(apple_app_data, 0, 1, False)
print('Google:')
explore_data(google_app_data, 0, 1, False)
print('Data')
print('Apple Store')
explore_data(apple_app_data, 1, 5, True)
print('\n')
print('Google Play')
explore_data(google_app_data,1 ,5, True)
print('\n')
Column Names Apple: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] Google: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] Data Apple Store ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] Number of rows: 7198 Number of columns: 16 Google Play ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 10842 Number of columns: 13
The Apple App store dataset consists of 16 columns and 7198 rows of data. There is a mixture of numeric and descriptive columns.
The Google Play store dataset consists of 13 columns and 10842 rows of data. There is a mixture of numeric and descriptive columns.
There is an overlap in the columns of data that will allow us to compare the rows across the datasets. Those that will likely be the most important for this analysis are those that tell us the name of the app, (sub)categories, price, number of users and rating/reviews.
We will want to ensure that the datasets are cleaned before we conduct an analysis.
Let's first evaluate and clean the Google Play dataset. We found online discussion on the Google dataset about missing data. Let's recreate this and decide on a course of action.
explore_data(google_app_data,10473, 10474, False) #Missing data at entry 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
There is a column with a blank entry. Let's opt to delete the entire row with missing data.
# delete row
del google_app_data[10473]
# verify the app row was removed
explore_data(google_app_data,10472,10475,False)
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'] ['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up'] ['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']
There may be rows that are a duplicate of eachother. Let's first identify which rows those are and then decide how to handle them.
# loop through the names of the apps and add them to a list if they are a duplicate
unique_apps = []
duplicate_apps = []
for app in google_app_data:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else :
unique_apps.append(name)
print('Total Duplicate Apps: ', len(duplicate_apps))
print('\n')
print('Sample Duplicate Apps: ', duplicate_apps[:15])
Total Duplicate Apps: 1181 Sample Duplicate Apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
There are 1181 duplicate apps in the Google Dataset. Let's test a hypothesis that there is a variance in the number of reviews, which is a column that is likely to have change over time.
for app in google_app_data:
name = app[0]
if name == 'Instagram':
print(app)
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Our test showed that duplicate apps have a variance in their total reviews (column 4), which we can assume means that each record is a snapshot in time. Using this information, we will keep apps that have the greatest number of reviews, indicating their recency, and remove all other duplicates.
To clean our data of duplicates, we will create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. We expect to see 9659 value key pairs in the dictionary, which we found by subtracting the number of duplicates from the total number of apps.
reviews_max = {}
for row in google_app_data[1:]:
name = row[0]
n_reviews = float(row[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print(len(reviews_max))
9659
We now have a dictionary of all unique apps names and their corresponding number of reviews for the duplicate value containing the most reviews. We can use this dictionary to clean our dataset of duplicates.
We do this by creating two empty lists, looping through the dataset and checking if each value matches the key-value from the previous step's dictionary (which found and isolated the maximum value for each key in the dataset.) One list saves the row that has a match with the dictionary value and the other saves the name of each app that already has a recorded value in the 'clean' list, telling our program to ignore that entry.
googleplay_clean = []
already_added = []
for row in google_app_data[1:]:
name = row[0]
n_reviews = float(row[3])
#key_value = reviews_max[name]
if name not in already_added and float(reviews_max[name]) == float(n_reviews):
already_added.append(name)
googleplay_clean.append(row)
Let's check that our dataset has been changed. Explore the new dataset googleplay_clean, which is a version of the original dataset with incomplete and duplicate entries removed. From steps we completed earlier, we expect to see a count of 9659 rows in the list.
explore_data(googleplay_clean, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9659 Number of columns: 13
We see that the number of rows is reflective of our expectation.
Let's check if the Apple App Store dataset contains duplicate rows.
unique_apps = []
duplicate_apps = []
#print(apple_app_data[0])
for row in apple_app_data[1:]:
app_id = row[0]
if app_id not in unique_apps:
unique_apps.append(app_id)
else :
duplicate_apps.append(app_id)
print(len(unique_apps))
print(len(duplicate_apps))
7197 0
These results tell us that there are no duplicate apps in the apple dataset.
Both the Apple app store and the Google Play store have applications created for Enlish speaking and Non-Enlish speaking audiences. Our business, however, only develops applications for an English speaking audience. It will make sense to remove these where reasonably possible. Let's explore apps that are not made for English speaking audiences and make decisions on how to remove them.
# print app names from Google and Apple
print(apple_app_data[814][1])
print(apple_app_data[6732][1])
print('\n')
print(googleplay_clean[4412][0])
print(googleplay_clean[7940][0])
爱奇艺PPS -《欢乐颂2》电视剧热播 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き&ブロックパズル〜 中国語 AQリスニング لعبة تقدر تربح DZ
Using the ord()
function we can identify which apps contain characters typically not found in use with the English language. English characters fall in the range (0-127).
# show examples of unicode characters in and out of ord range
print(ord('a'))
print(ord('爱'))
97 29233
We can parse out the names of our app names and evaluate each character. By passing each character into a function, we can tell whether or not the fall into the English unicode character range (0-127). We can use discretion to say that if there are more than three non-English unicode characters that app is not likely for the English market.
# Function that takes in a string and returns `False` if there's three or more non-English characters
def english(string) :
count = 0
for letter in string :
if ord(letter) > 127 :
count = count + 1
#print(letter, count)
if count > 3 :
return False
return True
# examples of the function working
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))
True False True True
We can now loop through both of our entire datasets and, using the function above, systematically remove every app that is not for the English market.
# loop through the Google Play dataset and categorize apps into english and non-english lists
googleplay_eng = []
googleplay_non_eng = []
for app in googleplay_clean:
name = app[0]
if english(name) is True :
googleplay_eng.append(app)
else :
googleplay_non_eng.append(app)
print('Count English Google Play Length: ', len(googleplay_eng))
print('Count Non-English Google Play Length: ', len(googleplay_non_eng))
# loop through the Apple dataset and categorize apps into english and non-english lists
apple_eng = []
apple_non_eng = []
apple_app_data = apple_app_data[1:]
for app in apple_app_data:
name = app[1]
if english(name) is True :
apple_eng.append(app)
else :
apple_non_eng.append(app)
print('Count English Apple Length: ', len(apple_eng))
print('Count Non-English Apple Length: ', len(apple_non_eng))
Count English Google Play Length: 9614 Count Non-English Google Play Length: 45 Count English Apple Length: 6183 Count Non-English Apple Length: 1014
Before moving on, let's do a quick gut-check that the apps in non-English list are non-English.
print('Google Play')
print(googleplay_non_eng[0:5])
print('\n')
print('Apple')
print(apple_non_eng[0:5])
Google Play [['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up'], ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'], ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'], ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']] Apple [['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1'], ['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1'], ['336141475', '优酷视频', '204959744', 'USD', '0.0', '4885', '0', '3.5', '0.0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1'], ['425349261', '网易新闻 - 精选好内容,算出你的兴趣', '133134336', 'USD', '0.0', '4263', '6', '4.5', '1.0', '23.2', '17+', 'News', '37', '4', '2', '1'], ['387682726', '淘宝 - 随时随地,想淘就淘', '309673984', 'USD', '0.0', '3801', '6', '4.0', '4.0', '6.7.2', '4+', 'Shopping', '37', '1', '1', '1']]
Our business develops apps that are free to download and use, but generate ad revenue from in-app ads. The characteristics for paid and free apps can be quite different. Let's now filter both of our datasets to only include free apps.
# loop through Google Play dataset and filter free and paid apps into separate lists
googleplay_free = []
googleplay_paid = []
for app in googleplay_eng:
cost = app[6]
if cost == 'Free':
googleplay_free.append(app)
else:
googleplay_paid.append(app)
# loop through apple dataset and filter free and paid apps into separate lists
apple_free = []
apple_paid = []
for app in apple_eng:
cost = app[4]
if cost == '0.0':
apple_free.append(app)
else:
apple_paid.append(app)
# check that both datasets only contain free apps
print(apple_free[0:3])
print(len(apple_free))
print('\n')
print(googleplay_free[0:3])
print(len(googleplay_free))
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']] 3222 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']] 8863
Our end goal is to develop an app that is successful on both Google Play and Apple App Store marketplaces. Stakeholders most important success metric is revenue, and we know the KPI that most directly impacts revenue are the number of installs our apps have on the marketplace.
To minimize risks and overhead, the business's validation strategy for an app idea has three steps:
Our analysis will follow a similar structure. First validating Google Play app category performance and then finding where within the iOS apple app store there is opportunity.
Our Google Play dataset consists of 16 columns and our Apple App Store dataset consists of 13 columns of data, not all of which may be useful in this analysis.
Let's print out our columns and identify key columns.
print('Google Play: ')
print(google_headers)
print('\n')
print('Apple: ')
print(apple_headers)
Google Play: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] Apple: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
The columns that will be useful in further analysis are:
Google Play
Category
Genres
Installs
Apple App Store
prime_genre
rating_count_tot
The Google Play dataset contains our primary KPI, count of app installs, however our Apple App Store data does not. As a proxy for installs, we can use the rating_count_tot
which will tell us how many times an app was reviewed. We can fairly assume that app installs and count of reviews is correlated.
One indicator that a category is lucrative for other app developers and that the category may be accepting of new app entrants, is the number of apps currently in a category.
Let's create frequency tables of our Google Play dataset.
# function that takes in a dataset and creates a normalized frequency table of the given index
def freq_table(dataset, index):
frequency_table = {}
total = 0
# generates a frequency table
for row in dataset:
target = row[index]
if target in frequency_table:
frequency_table[target] += 1
total += 1
else :
frequency_table[target] = 1
total += 1
# normalizes a frequency table
for value in frequency_table:
key_val = frequency_table[value]
percentage = round(((key_val / total) * 100), 2)
frequency_table[value] = percentage
#one_hundred_percent = one_hundred_percent + percentage
return frequency_table#, round(one_hundred_percent)
To make our frequency table more readable, it will be easier to transform it into a table. Let's create a function that will do that next.
def display_table(dataset, index): #dataset is a list of lists and index will be an integer
table = freq_table(dataset, index) # use freq_table function
table_display = []
#transforms the frequency tables into a list of tuples
for key in table:
key_val_as_tuple = (table[key], key) # turns key values into a tuple in reverse order
table_display.append(key_val_as_tuple) # appends tuple to list
#takes our complete list and sorts it in reverse (descending) order
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0]) #prints our sorted lists of lists back in key - value form in descending order
Let's now use our frequency table on the Genre category in the Google Play dataset.
display_table(googleplay_free, 9) #Genre
Tools : 8.45 Entertainment : 6.07 Education : 5.35 Business : 4.59 Productivity : 3.89 Lifestyle : 3.89 Finance : 3.7 Medical : 3.53 Sports : 3.46 Personalization : 3.32 Communication : 3.24 Action : 3.1 Health & Fitness : 3.08 Photography : 2.94 News & Magazines : 2.8 Social : 2.66 Travel & Local : 2.32 Shopping : 2.25 Books & Reference : 2.14 Simulation : 2.04 Dating : 1.86 Arcade : 1.85 Video Players & Editors : 1.77 Casual : 1.76 Maps & Navigation : 1.4 Food & Drink : 1.24 Puzzle : 1.13 Racing : 0.99 Role Playing : 0.94 Libraries & Demo : 0.94 Auto & Vehicles : 0.93 Strategy : 0.9 House & Home : 0.82 Weather : 0.8 Events : 0.71 Adventure : 0.68 Comics : 0.61 Beauty : 0.6 Art & Design : 0.6 Parenting : 0.5 Card : 0.45 Casino : 0.43 Trivia : 0.42 Educational;Education : 0.39 Board : 0.38 Educational : 0.37 Education;Education : 0.34 Word : 0.26 Casual;Pretend Play : 0.24 Music : 0.2 Racing;Action & Adventure : 0.17 Puzzle;Brain Games : 0.17 Entertainment;Music & Video : 0.17 Casual;Brain Games : 0.14 Casual;Action & Adventure : 0.14 Arcade;Action & Adventure : 0.12 Action;Action & Adventure : 0.1 Educational;Pretend Play : 0.09 Simulation;Action & Adventure : 0.08 Parenting;Education : 0.08 Entertainment;Brain Games : 0.08 Board;Brain Games : 0.08 Parenting;Music & Video : 0.07 Educational;Brain Games : 0.07 Casual;Creativity : 0.07 Art & Design;Creativity : 0.07 Education;Pretend Play : 0.06 Role Playing;Pretend Play : 0.05 Education;Creativity : 0.05 Role Playing;Action & Adventure : 0.03 Puzzle;Action & Adventure : 0.03 Entertainment;Creativity : 0.03 Entertainment;Action & Adventure : 0.03 Educational;Creativity : 0.03 Educational;Action & Adventure : 0.03 Education;Music & Video : 0.03 Education;Brain Games : 0.03 Education;Action & Adventure : 0.03 Adventure;Action & Adventure : 0.03 Video Players & Editors;Music & Video : 0.02 Sports;Action & Adventure : 0.02 Simulation;Pretend Play : 0.02 Puzzle;Creativity : 0.02 Music;Music & Video : 0.02 Entertainment;Pretend Play : 0.02 Casual;Education : 0.02 Board;Action & Adventure : 0.02 Video Players & Editors;Creativity : 0.01 Trivia;Education : 0.01 Travel & Local;Action & Adventure : 0.01 Tools;Education : 0.01 Strategy;Education : 0.01 Strategy;Creativity : 0.01 Strategy;Action & Adventure : 0.01 Simulation;Education : 0.01 Role Playing;Brain Games : 0.01 Racing;Pretend Play : 0.01 Puzzle;Education : 0.01 Parenting;Brain Games : 0.01 Music & Audio;Music & Video : 0.01 Lifestyle;Pretend Play : 0.01 Lifestyle;Education : 0.01 Health & Fitness;Education : 0.01 Health & Fitness;Action & Adventure : 0.01 Entertainment;Education : 0.01 Communication;Creativity : 0.01 Comics;Creativity : 0.01 Casual;Music & Video : 0.01 Card;Action & Adventure : 0.01 Books & Reference;Education : 0.01 Art & Design;Pretend Play : 0.01 Art & Design;Action & Adventure : 0.01 Arcade;Pretend Play : 0.01 Adventure;Education : 0.01
display_table(googleplay_free, 1) #Category
FAMILY : 18.9 GAME : 9.73 TOOLS : 8.46 BUSINESS : 4.59 LIFESTYLE : 3.9 PRODUCTIVITY : 3.89 FINANCE : 3.7 MEDICAL : 3.53 SPORTS : 3.4 PERSONALIZATION : 3.32 COMMUNICATION : 3.24 HEALTH_AND_FITNESS : 3.08 PHOTOGRAPHY : 2.94 NEWS_AND_MAGAZINES : 2.8 SOCIAL : 2.66 TRAVEL_AND_LOCAL : 2.34 SHOPPING : 2.25 BOOKS_AND_REFERENCE : 2.14 DATING : 1.86 VIDEO_PLAYERS : 1.79 MAPS_AND_NAVIGATION : 1.4 FOOD_AND_DRINK : 1.24 EDUCATION : 1.16 ENTERTAINMENT : 0.96 LIBRARIES_AND_DEMO : 0.94 AUTO_AND_VEHICLES : 0.93 HOUSE_AND_HOME : 0.82 WEATHER : 0.8 EVENTS : 0.71 PARENTING : 0.65 ART_AND_DESIGN : 0.64 COMICS : 0.62 BEAUTY : 0.6
By counting the apps within Google Play Categories and Genres we see:
Whichever genre we choose, it must also have a chance for success in the Apple App store market. Let's compare our findings here with the count of Apple app store apps prime_genre column.
display_table(apple_free, 11) #prime_genre
Games : 58.16 Entertainment : 7.88 Photo & Video : 4.97 Education : 3.66 Social Networking : 3.29 Shopping : 2.61 Utilities : 2.51 Sports : 2.14 Music : 2.05 Health & Fitness : 2.02 Productivity : 1.74 Lifestyle : 1.58 News : 1.33 Travel : 1.24 Finance : 1.12 Weather : 0.87 Food & Drink : 0.81 Reference : 0.56 Business : 0.53 Book : 0.43 Navigation : 0.19 Medical : 0.19 Catalogs : 0.12
From this frequency Table we see:
Usage behavior seems to vary based on the platform. Generally, Android apps provide utility are more popular on the Google Play marketplace, while popular Apple apps provide some sort of entertainment. This is likely a result of the customizability that each operating system allows it developers to exercise.
Considering these tables together:
To have an idea of which genres of apps have the most users, we can take the Installs
[5]
column from our Google Play data and the rating_count_tot
[5]
(which is our closest proxy) column from the Apple data set and average across our genres.
To accomplish this we will sum installs and total reviews for our each genres.
For Google Play, we have install data for our apps, however, we don't have the exact number. The apps fall into buckets with fairly large ranges. For example, an app in the 1,000,000+ category could have 3,000,000 installs or 1,000,000 installs. Therefore, we will need to make assumptions and leave each app rating at their categorical face value. So a 1,000,000+ app will be assumed to have exactly 1,000,000 downloads.
# use the display_table function selecting installs
display_table(googleplay_free, 5)
1,000,000+ : 15.73 100,000+ : 11.55 10,000,000+ : 10.55 10,000+ : 10.2 1,000+ : 8.39 100+ : 6.92 5,000,000+ : 6.83 500,000+ : 5.56 50,000+ : 4.77 5,000+ : 4.51 10+ : 3.54 500+ : 3.25 50,000,000+ : 2.3 100,000,000+ : 2.13 50+ : 1.92 5+ : 0.79 1+ : 0.51 500,000,000+ : 0.27 1,000,000,000+ : 0.23 0+ : 0.05
The percentage number isn't necessarily important, however the bins are in a format that will not be conducive to analysis techniques. Let's turn these from strings to numbers so we can better sort, analyze and interpret our installation bins.
for app in googleplay_free:
installs = app[5]
installs = installs.replace('+','')
installs = installs.replace(',','')
installs = float(installs)
app[5] = installs
# use the display_table function selecting installs
display_table(googleplay_free, 5)
1000000.0 : 15.73 100000.0 : 11.55 10000000.0 : 10.55 10000.0 : 10.2 1000.0 : 8.39 100.0 : 6.92 5000000.0 : 6.83 500000.0 : 5.56 50000.0 : 4.77 5000.0 : 4.51 10.0 : 3.54 500.0 : 3.25 50000000.0 : 2.3 100000000.0 : 2.13 50.0 : 1.92 5.0 : 0.79 1.0 : 0.51 500000000.0 : 0.27 1000000000.0 : 0.23 0.0 : 0.05
We can now manipulate install values for our app categories. Let's create a few functions that will allow us to sum the installs for each category and look at averages.
# function that takes in a dataset and totals all the values in a designated column by each column
def sum_table(dataset, index, sum_col_index):
sum_table = {}
# generataes the sum table
for row in dataset:
target = row[index]
value = float(row[sum_col_index])
if target in sum_table:
sum_table[target] += value
else:
sum_table[target] = value
return sum_table
# function that sorts our sum_table function
def sorted_sum_table(dataset, index, sum_col_index, average = False):
table = sum_table(dataset, index, sum_col_index)
display_table = []
if average is True:
# create frequency table
frequency_table = {}
for row in dataset:
target = row[index]
if target in frequency_table:
frequency_table[target] += 1
else :
frequency_table[target] = 1
# turns the values in the in table into averages
for sum_key in table:
for count_key in frequency_table:
if sum_key == count_key:
table[sum_key] = table[sum_key] / frequency_table[count_key]
# transforms the frequency tables into a list of tuples
for key in table:
key_val_as_tuple = (table[key], key)
display_table.append(key_val_as_tuple)
# sorts our list
table_sorted = sorted(display_table,reverse = True)
# pretty prints our table
for entry in table_sorted:
print(entry[1], ':', entry[0])
else:
# transforms the frequency tables into a list of tuples
for key in table:
key_val_as_tuple = (table[key], key)
display_table.append(key_val_as_tuple)
# sorts our list
table_sorted = sorted(display_table,reverse = True)
# pretty prints our table
for entry in table_sorted:
print(entry[1], ':', entry[0])
Let's look at the average installs for the category and genres of the Google Play dataset.
# category average installs
sorted_sum_table(googleplay_free, 1, 5, average=True)
COMMUNICATION : 38456119.167247385 VIDEO_PLAYERS : 24727872.452830188 SOCIAL : 23253652.127118643 PHOTOGRAPHY : 17840110.40229885 PRODUCTIVITY : 16787331.344927534 GAME : 15588015.603248259 TRAVEL_AND_LOCAL : 13984077.710144928 ENTERTAINMENT : 11640705.88235294 TOOLS : 10801391.298666667 NEWS_AND_MAGAZINES : 9549178.467741935 BOOKS_AND_REFERENCE : 8767811.894736841 SHOPPING : 7036877.311557789 PERSONALIZATION : 5201482.6122448975 WEATHER : 5074486.197183099 HEALTH_AND_FITNESS : 4188821.9853479853 MAPS_AND_NAVIGATION : 4056941.7741935486 FAMILY : 3697848.1731343283 SPORTS : 3638640.1428571427 ART_AND_DESIGN : 1986335.0877192982 FOOD_AND_DRINK : 1924897.7363636363 EDUCATION : 1833495.145631068 BUSINESS : 1712290.1474201474 LIFESTYLE : 1437816.2687861272 FINANCE : 1387692.475609756 HOUSE_AND_HOME : 1331540.5616438356 DATING : 854028.8303030303 COMICS : 817657.2727272727 AUTO_AND_VEHICLES : 647317.8170731707 LIBRARIES_AND_DEMO : 638503.734939759 PARENTING : 542603.6206896552 BEAUTY : 513151.88679245283 EVENTS : 253542.22222222222 MEDICAL : 120550.61980830671
# genre average installs Google Play Store
sorted_sum_table(googleplay_free, 9, 5, average=True)
Communication : 38456119.167247385 Adventure;Action & Adventure : 35333333.333333336 Video Players & Editors : 24947335.796178345 Social : 23253652.127118643 Arcade : 22888365.48780488 Casual : 19569221.602564104 Puzzle;Action & Adventure : 18366666.666666668 Photography : 17840110.40229885 Educational;Action & Adventure : 17016666.666666668 Productivity : 16787331.344927534 Racing : 15910645.681818182 Travel & Local : 14051476.145631067 Casual;Action & Adventure : 12916666.666666666 Action : 12603588.872727273 Strategy : 11339901.3125 Tools : 10802461.246995995 Tools;Education : 10000000.0 Role Playing;Brain Games : 10000000.0 Lifestyle;Pretend Play : 10000000.0 Casual;Music & Video : 10000000.0 Card;Action & Adventure : 10000000.0 Adventure;Education : 10000000.0 News & Magazines : 9549178.467741935 Music : 9445583.333333334 Educational;Pretend Play : 9375000.0 Puzzle;Brain Games : 9280666.666666666 Word : 9094458.695652174 Racing;Action & Adventure : 8816666.666666666 Books & Reference : 8767811.894736841 Puzzle : 8302861.91 Video Players & Editors;Music & Video : 7500000.0 Shopping : 7036877.311557789 Role Playing;Action & Adventure : 7000000.0 Casual;Pretend Play : 6957142.857142857 Entertainment;Music & Video : 6413333.333333333 Action;Action & Adventure : 5888888.888888889 Entertainment : 5602792.775092937 Education;Brain Games : 5333333.333333333 Casual;Creativity : 5333333.333333333 Role Playing;Pretend Play : 5275000.0 Personalization : 5201482.6122448975 Weather : 5074486.197183099 Sports;Action & Adventure : 5050000.0 Music;Music & Video : 5050000.0 Video Players & Editors;Creativity : 5000000.0 Adventure : 4922785.333333333 Simulation;Action & Adventure : 4857142.857142857 Education;Education : 4759517.0 Board : 4759209.117647059 Sports : 4596842.615635179 Educational;Brain Games : 4433333.333333333 Health & Fitness : 4188821.9853479853 Maps & Navigation : 4056941.7741935486 Entertainment;Creativity : 4000000.0 Role Playing : 3965645.421686747 Card : 3815462.5 Trivia : 3475712.7027027025 Simulation : 3475484.08839779 Casino : 3427910.5263157897 Entertainment;Brain Games : 3314285.714285714 Arcade;Action & Adventure : 3190909.1818181816 Entertainment;Pretend Play : 3000000.0 Board;Action & Adventure : 3000000.0 Education;Creativity : 2875000.0 Entertainment;Action & Adventure : 2333333.3333333335 Educational;Creativity : 2333333.3333333335 Art & Design : 2122850.9433962265 Education;Music & Video : 2033333.3333333333 Food & Drink : 1924897.7363636363 Education;Pretend Play : 1800000.0 Educational;Education : 1737143.142857143 Business : 1712290.1474201474 Casual;Brain Games : 1425916.6666666667 Lifestyle : 1412998.3449275363 Finance : 1387692.475609756 House & Home : 1331540.5616438356 Parenting;Music & Video : 1118333.3333333333 Strategy;Creativity : 1000000.0 Strategy;Action & Adventure : 1000000.0 Racing;Pretend Play : 1000000.0 Parenting;Brain Games : 1000000.0 Health & Fitness;Action & Adventure : 1000000.0 Entertainment;Education : 1000000.0 Education;Action & Adventure : 1000000.0 Casual;Education : 1000000.0 Arcade;Pretend Play : 1000000.0 Dating : 854028.8303030303 Comics : 831873.1481481482 Puzzle;Creativity : 750000.0 Auto & Vehicles : 647317.8170731707 Libraries & Demo : 638503.734939759 Education : 550185.4430379746 Simulation;Pretend Play : 550000.0 Beauty : 513151.88679245283 Strategy;Education : 500000.0 Music & Audio;Music & Video : 500000.0 Communication;Creativity : 500000.0 Art & Design;Pretend Play : 500000.0 Parenting : 467977.5 Parenting;Education : 452857.14285714284 Educational : 411184.8484848485 Board;Brain Games : 407142.85714285716 Art & Design;Creativity : 285000.0 Events : 253542.22222222222 Medical : 120550.61980830671 Travel & Local;Action & Adventure : 100000.0 Puzzle;Education : 100000.0 Lifestyle;Education : 100000.0 Health & Fitness;Education : 100000.0 Art & Design;Action & Adventure : 100000.0 Comics;Creativity : 50000.0 Books & Reference;Education : 1000.0 Simulation;Education : 500.0 Trivia;Education : 100.0
# genre average installs Apple App Store
sorted_sum_table(apple_free, 11, 5, average=True)
Navigation : 86090.33333333333 Reference : 74942.11111111111 Social Networking : 71548.34905660378 Music : 57326.530303030304 Weather : 52279.892857142855 Book : 39758.5 Food & Drink : 33333.92307692308 Finance : 31467.944444444445 Photo & Video : 28441.54375 Travel : 28243.8 Shopping : 26919.690476190477 Health & Fitness : 23298.015384615384 Sports : 23008.898550724636 Games : 22788.6696905016 News : 21248.023255813954 Productivity : 21028.410714285714 Utilities : 18684.456790123455 Lifestyle : 16485.764705882353 Entertainment : 14029.830708661417 Business : 7491.117647058823 Education : 7003.983050847458 Catalogs : 4004.0 Medical : 612.0
Social, Video Players & Editors, Photography Apps, Entertainment, and Travel/Navigation/Weather applications received high average installs/proxy installs in both the Google Play and Apple App Store.
Considering the average app ratings and the frequency distributions for apps together, we get a more complete picture. We want to develop an app that is capable of receiving a decent amount of installs, but we don't want there to be too much or too little competition amongst apps. The gaming category is an example of too much competition leading to market saturation. Gaming is one of the most frequently occuring in both marketplaces, however, the average installs they receive is not very high. This indicates that there are many games that likely get very few new players. On the other hand, a category/genre can be dominated by few very large apps making market penetration difficult. Social apps is an example of this. Social apps has a high average amount of users per app, but few apps generally. Competing in this category will similarly be difficult.
The middle to mid-high end of the road in average installs and total frequency of apps in the category/genre is most attractive. We see a few categories/genres that meet this criteria across marketplaces:
We leave these recommendations to the development team to determine which category to move forward with.