In this project we will analyze data to help our developers understand what type of apps are likely to attract more users. We only build apps that are free to download, and our main source of revenue consists of in-app ads. This study will help the company develop successful apps for App Store and Google Play.
Open the two data sets, and save both as lists of lists.
def dataset_to_list(dataset):
open_dataset = open(dataset)
from csv import reader
read_dataset = reader(open_dataset)
convert_to_list = list(read_dataset)
return convert_to_list
app_store = dataset_to_list('AppleStore.csv')
google_play = dataset_to_list('googleplaystore.csv')
Explore both data sets using the explore_data() function
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset)-1)
print('Number of columns:', len(dataset[0]))
print(explore_data(app_store, 0, 5, rows_and_columns=True))
print(explore_data(google_play, 0, 5, rows_and_columns=True))
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] Number of rows: 7197 Number of columns: 16 None ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 10841 Number of columns: 13 None
Print the column names and try to identify the columns that could help us with our analysis.
app_store_headers = app_store[0]
google_play_headers = google_play[0]
print(app_store_headers)
print(google_play_headers)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Here we are going to delete a row that has wrong data.
print(google_play[0])
print(google_play[10472])
print(google_play[10473])
del google_play[10473]
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'] ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Check if dataset has duplicate entries
unique_rows = []
non_unique_rows = []
for row in google_play[1:]:
name = row[0]
if name in unique_rows:
non_unique_rows.append(name)
else:
unique_rows.append(name)
print('Number of duplicate rows: ', len(non_unique_rows))
Number of duplicate rows: 1181
The above cell shows that there are 1181 duplicate entries. We are going to remove all duplicates and keep only the row with the highest number of ratings. The reason being that the higher number of rating implies the entry is more recent. After we remove the duplicates, we will be left with the below number of entries:
print(len(google_play[1:]) - len(non_unique_rows))
9659
Here we are going to create an empty dictionary, where each key is a unique app name and the corresponding value is the highest number of reviews.
reviews_max = {}
for row in google_play[1:]:
name = row[0]
n_reviews = float(row[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
if name not in reviews_max:
reviews_max[name] = n_reviews
print(len(reviews_max))
9659
Here we will use the reviews_max dictionary created above to remove duplicate rows. We will create two empty lists: android_clean and already_added. We will use a for loop to add each row from the google_play dataset into the android_clean list, provided it meets the below two requirements:
If both conditions are met, we also append the name to the already_added list.
android_clean = []
already_added = []
for row in google_play[1:]:
entry = row
name = row[0]
n_reviews = float(row[3])
if name not in already_added and n_reviews == reviews_max[name]:
android_clean.append(entry)
already_added.append(name)
print(len(android_clean))
9659
Define a function that returns True if all characters in a string belong to the set of common English characters (ASCII 0-127) and False otherwise.
def english_character(a_string):
for character in a_string:
if ord(character) > 127:
return False
return True
english_character('Instachat 😜')
False
This function labels as non-English some apps that are indeed English but contain some special characters (like emojis) which have a ASCII code greater than 127. We are going to rewrite the function to allow up to 3 characters that fall outside the ASCII range.
def english_character_v2(a_string):
non_english_count = 0
for character in a_string:
if ord(character) > 127:
non_english_count += 1
if non_english_count > 3:
return False
else:
return True
english_character_v2('》Docs To Go™ Free Office S电视剧热播uite')
False
Now we are going to use the english_character_v2 function to filter out non-English apps from both data sets.
def english_only_dataset(index, a_dataset):
a_string = []
for row in a_dataset:
entry = row
name = row[index]
if english_character_v2(name) == True:
a_string.append(entry)
return a_string
app_store_english_only = english_only_dataset(0, app_store[1:])
print(len(app_store_english_only))
print(app_store_english_only[:5])
google_play_english_only = english_only_dataset(0, android_clean)
print(len(google_play_english_only))
print(google_play_english_only[:5])
7197 [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']] 9614 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
So far in the data cleaning process, we have:
Our next step in the data cleaning process is to isolate only the free apps for our analysis. This will be our last step in the data cleaning process.
app_store_clean = []
for row in app_store_english_only:
price = float(row[4])
if price == 0.0:
app_store_clean.append(row)
print(len(app_store_clean))
print(app_store_clean[:5])
print('\n')
google_play_clean =[]
for row in google_play_english_only:
price = row[7]
if price == '0':
google_play_clean.append(row)
print(len(google_play_clean))
print(google_play_clean[:5])
### Here I am not creating a function to be used for both datasets as their column with prices has different data types ###
4056 [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']] 8864 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
Now that the data cleaning process is finished, we want to find out what app profiles are succcessful on both App Store and Google Play. To do that we will create some frequency tables (for example for the genre).
def freq_table(index, a_list):
a_dictionary = {}
for row in a_list:
genre = row[index]
if genre in a_dictionary:
a_dictionary[genre] += 1
else:
a_dictionary[genre] = 1
return a_dictionary
app_store_genre_frequency = freq_table(11, app_store_clean)
print('App Store genre frequency')
print(app_store_genre_frequency)
print('\n')
google_play_category_frequency = freq_table(1, google_play_clean)
print('Google Play category frequency')
print(google_play_category_frequency)
print('\n')
google_play_genre_frequency = freq_table(9, google_play_clean)
print('Google Play genre frequency')
print(google_play_genre_frequency)
App Store genre frequency {'Travel': 56, 'Reference': 20, 'Music': 67, 'Entertainment': 334, 'Education': 132, 'Productivity': 62, 'Catalogs': 9, 'Book': 66, 'Utilities': 109, 'Lifestyle': 94, 'Food & Drink': 43, 'Photo & Video': 167, 'Weather': 31, 'News': 58, 'Sports': 79, 'Social Networking': 143, 'Health & Fitness': 76, 'Medical': 8, 'Navigation': 20, 'Shopping': 121, 'Finance': 84, 'Business': 20, 'Games': 2257} Google Play category frequency {'EVENTS': 63, 'COMICS': 55, 'BUSINESS': 407, 'TRAVEL_AND_LOCAL': 207, 'PARENTING': 58, 'BEAUTY': 53, 'FINANCE': 328, 'FAMILY': 1676, 'NEWS_AND_MAGAZINES': 248, 'WEATHER': 71, 'EDUCATION': 103, 'FOOD_AND_DRINK': 110, 'TOOLS': 750, 'ENTERTAINMENT': 85, 'MAPS_AND_NAVIGATION': 124, 'SOCIAL': 236, 'VIDEO_PLAYERS': 159, 'AUTO_AND_VEHICLES': 82, 'LIFESTYLE': 346, 'BOOKS_AND_REFERENCE': 190, 'HEALTH_AND_FITNESS': 273, 'DATING': 165, 'PRODUCTIVITY': 345, 'SHOPPING': 199, 'SPORTS': 301, 'ART_AND_DESIGN': 57, 'PERSONALIZATION': 294, 'GAME': 862, 'PHOTOGRAPHY': 261, 'LIBRARIES_AND_DEMO': 83, 'COMMUNICATION': 287, 'HOUSE_AND_HOME': 73, 'MEDICAL': 313} Google Play genre frequency {'Educational;Brain Games': 6, 'Puzzle': 100, 'Education': 474, 'Casual': 156, 'News & Magazines': 248, 'Communication;Creativity': 1, 'Casual;Action & Adventure': 12, 'Puzzle;Brain Games': 15, 'Strategy': 81, 'Health & Fitness;Action & Adventure': 1, 'Beauty': 53, 'Role Playing': 83, 'Libraries & Demo': 83, 'Board;Action & Adventure': 2, 'Entertainment;Action & Adventure': 3, 'Education;Action & Adventure': 3, 'Word': 23, 'Art & Design;Action & Adventure': 1, 'Simulation;Education': 1, 'Puzzle;Creativity': 2, 'Health & Fitness': 273, 'Education;Pretend Play': 5, 'Card;Action & Adventure': 1, 'Business': 407, 'Music': 18, 'Art & Design': 53, 'Travel & Local;Action & Adventure': 1, 'Tools': 749, 'Racing;Pretend Play': 1, 'Parenting;Education': 7, 'Auto & Vehicles': 82, 'Productivity': 345, 'Travel & Local': 206, 'Casual;Pretend Play': 21, 'Strategy;Creativity': 1, 'Social': 236, 'Education;Brain Games': 3, 'Board;Brain Games': 7, 'Role Playing;Pretend Play': 4, 'Parenting;Brain Games': 1, 'Communication': 287, 'Weather': 71, 'Casual;Creativity': 6, 'Entertainment;Music & Video': 15, 'Maps & Navigation': 124, 'Simulation;Action & Adventure': 7, 'Role Playing;Action & Adventure': 3, 'Arcade;Pretend Play': 1, 'Card': 40, 'Adventure': 60, 'Personalization': 294, 'Simulation;Pretend Play': 2, 'Entertainment;Education': 1, 'Finance': 328, 'Casual;Brain Games': 12, 'Educational': 33, 'Arcade': 164, 'Entertainment;Creativity': 3, 'Parenting;Music & Video': 6, 'Lifestyle': 345, 'Strategy;Education': 1, 'Video Players & Editors;Music & Video': 2, 'Educational;Pretend Play': 8, 'Racing;Action & Adventure': 15, 'Comics;Creativity': 1, 'Entertainment;Brain Games': 7, 'Casual;Education': 2, 'Trivia': 37, 'Education;Creativity': 4, 'Food & Drink': 110, 'Music & Audio;Music & Video': 1, 'Art & Design;Creativity': 6, 'Trivia;Education': 1, 'Education;Music & Video': 3, 'Role Playing;Brain Games': 1, 'Video Players & Editors;Creativity': 1, 'Books & Reference': 190, 'Tools;Education': 1, 'Educational;Education': 35, 'Medical': 313, 'Action;Action & Adventure': 9, 'House & Home': 73, 'Entertainment;Pretend Play': 2, 'Puzzle;Education': 1, 'Events': 63, 'Lifestyle;Education': 1, 'Casino': 38, 'Parenting': 44, 'Board': 34, 'Casual;Music & Video': 1, 'Entertainment': 538, 'Health & Fitness;Education': 1, 'Educational;Creativity': 3, 'Puzzle;Action & Adventure': 3, 'Racing': 88, 'Comics': 54, 'Educational;Action & Adventure': 3, 'Dating': 165, 'Action': 275, 'Adventure;Action & Adventure': 3, 'Arcade;Action & Adventure': 11, 'Art & Design;Pretend Play': 1, 'Simulation': 181, 'Sports;Action & Adventure': 2, 'Lifestyle;Pretend Play': 1, 'Books & Reference;Education': 1, 'Sports': 307, 'Strategy;Action & Adventure': 1, 'Adventure;Education': 1, 'Photography': 261, 'Music;Music & Video': 2, 'Shopping': 199, 'Education;Education': 30, 'Video Players & Editors': 157}
Now we will build two functions to analyze the above frequency tables:
### Here we create a function to generate frequency tables that show percentages ###
def freq_table_percentages(dataset, index):
sum_freq_table_values = len(dataset[1:])
freq_table = {}
for row in dataset[1:]:
key = row[index]
if key in freq_table:
freq_table[key] += 1
else:
freq_table[key] = 1
freq_table_percent = {}
for key in freq_table:
freq_table_percent[key] = (freq_table[key] / sum_freq_table_values *100)
return freq_table_percent
app_store_genre_frequency_percent = freq_table_percentages(app_store_clean, 11)
print('% of apps per genre in App Store')
print(app_store_genre_frequency_percent)
print('\n')
google_play_category_frequency_percent = freq_table_percentages(google_play_clean, 1)
print('% of apps per category in Google Play')
print(google_play_category_frequency_percent)
print('\n')
google_play_genre_frequency_percent = freq_table_percentages(google_play_clean, 9)
print('% of apps per genre in Google Play')
print(google_play_genre_frequency_percent)
% of apps per genre in App Store {'Travel': 1.381011097410604, 'Reference': 0.4932182490752158, 'Entertainment': 8.236744759556105, 'Education': 3.255240443896424, 'Productivity': 1.528976572133169, 'Catalogs': 0.22194821208384713, 'Navigation': 0.4932182490752158, 'Utilities': 2.688039457459926, 'Lifestyle': 2.318125770653514, 'Music': 1.6522811344019728, 'Photo & Video': 4.1183723797780525, 'Weather': 0.7644882860665845, 'News': 1.4303329223181258, 'Games': 55.659679408138096, 'Sports': 1.9482120838471024, 'Social Networking': 3.501849568434032, 'Health & Fitness': 1.8742293464858202, 'Medical': 0.19728729963008632, 'Book': 1.627620221948212, 'Shopping': 2.9839704069050557, 'Finance': 2.0715166461159065, 'Business': 0.4932182490752158, 'Food & Drink': 1.060419235511714} % of apps per category in Google Play {'EVENTS': 0.7108202640189552, 'COMICS': 0.6205573733498815, 'BUSINESS': 4.592124562789123, 'PARENTING': 0.6544059573507841, 'BEAUTY': 0.5979916506826132, 'FINANCE': 3.7007785174320205, 'FAMILY': 18.910075595170937, 'NEWS_AND_MAGAZINES': 2.798149610741284, 'WEATHER': 0.8010831546880289, 'BOOKS_AND_REFERENCE': 2.1437436533904997, 'FOOD_AND_DRINK': 1.241114746699763, 'COMMUNICATION': 3.2381812027530184, 'ENTERTAINMENT': 0.9590432133589079, 'MAPS_AND_NAVIGATION': 1.399074805370642, 'VIDEO_PLAYERS': 1.7939749520478394, 'AUTO_AND_VEHICLES': 0.9251946293580051, 'LIFESTYLE': 3.9038700214374367, 'EDUCATION': 1.1621347173643235, 'HEALTH_AND_FITNESS': 3.0802211440821394, 'DATING': 1.8616721200496444, 'PRODUCTIVITY': 3.8925871601038025, 'HOUSE_AND_HOME': 0.8236488773552973, 'MEDICAL': 3.5315355974275078, 'PHOTOGRAPHY': 2.944826808078529, 'SPORTS': 3.396141261423897, 'ART_AND_DESIGN': 0.6318402346835158, 'PERSONALIZATION': 3.317161232088458, 'GAME': 9.725826469592688, 'TOOLS': 8.462146000225657, 'LIBRARIES_AND_DEMO': 0.9364774906916393, 'SOCIAL': 2.6627552747376737, 'SHOPPING': 2.245289405393208, 'TRAVEL_AND_LOCAL': 2.335552296062281} % of apps per genre in Google Play {'Educational;Brain Games': 0.06769716800180525, 'Puzzle': 1.128286133363421, 'Education': 5.348076272142616, 'Comics': 0.6092745120162473, 'Racing;Pretend Play': 0.011282861333634209, 'Communication;Creativity': 0.011282861333634209, 'Casual;Action & Adventure': 0.1353943360036105, 'Strategy;Education': 0.011282861333634209, 'Strategy': 0.9139117680243709, 'Beauty': 0.5979916506826132, 'Role Playing': 0.9364774906916393, 'Libraries & Demo': 0.9364774906916393, 'Board;Action & Adventure': 0.022565722667268417, 'Entertainment;Action & Adventure': 0.033848584000902626, 'Education;Action & Adventure': 0.033848584000902626, 'Word': 0.25950581067358686, 'Art & Design;Action & Adventure': 0.011282861333634209, 'Puzzle;Creativity': 0.022565722667268417, 'Health & Fitness': 3.0802211440821394, 'Tools;Education': 0.011282861333634209, 'Educational;Pretend Play': 0.09026289066907367, 'Parenting;Brain Games': 0.011282861333634209, 'Art & Design': 0.5867087893489789, 'Entertainment;Pretend Play': 0.022565722667268417, 'Tools': 8.450863138892023, 'Weather': 0.8010831546880289, 'Parenting;Education': 0.07898002933543948, 'Auto & Vehicles': 0.9251946293580051, 'Productivity': 3.8925871601038025, 'Travel & Local': 2.324269434728647, 'Casual;Pretend Play': 0.2369400880063184, 'Lifestyle;Pretend Play': 0.011282861333634209, 'Social': 2.6627552747376737, 'Education;Brain Games': 0.033848584000902626, 'Board;Brain Games': 0.07898002933543948, 'Role Playing;Pretend Play': 0.045131445334536835, 'Business': 4.592124562789123, 'Communication': 3.2381812027530184, 'Education;Pretend Play': 0.05641430666817105, 'Casual;Creativity': 0.06769716800180525, 'Entertainment;Music & Video': 0.16924292000451313, 'Books & Reference;Education': 0.011282861333634209, 'Simulation;Action & Adventure': 0.07898002933543948, 'Medical': 3.5315355974275078, 'Arcade;Pretend Play': 0.011282861333634209, 'Card': 0.4513144533453684, 'Adventure': 0.6769716800180525, 'Simulation;Pretend Play': 0.022565722667268417, 'Entertainment;Education': 0.011282861333634209, 'Finance': 3.7007785174320205, 'Sports': 3.463838429425702, 'Educational': 0.37233442400992894, 'Travel & Local;Action & Adventure': 0.011282861333634209, 'Arcade': 1.8503892587160102, 'Art & Design;Pretend Play': 0.011282861333634209, 'Entertainment;Creativity': 0.033848584000902626, 'Puzzle;Brain Games': 0.16924292000451313, 'Video Players & Editors;Music & Video': 0.022565722667268417, 'Card;Action & Adventure': 0.011282861333634209, 'Racing;Action & Adventure': 0.16924292000451313, 'Casual;Music & Video': 0.011282861333634209, 'Simulation;Education': 0.011282861333634209, 'Casual;Education': 0.022565722667268417, 'Trivia': 0.4174658693444658, 'Education;Creativity': 0.045131445334536835, 'Lifestyle': 3.8925871601038025, 'Food & Drink': 1.241114746699763, 'Music & Audio;Music & Video': 0.011282861333634209, 'Art & Design;Creativity': 0.06769716800180525, 'Trivia;Education': 0.011282861333634209, 'Education;Music & Video': 0.033848584000902626, 'Role Playing;Brain Games': 0.011282861333634209, 'Video Players & Editors;Creativity': 0.011282861333634209, 'Books & Reference': 2.1437436533904997, 'Health & Fitness;Action & Adventure': 0.011282861333634209, 'Educational;Education': 0.3949001466771973, 'Personalization': 3.317161232088458, 'Action;Action & Adventure': 0.10154575200270789, 'House & Home': 0.8236488773552973, 'Parenting;Music & Video': 0.06769716800180525, 'Puzzle;Education': 0.011282861333634209, 'Events': 0.7108202640189552, 'Lifestyle;Education': 0.011282861333634209, 'Casino': 0.42874873067809993, 'Parenting': 0.4964458986799052, 'Board': 0.38361728534356315, 'Comics;Creativity': 0.011282861333634209, 'Maps & Navigation': 1.399074805370642, 'Health & Fitness;Education': 0.011282861333634209, 'Educational;Creativity': 0.033848584000902626, 'News & Magazines': 2.798149610741284, 'Puzzle;Action & Adventure': 0.033848584000902626, 'Racing': 0.9928917973598104, 'Casual': 1.7601263680469368, 'Educational;Action & Adventure': 0.033848584000902626, 'Dating': 1.8616721200496444, 'Action': 3.102786866749408, 'Adventure;Action & Adventure': 0.033848584000902626, 'Arcade;Action & Adventure': 0.1241114746699763, 'Music': 0.20309150400541578, 'Simulation': 2.042197901387792, 'Sports;Action & Adventure': 0.022565722667268417, 'Strategy;Creativity': 0.011282861333634209, 'Role Playing;Action & Adventure': 0.033848584000902626, 'Entertainment': 6.070179397495204, 'Casual;Brain Games': 0.1353943360036105, 'Strategy;Action & Adventure': 0.011282861333634209, 'Adventure;Education': 0.011282861333634209, 'Photography': 2.944826808078529, 'Music;Music & Video': 0.022565722667268417, 'Shopping': 2.245289405393208, 'Education;Education': 0.33848584000902626, 'Entertainment;Brain Games': 0.07898002933543948, 'Video Players & Editors': 1.771409229380571}
### Here we build a function to display the most common app genres in descending order ###
def display_table(dataset, index):
table = freq_table_percentages(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
print('% of apps per genre in App Store, sorted')
print(display_table(app_store_clean, 11))
print('\n')
print('% of apps per category in Google Play, sorted')
print(display_table(google_play_clean, 1))
print('\n')
print('% of apps per genre in Google Play, sorted')
print(display_table(google_play_clean, 9))
% of apps per genre in App Store, sorted Games : 55.659679408138096 Entertainment : 8.236744759556105 Photo & Video : 4.1183723797780525 Social Networking : 3.501849568434032 Education : 3.255240443896424 Shopping : 2.9839704069050557 Utilities : 2.688039457459926 Lifestyle : 2.318125770653514 Finance : 2.0715166461159065 Sports : 1.9482120838471024 Health & Fitness : 1.8742293464858202 Music : 1.6522811344019728 Book : 1.627620221948212 Productivity : 1.528976572133169 News : 1.4303329223181258 Travel : 1.381011097410604 Food & Drink : 1.060419235511714 Weather : 0.7644882860665845 Reference : 0.4932182490752158 Navigation : 0.4932182490752158 Business : 0.4932182490752158 Catalogs : 0.22194821208384713 Medical : 0.19728729963008632 None % of apps per category in Google Play, sorted FAMILY : 18.910075595170937 GAME : 9.725826469592688 TOOLS : 8.462146000225657 BUSINESS : 4.592124562789123 LIFESTYLE : 3.9038700214374367 PRODUCTIVITY : 3.8925871601038025 FINANCE : 3.7007785174320205 MEDICAL : 3.5315355974275078 SPORTS : 3.396141261423897 PERSONALIZATION : 3.317161232088458 COMMUNICATION : 3.2381812027530184 HEALTH_AND_FITNESS : 3.0802211440821394 PHOTOGRAPHY : 2.944826808078529 NEWS_AND_MAGAZINES : 2.798149610741284 SOCIAL : 2.6627552747376737 TRAVEL_AND_LOCAL : 2.335552296062281 SHOPPING : 2.245289405393208 BOOKS_AND_REFERENCE : 2.1437436533904997 DATING : 1.8616721200496444 VIDEO_PLAYERS : 1.7939749520478394 MAPS_AND_NAVIGATION : 1.399074805370642 FOOD_AND_DRINK : 1.241114746699763 EDUCATION : 1.1621347173643235 ENTERTAINMENT : 0.9590432133589079 LIBRARIES_AND_DEMO : 0.9364774906916393 AUTO_AND_VEHICLES : 0.9251946293580051 HOUSE_AND_HOME : 0.8236488773552973 WEATHER : 0.8010831546880289 EVENTS : 0.7108202640189552 PARENTING : 0.6544059573507841 ART_AND_DESIGN : 0.6318402346835158 COMICS : 0.6205573733498815 BEAUTY : 0.5979916506826132 None % of apps per genre in Google Play, sorted Tools : 8.450863138892023 Entertainment : 6.070179397495204 Education : 5.348076272142616 Business : 4.592124562789123 Productivity : 3.8925871601038025 Lifestyle : 3.8925871601038025 Finance : 3.7007785174320205 Medical : 3.5315355974275078 Sports : 3.463838429425702 Personalization : 3.317161232088458 Communication : 3.2381812027530184 Action : 3.102786866749408 Health & Fitness : 3.0802211440821394 Photography : 2.944826808078529 News & Magazines : 2.798149610741284 Social : 2.6627552747376737 Travel & Local : 2.324269434728647 Shopping : 2.245289405393208 Books & Reference : 2.1437436533904997 Simulation : 2.042197901387792 Dating : 1.8616721200496444 Arcade : 1.8503892587160102 Video Players & Editors : 1.771409229380571 Casual : 1.7601263680469368 Maps & Navigation : 1.399074805370642 Food & Drink : 1.241114746699763 Puzzle : 1.128286133363421 Racing : 0.9928917973598104 Role Playing : 0.9364774906916393 Libraries & Demo : 0.9364774906916393 Auto & Vehicles : 0.9251946293580051 Strategy : 0.9139117680243709 House & Home : 0.8236488773552973 Weather : 0.8010831546880289 Events : 0.7108202640189552 Adventure : 0.6769716800180525 Comics : 0.6092745120162473 Beauty : 0.5979916506826132 Art & Design : 0.5867087893489789 Parenting : 0.4964458986799052 Card : 0.4513144533453684 Casino : 0.42874873067809993 Trivia : 0.4174658693444658 Educational;Education : 0.3949001466771973 Board : 0.38361728534356315 Educational : 0.37233442400992894 Education;Education : 0.33848584000902626 Word : 0.25950581067358686 Casual;Pretend Play : 0.2369400880063184 Music : 0.20309150400541578 Racing;Action & Adventure : 0.16924292000451313 Puzzle;Brain Games : 0.16924292000451313 Entertainment;Music & Video : 0.16924292000451313 Casual;Brain Games : 0.1353943360036105 Casual;Action & Adventure : 0.1353943360036105 Arcade;Action & Adventure : 0.1241114746699763 Action;Action & Adventure : 0.10154575200270789 Educational;Pretend Play : 0.09026289066907367 Simulation;Action & Adventure : 0.07898002933543948 Parenting;Education : 0.07898002933543948 Entertainment;Brain Games : 0.07898002933543948 Board;Brain Games : 0.07898002933543948 Parenting;Music & Video : 0.06769716800180525 Educational;Brain Games : 0.06769716800180525 Casual;Creativity : 0.06769716800180525 Art & Design;Creativity : 0.06769716800180525 Education;Pretend Play : 0.05641430666817105 Role Playing;Pretend Play : 0.045131445334536835 Education;Creativity : 0.045131445334536835 Role Playing;Action & Adventure : 0.033848584000902626 Puzzle;Action & Adventure : 0.033848584000902626 Entertainment;Creativity : 0.033848584000902626 Entertainment;Action & Adventure : 0.033848584000902626 Educational;Creativity : 0.033848584000902626 Educational;Action & Adventure : 0.033848584000902626 Education;Music & Video : 0.033848584000902626 Education;Brain Games : 0.033848584000902626 Education;Action & Adventure : 0.033848584000902626 Adventure;Action & Adventure : 0.033848584000902626 Video Players & Editors;Music & Video : 0.022565722667268417 Sports;Action & Adventure : 0.022565722667268417 Simulation;Pretend Play : 0.022565722667268417 Puzzle;Creativity : 0.022565722667268417 Music;Music & Video : 0.022565722667268417 Entertainment;Pretend Play : 0.022565722667268417 Casual;Education : 0.022565722667268417 Board;Action & Adventure : 0.022565722667268417 Video Players & Editors;Creativity : 0.011282861333634209 Trivia;Education : 0.011282861333634209 Travel & Local;Action & Adventure : 0.011282861333634209 Tools;Education : 0.011282861333634209 Strategy;Education : 0.011282861333634209 Strategy;Creativity : 0.011282861333634209 Strategy;Action & Adventure : 0.011282861333634209 Simulation;Education : 0.011282861333634209 Role Playing;Brain Games : 0.011282861333634209 Racing;Pretend Play : 0.011282861333634209 Puzzle;Education : 0.011282861333634209 Parenting;Brain Games : 0.011282861333634209 Music & Audio;Music & Video : 0.011282861333634209 Lifestyle;Pretend Play : 0.011282861333634209 Lifestyle;Education : 0.011282861333634209 Health & Fitness;Education : 0.011282861333634209 Health & Fitness;Action & Adventure : 0.011282861333634209 Entertainment;Education : 0.011282861333634209 Communication;Creativity : 0.011282861333634209 Comics;Creativity : 0.011282861333634209 Casual;Music & Video : 0.011282861333634209 Card;Action & Adventure : 0.011282861333634209 Books & Reference;Education : 0.011282861333634209 Art & Design;Pretend Play : 0.011282861333634209 Art & Design;Action & Adventure : 0.011282861333634209 Arcade;Pretend Play : 0.011282861333634209 Adventure;Education : 0.011282861333634209 None
The most common app genre in the App Store is "Games" (55.7%), followed by "Entertainment" (8.2%). There is a total of 23 app genres and the top five account for 75% of total apps. The top four genres are all leisure related: Games, Entertainment, Photo & Video, Social Networking. From the above frequency tables we can't say which apps have the most users but we can only see what genres have the most number of apps.
The most common category in Google Play is "Family" (18.9%), followed by "Game" (9.7%).
The most common genre in Google Play is "Tools" (8.5%), followed by "Entertainment" (6.1%). Four out of five of the top genres in Google Play are geared towards practical uses rather than leisure: Tools, Education, Business, Productivity. This is in contrast to what we see in the App store where top genres are all leisure related.
Now we want to know what the average number of users per app by genre is. To do that:
### Average # of user ratings per app by genre - App Store ###
print('Average # of user ratings per app by genre - App Store')
for genre in app_store_genre_frequency:
total = 0
len_genre = 0
for row in app_store_clean:
genre_app = row[11]
if genre_app == genre:
app_ratings = float(row[5])
total += app_ratings
len_genre += 1
avg_user_ratings = total / len_genre
print(genre,": ", avg_user_ratings)
print('\n')
### Average # of user reviews per app by category - Google Play ###
print('Average # of user reviews per app by category - Google Play')
for category in google_play_category_frequency:
total = 0
len_category = 0
for row in google_play_clean:
category_app = row[1]
if category_app == category:
app_reviews = float(row[3])
total += app_reviews
len_category += 1
avg_user_reviews = total / (len_category + 1)
print(category, ": ", avg_user_reviews)
print('\n')
### Average # of user reviews per app by genre - Google Play ###
print('Average # of user reviews per app by genre - Google Play')
for genre in google_play_genre_frequency:
total = 0
len_genre = 0
for row in google_play_clean:
genre_app = row[9]
if genre_app == genre:
app_reviews = float(row[3])
total += app_reviews
len_genre += 1
avg_user_reviews = total / (len_genre + 1)
print(genre, ": ", avg_user_reviews)
Average # of user ratings per app by genre - App Store Travel : 20216.01785714286 Reference : 67447.9 Music : 56482.02985074627 Entertainment : 10822.961077844311 Education : 6266.333333333333 Productivity : 19053.887096774193 Catalogs : 1779.5555555555557 Book : 8498.333333333334 Utilities : 14010.100917431193 Lifestyle : 8978.308510638299 Food & Drink : 20179.093023255813 Photo & Video : 27249.892215568863 Weather : 47220.93548387097 News : 15892.724137931034 Sports : 20128.974683544304 Social Networking : 53078.195804195806 Health & Fitness : 19952.315789473683 Medical : 459.75 Navigation : 25972.05 Shopping : 18746.677685950413 Finance : 13522.261904761905 Business : 6367.8 Games : 18924.68896765618 Average # of user reviews per app by category - Google Play EVENTS : 2515.90625 COMICS : 41825.16071428572 BUSINESS : 24180.316176470587 TRAVEL_AND_LOCAL : 128861.90384615384 PARENTING : 16101.101694915254 BEAUTY : 7337.777777777777 FINANCE : 38418.76899696048 FAMILY : 113075.53070960048 NEWS_AND_MAGAZINES : 92714.18473895582 WEATHER : 168872.29166666666 EDUCATION : 55751.817307692305 FOOD_AND_DRINK : 56960.963963963964 TOOLS : 305325.79627163784 ENTERTAINMENT : 298243.5 MAPS_AND_NAVIGATION : 141717.168 SOCIAL : 961755.7510548523 VIDEO_PLAYERS : 422691.64375 AUTO_AND_VEHICLES : 13969.915662650603 LIFESTYLE : 33824.06628242075 BOOKS_AND_REFERENCE : 87534.36125654451 HEALTH_AND_FITNESS : 77809.95255474452 DATING : 21821.024096385543 PRODUCTIVITY : 160170.2803468208 SHOPPING : 222767.91 SPORTS : 116551.40066225166 ART_AND_DESIGN : 24273.568965517243 PERSONALIZATION : 180508.34237288137 GAME : 682731.8122827347 PHOTOGRAPHY : 402539.0801526718 LIBRARIES_AND_DEMO : 10795.738095238095 COMMUNICATION : 992151.4895833334 HOUSE_AND_HOME : 26078.22972972973 MEDICAL : 3718.2738853503183 Average # of user reviews per app by genre - Google Play Educational;Brain Games : 17901.714285714286 Puzzle : 213527.28712871287 Education : 16177.254736842106 Casual : 832370.2993630574 News & Magazines : 92714.18473895582 Communication;Creativity : 1739.0 Casual;Action & Adventure : 870209.0 Puzzle;Brain Games : 147605.625 Strategy : 1236575.4512195121 Health & Fitness;Action & Adventure : 15530.5 Beauty : 7337.777777777777 Role Playing : 246289.4880952381 Libraries & Demo : 10795.738095238095 Board;Action & Adventure : 24855.333333333332 Entertainment;Action & Adventure : 34312.5 Education;Action & Adventure : 3882.25 Word : 218760.70833333334 Art & Design;Action & Adventure : 32.5 Simulation;Education : 8.0 Puzzle;Creativity : 25746.333333333332 Health & Fitness : 77809.95255474452 Education;Pretend Play : 20959.5 Card;Action & Adventure : 460285.5 Business : 24180.316176470587 Music : 205064.0 Art & Design : 25635.425925925927 Travel & Local;Action & Adventure : 445.0 Tools : 305276.44933333335 Racing;Pretend Play : 1100.0 Parenting;Education : 1921.125 Auto & Vehicles : 13969.915662650603 Productivity : 160170.2803468208 Travel & Local : 129480.12560386474 Casual;Pretend Play : 100632.77272727272 Strategy;Creativity : 64771.0 Social : 961755.7510548523 Education;Brain Games : 144443.25 Board;Brain Games : 1968.0 Role Playing;Pretend Play : 44638.4 Parenting;Brain Games : 1807.0 Communication : 992151.4895833334 Weather : 168872.29166666666 Casual;Creativity : 75864.71428571429 Entertainment;Music & Video : 74699.5625 Maps & Navigation : 141717.168 Simulation;Action & Adventure : 153957.875 Role Playing;Action & Adventure : 241731.25 Arcade;Pretend Play : 11835.5 Card : 162278.0243902439 Adventure : 297260.59016393445 Personalization : 180508.34237288137 Simulation;Pretend Play : 25109.666666666668 Entertainment;Education : 3660.0 Finance : 38418.76899696048 Casual;Brain Games : 9671.076923076924 Educational : 6865.588235294118 Arcade : 704533.896969697 Entertainment;Creativity : 107669.5 Parenting;Music & Video : 3746.8571428571427 Lifestyle : 33514.32369942196 Strategy;Education : 1031.0 Video Players & Editors;Music & Video : 52993.666666666664 Educational;Pretend Play : 190862.33333333334 Racing;Action & Adventure : 190495.1875 Comics;Creativity : 258.0 Entertainment;Brain Games : 69216.25 Casual;Education : 9211.666666666666 Trivia : 188835.8947368421 Education;Creativity : 5107.4 Food & Drink : 56960.963963963964 Music & Audio;Music & Video : 684.5 Art & Design;Creativity : 4866.714285714285 Trivia;Education : 4.0 Education;Music & Video : 11209.0 Role Playing;Brain Games : 75687.0 Video Players & Editors;Creativity : 79811.0 Books & Reference : 87534.36125654451 Tools;Education : 171168.0 Educational;Education : 11676.611111111111 Medical : 3718.2738853503183 Action;Action & Adventure : 116087.5 House & Home : 26078.22972972973 Entertainment;Pretend Play : 35770.0 Puzzle;Education : 417.0 Events : 2515.90625 Lifestyle;Education : 1573.0 Casino : 130870.61538461539 Parenting : 20105.644444444446 Board : 118080.0 Casual;Music & Video : 19010.5 Entertainment : 103197.43042671614 Health & Fitness;Education : 4928.0 Educational;Creativity : 11121.75 Puzzle;Action & Adventure : 400421.5 Racing : 591278.0898876404 Comics : 42576.236363636366 Educational;Action & Adventure : 169351.75 Dating : 21821.024096385543 Action : 543001.0724637681 Adventure;Action & Adventure : 1134951.75 Arcade;Action & Adventure : 92402.33333333333 Art & Design;Pretend Play : 487.0 Simulation : 142065.32967032967 Sports;Action & Adventure : 486676.0 Lifestyle;Pretend Play : 70497.5 Books & Reference;Education : 21.0 Sports : 212745.4025974026 Strategy;Action & Adventure : 9585.0 Adventure;Education : 144303.0 Photography : 402539.0801526718 Music;Music & Video : 16982.666666666668 Shopping : 222767.91 Education;Education : 226998.25806451612 Video Players & Editors : 426277.46202531643
The app genre with the highest average number of user ratings per app is "Weather". Weather apps only account for 0.76% of total in the App Store, or 31 apps. Such a high avg number of ratings combined with the relatively low number of apps in this genre may tell us that there is one or a few successful Weather app with a large number of ratings, skewing the genre average up. Creating a weather app is not recommended as it is probably a winner takes most genre.
Our recommendation is to create an app in a genre that's popular having a large number of apps and a high average number of user ratings.
The most popular genres (in terms of # of apps) are: Games, Entertainment, Photo & Video, Social Networking. Based on the average number of ratings we will make our recommendation.
Genre | % of total apps | Avg user ratings |
---|---|---|
Games | 55.7% | 18925 |
Entertainment | 8.2% | 10823 |
Photo & Video | 4.1% | 27250 |
Social Networking | 3.5% | 53078 |
While Social Networking and Photo & Video tend to be genres with a handful of very successful apps, we believe that in the Games genres there is more room for new entrants to be successful. This is confirmed by the fact that the Games genre has a relatively high average number of ratings despite having by far the highest number of apps.
Given the above reasoning, and aware that a more in depth analysis would help to obtain a more detailed recommendation, we suggest the Games genre.
The Google Play dataset contains a column with number of installs. Unfortunately, the data is stored as a string, so before we can calculate the average we need to convert the data into float. We can remove the commas and + sign from the "Installs" column by using the str.replace(old, new) method.
### Here we are looking at the average number of app installs by category in Google Play ###
for category in google_play_category_frequency:
total = 0
len_category = 0
for row in google_play_clean:
category_app = row[1]
installs = row[5]
if category_app == category:
installs = installs.replace('+', '')
installs = installs.replace(',', '')
total += float(installs)
len_category += 1
avg_installs_category = total / len_category
print(category,': ', avg_installs_category)
EVENTS : 253542.22222222222 COMICS : 817657.2727272727 BUSINESS : 1712290.1474201474 TRAVEL_AND_LOCAL : 13984077.710144928 PARENTING : 542603.6206896552 BEAUTY : 513151.88679245283 FINANCE : 1387692.475609756 FAMILY : 3695641.8198090694 NEWS_AND_MAGAZINES : 9549178.467741935 WEATHER : 5074486.197183099 EDUCATION : 1833495.145631068 FOOD_AND_DRINK : 1924897.7363636363 TOOLS : 10801391.298666667 ENTERTAINMENT : 11640705.88235294 MAPS_AND_NAVIGATION : 4056941.7741935486 SOCIAL : 23253652.127118643 VIDEO_PLAYERS : 24727872.452830188 AUTO_AND_VEHICLES : 647317.8170731707 LIFESTYLE : 1437816.2687861272 BOOKS_AND_REFERENCE : 8767811.894736841 HEALTH_AND_FITNESS : 4188821.9853479853 DATING : 854028.8303030303 PRODUCTIVITY : 16787331.344927534 SHOPPING : 7036877.311557789 SPORTS : 3638640.1428571427 ART_AND_DESIGN : 1986335.0877192982 PERSONALIZATION : 5201482.6122448975 GAME : 15588015.603248259 PHOTOGRAPHY : 17840110.40229885 LIBRARIES_AND_DEMO : 638503.734939759 COMMUNICATION : 38456119.167247385 HOUSE_AND_HOME : 1331540.5616438356 MEDICAL : 120550.61980830671
The app category with the highest average number of installs is Communication with 38.5M downloads, followed by Video_players (24.7M) and Social (23.2M). These are also the categories with the highest average number of user reviews. In terms of % of the total apps in Google Play, these categories are not the ones with the highest number of apps: Communication has 3.2%, Social has 2.7% and Video_players has 1.8%. Combining this information with the fact that these categories have the highest number of average downloads, make us deduct that there may bey a few very successful apps in these categories. The Family category has a pretty high average number of installs (3.7M) and it's the largest category in terms of number of apps. The Games category has a pretty high average number of installs (15.6M) and it makes up 9.7% of total apps (second largest category). For this reason we think that the Games category would be the best to develope a new app.