At this time, it is recommended that the company's app developers focus resources on creating app profiles that are unique, no-pay-to-play, extemely user friendly vlogger/tuber simulator games and/or tools similar to 'PewDiePie's Tuber Simulator' or 'Vlogger Go Viral - Tuber Game'. Both app profiles can be viewed here:
Other notable app profiles for consideration include: Head Soccer, Eternium, My Talking Tom, Candy Crush Saga, Sniper 3D Assassin, and Geometry Dash.
Additionally, it is recommended that the company conducts a feasibility study to assess the practicality and the resources required to develop mobile antivirus, cleaning, and boosting apps.
There are millions of apps available to users from the Apple Apps Store and Google Play. Our company develops apps for the Apple and Google markets that are typically directed towards an English speaking audience and are free for users to download. The apps generate revenue for the company by means of in-app ads; therefore, the more users that install and use our apps the more likely the user will see and/or engage with the ads that generate the company's revenue. As we continue to develop apps for these markets, it is beneficial for the company to be aware of free apps currently on the market that attract the most users in order to aid our developers when creating new app profiles.
This project will include the collection, cleaning, and analysis of two datasets regarding mobile apps available on the Apple Apps Store and Google Play. In an effort to focus time and resources, only sample datasets available through a third party will be analyzed. The samples consist of data for approximately 10,000 Android and 7,000 iOS apps collected from August 2018 and July 2017 respectively. The data points in the samples include information related to the app's name, price, genre, user rating, rating count total, size, install count, etc.
The goal of this project is to develop short-lists of apps highly engaged by users from the Apple and Google sample datasets. The app short-lists will assist the company's developers in creating similar engaging apps for both the Apple and Google markets that will encourage user/ad interaction with the intent to increase company revenue.
The data to be analyzed in this project include two third party sample datasets, one consists of data collected from approximately 10,000 Android apps which can be accessed here:
The second dataset if of approximately 7,000 iOS apps which can be accessed here:
The heading column names from both the Google and Apple dataset csv files were modified prior to importing so that the column names were consistent and in similar order. The Google dataset does have two additional column headers installs
and sub_genre
as shown below.
# google dataset open
opened_file = open('C:\Shaun\Businesses\Mr Data\DataQuest\Guided Projects\Profitable App Profiles\Datasets\googleplaystore mod.1.csv', encoding = 'utf-8')
from csv import reader
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google_data = google[1:]
print(google_header)
['app_name', 'price', 'genre', 'rating_count_total', 'user_rating', 'content_rating', 'size_bytes', 'installs', 'sub_genre']
# apple dataset open
opened_file = open('C:\Shaun\Businesses\Mr Data\DataQuest\Guided Projects\Profitable App Profiles\Datasets\AppleStore mod.1.csv', encoding = 'utf-8')
from csv import reader
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple_data = apple[1:]
print(apple_header)
['app_name', 'price', 'genre', 'rating_count_total', 'user_rating', 'content_rating', 'size_bytes']
An explorer function was added to assist the analysis by printing slices of the dataset rows in a more readable way. An example of the explore_data
function was printed to demonstrate its output.
def explore_data(dataset, start, end, rows_and_columns = True):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print('An example of the explore_data output for google_data.')
print(' ')
explore_data(google_data, 0, 1)
print('\n')
print('An example of the explore_data output for apple_data.')
print(' ')
explore_data(apple_data, 0, 1)
An example of the explore_data output for google_data. ['Photo Editor & Candy Camera & Grid & ScrapBook', '0.00', 'ART_AND_DESIGN', '159', '4.1', 'Everyone', '19M', '10,000+', 'Adventure;Action & Adventure'] Number of rows: 10841 Number of columns: 9 An example of the explore_data output for apple_data. ['Facebook', '0', 'Social Networking', '2974676', '3.5', '4+', '389879808'] Number of rows: 7197 Number of columns: 7
Since the company typically produces free apps for users that are English speaking, the data cleaning scope will include:
Extraction of free apps where price
equals 0.00
,
Removal of apps with no user_rating
where rating_count_total
is greater than zero,
Removal of duplicate apps based on the app_name
column,
Removal of apps with non-English characters,
Removal of any incorrect or inaccurate data if detected.
This section of the report describes the data cleaning process for the google_data
list only and is more descriptive of the cleaning process. The apple_data
list cleaning process includes the same steps and is shown in section 4.4.
While first attempting to extract free apps with user rating data into a new list called google_data_mod_1
, it was noted that a ValueError: could not convert string to float: 'Everyone'
would occur in relation to price = float(row[1])
. The following code was run to search for the error:
for row in google_data:
price = row[1]
if price == 'Everyone':
print(row)
['Life Made WI-Fi Touchscreen Photo Frame', 'Everyone', '1.9', '3.0M', '19', '', '1,000+', 'Free', '']
It was found that the app named 'Life Made WI-Fi Touchscreen Photo Frame'
had errors in each column excluding the app_name
column. In order to remove the row containing 'Life Made WI-Fi Touchscreen Photo Frame'
from the dataset, the app's index would need to be known. List comprehension with the built-in enumerate
function was used to identify the correct index number for removal.
error_index = [index for (index, item) in enumerate(google_data) if item == ['Life Made WI-Fi Touchscreen Photo Frame', 'Everyone', '1.9', '3.0M', '19', '', '1,000+', 'Free', '']]
print(error_index)
[10472]
The app 'Life Made WI-Fi Touchscreen Photo Frame'
index number is [10472]
and was removed from the dataset in the following cell; the ValueError
raised while extracting free apps no longer was an issue. The explore_data
function was ran before and after the app removal to demonstrate that the correct app was removed and that the Number of rows
count decreased by one from 10841 to 10840.
explore_data(google_data, 10472, 10473 )
del google_data[10472]
print('\n')
explore_data(google_data, 10472, 10473 )
['Life Made WI-Fi Touchscreen Photo Frame', 'Everyone', '1.9', '3.0M', '19', '', '1,000+', 'Free', ''] Number of rows: 10841 Number of columns: 9 ['osmino Wi-Fi: free WiFi', '0.00', 'TOOLS', '134203', '4.2', 'Everyone', '4.1M', '10,000,000+', ''] Number of rows: 10840 Number of columns: 9
As part of the data cleaning scope, a new list google_data_mod_1
was created to extract all free apps with user rating data, where price
equals 0.00
and rating_count_total
is greater than zero. The explore_data
function was called to show the updated row count.
google_data_mod_1 = []
for row in google_data:
app_name = row[0]
price = float(row[1])
genre = row[2]
rating_count_total = int(row[3])
user_rating = row[4]
content_rating = row[5]
size_bytes = row[6]
installs = row[7]
sub_genre = row[8]
if price == 0 and rating_count_total != 0:
google_data_mod_1.append([app_name, price, genre, rating_count_total, user_rating, content_rating, size_bytes, installs, sub_genre])
explore_data(google_data_mod_1, 0, 1)
['Photo Editor & Candy Camera & Grid & ScrapBook', 0.0, 'ART_AND_DESIGN', 159, '4.1', 'Everyone', '19M', '10,000+', 'Adventure;Action & Adventure'] Number of rows: 9520 Number of columns: 9
Continuing with the data cleaning scope, all duplicate apps in the new list google_data_mod_1
needed to be removed based on the app_name
column. A new list, duplicate_apps_google
was created to determine how many duplicate apps are in the google_data_mod_1
dataset and also to provide examples of duplicate apps in order to establish criterion in which to remove specific duplicates.
# duplicate and unique app list creation.
duplicate_apps_google = []
unique_apps_google = []
for row in google_data_mod_1:
app_name = row[0]
if app_name in unique_apps_google:
duplicate_apps_google.append(app_name)
else:
unique_apps_google.append(app_name)
print(len(duplicate_apps_google))
1132
According to the duplicate_apps_google
list, there are 1132
duplicate apps based on the app_name
column.
# duplicate app examples.
print(duplicate_apps_google[:15])
print('\n')
def duplicates(dataset, name):
for row in dataset:
app_name = row[0]
if app_name == name:
print(row)
duplicates(google_data_mod_1, 'Slack')
print('\n')
duplicates(google_data_mod_1, 'Google Ads')
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'] ['Slack', 0.0, 'BUSINESS', 51507, '4.4', 'Everyone', 'Varies with device', '5,000,000+', 'Casual'] ['Slack', 0.0, 'BUSINESS', 51507, '4.4', 'Everyone', 'Varies with device', '5,000,000+', 'Action'] ['Slack', 0.0, 'BUSINESS', 51510, '4.4', 'Everyone', 'Varies with device', '5,000,000+', 'Action'] ['Google Ads', 0.0, 'BUSINESS', 29313, '4.3', 'Everyone', '20M', '5,000,000+', 'Action'] ['Google Ads', 0.0, 'BUSINESS', 29313, '4.3', 'Everyone', '20M', '5,000,000+', 'Action'] ['Google Ads', 0.0, 'BUSINESS', 29331, '4.3', 'Everyone', '20M', '5,000,000+', '']
It appears that the differences between duplicate apps include the number value in the rating_count_total
and the designated sub_genre
. Using 'Slack'
rating_count_total
as an example, the first entry has a rating_count_total
of 51507
and the third entry has a rating_count_total
of 51510
.
The criterion moving forward with the data cleaning scope is as follows: duplicate apps with the highest recorded rating_count_total
will remain on the final cleaned list as they are most likely the record with the most recent data; all other duplicates will be removed; sub_genre
is not taken into consideration at this time.
# Expected number of remaining apps after duplication removal.
print(len(google_data_mod_1) - len(duplicate_apps_google))
8388
After removing all the duplicate apps there should be 8388
unique apps remaining. The proceding cell creates a dictionary called google_data_mod_2
where each dictionary key
is a unique app name and the complimenting dictionary value
is the highest rating_count_total
as per the duplicate removal criterion. The purpose of the dictionary is to store only unique apps with the highest rating_count_total
separate from the duplicates.
# Unique app name dictionary creation.
google_data_mod_2 = {}
for row in google_data_mod_1:
app_name = row[0]
rating_count_total = row[3]
if app_name in google_data_mod_2 and google_data_mod_2[app_name] < rating_count_total:
google_data_mod_2[app_name] = rating_count_total
elif app_name not in google_data_mod_2:
google_data_mod_2[app_name] = rating_count_total
#print(google_data_mod_2)
print('Expected length:', (len(google_data_mod_1)) - (len(duplicate_apps_google)))
print(' ')
print('Actual length:', len(google_data_mod_2))
Expected length: 8388 Actual length: 8388
As expected, after removing the 1138
duplicate apps, 8388
unique apps remain. Since google_data_mod_2
is only a dictionary containing key-value pairs of app_name
and rating_count_total
, a new list is required which includes only apps with unique names and their accompanying datapoints (price
, genre
, user_rating
, content_rating
, size_bytes
, installs
). Below, a new list called google_data_no_duplicates
is created utylizing the google_data_mod_2
dictionary and the google_data_mod_1
list; the list already_added
helps to keep track of the apps already added to google_data_no_apps
so no duplicate apps migrate into the new list.
# Complete list of apps with all duplicates removed.
google_data_no_duplicates = []
already_added = []
for row in google_data_mod_1:
app_name = row[0]
rating_count_total = row[3]
if google_data_mod_2[app_name] == rating_count_total and app_name not in already_added:
google_data_no_duplicates.append(row)
already_added.append(app_name)
explore_data(google_data_no_duplicates, 0, 1)
['Photo Editor & Candy Camera & Grid & ScrapBook', 0.0, 'ART_AND_DESIGN', 159, '4.1', 'Everyone', '19M', '10,000+', 'Adventure;Action & Adventure'] Number of rows: 8388 Number of columns: 9
The first row of the google_data_no_duplicates
list was printed using the explore_data
function to demonstrate that google_data_mod_1
and the google_data_mod_2
dictionary were integrated properly with the expected number of rows 8388
remaining.
The dataset up to this point appears to include some apps that are for non-English speaking users. As part of the analysis, only apps designed for an English-speaking user will be considered; as such, non-English apps needed to be removed. According to the American Standard Code for Information Interchange system (ASCII), the numbers associated with common English characters are in the number range 0
to 127
. Each character in a string has a corresponding number associated with it; for example, the string 'Pie' would be associated with the numbers 80
(P), 105
(i), and 101
(e).
A function called english
was created with the built-in ord
function to detect names of apps that contain strings with characters that fall outside of the common range of English characters (0-127). If a non-English character is detected in a string then the english
function returns False
for an output as shown in the cell below.
def english(a_string):
for index in a_string:
if ord(index) > 127:
return False
else:
return True
print(english('Slack'))
print(english('漫咖 Comics - Manga,Novel and Stories'))
True False
However, some English app names contain special characters and emojis that fall outside of the ASCII range of 0-127 for common English characters. The english
function needed to be modified so it would not incorrectly label English apps as non-English apps and result in the loss of useful data. To minimize data loss, an app was only removed by the modified english
function if the app name had more than three characters outside of the 0-127 ASCII range as shown with the printed examples below.
def english(a_string):
non_ascii = 0
for index in a_string:
if ord(index) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
print(english('DM for IG 😘 - Image & Video Saver for Instagram'))
print(english('Combat Strike CS 🔫 Counter Terrorist Attack FPS💣'))
print(english('뽕티비 - 개인방송, 인터넷방송, BJ방송'))
True True False
The final step in the cleaning process is to remove the 'non-English' apps from the google_data_no_duplicates
list and create a final list called google_data_final
to use for analysis. Below, the modified english
function is applied to filter out non-English apps while minimizing data loss.
google_data_final = []
for row in google_data_no_duplicates:
app_name = row[0]
if english(app_name):
google_data_final.append(row)
explore_data(google_data_final, 0, 1)
['Photo Editor & Candy Camera & Grid & ScrapBook', 0.0, 'ART_AND_DESIGN', 159, '4.1', 'Everyone', '19M', '10,000+', 'Adventure;Action & Adventure'] Number of rows: 8348 Number of columns: 9
At the end of the cleaning process, the initial google_data
dataset of 10841
apps has been cleaned and trimmed down to the google_data_final
list of 8343
apps.
This section of the report describes the data cleaning process for the apple_data
dataset only and is less descriptive of the cleaning process as the steps are the same as the google_data
cleaning process.
There are no incorrect or inaccurate data to remove from the apple_data
list.
apple_data_mod_1 = []
for row in apple_data:
app_name = row[0]
price = float(row[1])
genre = row[2]
rating_count_total = int(row[3])
user_rating = row[4]
content_rating = row[5]
size_bytes = row[6]
if price == 0 and rating_count_total != 0:
apple_data_mod_1.append([app_name, price, genre, rating_count_total, user_rating, content_rating, size_bytes])
explore_data(apple_data_mod_1, 0, 1)
['Facebook', 0.0, 'Social Networking', 2974676, '3.5', '4+', '389879808'] Number of rows: 3383 Number of columns: 7
duplicate_apps_apple = []
unique_apps_apple = []
for row in apple_data_mod_1:
app_name = row[0]
if app_name in unique_apps_apple:
duplicate_apps_apple.append(app_name)
else:
unique_apps_apple.append(app_name)
print(len(duplicate_apps_apple))
2
There are only two duplicate apps in the apple_data_mod_1
list.
print(duplicate_apps_apple[:15])
print('\n')
def duplicates(dataset, name):
for row in dataset:
app_name = row[0]
if app_name == name:
print(row)
duplicates(apple_data_mod_1, 'Mannequin Challenge')
print('\n')
duplicates(apple_data_mod_1, 'VR Roller Coaster')
['Mannequin Challenge', 'VR Roller Coaster'] ['Mannequin Challenge', 0.0, 'Games', 668, '3', '9+', '109705216'] ['Mannequin Challenge', 0.0, 'Games', 105, '4', '4+', '59572224'] ['VR Roller Coaster', 0.0, 'Games', 107, '3.5', '4+', '169523200'] ['VR Roller Coaster', 0.0, 'Games', 67, '3.5', '4+', '240964608']
The apple_data_mod_1
list indicates that the rating_count_total
column is not the only difference between duplicate apps as first stated with the google_data_mod_1
list in section 4.2.3. As shown, 'Mannequin Challenge' has discrepancies with the rating_count_total
, user_rating
and content_rating
columns. This is not considered an issue moving forward as the duplicate app with the highest rating_count_total
is most likely the app record with the most recent data.
print(len(apple_data_mod_1) - len(duplicate_apps_apple))
3381
apple_data_mod_2 = {}
for row in apple_data_mod_1:
app_name = row[0]
rating_count_total = row[3]
if app_name in apple_data_mod_2 and apple_data_mod_2[app_name] < rating_count_total:
apple_data_mod_2[app_name] = rating_count_total
elif app_name not in apple_data_mod_2:
apple_data_mod_2[app_name] = rating_count_total
print('Expected length:', (len(apple_data_mod_1)) - (len(duplicate_apps_apple)))
print(' ')
print('Actual length:', len(apple_data_mod_2))
Expected length: 3381 Actual length: 3381
apple_data_no_duplicates = []
already_added = []
for row in apple_data_mod_1:
app_name = row[0]
rating_count_total = row[3]
if apple_data_mod_2[app_name] == rating_count_total and app_name not in already_added:
apple_data_no_duplicates.append(row)
already_added.append(app_name)
explore_data(apple_data_no_duplicates, 0, 1)
['Facebook', 0.0, 'Social Networking', 2974676, '3.5', '4+', '389879808'] Number of rows: 3381 Number of columns: 7
apple_data_final = []
for row in apple_data_no_duplicates:
app_name = row[0]
if english(app_name):
apple_data_final.append(row)
explore_data(apple_data_final, 0, 1)
['Facebook', 0.0, 'Social Networking', 2974676, '3.5', '4+', '389879808'] Number of rows: 3069 Number of columns: 7
At the end of the cleaning process, the initial apple_data
dataset of 7197
apps has been cleaned and trimmed down to the apple_data_final
dataset of 3069
apps.
The intent of the data analysis is to provide company app developers with short-lists of currently available free apps that are highly engaged by users in both the Google and Apple markets. When building new app profiles, the developers can utilize the short-lists to model their projects in a similar fashion to increase the likelihood of user/ad interaction. It was decided to start the analysis process by determining which app genres are most commonly found in the cleaned google_data_final
and apple_data_final
datasets.
The most common genre was determined by creating and reviewing frequency tables for both the google_data_final
and apple_data_final
datasets. The percent distrbution of each genre was calculated, sorted from highest percent to lowest, and then presented in a bar chart.
# most common genre for google apps.
def freq_table(data_set, index):
table = {}
total = 0
for row in data_set:
total += 1
key = row[index]
if key in table:
table[key] += 1
else:
table[key] = 1
percent = {}
for key in table:
value = table[key]
percentage = (value / total) * 100
percentage = round(percentage, 1)
percent[key] = percentage
return percent
genre_most_common_google = freq_table(google_data_final, 2)
percent_genre_google = []
for key, value in genre_most_common_google.items():
percent_genre_google.append([key, value])
print(percent_genre_google)
[['ART_AND_DESIGN', 0.7], ['AUTO_AND_VEHICLES', 0.9], ['BEAUTY', 0.6], ['BOOKS_AND_REFERENCE', 2.1], ['BUSINESS', 4.0], ['COMICS', 0.7], ['COMMUNICATION', 3.1], ['DATING', 1.8], ['EDUCATION', 1.2], ['ENTERTAINMENT', 1.0], ['EVENTS', 0.7], ['FINANCE', 3.8], ['FOOD_AND_DRINK', 1.2], ['HEALTH_AND_FITNESS', 3.0], ['HOUSE_AND_HOME', 0.8], ['LIBRARIES_AND_DEMO', 1.0], ['LIFESTYLE', 3.8], ['GAME', 10.2], ['FAMILY', 19.4], ['MEDICAL', 3.1], ['SOCIAL', 2.7], ['SHOPPING', 2.3], ['PHOTOGRAPHY', 3.1], ['SPORTS', 3.3], ['TRAVEL_AND_LOCAL', 2.3], ['TOOLS', 8.6], ['PERSONALIZATION', 3.3], ['PRODUCTIVITY', 3.7], ['PARENTING', 0.7], ['WEATHER', 0.8], ['VIDEO_PLAYERS', 1.9], ['NEWS_AND_MAGAZINES', 2.8], ['MAPS_AND_NAVIGATION', 1.4]]
# most common genre for apple apps.
genre_most_common_apple = freq_table(apple_data_final, 2)
percent_genre_apple = []
for key, value in genre_most_common_apple.items():
percent_genre_apple.append([key, value])
print(percent_genre_apple)
[['Social Networking', 3.3], ['Photo & Video', 4.9], ['Games', 58.5], ['Music', 2.1], ['Reference', 0.5], ['Health & Fitness', 1.9], ['Weather', 0.8], ['Utilities', 2.4], ['Travel', 1.2], ['Shopping', 2.6], ['News', 1.3], ['Navigation', 0.2], ['Lifestyle', 1.6], ['Entertainment', 8.0], ['Food & Drink', 0.8], ['Sports', 2.2], ['Book', 0.3], ['Finance', 1.1], ['Education', 3.6], ['Productivity', 1.7], ['Business', 0.6], ['Catalogs', 0.1], ['Medical', 0.1]]
percent_genre_google_sorted = percent_genre_google
percent_genre_google_sorted.sort(key=lambda x: x[1], reverse = True)
print('Google apps genre percent distribution sorted highest to lowest:')
print(' ')
print(percent_genre_google_sorted)
print('\n')
percent_genre_apple_sorted = percent_genre_apple
percent_genre_apple_sorted.sort(key=lambda x: x[1], reverse = True)
print('Apple apps genre percent distribution sorted highest to lowest:')
print(' ')
print(percent_genre_apple_sorted)
Google apps genre percent distribution sorted highest to lowest: [['FAMILY', 19.4], ['GAME', 10.2], ['TOOLS', 8.6], ['BUSINESS', 4.0], ['FINANCE', 3.8], ['LIFESTYLE', 3.8], ['PRODUCTIVITY', 3.7], ['SPORTS', 3.3], ['PERSONALIZATION', 3.3], ['COMMUNICATION', 3.1], ['MEDICAL', 3.1], ['PHOTOGRAPHY', 3.1], ['HEALTH_AND_FITNESS', 3.0], ['NEWS_AND_MAGAZINES', 2.8], ['SOCIAL', 2.7], ['SHOPPING', 2.3], ['TRAVEL_AND_LOCAL', 2.3], ['BOOKS_AND_REFERENCE', 2.1], ['VIDEO_PLAYERS', 1.9], ['DATING', 1.8], ['MAPS_AND_NAVIGATION', 1.4], ['EDUCATION', 1.2], ['FOOD_AND_DRINK', 1.2], ['ENTERTAINMENT', 1.0], ['LIBRARIES_AND_DEMO', 1.0], ['AUTO_AND_VEHICLES', 0.9], ['HOUSE_AND_HOME', 0.8], ['WEATHER', 0.8], ['ART_AND_DESIGN', 0.7], ['COMICS', 0.7], ['EVENTS', 0.7], ['PARENTING', 0.7], ['BEAUTY', 0.6]] Apple apps genre percent distribution sorted highest to lowest: [['Games', 58.5], ['Entertainment', 8.0], ['Photo & Video', 4.9], ['Education', 3.6], ['Social Networking', 3.3], ['Shopping', 2.6], ['Utilities', 2.4], ['Sports', 2.2], ['Music', 2.1], ['Health & Fitness', 1.9], ['Productivity', 1.7], ['Lifestyle', 1.6], ['News', 1.3], ['Travel', 1.2], ['Finance', 1.1], ['Weather', 0.8], ['Food & Drink', 0.8], ['Business', 0.6], ['Reference', 0.5], ['Book', 0.3], ['Navigation', 0.2], ['Catalogs', 0.1], ['Medical', 0.1]]
from matplotlib import pyplot as plt
genre = []
for row in percent_genre_google_sorted:
g = row[0]
genre.append(g)
# print(genre)
percent = []
for row in percent_genre_google_sorted:
p = row[1]
percent.append(p)
# print(percent)
Color = 'white'
New_Colors = ['sandybrown', 'sandybrown', 'sandybrown', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen', 'mediumseagreen']
plt.figure(figsize = (14, 6))
ax = plt.axes()
ax.set_facecolor('0.85') # Background Color
plt.grid(color = Color, alpha = 0.3, linestyle = '--', linewidth = 1)
plt.bar(genre, percent, color = New_Colors, edgecolor = 'black', width = .85)
plt.title('Percent Distribution of Genres - Google Play Apps', fontsize = 18)
plt.ylabel('Distribution (%)', fontsize = 18)
plt.xlabel('Genre', fontsize = 18,)
plt.xticks(genre, horizontalalignment = 'right', rotation = '45')
plt.yticks([0, 5, 8.6, 10.2, 15, 19.4, 25])
plt.show('center')
There are 33 different genres in the google_data_final
dataset and the most common is FAMILY
at 19.4% of the the total distribution of possible genres. The next closest genres include GAME
and TOOLS
at 10.2% and 8.6% respectively.
genre = []
for row in percent_genre_apple_sorted:
g = row[0]
genre.append(g)
# print(genre)
percent = []
for row in percent_genre_apple_sorted:
p = row[1]
percent.append(p)
# print(frequency)
Color = 'white'
New_Colors = ['cornflowerblue', 'cornflowerblue', 'cornflowerblue', 'tomato', 'tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato','tomato']
plt.figure(figsize = (14, 6))
ax = plt.axes()
ax.set_facecolor('0.85') # Background Color
plt.grid(color = Color, alpha = 0.3, linestyle = '--', linewidth = 1)
plt.bar(genre, percent, color = New_Colors, edgecolor = 'black', width = .85)
plt.title('Percent Distribution of Genres - Apple App Store', fontsize = 18)
plt.ylabel('Distribution (%)', fontsize = 18)
plt.xlabel('Genre', fontsize = 18,)
plt.xticks(genre, horizontalalignment = 'right', rotation = '45')
plt.yticks([0, 4.9, 8, 20, 30, 40, 50, 58.5, 70])
plt.show('center')
There are 23 different genres in the apple_data_final
dataset, the most common is Games
with a commanding 58.5% of the the total distribution of possible genres. The next closest genres include Entertainment
and Photo & Video
at 8.0% and 4.9% respectively. The analysis suggests that the Apple App store focuses primarily on providing users with apps in the Games
and Entertainment
genres as they make up 67% of the apps in the apple_data_final
dataset.
It is obvious that the most common genre combination in the apple_data_final
dataset is Games
and Entertainment
. However, with the google_data_final
dataset it is not quite as one sided as the genre distribution is more comprehensive with 33 primary genres which can be assigned to at least one of 84 sub_genres
including assigning no sub-genre at all. It was decided to explore how the sub_genres
were distributed throughout the most common primary genre FAMILY
to see if any additional patterns could be identified. A frequency table for FAMILY
sub_genres
was created and sorted for review.
# most common sub_genre in the FAMILY genre, not sorted.
most_common_sub_genre_family_google = {}
for row in google_data_final:
genre = row[2]
sub_genre = row[8]
if genre == 'FAMILY':
if genre and sub_genre in most_common_sub_genre_family_google:
most_common_sub_genre_family_google[sub_genre] += 1
else:
most_common_sub_genre_family_google[sub_genre] = 1
print(most_common_sub_genre_family_google)
{'Role Playing': 8, 'Education': 29, 'Action': 20, 'Trivia': 4, 'Simulation': 7, 'Racing': 4, 'Music': 3, 'Casual': 10, 'Puzzle': 3, 'Entertainment': 45, 'Strategy': 3, 'Word': 2, 'Board': 4, 'Adventure': 1, 'Arcade': 2, 'Education;Education': 1, 'Casino': 1, '': 1473}
# most common sub-genre within the 'FAMILY' genre for the google_data_final dataset?
family_sub_genre_google = []
for key in most_common_sub_genre_family_google:
sub_genre = key
freq = most_common_sub_genre_family_google[key]
family_sub_genre_google.append([sub_genre, freq])
family_sub_genre_google_sorted = family_sub_genre_google
family_sub_genre_google_sorted.sort(key=lambda x: x[1], reverse = True)
print('"FAMILY" genre sub_genre frequency table sorted:')
print(' ')
print(family_sub_genre_google_sorted)
print('\n')
# 'FAMILY' genre app count check
freq_count = 0
for row in family_sub_genre_google_sorted:
freq = row[1]
freq_count = freq_count + freq
print(freq_count, '- Check count matches frequency table - Yes')
"FAMILY" genre sub_genre frequency table sorted: [['', 1473], ['Entertainment', 45], ['Education', 29], ['Action', 20], ['Casual', 10], ['Role Playing', 8], ['Simulation', 7], ['Trivia', 4], ['Racing', 4], ['Board', 4], ['Music', 3], ['Puzzle', 3], ['Strategy', 3], ['Word', 2], ['Arcade', 2], ['Adventure', 1], ['Education;Education', 1], ['Casino', 1]] 1620 - Check count matches frequency table - Yes
Since a very high number (1473 of 1620) of the FAMILY
sub_genre
is blank, it was decided to see how many of the google_data_final
dataset has undefined sub-genres for comparison.
# most common sub-genre in the google_data_final dataset.
sub_genre_google = freq_table(google_data_final, 8)
percent_sub_genre_google = []
for key, value in sub_genre_google.items():
percent_sub_genre_google.append([key, value])
percent_sub_genre_google_sorted = percent_sub_genre_google
percent_sub_genre_google_sorted.sort(key=lambda x: x[1], reverse = True)
print('Google apps sub-genre percent distribution sorted highest to lowest (top 3):')
print(' ')
print(percent_sub_genre_google_sorted[:3])
Google apps sub-genre percent distribution sorted highest to lowest (top 3): [['', 74.9], ['Entertainment', 4.2], ['Education', 3.2]]
It appears that the most common sub_genre
within the google_data_final
dataset and the FAMILY
primary genre is blank or undefined. There are 1473
out of 1620
apps in the FAMILY
genre
that have no defined sub_genre
; furthermore, nearly 75% of the entire google_data_final
dataset has no defined sub_genre
. The most common genre
in the google_data_final
list remains to be FAMILY
followed by ENTERTAINMENT
. However, it was noted during review of the FAMILY
genre app list that many of the apps could be considered games or entertainment. The following is a qualitative assessment of the google_data_final
list in order to develop assumptions that will allow further analysis of the google_data_final
list.
# qualitative assessment of 'FAMILY' genre apps with undefined sub_genres.
def look_up(data_set, start, end, genre_name):
extract = []
for row in data_set:
app_name = row[0]
genre = row[2]
sub_genre = row[8]
if genre == genre_name:
extract.append([app_name, genre, sub_genre])
data_slice = extract[start:end]
for row in data_slice:
print(row)
look_up(google_data_final, 0, 10, 'FAMILY')
print('\n')
look_up(google_data_final, 0, 5, 'BEAUTY')
print('\n')
look_up(google_data_final, 0, 5, 'BOOKS_AND_REFERENCE')
['Jewels Crush- Match 3 Puzzle', 'FAMILY', 'Role Playing'] ['Coloring & Learn', 'FAMILY', 'Education'] ['Mahjong', 'FAMILY', 'Role Playing'] ['Super ABC! Learning games for kids! Preschool apps', 'FAMILY', 'Action'] ['Toy Pop Cubes', 'FAMILY', 'Trivia'] ['Educational Games 4 Kids', 'FAMILY', 'Action'] ['Candy Pop Story', 'FAMILY', 'Simulation'] ['Princess Coloring Book', 'FAMILY', 'Education'] ['Hello Kitty Nail Salon', 'FAMILY', 'Action'] ['Candy Smash', 'FAMILY', 'Education'] ['Hush - Beauty for Everyone', 'BEAUTY', 'Casual'] ['ipsy: Makeup, Beauty, and Tips', 'BEAUTY', 'Puzzle'] ['Natural recipes for your beauty', 'BEAUTY', 'Puzzle'] ['BestCam Selfie-selfie, beauty camera, photo editor', 'BEAUTY', 'Arcade'] ['Mirror - Zoom & Exposure -', 'BEAUTY', 'Sports'] ['E-Book Read - Read Book for free', 'BOOKS_AND_REFERENCE', 'Sports'] ['Download free book with green book', 'BOOKS_AND_REFERENCE', 'Action'] ['Wikipedia', 'BOOKS_AND_REFERENCE', 'Arcade'] ['Cool Reader', 'BOOKS_AND_REFERENCE', 'Word'] ['Free Panda Radio Music', 'BOOKS_AND_REFERENCE', 'Action']
A qualitative assessment of the FAMILY
genre found that many of the apps could be classified as GAME
or ENTERTAINMENT
following review of app descriptions and content. Additional qualitative assessment of all genres in the google_data_final
dataset affirms that there are apps in all genres that could be defined as GAME
or ENTERTAINMENT
or have mislabeled sub_genres
as shown in the BEAUTY
and BOOKS_AND_REFERENCE
examples; i.e., "ipsy: Makeup, Beauty, and Tips', 'BEAUTY', 'Puzzle" is not a 'Puzzle' type app or "Wikipedia', 'BOOKS_AND_REFERENCE', 'Arcade" is not an 'Arcade' type app. It was determined that the sub_genre
data was not reliable and no further analysis into sub_genre
was required.
The qualitative assessment of the google_data_final
list appears to "agree" with the apple_data_final
dataset as approximately 67% of the apps in the apple_data_final
dataset (as shown in Chart 2) are defined as Games
or Entertainment
. Even though Google Play offers a greater variety of apps in an extensive range of genres
and sub_genres
as compared to the Apple App Store, the most common genres of apps in the google_data_final
dataset are FAMILY
and GAME
with a combined total of 30% of the distribution. The assumptions moving forward with the analysis are 1: The majority of apps in the FAMILY
genre are considered GAME
and ENTERTAINMENT
. 2: Some apps in all genres other than FAMILY
and GAME
are considered GAME
and ENTERTAINMENT
. 3: The most common apps in the google_data_final
are GAME
and ENTERTAINMENT
.
The following section of this report is the development of short-lists which include apps that are deemed highly engaged by users. The lists include the top ten apps based on user data such as user_rating
, rating_count_total
, and installs
. The first list is of the top 10 apps in the google_data_final
dataset in the FAMILY
and GAMES
genres based on the highest rating_count_total
and then the highest user_rating
.
def look_up_user_rating_and_rating_count_total(data_set, n_rating_count_total, n_user_rating):
extract = []
for row in data_set:
app_name = row[0]
genre = row[2]
rating_count_total = row[3]
user_rating = row[4]
installs = row[7]
if user_rating >= n_user_rating and rating_count_total >= n_rating_count_total:
extract.append([app_name, genre, rating_count_total, user_rating, installs])
return extract
top_google_apps_genre = look_up_user_rating_and_rating_count_total(google_data_final, 40000, '4.8')
top_google_apps_genre_sorted = top_google_apps_genre
top_google_apps_genre_sorted.sort(key=lambda x: x[2], reverse = True)
def look_up_FAMILY_GAMES(data_set):
extract = []
for row in data_set:
app_name = row[0]
genre = row[1]
rating_count_total = row[2]
user_rating = row[3]
installs = row[4]
if genre == 'FAMILY' or genre == 'GAME':
extract.append([app_name, genre, rating_count_total, user_rating, installs])
return extract
top10_goggle_FAMILY_GAME_apps = look_up_FAMILY_GAMES(top_google_apps_genre_sorted)
print(top10_goggle_FAMILY_GAME_apps)
[['Eternium', 'FAMILY', 1506783, '4.8', '10,000,000+'], ["PewDiePie's Tuber Simulator", 'FAMILY', 1499466, '4.8', '10,000,000+'], ['Vlogger Go Viral - Tuber Game', 'FAMILY', 1304467, '4.8', '10,000,000+'], ['Cash, Inc. Money Clicker Game & Business Adventure', 'GAME', 549720, '4.8', '10,000,000+'], ['Dan the Man: Action Platformer', 'GAME', 528550, '4.8', '10,000,000+'], ['Fernanfloo', 'GAME', 526595, '4.8', '10,000,000+'], ['No. Color - Color by Number, Number Coloring', 'FAMILY', 269194, '4.8', '10,000,000+'], ['Wordscapes', 'GAME', 230849, '4.8', '10,000,000+'], ["Drag'n'Boom", 'GAME', 133180, '4.8', '1,000,000+'], ['PixPanda - Color by Number Pixel Art Coloring Book', 'FAMILY', 55723, '4.9', '1,000,000+'], ['Hungry Hearts Diner: A Tale of Star-Crossed Souls', 'FAMILY', 46253, '4.9', '500,000+']]
data = top10_goggle_FAMILY_GAME_apps[:10]
columns = ['App Name', 'Genre', 'Rating Count Total', 'User Rating', 'Installs']
rows = ['{:0}'.format((1 * i) + 1) for i in range(10)]
r_colors = ['gold', 'silver', 'peru', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen',]
fig, ax = plt.subplots()
ax.set_axis_off() # removes the plot x and y axes
table = ax.table(
cellText = data, cellLoc = 'center',
rowLabels = rows, rowLoc = 'center', rowColours = r_colors,
colLabels = columns, colColours = ['sandybrown'] * 5,
bbox=[0, -.55, 2.5, 1.5], # table (left/right, up/down, Column Width and padding, Row Height and padding)
loc = 'upper left')
table.auto_set_column_width(col = [-1, 0, 1, 2, 3, 4]) # auto fit column size
table.auto_set_font_size(False)
table.set_fontsize(12)
ax.set_title('Top 10 Google Apps: FAMILY & GAME Genre - Highest Rating Count Total',
fontweight = 'bold', fontsize = 16, loc = 'left')
plt.show()
The top app in the FAMILY
and GAME
genres is 'Eternium' which is a classic RPG game with impressive graphics that touts its "effortless 'tap to move' and innovative 'swipe to cast'" control features. The developer says the game is "player-friendly" with "no paywalls, never pay-to-play" philosophy.
The 2nd (PewDiePie's Tuber Simulator) and 3rd (Vlogger Go Viral - Tuber Game) position apps are both 'tuber' simulator games where users play to become the world's "#1" YouTuber influencer or internet sensation.
Other notables include the 7th and 10th position apps which are both color-by-numbers type games.
The second google app list includes the top 10 apps for all primary genres based on the highest number of installs
(500,000,000+) with the highest user_rating
.
installs = []
for row in google_data_final:
count = row[7]
installs.append(count)
min_installs = min(installs)
max_installs = max(installs)
print(min_installs, 'to', max_installs)
1+ to 500,000,000+
def look_up_installs_and_user_rating(data_set, n_installs, n_user_rating):
extract = []
for row in data_set:
app_name = row[0]
genre = row[2]
rating_count_total = row[3]
user_rating = row[4]
installs = row[7]
sub_genre = row[8]
if user_rating >= n_user_rating and installs >= n_installs:
extract.append([app_name, genre, installs, user_rating, rating_count_total])
return extract
top10_google_apps_installs_and_user_rating = look_up_installs_and_user_rating(google_data_final, '500,000,000+', '4.4')
top10_google_apps_installs_and_user_rating_sorted = top10_google_apps_installs_and_user_rating
top10_google_apps_installs_and_user_rating_sorted.sort(key=lambda x: x[3], reverse = True)
print(top10_google_apps_installs_and_user_rating_sorted)
[['Clean Master- Space Cleaner & Antivirus', 'TOOLS', '500,000,000+', '4.7', 42916526], ['Security Master - Antivirus, VPN, AppLock, Booster', 'TOOLS', '500,000,000+', '4.7', 24900999], ['Google Duo - High Quality Video Calls', 'COMMUNICATION', '500,000,000+', '4.6', 2083237], ['SHAREit - Transfer & Share', 'TOOLS', '500,000,000+', '4.6', 7790693], ['UC Browser - Fast Download Private & Secure', 'COMMUNICATION', '500,000,000+', '4.5', 17714850], ['My Talking Tom', 'GAME', '500,000,000+', '4.5', 14892469], ['Microsoft Word', 'PRODUCTIVITY', '500,000,000+', '4.5', 2084126], ['MX Player', 'VIDEO_PLAYERS', '500,000,000+', '4.5', 6474672], ['Candy Crush Saga', 'GAME', '500,000,000+', '4.4', 22430188], ['Google Translate', 'TOOLS', '500,000,000+', '4.4', 5745093], ['Dropbox', 'PRODUCTIVITY', '500,000,000+', '4.4', 1861310], ['Flipboard: News For Our Time', 'NEWS_AND_MAGAZINES', '500,000,000+', '4.4', 1284018]]
data = top10_google_apps_installs_and_user_rating_sorted[0:10]
columns = ['App Name', 'Genre', 'Installs', 'User Rating', 'Rating Count Total']
rows = ['{:0}'.format((1 * i) + 1) for i in range(10)]
r_colors = ['gold', 'silver', 'peru', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen',]
fig, ax = plt.subplots()
ax.set_axis_off() # removes the plot x and y axes
table = ax.table(
cellText = data, cellLoc = 'center',
rowLabels = rows, rowLoc = 'center', rowColours = r_colors,
colLabels = columns, colColours = ['cornflowerblue'] * 5,
bbox=[0, -.55, 2.5, 1.5], # table (left/right, up/down, Column Width and padding, Row Height and padding)
loc = 'upper left')
table.auto_set_column_width(col = [0, 1, 2, 3, 4]) # auto fit column size
table.auto_set_font_size(False)
table.set_fontsize(12)
ax.set_title('Top 10 Google Apps: All Genres - Most Installs & Highest User Rating',
fontweight = 'bold', fontsize = 16, loc = 'left')
plt.show()
The majority of the apps in the list above (6 of 10) fall under the TOOLS
and COMMUNICATION
genres. The #1 and #2 positions are 'Cleanmaster' and 'Security Master' respectively; both are mobile antivirus, cleaning, and performance boosting apps. Antivirus apps may be an area to consider for development if the company has the experience and resources available.
Two GAMES
made this top 10 list which include 'My Talking Tom' and 'Candy Crush Saga'. These two games did not make the first top 10 list due to lower user ratings. It should be noted that 'My Talking Tom' and 'Candy Crush Saga' have at least 50x more installs and 10-15x more rating count totals as compared to the #1 FAMILY
and GAME
app 'Eternium'.
The final top 10 list includes apps from only the Games
genre in the Apple_data_final
dataset. The list was genreated by taking the highest number rating_count_total
for apps with the highest user_rating
into consideration.
rating_count_total = []
for row in apple_data_final:
count = row[3]
rating_count_total.append(count)
min_installs = min(rating_count_total)
max_installs = max(rating_count_total)
print(min_installs, 'to', max_installs)
1 to 2974676
def look_up_user_rating_and_genre(data_set, n_genre, n_rating_count_total, n_user_rating,):
extract = []
for row in data_set:
app_name = row[0]
genre = row[2]
rating_count_total = row[3]
user_rating = row[4]
if user_rating == n_user_rating and genre == n_genre and rating_count_total >= n_rating_count_total:
extract.append([app_name, genre, rating_count_total, user_rating])
return extract
top10_apple_apps_games_genre = look_up_user_rating_and_genre(apple_data_final, 'Games', 60000, '5')
top10_apple_apps_games_genre_sorted = top10_apple_apps_games_genre
top10_apple_apps_games_genre_sorted.sort(key=lambda x: x[2], reverse = True)
print(top10_apple_apps_games_genre_sorted)
[['Head Soccer', 'Games', 481564, '5'], ['Sniper 3D Assassin: Shoot to Kill Gun Game', 'Games', 386521, '5'], ['Geometry Dash Lite', 'Games', 370370, '5'], ['CSR Racing 2', 'Games', 257100, '5'], ["Pictoword: Fun 2 Pics Guess What's the Word Trivia", 'Games', 186089, '5'], ['Iron Force', 'Games', 141634, '5'], ['Sniper Shooter: Gun Shooting Games', 'Games', 134080, '5'], ["PewDiePie's Tuber Simulator", 'Games', 90851, '5'], ['Blackbox - think outside the box', 'Games', 80058, '5'], ['Egg, Inc.', 'Games', 79074, '5'], ['Flight Pilot Simulator 3D: Flying Game For Free', 'Games', 60360, '5']]
data = top10_apple_apps_games_genre_sorted[:10]
columns = ['App Name', 'Genre', 'Rating Count Total', 'User Rating']
rows = ['{:0}'.format((1 * i) + 1) for i in range(10)]
r_colors = ['gold', 'silver', 'peru', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen', 'linen',]
fig, ax = plt.subplots()
ax.set_axis_off() # removes the plot x and y axes
table = ax.table(
cellText = data, cellLoc = 'center',
rowLabels = rows, rowLoc = 'center', rowColours = r_colors,
colLabels = columns, colColours = ['violet'] * 4,
bbox=[0, -.55, 2.5, 1.5], # table (left/right, up/down, Column Width and padding, Row Height and padding)
loc = 'upper left')
table.auto_set_column_width(col = [0, 1, 2, 3]) # auto fit column size
table.auto_set_font_size(False)
table.set_fontsize(12)
ax.set_title('Top 10 Apple Apps: Games Genre - Highest Rating Count Total & User Rating',
fontweight = 'bold', fontsize = 16, loc = 'left')
plt.show()
The #1 game from the apple_data_final
dataset is 'Head Soccer'. The apps' developers mention that it is "a soccer game with easy controls that everyone can learn in 1 second". It was also noted from the Apple App store that this app has over 100,000,000+ downloads. Soccer is the world's most popular sport with an estimated 4 billion fans followed by cricket with an estimated 2.5 billon fans. The company could develop a two-in-one app where users have the option to play either soccer or cricket to capture fans from both sports markets.
The only overlap between the Google and Apple top 10 lists is 'PewDiePies's Tuber Simulator' which had the 2nd position in the Google FAMILY
and GAME
genre list and the 8th position in the Apple Games
genre list. This type of app profile should be considered by the company's developers especially as "vlogging" and "tubing" continue to become even more popular and user friendly for all.
As part of this project, third party datasets containing user data for apps currently available on the Apple App store and Google play were collected, cleaned, and analyzed. The project's intent was to provide the company's app developers with lists of highly engaged apps to consider when creating new app profiles in order to increase the liklihood of user/ad interaction to generate revenue.
The concluding primary suggestion is to focus the company's app development resources to create app profiles that are unique, no-pay-to-play, extemely user friendly vlogger/tuber simulator games and/or tools similar to 'PewDiePie's Tuber Simulator' or 'Vlogger Go Viral - Tuber Game'.
Other suggestions include developing a two-in-one soccer/cricket game, apps similar to 'Candy Crush Saga', 'My Talking Tom', and/or an RPG style game with innovative gameplay controls.
Finally, it is recommended that the company considers looking into what resources would be required to develop mobile antivirus, cleaning, and boosting apps.