Hi ! This is my first Jupyter Project where I am trying to identify some profitable app profiles in the App Store and Google Play markets by analysing sample data sets available online.
Data sets used :
I am going to analyze these data sets with an aim to build apps that are free to download and install, and assuming that the main source of revenue consists of in-app ads. My goal for this project is to analyze data to help me understand what type of apps are likely to attract more users.
Since the data sets contain apps from various countries and languages, I would like to analyze only the ones in English.
Please go through the various steps below to understand the logic used to zero down on the kind of Apps we can develop based on the requirements mentioned above.
As of October 2020, there were approximately 1.85 million iOS apps available on the App Store, and 2.56 million Android apps on Google Play.
Collecting data for over four million apps requires a significant amount of time and money, so I'll try to analyze these samples of data instead.
Let me start by opening the two data sets mentioned above and exploring them.
opened_file1 = open('AppleStore.csv')
opened_file2 = open('googleplaystore.csv')
from csv import reader
read_file1 = reader(opened_file1)
read_file2 = reader(opened_file2)
#Saving files as list of lists
ios = list(read_file1)
android = list(read_file2)
#Seperating the Header Row from the list
ios_head = ios[0]
android_head = android[0]
#Header row removal
ios = ios[1:]
android = android[1:]
#defining explore function to help understand the data set
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print('\n')
Using the explore()
function i am trying to understand few rows of the dataset. In addition i have also seperated the column names from the data to identify the columns which will help me in the analysis .
#App Store data
print(ios_head) #Column headers
explore_data(ios, 0, 3, True) #Appstore sample data exploration
#Google Play data
print(android_head)
explore_data(android, 0, 3, True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 7197 Number of columns: 16 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 10841 Number of columns: 13
While scanning throughthe column headers, many headings are self explanatory but for better understanding i am referring to the dataset sources mentioned at the start for both Google Play and Apple store.
Sample data using explore()
is also got with number of rows and columns detail.
The Google Play data set has a dedicated discussion section where an error for a certain row has been described. The row 10472 has been highlighted as it does not have the Category details which has made the colums shift.
android[10472]
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
We have seen previously that the android data set had 13 columns whereas this particular row has only 12 entries. Hence in order to eliminate this entry i am using the del
statement.
del android[10472]
In the Google play dataset there can be a chance of the same app being mentioned in more than one entry. To check this i am going to find the number of unique and dupicate Apps in the dataset.
duplicate = []
unique = []
for row in android:
if row[0] in unique:
duplicate.append(row[0])
else :
unique.append(row[0])
print('Number of unique apps: ' ,len(unique))
print('Number of duplicate apps: ' , len(duplicate))
print('\n')
print('Examples of duplicate apps:', duplicate[:15])
Number of unique apps: 9659 Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
Approx 10% of the entries are duplicates from data above and hence I will have to remove the duplicates based on the 'Reviews' value. The value with latest(highest) review value will be retained and all other old duplicate entries can be deleted.
Below there is an example for some duplicates within the dataset.
for row in android:
if row[0] == 'Slack':
print(row)
print('\n')
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
Here we see that Slack has '51507' and '51510' as rating vales. We can retain 51510 as it is the latest entry and remove the rest.
Using this logic a dictionary with max review value for each unique app can be created.
reviews_max = {}
for row in android:
name = row[0]
n_reviews = float(row[3])
#Checking the review value for every entry
#and appending in dictionary if it is the highest
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
#Appending New entries in dictionary
elif name not in reviews_max:
reviews_max[name] = n_reviews
print('Number of entries in reviews_max : ', len(reviews_max))
Number of entries in reviews_max : 9659
We have created a dictionary with the unique app names as key and its max review as value which will help in deleting the duplicates from the main dataset.
android_clean = [] #Cleaned Dataset without duplicates
already_added = [] #List of Apps cleaned already
for row in android:
name = row[0]
n_reviews = float(row[3])
#Row is appended to android_clean[] if review value matches with the dictionary created in the last step
if n_reviews == reviews_max[name] and name not in already_added :
android_clean.append(row)
already_added.append(row[0])
print(len(android_clean))
9659
Using the max review dictionary we compare the values with the actual dataset 'review' values and append the unique matches into the android_clean[]
list of lists. In order to avoid re addition of any app we keep track of cleaned app names through already_added[]
.
Similarly we can check for duplicates in the Appstore Dataset
duplicate = []
unique = []
for row in ios:
if row[0] in unique:
duplicate.append(row[0])
else :
unique.append(row[0])
print('Number of unique apps: ' ,len(unique))
print('Number of duplicate apps: ' , len(duplicate))
print('\n')
print('Examples of duplicate apps:', duplicate[:15])
Number of unique apps: 7197 Number of duplicate apps: 0 Examples of duplicate apps: []
We find that there are no duplicates in the Appstore Dataset.
Since I want to analyze only the apps that are directed toward an English-speaking audience, App names with characters other than English can be eliminated.
For this we use ord()
built-in function to find the unicode values of characters. Any character with codes greater than 127 can be eliminated since in ASCII coding all english characters are contained in the range 0 - 127.
Inorder to verify this we can check for few App names below
def english(string):
for element in string:
#Comparing the unicode values of inputs with the range 0-127
if ord(element) > 127:
return False
return True
#Checking the function output for various inputs
name1 = english('Instagram')
name2 = english('爱奇艺PPS -《欢乐颂2》电视剧热播')
name3 = english('Docs To Go™ Free Office Suite')
name4 = english('Instachat 😜')
print(name1, name2, name3, name4)
True False False False
'Docs To Go™ Free Office Suite' and 'Instachat 😜' are English apps but the ™ and 😜 characters are out of the 0 to 127 range and hence these names might be eliminated according to our condition.
To minimize this data loss we'll remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.
def check_english(a_string):
t = 0
for character in a_string:
if ord(character) > 127:
t += 1
if t > 3:
return False
return True
name5 = check_english('Instagram')
name6 = check_english('爱奇艺PPS -《欢乐颂2》电视剧热播')
name7 = check_english('Docs To Go™ Free Office Suite')
name8 = check_english('Instachat 😜')
print(name5, name6, name7, name8)
True False True True
Seeing the results it is evident that the check_english()
function is much more efficient than the english()
function and hence we can use this for the removal of Non English Apps.
def remove_NonEnglish(dataset, name_index):
dataset_english = [] #List of English Apps
for row in dataset:
name = row[name_index]
if check_english(name) == True:
dataset_english.append(row)
print('Number of rows before removal:', len(dataset))
print('Number of rows after removal:', len(dataset_english))
print('Number of rows removed:', len(dataset) - len(dataset_english))
print('\n')
return dataset_english
#calling the function for both Google play and Appstore data sets
ios_english = remove_NonEnglish(ios,1)
android_english = remove_NonEnglish(android_clean,0)
Number of rows before removal: 7197 Number of rows after removal: 6183 Number of rows removed: 1014 Number of rows before removal: 9659 Number of rows after removal: 9614 Number of rows removed: 45
1014 and 45 Non english apps have been removed successfully from the Appstore and Google Play dataset respectively.
Since we are interested about free Apps we can isolate those from the dataset. This can be achieved by parsing through the price
columns for both data sets.
android_final = []
ios_final = []
for app in android_english:
price = app[7]
if price == '0' or price =='$0': #checking for free apps
android_final.append(app)
for app in ios_english:
price = app[4]
if price == '0.0' or price =='$0':
ios_final.append(app)
print('Free Apps in Google Play:',len(android_final))
print('Free Apps in Apple Store:',len(ios_final))
Free Apps in Google Play: 8864 Free Apps in Apple Store: 3222
Since the revenue of the App we are looking to build is highly influenced by the number of people using it, we should know the Genre of Apps most prominent/common in Google Play and App store.
In order to achieve this we need to build frquency table for the prime_genre
column of the App Store data set and for the Genres
and Category
columns of the Google Play data set.
def freq_table(dataset, index):
table = {}
total = 0
#Appending values to table{} to create required frequency table
for row in dataset:
total += 1
value = row[index]
if value in table:
table[value] += 1
else:
table[value] = 1
#Converting frquencies to percentage table
table_percentages = {}
for key in table:
percentage = (table[key] / total) * 100
table_percentages[key] = percentage
return table_percentages
def display_table(dataset, index):
table = freq_table(dataset, index) #Calling for Percentage table
table_display = []
#Interchanging the values and keys position in list of tuples for easy sorting
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
#Display table in descending order using sort
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
#Displaying values as % upto 2 decimals using round()
print(entry[1], ':', round(entry[0],2), '%')
display_table(ios_final, -5) #prime_genre frequency table in Appstore
Games : 58.16 % Entertainment : 7.88 % Photo & Video : 4.97 % Education : 3.66 % Social Networking : 3.29 % Shopping : 2.61 % Utilities : 2.51 % Sports : 2.14 % Music : 2.05 % Health & Fitness : 2.02 % Productivity : 1.74 % Lifestyle : 1.58 % News : 1.33 % Travel : 1.24 % Finance : 1.12 % Weather : 0.87 % Food & Drink : 0.81 % Reference : 0.56 % Business : 0.53 % Book : 0.43 % Navigation : 0.19 % Medical : 0.19 % Catalogs : 0.12 %
The prime_genre
column in App store shows that Games is the most common genre among the free English apps. Following this is the Entertainment, Photo & Video, Education, Social Networking, Shopping, Utilities genres. Approximately 60% of the total users of free English apps download Gaming genre based on the above table for Appstore data set.
display_table(android_final, 9)# Genre frequency table in Google Play
Tools : 8.45 % Entertainment : 6.07 % Education : 5.35 % Business : 4.59 % Productivity : 3.89 % Lifestyle : 3.89 % Finance : 3.7 % Medical : 3.53 % Sports : 3.46 % Personalization : 3.32 % Communication : 3.24 % Action : 3.1 % Health & Fitness : 3.08 % Photography : 2.94 % News & Magazines : 2.8 % Social : 2.66 % Travel & Local : 2.32 % Shopping : 2.25 % Books & Reference : 2.14 % Simulation : 2.04 % Dating : 1.86 % Arcade : 1.85 % Video Players & Editors : 1.77 % Casual : 1.76 % Maps & Navigation : 1.4 % Food & Drink : 1.24 % Puzzle : 1.13 % Racing : 0.99 % Role Playing : 0.94 % Libraries & Demo : 0.94 % Auto & Vehicles : 0.93 % Strategy : 0.91 % House & Home : 0.82 % Weather : 0.8 % Events : 0.71 % Adventure : 0.68 % Comics : 0.61 % Beauty : 0.6 % Art & Design : 0.6 % Parenting : 0.5 % Card : 0.45 % Casino : 0.43 % Trivia : 0.42 % Educational;Education : 0.39 % Board : 0.38 % Educational : 0.37 % Education;Education : 0.34 % Word : 0.26 % Casual;Pretend Play : 0.24 % Music : 0.2 % Racing;Action & Adventure : 0.17 % Puzzle;Brain Games : 0.17 % Entertainment;Music & Video : 0.17 % Casual;Brain Games : 0.14 % Casual;Action & Adventure : 0.14 % Arcade;Action & Adventure : 0.12 % Action;Action & Adventure : 0.1 % Educational;Pretend Play : 0.09 % Simulation;Action & Adventure : 0.08 % Parenting;Education : 0.08 % Entertainment;Brain Games : 0.08 % Board;Brain Games : 0.08 % Parenting;Music & Video : 0.07 % Educational;Brain Games : 0.07 % Casual;Creativity : 0.07 % Art & Design;Creativity : 0.07 % Education;Pretend Play : 0.06 % Role Playing;Pretend Play : 0.05 % Education;Creativity : 0.05 % Role Playing;Action & Adventure : 0.03 % Puzzle;Action & Adventure : 0.03 % Entertainment;Creativity : 0.03 % Entertainment;Action & Adventure : 0.03 % Educational;Creativity : 0.03 % Educational;Action & Adventure : 0.03 % Education;Music & Video : 0.03 % Education;Brain Games : 0.03 % Education;Action & Adventure : 0.03 % Adventure;Action & Adventure : 0.03 % Video Players & Editors;Music & Video : 0.02 % Sports;Action & Adventure : 0.02 % Simulation;Pretend Play : 0.02 % Puzzle;Creativity : 0.02 % Music;Music & Video : 0.02 % Entertainment;Pretend Play : 0.02 % Casual;Education : 0.02 % Board;Action & Adventure : 0.02 % Video Players & Editors;Creativity : 0.01 % Trivia;Education : 0.01 % Travel & Local;Action & Adventure : 0.01 % Tools;Education : 0.01 % Strategy;Education : 0.01 % Strategy;Creativity : 0.01 % Strategy;Action & Adventure : 0.01 % Simulation;Education : 0.01 % Role Playing;Brain Games : 0.01 % Racing;Pretend Play : 0.01 % Puzzle;Education : 0.01 % Parenting;Brain Games : 0.01 % Music & Audio;Music & Video : 0.01 % Lifestyle;Pretend Play : 0.01 % Lifestyle;Education : 0.01 % Health & Fitness;Education : 0.01 % Health & Fitness;Action & Adventure : 0.01 % Entertainment;Education : 0.01 % Communication;Creativity : 0.01 % Comics;Creativity : 0.01 % Casual;Music & Video : 0.01 % Card;Action & Adventure : 0.01 % Books & Reference;Education : 0.01 % Art & Design;Pretend Play : 0.01 % Art & Design;Action & Adventure : 0.01 % Arcade;Pretend Play : 0.01 % Adventure;Education : 0.01 %
display_table(android_final, 1) # Category frequency table in Google Play
FAMILY : 18.91 % GAME : 9.72 % TOOLS : 8.46 % BUSINESS : 4.59 % LIFESTYLE : 3.9 % PRODUCTIVITY : 3.89 % FINANCE : 3.7 % MEDICAL : 3.53 % SPORTS : 3.4 % PERSONALIZATION : 3.32 % COMMUNICATION : 3.24 % HEALTH_AND_FITNESS : 3.08 % PHOTOGRAPHY : 2.94 % NEWS_AND_MAGAZINES : 2.8 % SOCIAL : 2.66 % TRAVEL_AND_LOCAL : 2.34 % SHOPPING : 2.25 % BOOKS_AND_REFERENCE : 2.14 % DATING : 1.86 % VIDEO_PLAYERS : 1.79 % MAPS_AND_NAVIGATION : 1.4 % FOOD_AND_DRINK : 1.24 % EDUCATION : 1.16 % ENTERTAINMENT : 0.96 % LIBRARIES_AND_DEMO : 0.94 % AUTO_AND_VEHICLES : 0.93 % HOUSE_AND_HOME : 0.82 % WEATHER : 0.8 % EVENTS : 0.71 % PARENTING : 0.65 % ART_AND_DESIGN : 0.64 % COMICS : 0.62 % BEAUTY : 0.6 %
For the Tables in Category
and Genres
columns of Google Play data set we find that Family, Games, Tools, Business and Lifestyle are the top five Categories and Tools, Enetertainment, Education, Business and Productivity are the top five Genres for free English apps.
Comparing App Store and Google Play it is clear that App Store has more number of users looking for fun Apps whereas Google play has a more balanced user base with equal emphasis for Utility and Entertainment.
Using the rating_count_column
values as a proxy in App store we can try finding out the most popular apps (have the most users).
To analyze this we can start with calculating the average number of user ratings per app genre.
genres_ios = freq_table(ios_final, -5)
genres_table = []
count = 0
for genre in genres_ios:
total = 0
len_genre = 0
for app in ios_final:
genre_app = app[-5]
if genre_app == genre:
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
key_val_tuple1 = (avg_n_ratings, genre)
genres_table.append(key_val_tuple1)
count += avg_n_ratings
#Sorting in Descending order
sorted_table1 = sorted(genres_table, reverse = True)
for element in sorted_table1:
print(element[1], ':', round(element[0]/count*100,2), '%' )
Navigation : 12.12 % Reference : 10.55 % Social Networking : 10.08 % Music : 8.07 % Weather : 7.36 % Book : 5.6 % Food & Drink : 4.69 % Finance : 4.43 % Photo & Video : 4.01 % Travel : 3.98 % Shopping : 3.79 % Health & Fitness : 3.28 % Sports : 3.24 % Games : 3.21 % News : 2.99 % Productivity : 2.96 % Utilities : 2.63 % Lifestyle : 2.32 % Entertainment : 1.98 % Business : 1.06 % Education : 0.99 % Catalogs : 0.56 % Medical : 0.09 %
Based on the Popularity Navigation, Reference, Social Networking, Music, Weather and Books are the Top 5 Genres in App Store
Using the installs
column values in Google Play store we can try finding out the most popular apps (have the most users).
categories_android = freq_table(android_final, 1)
cat_table = []
count2 = 0
for category in categories_android:
total = 0
len_category = 0
for app in android_final:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace(',', '')
n_installs = n_installs.replace('+', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
key_val_tuple2 = (avg_n_installs, category)
genres_table.append(key_val_tuple2)
count2 += avg_n_installs
sorted_table2 = sorted(genres_table, reverse = True)
for element in sorted_table2:
print(element[1], ':', round(element[0]/count2*100,2), '%')
COMMUNICATION : 16.0 % VIDEO_PLAYERS : 10.29 % SOCIAL : 9.68 % PHOTOGRAPHY : 7.42 % PRODUCTIVITY : 6.99 % GAME : 6.49 % TRAVEL_AND_LOCAL : 5.82 % ENTERTAINMENT : 4.84 % TOOLS : 4.5 % NEWS_AND_MAGAZINES : 3.97 % BOOKS_AND_REFERENCE : 3.65 % SHOPPING : 2.93 % PERSONALIZATION : 2.16 % WEATHER : 2.11 % HEALTH_AND_FITNESS : 1.74 % MAPS_AND_NAVIGATION : 1.69 % FAMILY : 1.54 % SPORTS : 1.51 % ART_AND_DESIGN : 0.83 % FOOD_AND_DRINK : 0.8 % EDUCATION : 0.76 % BUSINESS : 0.71 % LIFESTYLE : 0.6 % FINANCE : 0.58 % HOUSE_AND_HOME : 0.55 % DATING : 0.36 % COMICS : 0.34 % AUTO_AND_VEHICLES : 0.27 % LIBRARIES_AND_DEMO : 0.27 % PARENTING : 0.23 % BEAUTY : 0.21 % EVENTS : 0.11 % MEDICAL : 0.05 % Navigation : 0.04 % Reference : 0.03 % Social Networking : 0.03 % Music : 0.02 % Weather : 0.02 % Book : 0.02 % Food & Drink : 0.01 % Finance : 0.01 % Photo & Video : 0.01 % Travel : 0.01 % Shopping : 0.01 % Health & Fitness : 0.01 % Sports : 0.01 % Games : 0.01 % News : 0.01 % Productivity : 0.01 % Utilities : 0.01 % Lifestyle : 0.01 % Entertainment : 0.01 % Business : 0.0 % Education : 0.0 % Catalogs : 0.0 % Medical : 0.0 %
Based on the number of Installations data Communication, Video Players, Social Networking, Photo and Productivity are the top 5 popular Genres in Google Play store.
Since we want to build an Application whose revenue model is based on the number of Ads shown we should give priority to the Popularity of the Genre/Category. Also we have to keep in mind that the app has to generate traction in both App store and Google Play.
On analyzing the popularity data of both Apple and Google Play stores we find that Social genre is popular among 9-10% of the total free english app users. Only 2-4% of the total apps are under the category of Social. This make Social category less competitive and an ideal genre for our App to generate revenue in both App store and Google Play.