In this project I analyze data about app profiles from the Apple App store and Google Play markets to understand what apps are most profitable.
The goal of the project is to help developers understand what type of apps are likely to attract more users.
Note: This project forms part of Dataquest.io's course 'Data Science in Python - Fundamentals'.
Let's start out by gathering the data.
Instead of collecting data on over 4 million apps (Source: Statistica), there are two freely available data sets that seem suitable for our goal:
googleplaystore.csv
.AppleStore.csv
.First we open these two data sets, and save both as lists of lists. Then we separate the header of each list, so the remainder of each list has a homogeneous data structure.
opened_file_google = open("googleplaystore.csv")
opened_file_apple = open("AppleStore.csv")
from csv import reader
read_file_google = reader(opened_file_google)
read_file_apple = reader(opened_file_apple)
list_google = list(read_file_google)
header_google = list_google[0]
list_google = list_google[1:]
list_apple = list(read_file_apple)
header_apple = list_apple[0]
list_apple = list_apple[1:]
Then we explore the two data sets, making use of this function that was provided by the course:
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
Exploring the Google Play markets data set first, then the Apple Store data set:
print ("GOOGLE")
print('\n')
explore_data (list_google, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (list_apple, 0, 3, True)
GOOGLE ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 10841 Number of columns: 13 APPLE ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 7197 Number of columns: 16
Now that we have taken a look at the first three rows and the total number of rows and columns, let's print out the columns names for both data sets as well:
print ("GOOGLE")
print('\n')
print (header_google)
print('\n')
print ("APPLE")
print (header_apple)
GOOGLE ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] APPLE ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Time to do some investigation into what columns are relevant to our analysis. The documentation for both provides help on the information each column contains (See Google, Apple). We want to find out what apps attract more users, so the following columns seem interesting:
App
(application name)Reviews
(number of user reviews)Category
(category the app belongs to)Installs
(number of user downloads/installs)Type
(paid or free)Price
(price of the app)Genres
(genre app belongs to, can belong to multiple at ones)APPLE
track_name
(application name)price
(price of the app)rating_count_tot
(User rating counts for all versions)user_rating
(ratings for the app)prime_genre
(genre app belongs to)The next step is making sure the data is accurate, by checking for:
Let's start with checking for inaccurate data.
The discussions section on the Apple data set shows there are no wrong data in it.
The discussions section on the Google data set mentions row 10472 has no rating. Let's check this by printing the row.
print (list_google[10472])
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
This row indeed has only 12 columns instead of 13 (like the header row). But it is not the Rating
column that is missing, it seems to be second column, Category
, which caused the rest of the columns to shift over to the left.
Instead of imputing the missing value, we decide to remove the row entirely. This statement can only be run once, since we are using the index number to remove the row! And then we print the row again to check if it was removed.
del list_google[10472]
print (list_google [10472])
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
With the check for innacurate data done and solved, let's move on to the next part of cleaning data: dealing with duplicates.
The discussions section on the Google data set also talks of duplicate entries. Let's check this and print any duplicate entries we find.
# create two empty lists
duplicate_apps = []
unique_apps = []
# loop through the data set, retrieve the name from the first column, and check if the name is in the unique_apps list.
# if it is not, append the name to this list, and else append the name to the duplicates list.
for app in list_google:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
# Print the number of duplicate apps and the names of the first 10 we found.
print("Number of duplicate apps: " + str(len(duplicate_apps)))
print("\n")
print("First 10 names of duplicate apps: " + str(duplicate_apps[:10]))
Number of duplicate apps: 1181 First 10 names of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
This confirms there are 1181 duplicates in the data set.
Instead of randomly removing duplicates, it's better to make an informed decision on which row to keep. To decide on a criterion for which entry to keep, let's inspect one of the duplicate entries, e.g. Slack
.
# Loop throught the data set and print the row if the name is 'Slack'.
for app in list_google:
name = app[0]
if name == 'Slack':
print (app)
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
The only difference betweeen the duplicate rows seems to be in the number of reviews. A quick inspection of some of the other rows confirms this.
Going back to the purpose of the analysis, we are interested in understanding what type of apps are likely to attract more users. This makes a case for keeping the entries with the highest number of reviews. Also, the more reviews, the more recent the data will be. Let's keep the entries with the higher number of reviews and remove the others.
To do this we first create a dictionary with the entries to keep per duplicatie entry (key = app name, and value = highest number of reviews).
# creating an empty dictionary and looping through the data set.
# if the name does not exist yet in the dictionary or
# if the number of reviews is higher than the existing key-value entry,
# then the name and number of reviews is added to the dictionary.
reviews_max = {}
for app in list_google:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
# printing 3 random entries of the dictionary to see if things went well
print(list(reviews_max.items())[:3])
print("\n")
# printing the length of the dictionary
# and the expected length (datset -minus duplicates) to confirm
print ("Length of the dictionary is: ", len(reviews_max))
print ("Expected length is: ", (len(list_google)-1181))
[('I am Rich Plus', 856.0), ('Draw A Stickman', 29265.0), ('Q Remote Control', 4264.0)] Length of the dictionary is: 9659 Expected length is: 9659
Now that we have dictionary with the entries to keep per duplicatie entry (key = app name, and value = highest number of reviews), let's use this to remove the unwanted duplicate entries.
# creating two empty lists
google_clean = []
already_added = []
# looping through the Google data set, for each iteration;
# if the number of reviews for that app is the same as in the dictionary and
# the name is not yet in the list `already_added`
# we append the entire row to the list 'google_clean' and
# we append the name of the app to the list `already_added`(to account for cases where the number of reviews is equal to what was recorded previously).
for app in list_google:
name = app[0]
n_reviews = float(app[3])
if (n_reviews == reviews_max[name]) and name not in already_added:
google_clean.append(app)
already_added.append(name)
Time to check if things went well, re-using the explore_data()
function that we used at the start of the project . Remember, the expected length is 9659 rows!
explore_data (google_clean, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9659 Number of columns: 13
The discussions section on the Apple data does not talk of duplicate entries. To keep the names of our two data sets in sync and indicate that the Apple data set is clean as well, let's change the name of the Apple data set, so we are left with these two data set names:
apple_clean = list_apple
# check our name change
explore_data (apple_clean, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 7197 Number of columns: 16
Now that the data is clean we need to make sure that the data we use for our analysis fits our purpose. The company we work for only makes apps in English, and which are free to download. This means we are not interested in non-English or non-free apps, and these should not be included in the data set we use in our analysis.
So two more steps before we start analyzing:
Let's finish up the preparation of our data sets.
Withouth a country
column there is no good way to filter out English as a language. But we can remove each app name with a character that is not commonly used in English, by detecting characters that fall outside of the ragne 0 to 127 according to the ASCII system. Note that this might leave in other languages that solely use these characters, but we accept the risk.
Let's start with a function to detect if an app name contains a character that is greater than 127, by making use of the built-in function ord()
def is_english(string):
for char in string:
if ord(char) > 127:
return False
return True
# checking if the function works
print ("Instagram: " + str(is_english("Instagram")))
print ("爱奇艺PPS -《欢乐颂2》电视剧热播: " + str(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print ("Docs To Go™ Free Office Suite: " + str(is_english("Docs To Go™ Free Office Suite")))
print ("Instachat 😜: " + str(is_english("Instachat 😜")))
Instagram: True 爱奇艺PPS -《欢乐颂2》电视剧热播: False Docs To Go™ Free Office Suite: False Instachat 😜: False
The function does not entirely work as expected, since some app names use one or more special characters like an emoji or the trademark symbol. Let's just label a name as non-English if it has more than 3 characters outside of our defined range.
def is_english(string):
number = 0
for char in string:
if ord(char) > 127:
number +=1
if number > 3:
return False
else:
return True
# checking our new function
print ("Instagram: " + str(is_english("Instagram")))
print ("爱奇艺PPS -《欢乐颂2》电视剧热播: " + str(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print ("Docs To Go™ Free Office Suite: " + str(is_english("Docs To Go™ Free Office Suite")))
print ("Instachat 😜: " + str(is_english("Instachat 😜")))
Instagram: True 爱奇艺PPS -《欢乐颂2》电视剧热播: False Docs To Go™ Free Office Suite: True Instachat 😜: True
That's better.
Now let's use this function on both our data sets to filter out non-English apps.
# Creating a new empty list for each data set
# Looping though both data sets
# using the `is_english()` function to identify English apps
# appending these apps to a new list
google_clean_english = []
apple_clean_english = []
for app in google_clean:
name = app[0]
if is_english(name):
google_clean_english.append(app)
for app in apple_clean:
# name is in second column in this data set!
name = app[1]
if is_english(name):
apple_clean_english.append(app)
# exploring the new data sets
print ("GOOGLE")
print('\n')
explore_data (google_clean_english, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (apple_clean_english, 0, 3, True)
GOOGLE ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9614 Number of columns: 13 APPLE ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 6183 Number of columns: 16
Looks good! We now have 9614 rows in the Google data set, and 6183 rows remaining in the Apple data set.
Time to move on to the other step in our data preparation: taking care of non-free apps.
From the exploration of the data set above we know that Price
is stored in the following columns in our data sets:
To isolate the free apps in a separate lists we will loop through each, and save the entries where price is higher than 0 to a separate new list.
Let's get started.
# creating new empty lists for both data sets
google_clean_english_free = []
apple_clean_english_free = []
# looping through each data set and saving free apps to the new list
for app in google_clean_english:
price = app[7]
# from inspection of the data set we know price is stored as a string
# and when the app is free the price is '0'
if price == '0':
google_clean_english_free.append(app)
for app in apple_clean_english:
price = app[4]
# from inspection of the data set we know price is stored as a string
# and when the app is free the price is '0.0'
if price == '0.0':
apple_clean_english_free.append(app)
# exploring the new data sets to see if things went well
print ("GOOGLE")
print('\n')
explore_data (google_clean_english_free, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (apple_clean_english_free, 0, 3, True)
GOOGLE ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 8864 Number of columns: 13 APPLE ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 3222 Number of columns: 16
That's it! We are now left with clean, error-free data sets with only free apps in English.
We can finally start with the analysis, and dive into what apps attract more users.
Our business intends to release new apps on both Google Play and the App Store, so we are interested in finding app profiles that are succesful on both markets.
A feature that might be useful is the one indicating what genre an app belongs to, and see which genres are the most common. The earlier data exploration revealed the following columns store this information;
Category
and Genre
prime_genre
Let's build a function to show the most common genres in both data sets.
To detect the most common genre we will create a frequency table in a dictionary, taking a data set and the index of the desired column as input.
def freq_table (dataset, index):
# Create an empty dictionary
# Loop through the data set list
# and check for every iteration whether the iteration variable exists as a key in the dictionary.
# If it exists, then increment the dictionary value at that key by 1.
# If it doesn't exist, create a new key-value pair in the dictionary,
# where the dictionary key is the iteration variable and the dictionary value is 1.
freq_dict = {}
for app in dataset:
element = app[index]
if element in freq_dict:
freq_dict[element] += 1
else:
freq_dict[element] = 1
return freq_dict
In order to show the contents we need a second function which can help us display the entries in the frequency table in a descending order. The following function was provided by the course;
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
Time to use these two functions!
Let's display the frequency table of the columns Category
(index 1) Genre
(index 9) of the Google data set. And then let's display the frequency table of the column prime_genre
(index 11) of the Apple data set.
Google / Category
display_table (google_clean_english_free, 1)
FAMILY : 1676 GAME : 862 TOOLS : 750 BUSINESS : 407 LIFESTYLE : 346 PRODUCTIVITY : 345 FINANCE : 328 MEDICAL : 313 SPORTS : 301 PERSONALIZATION : 294 COMMUNICATION : 287 HEALTH_AND_FITNESS : 273 PHOTOGRAPHY : 261 NEWS_AND_MAGAZINES : 248 SOCIAL : 236 TRAVEL_AND_LOCAL : 207 SHOPPING : 199 BOOKS_AND_REFERENCE : 190 DATING : 165 VIDEO_PLAYERS : 159 MAPS_AND_NAVIGATION : 124 FOOD_AND_DRINK : 110 EDUCATION : 103 ENTERTAINMENT : 85 LIBRARIES_AND_DEMO : 83 AUTO_AND_VEHICLES : 82 HOUSE_AND_HOME : 73 WEATHER : 71 EVENTS : 63 PARENTING : 58 ART_AND_DESIGN : 57 COMICS : 55 BEAUTY : 53
Google / Genre
display_table (google_clean_english_free, 9)
Tools : 749 Entertainment : 538 Education : 474 Business : 407 Productivity : 345 Lifestyle : 345 Finance : 328 Medical : 313 Sports : 307 Personalization : 294 Communication : 287 Action : 275 Health & Fitness : 273 Photography : 261 News & Magazines : 248 Social : 236 Travel & Local : 206 Shopping : 199 Books & Reference : 190 Simulation : 181 Dating : 165 Arcade : 164 Video Players & Editors : 157 Casual : 156 Maps & Navigation : 124 Food & Drink : 110 Puzzle : 100 Racing : 88 Role Playing : 83 Libraries & Demo : 83 Auto & Vehicles : 82 Strategy : 81 House & Home : 73 Weather : 71 Events : 63 Adventure : 60 Comics : 54 Beauty : 53 Art & Design : 53 Parenting : 44 Card : 40 Casino : 38 Trivia : 37 Educational;Education : 35 Board : 34 Educational : 33 Education;Education : 30 Word : 23 Casual;Pretend Play : 21 Music : 18 Racing;Action & Adventure : 15 Puzzle;Brain Games : 15 Entertainment;Music & Video : 15 Casual;Brain Games : 12 Casual;Action & Adventure : 12 Arcade;Action & Adventure : 11 Action;Action & Adventure : 9 Educational;Pretend Play : 8 Simulation;Action & Adventure : 7 Parenting;Education : 7 Entertainment;Brain Games : 7 Board;Brain Games : 7 Parenting;Music & Video : 6 Educational;Brain Games : 6 Casual;Creativity : 6 Art & Design;Creativity : 6 Education;Pretend Play : 5 Role Playing;Pretend Play : 4 Education;Creativity : 4 Role Playing;Action & Adventure : 3 Puzzle;Action & Adventure : 3 Entertainment;Creativity : 3 Entertainment;Action & Adventure : 3 Educational;Creativity : 3 Educational;Action & Adventure : 3 Education;Music & Video : 3 Education;Brain Games : 3 Education;Action & Adventure : 3 Adventure;Action & Adventure : 3 Video Players & Editors;Music & Video : 2 Sports;Action & Adventure : 2 Simulation;Pretend Play : 2 Puzzle;Creativity : 2 Music;Music & Video : 2 Entertainment;Pretend Play : 2 Casual;Education : 2 Board;Action & Adventure : 2 Video Players & Editors;Creativity : 1 Trivia;Education : 1 Travel & Local;Action & Adventure : 1 Tools;Education : 1 Strategy;Education : 1 Strategy;Creativity : 1 Strategy;Action & Adventure : 1 Simulation;Education : 1 Role Playing;Brain Games : 1 Racing;Pretend Play : 1 Puzzle;Education : 1 Parenting;Brain Games : 1 Music & Audio;Music & Video : 1 Lifestyle;Pretend Play : 1 Lifestyle;Education : 1 Health & Fitness;Education : 1 Health & Fitness;Action & Adventure : 1 Entertainment;Education : 1 Communication;Creativity : 1 Comics;Creativity : 1 Casual;Music & Video : 1 Card;Action & Adventure : 1 Books & Reference;Education : 1 Art & Design;Pretend Play : 1 Art & Design;Action & Adventure : 1 Arcade;Pretend Play : 1 Adventure;Education : 1
Apple/prime_genre
display_table (apple_clean_english_free, 11)
Games : 1874 Entertainment : 254 Photo & Video : 160 Education : 118 Social Networking : 106 Shopping : 84 Utilities : 81 Sports : 69 Music : 66 Health & Fitness : 65 Productivity : 56 Lifestyle : 51 News : 43 Travel : 40 Finance : 36 Weather : 28 Food & Drink : 26 Reference : 18 Business : 17 Book : 14 Navigation : 6 Medical : 6 Catalogs : 4
So the most common genres are:
Google/Category:
Google/Genre:
Apple/prime_genre:
It seems the most common genres for free apps in English on Google Play are of a more practical nature, while the most common genre for these types of apps on the App store is Games.
While this is good to know, the most frequent genre does not necessarily mean these apps and genres also have the most users.
Let's find out what genres are the most popular. For the Google data set we will calculate this using the Installs
column. For the Apple data set this column is missing, so we will make do with the total number of user ratings instead, which can be found in the rating_count_tot
column.
Starting with the App store data set, we will
so we get the average number of user ratings for each genre.
# creating a frequency table for the `prime_genre` column
# using the previously created freq_table() function
freq_prime_genre = freq_table(apple_clean_english_free, 11)
# looping over the genres
# per genre, save the number of ratings and count the genre itself
# calculate the average per genre and print the result
for genre in freq_prime_genre:
total = 0
len_genre = 0
for app in apple_clean_english_free:
genre_app = app[11]
if genre_app == genre:
user_ratings = float(app[5])
total += user_ratings
len_genre += 1
average = total / len_genre
print (genre + ": " + str(average))
Sports: 23008.898550724636 Education: 7003.983050847458 Book: 39758.5 Weather: 52279.892857142855 Travel: 28243.8 Health & Fitness: 23298.015384615384 Catalogs: 4004.0 Utilities: 18684.456790123455 Food & Drink: 33333.92307692308 Reference: 74942.11111111111 Medical: 612.0 News: 21248.023255813954 Social Networking: 71548.34905660378 Shopping: 26919.690476190477 Business: 7491.117647058823 Entertainment: 14029.830708661417 Photo & Video: 28441.54375 Navigation: 86090.33333333333 Music: 57326.530303030304 Lifestyle: 16485.764705882353 Games: 22788.6696905016 Productivity: 21028.410714285714 Finance: 31467.944444444445
Popular genres with a high average number of users seem Social Networking, Navigation and Music, but we know these numbers are skewed due to a few very popular apps like Facebook, Google Maps and Spotify.
Another popular genre that jumps out is Reference. Let's see what apps this genre is about to get a better understanding.
for app in apple_clean_english_free:
if app[11] == 'Reference':
print(app[1], ':', app[5])
Bible : 985920 Dictionary.com Dictionary & Thesaurus : 200047 Dictionary.com Dictionary & Thesaurus for iPad : 54175 Google Translate : 26786 Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588 Merriam-Webster Dictionary : 16849 Night Sky : 12122 City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535 LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693 GUNS MODS for Minecraft PC Edition - Mods Tools : 1497 Guides for Pokémon GO - Pokemon GO News and Cheats : 826 WWDC : 762 Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718 VPN Express : 14 Real Bike Traffic Rider Virtual Reality Glasses : 8 教えて!goo : 0 Jishokun-Japanese English Dictionary & Translator : 0
Many of the apps in the genre seem to have taken a popular book and made it available on a mobile device, possibly with some extra features.
Let's move on to the Google data, to see which app genres attract the most users in Google Play.
While the data in the install
column are not very precies (100.00+, 1.000.000+ etc), it serves our purpose. We will have to remove the comma's and plus characters, and convert the data from string to float so we can perform calculations on them.
# creating a frequency table for the `Category` (first) column
# using the previously created freq_table() function
freq_category = freq_table(google_clean_english_free, 1)
# looping over the categories
# per cateogry, save the number of installs and count the install itself
# calculate the average installs per category and print the result
for category in freq_category:
total = 0
len_category = 0
for app in google_clean_english_free:
category_app = app[1]
if category_app == category:
installs = app[5]
# using `str.replace(old, new) method to remove unwanted characters
# removing unwanted characters by replacing them with an empty string
# convert the result to a float
clean_installs = installs.replace('+', '')
clean_installs = clean_installs.replace (',', '')
clean_installs = float(clean_installs)
total += clean_installs
len_category += 1
average = total / len_category
print (category + ": " + str(average))
PERSONALIZATION: 5201482.6122448975 MAPS_AND_NAVIGATION: 4056941.7741935486 PARENTING: 542603.6206896552 VIDEO_PLAYERS: 24727872.452830188 HEALTH_AND_FITNESS: 4188821.9853479853 BUSINESS: 1712290.1474201474 FOOD_AND_DRINK: 1924897.7363636363 COMMUNICATION: 38456119.167247385 TRAVEL_AND_LOCAL: 13984077.710144928 SOCIAL: 23253652.127118643 FINANCE: 1387692.475609756 ART_AND_DESIGN: 1986335.0877192982 LIBRARIES_AND_DEMO: 638503.734939759 SHOPPING: 7036877.311557789 WEATHER: 5074486.197183099 ENTERTAINMENT: 11640705.88235294 MEDICAL: 120550.61980830671 LIFESTYLE: 1437816.2687861272 GAME: 15588015.603248259 FAMILY: 3695641.8198090694 PHOTOGRAPHY: 17840110.40229885 TOOLS: 10801391.298666667 EDUCATION: 1833495.145631068 SPORTS: 3638640.1428571427 BOOKS_AND_REFERENCE: 8767811.894736841 NEWS_AND_MAGAZINES: 9549178.467741935 EVENTS: 253542.22222222222 HOUSE_AND_HOME: 1331540.5616438356 DATING: 854028.8303030303 COMICS: 817657.2727272727 BEAUTY: 513151.88679245283 PRODUCTIVITY: 16787331.344927534 AUTO_AND_VEHICLES: 647317.8170731707
Popular genres with a high average number of users seem Communication and Video Players, but again, we know these numbers are skewed due to a few very popular apps like Whatsapp and Youtube. Another popular genre that jumps out is Books and Reference, which coincides with our finding from the Apple data set. Let's see what apps this genre is about to get a better understanding.
for app in google_clean_english_free:
if app[1] == 'BOOKS_AND_REFERENCE':
print(app[0], ':', app[5])
E-Book Read - Read Book for free : 50,000+ Download free book with green book : 100,000+ Wikipedia : 10,000,000+ Cool Reader : 10,000,000+ Free Panda Radio Music : 100,000+ Book store : 1,000,000+ FBReader: Favorite Book Reader : 10,000,000+ English Grammar Complete Handbook : 500,000+ Free Books - Spirit Fanfiction and Stories : 1,000,000+ Google Play Books : 1,000,000,000+ AlReader -any text book reader : 5,000,000+ Offline English Dictionary : 100,000+ Offline: English to Tagalog Dictionary : 500,000+ FamilySearch Tree : 1,000,000+ Cloud of Books : 1,000,000+ Recipes of Prophetic Medicine for free : 500,000+ ReadEra – free ebook reader : 1,000,000+ Anonymous caller detection : 10,000+ Ebook Reader : 5,000,000+ Litnet - E-books : 100,000+ Read books online : 5,000,000+ English to Urdu Dictionary : 500,000+ eBoox: book reader fb2 epub zip : 1,000,000+ English Persian Dictionary : 500,000+ Flybook : 500,000+ All Maths Formulas : 1,000,000+ Ancestry : 5,000,000+ HTC Help : 10,000,000+ English translation from Bengali : 100,000+ Pdf Book Download - Read Pdf Book : 100,000+ Free Book Reader : 100,000+ eBoox new: Reader for fb2 epub zip books : 50,000+ Only 30 days in English, the guideline is guaranteed : 500,000+ Moon+ Reader : 10,000,000+ SH-02J Owner's Manual (Android 8.0) : 50,000+ English-Myanmar Dictionary : 1,000,000+ Golden Dictionary (EN-AR) : 1,000,000+ All Language Translator Free : 1,000,000+ Azpen eReader : 500,000+ URBANO V 02 instruction manual : 100,000+ Bible : 100,000,000+ C Programs and Reference : 50,000+ C Offline Tutorial : 1,000+ C Programs Handbook : 50,000+ Amazon Kindle : 100,000,000+ Aab e Hayat Full Novel : 100,000+ Aldiko Book Reader : 10,000,000+ Google I/O 2018 : 500,000+ R Language Reference Guide : 10,000+ Learn R Programming Full : 5,000+ R Programing Offline Tutorial : 1,000+ Guide for R Programming : 5+ Learn R Programming : 10+ R Quick Reference Big Data : 1,000+ V Made : 100,000+ Wattpad 📖 Free Books : 100,000,000+ Dictionary - WordWeb : 5,000,000+ Guide (for X-MEN) : 100,000+ AC Air condition Troubleshoot,Repair,Maintenance : 5,000+ AE Bulletins : 1,000+ Ae Allah na Dai (Rasa) : 10,000+ 50000 Free eBooks & Free AudioBooks : 5,000,000+ Ag PhD Field Guide : 10,000+ Ag PhD Deficiencies : 10,000+ Ag PhD Planting Population Calculator : 1,000+ Ag PhD Soybean Diseases : 1,000+ Fertilizer Removal By Crop : 50,000+ A-J Media Vault : 50+ Al-Quran (Free) : 10,000,000+ Al Quran (Tafsir & by Word) : 500,000+ Al Quran Indonesia : 10,000,000+ Al'Quran Bahasa Indonesia : 10,000,000+ Al Quran Al karim : 1,000,000+ Al-Muhaffiz : 50,000+ Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+ Al-Quran 30 Juz free copies : 500,000+ Koran Read &MP3 30 Juz Offline : 1,000,000+ Hafizi Quran 15 lines per page : 1,000,000+ Quran for Android : 10,000,000+ Surah Al-Waqiah : 100,000+ Hisnul Al Muslim - Hisn Invocations & Adhkaar : 100,000+ Satellite AR : 1,000,000+ Audiobooks from Audible : 100,000,000+ Kinot & Eichah for Tisha B'Av : 10,000+ AW Tozer Devotionals - Daily : 5,000+ Tozer Devotional -Series 1 : 1,000+ The Pursuit of God : 1,000+ AY Sing : 5,000+ Ay Hasnain k Nana Milad Naat : 10,000+ Ay Mohabbat Teri Khatir Novel : 10,000+ Arizona Statutes, ARS (AZ Law) : 1,000+ Oxford A-Z of English Usage : 1,000,000+ BD Fishpedia : 1,000+ BD All Sim Offer : 10,000+ Youboox - Livres, BD et magazines : 500,000+ B&H Kids AR : 10,000+ B y H Niños ES : 5,000+ Dictionary.com: Find Definitions for English Words : 10,000,000+ English Dictionary - Offline : 10,000,000+ Bible KJV : 5,000,000+ Borneo Bible, BM Bible : 10,000+ MOD Black for BM : 100+ BM Box : 1,000+ Anime Mod for BM : 100+ NOOK: Read eBooks & Magazines : 10,000,000+ NOOK Audiobooks : 500,000+ NOOK App for NOOK Devices : 500,000+ Browsery by Barnes & Noble : 5,000+ bp e-store : 1,000+ Brilliant Quotes: Life, Love, Family & Motivation : 1,000,000+ BR Ambedkar Biography & Quotes : 10,000+ BU Alsace : 100+ Catholic La Bu Zo Kam : 500+ Khrifa Hla Bu (Solfa) : 10+ Kristian Hla Bu : 10,000+ SA HLA BU : 1,000+ Learn SAP BW : 500+ Learn SAP BW on HANA : 500+ CA Laws 2018 (California Laws and Codes) : 5,000+ Bootable Methods(USB-CD-DVD) : 10,000+ cloudLibrary : 100,000+ SDA Collegiate Quarterly : 500+ Sabbath School : 100,000+ Cypress College Library : 100+ Stats Royale for Clash Royale : 1,000,000+ GATE 21 years CS Papers(2011-2018 Solved) : 50+ Learn CT Scan Of Head : 5,000+ Easy Cv maker 2018 : 10,000+ How to Write CV : 100,000+ CW Nuclear : 1,000+ CY Spray nozzle : 10+ BibleRead En Cy Zh Yue : 5+ CZ-Help : 5+ Modlitební knížka CZ : 500+ Guide for DB Xenoverse : 10,000+ Guide for DB Xenoverse 2 : 10,000+ Guide for IMS DB : 10+ DC HSEMA : 5,000+ DC Public Library : 1,000+ Painting Lulu DC Super Friends : 1,000+ Dictionary : 10,000,000+ Fix Error Google Playstore : 1,000+ D. H. Lawrence Poems FREE : 1,000+ Bilingual Dictionary Audio App : 5,000+ DM Screen : 10,000+ wikiHow: how to do anything : 1,000,000+ Dr. Doug's Tips : 1,000+ Bible du Semeur-BDS (French) : 50,000+ La citadelle du musulman : 50,000+ DV 2019 Entry Guide : 10,000+ DV 2019 - EDV Photo & Form : 50,000+ DV 2018 Winners Guide : 1,000+ EB Annual Meetings : 1,000+ EC - AP & Telangana : 5,000+ TN Patta Citta & EC : 10,000+ AP Stamps and Registration : 10,000+ CompactiMa EC pH Calibration : 100+ EGW Writings 2 : 100,000+ EGW Writings : 1,000,000+ Bible with EGW Comments : 100,000+ My Little Pony AR Guide : 1,000,000+ SDA Sabbath School Quarterly : 500,000+ Duaa Ek Ibaadat : 5,000+ Spanish English Translator : 10,000,000+ Dictionary - Merriam-Webster : 10,000,000+ JW Library : 10,000,000+ Oxford Dictionary of English : Free : 10,000,000+ English Hindi Dictionary : 10,000,000+ English to Hindi Dictionary : 5,000,000+ EP Research Service : 1,000+ Hymnes et Louanges : 100,000+ EU Charter : 1,000+ EU Data Protection : 1,000+ EU IP Codes : 100+ EW PDF : 5+ BakaReader EX : 100,000+ EZ Quran : 50,000+ FA Part 1 & 2 Past Papers Solved Free – Offline : 5,000+ La Fe de Jesus : 1,000+ La Fe de Jesús : 500+ Le Fe de Jesus : 500+ Florida - Pocket Brainbook : 1,000+ Florida Statutes (FL Code) : 1,000+ English To Shona Dictionary : 10,000+ Greek Bible FP (Audio) : 1,000+ Golden Dictionary (FR-AR) : 500,000+ Fanfic-FR : 5,000+ Bulgarian French Dictionary Fr : 10,000+ Chemin (fr) : 1,000+ The SCP Foundation DB fr nn5n : 1,000+
While there seem to be a few very popular apps that skew the outcome (Google Play Books, Bible, Amazon Kindle) the picture is very much the same as with the Apple data set: Many of the apps in this category seem to have taken a popular book and made it available on a mobile device, possibly with some extra features.
Time to wrap up our analysis.
In this project we analyzed data on profitable apps from both the App Store and Google Play to understand what apps are most profitable. The goal of the project was to help developers understand what type of apps are likely to attract more users.
Our analysis shows that within the segment of free apps in English the genre of Books & Reference shows potential for both markets. Our suggestion is to look for a popular book and convert this into an app, adding functionality for interaction, learning, quick reference or personal annotations.