The goal of this project is to identify mobile app profiles that are profitable for the App Store and Google Play Markets. We are working as data analyst for a company that builds Android and iOS mobile apps. We will assist our team of developers in making data driven decisions with regard to the kind of apps they build. user attraction to various app genres available on Google Play and the App Store in order to give the developers of free apps a better understanding of the types of apps that attract the most users.
As of September 2018, there were approximately 2 million iOS app available on the App Store, and 2.1 million Android dapps on Google Play.
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see where we can find any relevant existing data at no cost. There are two data sets that seem suitable for our purpose:
We begin by opening and exploring two data sets,
# App Store data set #
# Read in the data
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
# Transform read_file into a list of lists
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]
# Google Play data set #
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]
To explore the data sets, we will write a function, explore_data(), and use it explore row. We will use print('/n') to add lines bewteen ouput. And, add an option for the function to show the number of rows and columns for any data set.
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True) # ios data set, starting at index 0, end at index 4, True - print number of rows and number of columns
The App Data Store data set has 7197 rows and 16 columns. The columns titled id, track_name, price, rating_count_tot, rating_count_ver, user_rating, & prime_genre are useful for this analysis. For more information on the columns , details about each column can be founf in the data set documentation
print(android_header)
print('\n')
explore_data(android, 10470, 10478, True)
The Google Play store platform has 10841 rows and 13 columns. The columns titled App, Category, Rating, View, Installs, Price & Genres are useful for the purpose of this analyis.
The Google Play data set has a discussion section. In one of the discussions an error in row 10472 is outlined. We will print the row and compare it to the header and a correct row in the data set.
print(android[10472]) # incorrect row
print('\n')
print(android_header) # header
print('\n')
print(android[0]) # a correct row
Row 10472 corresponds to Life Made Wi-Fi Photo Frame. The app has an incorrect rating of 19. The highest available rating for an app in the App Store platform is 5.0.
print(len(android))
del android[10472] # Deleting the row with the error
print(len(android))
for app in android:
name = app[0]
if name == 'Instagram':
print(app)
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])
When this data, to prevent counting certain apps more than once, remove duplicates and retain one entry per app. Instead of removing duplicates randomly, the difference in the number of review columns for the duplicate apps will be used to differentiate between the older and more recent ratings. Looking above at the Instagram duplicates, column four illustrates the differentiation where the higher number of ratings represents the most recent rating, thus, the more reliable rating.
The rows that have the highest number of ratings will be retained; therefore eliminating duplicated. How?
Buiding the dictionary
reviews_max = {}
for app in android:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name]< n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
We previously determined the number of duplicates to be 1181. Now we can determine the number of unique apps by finding the difference between the origninal number of apps and the number of duplicate apps. This will give us the actual length of the dictionary.
print('Expected length:', len(android)-1181)
print('Actual length:', len(reviews_max))
Reviews_max dictionary was created above. Using this dictionary to remove duplicates, retain the entries with the highest number of reviews. This was achieved in the code cell below.
android_clean = []
already_added = []
for app in android:
name = app[0]
n_reviews = float(app[3])
if (reviews_max[name] == n_reviews) and (name not in already_added):
android_clean.append(app)
already_added.append(name)
We can use the explore_data() function to confirm that the number of rows is 9659
explore_data(android_clean, 0, 3, True)
print(ios[813][1])
print(ios[6731][1])
print(android_clean[4412][0])
print(android_clean[7940][0])
The out put from the above cell shows that ther are some apps that are not in the English language. Such apps need to be removed. Apps that do not contain symbols and text commonly used in English can be removed base on the ASCII standard where wach character that is specific to English text has a corresponding number assigned to it. The range of the numbers is from 0 to 127. Using the built-in function, ord(), a function can be built to check if an apps corresponding encoding number is within the ASCII standard range.
def yes_english(string):
for character in string:
if ord(character) > 127:
return False
return True
print(yes_english('Instagram'))
print(yes_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(yes_english('Docs To Go™ Free Office Suite'))
print(yes_english('Instachat 😜'))
The function works; however some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.
print(ord('™'))
print(ord('😜'))
To prevent data being loss as a result of the use of emojis and other symbols, we will only remove an app if its name has more than three characters that are not in the ASCII range.
def yes_english(string):
non_ascii = 0
for character in string:
if ord(character) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
print(yes_english('Instagram'))
print(yes_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(yes_english('Docs To Go™ Free Office Suite'))
print(yes_english('Instachat 😜'))
We use the yes_english() to filter out non-English apps for both data sets.
android_english = []
ios_english = []
for app in android_clean:
name = app[0]
if yes_english(name):
android_english.append(app)
for app in ios:
name = app[1]
if yes_english(name):
ios_english.append(app)
explore_data(android_english, 0, 3, True)
print('/n')
explore_data(ios_english, 0, 3, True)
There are not 9614 Android apps and 6183 iOS apps after removing non-English apps.
Isolate free apps from both data sets to get final data sets
free_android = []
free_ios = []
for app in android_english:
price = app[7]
if price == '0':
free_android.append(app)
for app in ios_english:
price = app[4]
if price == '0.0':
free_ios.append(app)
final_android = free_android
final_ios = free_ios
print(len(final_android))
print(len(final_ios))
We now have 8864 Android apps and 3222 iOS apps for our qualified for our analysis.
The goal of this project is to analyze user attraction to various app genres available on Google Play and the App Store in order to give the developers of free apps a better understanding of the types of apps that attract the most users.
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
We want to add apps to both the Google Play and App Store platforms. Therefore, we need to acertain which app genres are successful on both platforms. We start by anaylyzing the data to ascertain the most common genres on each platform.
To get a sense of the most common genres, we start by building a frequency table for:
We will analyze the frquency tables by building two functions:
def freq_table(dataset, index):
table = {}
total = 0
for row in dataset:
total += 1
value = row[index]
if value in table:
table[value] += 1
else:
table[value] = 1
table_percentages = {}
for key in table:
percentage = (table[key] / total) *100
table_percentages[key] = percentage
return table_percentages
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
Using the functions we will analyze the frequency tables
display_table(final_ios, -5)
In the App Store frequency table for prime_genre, the top five available free English apps are Games (58.16%), Entertainment(7.88%), Photo & Video (4.96%), Education(3.66%), Social Networking (3.28%). These apps are primarily fun apps. Apps that have more pratical purposes are far less common in comparison to the fun apps. The table indicates that there are more fun apps than practical apps available. This is not an indicator of app usage; therefore, further analysis is required.
display_table(final_android, 1)
The Google Play frequency table for Category indicates that the top five available categories are Family(18.90%), Game(9.72%), Tools(8.46%), Business(4.59%) and Lifestyle(3.90%). This indicates that there are more practical apps available to users than there are fun apps. In comparison to App Store, Google Play has more practical apps available to users.
display_table(final_android, 9)
This Google Play Store Genres frequency table is more dense and specific than its Category frequency table. This table shows more even availabilty of fun and practical apps. For the purpose of this analysis, the Category is more suitable for the broader scope.
To briefly summarize the three tables above, the free English apps available on the App Store platform are primarily fun apps: while, the free English apps available on the Google Play platform evenly distributed fun and practical apps. Recall that the tables do not indicate usage, only availability. To get a better understanding of usage, we will analyze app popularity.
To determine the most popular apps by genre on the App Store we will use the total number of user ratings to compute the average number of user ratings.
ios_genres = freq_table(final_ios, -5)
for genre in ios_genres:
total = 0
len_genre = 0
for app in final_ios:
genre_app = app[-5]
if genre_app == genre:
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
print(genre, ':', avg_n_ratings)
Navigation, Reference and Social Networking apps have the highest average number of user ratings in the App Store. We will take a closer look at each,
for app in final_ios:
if app[-5] == 'Navigation':
print(app[1], ':', app[5])
Google Maps and Waze received the highest number of user ratings in the Navigation genre on the App Store. Although, there are pnly six apps in the Navigation genre, Waze and Google Maps dominate the genre. This would preclude us from recommending that the developers consider creating an app in the Navigation genre.
for app in final_ios:
if app[-5] == 'Reference':
print(app[1], ':', app[5])
The Bible and Dictionary have highest number of user ratings in the Reference genre on the App Store. The list is more diverse than the Navigation list. Most of the apps in the App Store are for fun. If, for instance, one of the popular books were turned into an app that contained daily reminders, challenges, quizzes and invites for friends, the app could become a standout on the App Store platform.
for app in final_ios:
if app[-5] == 'Social Networking':
print(app[1], ':', app[5])
Facebook, Pinterest, Skype, Messenger and Tumblr received the highest number of user ratings in the Social Networking genre on the App Store. The list large, but the average number of user ratings in the Social Networking genre is greatly influenced by those five apps. This, along with the number of apps in the Social Networking genre would preclude us from recommending that the developers consider building a Social Networking app.
For the Google Play data set, we will the information in the installs column to determine the average number of installs for each app genre.The install numbers are presented as, 1,000,000+ ....etc. The values are not clear in that they are strings that imply a value. For this analysis, we want to have a better understanding of which apps have the most users. For this reason, we can still use the data. The strings will be converted to floats; and, commas and plus characters will be removed in order to avoid an error.
display_table(final_android, 5)
We use a loop to convert each install number to a float by removing commas and the plus characters and compute the average number of installs for each genre in category.
cat_android = freq_table(final_android, 1)
for category in cat_android:
total= 0
len_category = 0
for app in final_android:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace('+', '')
n_installs = n_installs.replace(',', '')
total += float(n_installs)
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
Communication(38456119) and Video_Players(24727872) categories have the most installs. We will take a closer look.
for app in final_android:
if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
There are several communication apps that have over 100 million installs. With some surpassing 500 million and 1 billion. Many of which have a stong standing amongst users that would make it difficult for a new app to compete. If those apps are removed, there will be a 10 fold reduction in average number of installs for the Communication category.
under_100_m = []
for app in final_android:
n_installs = app[5]
n_installs = n_installs.replace('+', '')
n_installs = n_installs.replace(',', '')
if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
under_100_m.append(float(n_installs))
sum(under_100_m) / len(under_100_m)
After removing apps from the Communication category on the Google Play platform, we see that the number of apps is reduced 10 fold. Apps like, WhatsApps Messenger, Messenger- Text and Video Chat for Free and Google Duo - High Quality Video Calls, just to name a few, dominate this category.
for app in final_android:
if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
The Video_Player category yields a much smaller list. Yet, the list is dominated by YouTube and Google Play Movies & TV. This would, again, make it difficult for a new app to complete in this category.
We found that Reference genre in the App Store is a good possible genre for developing a new app. In the Google Play Store, we saw that Books and References category has over 8 million installs. We will look analyze this catergory further.
for app in final_android:
if app[1] == 'BOOKS_AND_REFERENCE':
print(app[0], ':', app[5])
The book and reference genre has a variety of apps. There are dictionaries, programming tutorials, poetry, library collections, language translations, etc. There are some that have more than 100 million installs.
for app in final_android:
if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
These five apps are the most popular; but, only represent a small portion of the apps available in the books and reference genre. We will explore apps in the midrange(between 1,000,000 and 100,000,000 downloads) populatiry to get some ideas of the types of apps that could be considered for development.
for app in final_android:
if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
or app[5] == '5,000,000+'
or app[5] == '10,000,000+'
or app[5] == '50,000,000+'):
print(app[0], ':', app[5])
This list of midrange popular apps in the Books and Reference genre primarily contains apps for library collections, dictionaires and software for processing and reading ebooks. Recommending an app that fits in this niche is not a good idea because there is relevant competition.
The Quran has several apps. This indicates that building an app around a popular book can be profitable. Both the Google Play and the App Store platforms can benefit from taking a book and turning it into and app.
With the market already full of libraries, special features will need to be added to an app built around any book. These features could include an audio version, a discussion forum, daily quotes and reminders, challenges, quizzes and invites for friends,etc.
In this project, we analyzed data abput the App store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.
We found that building an app around a popular book could be profitable for both Google Play and the App Store markets. Both markets have established libraries, thus, the app would need to include special features to compliment the book. These features could include an audio version, a discussion forum, daily quotes and reminders, challenges, quizzes and invites for friends,etc.