#!/usr/bin/env python # coding: utf-8 # # Guided Project: Profitable App Profiles for App Store and Google Play # # As data analysts for a company that builds Android and iOS mobile apps, we aim to find mobile app profiles that are potentially profitable for both App Store and Google Play markets. This project enables our team of developers to make data-driven decisions with respect to the kind of apps they build. # # As our company only build apps that are free to download and install, our main source of revenue comes from in-app advertisements. Since advertising-merchants are heavily influenced by the number of users of any given app, the focus of our project will be analysing data so as to identify the kinds of apps that are likely to attract more users. # ## Opening and Exploring the Data¶ # # As of September 2018, the number of iOS apps available on the App Store and the number of Android apps on Google Play were approximately 2 million and 2.1 million respectively. # # Prior to commiting a significant amount of time and money to collect data for over four million apps, we would try to: # 1) analyze a sample of data for a start; and
# 2) look for any relevant existing data at no cost. # # For our purpose, the following sets of data are deemed suitable: # # A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data on approximately ten thousand Android apps from Google Play, and that can be downloaded directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
# A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data on approximately seven thousand iOS apps from the App Store, also downloadabledirectly from this other [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). # # We shall begin by opening the two data sets, before moving on to explore the data. # In[1]: def open_dataset(file_name): opened_file = open(file_name) from csv import reader read_file = reader(opened_file) data = list(read_file) return data # The open_dataset() function is the first of many that we'll be writing while exploring the data sets. These functions will allow us to efficiently and repeatedly access different perspectives with greater ease. # # After opening the data sets, we separate the headers from the bulk of data, and get to know the number of columns (based on the headers) and the number of rows (of the data bodies), by simply giving them identifiable variable names. The header also let us know the type of information within the data bodies, as it is the latter that we are keen on exploring. # In[2]: android_dataset = open_dataset('googleplaystore.csv') android_header = android_dataset[0] android_data = android_dataset[1:] ios_dataset = open_dataset('AppleStore.csv') ios_header = ios_dataset[0] ios_data = ios_dataset[1:] print(android_header, '\n') print("Number of rows in Google's dataset: " + str(len(android_data)) + " (excluding header)") print("Number of columns: " + str(len(android_header)), '\n') print(ios_header, '\n') print("Number of rows in Apple's dataset: " + str(len(ios_data)) + " (excluding header)") print("Number of columns: " + str(len(ios_header))) # The Google Play data set has 10841 rows and 13 columns of apps data. At a quick glance, columns that might be useful for analyses are 'App', 'Category', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Content Rating' and 'Genres'. # # The Apple Store data set has 7197 rows and 16 columns of apps data. Useful columns might include 'track_name', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'cont_rating' and 'prime_genre'. # # Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). # ## Deleting Row(s) with Wrong Data # # The Google Play data set has a dedicated [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and within [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) an error for row 10472 has been outlined. # # Tto scan both sets of data, we use another function that will help us locate rows with missing column data. # In[3]: def row_check(file_header, file_data): length_header = len(file_header) error_rows = [] for row in file_data: if len(row) != length_header: error_rows.append(file_data.index(row)) return error_rows # In[4]: print("Google's row-error found at: ", row_check(android_header, android_data)) print("Apple's row_error found at: ", row_check(ios_header, ios_data)) # It appears that the number of columns in Apple Store data body matches that of its header. Let's print this row from Google Play data set and compare it against both its header and another row that is correct. # In[5]: print(android_header, '\n') # header print(android_data[0], '\n') # correct row print(android_data[10472]) # incorrect row # The row 10472 corresponds to the app "Life Made WI-Fi Touchscreen Photo Frame". We realized that the rating is 19, which is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) section). As highlighted, this problem is caused by a missing value in the 'Category' column. As such, we'll delete this row. # In[6]: print(len(android_data)) del android_data[10472] # don't run this more than once print(len(android_data)) # ## Removing Row with Duplicate Entries¶ # # ### Part One # # Besides missing column(s) among the rows of data, there could be duplicated entries of the same app too, hence requiring further cleaning of data prior to analysis. # # For a start, we extract a list of apps from each data set, whereby each app is only listed once. # In[7]: def extract(indexed_column, data_set): column_list = [] for row in data_set: app = row[indexed_column] column_list.append(app) return column_list android_n_apps = set(extract(0, android_data)) ios_n_apps = set(extract(1, ios_data)) print(len(android_n_apps), len(ios_n_apps)) # There are altogether 9659 unique apps in Google Play and 7195 unique apps in Apple Store. # # Next, we use another function to separate each data set into two lists as follows: # In[8]: def multiple_entries(indexed_column, data_set): duplicate_apps = [] unique_apps = [] for row in data_set: app_name = row[indexed_column] if app_name in unique_apps: duplicate_apps.append(row) else: unique_apps.append(app_name) return unique_apps, duplicate_apps # In[9]: android_unique, android_duplicates = multiple_entries(0, android_data) print('Number of duplicate google apps:', len(android_duplicates)) print('Number of unique google apps:', len(android_unique), '\n') ios_unique, ios_duplicates = multiple_entries(1, ios_data) print('Number of duplicate apple apps:', len(ios_duplicates)) print('Number of unique apple apps:', len(ios_unique)) # For Google, there are 1,181 cases where various apps occur more than once, e.g. the application Instagram has four entries: # In[10]: for item in android_data: app = item[0] if app == 'Instagram': print(item, '\n') # In[11]: print('Examples of other duplicate apps: ', extract(0, android_duplicates[:15])) # However, contrary to the observations made in the discussions, whereby no duplicates had been found for Apple, the multiple_entries() function discovered 2 duplicates in the latter. # In[12]: for item in ios_duplicates: print(item[1]) for row in ios_data: name = row[1] if name == 'Mannequin Challenge' or name == 'VR Roller Coaster': print('\n', row) # To avoid counting certain apps more than once during our analyses, we need to remove the duplicate entries and keep only one entry per app. However, instead of removing the duplicate rows randomly, we could choose a criterion to decide which entries to keep. # # Looking at 'Instagram', 'Mannequin' and 'VR Roller Coaster', the main difference among the duplicates lies in the number of reviews, i.e. fourth position for Google and sixth position for Apple. # # The different numbers show that the review-data was collected at different times. The higher the number of reviews, the more reliable the ratings. As such, we can use reviews as a criterion, keeping the row with the highest numbers for each app. # ### Part Two # # To do that, we will: # - Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app # - Use the dictionary to create a new data set, which will have only one entry per app (based on the app-entry with the highest number of reviews) # In[13]: def max_reviewers(file_name, index_1, index_2): # duplicate-free data set based on max no. of reviewers reviews_max = {} data_cleaned = [] already_added = [] for app in file_name: name, n_reviews = app[index_1], float(app[index_2]) if name in reviews_max and (reviews_max[name] < n_reviews): reviews_max[name] = n_reviews elif name not in reviews_max: reviews_max[name] = n_reviews if (reviews_max[name] == n_reviews) and (name not in already_added): data_cleaned.append(app) already_added.append(name) return data_cleaned # Let's start by creating a function max_reviewers() to build the dictionary reviews_max that keep the entries with the highest number of reviews and removes duplicates. In the code cell below: # - After the dictionary, we initialize two empty lists, data_cleaned and already_added. # - We loop through the data sets, and for every iteration: # * We isolate the name of the app and the number of reviews. # * We add the current row (app) to the data_cleaned list, and the app name (name) to the already_added list if: # * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and # * The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps. # In[14]: android_cleaned = max_reviewers(android_data, 0, 3) ios_cleaned = max_reviewers(ios_data, 1, 5) print(len(android_cleaned), len(ios_cleaned)) # After removing the duplicates, the number of rows in each data set corresponds to their respective number of unique apps. # # However, another disparity was found when comparing the first three rows of the new android data set. # In[15]: for item in android_cleaned[0:3]: print(item, '\n') # The first three apps shown in the 'Solutions' are 'Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps' and 'Sketch - Draw & Paint'. # # But the first three apps shown after using max_reviewers() includes 'Coloring book moana', which is absent in the 'Solutions' using the function explore_data(), i.e. 'explore_data(android_clean, 0, 3, True)'. # # It's also worth noting that this app happens to fall under two different categories. Keeping the one with higher number of reviews, the app will be removed from the 'Family' category. # In[16]: for item in android_data: app = item[0] if app == 'Coloring book moana': print(item, '\n') # ## Removing Non-English Apps # # ### Part One # # When exploring the data sets, the names of some apps suggest that they are not directed toward an English-speaking audience. A couple of examples from both data sets are shown below: # In[17]: print(ios_cleaned[813][1]) print(ios_cleaned[6731][1], '\n') print(android_cleaned[5326][0]) print(android_cleaned[8938][0]) # As our focus is on English apps, we would need to remove the non-English ones. # # One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.). # # All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters. # # Using the built-in ord() functio, we built the function below to find out the corresponding encoding number of each character. # In[18]: def is_english(string): for character in string: if ord(character) > 127: return False return True print(is_english('Instagram')) print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')) # While the function seems to work fine, some English app names that use emojis or other symbols (™, — (em dash), – (en dash), etc.) will fall outside of the ASCII range. As such, we'll remove useful apps if we use the function in its current form. # In[19]: print(is_english('Docs To Go™ Free Office Suite')) print(is_english('Instachat 😜')) print(ord('™')) print(ord('😜')) # ### Part Two # # To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters. # # In order to further filter those with three non-ASCII characters, we included another condition that compares the number of non-ASCII characters against the number of characters in the full name of the apps. If it is a match, e.g. where two Chinese characters form the name of the app, we shall remove these too. # # The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis. # In[20]: def is_english(string): non_ascii = 0 for character in string: if ord(character) > 127: non_ascii += 1 if non_ascii > 3 or (non_ascii <= 3 and non_ascii == len(string)): return False else: return True # In[21]: print(is_english('Docs To Go™ Free Office Suite')) print(is_english('Instachat 😜')) print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')) print(is_english('豆瓣')) print(is_english('教えて!goo')) # In[22]: android_english = [] ios_english = [] for app in android_cleaned: name = app[0] if is_english(name): android_english.append(app) for app in ios_cleaned: name = app[1] if is_english(name): ios_english.append(app) print(len(android_english), len(ios_english)) # We can see that we're left with 9614 Android apps and 6163 iOS apps; the latter is fewer than that shown in the 'Solutions' by 20 apps-data due to the additional condition. # ## Isolating the Free Apps # # As mentioned in the introduction, our company only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets. # In[23]: android_final = [] for app in android_english: price = app[7] if price == '0' or price == '0.0': android_final.append(app) ios_final = [] for app in ios_english: price = app[4] if price == '0' or price == '0.0': ios_final.append(app) print(len(android_final), len(ios_final)) # We're left with 8862 Android apps and 3204 iOS apps, which should be adequate for our analysis. Once again, these numbers differ from those in the 'Solutions' (i.e. 8864 for Android and 3222 for iOS). # ## Most Common Apps by Genre # # ### Part One # # As mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. # # Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. # # Let's begin the analysis by getting a sense of the most common genres for each market. For a start, we'll take a look at the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set. # In[24]: android_genres = set(extract(9, android_final)) ios_genres = set(extract(11, ios_final)) android_categories = set(extract(1, android_final)) print('Number of android app genres: ', len(android_genres)) print('Number of android app categories: ', len(android_categories)) print('Number of iOS app genres: ', len(ios_genres)) # For Android apps, the difference between the Genres and the Category columns is not crystal clear. One thing for sure, the Genres column is much more granular (it has more categories). Since, we're only looking for the bigger picture at the moment, we'll only work with the Category column moving forward. # ### Part Two # # We'll build frequency table for the prime_genre column of the App Store data set and the Category columns of the Google Play data set. # # Thereafter, to analyze the frequency tables, we'll build: # # - One function to generate frequency tables that show percentages # - Another function that we can use to display the percentages in a descending order # In[25]: def freq_table(dataset, index): table = {} total = 0 for row in dataset: total += 1 item = row[index] if item in table: table[item] += 1 else: table[item] = 1 percentage_table = {} for item in table: percentage = (table[item] / total) * 100 percentage_table[item] = percentage return percentage_table # In[26]: ios_gen_ft = freq_table(ios_final, 11) import operator ios_gen_ft_s = sorted(ios_gen_ft.items(), key=operator.itemgetter(1), reverse = True) for entry in ios_gen_ft_s: print(entry[0], ":", round(entry[1], 3)) # In[27]: android_cat_ft = freq_table(android_final, 1) # Focusing on ‘categories’ instead of 'genres'. import operator android_cat_ft_s = sorted(android_cat_ft.items(), key=operator.itemgetter(1), reverse = True) for entry in android_cat_ft_s: print(entry[0], ":", round(entry[1], 3)) # App Store is seemingly dominated by apps meant for entertainment, while Google Play displays a relatively more balanced landscape of both practical and for-fun apps. # # Although there are also many apps designed for fun, the latter appears to have a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, upon further investigation, we discover that the family category (which accounts for almost 19% of the apps) contains mostly games for kids. # # Beside the weightings of the genres by number of apps, we would like to identify the kind of apps that brings in the most users. # ## Most Popular Apps by Genre # # One way to find out what genres are the most popular (i.e. most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot. # # Below, we calculate the average number of user ratings per app genre on the App Store: # In[28]: def average_n_ratings(dataset_1, dataset_2, index_1, index_2): table = {} for genre in dataset_1: total = 0 len_genre = 0 for app in dataset_2: app_genre = app[index_1] if app_genre == genre: n_ratings = float(app[index_2]) total += n_ratings len_genre += 1 avg_n_rtg = total / len_genre table[genre] = avg_n_rtg import operator sorted_table = sorted(table.items(), key=operator.itemgetter(1), reverse = True) return sorted_table # In[29]: ios_anr = average_n_ratings(ios_genres, ios_final, 11, 5) for entry in ios_anr: print(entry[0], ':', f'{entry[1]:,.2f}') # On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together, skewing the average number: # In[30]: for app in ios_final: if app[-5] == 'Navigation': print(app[1], ':', app[5]) # print name and number of ratings # The same pattern applies to: # 1) reference apps, where Bible and Dictionary heavily skewed the average number upwards;
# 2) social networking apps, where the average number is influenced by a few giants like Facebook, Pinterest, and Skype;
# 3) music apps, where a few big players like Pandora, Spotify, and Shazam impacted the average number. # # As such, navigation, reference, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. # # We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later. # In[31]: for app in ios_final: if app[-5] == 'Reference': print(app[1], ':', app[5]) # print name and number of ratings # In[32]: for app in ios_final: if app[-5] == 'Social Networking': print(app[1], ':', app[5]) # In[33]: for app in ios_final: if app[-5] == 'Music': print(app[1], ':', app[5]) # The fact that the App Store is dominated by for-fun apps might suggests the market may be a bit saturated with for-fun apps. In other words, a practical app might have more of a chance to stand out among the huge number of apps on the App Store. # # Now let's analyze the Google Play market a bit. # ## Most Popular Apps by Genre on Google Play # # For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.): # In[34]: for k, v in freq_table(android_final, 5).items(): print(k, ":", v) # Unfortunately, this data is not precise. # # For instance, we wouldn't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. Nevertheless, we don't need very precise data for our purposes, just an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users. # # We're going to leave the numbers as they are, i.e. we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. # # To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category). # In[35]: def average_n_installs(dataset_1, dataset_2, index_1, index_2): table = {} for category in dataset_1: total = 0 len_category = 0 for app in dataset_2: app_category = app[index_1] if app_category == category: n_installs = app[index_2] n_installs = n_installs.replace(',', '') n_installs = n_installs.replace('+', '') total += float(n_installs) len_category += 1 avg_n_instl = total / len_category table[category] = avg_n_instl import operator sorted_table = sorted(table.items(), key=operator.itemgetter(1), reverse = True) return sorted_table # In[36]: android_ani = average_n_installs(android_categories, android_final, 1, 5) for entry in android_ani: print(entry[0], ':', f'{entry[1]:,.2f}') # On average, communication apps have the most installs: 38,456,119. Once again, this number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs: # In[37]: comm_apps_tl = 0 for app in android_final: if app[1] == 'COMMUNICATION': comm_apps_tl += 1 print(app[0], ':', app[5]) # print name and number of installs # In[38]: print("Total number of apps under the 'Communication' category: ", comm_apps_tl) # If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times: # In[39]: under_100m = [] for app in android_final: n_installs = app[5] n_installs = n_installs.replace(',', '') n_installs = n_installs.replace('+', '') if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000): under_100m.append(float(n_installs)) print(f'{(sum(under_100m) / len(under_100m)):,.2f}', '\n') print("Number of 'Communication' apps with under 100 million installs: ", len(under_100m)) # We see the same pattern for: # 1) video players category, dominated by apps like Youtube, Google Play Movies & TV, or MX Player;
# 2) social apps (where we have giants like Facebook, Instagram, Google+, etc.);
# 3) photography apps (Google Photos and other popular photo editors);
# 4) productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.). # In[40]: for app in android_final: if app[1] == 'VIDEO_PLAYERS': print(app[0], ':', app[5]) # The main concern is that these app genres might seem more popular than they really are. # # But concerns aside, and notwithstanding the fact that they're dominated by a few giants who are hard to compete against, there might still be ways to break into these niches by thinking out of the box. # # Take YouTube for example. Challenging its reign is a new contender Odysee, a new video site launched in December 2020, and created to provide an alternative as "the internet has become “very corporate” with a small number of companies controlling the flow of information". The site was created by the team behind the Lbry (pronounced “library”) blockchain protocol. [Link](https://techcrunch.com/2020/12/07/odysee-launch/) # # The game genre is quite popular, albeit a bit saturated, so we'd like to come up with a different app recommendation if possible. # # And by the same logic, if the genre is not overcrowded (i.e. lower weighting) but enjoys relatively higher number of reviews or installs, this could potentially be fertile field to design an app. # # Let's compare the last half genres/categories by sorted weighting frequency table and the first half genres of the reviews/installs frequency table. # In[41]: for entry in ios_gen_ft_s[-12:]: print(entry[0], ":", round(entry[1], 3)) print() for entry in ios_anr[:12]: print(entry[0], ":", f'{entry[1]:,.2f}') # For Apple Store, 'Travel', 'Finance', 'Weather', 'Food & Drink', 'Reference', 'Book' and 'Navigation' are the overlapping genres. # # Other than 'Reference' (dominated by Bible and dictionaries), 'Navigation' (dominated by Waze and Google Map) and 'Weather', the other genres could be the "blue oceans" with potentially interesting and creative content that draw more users. # In[42]: for entry in android_cat_ft_s[-16:]: print(entry[0], ":", round(entry[1], 3)) print() for entry in android_ani[0:16]: print(entry[0], ":", f'{entry[1]:,.2f}') # On the other hand, the overlapping categories for Google Play are 'BOOKS_AND_REFERENCE', 'VIDEO_PLAYERS', 'ENTERTAINMENT', 'MAPS_AND_NAVIGATION' and 'WEATHER'. # # Putting the last two aside for the time being, 'BOOKS_AND_REFERENCE' and 'ENTERTAINMENT' seem to coincide with 'Travel', 'Food & Drink' (should these two be considered as a form of entertainment) and 'Book' of the Apple Store. # # It's interesting to explore these genres in more depth, as they have potential for being profitable on both the Apple Store and Google Play. # Quantitatively, based on the data analyses above, the genre to explore first would most likely be 'books'. Qualitatively, we may have to dig deeper, perhaps into the reviews itself to discover other hidden gemsof insights. # ## Conclusions # # In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets. # # We concluded that the 'book' genre is definitely worth exploring deeper, perhaps taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.