#!/usr/bin/env python # coding: utf-8 # ## Title: # # Profitiable Andriod and IOS mobile apps for App Store and Google Store # # # ## Project Aim: # # Our aim in this project is to build profitable free Andriod and IOS mobile app that will be available on Google play and App store for users to download and install. # ## Project Goal: # # As data analyst working for a company that builds Android and IOS mobile apps, my goal for this project is to analyze data to help developers understand what kind of apps really attract more users. # ## Opening and Exploring of data sets # # # * A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv). # # # # * A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). # # # We will start by opening both data sets by using the `open_dataset()` function for reuseability purpose before exploring further # # In[1]: ## Reuseable Function to convert any file into list of list ## def open_dataset(file_name, header=True): opened_file = open(file_name) from csv import reader read_file = reader(opened_file) data = list(read_file) if header: return data[0] #displays only header elif header == False: return data[1:] else: return data # In[2]: ## Google Play dataset ## Andriod_dataset = open_dataset('googleplaystore.csv') Andriod_header = open_dataset('googleplaystore.csv', header=True) Andriod = open_dataset('googleplaystore.csv', header = False) # In[3]: ## App Store dataset ## IOS_dataset = open_dataset('AppleStore.csv') IOS_header = open_dataset('AppleStore.csv', header = True) IOS = open_dataset('AppleStore.csv', header = False) # **Exploring the data sets using the function `explore_data()` that will enable us display rows and columns in a readable way.** # In[4]: def explore_data(dataset, start, end, rows_and_columns = True): data_slice = dataset[start:end] for row in data_slice: print(row) print('\n') #adds a new (empty) line between each row if rows_and_columns: print('Number of rows:', len(dataset[start:])) print('Number of columns:', len(dataset[0])) # In[5]: print(Andriod_header) print('\n') explore_data(Andriod, 0, 3, True) # from the above result of our Andriod data set, we see that there are 10841 rows and 13 columns. The columns that could help with our analysis are "App", 'Rating', 'Reviews', 'Installs', 'Price', 'Content Rating' e.t.c. # # # The full description of the column names can be gotten by clicking on [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) # # # # Now let's a look at our Apple Store dataset # In[6]: print(IOS_header) print('\n') explore_data(IOS, 0, 3, True) # We have 7197 iOS apps and 16 columns in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. # # Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). # ## Deleting Wrong Data # # The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row. # # To find the incorrect row in the dataset, we will compare the length of the header and the length of the row. # If the length of the header is not equal to that of the row, hence the incorrect row. # In[7]: for row in Andriod: header_length = len(Andriod_header) row_length = len(row) if row_length != header_length: print(row) print('\n') print(Andriod.index(row)) #incorrect row number # The error in the data set is found at the index row 10472. We will print it out to compare it with the header amd a correct row before removing the row. # In[8]: print(Andriod[10472]) #incorrect row print('\n') print(Andriod_header) #Data with only header print('\n') print(Andriod[0]) #Correct row # The row 10472 corresponds to the **app** `Life Made WI-Fi Touchscreen Photo Frame,` the **Category** is `1.9` which is wrong and the **Genre** is missing. # As a consequence, we'll delete this row. # In[9]: print(len(Andriod)) print('\n') del Andriod[10472] #Don't run this more than once print(len(Andriod)) # ## Removing Duplicate Entries # # **Stage 1** # # After looking through the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, it was observed that there are duplicate entries in the dataset. # # To prove further, we decided to explore through the following code to show one of the said entries which is **Instagram** with four entries: # In[10]: for app in Andriod: name = app[0] if name == 'Instagram': print(app) # **In order to obtain the whole duplicate values, we generate some codes below to print the total number of the duplicate Entries and print a few of them:** # # * we created two empty lists: one for duplicate values(`duplicate_app`) and the other for unique values(`unique_app`) # # * we looped through the dataset and for every iteration: # # * we retrieved the name of each app and assigned it to a variable called `name` # * we checked if each name is a member or exist in the unique app list # * if it does, then i attached the names generated into the duplicate list # * else if(it doesn't, then i attached them to the unique list # In[11]: duplicate_app = [] unique_app = [] for app in Andriod: name = app[0] if name in unique_app: duplicate_app.append(name) else: unique_app.append(name) print('Number of duplicate apps:', len(duplicate_app)) print('\n') print('Examples of duplicate apps:', duplicate_app[:15]) # In total, there are 1,181 ways in which apps were duplicated. # # # __We decided to use another method to remove duplicate instead by random.__ # # __If you examine the rows we printed for the instagram app, we noticed the review that has the highest rating happens to be the most recent data. Hence we build on this criterion for removing the duplicates__ # # **Stage II** # # To do so, # # * We create an empty dictionary and name it `reviews_max`, where each key represents a unique app name and the corresponding values represent the highest number of review for the app. # # * We loop through our dataset and for each each iteration: # _ We extract each app name and assign it to the variable `name` # _ We also extract each number of review, convert from string to float and assign it to a variable `n_reviews` # # * If `name` already exists as a key in `reviews_max` **and** `reviews_max(name) < n_reviews,` then update the number of reviews in the `reviews_max` dictionary # # * If `name` is not in the `reviews_max` dictionary, then create a new entry in the dictionary where the key is the unique app name and the value is the number of reviews. # # # In[12]: reviews_max = {} for app in Andriod: name = app[0] n_reviews = float(app[3]) if name in reviews_max and reviews_max[name] < n_reviews: reviews_max[name] = n_reviews elif name not in reviews_max: reviews_max[name] = n_reviews # * We have 10840 dataset, out of which 1181 are duplicate entries. so total length of rows will be 9659 # In[13]: print('Number of Expected length:', len(Andriod) - 1181) print('Number of Actual length:', len(reviews_max)) # * We will use the dictionary keys and values provided above to remove the duplicate entries in the data set by doing the following steps: # # _ We will create two empty lists as `andriod_clean`, to store the cleaned data set as list of lists and `already_added`, to help store the name of each app that has already been added in the former list and to help prevent each row that has the same highest review number, if such exists or scaled through,from being added again. # # _ We loop through the data set without header and for esch iteration: # * We extract the name of the app by index number and assign to the variable `name` # * We extract the number of review by index and assign to the variable `n_reviews` # * If `n_reviews` is the same as the value or number of review in the `reviews_max`variable, i.e `reviews_max[name]` and `name` not in `already_added` list # _ Append the whole row to the `andriod_clean`list # _ Append the name of the app to the `already_added` list # # *Explore the `andriod_clean` data set # In[14]: andriod_clean = [] already_added = [] for app in Andriod: name = app[0] n_reviews = float(app[3]) if (n_reviews == reviews_max[name]) and (name not in already_added): andriod_clean.append(app) already_added.append(name) # now let's explore the new dataset and confirm that the length of the dataset is 9659 # In[15]: explore_data(andriod_clean, 0, 5,True) # ## Deleting Non-English app # # ** Stage I** # # When we explored both data sets long enough, we found out that they both have apps that are not for english audience. # # So we came up with some coodes to confirm the above narrative: # # In[16]: print(IOS[813][1]) print(IOS[6731][1]) print('\n') print(andriod_clean[4412][0]) print(andriod_clean[7940][0]) # * We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /). # # * Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters # In[17]: def is_english(string): for character in string: if ord(character) > 127: return False return True print(is_english('Instagram')) print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')) print(is_english('Docs To Go™ Free Office Suite')) print(is_english('Instachat 😜')) # The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form. # In[18]: print(ord('™')) print(ord('😜')) # ** Stage II ** # # ** To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective** # # ** Let's edit the function we created in the previous screen, and then use it to filter out the non-English apps.** # In[19]: def is_english(string): non_ascii = 0 for character in string: if ord(character) > 127: non_ascii += 1 if non_ascii > 3: return False else: return True print(is_english('Docs To Go™ Free Office Suite')) print(is_english('Instachat 😜')) print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')) # Below, we use the is_english() function to filter out the non-English apps for both data sets: # In[20]: andriod_english = [] ios_english = [] for app in andriod_clean: name = app[0] if is_english(name): andriod_english.append(app) for app in IOS: name = app[1] if is_english(name): ios_english.append(app) explore_data(andriod_english, 0, 3, rows_and_columns=True) print('\n') explore_data(ios_english, 0, 3, rows_and_columns=True) # We can see that we're left with 9614 Android apps and 6183 iOS apps. # ## Extracting free Apps # # As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis. # In[21]: andriod_final = [] for free_app in andriod_english: price = free_app[7] if price == '0': andriod_final.append(free_app) print(len(andriod_final)) print('\n') ios_final = [] for free_app in ios_english: price = free_app[4] if price == '0.0': ios_final.append(free_app) print(len(ios_final)) # ## Most Attractive Apps by Genre # # **Part One** # # As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. # # To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps: # # * Build a minimal Android version of the app, and add it to Google Play. # * If the app has a good response from users, we develop it further. # * If the app is profitable after six months, we build an iOS version of the app and add it to the App Store. # # Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification. # # Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set. # **Part Two** # # We'll build two functions we can use to analyze the frequency tables: # # * One function to generate frequency tables that show percentages # * Another function that we can use to display the percentages in a descending order # In[ ]: def freq_table(dataset, index): frequency_table = {} total = 0 for each_row in dataset: total += 1 app_genre = each_row[index] if app_genre in frequency_table: frequency_table[app_genre] += 1 else: frequency_table[app_genre] = 1 freq_percentages = {} for key in frequency_table: percentages = (frequency_table[key] / total) * 100 freq_percentages[key] = percentages return freq_percentages def display_table(dataset, index): table = freq_table(dataset, index) freq_table_tuple = [] for each_key in table: table_as_tuple = (each_key, table[each_key]) freq_table_tuple.append(table_as_tuple) table_sorted = sorted(freq_table_tuple, reverse = True) for entry in table_sorted: print(entry) # **Part Three** # # We start by examining the frequency table for the prime_genre column of the App Store data set. # In[27]: display_table(ios_final, -5) # Prime_Genre - App Store # ## Frequency Table Analysis for App Store genre generated # # * From the frquency table generated for `prime-genre` column of the App Store dataset, The most common genre is `Games` with about 58% followed by `Entertainment` with approximately 8%. # * I see that another enterainment sector `Photo and Video` follows closely with about 5% while `Education` and `social networking` follow closely with about 3%. # * i can say that most of the apps about 74% are designed for entertainment purpose # # since the game app profile has the highest number of users and the company derives revenue from the most used app, then i would recommend for it to be added to the Apple Store. # In[28]: display_table(andriod_final, 1) # Category - Google Play # ## Frequency Table Analysis for Google play Category generated # # for category genre: # # * The `Family` app is the most common genre with about 19% followed by `Game`with about 9% # # * I see other apps meant for practical purposes followed with about 3% # # * The `Game` app for App Store has more users than that of Play Store # # In[29]: display_table(andriod_final, -4) # Genre - Google Play # ## Frequency Table Analysis for Google play Genre generated # # for genre: # # * There is no much difference between the Ctegory and Genre columns. # * but one thing we can notice is that the Genres column has more categories. # # so we'll only work with the Category column since it is no too cumbersome.. # # # ## Most Popular Apps by Genre on the App Store # # * One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app. # # * Below, we calculate the average number of user ratings per app genre on the App Store: # In[30]: unique_genre = freq_table(ios_final, -5) for genre in unique_genre: total = 0 len_genre = 0 for app in ios_final: genre_app = app[-5] if genre_app == genre: n_rating = float(app[5]) total += n_rating len_genre += 1 avg_rating = total / len_genre print(genre, ':', avg_rating) # On average, navigation apps have the highest number of user reviews. And these high numbers seeem to be influenced by its a few apps. # # Let take a look # In[31]: for app in ios_final: if app[-5] == 'Navigation': print(app[1], ':', app[5]) #print name of app and no 0f ratings # we can see above that, Navigation app high numbers are influenced by waze & Google maps. same applies to other apps like social Networking, music e.t.c # Reference app, being the runner up, has 74,942 user ratings and the high figures are being spiked up by bible and dictionary. # In[32]: for app in ios_final: if app[-5] == 'Reference': print(app[1], ':', app[5]) #app name and no of ratings # from the above result, since we are looking for the most popular app by rating, the Reference app being a practical app may be considered. and if possible additional functions could be added to the bible and dictionary to make it more attractive for users. # Now let us analyse the Google Play Store app # ## Most Popular Apps by Genre on Google Play # # For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. # # However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.): # In[33]: display_table(andriod_final, 5) #Installs Columns # For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users. # # We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. # In[34]: category_andriod = freq_table(andriod_final, 1) for category in category_andriod: total = 0 len_category = 0 for app in andriod_final: category_app = app[1] if category_app == category: n_installs = app[5] n_installs = n_installs.replace('+', '') n_installs = float(n_installs.replace(',', '')) total += n_installs len_category += 1 avg_n_installs = total / len_category print(category, ':', avg_n_installs) # On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs: # In[35]: for app in andriod_final: if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'): print(app[0], ':', app[5])#app name and no of installs # The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. # # Let's take a look at some of the apps from this genre and their number of installs as this genre seem to look very profitable: # In[36]: for app in andriod_final: if app[1] == 'BOOKS_AND_REFERENCE': print(app[0], ':', app[5]) #name of app and no of installs # The book and reference genre includes alot of software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. # # lets bring out the most popular apps that influence the high numbers of the genre average. # In[37]: for app in andriod_final: if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'): print(app[0], ':', app[5]) #app name and no of installs # it seems like there are only a few very popular apps with over `100,000,000+` that still shows potential. # # Let's try to find out the popularity of apps (between 1,000,000 and 100,000,000 downloads): # In[38]: for app in andriod_final: if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+' or app[5] == '5,000,000+' or app[5] == '10,000,000+' or app[5] == '50,000,000+'): print(app[0], ':', app[5]) # The outcome of the above result shows that there are alot of dictionaries, ebooks e.t.c. and building an app on ebooks will look competitive. # # We also notice there are quite a few apps built around the dictionary, which suggests that building an app around a dictionary that can also incorporate different translators for different languages can be profitable for both the Google Play and the App Store markets. # ## CONCULSION # # After analyzing data for both the App Store and Google Play mobile apps with the aim of recommending an app profile that can be profitable for both markets, we have concluded that adding another application that seem popular e.g language translator to a popular dictionary can be profitable for both App Store and Google play Markets. #