#!/usr/bin/env python # coding: utf-8 # # Profitable App Profiles for the App Store and Google Play Markets # ## Introduction # We are working as a data analyst for a company that builds *Android* and *iOS mobile apps*. The comapny builds *free apps* (which are free to download and install). Main revenue of the company is from *in-app ads*. It depends on the number of users. i.e. more the number of users watch and engage with the ads, more the revenue. Our aim here is to help our developers understand what type of apps attract more users. We have come up with a list of apps which are profitable to both *Apple Store* and *Google Play Store*. # * Dataset containing ~ 10,000 Android apps from Google Play Store __[link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)__ # # * Dataset containing ~ 7,000 iOS apps from App Store __[link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)__ # ## Exploring the Data for Better Comprehension # In[1]: # Open both the the csv files #Read the data and transform it into list of lists import csv open_Apple = open('AppleStore.csv') read_Apple = csv.reader(open_Apple) data_Apple = list(read_Apple) #header_Apple = list(read_Apple)[0] open_google = open('googleplaystore.csv') read_google = csv.reader(open_google) data_google = list(read_google) #header_google = list(read_google)[0] # In[2]: # Define a function 'explore_data()' which prints the rows & columns, # also prints the no.of rows and columns of the dataset def explore_data(dataset, start, end, rows_and_columns=False): dataset_slice = dataset[start:end] #loop through the dataset for row in dataset_slice: print(row) print('\n') if rows_and_columns: print('Number of rows', len(dataset)) print('Number of columns', len(dataset[0])) # In[3]: explore_data(data_Apple, 1, 4) #exploring the first three rows of PlayStore data # In[4]: explore_data(data_google, 1, 4) #exploring the first three rows of Android Store data # In[5]: #Printing the no.of rows & no.of columns for Play Store print('Number of rows', len(data_Apple[1:])) print('Number of column', len(data_Apple[0])) # In[6]: #Printing the no.of rows and no.of columns for Android Store print('Number of rows', len(data_google[1:])) print('Number of columns', len(data_google[0])) # In[7]: print('AppleStore column names:', data_Apple[0]) #Printing the header row for Play Store print('Required Columns: rating_count_tot, user_rating, cont_rating') #Link to the original data # `For more clarity check:` [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) # In[8]: print('googleplaystore column names:', data_google[0]) #Printing header row for Android Store print('Required Columns: Rating, Reviews, Installs, Content Rating') #Link to the original data # `For more clarity check:`[link](https://www.kaggle.com/lava18/google-play-store-apps) # **Above we explored both datasets by** # * Printing the *number of rows and columns*. # * Looked at the *header row*. # * Looked at the *body* of the data. # ## Cleaning the Data for Ease of Analysis # We are going to check if there is any missing data in the *google Play Store*. The way we do it is to check if the *length of any row is 'not equal' to the length of the header row*. We will delete such rows. # In[9]: #select the header row for Android Store data header_google = data_google[0] #loop over the data for row in data_google[1:]: header_len = len(header_google) row_len = len(row) #if the length of the row if row_len != header_len: #is not equivalent to the print(row) #header row print the row & print(data_google.index(row)) # it's index # In[10]: del data_google[10473] #delete the selected row # In[11]: print(data_google[10473]) #check if the perticular row is deleted # We will perform the above action on *Apple Store* data as well # In[77]: header_Apple = data_Apple[0] no_row_len = 0 for row in data_Apple[1:]: header_len = len(header_Apple) row_len = len(row) if row_len != header_len: no_row_len += 1 if no_row_len == 0: print("There are no missing rows") else: print(no_row_len) # We found that there is one row with missing data in *Android Play Store* and we deleted that row. But there is no missing data among *Apple Store*. # ## Removing Duplicate Entries for Android Store: Part 1 # In here we are going to findout if there are any duplicate Apps in the *Android Store*. # In[13]: #loop through the android data for app in data_google: name = app[0] if name == 'Facebook': #we have taken 'Facebook' for checking duplicate entries print(app) # Below we will calculate the number of duplicate apps for the *Android Play Store* # In[14]: duplicate_apps = [] #create an empty list called 'duplicate_apps' unique_apps = [] #create an empty list called 'unique_apps' #loop through the data and append the relavent apps in the above empty lists for app in data_google: name = app[0] if name in unique_apps: duplicate_apps.append(name) else: unique_apps.append(name) #Calculate the number of duplicate apps print('Number of duplicate apps:', len(duplicate_apps)) # Removing the duplicate rows manually(randomly) is a cumbersome and laborious process. So we should come up with programmatic way to carry out this process. # # **Here are few methods we can implement:** # `Option1:` Choosing the highest number of reviews(column 4) as it will be the more recent review and removing all other data(duplicates). # `Option2:` Selecting the highest number of installs(column 6) as it will be the most recent one and removing the others(duplicates). # `Option3:` Selecting the last updated(column 11), which will be recently updated app and removing the other duplicates. # `Option4:` Selecting the latest(current) version(column 12) as it will be the most recent App than the others. # **Here we are going to perform the first method** # # * We will create a dictionary called *reviews_max*, where the *key* is *app name* and *value* is *max_reviews*(i.e. maximum reviews recorded by an app) # # * We will find out the length of the dictionary(in order to cross check the answer): # `10840(total apps) - 1181(duplicate apps) = 9659` # # * We will create a list called *android_clean* where we can add the complete row of an app with maximum reviews. # # * We will create a list called *already_added* where we can add the names of apps which are already included in the android_clean list. (We are adding this supplementary information to take care of fact that if the maximum number of reviews is same for more than one duplicate app) # ## Removing Duplicate Entries from the Android Store: Part 2 # Below we perform the method mentioned above # In[15]: reviews_max = {} #Create a dictionary #loop through the rows and add the 'key' & 'values' to the dictionary for row in data_google[1:]: name = row[0] n_reviews = float(row[3]) #change the row[3] data type to float & name it 'n_reviews' if name in reviews_max and reviews_max[name] < n_reviews: reviews_max[name] = n_reviews elif name not in reviews_max: reviews_max[name] = n_reviews # In[16]: #Calculate the length of the dictionary print('Expected length:', len(reviews_max)) # The length of the dictionary, *reviews_max* exactly matches with the expected length. Below we are going to use the *reviews_max* dictionary to remove duplicate rows # In[17]: android_clean = [] #create empty lists called 'android_clean' and already_added = [] #'already_added' #loop through the rows for row in data_google[1:]: name = row[0] n_reviews = float(row[3]) #append the app with maximum reviews in 'android_clean' app and rest #in 'already_added' if n_reviews == reviews_max[name] and name not in already_added: android_clean.append(row) already_added.append(name) # In[18]: #Here we are going to check our method by observing the cleaned data # and by checking the length of the rows print((android_clean[:3])) print('Number of expected rows:', len(android_clean)) # As expected we have got *9659* rows. # ## Removing Non-English Apps: Part 1 # The company we are working for is specially for English speaking audience. So we need only English Apps for our analysis. We will delete all other apps. # # First we will define a function which checks if the given app has only English alphabets or not using *ord()* function. # # Below we define a function *english_app()*, with *string* as a parameter. It checks if the given app is english or not # In[19]: def english_app(string): #loop through the string & apply the conditions for character in string: if ord(character) > 127: return False else: return True # In[20]: print(english_app('Instagram')) # In[21]: print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播')) # In[22]: print(english_app('Docs To Go™ Free Office Suite')) # In[23]: print(english_app('Instachat 😜')) # To check if the function is giving appropriate outcomes, we ran the function with *English* and *non-English* app names. We got the correct answers. # ## Removing Non-English Apps: Part 2 # If we use the above function, we may loose some English apps along with non-English ones. So we are going to define one more function very similar to the ealier one. Here we will only remove the app if it has more than three characters with corresponding numbers falling outside the ASCII range. This means if the app has up to three emoji or other special characters, it will still be labelled as Englsh app. # In[24]: # define a function E_A() with string as a parameter def E_A(string): ord_list = 0 #assign variable ord_list to zero value #loop through string, if the character's number is greater #than 127 increment the ord_list for character in string: if ord(character) > 127: ord_list += 1 #if the value of 'ord_list' is greater than 3 the function #returns False else True if ord_list > 3: return False else: return True # In[25]: print(E_A('Docs To Go™ Free Office Suite')) # In[26]: print(E_A('Instachat 😜')) # In[27]: print(E_A('爱奇艺PPS -《欢乐颂2》电视剧热播')) # Above we checked the function on few apps and it works properly # Now we are going to apply the above function on both *Android Store* and *Apple Play Store*. # In[28]: android_English_App = [] #create an empty list #loop through the cleaned data and append the #row in the list above for row in android_clean: app = row[0] if E_A(app) is True: android_English_App.append(row) # Apply the function *explore_data()* on the list *android_English_App* and observe the results # In[29]: print(explore_data(android_English_App, 0, 4, rows_and_columns=True)) # Repeat the same process on *Apple Store Data* # In[30]: apple_English_App = [] #create an empty list #loop through the data and append it to the above list for row in data_Apple[1:]: app = row[1] if E_A(app) is True: apple_English_App.append(row) # Apply the *explore_data()* function on the list *apple_English_App* and observe the data # In[31]: print(explore_data(apple_English_App, 0, 4, rows_and_columns=True)) # We have successfully removed the non English apps from both *Android Store* and *Apple Play Store* # ## Isolating the Free Apps from the Paid Apps # The last step of the data cleaning is to isolate the free apps from the paid apps. As I have mentioned at the beginning the company is only interested in the free apps and the main revenue comes from the in-app ads. # # Below I am going to seperate the free apps from the paid apps for both *Android Store* and *Apple Play Store* together. # In[32]: #create an empty lists android_free_app = [] ios_free_app = [] #loop through the data and append the appropriate list with zero price for row in android_English_App: price = row[7] if price == '0': android_free_app.append(row) for row in apple_English_App: price = row[4] if price == '0.0': #exploration of the data shows that for ios ios_free_app.append(row) #zero price is listed as '0.0' # In[33]: print(explore_data(android_free_app, 0, 3, rows_and_columns=True )) # In[34]: print(explore_data(ios_free_app, 0, 3, rows_and_columns=True )) # We applied *explore_data()* function on both the lists and observed the number of rows and columns. # ## Most Common Apps by Genre: Part 1 # Our goal is to find an app profile that attracts users on both *App Store* and *google play*. Once we identify such a profile, we would like to validate our recommendation by first building a new app fitting this profile on one of the platforms (say android), observing its usage and, if successful port the app to the other platform. # # **Our validation strategy for an app idea has 3 steps:** # # 1) Build a minimal Android version of the app and add it to *Google Play*. # # 2) If the app has a good response from users, develope it further. # # 3) If the app is profitable after six months in *Google Play*, build an iOS version of the app and add it to the *App Store*. # # # **For generating frequency tables to find out most common genres we can use:** # # For Google Play: `Column 2 (category)` and `Column 10 (Genres)` # # For Play Store: `Column 12 (prime_genre)` # # # ## Most Common Apps by Genre: Part 2 # First we are going to define a function which creates a frequency table. This frequency tables shows the percentage of each genre. # In[35]: #create a function called freq_table() which takes #dataset & index as parameters def freq_table(dataset, index): dict_freq = {} total = 0 #loop through the row & add it to the total & #generate the frequency table with genre as key # and no.of genre as value for row in dataset: total += 1 genre = row[index] if genre in dict_freq: dict_freq[genre] += 1 else: dict_freq[genre] = 1 #generate a frequency table with value(no.of genre) in the above #frequency table as key and percentage of total (total no.of genre) #as value dict_freq_percentage = {} for value in dict_freq: dict_freq[value] /= total percentage = dict_freq[value] * 100 dict_freq_percentage[value] = percentage return dict_freq_percentage # We will create one more function called *display_table()*, which will display the genre percentages in a descending order # In[36]: #create a function display_table with dataset and index as parameters def display_table(dataset, index): table = freq_table(dataset, index) table_display = [] #create an empty list #loop through the key in the above table and append the #above list with a tuple of table[key] & key for key in table: key_val_as_tuple = (table[key], key) table_display.append(key_val_as_tuple) #sort the list using sorted function in descending order table_sorted = sorted(table_display, reverse = True) for entry in table_sorted: print(entry[1], ':', entry[0]) # Below we apply the *display_table()* function on *'android Category'*, *'android Genre'* and *'ios prime_genre'* and observe the output # In[37]: print('android Category:') print(display_table(android_free_app, 1)) print('\n') print('android Genre:') print(display_table(android_free_app, 9)) print('\n') print('ios prime_genre:') print(display_table(ios_free_app, 11)) # ## Most Common Apps by Genre: Part 3 # **Our analysis of the frequency table generated for prime_genre of the App Store data set:** # # * Frequency table shows that the most common genre is *games* (more than 50%) and next common genre is *Entertainment* apps. # # * Majority of the apps (i.e. more than 70%) fall in the *Entertainment purpose* compared to *Practical purpose*. # # * It is premature to recommend an app profile based on the above frequency table, as this table is built using *app genre* and not with any kind of user information. # # # **Our analysis of the frequency table generated for Category and Genres column of the Google Play data set:** # # * For *Google Play* the most common genres are *Entertainment* and *Tools*. Among *category*, *Family* with *~ 18%* is on top of the table. Among *genre*, *tools* with *~ 8%* is on top of the table. # # * Here the ratio of *practicle purpose apps* and *Entertainment purpose apps* are almost equal unlike in *app store*. # # * As in the case of *app store*, it is not possible to recommend any app profile for *google play store* as well. As these are based on *app genre/category* and not on any user information. # ## Most Popular Apps by Genre on Apple Store # Below we are going to list the *genre* and respective *average rating counts* for 'Apple Store' # In[38]: unique_genre = freq_table(ios_free_app, 11) #generate a #'unique genre' using frequecy table 'prime genre' column #loop through the unique_genre for genre in unique_genre: total = 0 #sum of user rating len_genre = 0 #no. of apps specific to each genre #loop through the ios_free_app and iterate through #genre and rating counts for row in ios_free_app: genre_app = row[11] if genre_app == genre: rating_counts = float(row[5]) total += rating_counts len_genre += 1 #calculate the average rating counts avg_rating_counts = total / len_genre print(genre, ':', avg_rating_counts) # Here we recommend an app profile for *IOS app store* based on the user ratings. # # Above frequency table shows that there are *five* apps which have *>50000* user rating counts. We are listing them in descending order # # * Navigation # * Reference # * Social Networking # * Music # * Weather # # Few other apps which have the rating counts of *>30000* are listed below in descending order # # * Book # * Food & drink # * Finance # # Some which are *<30000* are listed below # # * Travel # * Photo & Video # * Shopping # Below we give the list of *Navigation Apps* with respective rating counts for 'Apple Store', as this app has the highest rating count of *~86000*. # In[39]: for row in ios_free_app: if row[11] == 'Navigation': print(row[1], ':', row[5]) # ## Most Popular Apps by Genre: Google Play # Below we are going to list the *category* and respective *average number of installs* for 'Google Play Store' # In[40]: #generate a unique app unique_app_genre_android = freq_table(android_free_app, 1) #loop through the unique app for category in unique_app_genre_android: total = 0 len_category = 0 #loop through the android free app & select the category for row in android_free_app: category_app = row[1] #apply the conditional statement to the category #and select installs if category_app == category: n_installs = row[5] n_installs = n_installs.replace('+', '') n_installs = n_installs.replace(',', '') n_installs = float(n_installs) total += n_installs #add the installs to total len_category += 1 #count the no.of category #calculate the average installs and print it avg_no_n_installs = total / len_category print(category, ':', avg_no_n_installs) # Here we recommend the app profile for *Google Play Store* based on the number of user installs. # # Frequency table above shows that there are *nine* apps which are *> 10000000* installs. We are listing them in descending order. # # * Communication # * Video-Players # * Social # * Photography # * Productivity # * Game # * Travel & Local # * Entertainment # * Tools # Below we give the list of few *Communication Apps* (which has a *~ 38456119* average no. of installs) for 'Google Play Store' # In[41]: for row in android_free_app[:290]: if row[1]== 'COMMUNICATION': print(row[0], ':', row[5]) # Based on our analysis and observation, we have come up with a list of apps common to both *Apple Store* and *Google Play Store*. # # * Social/Social Networking # * Photography/Photo & Video # * Travel & Local # * Entertainment/Music # * Book & Reference # * Finance # * Communication # * Games # * Food & Drink # ## Conclusion # # In this project we analyzed app data of *Apple Store* and *Google Play Store*. We needed to recommend a free app which will be profitable to both. We have come up with a list of apps which can be profitable to both the stores. These apps are listed below: # # * Social/Social Networking # * Photography/Photo & Video # * Travel & Local # * Entertainment/Music # * Book & Reference # * Finance # * Communication # * Games # * Food & Drink #