#!/usr/bin/env python # coding: utf-8 # **Project**\ # Analyzing Mobile App Data from Google Play and the App Store # # **What:** Build new free apps on Google Play and the App Store.\ # **Goal:** Analyze data to help our developers understand what type of apps are likely to attract more users. # In[1]: #open Apple Store data set opened_file = open('AppleStore.csv') from csv import reader read_file = reader(opened_file) apple_data = list(read_file) apple_header = apple_data[0] #header assigned to a apple_header variable apple_data = apple_data[1:] #excludes the first row which is the header #open Google Play Store data set opened_file = open('googleplaystore.csv') from csv import reader read_file = reader(opened_file) android_data = list(read_file) android_header = android_data[0] #header assigned to a android_header variable android_data = android_data[1:] #excludes the first row which is the header # In the code above we opened the `csv` files. Now, we'll use the `explore_data()` function to explore the data sets. It's a reusable function to print rows in a readable way. # In[2]: #explore both data sets using the explore_data() function def explore_data(dataset, start, end, rows_and_columns=False): dataset_slice = dataset[start:end] for row in dataset_slice: print(row) print('\n') # adds a new (empty) line after each row if rows_and_columns: print('Number of rows:', len(dataset)) print('Number of columns:', len(dataset[0])) print(android_header) # prints only the headers row print('\n') # adds a new (empty) line after each row explore_data(android_data,0,3,True) # [Documentation](https://www.kaggle.com/lava18/google-play-store-apps) for the Google Play Store data set. This data set contains approximately 10,000 Android apps collected in August 2018. [Lavanya Gupta](https://www.kaggle.com/lava18), a Machine Learning Engineer, provided this data set. # In[3]: print(apple_header) # prints only the headers row print('\n') # adds a new (exmpty) line after each row explore_data(apple_data,0,3,True) # [Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) for the App Store data set. This data set contains approximately 7,000 iOS apps collected in July 2017. [Ramanathan](https://www.kaggle.com/ramamet4), a Research Engineer, provided this data set. # Data cleaning is imporant because we need to make sure the data we're analyzing is accurate, otherwise our analysis will be inaccurate. Data cleaning is the process of preparing our data for analysis, which is completed before the analysis. # # We need to: # * Delete inaccurate data, and correct or remove it # * Detect duplicate data, and remove the duplicates # # We're only building apps that are free to download and install, and directed toward an English-speaking audience. This means we need to: # * Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播 # * Remove apps that aren't free # # The [documentation](https://www.kaggle.com/lava18/google-play-store-apps) for the Google Play data set has a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and others have identified an error for a certain row. # In[4]: print(android_data[10472]) print('\n') print(android_header) # After reading through the discussions we need to confirm the index of the row, and confirm if the row is indeed inaccurate. We need to print the row at the index. However, the index may vary if the user reporting the error decided to keep or remove the header row. # # We've identifed the index `android_data[10472]` and printed the header row `android_header`. The header `Category` has a value of `1.9` which is inaccurate. This row is missing `Category` data so we need to delete this row using `del dataset[index]`. # In[5]: print(len(android_data)) del android_data[10472] print(len(android_data)) # We've printed the length of the data set (headers excluded), deleted the inaccurate row and printed the length of the data set to confirm the row count is one less. # **Removing Duplicate Apps: Part 1** # # After reviewing the Google Play Store Apps [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/136133) board, users have reported duplicate entries of android apps. For example, the code below will demonstrate Instagram has four entries in the data set. # In[6]: print(android_header) print('\n') for app in android_data: app_name = app[0] if app_name == 'Instagram': print(app) # There are 1181 instances when an app occurs more than once. The code below will identify the duplicate and unique apps and place them in their respective lists. Then, we'll find the number of duplicate apps and print a list of the first 15 examples. # In[7]: duplicate_android_apps = [] unique_android_apps = [] for app in android_data: app_name = app[0] if app_name in unique_android_apps: duplicate_android_apps.append(app_name) else: unique_android_apps.append(app_name) print('Number of duplicate android apps:', len(duplicate_android_apps)) print('Number of unique android apps:', len(unique_android_apps)) print('\n') print('Examples of duplicate android apps:', duplicate_android_apps[:15]) # We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. # # The main difference between the four Instagram entries was the number of reviews. We can deduce that data was collected at different times. # # We'll select this information as a criterion for removing duplicate apps. If the entry has a higher number of reviews, then it's the most recent data. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove th other entries for any given app. # **Removing Duplicate Apps: Part 2** # # On the previous screen, we looped through the Google Play data set and found 1181 duplicate apps. After removing the duplicates we should end up with 9,659 rows: # In[8]: print('Expected number of rows:', len(android_data) - 1181) # To remove duplicates, we will: # 1. Create a dictionary. Each key is a unique app name and the value is the highest number of reviews of that app. # 2. Use the dictionary we created to remove duplicate rows. # In[9]: reviews_max = {} for app in android_data: name = app[0] n_reviews = float(app[3]) if name not in reviews_max: reviews_max[name] = n_reviews elif name in reviews_max and n_reviews > reviews_max[name]: reviews_max[name] = n_reviews # In[10]: print('Expected length of data set:', len(android_data) - 1181) #subtracting the duplicates we found earlier print('Actual lengh of dictionary:', len(reviews_max)) # We'll use `reviews_max` dictionary created above to remove duplicate rows. # # * Start by creating two empty lists. `android_clean` will store our cleaned data. `already_added` will store only app names. # * Loop through the `android_data` data set, and for each iteration: # * Name of the app will be assigned to `name` and number of reviews will be assigned `n_reviews` # * If `reviews_max[name] == n_reviews` AND `name not in already_added` then append `app` (entire row) into `andoird_clean` and `name` of the app into `already_added` # * Why `already_added` list? We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps. # * `reviews_max[name] == n_reviews` confirms the number of reviews of the current apps matches the numbers of reviews of that app in `reviews_max` dictionary. # # In[11]: android_clean = [] already_added = [] for app in android_data: name = app[0] n_reviews = float(app[3]) if (reviews_max[name] == n_reviews) and (name not in already_added): android_clean.append(app) #appending the entire row from the android data set already_added.append(name) #appending only the app name # Using the `explore_data` function we wrote earlier, we can confirm the number of rows equals 9,659 and number of columns equals 13. # # Success! # In[12]: explore_data(android_clean,0,3,True) # **Removing Non-English Apps: Part 1** # In[13]: def english_apps(string): for character in string: if ord(character) > 127: return False return True print(english_apps('Instagram')) print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播')) print(english_apps('Docs To Go™ Free Office Suite')) print(english_apps('Instachat 😜')) # **Removing Non-English Apps: Part 2** # English app names such as `'Docs To Go™ Free Office Suite'` and `'Instachat 😜'` returned a `False` output in the code above. Characters like `'😜'` and `'™'` have corresponding numbers over 127. Our function above will remove useful data so we need to mitigate the impact. It will not be perfect, but fairly close. # # If the app name has up to three emoji or other special characters, then we'll label them as English-apps. Otherwise, they'll be labeled as non-English apps. # # We'll edit the code above to reflect the new criteria. # In[14]: def english_apps(string): non_ascii = 0 for character in string: if ord(character) > 127: non_ascii += 1 if non_ascii > 3: return False else: return True print(english_apps('Instagram')) print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播')) print(english_apps('Docs To Go™ Free Office Suite')) print(english_apps('Instachat 😜')) # Within our function `english_apps` we assigned `0` to variable `non_ascii`. When our for loop iterates over any given string and the characters contain a number greater than 127, then `non_ascii` will increment by 1. If `non_ascii` is greater than 3, then the string contains more than 3 characters outside the ASCII (0-127) range, therefore the output will be `False`. Else, if `non_ascii` is less than 3, then the output is `True`. # # `print(english_apps('Docs To Go™ Free Office Suite'))` is `True` # # `print(english_apps('Instachat 😜'))` is `True` # # Now, we'll use this new function to filter out non-English apps from both data sets. If the app nape identifies as English, then we'll append the whole row to a separate list. # In[15]: android_English_apps = [] iOS_English_apps = [] for app in android_clean: name = app[0] if english_apps(name): android_English_apps.append(app) for app in apple_data: name = app[1] if english_apps(name): iOS_English_apps.append(app) # Using the `explore_data` function we'll explore the new list we just created `android_English_apps` to confirm number of rows and columns. # # We now have 9614. # In[16]: explore_data(android_English_apps,0,3,True) # We'll do the same for iOS apps and confirm we now have 6183 rows. # In[17]: explore_data(iOS_English_apps,0,3,True) # **Isolating Free Apps** # # We'll loop through each data set to append the free apps into separate lists. # * Identify the apps columns correctly # * Price in each data set are strings # In[18]: android_free = [] ios_free = [] for app in android_English_apps: price = app[7] if price == '0.0' or price == '0': android_free.append(app) for app in iOS_English_apps: price = app[4] if price == '0.0' or price == '0': ios_free.append(app) # Now, explore each data set to confirm the number of rows. # In[19]: explore_data(android_free,0,3,True) # In[20]: explore_data(ios_free,0,3,True) # **Most Common Apps by Genre: Part 1** # # The goal is to build a free app that attracts the most users, because our revenue is highly influenced by numbers of users. We'll analyze both market places to determine successful app profiles. # # To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps: # 1. Build a minimal Android version of the app, and add it to Google Play. # 2. If the app has a good response from users, we develop it further. # 3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store. # # To generate frequency tables we'll use `prime_genre` column in the iOS data set. For Google Play store we'll select `Category` and `Genres` columns. # **Most Common Apps by Genre: Part 2** # # We'll build two functions we can use to analyze the frequency tables: # * One function to generate frequency tables that show percentages # * Another function we can use to display the percentages in a descending order # In[21]: def freq_table(dataset,index): freq_dict = {} total = 0 for row in dataset: total += 1 num = row[index] if num in freq_dict: freq_dict[num] += 1 else: freq_dict[num] = 1 freq_dict_percentages = {} for key in freq_dict: percentage = (freq_dict[key] / total) * 100 freq_dict_percentages[key] = percentage return freq_dict_percentages def display_table(dataset, index): table = freq_table(dataset, index) table_display = [] for key in table: key_val_as_tuple = (table[key], key) table_display.append(key_val_as_tuple) table_sorted = sorted(table_display, reverse = True) for entry in table_sorted: print(entry[1], ':', entry[0]) # **Most Common Apps by Genre: Part 3** # # We'll now focus on analyzing these frequency tables. First, we'll analyze data from `ios_free`. # In[22]: display_table(ios_free,-5) # Amongst English-only free apps, Games(58.16%) are the most common applications in the Apple Store. Entertainment(7.88%) applications are a distant second. Applications for entertainment purposes (games, photo & video, social networking, sports, music) dominant in terms of frequency. However, practical purpose applications (education, shopping, utilities, producivity) lack in comparison. # # Now, we'll analyze `android_free` with the `Category` column. # In[23]: display_table(android_free,1) # Category column # Google Play store is a different story because there is balance between practical and entertainment applications. FAMILY(18.9%) leads in terms of number of applications in a category. GAME(9.72%) is in second place and right behind it is TOOLS(8.46%). # In[24]: display_table(android_free,-4) # Genres column # It appears the `Genres` columns is granular compared to `Category`. But, the data suggests there is still a healthy balance between practical and entertainment applications. However, moving forward we'll use `Category`. # # Note, we're currently reviewing the frequency of applications by their respective categories. Next, we'll review the popularity of these applications. # **Most Popular Apps by Genre on the App Store** # # One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. # # Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to: # * Isolate the apps of each genre. # * Sum up the user ratings for the apps of that genre. # * Divide the sum by the number of apps belonging to that genre (not by the total number of apps). # In[25]: genres = freq_table(ios_free, -5) for genre in genres: total = 0 len_genre = 0 for app in ios_free: genre_app = app[-5] if genre_app == genre: n_ratings = float(app[5]) total += n_ratings len_genre += 1 average_user_ratings = total / len_genre print(genre, ':',average_user_ratings) # `Games` apps have an average user rating of 22,788. And based on the code below, the average user ratings are not skewed by a few applications. # In[26]: for app in ios_free: if app[-5] == 'Games': print(app[1], ':', app[5]) # **Most Popular Apps by Genre on Google Play** # # For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app. We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). # # For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users. # # We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. # In[27]: category_android = freq_table(android_free,1) for category in category_android: total = 0 len_category = 0 for app in android_free: category_app = app[1] if category_app == category: n_installs = app[5] n_installs = n_installs.replace('+' , '') n_installs = n_installs.replace(',' , '') total += float(n_installs) len_category += 1 average_user_ratings = total / len_category print(category, ':',average_user_ratings) # `GAME` apps have an average user rating of 15,588,015. The output of the code below display apps with installs of `500,000+` or `1,000,000+` or `5,000,000+` or `10,000,000+`. # In[28]: for app in android_free: if app[1] == 'GAME' and (app[5] == '500,000+' or app[5] == '1,000,000+' or app[5] == '5,000,000+' or app[5] == '10,000,000+'): print(app[0], ':', app[5]) # **Conclusion** # # Overall, creating a free game is an ideal app profile to attract users.