#!/usr/bin/env python # coding: utf-8 # # Analyzing IOS & Android apps market # As a company that focuses on building **free** mobile apps for both the IOS Apple Store & the Google Play Store our revenue comes from users watching the ads. Hence, the more the users the greater the revenue. # # Next is an analysis of a number of apps for both platforms which concludes what type (Category) of apps attracts the most users and based on our conclusion the company will decide to what category ,the next app they are going to build, belongs.
# You can view and download the Google apps data set from [here](https://www.kaggle.com/lava18/google-play-store-apps)
# You can view and download the IOS apps data set from [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) # # Please note that the data was collected in 2017/2018 # In[1]: #Open IOS apps file from csv import reader opened_file = open('AppleStore.csv') read_file = reader(opened_file) apps_ios = list(read_file) ios_header = apps_ios[0] apps_ios = apps_ios[1:] #Open Google apps file Gopened_file = open('googleplaystore.csv') Gread_file = reader(Gopened_file) apps_google = list(Gread_file) google_header = apps_google[0] apps_google = apps_google[1:] # To make it easier to explore our data easily we will build a function that prints the desired row/s based on the input to the function: # In[2]: def explore_data(dataset,start,end,printrnc=False): data_slice = dataset[start:end] for row in data_slice: print(row) print('\n') if printrnc: print("number of rows = " ,(len(dataset))) print('number of columns =' ,(len(dataset[0]))) # The `explore_data()` function: # # 1. Takes in four parameters: # * dataset, which is expected to be a list of lists. # * start and end, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set. # * printrnc, which prints number of rows and columns in our set and is expected to be a Boolean and has False as a default argument. # 2. Slices the data set using dataset[start:end]. # 3. Loops through the slice, and for each iteration, prints a row and adds a new line after that row using print('\n'). # # *Please note dataset shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).* # In[3]: print(ios_header,'\n') explore_data(apps_ios,5,10,True) # We can see from the headers that the fields we are interested in in our analysis are: track_name, price, rating_count_tot, user_rating and prime_genre # In[4]: print(google_header, '\n') explore_data(apps_google,5,10,True) # We can see from the headers that the fields we are interested in are: App, Category, Rating, Reviews, Installs and Price. # # Data Cleaning # The first and ,in my opinion, the most important step to do when analyzing data is to clean the data set at hand.
# In the data cleaning procedure we will go through 4 steps: # 1. Deleting wrong data (such as nulls) # 2. Removing duplicate rows # 3. Removing Non-English apps (since our target audience are English-speaking audience) # 4. Removing Non-Free apps (since our company only develops free apps) # # ## 1. Deleting wrong data # The Google playstore data set has a discussion panel and you can find [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101) that it was found out that entry for row 10472 has the *'Category'* field null which caused a shift in the rest of the columns, this will lead to wrong calculations if used, therefore, we have to delete it. # # It's important to note that detecting nulls is an important procedure during the data cleaning routine, the next piece of code detects the presence of null fields in data sets and confirms that row 10472 in the Google apps data set does have a null field as mentioned in the discussion panel. # In[5]: def find_nulls(dataset): if dataset is apps_google: header = google_header elif dataset is apps_ios: header = ios_header else: print('Please enter function parameter only as "apps_google" OR "apps_ios"') has_nulls = [] count = 0 for row in dataset: if len(row) < len(header): index = dataset.index(row) count += 1 has_nulls.append(index) print('row ', index, ' has a null field') print(count, ' row(s) with null field(s) found') print('Checking nulls in Google data set') find_nulls(apps_google) print('\n') print('Checking nulls in IOS data set') find_nulls(apps_ios) # From the code above we can confirm that row 10472 in the Google data set does have a null field, lets print it and compare it to the header and a correct row so we can see for ourselves: # In[6]: print(google_header,'\n') explore_data(apps_google,10472,10474) #10472 incorrect, 10473 correct # You can notice that *'Life Made WI-Fi Touchscreen Photo Frame'* app has it's Category field missing (null) and the rest of the fields are shifted to the left.
# As a step in our data cleaning process lets delete this wrong entry. # # **Please note that the following code should only be executed once to avoid deleting healthy rows** # In[7]: del apps_google[10472] #run only once # ## 2. Removing duplicate rows # The second step in our data cleaning process is removing duplicate rows, the next piece of code will run through the datasets and check if there are any instances of duplication # In[8]: def duplicate_rows(dataset): duplicate_apps = [] unique_apps = [] for row in dataset: if row[0] in unique_apps: duplicate_apps.append(row[0]) else: unique_apps.append(row[0]) print ('Number of duplicated rows = ',len(duplicate_apps)) print ('Number of unique rows = ',len(unique_apps)) return duplicate_apps print('Checking Google data set') duplicate_rows(apps_google) print('\n') print('Checking IOS data set') duplicate_rows(apps_ios) # From the code above we find out that the google apps data set has 1181 duplicated rows which needs to be removed, lets show an example of a duplicated app # In[9]: duplicate_apps = duplicate_rows(apps_google) for i in duplicate_apps: if i == duplicate_apps[777]: #A random sample print(i) # From the code above we notice that the *'Google News'* app has 4 entries in our data set (1 unique and 3 duplicated), lets examine those 4 entries: # In[10]: print(google_header,'\n') for row in apps_google: if row[0] == 'Google News': print (row,'\n') # By examining the results we notice that all the fields are similar except for the *'Reviews'* column, we want to make sure that when we delete duplicate rows we only keep the entry with the highest number of reviews because it corresponds to the latest(last updated) entry of the app. # # The next piece of code will run through the whole data set and for duplicated apps it will only store the highest number of reviews for each app. # In[11]: reviews_dict = {} for row in apps_google: if row[0] in reviews_dict: max_review = reviews_dict[row[0]] review_num = row[3] if review_num > max_review: max_review = review_num reviews_dict[row[0]] = max_review else: reviews_dict[row[0]] = row[3] print(reviews_dict['Google News']) print(len(reviews_dict)) # We ended up having a dictionary with the app name as key and the highest number of reviews as value, you can notice we tested it on the Google News app and if you compare with yourself the result with the 4 entries of the app you can confirm that it does store only the highest reviews value. # # Also note that the dictionary's length is the same as the number of unique apps deduced earlier # # Next we are going to remove the duplicates, there are two points to keep in mind while doing this: # 1. we need to remove the duplicates with reviews less than the highest number of reviews available in the dictionary # 2. There might be two or more occurences of an app with the same highest number of reviews, we want only one # In[12]: print('Google data set length before duplicates removal',len(apps_google)) apps_google_clean = [] apps_names = [] for k,v in reviews_dict.items(): #looping with keys and values for row in apps_google: name = row[0] reviews = row [3] if k == name: if v == reviews: if name not in apps_names: apps_google_clean.append(row) apps_names.append(name) print('\nGoogle data set length after duplicates removal',len(apps_google_clean)) # Note that we ended with Google data set length of 9659 as expected, which is the same number of unique apps we deduced beore in an earlier code # ## 3. Removing Non-English apps # As a company that is interested in developing free apps for English-speaking audience we need to remove apps that wont be useful to our target audience. # # # In[13]: def english_check(a_string): for char in a_string: if ord(char) > 127: return False return True # The previous function takes in a string value and checks the ASCII order of its characters, if they belong to the normal ASCII range (0-127) then it is an english word and if they belong to the extended ASCII range (greater than 127) then it's not an english word. # # Below is some examples from our data sets. # In[14]: ex1 = apps_google[786][0] print(ex1,'\n',english_check(ex1),'\n') ex2 = apps_google[123][0] print(ex2,'\n',english_check(ex2),'\n') ex3 = apps_ios[813][1] print(ex3,'\n',english_check(ex3),'\n') ex4 = apps_ios[123][1] print(ex4,'\n',english_check(ex4),'\n') # However Notice that in the following example although it's an english application, the presence of a single non_english character in it's name resulted in our function detecting it as a non-english app (which is not true) # In[15]: ex5 = apps_ios[876][1] print(ex5,'\n',english_check(ex5)) # Thus using the `english_check()` function in it's current form will cause alot of data loss (filtering) in our data set.
# Next we will modify our function to only filter an app if it contains at least 4 non-english characters in it's name which will minimize the impact of data loss. # In[16]: def english_check(a_string): count = 0 for char in a_string: if ord(char) > 127: count +=1 if count > 3: return False else: return True ex5 = apps_ios[876][1] #running the same example that returned False in the previous code print(ex5,'\n',english_check(ex5)) # Now let's use our modified function to update our data sets by deleting non-english apps # In[17]: print('Google data set length was ',len(apps_google_clean)) print('Apple data set length was ',len(apps_ios),'\n') apps_google_english=[] apps_ios_english=[] for row in apps_google_clean: name = row[0] if english_check(name): apps_google_english.append(row) for row in apps_ios: name = row[1] if english_check(name): apps_ios_english.append(row) print('Google data set length is now ',len(apps_google_english)) print('Apple data set length is now ',len(apps_ios_english)) # ## 4.Removing Non-Free apps # Now to the last step in our data cleaning process before we start our analysis, we need to remove all the non-free apps as due to our goal, which is to find out what type of *'Free'* apps attracts the most users. # In[18]: def check_free(price): if price == 0.0: return True return False # Now let's test our new function against a free and a non-free app and see the result. # In[19]: print(ios_header,'\n') print(apps_ios[10],' >>> ',check_free(float(apps_ios[10][4])),'\n') print(apps_ios[11],' >>> ',check_free(float(apps_ios[11][4]))) # In[20]: print('Google data set length was ',len(apps_google_english)) print('Apple data set length was ',len(apps_ios_english),'\n') apps_google_final = [] apps_ios_final = [] for row in apps_google_english: try: price = float(row[7]) except: price = float(row[7][1:]) if check_free(price): apps_google_final.append(row) for row in apps_ios_english: price = float(row[4]) if check_free(price): apps_ios_final.append(row) print('Google data set length is now ',len(apps_google_final)) print('Apple data set length is now ',len(apps_ios_final)) # The data cleaning process is finally over.
# From **10841** entries for the Google dataset we ended up with **8758**.
# From **7197** entries for the Apple dataset we ended up with **3169**. # # Having a clean dataset that is filtered according to your needs ensures that your analysis will give correct results. # ## Data Analysis # Now it's time to start analyzing our data, but first it's worth noting that our company has a validation strategy for the apps that they need to follow in order to minimize risk, this strategy has 3 steps: # # 1. Develop a minimal android version for the app. # 2. If the users give positive feedback, the comapny starts to develop the app further. # 3. After 6 months if the app is successful on the Google play store, the company will develop an IOS version on the AppStore. # # Now the question is what app genre should the company start building a minimal Android version of?
# To answer this question properly there is 2 steps to consider:
# 1. We first need to find the **most common genres** on both stores ,since we need the app to be successful on both stores, To do this we are going to create a frequency table for the **prime_genre** column in the AppleStore data set and for the **Category** column in the Google PlayStore data set. #
#
# 2. Second, we will find the **most used genres** on both stores, To do this we will create a dictionary with the genres as keys and the values will be: # - The average **Installs** for the Google PlayStore. # - And The average **rating_count_tot** for the IOS AppStore. # ## 1.Most common genres # In[21]: def freq_table(dataset,index): table = {} total = 0 for row in dataset: key = row[index] total += 1 if key in table: table[key] += 1 else: table[key] = 1 return table,total genres_google_tot = freq_table(apps_google_final,1)[0] genres_ios_tot = freq_table(apps_ios_final,11)[0] def percent_table(dictionary,total): for key in dictionary: dictionary[key] = (dictionary[key]/total)*100 return dictionary genres_google = percent_table(freq_table(apps_google_final,1)[0],freq_table(apps_google_final,1)[1]) genres_ios = percent_table(freq_table(apps_ios_final,11)[0],freq_table(apps_ios_final,11)[1]) print('''Google PlayStore genres frequency Table: ----------------------------------------''') print(genres_google,'\n') print('''Apple AppStore genres frequency Table: --------------------------------------''') print(genres_ios) # Now that we have our genres frequency table generated we need to sort it so we can find which genres are the most frequent. # In[22]: def display_table(dictionary): values = [] for k,v in dictionary.items(): values.append((v,k)) for v,k in sorted(values,reverse=True): print(k,':',round(v,2)) print('''Google PlayStore genres frequency Table: ----------------------------------------''') display_table(genres_google) print('''\n\nApple AppStore genres frequency Table: --------------------------------------''') display_table(genres_ios) # By studying the frequency tables above we can conclude the following:
# * The most frequent Google PlayStore categories are: Family(19%), Game(9.6%) and Tools(8.5%) # * The most frequent Apple AppStore categories are: Games(58%), Entertainment(7.8%) and Photo & Video(5%) # * We can notice that the AppStore is dominated by apps that are made for fun. # * While the PlayStore is more balanced between apps that are made for fun and apps that are made for a more practical use, It is still observed that the highest 2 categories are also games. # # Note that the *'Family'* Category in the Playstore resembles games that are made for kids as you can see in the following image.
# Also keep in mind that these conclusions are only true for free English apps and not for the whole market. # ![Image](PlayStoreFamily.png) # However, it's worth noting that being the *\"most frequent\"* category doesn't necessarily mean that it's the *\"most used\"* category since the supply and the demand may not be the same. # ## 2.Most used genres # # Google PlayStore # As mentioned before we will now create a dictionary with the genres as keys and the values as: # - The average genre **Installs** for the Google PlayStore data set. # - And The average genre **rating_count_tot** for the IOS AppStore data set. # # It's worth noting that the number of installs we have in our data set for Google apps is not precise as you can see in the next code result, for example 100,000+ installs might be any number between 100,000 and 500,000 which is a very large range. However, for the purpose of our analysis the exact number isn't needed and we can work our way with the numbers at hand and still get accepted results. # In[23]: google_installs = freq_table(apps_google_final,5)[0] display_table(google_installs) # In[24]: used_google={} for row in apps_google_final: genre = row[1] if len(row[5]) > 1: installs = int(row[5][:-1].replace(',','')) #To avoid the '+' sign else: installs = int(row[5]) if genre in used_google: used_google[genre] += installs else: used_google[genre] = installs genres_google_tot = freq_table(apps_google_final,1)[0] #This is the genre frequency table generated before for k in used_google: used_google[k] /= genres_google_tot[k] #dividing by the total number of apps that belongs to a specific genre print('''Google PlayStore average Installs Table: ----------------------------------------''') display_table(used_google) # From the previous output we can notice that the top 3 genres installed by users are:
# 1. Communication (38,550,548) # 2. Video players (24,878,048) # 3. Social networks (23,628,689) # # Lets dive deeper into these categories. # # First lets explore apps with the highest number of installs that belong to the Communication genre. # ## Communication genre # In[25]: print('''Apps between 100M & 1B ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'COMMUNICATION' and 100000000 <= installs <= 1000000000: #between 100Million and 1Billion print(row[0],':',row[5]) # From the previous output we conclude that the *'Communication'* genre is dominated by big name companies like whatsapp, facebook messenger and skype which skews the average number of installs making it the highest genre in terms of average number of installs, however this doesn't mean that all the apps in this genre has a high number of installs. For example have a look at the following apps that have less than 50 million installs (relatively low) # In[26]: for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'COMMUNICATION' and installs < 50000000: #Less than 50Million print(row[0],':',row[5]) # We can see that most of the apps has less than 10M installs which is less than the average (~38M), which means that in order to build an app in this category we will have to compete with giants like Facebook, Whatsapp and Skype and therefore we wont be building an app in this category. # # Lets explore the rest of the categories in the same manner. # ## Video Players Genre # In[27]: print('''Apps between 100M & 1B ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'VIDEO_PLAYERS' and 100000000 <= installs <= 1000000000: #between 100Million and 1Billion print(row[0],':',row[5]) # ## Social genre # In[28]: print('''Apps between 100M & 1B ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'SOCIAL' and 100000000 <= installs <= 1000000000: #between 100Million and 1Billion print(row[0],':',row[5]) # It's concluded that the top 3 categories are skewed by whales and big name companies and that the average number of installs is misleading, we wont be competing in any of these categories. # # Next we will explore the Game category, which we found out before that it's the most common on both markets. # ## Game genre # In[29]: print('''Apps between 100M & 1B ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'GAME' and 100000000 <= installs <= 1000000000: #between 100Million and 1Billion print(row[0],':',row[5]) # Alot of popular games available plus it's observable that the Games market is pretty saturated, developing an app in this genre wont be the smartest thing to do since we will be competing with a very large number of other apps. # Now remembering our first analysis of the most common genres in the playstore, the **Entertainment** category had pretty low percentage of the market (0.96%), however, it has a large number of installs (11,767,380). Having low supply and high demand makes this category a potential candidate for our next app. # # Lets explore it even further. # ## Entertainment genre # In[30]: print('''Apps between 100M & 1B ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'ENTERTAINMENT' and 100000000 <= installs <= 1000000000: #between 100Million and 1Billion print(row[0],':',row[5]) print('''\n\n\nApps between 1M & 50M ---------------------''') for row in apps_google_final: installs = int(row[5].replace(',','').replace('+','')) if row[1] == 'ENTERTAINMENT' and 1000000 <= installs <= 50000000: #between 1M and 50M print(row[0],':',row[5]) # Only a small number of apps dominate this category which leaves a space for competing, also, by studying apps that have installs between 1M & 50M we notice that: # * Coloring apps have large number of installs # * Coloring apps market isn't saturated with apps # * None of the big names are specialised in coloring # # Lets have a closer look on coloring apps only. # In[31]: import re for row in apps_google_final: color = [] color = re.findall('.*Color.*',row[0]) if color == []: color = re.findall('.*paint.*',row[0]) if row[1] == 'ENTERTAINMENT' and color != []: print(color,':',row[5]) # Coloring apps can also be found in **Family** category, lets scan them too. # In[32]: import re for row in apps_google_final: color = [] installs = int(row[5].replace(',','').replace('+','')) color = re.findall('.*Color.*',row[0]) if color == []: color = re.findall('.*paint.*',row[0]) if row[1] == 'FAMILY' and color != [] and installs >= 1000000: print(color,':',row[5]) # By studying the previous results it's seen that there is a trend in coloring apps with an option to *'Color by number'* for kids, from this observation we will make sure that our application that we are going to develop will also have this feature. However, we have to add other features so we can stand out, for example: we can have a section for kids and another for adults, add tutorials, add weekly competitions between users, add a place where users can share and discuss their paintings, add daily quotes from famous artists, etc.. # --- # # Apple AppStore # Now lets study the Apple AppStore based on average number of ratings since our data doesn't have an installs field like the Google PlayStore, we also want to further explore the potential of coloring apps in the AppStore because we want our app to be successful on both platforms. # In[33]: used_ios={} for row in apps_ios_final: genre = row[11] rating_count_tot = int(row[5]) if genre in used_ios: used_ios[genre] += rating_count_tot else: used_ios[genre] = rating_count_tot for k in used_ios: used_ios[k] /= genres_ios_tot[k] print('''Apple AppStore average rating_count_tot Table: ----------------------------------------------''') display_table(used_ios) # In[34]: for row in apps_ios_final: color = [] color = re.findall('.*Color.*',row[1]) if color == []: color = re.findall('.*paint.*',row[1]) ratings = int(row[5]) if row[11] == 'Entertainment' and color != []: print(color,':',row[5]) # The output implies that most of the coloring apps on the AppStore are made for adults which means that our app is going to stand out with it's kids section and the Color by number feature for kids, the idea of building a coloring app seems to fit well with our goal as they:
# 1. It's supply is low and demand is high which will result in a large number of users. # 2. The nature of the app being made for coloring will lead to users spending a good amount of time on the app and thus watching more ads. # 3. It's market isn't saturated with apps and doesn't have big names dominating it. # --- # # Conclusion # In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets. # # We concluded that building a coloring app is a good potential for succeeding in both markets. Our app is going to have a section for kids with a color by number feature which was trending on the Google PlayStore and will also have a section for adults which was trending on the Apple AppStore, other features might include: tutorials, weekly competitions between users, add a place where users can share and discuss their paintings, add daily quotes from famous artists, etc..