For a company with the target of building apps which focus on english speaking markets and revune only from in-app ads. it is important to analyze market trends to discover which app categories pull the largest amount of users.
In this project we aim to analyze data on various apps in bothe the IOS and Android markets in order to advise the development team on what type of apps have the best chance of acheiving our company revenue goals.
open_file1 = open('AppleStore.csv')
from csv import reader
read_file1 = reader(open_file1)
ios = list(read_file1)
open_file2 = open('googleplaystore.csv')
from csv import reader
read_file2 = reader(open_file2)
google = list(read_file2)
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print('IOS Data set preview:')
explore_data(ios,1,3)
print('\n')
print('Google play Data set preview:')
explore_data(google,1,3)
IOS Data set preview: ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] Google play Data set preview: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
print('IOS Header')
print(ios[0])
print('\n')
print('Google Header')
print(google[0])
IOS Header ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] Google Header ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
For IOS APPS, the columns that could help us with our analysis are the price, rating_count_tot, user_rating, cont_rating and prime_genre. While for Google play Apps, the columns are category, rating, installs, type, content rating.
More information about the description of the columns can be gotten from here for the IOS Apps and here for the Google play Apps.
#From reading the discussion section we found the index of the row with an eror.
print(google[10473])
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
The row google[10473]
of the Google play Store data set is seen to have an error (i.e there is a missing category value which causes a column shift) and is therefore deleted.
The posible error of duplicate apps in the IOS Apps data set has been proven to be false, so at this time there is no discernable error in this data set.
del google[10473]
on further examination of the data sets, the Google Play data set is seen to contain multiple duplicate entries which can interfere with our data analysis. First we must identify the duplicate entries, then delete the duplicates leaving just the "original entry". To do this the criterion of leaving the entry with the highest number of reviews is left while the rest are deleted (since the highest number of reviews corresponds to the most recent data entry).
google_unique = []
google_duplicate = []
for app in google[1:]:
name = app[0]
if name in google_unique:
google_duplicate.append(name)
else:
google_unique.append(name)
print('Number of duplicate apps:', len(google_duplicate))
print('\n')
# Examples apps in google_duplicate,to prove that there are in fact duplicate entries
print('Examples of duplicate apps:', google_duplicate[:10])
Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
We are now going to begin removing duplicates by creating a dictionary using the app names as the unique dictionary key and the highest number of reviews as the dictionary value.
reviews_max = {}
for app in google[1:]:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
if name not in reviews_max:
reviews_max[name] = n_reviews
# Check if expected dictionary is created (expected len = 9659)
print(len(reviews_max))
9659
# Using the dictionary created, we remove the duplicates
android_clean = []
already_added = []
for app in google[1:]:
name = app[0]
n_reviews = float(app[3])
if n_reviews == reviews_max[name] and name not in already_added:
android_clean.append(app)
already_added.append(name)
explore_data(android_clean,0,2,True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 9659 Number of columns: 13
We have used the google_unique
dictionary to clean our GOOGLE Play data set by extracting the entire rows which correspond to the highest number of reviews to a new list android_clean
. Exploring the android_clean
list confirms this as it has the expected number of rows: 9659 (Expected number of rows was gotten from (length of dataset - number of duplicate apps)).
Since our project is geared towards english speaking audiences. it is wise to remove data from non-english apps from our data set. We can do this by creating a function which checks the ASCII number of each character in our app names(Each character in a string has an ASCII number attached to it and commonly used english characters have an ASCII number of range o-127). This function returns False
if the ASCII number of a character in the app name is greater than 127(i.e non-english) otherwise it returns True
(app is english).
# Function to check if the app name is english or not
def lang_check(string):
for character in string:
if ord(character) > 127:
return False
return True
print(lang_check('Instagram'))
print(lang_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(lang_check('Docs To Go™ Free Office Suite'))
print(lang_check('Instachat 😜'))
True False False False
Upon exploration of our lang_check()
function, it is discovered that the function does not correctly identify some english app names which contain emojis and special characters because this characters have ASCII numbers above 127. This will lead to loss of data for some english apps. It is therefore necessary to modify our function to only return False
if more than 3 characters in the app name fall outside the english ASCII range(0 -127).
def lang_check(string):
ascii_n = 0
for character in string:
if ord(character) > 127:
ascii_n += 1
if ascii_n > 3:
return False
else:
return True
print(lang_check('Instagram'))
print(lang_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(lang_check('Docs To Go™ Free Office Suite'))
print(lang_check('Instachat 😜'))
True False True True
# Using the modified language check function to filter our data sets
android_eng = []
for app in android_clean:
name = app[0] # For the Google play data set
if lang_check(name):
android_eng.append(app)
explore_data(android_eng,0,2,True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 9614 Number of columns: 13
# For the IOS Apps data set
ios_eng = []
for app in ios[1:]:
name = app[1]
if lang_check(name):
ios_eng.append(app)
explore_data(ios_eng,0,3,True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 6183 Number of columns: 16
We have successfully cleaned our data sets of duplicate and non english data. We now recall that our company deals with only free apps, it is therefore necessary to eliminate data about non-free apps from our data sets. We can do this by looping through our data sets and extracting the data on apps whose price is zero(free) to a separate list. For the Google Play data set:
android_free
.android_eng
list.price
.android_free
list.android_free
list.For the IOS Apps data set:
ios_free
.ios_eng
list.price
.ios_free
list.ios_free
list.# For the Google Play data set
android_free = []
for app in android_eng:
price = app[7]
if price == '0':
android_free.append(app)
len(android_free)
8864
# For the IOS data set
ios_free = []
for app in ios_eng:
price = app[4]
if price == '0.0':
ios_free.append(app)
len(ios_free)
3222
From the aim of our project, it can be seen that majority of our revenue is tied to how many users install our use our apps. it is cheaper and more efficient to build an app which will do well in both the android and ios markets. in order to achieve this, we need to take a closer look at both data sets to figure out which apps thrive in both markets. One way to do this is to figure out the most common genres of applications for each market and focus our app development on the overlapping set(app genre common to both markets). We will achieve this by creating frequency tables for the 'Category' column (index 1) & 'Genre' column (index 9) in the Google Play data set and the 'prime_genre' column(index 11) in the IOS Apps data set.
def freq_table(dataset,index):
table = {}
for app in dataset:
column = app[index]
if column in table:
table[column] += 1
else:
table[column] = 1
table_percentage = {}
for key in table:
percentage = (table[key]/ len(dataset))*100
table_percentage[key] = percentage
return table_percentage
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
To help us analyze the common app genres in our data sets, we built two functions. One (freq_table
) to create a frequency table of any column in the data sets and display the result in percentages and another (display_table
) to display the percentages in descending order.
display_table(ios_free, 11)
print('\n')
display_table(android_free, 9)
print('\n')
display_table(android_free, 1)
print('\n')
Games : 58.16263190564867 Entertainment : 7.883302296710118 Photo & Video : 4.9658597144630665 Education : 3.662321539416512 Social Networking : 3.2898820608317814 Shopping : 2.60707635009311 Utilities : 2.5139664804469275 Sports : 2.1415270018621975 Music : 2.0484171322160147 Health & Fitness : 2.0173805090006205 Productivity : 1.7380509000620732 Lifestyle : 1.5828677839851024 News : 1.3345747982619491 Travel : 1.2414649286157666 Finance : 1.1173184357541899 Weather : 0.8690254500310366 Food & Drink : 0.8069522036002483 Reference : 0.5586592178770949 Business : 0.5276225946617008 Book : 0.4345127250155183 Navigation : 0.186219739292365 Medical : 0.186219739292365 Catalogs : 0.12414649286157665 Tools : 8.449909747292418 Entertainment : 6.069494584837545 Education : 5.347472924187725 Business : 4.591606498194946 Productivity : 3.892148014440433 Lifestyle : 3.892148014440433 Finance : 3.7003610108303246 Medical : 3.531137184115524 Sports : 3.463447653429603 Personalization : 3.3167870036101084 Communication : 3.2378158844765346 Action : 3.1024368231046933 Health & Fitness : 3.0798736462093865 Photography : 2.944494584837545 News & Magazines : 2.7978339350180503 Social : 2.6624548736462095 Travel & Local : 2.3240072202166067 Shopping : 2.2450361010830324 Books & Reference : 2.1435018050541514 Simulation : 2.0419675090252705 Dating : 1.861462093862816 Arcade : 1.8501805054151623 Video Players & Editors : 1.7712093862815883 Casual : 1.7599277978339352 Maps & Navigation : 1.3989169675090252 Food & Drink : 1.2409747292418771 Puzzle : 1.128158844765343 Racing : 0.9927797833935018 Role Playing : 0.9363718411552346 Libraries & Demo : 0.9363718411552346 Auto & Vehicles : 0.9250902527075812 Strategy : 0.9138086642599278 House & Home : 0.8235559566787004 Weather : 0.8009927797833934 Events : 0.7107400722021661 Adventure : 0.6768953068592057 Comics : 0.6092057761732852 Beauty : 0.5979241877256317 Art & Design : 0.5979241877256317 Parenting : 0.4963898916967509 Card : 0.45126353790613716 Casino : 0.42870036101083037 Trivia : 0.41741877256317694 Educational;Education : 0.39485559566787 Board : 0.3835740072202166 Educational : 0.3722924187725632 Education;Education : 0.33844765342960287 Word : 0.2594765342960289 Casual;Pretend Play : 0.236913357400722 Music : 0.2030685920577617 Racing;Action & Adventure : 0.16922382671480143 Puzzle;Brain Games : 0.16922382671480143 Entertainment;Music & Video : 0.16922382671480143 Casual;Brain Games : 0.13537906137184114 Casual;Action & Adventure : 0.13537906137184114 Arcade;Action & Adventure : 0.12409747292418773 Action;Action & Adventure : 0.10153429602888085 Educational;Pretend Play : 0.09025270758122744 Simulation;Action & Adventure : 0.078971119133574 Parenting;Education : 0.078971119133574 Entertainment;Brain Games : 0.078971119133574 Board;Brain Games : 0.078971119133574 Parenting;Music & Video : 0.06768953068592057 Educational;Brain Games : 0.06768953068592057 Casual;Creativity : 0.06768953068592057 Art & Design;Creativity : 0.06768953068592057 Education;Pretend Play : 0.056407942238267145 Role Playing;Pretend Play : 0.04512635379061372 Education;Creativity : 0.04512635379061372 Role Playing;Action & Adventure : 0.033844765342960284 Puzzle;Action & Adventure : 0.033844765342960284 Entertainment;Creativity : 0.033844765342960284 Entertainment;Action & Adventure : 0.033844765342960284 Educational;Creativity : 0.033844765342960284 Educational;Action & Adventure : 0.033844765342960284 Education;Music & Video : 0.033844765342960284 Education;Brain Games : 0.033844765342960284 Education;Action & Adventure : 0.033844765342960284 Adventure;Action & Adventure : 0.033844765342960284 Video Players & Editors;Music & Video : 0.02256317689530686 Sports;Action & Adventure : 0.02256317689530686 Simulation;Pretend Play : 0.02256317689530686 Puzzle;Creativity : 0.02256317689530686 Music;Music & Video : 0.02256317689530686 Entertainment;Pretend Play : 0.02256317689530686 Casual;Education : 0.02256317689530686 Board;Action & Adventure : 0.02256317689530686 Video Players & Editors;Creativity : 0.01128158844765343 Trivia;Education : 0.01128158844765343 Travel & Local;Action & Adventure : 0.01128158844765343 Tools;Education : 0.01128158844765343 Strategy;Education : 0.01128158844765343 Strategy;Creativity : 0.01128158844765343 Strategy;Action & Adventure : 0.01128158844765343 Simulation;Education : 0.01128158844765343 Role Playing;Brain Games : 0.01128158844765343 Racing;Pretend Play : 0.01128158844765343 Puzzle;Education : 0.01128158844765343 Parenting;Brain Games : 0.01128158844765343 Music & Audio;Music & Video : 0.01128158844765343 Lifestyle;Pretend Play : 0.01128158844765343 Lifestyle;Education : 0.01128158844765343 Health & Fitness;Education : 0.01128158844765343 Health & Fitness;Action & Adventure : 0.01128158844765343 Entertainment;Education : 0.01128158844765343 Communication;Creativity : 0.01128158844765343 Comics;Creativity : 0.01128158844765343 Casual;Music & Video : 0.01128158844765343 Card;Action & Adventure : 0.01128158844765343 Books & Reference;Education : 0.01128158844765343 Art & Design;Pretend Play : 0.01128158844765343 Art & Design;Action & Adventure : 0.01128158844765343 Arcade;Pretend Play : 0.01128158844765343 Adventure;Education : 0.01128158844765343 FAMILY : 18.907942238267147 GAME : 9.724729241877256 TOOLS : 8.461191335740072 BUSINESS : 4.591606498194946 LIFESTYLE : 3.9034296028880866 PRODUCTIVITY : 3.892148014440433 FINANCE : 3.7003610108303246 MEDICAL : 3.531137184115524 SPORTS : 3.395758122743682 PERSONALIZATION : 3.3167870036101084 COMMUNICATION : 3.2378158844765346 HEALTH_AND_FITNESS : 3.0798736462093865 PHOTOGRAPHY : 2.944494584837545 NEWS_AND_MAGAZINES : 2.7978339350180503 SOCIAL : 2.6624548736462095 TRAVEL_AND_LOCAL : 2.33528880866426 SHOPPING : 2.2450361010830324 BOOKS_AND_REFERENCE : 2.1435018050541514 DATING : 1.861462093862816 VIDEO_PLAYERS : 1.7937725631768955 MAPS_AND_NAVIGATION : 1.3989169675090252 FOOD_AND_DRINK : 1.2409747292418771 EDUCATION : 1.1620036101083033 ENTERTAINMENT : 0.9589350180505415 LIBRARIES_AND_DEMO : 0.9363718411552346 AUTO_AND_VEHICLES : 0.9250902527075812 HOUSE_AND_HOME : 0.8235559566787004 WEATHER : 0.8009927797833934 EVENTS : 0.7107400722021661 PARENTING : 0.6543321299638989 ART_AND_DESIGN : 0.6430505415162455 COMICS : 0.6204873646209386 BEAUTY : 0.5979241877256317
Looking at the frequency table for the ios_free
data set, the application genre with the highest percentage by a huge margin is Games with about 58%. This is followed but not closely by Entertainment with about 8% and then Photo and Video with aboout 5%. From this analysis we can see that most apps in the IOS Store are designed for entertainment rather than practical purposes. Whether that translates to them having a large number of users is not factualy determined.However assuming other companies are also developing apps which mostly target the largest amount of users, then we can say that, yes, this genre of has the largest amount of users and therefore recommend development of a free, english gaming app to our company.
Observing the frequency table for the android_free
data set. In the category column the genre with the highest percentage is Tools (about 8%), followed closely by Entertainment (about 6%). While in the Genres column, Family applications had the highest percentage (almost 19%) and Games comes in second with a percentage of about 10%. Looking at this we can see that while Games are Entertainment aren't number one as seen in the IOS data set, they are not far behind. This tells us that good portion of the apps developed for the android market still fall under games and entertainment genre which is an intersection of the majority genre in the Apple app store.
Based only the information above, it is possible to recommend the development of gaming apps for both the IOS and Android markets by our company. However, whether these apps will actuall be suucessful i.e have a large number of users is still largely undetermined as our frequency tables do not actually reveal the genres with the most users. Further analysis of the data sets, perhaps the user ratings column would help us come to a better conclusion.
In order to further analyze our data sets to discover which app genres have the most number of users, we can calculate the average number of installs for each app genre. Now for the Google Play data set this information can be found in the 'Installs' column, but this information is missing from the IOS Apps data set. To get around this gap, we take the total number user ratings as a proxy, this information can be found in the 'rating_count_tot' column of the data set. We can find the average number of ratings by dividing the total number of ratings per app genre by the the number of apps belonging to that particular genre(not by the total number of apps). To get this we make use of a nested loop, which is a loop within another loop.
# For the IOS App Store
ios_genre = freq_table(ios_free, 11)
for genre in ios_genre:
total = 0
len_genre = 0
for app in ios_free:
genre_app = app[11]
if genre_app == genre:
n_ratings = float(app[5])
total += n_ratings
len_genre += 1
avg_n_ratings = total /len_genre
print(genre,':', avg_n_ratings)
Productivity : 21028.410714285714 Music : 57326.530303030304 Food & Drink : 33333.92307692308 Social Networking : 71548.34905660378 News : 21248.023255813954 Reference : 74942.11111111111 Health & Fitness : 23298.015384615384 Book : 39758.5 Navigation : 86090.33333333333 Weather : 52279.892857142855 Finance : 31467.944444444445 Education : 7003.983050847458 Shopping : 26919.690476190477 Travel : 28243.8 Utilities : 18684.456790123455 Games : 22788.6696905016 Lifestyle : 16485.764705882353 Medical : 612.0 Sports : 23008.898550724636 Business : 7491.117647058823 Catalogs : 4004.0 Photo & Video : 28441.54375 Entertainment : 14029.830708661417
From the results above we can see navigation has the highest amount of users, followed by Reference and then Social Networking. Tying this information in with our previous analysis of the most common apps genres in the IOS Store which were Games & Entertainment. Going by this i will recommend the development a social networking app as it is pretty similar to games and entertainment genres, has a large amount of users and can potentially provide the amount of hits needed to make revenue off in_app advertisement.
android_genre = freq_table(android_free, 1)
for category in android_genre:
total = 0
len_category = 0
for app in android_free:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace('+','')
n_installs = n_installs.replace(',','')
n_installs = float(n_installs)
total += n_installs
len_category += 1
avg_n_installs = total / len_category
print(category,':', avg_n_installs)
AUTO_AND_VEHICLES : 647317.8170731707 GAME : 15588015.603248259 VIDEO_PLAYERS : 24727872.452830188 COMICS : 817657.2727272727 BOOKS_AND_REFERENCE : 8767811.894736841 MEDICAL : 120550.61980830671 BUSINESS : 1712290.1474201474 MAPS_AND_NAVIGATION : 4056941.7741935486 SOCIAL : 23253652.127118643 HEALTH_AND_FITNESS : 4188821.9853479853 BEAUTY : 513151.88679245283 SHOPPING : 7036877.311557789 ENTERTAINMENT : 11640705.88235294 FINANCE : 1387692.475609756 COMMUNICATION : 38456119.167247385 LIFESTYLE : 1437816.2687861272 WEATHER : 5074486.197183099 EVENTS : 253542.22222222222 PERSONALIZATION : 5201482.6122448975 PRODUCTIVITY : 16787331.344927534 TOOLS : 10801391.298666667 LIBRARIES_AND_DEMO : 638503.734939759 SPORTS : 3638640.1428571427 EDUCATION : 1833495.145631068 FOOD_AND_DRINK : 1924897.7363636363 FAMILY : 3695641.8198090694 PHOTOGRAPHY : 17840110.40229885 ART_AND_DESIGN : 1986335.0877192982 HOUSE_AND_HOME : 1331540.5616438356 PARENTING : 542603.6206896552 DATING : 854028.8303030303 TRAVEL_AND_LOCAL : 13984077.710144928 NEWS_AND_MAGAZINES : 9549178.467741935
In order to analyze the Google Play data set for the app with the most amount of users, we investigated the installs column. However the values contained '+' and ',' characters which will throw off error messages when converted to float. To over come this, we made use of the string.relace()
function to replace the unwanted characters with empty strings.
Upon Analysis of the data set,on averagethe communication apps have the most installs: 38,456,119 followed the video player category. However since this numbers are highly infated due to the huge number of intalls for some individual apps in this categories, it is difficult to recommend their development to our company.
Our initial goal was to recommend an app catergory for development in both the IOS and Android Markets. After looking at the analysis, i will recemmend the development of a Social Networking app because it has a significant amount of users in both markets and is similar enough to the most common app categories in both markets. it also has a very high potential to pull huge amounts of users which will make our goal of in-app advertisement revenue generation succesful.