This is the first guided project in the Dataquest's Data Scientist in Python path. The idea is to imagine working for an app development company that develops and publishes free apps in the Apple Store and the Google Play Store. Since the apps are free, revenues are generated through ads. As a data analyst, we would want to find out the profiles of free apps that have the potential to generate the most ad revenues.
My goal for this project is to familiarize myself with some basic data analytical tasks using Python and working with the Jupyter interface.
We begin by defining some functions to automate some repetitive tasks.
# Creating a function that will open csv's as list of lists
def open_as_list(dataset, separate_header=True):
opened_file = open(dataset, encoding='utf8')
from csv import reader
read_file = reader(opened_file)
list_file = list(read_file)
if separate_header:
return list_file[0], list_file[1:]
else:
return list_file
# This is a function provided by Dataquest that allows us to print rows in a readable way
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
Loading the data sets
We then begin to load the data sets as lists of lists, using the functions (open_as_list()
and explore_data()
) that we defined earlier.
# Opening the csv files (with separate header)
apple_header, apple = open_as_list('AppleStore.csv')
google_header, google = open_as_list('googleplaystore.csv')
# Exploring the first few rows of the data sets using the explore_data() function
explore_data(apple, 0, 3, rows_and_columns=True)
explore_data(google, 0, 3, rows_and_columns=True)
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] Number of rows: 7197 Number of columns: 17 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 10841 Number of columns: 13
# Exploring the column headers
print(apple_header)
print('\n')
print(google_header)
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Documentation for the Apple Store data set is available in this link while the documentation for the Google Play Store data set is available here.
Dataquest informs us that in the discussion thread for the Google Play Store data set, there appears to be a row with missing data (column length is less than expected). We look for the index number of the problematic row in the following cell.
# Looking for missing data in google play store data
rows_with_miss = [] # initialize the list that will contain the index numbers of rows with missing data
number_rows_miss = 0 # initialize integer that counts number of rows with missing data
for row in google:
header_length = len(google_header)
row_length = len(row)
if row_length!= header_length:
print(row)
print('\n')
print('Row with index number ' + str(google.index(row)) + ' may have missing values')
rows_with_miss.append(google.index(row))
number_rows_miss += 1
print('\n')
print(rows_with_miss)
print(number_rows_miss)
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] Row with index number 10472 may have missing values [10472] 1
We check the details of the problematic row(s) and compare it with the details contained in the header row to better understand what went wrong.
If there are more rows with missing data, we can also automate this process with a for loop. We skip that part for now and proceed, knowing there is only one row with missing data.
# Checking the details of the problematic row
print(google_header)
print('\n')
print(google[10472])
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Category appears to be missing, it jumps straight to 1.9 instead of a string describing category. We now delete the problematic row. Again, we could also perform a for loop here if there are more rows with missing data.
# Deleting the problematic row, make sure this is ran only once
del google[10472]
Apparently, the Google Play Store data (the google
data set) contains duplicate entries for some apps. Our next step is to identify the apps with duplicate entries.
# Creating lists containing the duplicates and the unique app names
google_unique = []
google_duplicate = []
for row in google:
app = row[0]
if app in google_unique:
google_duplicate.append(app)
else:
google_unique.append(app)
# Checking how many apps are unique and how many are duplicated
print('There are ' + str(len(google_unique)) + ' unique apps.')
print('There are ' + str(len(google_duplicate)) + ' duplicated apps.')
print('Here are some examples of duplicated apps: ' + str(google_duplicate[0:3]))
There are 9659 unique apps. There are 1181 duplicated apps. Here are some examples of duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
We will not be deleting the duplicated apps right away. Instead, we will figure out how to deal with these. We can either (1) delete duplicates using some criteria or (2) we can retain one row for each unique app and just aggregate the numerical columns. We'll go with option 1 for this project.
Our criteria would be to keep the row with the most number of reviews, with the assumption that it has the latest information since there are more reviews.
We then create a dictionary with app names as keys and the highest number of reviews for that app as the values.
google_reviews_max = {}
for app in google:
name = app[0]
n_reviews = int(app[3]) # Dataquest recommends converting to float but we convert to int since n_reviews are whole numbers
if (name in google_reviews_max) and (google_reviews_max[name] < n_reviews):
google_reviews_max[name] = n_reviews
if name not in google_reviews_max:
google_reviews_max[name] = n_reviews
print(len(google_reviews_max)) # checking if reviews_max dictionary has the correct length (9659)
print(list(google_reviews_max.items())[:3]) # checking the first few elements of the reviews_max dictionary
9659 [('Photo Editor & Candy Camera & Grid & ScrapBook', 159), ('Coloring book moana', 974), ('U Launcher Lite – FREE Live Cool Themes, Hide Apps', 87510)]
Next, we use the dictionary we just created to retain only the rows we're interested in keeping (based on our criteria). For this step, we will be creating a new list of lists which we will name google_clean
. This new data set should have no more duplicates.
google_clean = []
google_already_added = []
for app in google:
name = app[0]
n_reviews = int(app[3])
if (n_reviews == google_reviews_max[name]) and (name not in google_already_added):
google_clean.append(app)
google_already_added.append(name)
print(len(google_clean)) # checking if our new data set has the correct number of rows (9659)
print(google_clean[:3]) # checking the first few entries of our 'cleaned' data set
# We could also use the explore_data function to perform these checks
# Upon reviewing some of the prior cells, it may be useful to define a function that will remove duplicate rows for us, we skip that for now.
9659 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
Ideally, we do the same process for the Apple Store data set. The Dataquest prompts do not ask us to do so but we'll check anyway. This is also for consistency so we get to call the Apple Store data set apple_clean
instead of just apple
the same way we should now be using the google_clean
data set.
For the Apple Store data set, we will us the same criteria (most number of reviews). We could actually try retaining the row with the latest version or the most recently updated (although this info is not in our Apple data set) but we are not yet very familiar with dealing in comparing strings (e.g. 1.0.1 vs. 1.0.2) to be able to implement such conditions.
# Creating lists containing the duplicates and the unique app names
apple_unique = []
apple_duplicate = []
for row in apple:
app = row[2]
if app in apple_unique:
apple_duplicate.append(app)
else:
apple_unique.append(app)
# Checking how many apps are unique and how many are duplicated
print('There are ' + str(len(apple_unique)) + ' unique apps.')
print('There are ' + str(len(apple_duplicate)) + ' duplicated apps.')
print('Here are some examples of duplicated apps: ' + str(apple_duplicate[0:3]))
print('\n')
# Creating dictionary for max number of reviews
apple_reviews_max = {}
for app in apple:
name = app[2]
n_reviews = int(app[6]) # Dataquest recommends converting to float but we convert to int since n_reviews are whole numbers
if (name in apple_reviews_max) and (apple_reviews_max[name] < n_reviews):
apple_reviews_max[name] = n_reviews
if name not in apple_reviews_max:
apple_reviews_max[name] = n_reviews
print(len(apple_reviews_max)) # checking if reviews_max dictionary has the correct length (9659)
print(list(apple_reviews_max.items())[:3]) # checking the first few elements of the reviews_max dictionary
print('\n')
# Creating the new apple_clean data set with duplicates removed
apple_clean = []
apple_already_added = []
for app in apple:
name = app[2]
n_reviews = int(app[6])
if (n_reviews == apple_reviews_max[name]) and (name not in apple_already_added):
apple_clean.append(app)
apple_already_added.append(name)
print(len(apple_clean)) # checking if our new data set has the correct number of rows (7195)
print('\n')
print(apple_clean[:2]) # checking the first few entries of our 'cleaned' data set
There are 7195 unique apps. There are 2 duplicated apps. Here are some examples of duplicated apps: ['VR Roller Coaster', 'Mannequin Challenge'] 7195 [('PAC-MAN Premium', 21292), ('Evernote - stay organized', 161065), ('WeatherBug - Local Weather, Radar, Maps, Alerts', 188583)] 7195 [['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']]
We're now done with removing duplicates for both data sets. It's a good thing that we did the extra work for the Apple data set, despite Dataquest not prompting us to do so, since there are apparently two duplicate entries.
Since in this project, we are pretending to be working for a development team that only develops English apps, we must now remove apps not targeted towards English-speaking customers. This is important since we don't want to make any inferences from non-English apps since the people downloading those might have different preference profiles.
We start by a defining a function that will help us systematically identify whether the app name has non-English characters. This means we want to tag apps with names that contain non-English characters or those with Unicode code point values greater than 127 (Dataquest explains that the common English characters have equivalents of 0-127).
def english_app(text):
for character in text:
if ord(character) > 127:
return False # function stops and returns False if it finds a single instance of non=English character
return True
We check whether our english_app()
function works.
print(english_app('Instagram'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suit'))
print(english_app('Instachat 😜'))
True False False False
While the english_app()
appears to work fine, it does tag apps with names that use special characters (e.g. trademark, emojis, etc.) as non-English. We need to modify this so as to only tag apps as non-English if they have more than three 'non-English' characters in their names. We define a new function called english_loose()
.
def english_loose(text):
non_english_chars = 0
for character in text:
if ord(character) > 127:
non_english_chars += 1
if non_english_chars > 3:
return False
else:
return True
We check again whether our new function, english_loose()
is working as intended. Note that while the filter is still not perfect, it should work better with compared to the stricter criteria in the english_app()
function.
print(english_loose('Instagram'))
print(english_loose('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_loose('Docs To Go™ Free Office Suit'))
print(english_loose('Instachat 😜'))
True False True True
We can now use the english_loose()
function to filter out non-English apps from both data sets (apple_clean
and google_clean
). We will call our new data sets apple_english
and google_english
. There is no need to do this filtering for the headers (apple_header
and google_header
) since the information contained in those lists will remain valid for all the apps in their respective data sets.
# Initialize empty lists
apple_english = []
google_english = []
for app in apple_clean:
name = app[2] # name of app in apple data set is in index 2
english = english_loose(name)
if english:
apple_english.append(app)
print('There are ' + str(len(apple_english)) + ' English apps in our Apple Store data set')
print(apple_english[:3])
print('\n')
for app in google_clean:
name = app[0] # name of app in apple data set is in index 2
english = english_loose(name)
if english:
google_english.append(app)
print('There are ' + str(len(google_english)) + ' English apps in our Google Store data set')
print(google_english[:3])
There are 6181 English apps in our Apple Store data set [['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']] There are 9614 English apps in our Google Store data set [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
Since we're only interested in free apps (our company gets its revenues from ads), we will need to remove apps that are not free. First, let's check the headers of our data sets to refresh our memory.
print('Apple Store')
print(apple_header)
print('\n')
print('Google Store')
print(google_header)
Apple Store ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] Google Store ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
We will now be creating our 'final' data sets.
# Initializing empty lists
apple_final = []
google_final = []
# Isolating free apps in Apple Store
for app in apple_english:
price = app[5] # should ideally convert to float but returns error because of some special characters we still don't know how to deal with
if price == '0' or price == '0.0': # we just use different iterations of the strings zero instead
apple_final.append(app)
# Isolating free apps in Google Store
for app in google_english:
price = app[7] # should ideally convert to float but returns error because of some special characters we still don't know how to deal with
if price == '0' or price == '0.0': # we just use different iterations of the strings zero instead
google_final.append(app)
print('There are a total of ' + str(len(apple_final)) + ' free English apps in our Apple Store data set.')
print('There are a total of ' + str(len(google_final)) + ' free English apps in our Google Store data set.')
There are a total of 3220 free English apps in our Apple Store data set. There are a total of 8864 free English apps in our Google Store data set.
To review, here's what we've done so far with our two data sets:
We can now begin some crude analysis for the purpose of this project.
Recall that the our hypothetical aim here is to identify the profile of apps that would be most profitable to develop. Dataquest provides us a list of the validation strategy to use in app development decision-making:
Since the plan is to release apps for both Apple and Google stores, we will now look at the profiles of apps and try to figure out what would be most profitable (generates the most revenue through ads)
# Checking the columns to see which ones we can use to generate frequency tables
print('Apple')
display(apple_header)
print('\n')
print('Google')
display(google_header)
Apple
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
For the Apple Store, it may be interesting to look at 'size_bytes'
, 'rating_count_tot'
, 'user_rating'
, 'cont_rating'
, and 'prime_genre'
.
For the Google Store, we may want to look at 'Category'
, 'Rating'
, 'Reviews'
, 'Size'
, 'Installs'
, 'Type'
, 'Content Rating'
, and 'Genres'
.
For now, let's check what the most common genres are for the two stores, that information would be contained in the column 'prime_genre'
for Apple and 'Genres'
for Google.
# Getting the index numbers for genre and/or category
apple_genre_index = apple_header.index('prime_genre')
google_genre_index = google_header.index('Genres')
google_category_index = google_header.index('Category')
print('Apple genre index is ' + str(apple_genre_index))
print('Google category index is ' + str(google_category_index))
print('Google genre index is ' + str(google_genre_index))
Apple genre index is 12 Google category index is 1 Google genre index is 9
Building Frequency Tables
We will now create functions that will help us generate frequency tables for our columns of choice.
# Function for generating frequency tables
def freq_table(dataset, index):
frequency_table = {}
for row in dataset:
column = row[index]
if column in frequency_table:
frequency_table[column] += 1
else:
frequency_table[column] = 1
return frequency_table
# no need to assign a variable in the main memory here because we're just going to use this function within the display_table function
# This function was provided by Dataquest
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
We now use the functions ,display_table()
and freq_table()
, we just created to see what categories or genres are most common.
print('Apple Genres')
display_table(apple_final, 12)
print('\n')
print('Google Categories')
display_table(google_final, 1)
print('\n')
print('Google Genres')
display_table(google_final, 9)
Apple Genres Games : 1872 Entertainment : 254 Photo & Video : 160 Education : 118 Social Networking : 106 Shopping : 84 Utilities : 81 Sports : 69 Music : 66 Health & Fitness : 65 Productivity : 56 Lifestyle : 51 News : 43 Travel : 40 Finance : 36 Weather : 28 Food & Drink : 26 Reference : 18 Business : 17 Book : 14 Navigation : 6 Medical : 6 Catalogs : 4 Google Categories FAMILY : 1676 GAME : 862 TOOLS : 750 BUSINESS : 407 LIFESTYLE : 346 PRODUCTIVITY : 345 FINANCE : 328 MEDICAL : 313 SPORTS : 301 PERSONALIZATION : 294 COMMUNICATION : 287 HEALTH_AND_FITNESS : 273 PHOTOGRAPHY : 261 NEWS_AND_MAGAZINES : 248 SOCIAL : 236 TRAVEL_AND_LOCAL : 207 SHOPPING : 199 BOOKS_AND_REFERENCE : 190 DATING : 165 VIDEO_PLAYERS : 159 MAPS_AND_NAVIGATION : 124 FOOD_AND_DRINK : 110 EDUCATION : 103 ENTERTAINMENT : 85 LIBRARIES_AND_DEMO : 83 AUTO_AND_VEHICLES : 82 HOUSE_AND_HOME : 73 WEATHER : 71 EVENTS : 63 PARENTING : 58 ART_AND_DESIGN : 57 COMICS : 55 BEAUTY : 53 Google Genres Tools : 749 Entertainment : 538 Education : 474 Business : 407 Productivity : 345 Lifestyle : 345 Finance : 328 Medical : 313 Sports : 307 Personalization : 294 Communication : 287 Action : 275 Health & Fitness : 273 Photography : 261 News & Magazines : 248 Social : 236 Travel & Local : 206 Shopping : 199 Books & Reference : 190 Simulation : 181 Dating : 165 Arcade : 164 Video Players & Editors : 157 Casual : 156 Maps & Navigation : 124 Food & Drink : 110 Puzzle : 100 Racing : 88 Role Playing : 83 Libraries & Demo : 83 Auto & Vehicles : 82 Strategy : 81 House & Home : 73 Weather : 71 Events : 63 Adventure : 60 Comics : 54 Beauty : 53 Art & Design : 53 Parenting : 44 Card : 40 Casino : 38 Trivia : 37 Educational;Education : 35 Board : 34 Educational : 33 Education;Education : 30 Word : 23 Casual;Pretend Play : 21 Music : 18 Racing;Action & Adventure : 15 Puzzle;Brain Games : 15 Entertainment;Music & Video : 15 Casual;Brain Games : 12 Casual;Action & Adventure : 12 Arcade;Action & Adventure : 11 Action;Action & Adventure : 9 Educational;Pretend Play : 8 Simulation;Action & Adventure : 7 Parenting;Education : 7 Entertainment;Brain Games : 7 Board;Brain Games : 7 Parenting;Music & Video : 6 Educational;Brain Games : 6 Casual;Creativity : 6 Art & Design;Creativity : 6 Education;Pretend Play : 5 Role Playing;Pretend Play : 4 Education;Creativity : 4 Role Playing;Action & Adventure : 3 Puzzle;Action & Adventure : 3 Entertainment;Creativity : 3 Entertainment;Action & Adventure : 3 Educational;Creativity : 3 Educational;Action & Adventure : 3 Education;Music & Video : 3 Education;Brain Games : 3 Education;Action & Adventure : 3 Adventure;Action & Adventure : 3 Video Players & Editors;Music & Video : 2 Sports;Action & Adventure : 2 Simulation;Pretend Play : 2 Puzzle;Creativity : 2 Music;Music & Video : 2 Entertainment;Pretend Play : 2 Casual;Education : 2 Board;Action & Adventure : 2 Video Players & Editors;Creativity : 1 Trivia;Education : 1 Travel & Local;Action & Adventure : 1 Tools;Education : 1 Strategy;Education : 1 Strategy;Creativity : 1 Strategy;Action & Adventure : 1 Simulation;Education : 1 Role Playing;Brain Games : 1 Racing;Pretend Play : 1 Puzzle;Education : 1 Parenting;Brain Games : 1 Music & Audio;Music & Video : 1 Lifestyle;Pretend Play : 1 Lifestyle;Education : 1 Health & Fitness;Education : 1 Health & Fitness;Action & Adventure : 1 Entertainment;Education : 1 Communication;Creativity : 1 Comics;Creativity : 1 Casual;Music & Video : 1 Card;Action & Adventure : 1 Books & Reference;Education : 1 Art & Design;Pretend Play : 1 Art & Design;Action & Adventure : 1 Arcade;Pretend Play : 1 Adventure;Education : 1
Looking at the App Store frequency table, we see that the most common genre are Games (1,872) followed far behind by Entertainment (254), Photo & Video(160), and Education (118). It appears that the most common apps are those for entertainment rather than for productivity.
While it is not a foregone conclusion, the large number of games and entertainment apps that are free could mean that developers are flocking that genre because of the larger number of users for those types of apps. However, we may want to develop apps that have a somewhat large number of users but with fewer competing apps. For example, 'gamified' education apps is a possibility.
For the Google Store, we look at the genres and categories. We see that the Top 5 genres/categories for the stores are as follows:
Category - Family (1676), Game (862), Tools (750), Business (407) , Lifestyle (346)
Genre - Tools (749), Entertainment (538), Education (474), Business (407), Productivity (345)
The distribution of app categories/genres in the Google Store is more balanced between entertainment/fun and practical apps.
We now check which apps are most popular based on the number of users. For the Apple Store, this data is not available but we can use the number of ratings, rating_count_tot
, as a proxy (the more ratings, the more users). For the Google Store, the Installs
column will give us an idea of how many users there are per genre.
Apple
print(apple_header)
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
# Apple Store
apple_unique_genres = freq_table(apple_final, 12)
for genre in apple_unique_genres:
total = 0
len_genre = 0
for app in apple_final:
genre_app = app[12]
if genre_app == genre:
n_ratings = int(app[6])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
print(genre + ' : ' + str(avg_n_ratings))
# In the same vein, we can create a dictionary instead
apple_genres_avg_n_ratings = {} # initializing empty dictionary
for genre in apple_unique_genres:
total = 0
len_genre = 0
for app in apple_final:
genre_app = app[12]
if genre_app == genre:
n_ratings = int(app[6])
total += n_ratings
len_genre += 1
avg_n_ratings = total / len_genre
apple_genres_avg_n_ratings[genre] = avg_n_ratings
# We try to display the genres from lowest to highest average number of reviews
print('\n')
print(sorted(apple_genres_avg_n_ratings.items(), key = lambda kv:(kv[1], kv[0]))) # not yet sure how this works, we copied this code from the internet
Productivity : 21028.410714285714 Weather : 52279.892857142855 Shopping : 26919.690476190477 Reference : 74942.11111111111 Finance : 31467.944444444445 Music : 57326.530303030304 Utilities : 18684.456790123455 Travel : 28243.8 Social Networking : 71548.34905660378 Sports : 23008.898550724636 Health & Fitness : 23298.015384615384 Games : 22812.92467948718 Food & Drink : 33333.92307692308 News : 21248.023255813954 Book : 39758.5 Photo & Video : 28441.54375 Entertainment : 14029.830708661417 Business : 7491.117647058823 Lifestyle : 16485.764705882353 Education : 7003.983050847458 Navigation : 86090.33333333333 Medical : 612.0 Catalogs : 4004.0 [('Medical', 612.0), ('Catalogs', 4004.0), ('Education', 7003.983050847458), ('Business', 7491.117647058823), ('Entertainment', 14029.830708661417), ('Lifestyle', 16485.764705882353), ('Utilities', 18684.456790123455), ('Productivity', 21028.410714285714), ('News', 21248.023255813954), ('Games', 22812.92467948718), ('Sports', 23008.898550724636), ('Health & Fitness', 23298.015384615384), ('Shopping', 26919.690476190477), ('Travel', 28243.8), ('Photo & Video', 28441.54375), ('Finance', 31467.944444444445), ('Food & Drink', 33333.92307692308), ('Book', 39758.5), ('Weather', 52279.892857142855), ('Music', 57326.530303030304), ('Social Networking', 71548.34905660378), ('Reference', 74942.11111111111), ('Navigation', 86090.33333333333)]
We see that Navigation, Reference, and Social Networking have the highest number of reviews. We suspect that these numbers are being pulled up by very popular apps (which makes it more difficult for our company to attract customers into these types of apps). For example, GoogleMaps and Waze for navigation; Facebook for Social Networking; and Wikipedia for Reference.
Let's try to verify that in the next cell.
print('Navigation')
for app in apple_final:
if app[12] == 'Navigation' and (int(app[6]) > 100000):
print(app[2], ':', app[6])
print('\n')
print('Reference')
for app in apple_final:
if app[12] == 'Reference' and (int(app[6]) > 100000):
print(app[2], ':', app[6])
print('\n')
print('Social Networking')
for app in apple_final:
if app[12] == 'Social Networking' and (int(app[6]) > 100000):
print(app[2], ':', app[6])
Navigation Waze - GPS Navigation, Maps & Real-time Traffic : 345046 Google Maps - Navigation & Transit : 154911 Reference Bible : 985920 Dictionary.com Dictionary & Thesaurus : 200047 Social Networking Facebook : 2974676 Skype for iPhone : 373519 Tumblr : 334293 WhatsApp Messenger : 287589 TextNow - Unlimited Text + Calls : 164963 Kik : 260965 Viber Messenger – Text & Call : 164249 ooVoo – Free Video Call, Text and Voice : 177501 Pinterest : 1061624 Messenger : 351466 Followers - Social Analytics For Instagram : 112778
We confirm our suspicions that the average number of reviews (which is our proxy for average number of users) for the top 3 categories are being pulled up by a small number extremely popular apps. The same is most likely true for Music (Spotify, Pandora) and Weather (Weather, Accuweather), and Book (Kindle, Audible).
Unless we want to develop the less popular app categories (with fewer reviews), we'll just have to try to compete with bigger players. There appears to be some promise for Reference, Book, Photo & Video, and Travel. These types of apps don't really require physical presence or services and can be developed by a small development team.
For Google Play Store, we have information on number of installs but they are stored as strings and indicate ranges. For now, let's convert them into numbers (floats or integers) using a crude process as described by Dataquest where we just assume that the open-ended strings represent the actual number of reviews. This means that apps with installs of '10,000+' will be assumed to have 10,000 installs , '100,000+' to have 100,000, and so on. Note that we could actually get the median value within the range instead but let's follow the recommended process in the Dataquest prompt.
print(google_header)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
google_unique_categories = freq_table(google_final, 1)
for category in google_unique_categories:
total = 0
len_category = 0
for app in google_final:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace('+', '')
n_installs = n_installs.replace(',', '')
n_installs = int(n_installs)
total += n_installs
len_category += 1
avg_n_installs = total / len_category
print(category, ':', avg_n_installs)
# Again, we can create a dictionary
google_cat_avg_n_installs = {} # initializing empty dictionary
for category in google_unique_categories:
total = 0
len_category = 0
for app in google_final:
category_app = app[1]
if category_app == category:
n_installs = app[5]
n_installs = n_installs.replace('+', '')
n_installs = n_installs.replace(',', '')
n_installs = int(n_installs)
total += n_installs
len_category += 1
avg_n_installs = total / len_category
google_cat_avg_n_installs[category] = avg_n_installs
# We try to display the genres from lowest to highest average number of reviews
print('\n')
print(sorted(google_cat_avg_n_installs.items(), key = lambda kv:(kv[1], kv[0]))) # not yet sure how this works, we copied this code from the internet
ART_AND_DESIGN : 1986335.0877192982 AUTO_AND_VEHICLES : 647317.8170731707 BEAUTY : 513151.88679245283 BOOKS_AND_REFERENCE : 8767811.894736841 BUSINESS : 1712290.1474201474 COMICS : 817657.2727272727 COMMUNICATION : 38456119.167247385 DATING : 854028.8303030303 EDUCATION : 1833495.145631068 ENTERTAINMENT : 11640705.88235294 EVENTS : 253542.22222222222 FINANCE : 1387692.475609756 FOOD_AND_DRINK : 1924897.7363636363 HEALTH_AND_FITNESS : 4188821.9853479853 HOUSE_AND_HOME : 1331540.5616438356 LIBRARIES_AND_DEMO : 638503.734939759 LIFESTYLE : 1437816.2687861272 GAME : 15588015.603248259 FAMILY : 3695641.8198090694 MEDICAL : 120550.61980830671 SOCIAL : 23253652.127118643 SHOPPING : 7036877.311557789 PHOTOGRAPHY : 17840110.40229885 SPORTS : 3638640.1428571427 TRAVEL_AND_LOCAL : 13984077.710144928 TOOLS : 10801391.298666667 PERSONALIZATION : 5201482.6122448975 PRODUCTIVITY : 16787331.344927534 PARENTING : 542603.6206896552 WEATHER : 5074486.197183099 VIDEO_PLAYERS : 24727872.452830188 NEWS_AND_MAGAZINES : 9549178.467741935 MAPS_AND_NAVIGATION : 4056941.7741935486 [('MEDICAL', 120550.61980830671), ('EVENTS', 253542.22222222222), ('BEAUTY', 513151.88679245283), ('PARENTING', 542603.6206896552), ('LIBRARIES_AND_DEMO', 638503.734939759), ('AUTO_AND_VEHICLES', 647317.8170731707), ('COMICS', 817657.2727272727), ('DATING', 854028.8303030303), ('HOUSE_AND_HOME', 1331540.5616438356), ('FINANCE', 1387692.475609756), ('LIFESTYLE', 1437816.2687861272), ('BUSINESS', 1712290.1474201474), ('EDUCATION', 1833495.145631068), ('FOOD_AND_DRINK', 1924897.7363636363), ('ART_AND_DESIGN', 1986335.0877192982), ('SPORTS', 3638640.1428571427), ('FAMILY', 3695641.8198090694), ('MAPS_AND_NAVIGATION', 4056941.7741935486), ('HEALTH_AND_FITNESS', 4188821.9853479853), ('WEATHER', 5074486.197183099), ('PERSONALIZATION', 5201482.6122448975), ('SHOPPING', 7036877.311557789), ('BOOKS_AND_REFERENCE', 8767811.894736841), ('NEWS_AND_MAGAZINES', 9549178.467741935), ('TOOLS', 10801391.298666667), ('ENTERTAINMENT', 11640705.88235294), ('TRAVEL_AND_LOCAL', 13984077.710144928), ('GAME', 15588015.603248259), ('PRODUCTIVITY', 16787331.344927534), ('PHOTOGRAPHY', 17840110.40229885), ('SOCIAL', 23253652.127118643), ('VIDEO_PLAYERS', 24727872.452830188), ('COMMUNICATION', 38456119.167247385)]
The categories in the Google Store with the highest number of average installs are Communication, Video_Players, Social, Photography, and Productivity. Similar to the Apple Store, it is most likely that these numbers are being pulled up by very popular apps. Let's check.
print('VIDEO_PLAYERS')
for app in google_final:
if (app[1] == 'VIDEO_PLAYERS') and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
print(app[0], ':', app[5])
print('\n')
print('COMMUNICATION')
for app in google_final:
if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
print(app[0], ':', app[5])
VIDEO_PLAYERS YouTube : 1,000,000,000+ Google Play Movies & TV : 1,000,000,000+ MX Player : 500,000,000+ COMMUNICATION WhatsApp Messenger : 1,000,000,000+ Google Duo - High Quality Video Calls : 500,000,000+ Messenger – Text and Video Chat for Free : 1,000,000,000+ imo free video calls and chat : 500,000,000+ Skype - free IM & video calls : 1,000,000,000+ LINE: Free Calls & Messages : 500,000,000+ Google Chrome: Fast & Secure : 1,000,000,000+ UC Browser - Fast Download Private & Secure : 500,000,000+ Gmail : 1,000,000,000+ Hangouts : 1,000,000,000+ Viber Messenger : 500,000,000+
The same pattern of a few extremely popular apps inflating the average number of users per category can be seen in the Google Store. We have YouTube for Video Players and Google Chrome, WhatsApp, and Messenger for Communication.
Since we were considering some sort of gamified or interactive references or books for the Apple Store, let's check what the books_and_references and games category in Google Store look like.
print('\n')
print('BOOKS_AND_REFERENCE')
for app in google_final:
if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'
or app[5] == '10,000,000+'):
print(app[0], ':', app[5])
print('\n')
print('GAME')
for app in google_final:
if app[1] == 'GAME' and (app[5] == '1,000,000,000+'
or app[5] == '500,000,000+'
or app[5] == '100,000,000+'):
print(app[0], ':', app[5])
BOOKS_AND_REFERENCE Wikipedia : 10,000,000+ Cool Reader : 10,000,000+ FBReader: Favorite Book Reader : 10,000,000+ Google Play Books : 1,000,000,000+ HTC Help : 10,000,000+ Moon+ Reader : 10,000,000+ Bible : 100,000,000+ Amazon Kindle : 100,000,000+ Aldiko Book Reader : 10,000,000+ Wattpad 📖 Free Books : 100,000,000+ Al-Quran (Free) : 10,000,000+ Al Quran Indonesia : 10,000,000+ Al'Quran Bahasa Indonesia : 10,000,000+ Quran for Android : 10,000,000+ Audiobooks from Audible : 100,000,000+ Dictionary.com: Find Definitions for English Words : 10,000,000+ English Dictionary - Offline : 10,000,000+ NOOK: Read eBooks & Magazines : 10,000,000+ Dictionary : 10,000,000+ Spanish English Translator : 10,000,000+ Dictionary - Merriam-Webster : 10,000,000+ JW Library : 10,000,000+ Oxford Dictionary of English : Free : 10,000,000+ English Hindi Dictionary : 10,000,000+ GAME Sonic Dash : 100,000,000+ PAC-MAN : 100,000,000+ Roll the Ball® - slide puzzle : 100,000,000+ Piano Tiles 2™ : 100,000,000+ Pokémon GO : 100,000,000+ Extreme Car Driving Simulator : 100,000,000+ Trivia Crack : 100,000,000+ Angry Birds 2 : 100,000,000+ Candy Crush Saga : 500,000,000+ 8 Ball Pool : 100,000,000+ Subway Surfers : 1,000,000,000+ Candy Crush Soda Saga : 100,000,000+ Clash Royale : 100,000,000+ Clash of Clans : 100,000,000+ Plants vs. Zombies FREE : 100,000,000+ Pou : 500,000,000+ Flow Free : 100,000,000+ My Talking Angela : 100,000,000+ slither.io : 100,000,000+ Cooking Fever : 100,000,000+ Yes day : 100,000,000+ Score! Hero : 100,000,000+ Dream League Soccer 2018 : 100,000,000+ My Talking Tom : 500,000,000+ Sniper 3D Gun Shooter: Free Shooting Games - FPS : 100,000,000+ Zombie Tsunami : 100,000,000+ Helix Jump : 100,000,000+ Crossy Road : 100,000,000+ Temple Run 2 : 500,000,000+ Talking Tom Gold Run : 100,000,000+ Agar.io : 100,000,000+ Bus Rush: Subway Edition : 100,000,000+ Traffic Racer : 100,000,000+ Hill Climb Racing : 100,000,000+ Angry Birds Rio : 100,000,000+ Cut the Rope FULL FREE : 100,000,000+ Hungry Shark Evolution : 100,000,000+ Angry Birds Classic : 100,000,000+ Hill Climb Racing 2 : 100,000,000+ Jetpack Joyride : 100,000,000+ Super Mario Run : 100,000,000+ Glow Hockey : 100,000,000+ Asphalt 8: Airborne : 100,000,000+ Lep's World 2 🍀🍀 : 100,000,000+ Fruit Ninja® : 100,000,000+ Vector : 100,000,000+ Dr. Driving : 100,000,000+ Bike Race Free - Top Motorcycle Racing Games : 100,000,000+ Smash Hit : 100,000,000+ Temple Run : 100,000,000+ Geometry Dash Lite : 100,000,000+ Ant Smasher by Best Cool & Fun Games : 100,000,000+ Angry Birds Star Wars : 100,000,000+ Mobile Legends: Bang Bang : 100,000,000+ Banana Kong : 100,000,000+ Skater Boy : 100,000,000+ Shadow Fight 2 : 100,000,000+ Modern Combat 5: eSports FPS : 100,000,000+ Garena Free Fire : 100,000,000+
Dictionaries, e-book readers, and religious text references appear to be the most popular. On the other hand, casual games appear to dominate the Game category.
The analysis in this project was to identify the type of app we would recommend developing if the business model is to attract users and earn revenues through ads. There appears to be no clear answer but references or educational apps that are interactive or 'gamified' seem to hold some promise. We can develop primary-school level reference materials (nursery rhymes, short stories) that are interactive which means we could add simple mini-games (word matching, fill in the blanks, image matching) that grant the user some experience points of some sort.
For the more mature demographic, we could create an app that recommends books or movies based on what the user indicates to be the movies or books he likes. The app can contain some basic information on movies and books.