Notebook

Dataquest Guided Project: Profitable App Profiles for the App Store and Google Play Markets¶

This is the first guided project in the Dataquest's Data Scientist in Python path. The idea is to imagine working for an app development company that develops and publishes free apps in the Apple Store and the Google Play Store. Since the apps are free, revenues are generated through ads. As a data analyst, we would want to find out the profiles of free apps that have the potential to generate the most ad revenues.

My goal for this project is to familiarize myself with some basic data analytical tasks using Python and working with the Jupyter interface.

Opening and Exploring Data¶

We begin by defining some functions to automate some repetitive tasks.

In [1]:

# Creating a function that will open csv's as list of lists
def open_as_list(dataset, separate_header=True):
    opened_file = open(dataset, encoding='utf8')
    from csv import reader
    read_file = reader(opened_file)
    list_file = list(read_file)
    if separate_header:
        return list_file[0], list_file[1:]
    else:
        return list_file

In [2]:

# This is a function provided by Dataquest that allows us to print rows in a readable way
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Loading the data sets

We then begin to load the data sets as lists of lists, using the functions (open_as_list() and explore_data()) that we defined earlier.

In [3]:

# Opening the csv files (with separate header)
apple_header, apple = open_as_list('AppleStore.csv')
google_header, google = open_as_list('googleplaystore.csv')

In [4]:

# Exploring the first few rows of the data sets using the explore_data() function
explore_data(apple, 0, 3, rows_and_columns=True)
explore_data(google, 0, 3, rows_and_columns=True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13

In [5]:

# Exploring the column headers
print(apple_header)
print('\n')
print(google_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Documentation for the Apple Store data set is available in this link while the documentation for the Google Play Store data set is available here.

Deleting Wrong Data (Row with Missing Data)¶

Dataquest informs us that in the discussion thread for the Google Play Store data set, there appears to be a row with missing data (column length is less than expected). We look for the index number of the problematic row in the following cell.

In [6]:

# Looking for missing data in google play store data
rows_with_miss = [] # initialize the list that will contain the index numbers of rows with missing data
number_rows_miss = 0 # initialize integer that counts number of rows with missing data
for row in google:
    header_length = len(google_header)
    row_length = len(row)
    if row_length!= header_length:
        print(row)
        print('\n')
        print('Row with index number ' + str(google.index(row)) + ' may have missing values')
        rows_with_miss.append(google.index(row))
        number_rows_miss += 1
print('\n')
print(rows_with_miss)
print(number_rows_miss)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Row with index number 10472 may have missing values


[10472]
1

We check the details of the problematic row(s) and compare it with the details contained in the header row to better understand what went wrong.

If there are more rows with missing data, we can also automate this process with a for loop. We skip that part for now and proceed, knowing there is only one row with missing data.

In [7]:

# Checking the details of the problematic row
print(google_header)
print('\n')
print(google[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

Category appears to be missing, it jumps straight to 1.9 instead of a string describing category. We now delete the problematic row. Again, we could also perform a for loop here if there are more rows with missing data.

In [8]:

# Deleting the problematic row, make sure this is ran only once
del google[10472]

Removing Duplicate Entries (Google)¶

Apparently, the Google Play Store data (the google data set) contains duplicate entries for some apps. Our next step is to identify the apps with duplicate entries.

In [9]:

# Creating lists containing the duplicates and the unique app names
google_unique = []
google_duplicate = []

for row in google:
    app = row[0]
    if app in google_unique:
        google_duplicate.append(app)
    else:
        google_unique.append(app)

# Checking how many apps are unique and how many are duplicated
print('There are ' + str(len(google_unique)) + ' unique apps.')
print('There are ' + str(len(google_duplicate)) + ' duplicated apps.')
print('Here are some examples of duplicated apps: ' + str(google_duplicate[0:3]))

There are 9659 unique apps.
There are 1181 duplicated apps.
Here are some examples of duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']

We will not be deleting the duplicated apps right away. Instead, we will figure out how to deal with these. We can either (1) delete duplicates using some criteria or (2) we can retain one row for each unique app and just aggregate the numerical columns. We'll go with option 1 for this project.

Our criteria would be to keep the row with the most number of reviews, with the assumption that it has the latest information since there are more reviews.

We then create a dictionary with app names as keys and the highest number of reviews for that app as the values.

In [10]:

google_reviews_max = {}

for app in google:
    name = app[0]
    n_reviews = int(app[3]) # Dataquest recommends converting to float but we convert to int since n_reviews are whole numbers
    if (name in google_reviews_max) and (google_reviews_max[name] < n_reviews):
        google_reviews_max[name] = n_reviews
    if name not in google_reviews_max:
        google_reviews_max[name] = n_reviews

print(len(google_reviews_max)) # checking if reviews_max dictionary has the correct length (9659)
print(list(google_reviews_max.items())[:3]) # checking the first few elements of the reviews_max dictionary

9659
[('Photo Editor & Candy Camera & Grid & ScrapBook', 159), ('Coloring book moana', 974), ('U Launcher Lite – FREE Live Cool Themes, Hide Apps', 87510)]

Next, we use the dictionary we just created to retain only the rows we're interested in keeping (based on our criteria). For this step, we will be creating a new list of lists which we will name google_clean. This new data set should have no more duplicates.

In [11]:

google_clean = []
google_already_added = []

for app in google:
    name = app[0]
    n_reviews = int(app[3])
    if (n_reviews == google_reviews_max[name]) and (name not in google_already_added):
        google_clean.append(app)
        google_already_added.append(name)
print(len(google_clean)) # checking if our new data set has the correct number of rows (9659)
print(google_clean[:3]) # checking the first few entries of our 'cleaned' data set

# We could also use the explore_data function to perform these checks
# Upon reviewing some of the prior cells, it may be useful to define a function that will remove duplicate rows for us, we skip that for now.

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]

Ideally, we do the same process for the Apple Store data set. The Dataquest prompts do not ask us to do so but we'll check anyway. This is also for consistency so we get to call the Apple Store data set apple_clean instead of just apple the same way we should now be using the google_clean data set.

Removing Duplicate Entries (Apple)¶

For the Apple Store data set, we will us the same criteria (most number of reviews). We could actually try retaining the row with the latest version or the most recently updated (although this info is not in our Apple data set) but we are not yet very familiar with dealing in comparing strings (e.g. 1.0.1 vs. 1.0.2) to be able to implement such conditions.

In [12]:

# Creating lists containing the duplicates and the unique app names
apple_unique = []
apple_duplicate = []

for row in apple:
    app = row[2]
    if app in apple_unique:
        apple_duplicate.append(app)
    else:
        apple_unique.append(app)

# Checking how many apps are unique and how many are duplicated
print('There are ' + str(len(apple_unique)) + ' unique apps.')
print('There are ' + str(len(apple_duplicate)) + ' duplicated apps.')
print('Here are some examples of duplicated apps: ' + str(apple_duplicate[0:3]))
print('\n')
# Creating dictionary for max number of reviews
apple_reviews_max = {}

for app in apple:
    name = app[2]
    n_reviews = int(app[6]) # Dataquest recommends converting to float but we convert to int since n_reviews are whole numbers
    if (name in apple_reviews_max) and (apple_reviews_max[name] < n_reviews):
        apple_reviews_max[name] = n_reviews
    if name not in apple_reviews_max:
        apple_reviews_max[name] = n_reviews

print(len(apple_reviews_max)) # checking if reviews_max dictionary has the correct length (9659)
print(list(apple_reviews_max.items())[:3]) # checking the first few elements of the reviews_max dictionary
print('\n')

# Creating the new apple_clean data set with duplicates removed
apple_clean = []
apple_already_added = []

for app in apple:
    name = app[2]
    n_reviews = int(app[6])
    if (n_reviews == apple_reviews_max[name]) and (name not in apple_already_added):
        apple_clean.append(app)
        apple_already_added.append(name)

print(len(apple_clean)) # checking if our new data set has the correct number of rows (7195)
print('\n')
print(apple_clean[:2]) # checking the first few entries of our 'cleaned' data set

There are 7195 unique apps.
There are 2 duplicated apps.
Here are some examples of duplicated apps: ['VR Roller Coaster', 'Mannequin Challenge']


7195
[('PAC-MAN Premium', 21292), ('Evernote - stay organized', 161065), ('WeatherBug - Local Weather, Radar, Maps, Alerts', 188583)]


7195


[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']]

We're now done with removing duplicates for both data sets. It's a good thing that we did the extra work for the Apple data set, despite Dataquest not prompting us to do so, since there are apparently two duplicate entries.

Removing Non-English Apps¶

Since in this project, we are pretending to be working for a development team that only develops English apps, we must now remove apps not targeted towards English-speaking customers. This is important since we don't want to make any inferences from non-English apps since the people downloading those might have different preference profiles.

We start by a defining a function that will help us systematically identify whether the app name has non-English characters. This means we want to tag apps with names that contain non-English characters or those with Unicode code point values greater than 127 (Dataquest explains that the common English characters have equivalents of 0-127).

In [13]:

def english_app(text):
    for character in text:
        if ord(character) > 127:
            return False # function stops and returns False if it finds a single instance of non=English character
    return True

We check whether our english_app() function works.

In [14]:

print(english_app('Instagram'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suit'))
print(english_app('Instachat 😜'))

True
False
False
False

While the english_app() appears to work fine, it does tag apps with names that use special characters (e.g. trademark, emojis, etc.) as non-English. We need to modify this so as to only tag apps as non-English if they have more than three 'non-English' characters in their names. We define a new function called english_loose().

In [15]:

def english_loose(text):
    non_english_chars = 0
    for character in text:
        if ord(character) > 127:
            non_english_chars += 1
    if non_english_chars > 3:
        return False
    else:
        return True

We check again whether our new function, english_loose() is working as intended. Note that while the filter is still not perfect, it should work better with compared to the stricter criteria in the english_app() function.

In [16]:

print(english_loose('Instagram'))
print(english_loose('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_loose('Docs To Go™ Free Office Suit'))
print(english_loose('Instachat 😜'))

True
False
True
True

We can now use the english_loose() function to filter out non-English apps from both data sets (apple_clean and google_clean). We will call our new data sets apple_english and google_english. There is no need to do this filtering for the headers (apple_header and google_header) since the information contained in those lists will remain valid for all the apps in their respective data sets.

In [17]:

# Initialize empty lists
apple_english = []
google_english = []

for app in apple_clean:
    name = app[2] # name of app in apple data set is in index 2
    english = english_loose(name)
    if english:
        apple_english.append(app)
print('There are ' + str(len(apple_english)) + ' English apps in our Apple Store data set')
print(apple_english[:3])
print('\n')

for app in google_clean:
    name = app[0] # name of app in apple data set is in index 2
    english = english_loose(name)
    if english:
        google_english.append(app)

print('There are ' + str(len(google_english)) + ' English apps in our Google Store data set')
print(google_english[:3])

There are 6181 English apps in our Apple Store data set
[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']]


There are 9614 English apps in our Google Store data set
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]

Isolating Free Apps¶

Since we're only interested in free apps (our company gets its revenues from ads), we will need to remove apps that are not free. First, let's check the headers of our data sets to refresh our memory.

In [18]:

print('Apple Store')
print(apple_header)
print('\n')
print('Google Store')
print(google_header)

Apple Store
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google Store
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

We will now be creating our 'final' data sets.

In [19]:

# Initializing empty lists
apple_final = []
google_final = []

# Isolating free apps in Apple Store
for app in apple_english:
    price = app[5] # should ideally convert to float but returns error because of some special characters we still don't know how to deal with
    if price == '0' or price == '0.0': # we just use different iterations of the strings zero instead
        apple_final.append(app)

# Isolating free apps in Google Store        
for app in google_english:
    price = app[7] # should ideally convert to float but returns error because of some special characters we still don't know how to deal with
    if price == '0' or price == '0.0': # we just use different iterations of the strings zero instead
        google_final.append(app)

print('There are a total of ' + str(len(apple_final)) + ' free English apps in our Apple Store data set.')
print('There are a total of ' + str(len(google_final)) + ' free English apps in our Google Store data set.')

There are a total of 3220 free English apps in our Apple Store data set.
There are a total of 8864 free English apps in our Google Store data set.

To review, here's what we've done so far with our two data sets:

Removed rows with missing data
Removed duplicate entries
Removed non-English apps
Isolated the free apps

We can now begin some crude analysis for the purpose of this project.

Most Common Apps by Genre¶

Recall that the our hypothetical aim here is to identify the profile of apps that would be most profitable to develop. Dataquest provides us a list of the validation strategy to use in app development decision-making:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Since the plan is to release apps for both Apple and Google stores, we will now look at the profiles of apps and try to figure out what would be most profitable (generates the most revenue through ads)

In [20]:

# Checking the columns to see which ones we can use to generate frequency tables
print('Apple')
display(apple_header)
print('\n')
print('Google')
display(google_header)

Apple

['',
 'id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']


Google

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

For the Apple Store, it may be interesting to look at 'size_bytes', 'rating_count_tot', 'user_rating', 'cont_rating', and 'prime_genre'.

For the Google Store, we may want to look at 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Content Rating', and 'Genres'.

For now, let's check what the most common genres are for the two stores, that information would be contained in the column 'prime_genre' for Apple and 'Genres' for Google.

In [21]:

# Getting the index numbers for genre and/or category
apple_genre_index = apple_header.index('prime_genre')
google_genre_index = google_header.index('Genres')
google_category_index = google_header.index('Category')
print('Apple genre index is ' + str(apple_genre_index))
print('Google category index is ' + str(google_category_index))
print('Google genre index is ' + str(google_genre_index))

Apple genre index is 12
Google category index is 1
Google genre index is 9

Building Frequency Tables

We will now create functions that will help us generate frequency tables for our columns of choice.

In [22]:

# Function for generating frequency tables
def freq_table(dataset, index):
    frequency_table = {}
    for row in dataset:
        column = row[index]
        if column in frequency_table:
            frequency_table[column] += 1
        else:
            frequency_table[column] = 1
    return frequency_table
# no need to assign a variable in the main memory here because we're just going to use this function within the display_table function

# This function was provided by Dataquest 
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We now use the functions ,display_table() and freq_table(), we just created to see what categories or genres are most common.

In [23]:

print('Apple Genres')
display_table(apple_final, 12)
print('\n')
print('Google Categories')
display_table(google_final, 1)
print('\n')
print('Google Genres')
display_table(google_final, 9)

Apple Genres
Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


Google Categories
FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


Google Genres
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Adventure : 11
Action;Action & Adventure : 9
Educational;Pretend Play : 8
Simulation;Action & Adventure : 7
Parenting;Education : 7
Entertainment;Brain Games : 7
Board;Brain Games : 7
Parenting;Music & Video : 6
Educational;Brain Games : 6
Casual;Creativity : 6
Art & Design;Creativity : 6
Education;Pretend Play : 5
Role Playing;Pretend Play : 4
Education;Creativity : 4
Role Playing;Action & Adventure : 3
Puzzle;Action & Adventure : 3
Entertainment;Creativity : 3
Entertainment;Action & Adventure : 3
Educational;Creativity : 3
Educational;Action & Adventure : 3
Education;Music & Video : 3
Education;Brain Games : 3
Education;Action & Adventure : 3
Adventure;Action & Adventure : 3
Video Players & Editors;Music & Video : 2
Sports;Action & Adventure : 2
Simulation;Pretend Play : 2
Puzzle;Creativity : 2
Music;Music & Video : 2
Entertainment;Pretend Play : 2
Casual;Education : 2
Board;Action & Adventure : 2
Video Players & Editors;Creativity : 1
Trivia;Education : 1
Travel & Local;Action & Adventure : 1
Tools;Education : 1
Strategy;Education : 1
Strategy;Creativity : 1
Strategy;Action & Adventure : 1
Simulation;Education : 1
Role Playing;Brain Games : 1
Racing;Pretend Play : 1
Puzzle;Education : 1
Parenting;Brain Games : 1
Music & Audio;Music & Video : 1
Lifestyle;Pretend Play : 1
Lifestyle;Education : 1
Health & Fitness;Education : 1
Health & Fitness;Action & Adventure : 1
Entertainment;Education : 1
Communication;Creativity : 1
Comics;Creativity : 1
Casual;Music & Video : 1
Card;Action & Adventure : 1
Books & Reference;Education : 1
Art & Design;Pretend Play : 1
Art & Design;Action & Adventure : 1
Arcade;Pretend Play : 1
Adventure;Education : 1

Looking at the App Store frequency table, we see that the most common genre are Games (1,872) followed far behind by Entertainment (254), Photo & Video(160), and Education (118). It appears that the most common apps are those for entertainment rather than for productivity.

While it is not a foregone conclusion, the large number of games and entertainment apps that are free could mean that developers are flocking that genre because of the larger number of users for those types of apps. However, we may want to develop apps that have a somewhat large number of users but with fewer competing apps. For example, 'gamified' education apps is a possibility.

For the Google Store, we look at the genres and categories. We see that the Top 5 genres/categories for the stores are as follows:

Category - Family (1676), Game (862), Tools (750), Business (407) , Lifestyle (346)

Genre - Tools (749), Entertainment (538), Education (474), Business (407), Productivity (345)

The distribution of app categories/genres in the Google Store is more balanced between entertainment/fun and practical apps.

Most Popular Apps by Genre¶

We now check which apps are most popular based on the number of users. For the Apple Store, this data is not available but we can use the number of ratings, rating_count_tot, as a proxy (the more ratings, the more users). For the Google Store, the Installs column will give us an idea of how many users there are per genre.

Apple

In [24]:

print(apple_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

In [25]:

# Apple Store
apple_unique_genres = freq_table(apple_final, 12)

for genre in apple_unique_genres:
    total = 0
    len_genre = 0
    for app in apple_final:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = int(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre + ' : ' + str(avg_n_ratings))

# In the same vein, we can create a dictionary instead
apple_genres_avg_n_ratings = {} # initializing empty dictionary
for genre in apple_unique_genres:
    total = 0
    len_genre = 0
    for app in apple_final:
        genre_app = app[12]
        if  genre_app == genre:
            n_ratings = int(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    apple_genres_avg_n_ratings[genre] = avg_n_ratings

# We try to display the genres from lowest to highest average number of reviews
print('\n')
print(sorted(apple_genres_avg_n_ratings.items(), key = lambda kv:(kv[1], kv[0]))) # not yet sure how this works, we copied this code from the internet

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22812.92467948718
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


[('Medical', 612.0), ('Catalogs', 4004.0), ('Education', 7003.983050847458), ('Business', 7491.117647058823), ('Entertainment', 14029.830708661417), ('Lifestyle', 16485.764705882353), ('Utilities', 18684.456790123455), ('Productivity', 21028.410714285714), ('News', 21248.023255813954), ('Games', 22812.92467948718), ('Sports', 23008.898550724636), ('Health & Fitness', 23298.015384615384), ('Shopping', 26919.690476190477), ('Travel', 28243.8), ('Photo & Video', 28441.54375), ('Finance', 31467.944444444445), ('Food & Drink', 33333.92307692308), ('Book', 39758.5), ('Weather', 52279.892857142855), ('Music', 57326.530303030304), ('Social Networking', 71548.34905660378), ('Reference', 74942.11111111111), ('Navigation', 86090.33333333333)]

We see that Navigation, Reference, and Social Networking have the highest number of reviews. We suspect that these numbers are being pulled up by very popular apps (which makes it more difficult for our company to attract customers into these types of apps). For example, GoogleMaps and Waze for navigation; Facebook for Social Networking; and Wikipedia for Reference.

Let's try to verify that in the next cell.

In [26]:

print('Navigation')
for app in apple_final:
    if app[12] == 'Navigation' and (int(app[6]) > 100000):
        print(app[2], ':', app[6])

print('\n')
print('Reference')
for app in apple_final:
    if app[12] == 'Reference' and (int(app[6]) > 100000):
        print(app[2], ':', app[6])

print('\n')        
print('Social Networking')        
for app in apple_final:
    if app[12] == 'Social Networking' and (int(app[6]) > 100000):
        print(app[2], ':', app[6])

Navigation
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911


Reference
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047


Social Networking
Facebook : 2974676
Skype for iPhone : 373519
Tumblr : 334293
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Kik : 260965
Viber Messenger – Text & Call : 164249
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
Messenger : 351466
Followers - Social Analytics For Instagram : 112778

We confirm our suspicions that the average number of reviews (which is our proxy for average number of users) for the top 3 categories are being pulled up by a small number extremely popular apps. The same is most likely true for Music (Spotify, Pandora) and Weather (Weather, Accuweather), and Book (Kindle, Audible).

Unless we want to develop the less popular app categories (with fewer reviews), we'll just have to try to compete with bigger players. There appears to be some promise for Reference, Book, Photo & Video, and Travel. These types of apps don't really require physical presence or services and can be developed by a small development team.

Google

For Google Play Store, we have information on number of installs but they are stored as strings and indicate ranges. For now, let's convert them into numbers (floats or integers) using a crude process as described by Dataquest where we just assume that the open-ended strings represent the actual number of reviews. This means that apps with installs of '10,000+' will be assumed to have 10,000 installs , '100,000+' to have 100,000, and so on. Note that we could actually get the median value within the range instead but let's follow the recommended process in the Dataquest prompt.

In [27]:

print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

In [28]:

google_unique_categories = freq_table(google_final, 1)

for category in google_unique_categories:
    total = 0
    len_category = 0
    for app in google_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = int(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

# Again, we can create a dictionary
google_cat_avg_n_installs = {} # initializing empty dictionary
for category in google_unique_categories:
    total = 0
    len_category = 0
    for app in google_final:
        category_app = app[1]
        if  category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = int(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = total / len_category
    google_cat_avg_n_installs[category] = avg_n_installs

# We try to display the genres from lowest to highest average number of reviews
print('\n')
print(sorted(google_cat_avg_n_installs.items(), key = lambda kv:(kv[1], kv[0]))) # not yet sure how this works, we copied this code from the internet

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_MAGAZINES : 9549178.467741935
MAPS_AND_NAVIGATION : 4056941.7741935486


[('MEDICAL', 120550.61980830671), ('EVENTS', 253542.22222222222), ('BEAUTY', 513151.88679245283), ('PARENTING', 542603.6206896552), ('LIBRARIES_AND_DEMO', 638503.734939759), ('AUTO_AND_VEHICLES', 647317.8170731707), ('COMICS', 817657.2727272727), ('DATING', 854028.8303030303), ('HOUSE_AND_HOME', 1331540.5616438356), ('FINANCE', 1387692.475609756), ('LIFESTYLE', 1437816.2687861272), ('BUSINESS', 1712290.1474201474), ('EDUCATION', 1833495.145631068), ('FOOD_AND_DRINK', 1924897.7363636363), ('ART_AND_DESIGN', 1986335.0877192982), ('SPORTS', 3638640.1428571427), ('FAMILY', 3695641.8198090694), ('MAPS_AND_NAVIGATION', 4056941.7741935486), ('HEALTH_AND_FITNESS', 4188821.9853479853), ('WEATHER', 5074486.197183099), ('PERSONALIZATION', 5201482.6122448975), ('SHOPPING', 7036877.311557789), ('BOOKS_AND_REFERENCE', 8767811.894736841), ('NEWS_AND_MAGAZINES', 9549178.467741935), ('TOOLS', 10801391.298666667), ('ENTERTAINMENT', 11640705.88235294), ('TRAVEL_AND_LOCAL', 13984077.710144928), ('GAME', 15588015.603248259), ('PRODUCTIVITY', 16787331.344927534), ('PHOTOGRAPHY', 17840110.40229885), ('SOCIAL', 23253652.127118643), ('VIDEO_PLAYERS', 24727872.452830188), ('COMMUNICATION', 38456119.167247385)]

The categories in the Google Store with the highest number of average installs are Communication, Video_Players, Social, Photography, and Productivity. Similar to the Apple Store, it is most likely that these numbers are being pulled up by very popular apps. Let's check.

In [29]:

print('VIDEO_PLAYERS')
for app in google_final:
    if (app[1] == 'VIDEO_PLAYERS') and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

print('\n')
print('COMMUNICATION')
for app in google_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

VIDEO_PLAYERS
YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


COMMUNICATION
WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+

The same pattern of a few extremely popular apps inflating the average number of users per category can be seen in the Google Store. We have YouTube for Video Players and Google Chrome, WhatsApp, and Messenger for Communication.

Since we were considering some sort of gamified or interactive references or books for the Apple Store, let's check what the books_and_references and games category in Google Store look like.

In [30]:

print('\n')
print('BOOKS_AND_REFERENCE')
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' 
                                            or app[5] == '500,000,000+' 
                                           or app[5] == '100,000,000+'
                                           or app[5] == '10,000,000+'):
        print(app[0], ':', app[5])
        
print('\n')
print('GAME')
for app in google_final:
    if app[1] == 'GAME' and (app[5] == '1,000,000,000+' 
                                            or app[5] == '500,000,000+' 
                                           or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])


BOOKS_AND_REFERENCE
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Google Play Books : 1,000,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Aldiko Book Reader : 10,000,000+
Wattpad 📖 Free Books : 100,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Quran for Android : 10,000,000+
Audiobooks from Audible : 100,000,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Dictionary : 10,000,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+


GAME
Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Roll the Ball® - slide puzzle : 100,000,000+
Piano Tiles 2™ : 100,000,000+
Pokémon GO : 100,000,000+
Extreme Car Driving Simulator : 100,000,000+
Trivia Crack : 100,000,000+
Angry Birds 2 : 100,000,000+
Candy Crush Saga : 500,000,000+
8 Ball Pool : 100,000,000+
Subway Surfers : 1,000,000,000+
Candy Crush Soda Saga : 100,000,000+
Clash Royale : 100,000,000+
Clash of Clans : 100,000,000+
Plants vs. Zombies FREE : 100,000,000+
Pou : 500,000,000+
Flow Free : 100,000,000+
My Talking Angela : 100,000,000+
slither.io : 100,000,000+
Cooking Fever : 100,000,000+
Yes day : 100,000,000+
Score! Hero : 100,000,000+
Dream League Soccer 2018 : 100,000,000+
My Talking Tom : 500,000,000+
Sniper 3D Gun Shooter: Free Shooting Games - FPS : 100,000,000+
Zombie Tsunami : 100,000,000+
Helix Jump : 100,000,000+
Crossy Road : 100,000,000+
Temple Run 2 : 500,000,000+
Talking Tom Gold Run : 100,000,000+
Agar.io : 100,000,000+
Bus Rush: Subway Edition : 100,000,000+
Traffic Racer : 100,000,000+
Hill Climb Racing : 100,000,000+
Angry Birds Rio : 100,000,000+
Cut the Rope FULL FREE : 100,000,000+
Hungry Shark Evolution : 100,000,000+
Angry Birds Classic : 100,000,000+
Hill Climb Racing 2 : 100,000,000+
Jetpack Joyride : 100,000,000+
Super Mario Run : 100,000,000+
Glow Hockey : 100,000,000+
Asphalt 8: Airborne : 100,000,000+
Lep's World 2 🍀🍀 : 100,000,000+
Fruit Ninja® : 100,000,000+
Vector : 100,000,000+
Dr. Driving : 100,000,000+
Bike Race Free - Top Motorcycle Racing Games : 100,000,000+
Smash Hit : 100,000,000+
Temple Run : 100,000,000+
Geometry Dash Lite : 100,000,000+
Ant Smasher by Best Cool & Fun Games : 100,000,000+
Angry Birds Star Wars : 100,000,000+
Mobile Legends: Bang Bang : 100,000,000+
Banana Kong : 100,000,000+
Skater Boy : 100,000,000+
Shadow Fight 2 : 100,000,000+
Modern Combat 5: eSports FPS : 100,000,000+
Garena Free Fire : 100,000,000+

Dictionaries, e-book readers, and religious text references appear to be the most popular. On the other hand, casual games appear to dominate the Game category.

Conclusion¶

The analysis in this project was to identify the type of app we would recommend developing if the business model is to attract users and earn revenues through ads. There appears to be no clear answer but references or educational apps that are interactive or 'gamified' seem to hold some promise. We can develop primary-school level reference materials (nursery rhymes, short stories) that are interactive which means we could add simple mini-games (word matching, fill in the blanks, image matching) that grant the user some experience points of some sort.

For the more mature demographic, we could create an app that recommends books or movies based on what the user indicates to be the movies or books he likes. The app can contain some basic information on movies and books.