Notebook

ImIntoData

Profitable App Profiles for the App store and Google Play markets - Data analysis¶

Introduction¶

In this project I analyze data about app profiles from the Apple App store and Google Play markets to understand what apps are most profitable.

The goal of the project is to help developers understand what type of apps are likely to attract more users.

Note: This project forms part of Dataquest.io's course 'Data Science in Python - Fundamentals'.

Data Exploration¶

Let's start out by gathering the data.

Instead of collecting data on over 4 million apps (Source: Statistica), there are two freely available data sets that seem suitable for our goal:

a data set on approx. 10.000 Android apps (link), stored in googleplaystore.csv.
a data set on approx. 7.000 apps from the App Store (link), stored in AppleStore.csv.

First we open these two data sets, and save both as lists of lists. Then we separate the header of each list, so the remainder of each list has a homogeneous data structure.

In [2]:

opened_file_google = open("googleplaystore.csv")
opened_file_apple = open("AppleStore.csv")

from csv import reader
read_file_google = reader(opened_file_google)
read_file_apple = reader(opened_file_apple)

list_google = list(read_file_google)
header_google = list_google[0]
list_google = list_google[1:]

list_apple = list(read_file_apple)
header_apple = list_apple[0]
list_apple = list_apple[1:]

Then we explore the two data sets, making use of this function that was provided by the course:

In [3]:

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Exploring the Google Play markets data set first, then the Apple Store data set:

In [4]:

print ("GOOGLE")
print('\n')
explore_data (list_google, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (list_apple, 0, 3, True)

GOOGLE


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


APPLE


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

Now that we have taken a look at the first three rows and the total number of rows and columns, let's print out the columns names for both data sets as well:

In [5]:

print ("GOOGLE")
print('\n')
print (header_google)
print('\n')
print ("APPLE")
print (header_apple)

GOOGLE


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


APPLE
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Time to do some investigation into what columns are relevant to our analysis. The documentation for both provides help on the information each column contains (See Google, Apple). We want to find out what apps attract more users, so the following columns seem interesting:

GOOGLE

App (application name)
Reviews (number of user reviews)
Category (category the app belongs to)
Installs (number of user downloads/installs)
Type (paid or free)
Price (price of the app)
Genres (genre app belongs to, can belong to multiple at ones)

APPLE

track_name (application name)
price (price of the app)
rating_count_tot (User rating counts for all versions)
user_rating (ratings for the app)
prime_genre (genre app belongs to)

Data Cleaning¶

The next step is making sure the data is accurate, by checking for:

inaccurate data (correct or remove errors)
duplicate data (remove duplicates)

Let's start with checking for inaccurate data.

Inaccurate data¶

The discussions section on the Apple data set shows there are no wrong data in it.

The discussions section on the Google data set mentions row 10472 has no rating. Let's check this by printing the row.

In [6]:

print (list_google[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

This row indeed has only 12 columns instead of 13 (like the header row). But it is not the Rating column that is missing, it seems to be second column, Category, which caused the rest of the columns to shift over to the left.

Instead of imputing the missing value, we decide to remove the row entirely. This statement can only be run once, since we are using the index number to remove the row! And then we print the row again to check if it was removed.

In [7]:

del list_google[10472]
print (list_google [10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']

With the check for innacurate data done and solved, let's move on to the next part of cleaning data: dealing with duplicates.

Duplicate data¶

The discussions section on the Google data set also talks of duplicate entries. Let's check this and print any duplicate entries we find.

In [8]:

# create two empty lists
duplicate_apps = []
unique_apps = []

# loop through the data set, retrieve the name from the first column, and check if the name is in the unique_apps list.
# if it is not, append the name to this list, and else append the name to the duplicates list.
for app in list_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
# Print the number of duplicate apps and the names of the first 10 we found.
print("Number of duplicate apps: " + str(len(duplicate_apps)))
print("\n")
print("First 10 names of duplicate apps: " + str(duplicate_apps[:10]))

Number of duplicate apps: 1181


First 10 names of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']

This confirms there are 1181 duplicates in the data set.

Instead of randomly removing duplicates, it's better to make an informed decision on which row to keep. To decide on a criterion for which entry to keep, let's inspect one of the duplicate entries, e.g. Slack.

In [9]:

# Loop throught the data set and print the row if the name is 'Slack'.
for app in list_google:
    name = app[0]
    if name == 'Slack':
        print (app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']

The only difference betweeen the duplicate rows seems to be in the number of reviews. A quick inspection of some of the other rows confirms this.

Going back to the purpose of the analysis, we are interested in understanding what type of apps are likely to attract more users. This makes a case for keeping the entries with the highest number of reviews. Also, the more reviews, the more recent the data will be. Let's keep the entries with the higher number of reviews and remove the others.

To do this we first create a dictionary with the entries to keep per duplicatie entry (key = app name, and value = highest number of reviews).

In [10]:

# creating an empty dictionary and looping through the data set.
# if the name does not exist yet in the dictionary or 
# if the number of reviews is higher than the existing key-value entry,
# then the name and number of reviews is added to the dictionary.
reviews_max = {}
for app in list_google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

# printing 3 random entries of the dictionary to see if things went well
print(list(reviews_max.items())[:3])
print("\n")
# printing the length of the dictionary
# and the expected length (datset -minus duplicates) to confirm
print ("Length of the dictionary is: ", len(reviews_max))
print ("Expected length is: ", (len(list_google)-1181))

[('I am Rich Plus', 856.0), ('Draw A Stickman', 29265.0), ('Q Remote Control', 4264.0)]


Length of the dictionary is:  9659
Expected length is:  9659

Now that we have dictionary with the entries to keep per duplicatie entry (key = app name, and value = highest number of reviews), let's use this to remove the unwanted duplicate entries.

In [11]:

# creating two empty lists
google_clean = []
already_added = []

# looping through the Google data set, for each iteration;
# if the number of reviews for that app is the same as in the dictionary and
# the name is not yet in the list `already_added`
# we append the entire row to the list 'google_clean' and
# we append the name of the app to the list `already_added`(to account for cases where the number of reviews is equal to what was recorded previously).

for app in list_google:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and name not in already_added:
        google_clean.append(app)
        already_added.append(name)

Time to check if things went well, re-using the explore_data() function that we used at the start of the project . Remember, the expected length is 9659 rows!

In [12]:

explore_data (google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13

The discussions section on the Apple data does not talk of duplicate entries. To keep the names of our two data sets in sync and indicate that the Apple data set is clean as well, let's change the name of the Apple data set, so we are left with these two data set names:

google_clean
apple_clean

In [13]:

apple_clean = list_apple

# check our name change
explore_data (apple_clean, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

Data Preparation¶

Now that the data is clean we need to make sure that the data we use for our analysis fits our purpose. The company we work for only makes apps in English, and which are free to download. This means we are not interested in non-English or non-free apps, and these should not be included in the data set we use in our analysis.

So two more steps before we start analyzing:

removing non-English apps
removing non-free apps

Let's finish up the preparation of our data sets.

Detecting and removing non-English apps¶

Withouth a country column there is no good way to filter out English as a language. But we can remove each app name with a character that is not commonly used in English, by detecting characters that fall outside of the ragne 0 to 127 according to the ASCII system. Note that this might leave in other languages that solely use these characters, but we accept the risk.

Let's start with a function to detect if an app name contains a character that is greater than 127, by making use of the built-in function ord()

In [14]:

def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

# checking if the function works
print ("Instagram: " + str(is_english("Instagram")))
print ("爱奇艺PPS -《欢乐颂2》电视剧热播: " + str(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print ("Docs To Go™ Free Office Suite: " + str(is_english("Docs To Go™ Free Office Suite")))
print ("Instachat 😜: " + str(is_english("Instachat 😜")))

Instagram: True
爱奇艺PPS -《欢乐颂2》电视剧热播: False
Docs To Go™ Free Office Suite: False
Instachat 😜: False

The function does not entirely work as expected, since some app names use one or more special characters like an emoji or the trademark symbol. Let's just label a name as non-English if it has more than 3 characters outside of our defined range.

In [15]:

def is_english(string):
    number = 0
    for char in string:
        if ord(char) > 127:
            number +=1
    if number > 3:
        return False
    else:
        return True

# checking our new function
print ("Instagram: " + str(is_english("Instagram")))
print ("爱奇艺PPS -《欢乐颂2》电视剧热播: " + str(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播")))
print ("Docs To Go™ Free Office Suite: " + str(is_english("Docs To Go™ Free Office Suite")))
print ("Instachat 😜: " + str(is_english("Instachat 😜")))

Instagram: True
爱奇艺PPS -《欢乐颂2》电视剧热播: False
Docs To Go™ Free Office Suite: True
Instachat 😜: True

That's better.

Now let's use this function on both our data sets to filter out non-English apps.

In [16]:

# Creating a new empty list for each data set
# Looping though both data sets
# using the `is_english()` function to identify English apps
# appending these apps to a new list
google_clean_english = []
apple_clean_english = []

for app in google_clean:
    name = app[0]
    if is_english(name):
        google_clean_english.append(app)

for app in apple_clean:
    # name is in second column in this data set!
    name = app[1]
    if is_english(name):
        apple_clean_english.append(app)

# exploring the new data sets
print ("GOOGLE")
print('\n')
explore_data (google_clean_english, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (apple_clean_english, 0, 3, True)
        

GOOGLE


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


APPLE


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16

Looks good! We now have 9614 rows in the Google data set, and 6183 rows remaining in the Apple data set.

Time to move on to the other step in our data preparation: taking care of non-free apps.

Detecting and removing non-free apps¶

From the exploration of the data set above we know that Price is stored in the following columns in our data sets:

Google 8th column (index 7)
Apple 5th column (index 4)

To isolate the free apps in a separate lists we will loop through each, and save the entries where price is higher than 0 to a separate new list.

Let's get started.

In [17]:

# creating new empty lists for both data sets
google_clean_english_free = []
apple_clean_english_free = []

# looping through each data set and saving free apps to the new list
for app in google_clean_english:
    price = app[7]
    # from inspection of the data set we know price is stored as a string
    # and when the app is free the price is '0'
    if price == '0':
        google_clean_english_free.append(app)

for app in apple_clean_english:
    price = app[4]
    # from inspection of the data set we know price is stored as a string
    # and when the app is free the price is '0.0'
    if price == '0.0':
        apple_clean_english_free.append(app)
        
# exploring the new data sets to see if things went well
print ("GOOGLE")
print('\n')
explore_data (google_clean_english_free, 0, 3, True)
print('\n')
print ("APPLE")
print('\n')
explore_data (apple_clean_english_free, 0, 3, True)
        

GOOGLE


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


APPLE


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16

That's it! We are now left with clean, error-free data sets with only free apps in English.

We can finally start with the analysis, and dive into what apps attract more users.

Data Analysis¶

Our business intends to release new apps on both Google Play and the App Store, so we are interested in finding app profiles that are succesful on both markets.

A feature that might be useful is the one indicating what genre an app belongs to, and see which genres are the most common. The earlier data exploration revealed the following columns store this information;

Google: Category and Genre
Apple: prime_genre

Let's build a function to show the most common genres in both data sets.

Most common genres¶

To detect the most common genre we will create a frequency table in a dictionary, taking a data set and the index of the desired column as input.

In [18]:

def freq_table (dataset, index):
    
# Create an empty dictionary 
# Loop through the data set list 
# and check for every iteration whether the iteration variable exists as a key in the dictionary.
# If it exists, then increment the dictionary value at that key by 1.
# If it doesn't exist, create a new key-value pair in the dictionary, 
# where the dictionary key is the iteration variable and the dictionary value is 1.
    freq_dict = {}
    for app in dataset:
        element = app[index]
        if element in freq_dict:
            freq_dict[element] += 1
        else:
            freq_dict[element] = 1
    return freq_dict

In order to show the contents we need a second function which can help us display the entries in the frequency table in a descending order. The following function was provided by the course;

In [19]:

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Time to use these two functions!

Let's display the frequency table of the columns Category (index 1) Genre (index 9) of the Google data set. And then let's display the frequency table of the column prime_genre (index 11) of the Apple data set.

Google / Category

In [20]:

display_table (google_clean_english_free, 1)

FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53

Google / Genre

In [21]:

display_table (google_clean_english_free, 9)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Adventure : 11
Action;Action & Adventure : 9
Educational;Pretend Play : 8
Simulation;Action & Adventure : 7
Parenting;Education : 7
Entertainment;Brain Games : 7
Board;Brain Games : 7
Parenting;Music & Video : 6
Educational;Brain Games : 6
Casual;Creativity : 6
Art & Design;Creativity : 6
Education;Pretend Play : 5
Role Playing;Pretend Play : 4
Education;Creativity : 4
Role Playing;Action & Adventure : 3
Puzzle;Action & Adventure : 3
Entertainment;Creativity : 3
Entertainment;Action & Adventure : 3
Educational;Creativity : 3
Educational;Action & Adventure : 3
Education;Music & Video : 3
Education;Brain Games : 3
Education;Action & Adventure : 3
Adventure;Action & Adventure : 3
Video Players & Editors;Music & Video : 2
Sports;Action & Adventure : 2
Simulation;Pretend Play : 2
Puzzle;Creativity : 2
Music;Music & Video : 2
Entertainment;Pretend Play : 2
Casual;Education : 2
Board;Action & Adventure : 2
Video Players & Editors;Creativity : 1
Trivia;Education : 1
Travel & Local;Action & Adventure : 1
Tools;Education : 1
Strategy;Education : 1
Strategy;Creativity : 1
Strategy;Action & Adventure : 1
Simulation;Education : 1
Role Playing;Brain Games : 1
Racing;Pretend Play : 1
Puzzle;Education : 1
Parenting;Brain Games : 1
Music & Audio;Music & Video : 1
Lifestyle;Pretend Play : 1
Lifestyle;Education : 1
Health & Fitness;Education : 1
Health & Fitness;Action & Adventure : 1
Entertainment;Education : 1
Communication;Creativity : 1
Comics;Creativity : 1
Casual;Music & Video : 1
Card;Action & Adventure : 1
Books & Reference;Education : 1
Art & Design;Pretend Play : 1
Art & Design;Action & Adventure : 1
Arcade;Pretend Play : 1
Adventure;Education : 1

Apple/prime_genre

In [22]:

display_table (apple_clean_english_free, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4

So the most common genres are:

Google/Category:

FAMILY : 1676
GAME : 862
TOOLS : 750

Google/Genre:

Tools : 749
Entertainment : 538
Education : 474

Apple/prime_genre:

Games : 1874
Entertainment : 254
Photo & Video : 160

It seems the most common genres for free apps in English on Google Play are of a more practical nature, while the most common genre for these types of apps on the App store is Games.

While this is good to know, the most frequent genre does not necessarily mean these apps and genres also have the most users.

Let's find out what genres are the most popular. For the Google data set we will calculate this using the Installs column. For the Apple data set this column is missing, so we will make do with the total number of user ratings instead, which can be found in the rating_count_tot column.

Starting with the App store data set, we will

count the number of apps per genre
count the user ratings of each
and divide one by the other

so we get the average number of user ratings for each genre.

In [23]:

# creating a frequency table for the `prime_genre` column
# using the previously created freq_table() function
freq_prime_genre = freq_table(apple_clean_english_free, 11)

# looping over the genres
# per genre, save the number of ratings and count the genre itself
# calculate the average per genre and print the result

for genre in freq_prime_genre:
    total = 0
    len_genre = 0
    for app in apple_clean_english_free:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
    average = total / len_genre
    print (genre + ": " + str(average))

Sports: 23008.898550724636
Education: 7003.983050847458
Book: 39758.5
Weather: 52279.892857142855
Travel: 28243.8
Health & Fitness: 23298.015384615384
Catalogs: 4004.0
Utilities: 18684.456790123455
Food & Drink: 33333.92307692308
Reference: 74942.11111111111
Medical: 612.0
News: 21248.023255813954
Social Networking: 71548.34905660378
Shopping: 26919.690476190477
Business: 7491.117647058823
Entertainment: 14029.830708661417
Photo & Video: 28441.54375
Navigation: 86090.33333333333
Music: 57326.530303030304
Lifestyle: 16485.764705882353
Games: 22788.6696905016
Productivity: 21028.410714285714
Finance: 31467.944444444445

Popular genres with a high average number of users seem Social Networking, Navigation and Music, but we know these numbers are skewed due to a few very popular apps like Facebook, Google Maps and Spotify.

Another popular genre that jumps out is Reference. Let's see what apps this genre is about to get a better understanding.

In [24]:

for app in apple_clean_english_free:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0

Many of the apps in the genre seem to have taken a popular book and made it available on a mobile device, possibly with some extra features.

Let's move on to the Google data, to see which app genres attract the most users in Google Play.

While the data in the install column are not very precies (100.00+, 1.000.000+ etc), it serves our purpose. We will have to remove the comma's and plus characters, and convert the data from string to float so we can perform calculations on them.

In [30]:

# creating a frequency table for the `Category` (first) column
# using the previously created freq_table() function
freq_category = freq_table(google_clean_english_free, 1)

# looping over the categories
# per cateogry, save the number of installs and count the install itself
# calculate the average installs per category and print the result

for category in freq_category:
    total = 0
    len_category = 0
    for app in google_clean_english_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            # using `str.replace(old, new) method to remove unwanted characters
            # removing unwanted characters by replacing them with an empty string
            # convert the result to a float
            clean_installs = installs.replace('+', '')
            clean_installs = clean_installs.replace (',', '')
            clean_installs = float(clean_installs)
            total += clean_installs
            len_category += 1
    average = total / len_category
    print (category + ": " + str(average))

PERSONALIZATION: 5201482.6122448975
MAPS_AND_NAVIGATION: 4056941.7741935486
PARENTING: 542603.6206896552
VIDEO_PLAYERS: 24727872.452830188
HEALTH_AND_FITNESS: 4188821.9853479853
BUSINESS: 1712290.1474201474
FOOD_AND_DRINK: 1924897.7363636363
COMMUNICATION: 38456119.167247385
TRAVEL_AND_LOCAL: 13984077.710144928
SOCIAL: 23253652.127118643
FINANCE: 1387692.475609756
ART_AND_DESIGN: 1986335.0877192982
LIBRARIES_AND_DEMO: 638503.734939759
SHOPPING: 7036877.311557789
WEATHER: 5074486.197183099
ENTERTAINMENT: 11640705.88235294
MEDICAL: 120550.61980830671
LIFESTYLE: 1437816.2687861272
GAME: 15588015.603248259
FAMILY: 3695641.8198090694
PHOTOGRAPHY: 17840110.40229885
TOOLS: 10801391.298666667
EDUCATION: 1833495.145631068
SPORTS: 3638640.1428571427
BOOKS_AND_REFERENCE: 8767811.894736841
NEWS_AND_MAGAZINES: 9549178.467741935
EVENTS: 253542.22222222222
HOUSE_AND_HOME: 1331540.5616438356
DATING: 854028.8303030303
COMICS: 817657.2727272727
BEAUTY: 513151.88679245283
PRODUCTIVITY: 16787331.344927534
AUTO_AND_VEHICLES: 647317.8170731707

Popular genres with a high average number of users seem Communication and Video Players, but again, we know these numbers are skewed due to a few very popular apps like Whatsapp and Youtube. Another popular genre that jumps out is Books and Reference, which coincides with our finding from the Apple data set. Let's see what apps this genre is about to get a better understanding.

In [31]:

for app in google_clean_english_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 100,000+
Pdf Book Download - Read Pdf Book : 100,000+
Free Book Reader : 100,000+
eBoox new: Reader for fb2 epub zip books : 50,000+
Only 30 days in English, the guideline is guaranteed : 500,000+
Moon+ Reader : 10,000,000+
SH-02J Owner's Manual (Android 8.0) : 50,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Azpen eReader : 500,000+
URBANO V 02 instruction manual : 100,000+
Bible : 100,000,000+
C Programs and Reference : 50,000+
C Offline Tutorial : 1,000+
C Programs Handbook : 50,000+
Amazon Kindle : 100,000,000+
Aab e Hayat Full Novel : 100,000+
Aldiko Book Reader : 10,000,000+
Google I/O 2018 : 500,000+
R Language Reference Guide : 10,000+
Learn R Programming Full : 5,000+
R Programing Offline Tutorial : 1,000+
Guide for R Programming : 5+
Learn R Programming : 10+
R Quick Reference Big Data : 1,000+
V Made : 100,000+
Wattpad 📖 Free Books : 100,000,000+
Dictionary - WordWeb : 5,000,000+
Guide (for X-MEN) : 100,000+
AC Air condition Troubleshoot,Repair,Maintenance : 5,000+
AE Bulletins : 1,000+
Ae Allah na Dai (Rasa) : 10,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Ag PhD Field Guide : 10,000+
Ag PhD Deficiencies : 10,000+
Ag PhD Planting Population Calculator : 1,000+
Ag PhD Soybean Diseases : 1,000+
Fertilizer Removal By Crop : 50,000+
A-J Media Vault : 50+
Al-Quran (Free) : 10,000,000+
Al Quran (Tafsir & by Word) : 500,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al-Muhaffiz : 50,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Al-Quran 30 Juz free copies : 500,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
Hafizi Quran 15 lines per page : 1,000,000+
Quran for Android : 10,000,000+
Surah Al-Waqiah : 100,000+
Hisnul Al Muslim - Hisn Invocations & Adhkaar : 100,000+
Satellite AR : 1,000,000+
Audiobooks from Audible : 100,000,000+
Kinot & Eichah for Tisha B'Av : 10,000+
AW Tozer Devotionals - Daily : 5,000+
Tozer Devotional -Series 1 : 1,000+
The Pursuit of God : 1,000+
AY Sing : 5,000+
Ay Hasnain k Nana Milad Naat : 10,000+
Ay Mohabbat Teri Khatir Novel : 10,000+
Arizona Statutes, ARS (AZ Law) : 1,000+
Oxford A-Z of English Usage : 1,000,000+
BD Fishpedia : 1,000+
BD All Sim Offer : 10,000+
Youboox - Livres, BD et magazines : 500,000+
B&H Kids AR : 10,000+
B y H Niños ES : 5,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible KJV : 5,000,000+
Borneo Bible, BM Bible : 10,000+
MOD Black for BM : 100+
BM Box : 1,000+
Anime Mod for BM : 100+
NOOK: Read eBooks & Magazines : 10,000,000+
NOOK Audiobooks : 500,000+
NOOK App for NOOK Devices : 500,000+
Browsery by Barnes & Noble : 5,000+
bp e-store : 1,000+
Brilliant Quotes: Life, Love, Family & Motivation : 1,000,000+
BR Ambedkar Biography & Quotes : 10,000+
BU Alsace : 100+
Catholic La Bu Zo Kam : 500+
Khrifa Hla Bu (Solfa) : 10+
Kristian Hla Bu : 10,000+
SA HLA BU : 1,000+
Learn SAP BW : 500+
Learn SAP BW on HANA : 500+
CA Laws 2018 (California Laws and Codes) : 5,000+
Bootable Methods(USB-CD-DVD) : 10,000+
cloudLibrary : 100,000+
SDA Collegiate Quarterly : 500+
Sabbath School : 100,000+
Cypress College Library : 100+
Stats Royale for Clash Royale : 1,000,000+
GATE 21 years CS Papers(2011-2018 Solved) : 50+
Learn CT Scan Of Head : 5,000+
Easy Cv maker 2018 : 10,000+
How to Write CV : 100,000+
CW Nuclear : 1,000+
CY Spray nozzle : 10+
BibleRead En Cy Zh Yue : 5+
CZ-Help : 5+
Modlitební knížka CZ : 500+
Guide for DB Xenoverse : 10,000+
Guide for DB Xenoverse 2 : 10,000+
Guide for IMS DB : 10+
DC HSEMA : 5,000+
DC Public Library : 1,000+
Painting Lulu DC Super Friends : 1,000+
Dictionary : 10,000,000+
Fix Error Google Playstore : 1,000+
D. H. Lawrence Poems FREE : 1,000+
Bilingual Dictionary Audio App : 5,000+
DM Screen : 10,000+
wikiHow: how to do anything : 1,000,000+
Dr. Doug's Tips : 1,000+
Bible du Semeur-BDS (French) : 50,000+
La citadelle du musulman : 50,000+
DV 2019 Entry Guide : 10,000+
DV 2019 - EDV Photo & Form : 50,000+
DV 2018 Winners Guide : 1,000+
EB Annual Meetings : 1,000+
EC - AP & Telangana : 5,000+
TN Patta Citta & EC : 10,000+
AP Stamps and Registration : 10,000+
CompactiMa EC pH Calibration : 100+
EGW Writings 2 : 100,000+
EGW Writings : 1,000,000+
Bible with EGW Comments : 100,000+
My Little Pony AR Guide : 1,000,000+
SDA Sabbath School Quarterly : 500,000+
Duaa Ek Ibaadat : 5,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+
English to Hindi Dictionary : 5,000,000+
EP Research Service : 1,000+
Hymnes et Louanges : 100,000+
EU Charter : 1,000+
EU Data Protection : 1,000+
EU IP Codes : 100+
EW PDF : 5+
BakaReader EX : 100,000+
EZ Quran : 50,000+
FA Part 1 & 2 Past Papers Solved Free – Offline : 5,000+
La Fe de Jesus : 1,000+
La Fe de Jesús : 500+
Le Fe de Jesus : 500+
Florida - Pocket Brainbook : 1,000+
Florida Statutes (FL Code) : 1,000+
English To Shona Dictionary : 10,000+
Greek Bible FP (Audio) : 1,000+
Golden Dictionary (FR-AR) : 500,000+
Fanfic-FR : 5,000+
Bulgarian French Dictionary Fr : 10,000+
Chemin (fr) : 1,000+
The SCP Foundation DB fr nn5n : 1,000+

While there seem to be a few very popular apps that skew the outcome (Google Play Books, Bible, Amazon Kindle) the picture is very much the same as with the Apple data set: Many of the apps in this category seem to have taken a popular book and made it available on a mobile device, possibly with some extra features.

Time to wrap up our analysis.

Recommendation¶

In this project we analyzed data on profitable apps from both the App Store and Google Play to understand what apps are most profitable. The goal of the project was to help developers understand what type of apps are likely to attract more users.

Our analysis shows that within the segment of free apps in English the genre of Books & Reference shows potential for both markets. Our suggestion is to look for a popular book and convert this into an app, adding functionality for interaction, learning, quick reference or personal annotations.

In [ ]: