Notebook

Popular App Profiles on Google Play and App Store¶

Index¶

1 Introduction
2 Reading the Data
3 Data Cleaning
4 Identifying and Removing Non-English Apps
5 Identifying and Removind Paid Apps
6 Defining a Strategy for App Development
7 Assessing User Rating by Genre for Both Datasets
8 The Result
9 Conclusion

1¶

Introduction¶

With the explosive growth of the smartphone over the last decade there has been a significant surge in app development. If you've thought of it, there is most likely an app for it.

Competition has forced developers to deliver apps that are not only highly functional but also provide basic features for free. Provision of the latter is of significant importance unless the app targets a niche market (of which almost none exists) or tailored for the use of specific devices. This forces companies to create apps whose main source of revenue would be ads

The goal of this project is to analyze data from Google's Play store and Apple's Apps Store and identify app profiles that:

Engage the user
Are free
Possibly attract a large following that is enough to sustain a model that depends on in-app ads

The expected end result is to identify those app classes that meet the above goals and consider the same for future app development.

Index

2¶

Reading the Data¶

Reading in the data from the data files

The focus for this project will be apps on Google's Playstore and Apple's App Store. As of 2018 the Google Play Store had more than 2.1 million apps while the App Store had about 2 million apps. Because of the volume of data and considering the fact that this is a learning project, the data sets we will be using consider a sample of 10000 apps for analysis. The links to the original data sets can be found below.

In [21]:

#Read the data from the sample datasets
from csv import reader

#App store data
opened_file = open('AppleStore.csv',encoding="utf-8")
read_file = reader(opened_file)
apple_list = list(read_file)

#Playstore data
opened_file = open('googleplaystore.csv', encoding="utf-8")
read_file = reader(opened_file)
google_list = list(read_file)

In [22]:

def explore_data(dataset, start, end, rows_and_columns=False):
    """
    Helps with quick analysis of playstore and app store data 
    by displaying the data slice specified by user along with the column name.
    
    Args:
        dataset (list): Data the user wants to analyse
        start (int): Start row of the data slice
        end (int): End row of the data slice
        rows_and_columns (boolean): If True, prints the number of rows and columns associated to the slice
    """
    dataset_slice = dataset[start:end]
    print(dataset[0])
    print('\n')
    for row in dataset_slice:
        if dataset_slice.index(row)!=0:
            print(row)
            print('\n') # adds a new (empty) line after each row

    if rows_and_columns == True:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [23]:

explore_data(apple_list, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16

In [24]:

explore_data(google_list, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13

Based on the above listing there are many columns from both stores that could help in our analysis. A more detailed description has been provided in the Column tab of the pages whose links have been provided at the beginning of this section.

Index

3¶

Data Cleaning¶

Filling in missing data and removing duplicates

Before proceeding to data analysis, it has to be ensured that the data is relevant and is accurate. On the point of relevancy, it has to be noted that analysis will be focussed on apps targeted towards the English speaking audience and that the apps must be free.

Both the data sets have a discussion section.

The discussions here should give some idea of issues found by others in specific areas.

One of the discussions in the Playstore data mentions that data associated to app Rating is missing thus causing the other columns to shift to the left. Further analysis is required to determine whether the problem exists as mentioned.

In [25]:

explore_data(google_list, 10472, 10474)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

As mentioned there is a value missing, however further analysis reveals that the data missing is the category to which the app belongs to. A quick check on google reveals that the category is Lifestyle. The value can be inserted.

In [26]:

google_list[10473] = ['Life Made WI-Fi Touchscreen Photo Frame', 'Lifestyle', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

In order to ensure that the App store data does not have any such issues we could verify that information for each app has the same set of columns as the header.

In [27]:

#Verify the App store data for missing information
row_with_missing_column = []
for each_app in apple_list[1:]:
    if len(each_app) != len(apple_list[0]):
        row_with_missing_colum.append(apple_list.index(each_app))

if len(row_with_missing_column) == 0:
    print("No rows with missing columns")
else:
    print(row_with_missing_column)

No rows with missing columns

That makes it clear that every row in the App Store data has all the required columns. Further reading of the discussions in the Gooogle playstore dataset reveals that there are multiple instances of duplicate data. Before taking any action for the same. It is essential to verify the degree by which this issue affects the dataset.

In [28]:

def print_list(a_list):
    """
    Prints each row in a list.
    
    Args:
        a_list (list): The list to be printed
    """
    for row in a_list:
        print(row)

In [29]:

def bold_print(a_string,a_value=None):
    """
    Boldens the output
    
    Args:
        a_string (string): String to be bolded
    """
    print("\033[1m"+a_string+"\033[0m"+'\n')

In [30]:

unique_apps = []
duplicate_apps = []
for an_app in google_list[1:]:
    app_name = an_app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("Number of instances of duplicate apps:",len(duplicate_apps))

Number of instances of duplicate apps: 1181

In [31]:

print("\033[1m"+"Some of the apps with multiple entries:"+"\033[0m")
print_list(duplicate_apps[:20])

Some of the apps with multiple entries:
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack
FreshBooks Classic
Insightly CRM
QuickBooks Accounting: Invoicing & Expenses
HipChat - Chat Built for Teams
Xero Accounting Software
MailChimp - Email, Marketing Automation
Crew - Free Messaging and Scheduling
Asana: organize team projects
Google Analytics
AdWords Express

In [32]:

bold_print("Example of an app with duplicate records:")
print(google_list[0])
for apps in google_list[1:]:
    name = apps[0]
    if name == 'Box':
        print(apps)

Example of an app with duplicate records:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

As can be seen above there are multiple instances for some apps. In order to remove the duplicate instances a criteria that could be considered is the instance with the highest count of comments.

An app with a large number of comments is a sign that the app is in use. So the app instance with the most comments for an app will be considered.

In [33]:

#Identify the number of reviews each app in the Playstore dataset
reviews_max = {}
for app in google_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
bold_print("Some of the apps and their review counts:")
print_list(list(reviews_max.items())[:10])

Some of the apps and their review counts:

('Photo Editor & Candy Camera & Grid & ScrapBook', 159.0)
('Coloring book moana', 974.0)
('U Launcher Lite – FREE Live Cool Themes, Hide Apps', 87510.0)
('Sketch - Draw & Paint', 215644.0)
('Pixel Draw - Number Art Coloring Book', 967.0)
('Paper flowers instructions', 167.0)
('Smoke Effect Photo Maker - Smoke Editor', 178.0)
('Infinite Painter', 36815.0)
('Garden Coloring Book', 13791.0)
('Kids Paint Free - Drawing Fun', 121.0)

Now that we have a list containing highest number of comments for each app it becomes easier to select a single app instance for the apps with the multiple entries.

In [34]:

#Remove all instances of the same app that do not have the most reviews
android_clean = []
already_added = []

for row in google_list[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

bold_print("Duplicates removed.")

Duplicates removed.

To verify that we have a clean list without multiple instances for the same app. We could run a count of the multiplicate over the cleaned set.

In [35]:

#Verify that there are no apps with multiple instances in the Playstore dataset
unique_apps = []
duplicate_apps = []
for an_app in android_clean[1:]:
    app_name = an_app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("Number of instances of duplicate apps:",len(duplicate_apps))

Number of instances of duplicate apps: 0

Seeing as we have cleaned the android set it would be beneficial to find out whether the apple data set suffers from such duplicates.

In [36]:

#Verify the App store dataset for existence of multiple entries
unique_apps = []
duplicate_apps = []
for an_app in apple_list[1:]:
    app_name = an_app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("Number of instances of duplicate apps:",len(duplicate_apps))

Number of instances of duplicate apps: 0

The App store dataset does not seem to have app data with multiple entries.

Index

4¶

Identifying and Removing non-English Apps¶

Filtering out apps focused on english speaking audience

As mentioned earlier our analysis is on apps that are focused on an English speaking audience. There are many apps in this list that are for audiences of other languages. We first need to identify those non-English apps.

In [37]:

def is_english(app_name):
    """
    Identify whether an app is English language based
    
    Args:
        app_name (string): Name of the app to be verified
    
    Returns:
        check (boolean): Indicates whether app name is English or Non-English
    """
    count = 0 
    check = True
    for letter in app_name:
        #ord() returns integer value for a Unicode charachter
        if ord(letter)>127:
            count+=1
            #There are many apps with TM and smiley faces. 
            #This includes those apps as well
            if count>3:
                check = False
    return check

In [38]:

#Verify the is_english() function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True

In [39]:

#Identify and filter the number of english apps in the Playstore dataset
english_apps = []
non_english_apps = []

for row in android_clean:
    if is_english(row[0]):
        english_apps.append(row)
    else:
        non_english_apps.append(row[0])

print("The number of english apps in the Playstore:",len(english_apps))
print("The number of non-english apps in the Playstore:",len(non_english_apps))

The number of english apps in the Playstore: 9615
The number of non-english apps in the Playstore: 45

We could repeat the same exercise for apps associated to the App Store and attempt to eliminate the non-enligh apps from the data set therein.

In [40]:

#Identify and filter the number of english apps in the App store dataset
appl_english_apps = []
appl_non_english_apps = []

for row in apple_list:
    if is_english(row[1]):
        appl_english_apps.append(row)
    else:
        appl_non_english_apps.append(row[0])

print("The number of App store english apps:",len(appl_english_apps))
print("The number of App store non-english apps:",len(appl_non_english_apps))

The number of App store english apps: 6184
The number of App store non-english apps: 1014

Index

5¶

Identifying and Removing paid apps¶

Filtering out free english apps

The goal is to identify apps that are both for an english speaking audience and are free. This requires that we remove paid apps from the data sets given. The seventh column of each row identifies the price of an app in the google data set. to identify the free apps, only those apps whose price is 0 will be considered.

In [41]:

#Identify and filter free english apps from the Playstore dataset
free_apps = []
priced_apps = []

for app in english_apps:
    if app[7] != '0':
        priced_apps.append(app)
    else:
        free_apps.append(app)

print("The number of free english apps in the android data set:",len(free_apps))
print("The number of paid english apps in the android data set:",len(priced_apps))

The number of free english apps in the android data set: 8865
The number of paid english apps in the android data set: 750

We could have the App Store data set go through the same filtering process to identify free apps in the App store.

In [42]:

#Identify and filter free english apps from the App store dataset
appl_free_apps = []
appl_priced_apps = []

for app in appl_english_apps:
    if app[4] != '0.0':
        appl_priced_apps.append(app)
    else:
        appl_free_apps.append(app)

print("The number of free english apps in the app store data set:",len(appl_free_apps))
print("The number of paid english apps in the app store data set:",len(appl_priced_apps))

The number of free english apps in the app store data set: 3222
The number of paid english apps in the app store data set: 2962

Index

6¶

Defining a Strategy for App Development¶

Since we now have a clean data set for both apps from the App Store and Google play. The next step would be to identify how we could use the data to identify a class of apps that would help generate revenue.

After discussions with certain stakeholders the planned strategy for development is as follows.

First identify an app class or app type that is popular in both the App Store and Play store.
Next develop the android app and get it up and running on Play Store.
- If the response for the same is good, create a similar app for the App Store and get it published therein.

Since the planned app is meant for both the Playstore and App store we need to identify which app types are popular in both stores.

Apps in both stores belong to a genre. Apps in the Play store also have a Category in addition to a Genre. We could consolidate apps by genre and identify which apps genres have the most apps. This would help to identify app share by genre in both stores.

In [43]:

def freq_table(dataset, index):
    """
    Calculate relative frequecy of a each value in a column in a dataset
    
    Args:
        dataset (list): List of lists containing the data
        index (int): Column for which the relative frequencies must be generated
    
    Returns:
        temp_dict (dictionary): Relative frequency of each value of the column supplied as input
    """
    #Create Frequency table
    temp_dict = {}
    for row in dataset[1:]:
        value_in_column = row[index]
        if value_in_column in temp_dict:
            temp_dict[value_in_column]+=1
        else:
            temp_dict[value_in_column]=1
            
    #Sum the frequencies for all genres        
    sum_of_freq = sum(temp_dict.values())
    
    # Assign percentage values to identify share of each genre in app store
    for row in temp_dict:
        temp_dict[row] = round(((temp_dict[row] / sum_of_freq) * 100),2)   
    return temp_dict

In [44]:

#Function to display the values in the above dictionary in descending order of percentage using tuples(Dictionaries can be sorted only by keys)
import pandas as pd
def display_table(dataset, index):
    """
    Converts a list in to a dataframe which enables easier processing of data
    
    Args:
        dataset (list): List of lists containing the data
        index (int): Column for which the relative frequencies must be generated
    
    Returns:
        percentage_table_df (dataframe): Contains the values in the column specified by the user
    """
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    percentage_table = []
    for entry in table_sorted:
        percentage_table.append([entry[1],entry[0]])
    
    percentage_table_df = pd.DataFrame(percentage_table, columns = ["app_group","percentage"])
    return percentage_table_df

Using the functions created above it is possible to summarize the prime_genre column of the apple data set and the Genre and Category columns of the filtered Google data set.

In [45]:

#Percentage of apps by genre in the filtered App store data set
appl_percent = display_table(appl_free_apps, 11)
google_percent = display_table(free_apps, 1)

In [73]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize = (13,13))

ax1 = fig.add_subplot(121)
ax1.set_title("Apps by Genre in App Store", size=16)
ax1.pie(x = appl_percent["percentage"],
        labels = appl_percent["app_group"].str.lower(),
        rotatelabels=True)

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Apps by Genre in Playstore", size=16)
ax2.pie(x = google_percent["percentage"],
        labels = google_percent["app_group"].str.lower(),
        rotatelabels=True)

plt.tight_layout()
plt.show()

A significant percentage of apps in the App Store are associated to the Games genre. However this cannot be considered as the most sought after genre of apps as this only considers a filtered list of free apps in English. However if a generalization is to be considered, apps associated to fun (i.e. Games, Entertainment, Photo & Video) have a larger share of the pie.

Contrast this to the Playstore data and what is immediately noticed is that most categories have an almost equal share in the lot. However, it must be noted that here too the Family and Game genres have a clear lead. What's more interesting is that that apps that belong to the Family genre are mostly games aimed towards children as can be seen below.

This clearly gives the Games genre slightly more than a quarter of the share of genres. However, it must be noted that non-English free apps were eliminated so the Game genre cannot conclusively have a full lead.

In [92]:

bold_print("Some apps in the FAMILY Category of PlayStore data")
count = 0
print('\033[4m'+"NAME"+'\033[0m'+":"+'\033[4m'+"CATEGORY"+'\033[0m')
for an_app in free_apps:
    if (an_app[1] == 'FAMILY')and (count<10):
        print(an_app[0],":",an_app[1])
        count+=1

Some apps in the FAMILY Category of PlayStore data

NAME:CATEGORY
Jewels Crush- Match 3 Puzzle : FAMILY
Coloring & Learn : FAMILY
Mahjong : FAMILY
Super ABC! Learning games for kids! Preschool apps : FAMILY
Toy Pop Cubes : FAMILY
Educational Games 4 Kids : FAMILY
Candy Pop Story : FAMILY
Princess Coloring Book : FAMILY
Hello Kitty Nail Salon : FAMILY
Candy Smash : FAMILY

It must be noted at this point that the Genre column of the Playstore dataset was not used because the column's purpose is primarily to display the sub-category of an app. Most apps have the same Category and Genre. However apps that belong to the Family and Games Category can belong to different Genres as shown below.

Since we are not analysing games separately, this column can be ignored.

In [93]:

bold_print("Apps in the GAME and FAMILY Category of PlayStore data")
count = 0

print('\033[4m'+"NAME"+'\033[0m'+":"+'\033[4m'+"CATEGORY"+'\033[0m'+":"+'\033[4m'+"GENRE"+'\033[0m')
for an_app in free_apps:
    if (an_app[1] == 'FAMILY' or an_app[1] == 'GAME') and (count<30):
        print(an_app[0],":",an_app[1],":",an_app[-4])
        count+=1

Apps in the GAME and FAMILY Category of PlayStore data

NAME:CATEGORY:GENRE
Solitaire : GAME : Card
Sonic Dash : GAME : Arcade
PAC-MAN : GAME : Arcade
Bubble Witch 3 Saga : GAME : Puzzle
Race the Traffic Moto : GAME : Racing
Marble - Temple Quest : GAME : Puzzle
Shooting King : GAME : Sports
Geometry Dash World : GAME : Arcade
Jungle Marble Blast : GAME : Casual
Roll the Ball® - slide puzzle : GAME : Puzzle
Block Craft 3D: Building Simulator Games For Free : GAME : Simulation
Farm Fruit Pop: Party Time : GAME : Casual
Love Balls : GAME : Puzzle
Piano Tiles 2™ : GAME : Arcade
Pokémon GO : GAME : Adventure
Paint Hit : GAME : Casual
Snake VS Block : GAME : Arcade
Rolly Vortex : GAME : Arcade
Woody Puzzle : GAME : Puzzle
Stack Jump : GAME : Arcade
The Cube : GAME : Arcade
Extreme Car Driving Simulator : GAME : Racing
Bricks n Balls : GAME : Casual
The Fish Master! : GAME : Arcade
Color Road : GAME : Arcade
Draw In : GAME : Arcade
PLANK! : GAME : Arcade
Looper! : GAME : Puzzle
Trivia Crack : GAME : Trivia
Will it Crush? : GAME : Simulation

Index

7¶

Assessing User Rating by Genre for Both Datasets¶

The number of users per genre could be a more reliable parameter to assess the popularity of a genre. Since more users for a genre would mean more users for apps of that genre.

However to meet this end we have to consider a couple of adjustments to consider. These have been detailed below:

The App Store data does not provide the number of downloads. It only provides the rating set by each user. On the other hand, the Playstore data provides the number of downloads but does not provide a rating. Seeing as this is the data we have, the average user rating will be calculated from the App Store dataset and the number of downloads will be used as the user rating for the Playstore data set. 500,000+ would be considered as 500,000 downloads.
Since it is impossible to know the exact number of downloads in the Playstore data set, the value given will have to be taken as is. So based on the example above, the number of download would be considered as 500,000 and the average would be taken using the same.

In [124]:

apple_user_rating = []
prime_genre_dict = freq_table(appl_free_apps, -5)

#Calculate the average user rating for each genre in the filtered App Store data
for genre in prime_genre_dict:
    total = 0
    len_genre = 0
    for row in appl_free_apps[1:]:
        genre_app = row[-5]
        if genre_app == genre:
            user_rating = float(row[5])
            total+=user_rating
            len_genre+=1
            
    avg_user_rating = round((total/len_genre),2)
    apple_user_rating.append([genre,avg_user_rating])

#Generating a Dataframe for a graph
apple_user_rating_df = pd.DataFrame(apple_user_rating, columns = ["genre","apple_avg_user_rating"])
apple_user_rating_df.sort_values(by = ["apple_avg_user_rating"],inplace = True)
apple_user_rating_df['genre'] = apple_user_rating_df['genre'].str.lower()

In [125]:

category_dict = freq_table(free_apps, 1)
google_user_rating = []

#Calculate the average user rating for each genre in the filtered Playstore data
for category in category_dict:
    total = 0
    len_category = 0
    for row in free_apps[1:]:
        category_app = category
        if category_app == row[1]:
            number_of_installs = row[5].replace('+', '')
            number_of_installs = float(number_of_installs.replace(',', ''))
            total+=number_of_installs
            len_category+=1
    google_user_rating.append([category,round((total/len_category),2)])
    
#Generating a Dataframe for a graph
google_user_rating_df = pd.DataFrame(google_user_rating, columns = ["genre","google_avg_user_downloads"])
google_user_rating_df.sort_values(by = ["google_avg_user_downloads"],inplace = True)
google_user_rating_df['genre'] = google_user_rating_df['genre'].str.lower().str.replace('_',' ').str.replace('and','&')

To make comparison easier, the average user ratings are represented in barcharts below.

In [135]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12))
ax1 = fig.add_subplot(121)
ax1.set_title("User Ratings by Genre - App Store",size=16)
ax1.barh(width = apple_user_rating_df['apple_avg_user_rating'], y = apple_user_rating_df['genre'])
for key, values in ax1.spines.items():
    if key!="top":
        ax1.spines[key].set_visible(False)
ax1.xaxis.tick_top()
ax1.tick_params(left = False)

ax2 = fig.add_subplot(122)
ax2.set_title("User Downloads by Genre - PlayStore (million)",size=16)
ax2.barh(width = google_user_rating_df['google_avg_user_downloads']/1000000, y = google_user_rating_df['genre'],color = 'red')
for key, values in ax2.spines.items():
    if key!="top":
        ax2.spines[key].set_visible(False)
ax2.xaxis.tick_top()
ax2.tick_params(left = False)
plt.tight_layout()
plt.show()

What immediately comes to notice is that there are many genres that are exactly the same like productivity and finance and many others that seem to be the same but have something extra like navigation and maps & navigation.

Based on the assumption that both Google and Apple have almost similar logical definitions for their Categories and Genres respectively we have to assume that those genres with the exact name have apps that meet those descriptions. However apps in the Playstore dataset that have slightly different names, may not fit the App Store definition for the same app.

Consider the navigation and maps & navigation example below.

In [128]:

bold_print("Some apps from the Navigation genre of the App store")
for app in appl_free_apps:
    if app[-5] == "Navigation":
        print(app[1],":",app[5])

Some apps from the Navigation genre of the App store

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5

In [127]:

bold_print("Some apps from the MAPS AND NAVIGATION genre of the App store")
count = 0
for app in free_apps:
    if app[1] == "MAPS_AND_NAVIGATION" and count<30:
        print(app[0],":",app[5])
        count+=1

Some apps from the MAPS AND NAVIGATION genre of the App store

Waze - GPS, Maps, Traffic Alerts & Live Navigation : 100,000,000+
T map (te map, T map, navigation) : 5,000,000+
MapQuest: Directions, Maps, GPS & Navigation : 10,000,000+
Yahoo! transit guide free timetable, operation information, transfer search : 10,000,000+
乗換NAVITIME　Timetable & Route Search in Japan Tokyo : 5,000,000+
Transit: Real-Time Transit App : 5,000,000+
Mapy.cz - Cycling & Hiking offline maps : 1,000,000+
Uber : 100,000,000+
GPS Navigation & Offline Maps Sygic : 50,000,000+
Map and Router Badge : 500,000+
Yandex.Transport : 10,000,000+
Air Traffic : 1,000,000+
Speed Cameras Radar : 1,000,000+
Atlan3D Navigation: Korea navigator : 1,000,000+
Compass : 10,000,000+
Mappy - Plan, route comparison, GPS : 1,000,000+
Gps Route Finder : 100,000+
My Location: GPS Maps, Share & Save Places : 5,000,000+
Yanosik: "antyradar", traffic jams, navigation, camera : 5,000,000+
NAVITIME - Map & Transfer Navi : 5,000,000+
Sygic Car Navigation : 5,000,000+
Czech Public Transport IDOS : 1,000,000+
Karta GPS - Offline Navigation : 1,000,000+
Circle ratio : 1,000,000+
Soviet Military Maps Free : 1,000,000+
Truck Car Navi by Navitime Large size car, traffic jam, traffic closure, live camera, typhoon / precipitation map : 100,000+
Sentin Information Map : 100,000+
Snapp : 1,000,000+
GPS Speedometer and Odometer : 1,000,000+
GPS Traffic Speedcam Route Planner by ViaMichelin : 5,000,000+

While the App Store apps are clearly navigation related the MAPS AND NAVIGATION category apps in Playstore include apps like Mapy.cz and Compass which are clearly more related to maps than actual navigation.

Based on this assumption, we will only consider genres that have the exact names and use the same to come to a conclusion.

The Finance genre is a good example to show that our assumption must be good. Data from both sets clearly highlight that most apps are directly finance related.

In [122]:

bold_print("Some apps from the Finance genre of the App store")
for app in appl_free_apps:
    if app[-5] == "Finance":
        print(app[1],":",app[5])

Some apps from the Finance genre of the App store

Chase Mobile℠ : 233270
Mint: Personal Finance, Budget, Bills & Money : 232940
Bank of America - Mobile Banking : 119773
PayPal - Send and request money safely : 119487
Credit Karma: Free Credit Scores, Reports & Alerts : 101679
Capital One Mobile : 56110
Citi Mobile® : 48822
Wells Fargo Mobile : 43064
Chase Mobile : 34322
Square Cash - Send Money for Free : 23775
Capital One for iPad : 21858
Venmo : 21090
USAA Mobile : 19946
TaxCaster – Free tax refund calculator : 17516
Amex Mobile : 11421
TurboTax Tax Return App - File 2016 income taxes : 9635
Bank of America - Mobile Banking for iPad : 7569
Wells Fargo for iPad : 2207
Stash Invest: Investing & Financial Education : 1655
Digit: Save Money Without Thinking About It : 1506
IRS2Go : 1329
Capital One CreditWise - Credit score and report : 1019
U by BB&T : 790
Paribus - Rebates When Prices Drop : 768
KeyBank Mobile : 623
VyStar Mobile Banking for iPhone : 434
Sparkasse - Your mobile branch : 77
VyStar Mobile Banking for iPad : 57
Zaim : 44
Ma Banque : 17
Lloyds Bank Mobile Banking : 17
Suica : 10
Halifax Mobile Banking : 8
La Banque Postale : 8
币优铺 : 0
Impots.gouv : 0

In [123]:

bold_print("Some apps from the Finance genre of the Playstore")
count = 0
for app in free_apps:
    if app[1] == "FINANCE" and count<20:
        print(app[0],":",app[5])
        count+=1

Some apps from the Finance genre of the Playstore

K PLUS : 10,000,000+
ING Banking : 1,000,000+
Citibanamex Movil : 5,000,000+
The postal bank : 5,000,000+
KTB Netbank : 5,000,000+
Mobile Bancomer : 10,000,000+
Nedbank Money : 500,000+
SCB EASY : 5,000,000+
CASHIER : 10,000,000+
Rabo Banking : 1,000,000+
Capitec Remote Banking : 1,000,000+
Itau bank : 10,000,000+
Nubank : 5,000,000+
The Societe Generale App : 1,000,000+
IKO : 1,000,000+
Cash App : 10,000,000+
Standard Bank / Stanbic Bank : 1,000,000+
Bualuang mBanking : 5,000,000+
Intesa Sanpaolo Mobile : 1,000,000+
UBA Mobile Banking : 1,000,000+

Based on the above assumption we could compare genres on datasets and come up with a genre for which to develop an app.

In [70]:

merged_apple_google = pd.merge(apple_user_rating_df,google_user_rating_df,on = 'genre')
merged_apple_google.sort_values(by = 'apple_avg_user_rating', ascending = False)

Out[70]:

	genre	apple_avg_user_rating	google_avg_user_downloads
12	weather	52279.89	5074486.20
4	food & drink	33333.92	1924897.74
3	finance	31467.94	1387692.48
10	shopping	26919.69	7036877.31
5	health & fitness	23298.02	4188821.99
11	sports	23008.90	3638640.14
9	productivity	21028.41	16787331.34
6	lifestyle	16485.76	1437816.27
7	lifestyle	16485.76	1000.00
2	entertainment	14029.83	11640705.88
0	business	7491.12	1712290.15
1	education	7003.98	1833495.15
8	medical	612.00	120550.62

The top 5 genres with the most downloads are productivity, entertainment, shopping, weather and health & fitness.

The top 5 genres based on ratings are weather, food & drink, finance, shopping and health & fitness.

Index

8¶

The Result¶

Based on the comparison above, apps in the Shopping genre are the clear favorites to take up for further development. They perform well in terms of average downloads in the Playstore and have a respectable rate in the App Store.

However, Productivity apps could also be given a strong consideration as our strategy warrants for App Store app development only if we see strong out comes in the Playstore.

While apps associated to Weather could be given some consideration I am unsure whether it is good enough to generate revenue. Unlikely Shopping and Productivity genres, its unlikely that the app will be opened multiple times.

However, this result must only be considered while keeping the following points in mind:

An assumption was made to get to this result. The assumption was tested with a few genres and may not stand true entirely. This would require further analysis of the data.
The problem with comparing genres is that the definition of a particular genre or what apps it must consider is entirely dependant on Google and Apple as there are no standard definitions for either to follow.
A couple of genres were ignored from both sets like Utilities from the App Store set and Tools from the Playstore data set. While these seem similar from a language perspective, their definitions could mean something different for Apple and Google respectively.
Ultimately we are comparing genres from two tech companies with two very different mindsets. This is evident from the number of paid vs. free apps that can be seen in the respective stores so some assumptions need to be made to get a result.

In [130]:

fig = plt.figure(figsize = (10,10))
plt.subplot(1,2,1)
plt.title("Free vs. Paid apps in Playstore",size=15)
plt.pie(x = [8865,750])
plt.legend(['Free apps','Paid apps'], loc = 'upper right')
plt.subplot(1,2,2)
plt.title("Free vs. Paid apps in App Store", size=15)
plt.pie(x = [3222,2962])
plt.legend(['Free apps','Paid apps'], loc = 'upper right')

plt.show()

Index

9¶

Conclusion¶

The aim of this project was to identify an app profile that could help to generate revenue in both Google's Playstore and Apple's Appstore. The project began by cleaning up the associated datasets and analysing them. We attempted to identify a popular app profile by going over the most popular genres. However since the datasets were filtered to evaluate English language apps that were free, we switched to a strategy of analysing user preferences for each genre.

Based on the analysis we concluded that apps in the Shopping genre could prove to be a safe bet as user preference for the same is balanced in both stores. This result, however does come with a few caveats as it is built on an assumption.

Index