Notebook

Mobile apps Analyis for Android and IOS

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users,We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

In [2]:

from csv import reader

# The App Store data set
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
apple_head = ios[0]
apple = ios[1:]

#The Google Play data set #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
droid = list(read_file)
android_head = droid[0]
android = droid[1:]

In [3]:

#function created to access dataset
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
            print(row)
            print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))
 

In [4]:

                             
print(apple_head,'\n')            
explore_data(apple,0,3,True)
#printing the column names that can help with analysis,'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'..

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16

In [5]:

print(android_head,'\n')
explore_data(android, 0, 3, True)
#printing the column names that can help with analysis,'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'..

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13

In [6]:

#incorrect header row, rating is 19 in index 10472
print (android_head[0:],'\n')#for visualization
print(len(android))
del (android[10472])#del in error index 10472, run once.
print(len(android))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

10841
10840

In [7]:

app_names = ['Instagram']

print('Instagram' in app_names)
print('Twitter' in app_names)
print(232 in app_names)
print('Facebook' in app_names)

True
False
False
False

In [24]:

for app in android:#searching for dupliclate apps in Google playstore
    name = app[0]
    if name == 'Instagram':
        print('duplicate apps'+'\n',app,'\n')

duplicate apps
 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

duplicate apps
 ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

duplicate apps
 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

duplicate apps
 ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

The are a number of duplicte apps, one probable cause is that the data was collected at different times.

In [ ]:

#Identifying duplicate apps in appstore
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:',len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:',duplicate_apps[:15])

In [ ]:

#removing duplicate entries.
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
#inspecting the dictionary;length should be 9,659

In [ ]:

#Total number of apps without duplicates.
print('Expected length:',len(android) - 1181)
print('actual length:',len(reviews_max))

In [ ]:

#removing duplicate rows, empty lists will store duplicate apps
android_clean = [] 
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
        
        

In [ ]:

#Explore the android_clean data set to ensure everything went as expected. 
#The data set should have 9,659 rows.
explore_data(android_clean, 0, 2, True)

In [ ]:

#identifying non english apps to remove for dataset.
print(apple[813][1])
print(apple[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

In [ ]:

#in order to remove non english apps, we search for app names greater than unicode 127
def eng_charac(english):
    for non_eng  in english:
        if ord(non_eng) > 127:
            return False
        
    return True

In [ ]:

#checking for english and non english apps
print(eng_charac('Instagram'))
print(eng_charac('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_charac('Docs To Go™ Free Office Suite'))
print(eng_charac('Docs To Go™ Free Office Suite'
))

#note special characters are not recognized as english characters, as they are out of ASCII range
#using function created could remove useful data.

We'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [ ]:

def eng_charac(string):
    non_ascii = 0
    
    for non_eng_charac in string:
        if ord(non_eng_charac) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(eng_charac('Docs To Go™ Free Office Suite'))
print(eng_charac('Instachat 😜'))

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [ ]:

android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if eng_charac(name):
        android_english.append(app)


for app in apple:
    name = app[1]
    if eng_charac(name):
        apple_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

We're now left with 9164 Android apps and 6183 IOS apps.

we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [ ]:

android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
for app in apple_english:
        price = app[4]
        if price == '0.0':
            ios_free.append(app)

print(len(android_free))
print(len(ios_free))

Removed inaccurate data Removed duplicate app entries Removed non-English apps Isolated the free apps

So far, we:

Removed inaccurate data
Removed duplicate app entries
Removed non-English apps
Isolated the free apps

We want to find an app profile that fits both the App Store and Google Play, because our aim is to determine the kinds of apps that are likely to attract more users, due to the fact that our revenue is highly influenced by the number of people using our apps.

Now we will inspect both data sets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market.

We'll build two functions we can use to analyze the frequency tables:

One function to generate frequency tables that show percentages
Another function that we can use to display the percentages in a descending order

In [ ]:

# function to generate frequency tables that show percentages
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages



#Another function that we can use to display the percentages in a descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We can now examine the frequency the table for prime_genre columun of the Applestore dataset.

In [ ]:

display_table(ios_free, -5)

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Below we can now examine the Goople playstore.

In [ ]:

display_table(android_english, -4) # Category

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Now will find out the most popular apps by genre App Store.¶

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [ ]:

#12
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

In [ ]:

for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings        

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

In [ ]:

for app in ios_free:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.
Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.
Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Most Popular Apps by Genre on Google Play¶

In [ ]:

display_table(android_free, 5)#installs colums

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [ ]:

categories_android = freq_table(android_free,1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category,':', avg_n_installs)

Our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [ ]:

for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ':' , app[5])

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [ ]:

under_100_m = []

for app in android_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    if app[1] == 'COMMUNICATION' and ( float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len (under_100_m)

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Let's take a look at some of the apps from this genre and their number of installs.

In [ ]:

for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])
    

In [ ]:

for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                           or app[5] == '500,000,000+'
                                           or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [ ]:

for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                           or app[5] == '5,000,000+'
                                           or app[5] == '5,000,000+'):
            print(app[0], ':', app[5])

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

Conclusions¶

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

In [ ]: