Notebook

PROFITABLE APPS ON THE MARKET - A GUIDED DATA ANALYSIS PROJECT¶

Project Description

A data analysis is the process of exploring, cleaning and transforming the data into a useable format. In this analysis, we are going to explore the apps on both Apple Store and Google Play Store and see what type of apps are currently attracting users. Most of us uses a smartphone everyday, we used different apps for different purposes.

Project Goal

The goal of this analysis is to recommend the type of apps that the developers can work on. In this analysis, we are going to work on exploring the data, cleaning the data, transforming and analyzing the data to come up with a recommendation.

Defining the Fucntions to open,read and explore the dataset¶

Before we start our exploration, cleaning and analysis. First we're going to define a function that will load and explore out dataset.

In [1]:

def DatasetCsv(file):
    '''This fuction takes in one parameter:
    file = the file name / file path that needs to be converted to a list of lists
    '''
    from csv import reader
    dataset = open(file, encoding = 'utf8')
    read = reader(dataset)
    data = list(read)
    
    return data

In [2]:

def explore_data(dataset,start,end,rows_and_columns = False):
    '''This function takes in four parameters:
    dataset = the data set that will be list of list. Which will contain all the rows and an optional header.
    start and end = are both integer type of data, it will indicate the start and end of the elements in the list of lists.
    rows_and_columns = will take an input of boolean values True or False and it will output the number of the column and rows of the dataset.
    '''
    dataset_slice = dataset[start:end]
    for i in dataset_slice:
        print(i)
    
    if rows_and_columns == True:
        print('')
        print('The number of column is', len(dataset[0]))
        print('The number of row is', len(dataset))
       
        

Initial Dataset Exploration¶

To understand the variables to consider that will drive the decision making on building the type of application, we'll look into the first five(5) rows and identify these variables.

iOS App Store¶

First, we will explore the iOS dataset. The source for this dataset can be found here. And this dataset was last updated on 2018-06-05.

In [3]:

ios = DatasetCsv(r'C:\Users\Mico\OneDrive\Desktop\DATASETS\KAGGLE\APPLE STORE\AppleStore.csv') #Using the defined function DatasetCsv(file) to store the dataset of 'AppleStore.csv' to the variable 'ios'
explore_data(ios,0,6,rows_and_columns = True) #Using the defined function explore_data(dataset,start,end,rows_and_columns = False) to do an initial exploration for the 'ios' dataset.

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']

The number of column is 17
The number of row is 7198

As per our observation on the dataset. We have 7,198 apps available. From the 16 columns that are present, we identified the following columns that will be useful on this analysis:

Columns	Description	Ideal Datatype
size_bytes	Size (in Bytes)	Integer
price	Price amount	Float
ratingcounttot	Total number of user ratings	Integer
user_rating	Average user rating value	Float
cont_rating	Content rating	String
prime_genre	Primary genre	String
ipadSc_urls.num	Number of screenshots showed for display	Integer
lang.num	Number of supported languages	Integer

For more information about the columns: Kaggle - Mobile App Store ( 7200 apps)

Google Play Store¶

For the google play store dataset. The source can be found here. And this dataset was last updated on 2018-09-04.

In [4]:

android = DatasetCsv(r'C:\Users\Mico\OneDrive\Desktop\DATASETS\KAGGLE\GOOGLE PLAY STORE\googleplaystore.csv')
explore_data(android,0,6,rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']

The number of column is 13
The number of row is 10842

As per our observation on the dataset. We have 10,842 apps available. From the 13 columns that are present, we identified the following columns that will be useful on this analysis:

Column	Description	Ideal Datatype
Category	Category the app belongs to	String
Rating	Average user rating value	Float
Reviews	Total number of user reviews	Integer
Size	Size of the app (in megabytes)	Float
Installs	Number of user downloads/installs for the app	Integer
Type	Paid or Free	Boolean
Content Rating	Age group the app is targeted at - Children / Mature 21+ / Adult	String
Genres	An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to	String

For more information about the columns: Kaggle - Google Play Store Apps

Data Cleaning¶

Before we do a further analysis. First, we'll have to ensure that the dataset is accurate and in a useable format. Remember that our goal is to understand what types of apps that will attract more users. Our target audience are english-speaking users and we will only build free apps.

In this section we are going to do the following:

Detect inaccurate data, and correct or remove it.
Detect duplicate data, and remove the duplicates.
Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
Remove apps that aren't free.

Data innacuracy¶

For our google play dataset, a discussion on kaggle states that the rating attribute is missing on entry 10472. Note that rating is the 3rd column which makes its index to be [2]. Our dataset has a header row and the reported entry 10472 might have or might not have a header so we will explore entries 10472-10473 to investigate.

In [5]:

explore_data(android,10472,10474) #Note that specifying an index slice is [start:end - 1] hence we will start indexing at 10472 and ending it at 10473
print('Number of columns in entry 10472:',len(android[10472]),'columns')
print('Number of columns in entry 10473:',len(android[10473]),'columns')

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Number of columns in entry 10472: 13 columns
Number of columns in entry 10473: 12 columns

After analyzing the two observations, we noticed that the entry 10473 is missing a column. Rating is present on the data and the Category is missing. Upon further investigating on this observation, we noticed that there's an element with '' entry. Basing on the present elements beside it, we can deduce that this is the Genre attribute.

Since this data will cause an error in the analysis and we've identified Category and Genre as an important part of the analysis. We can either look up for the application Life Made WI-FI Touchscreen Photo Frame in the Google Play Store to populate the data or we can just delete the observation. For this analysis we will delete the observation.

In [6]:

del android[10473]

To verify if we've made changes to the dataset. We will run the explore_data function again and check the number of columns.

In [7]:

explore_data(android,10472,10474)
print('Number of columns in entry 10472:',len(android[10472]),'columns')
print('Number of columns in entry 10473:',len(android[10473]),'columns')

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Number of columns in entry 10472: 13 columns
Number of columns in entry 10473: 13 columns

Data Duplicate¶

Now we will invesitage the dataset if it has any duplicates. We will be basing the duplicate by using the app (index[0]) attribute which will give us the name of the app.

In [8]:

android_rows = android[1:] # Extracting the rows of android dataset
unique_android_apps = list() # List of unique applications in Google Play Store
duplicate_android_apps = list() # List of apps that has a duplicate
most_dup_app = list() # App with the most number of duplicate


# Using this for loop, we are going to populate the unique_android_apps and duplicate_android_apps

for i in android_rows:
    name = i[0]
    
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    elif name not in unique_android_apps:
        unique_android_apps.append(name)

frequency_of_duplicate = dict() # Frequency table of the duplicates

for i in duplicate_android_apps:
    frequency_of_duplicate[i] = frequency_of_duplicate.get(i,0) + 1

# Using this for loop, we are going to identify which app has the most number of duplicate

for i in frequency_of_duplicate:
    if frequency_of_duplicate[i] == max(frequency_of_duplicate.values()):
        most_dup_app.append(i)
        
print('The unique number of apps is:',len(unique_android_apps))
print('The number of duplicate apps is:',len(duplicate_android_apps))
print('Dupicated apps examples:',duplicate_android_apps[:5])
print('The most duplicated apps is/are:', most_dup_app)

The unique number of apps is: 9659
The number of duplicate apps is: 1181
Dupicated apps examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
The most duplicated apps is/are: ['ROBLOX']

As we can see, there are a total number of 1,181 apps that are duplicates. And the app that has the most duplicate is 'Roblox'.

We will investigate this duplicated app since we cannot remove duplicates randomly.

In [9]:

# Using this for loop, we are going to verify the data of the most_dup_app list

for i in most_dup_app:
    for x in android_rows:
        if i in x:
            print(x)

['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4450855', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4450890', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4443407', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']

The Genre and Reviews attributes are different on the 'Roblox' app. We can use this information to create a criterion for deleting the observation. For this project, we are going to keep the observation among the duplicates that has the highest Reviews. Since we can assume that the most number of reviews is the latest one.

As we start removing the duplicates, we have to keep in mind that the total number of apps should be 9,659.

In [10]:

reviews_max = dict() # Creating a frequency table that will store the unique apps since this is a dictionary, it will automatically delete an app with a similar name. We will also update the values based on a higher rated duplicate app.

for i in android_rows:
    name = i[0]
    reviews = float(i[3])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews

#To verify that we have the correct number of unique apps
print('The unique number of apps is:',len(reviews_max))

The unique number of apps is: 9659

We will start removing the duplicate apps. By looping through the android_rows list. We will create a new list that will store our cleaned dataset, this dataset will be stored in the variable android_clean.

In [11]:

android_clean = list()
already_added = list() # We are going to add this to avoid adding a duplicate on the android_clean

for i in android_rows:
    name = i[0]
    reviews = float(i[3])
    
    
    if (name in reviews_max) and (reviews_max[name] == float(i[3])) and (name not in already_added):
        android_clean.append(i)
        already_added.append(name) # The purpose of this is so that we can filter out the ones that we alreaddy added, because there might be an instance where an app has a similar name and review

# To verify that all the data is unique, remember that there are 9,659 number of unique apps.
print("The number of rows in our new dataset 'android_clean' is",len(android_clean))
    

The number of rows in our new dataset 'android_clean' is 9659

Removing non-english apps¶

Since the app that we are going to build is targeted for english-speaking audience. We have to remove the apps that contains non-english characters. As per the ASCII(American Standard Code for Information Interchange) system. An english text has a value range of 0 to 127. We can get the individual value of each characters by passing it as an argument in the function ord(). Below, we are going to create a function that will loop through the names of the apps. And we are expecting a boolean value of True if all the character in the name is english characters. And will give an output of False if the name containts a non-english character.

In [12]:

def english(string):
    '''This fuction takes in one argument:
    string = the string that we are going to loop and verify if there exist a non-english character. If the string has a value of > 127 more than three times the function will return a false, otherwise it will return a true'''
    char = list()
    count = 0
    for character in string:
        if ord(character) > 127 and count <= 3:
            count += 1
            
    if count > 3:
        return False
    return True

Using the newly defined function english() we will create a new list consisting of non_english_app. And apply the explore_data() function to have an insight of the apps.

In [13]:

non_english_app = list()
for i in android_clean:
    name = english(i[0])
    if name == False:
        non_english_app.append(i)
print('Non-english app example:\n')
explore_data(non_english_app,0,6,rows_and_columns = True)

Non-english app example:

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']
['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']
['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']
['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up']
['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']
['RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'FAMILY', 'NaN', '4', '64M', '1+', 'Free', '0', 'Everyone', 'Education', 'July 17, 2018', '1.0.1', '4.4 and up']

The number of column is 13
The number of row is 45

To continue with our data cleaning, we are going to remove the non_english_app on our dataset and create a new dataset that will be assgined to the variable english_android_app.

In [14]:

english_android_app = list()

for row in android_clean:
    if row not in non_english_app:
        english_android_app.append(row)
print('English android app:',len(english_android_app),'rows')

English android app: 9614 rows

Remember that the number of unique apps before we removed the non_english_app is 9,659. The total number of non_english_app is 45. So we successfully removed the non_english_app on our dataset.

Extracting free aps¶

As we've mentioned, our focus is to build a free app and the main source of revenue will come from the in app ads. So in this section, we are going to isolate the free aps from the paid apps.

First we are going to identify the free apps in the english_android_app dataset.

In [15]:

free_english_app = list()

for row in english_android_app:
    free_paid = row[6]
    price = float(row[7].replace('$','')) # I added the price attribute to ensure that no Free app will have an error of having a price

    if free_paid == 'Free' and price == 0.0:
        free_english_app.append(row)

#Exploring our cleaned dataset
explore_data(free_english_app,0,10,rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']
['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']
['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up']
['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up']
['Kids Paint Free - Drawing Fun', 'ART_AND_DESIGN', '4.7', '121', '3.1M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'July 3, 2018', '2.8', '4.0.3 and up']
['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']

The number of column is 13
The number of row is 8863

After exploring the data, we now have 8,863 rows from the initial 10,841 rows.

Now that we've finished the following:

Detect inaccurate data, and correct or remove it.
Detect duplicate data, and remove the duplicates.
Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
Remove apps that aren't free.

We can say that the dataset free_english_app from Google Play Store Dataset is now cleaned.

iOS¶

For the iOS dataset, we're going to do the same thing that we did to the Google Play Store Dataset.

For starter, upon skimming the discussion in kaggle, a user named Marjan reported that there's two duplicate in the dataset. We will investiage this report.

On our initial data exploratory earlier, iOS dataset has 7,198 rows. Excluding the header, we are expecting 7,197 rows.

In [16]:

ios_rows = ios[1:] # Extracting the rows of android dataset
unique_ios_apps = list() # List of unique applications in Google Play Store
duplicate_ios_apps = list() # List of apps that has a duplicate
most_dup_ios_app = list()


for i in ios_rows:
    name = i[2]

    if name in unique_ios_apps:
        duplicate_ios_apps.append(name)
    elif name not in unique_ios_apps:
        unique_ios_apps.append(name)
        
frequency_of_ios_duplicate = dict()
for i in duplicate_ios_apps:
    frequency_of_ios_duplicate[i] = frequency_of_ios_duplicate.get(i,0) + 1

print(frequency_of_ios_duplicate)

{'VR Roller Coaster': 1, 'Mannequin Challenge': 1}

As we can see, there's two app duplicate in the track_name attribute. We will further investigate these two apps.

In [17]:

for i in duplicate_ios_apps:
    for x in ios_rows:
        name = x[2]
        if i == name:
            print('Index number:',ios_rows.index(x))
            print(x)

Index number: 3319
['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
Index number: 5603
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
Index number: 7092
['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']
Index number: 7128
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

Upon further investigation, the id (index[0]) and size_bytes (index[2]) attributes differ from each other we can conclude that these apps are different so we will keep it.

To check for the data accuracy, we'll loop in all the rows and check for a row that do not contain 16 columns.

In [18]:

inaccurate_columns = list()
for i in ios_rows:
    if len(i) != 17:
        inaccurate_columns.append(i)
if len(inaccurate_columns) > 0:
    print(inaccurate_columns)
elif len(inaccurate_columns) == 0:
    print('All observations has 17 columns')
    

All observations has 17 columns

Since all the observations has 17 columns it is safe to assume that we will not get an error due to IndexError: list index out of range.

Now we will check and remove the non-english apps using the function english() that we defined earlier.

In [19]:

non_english_ios_app = list()

for i in ios_rows:
    name = i[2]
    
    if english(name) == False:
        non_english_ios_app.append(i)
explore_data(non_english_ios_app,0,5,rows_and_columns = True)

['80', '299853944', '新浪新闻-阅读最新时事热门头条资讯视频', '115143680', 'USD', '0', '2229', '4', '3.5', '1', '6.2.1', '17+', 'News', '37', '0', '1', '1']
['96', '303191318', '同花顺-炒股、股票', '122886144', 'USD', '0', '1744', '0', '3.5', '0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1']
['239', '331259725', '央视影音-海量央视内容高清直播', '54648832', 'USD', '0', '2070', '0', '2.5', '0', '6.2.0', '4+', 'Sports', '37', '0', '1', '1']
['268', '336141475', '优酷视频', '204959744', 'USD', '0', '4885', '0', '3.5', '0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']
['295', '340368403', 'クックパッド - No.1料理レシピ検索アプリ', '76644352', 'USD', '0', '115', '0', '3.5', '0', '17.5.1.0', '4+', 'Food & Drink', '37', '5', '1', '1']

The number of column is 17
The number of row is 1014

As we observed there are a total of 1014 rows of non_english_ios_app in the iOS dataset. We're going to isolate these apps.

In [20]:

english_ios_apps = list()

for i in ios_rows:
    if i not in non_english_ios_app:
        english_ios_apps.append(i)
explore_data(english_ios_apps,0,5,rows_and_columns = True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']

The number of column is 17
The number of row is 6183

Remember that iOS dataset has 7,197 rows. After isolating the english apps the we've managed to remove 1,014 non-english apps. Giving us a dataset with 6,183 rows.

In [21]:

free_english_ios_apps = list()

for i in english_ios_apps:
    price = float(i[5])
    
    if price == 0.0:
        free_english_ios_apps.append(i)
print('Free english ios apps example:')
print('')
explore_data(free_english_ios_apps,100,106,rows_and_columns = True)

Free english ios apps example:

['252', '334235181', 'Trainline UK: Live Train Times, Tickets & Planner', '110198784', 'USD', '0', '248', '0', '4', '0', '22', '4+', 'Travel', '37', '4', '1', '1']
['253', '334256223', 'CBS News - Watch Free Live Breaking News', '78047232', 'USD', '0', '11691', '44', '3.5', '4.5', '3.5.1', '12+', 'News', '37', '5', '1', '1']
['255', '334503000', 'The Impossible Quiz!', '44652544', 'USD', '0', '18884', '451', '4', '4.5', '1.62', '9+', 'Entertainment', '37', '0', '1', '1']
['261', '335364882', 'Walgreens – Pharmacy, Photo, Coupons and Shopping', '169138176', 'USD', '0', '88885', '333', '4.5', '4', '6.5', '12+', 'Shopping', '37', '5', '1', '1']
['264', '335744614', 'NBA', '112074752', 'USD', '0', '43682', '19', '3.5', '2.5', '2013.4.3', '4+', 'Sports', '37', '5', '1', '1']
['266', '335875911', 'My Cycles Period and Ovulation Tracker', '77686784', 'USD', '0', '7469', '68', '3.5', '5', '5.10.3', '12+', 'Health & Fitness', '37', '0', '2', '1']

The number of column is 17
The number of row is 3222

From the 6,183 rows of the english_ios_apps dataset. We've managed to isolate the free apps giving us a clean dataset with 3,222 rows.

Data Analysis¶

After cleaning the data, we can now procede to our analysis. Remember that our goal is to build a free app and the main revenue will come from the advertisements.

We are going to follow these steps:

Build a minimal Android version of the app, and add it to Google Play Store.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we will build an iOS version of the app and add it to the Apple Store.

In this analysis process, we are going to analyze both the market for Google Play Store and Apple store since our end goal is to release the app on both platforms.

We will begin our analysis by looking at the most common genre for both market. We are going to use the dataset free_english_app with 8,863 rows for Android apps and free_english_ios_apps with 3,222 rows for iOS apps.

In [22]:

explore_data(free_english_app, 0,5,rows_and_columns = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']

The number of column is 13
The number of row is 8863

In [23]:

explore_data(free_english_ios_apps,0,5,rows_and_columns = True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']
['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']

The number of column is 17
The number of row is 3222

By using the defined function frequency_column() we are going to make a frequency table for both of our datasets. And by using the defined function display_table we're going to display the frequency table that is sorted by descending order and the value is in percentage.

In [24]:

def frequency_column(dataset,column):
    '''This function will take two agruments:
    dataset - where the frequency table will be extracted
    column - the index for the frequency table'''
    genre_list = list()
    for i in dataset:
        genre = i[column]
        genre_list.append(genre)

    frequency_table = dict()
    
    for i in genre_list:
        frequency_table[i] = (frequency_table.get(i,0) + 1)
        
    total_value = sum(frequency_table.values())
    
    frequency_percentage = dict()
    for i in frequency_table:
        frequency_percentage[i] = round(((frequency_table[i]/total_value)*100),2)
    
    return frequency_percentage

def display_table(dataset,column,numofrows=10):
    '''This function will take two agruments:
    dataset - where the displayed table will be extracted
    column - the index for the displayed table'''
    table = frequency_column(dataset,column)
    displayed_table = list()
    
    for x,y in table.items():
        key_and_value = (y,x)
        displayed_table.append(key_and_value)
    displayed_table_sorted = sorted(displayed_table,reverse = True)
    for x,y in displayed_table_sorted[:numofrows]:
        print(y,':',x)

In [25]:

print('Apple Store Genre: \n')
display_table(free_english_ios_apps,-5)

Apple Store Genre: 

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02

By using the defined functions above, we can see the top 10 most common app for our free_english_ios_apps dataset. As we can observe, more than half of those app are games. Our top 3 genre which makes about 71% of our dataset comes from the entertainment kind of apps.

In [26]:

print('Google Play Store Categories: \n')
display_table(free_english_app,1)
print('\nGoogle Play Store Genre: \n')
display_table(free_english_app,-4)

Google Play Store Categories: 

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32

Google Play Store Genre: 

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32

As for our free_english_apps dataset for Google Play Store. We can observe that there's more division for genre. Most of the app in Google Play Store are for practical usage type of apps. Although games is high in the list as well, but the variance of the result is closer to one another. It creates a balance between 'entertainment' and 'practical purposes' kind of apps.

For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [27]:

genre_in_ios = frequency_column(free_english_ios_apps,-5)
average_per_genre = list()

for i in genre_in_ios:
    total_rating = 0
    len_rating = 0
    for x in free_english_ios_apps:
        ratingcount = int(x[6])
        if i in x:
            total_rating += ratingcount
            len_rating += 1
    average_rating = total_rating/len_rating
    addtogenre = i,average_rating
    average_per_genre.append(addtogenre)
displayed_table = list()
for x,y in average_per_genre:
    key_and_value = (y,x)
    displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
    print(y,':',round(x,2))

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8

As we can observe, navigation apps have the most number of users Apple Store. Followed by reference and social networking. As we can see, even though the games is the most common app. This genre is not present in our top 10 most number of rating apps.

As for the Google Play, we'll both analyze the genre and category.

In [28]:

print('Average installs for genre in android:')
print('')
genre_in_android = frequency_column(free_english_app,-4)
average_per_genre = list()
for i in genre_in_android:
    total_install = 0
    len_install = 0
    for x in free_english_app:
        
        installs = int(x[5].replace('+','').replace(',',''))
        if i in x:
            total_install += installs
            len_install += 1
    average_install = total_install/len_install
    addtogenre = i,average_install
    average_per_genre.append(addtogenre)
displayed_table = list()
for x,y in average_per_genre:
    key_and_value = (y,x)
    displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
    print(y,':',round(x,2))

print('\n')
print('Average installs for categories in android:')
print('')
categories_in_android = frequency_column(free_english_app,1)
average_per_categories = list()
for i in categories_in_android:
    total_install = 0
    len_install = 0
    for x in free_english_app:
        
        installs = int(x[5].replace('+','').replace(',',''))
        if i in x:
            total_install += installs
            len_install += 1
    average_install = total_install/len_install
    addtogenre = i,average_install
    average_per_categories.append(addtogenre)
displayed_table = list()
for x,y in average_per_categories:
    key_and_value = (y,x)
    displayed_table.append(key_and_value)
displayed_table_sorted = sorted(displayed_table,reverse = True)
for x,y in displayed_table_sorted[:10]:
    print(y,':',round(x,2))

Average installs for genre in android:

Communication : 38456119.17
Adventure;Action & Adventure : 35333333.33
Video Players & Editors : 24947335.8
Social : 23253652.13
Arcade : 22888365.49
Casual : 19569221.6
Puzzle;Action & Adventure : 18366666.67
Photography : 17840110.4
Educational;Action & Adventure : 17016666.67
Productivity : 16787331.34


Average installs for categories in android:

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47

Both genre and category has the Communication as the most number of users that installed the application. We can also observe that there's more average user per apps for Google Play Store compared to Apple Store.

App Recommendation¶

Since this is a generation that is addicted with social media, sharing photos,videos and clips on social media. Building an app that can be shared on those platform can be a good idea. On the following section, we'll try to explore the Photography related application on Google Play Store.

In [35]:

for i in categories_in_android:
    if i == 'PHOTOGRAPHY':
        print(i,categories_in_android[i])

PHOTOGRAPHY 2.94

About 2.94% of the apps in Google Play Store are related to Photography.

We'll explore the most downloaded Photography related apps.

In [68]:

total_install_category = 0
len_install_category = 0
for i in free_english_app:
    category = i[1]
    install = float(i[5].replace('+','').replace(',',''))
    if category == 'PHOTOGRAPHY':
        total_install_category += install
        len_install_category += 1
        

        
average_install = total_install_category/len_install_category
    
above_average = list()
for i in free_english_app:
    install = float(i[5].replace('+','').replace(',',''))
    category = i[1]
    if install > average_install and category == 'PHOTOGRAPHY' :
        above_average.append(i)
sorted_list = list()
for i in above_average:
    install = float(i[5].replace('+','').replace(',',''))
    name = i[0]
    key_val = install,name
    sorted_list.append(key_val)
sorted(sorted_list, reverse = True)

Out[68]:

[(1000000000.0, 'Google Photos'),
 (100000000.0, 'Z Camera - Photo Editor, Beauty Selfie, Collage'),
 (100000000.0, 'YouCam Perfect - Selfie Photo Editor'),
 (100000000.0, 'YouCam Makeup - Magic Selfie Makeovers'),
 (100000000.0, 'Sweet Selfie - selfie camera, beauty cam, photo edit'),
 (100000000.0, 'S Photo Editor - Collage Maker , Photo Collage'),
 (100000000.0, 'Retrica'),
 (100000000.0, 'PicsArt Photo Studio: Collage Maker & Pic Editor'),
 (100000000.0, 'PhotoGrid: Video & Pic Collage Maker, Photo Editor'),
 (100000000.0, 'Photo Editor Pro'),
 (100000000.0, 'Photo Editor Collage Maker Pro'),
 (100000000.0, 'Photo Collage Editor'),
 (100000000.0, 'LINE Camera - Photo editor'),
 (100000000.0, 'Cymera Camera- Photo Editor, Filter,Collage,Layout'),
 (100000000.0, 'Candy Camera - selfie, beauty camera, photo editor'),
 (100000000.0, 'Camera360: Selfie Photo Editor with Funny Sticker'),
 (100000000.0, 'BeautyPlus - Easy Photo Editor & Selfie Camera'),
 (100000000.0, 'B612 - Beauty & Filter Camera'),
 (100000000.0, 'AR effect'),
 (50000000.0, 'Video Editor Music,Cut,No Crop'),
 (50000000.0, 'VSCO'),
 (50000000.0, 'Square InPic - Photo Editor & Collage Maker'),
 (50000000.0, 'Snapseed'),
 (50000000.0, 'Selfie Camera - Photo Editor & Filter & Sticker'),
 (50000000.0, 'SNOW - AR Camera'),
 (50000000.0, 'Pixlr – Free Photo Editor'),
 (50000000.0, 'Pic Collage - Photo Editor'),
 (50000000.0, 'PhotoWonder: Pro Beauty Photo Editor Collage Maker'),
 (50000000.0, 'Photo Lab Picture Editor: face effects, art frames'),
 (50000000.0, 'Photo Effects Pro'),
 (50000000.0, 'Photo Editor by Aviary'),
 (50000000.0, 'Photo Editor Selfie Camera Filter & Mirror Image'),
 (50000000.0, 'Motorola Camera'),
 (50000000.0, 'MomentCam Cartoons & Stickers'),
 (50000000.0, 'MakeupPlus - Your Own Virtual Makeup Artist'),
 (50000000.0, 'Keepsafe Photo Vault: Hide Private Photos & Videos'),
 (50000000.0, 'InstaSize Photo Filters & Collage Editor'),
 (50000000.0, 'InstaBeauty -Makeup Selfie Cam'),
 (50000000.0, 'Boomerang from Instagram'),
 (50000000.0, 'Adobe Photoshop Express:Photo Editor Collage Maker'),
 (50000000.0, 'ASUS Gallery')]

Conclusion¶

As we con see from the results, those photography related apps that applies a pre-editting to the photos are among the top downloaded apps from the market. It is a good idea to further investigate these kinds of apps and build around it. Since advertisement is the main revenue for this app, we can then input an advertisement before the app releases the result where the pre-editting happens. Otherwise the consumer have to pay for a pro version of the app to get an advertisement free version of the app. We can also do a deeper study where we'll analyze the other categories and apply some concepts of those on the photography related app that we're going to build. Since there's a lot of these kinds of apps that was already released on the market, we have to innovate features in order to stand out.