Notebook

First Practical Project: Profitable App Profile for the App Store and Google Play Markets¶

This project is about how we could make profit with the free-app. We're all know that the main source of our revenue is in-app ads, so if we could determine what kind of type is attractive with user => We will focus on them and make profit back.

First, we will render data from file. csv of both AppStore, GooglePlayStore. We create two function: 1 to load data from file .csv, and the rest to explore/look through some data

In [187]:

def open_data(file_name = 'AppleStore.csv'):## to load data from excel file: AppleStore.csv
    from csv import reader
    open_file = open('C:/Users/X1 Carbon/.jupyter/AppleStore.csv', encoding='utf8')
    read_file = reader(open_file)
    data = list(read_file)
    return data

def open_data_gg(file_name = 'googleplaystore.csv'):
    from csv import reader
    open_file = open('C:/Users/X1 Carbon/.jupyter/googleplaystore.csv', encoding='utf8')
    read_file = reader(open_file)
    data_gg = list(read_file)
    return data_gg
                     
def explore_data(data, start, end, rows_and_columns=True):## to render and explore data
    data_slice = data[1:][start:end]
    for row in data_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(data))
        print('Number of columns:', len(data[1:][0]))
        
def explore_data_gg(data_gg, start, end, rows_and_columns=True):
    data_slice = data_gg[1:][start:end]
    for row in data_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(data_gg))
        print('Number of columns:', len(data_gg[1:][0]))

Data have already loaded in, so now we will quickly look into some data of AppleStore and GooglePlayStore. In this block we can see both of data have how much column, row, and theh header.

In [188]:

data_check = open_data('AppStore.csv')
explore = explore_data(data_check,0,5,)
print('\n')
print(data_check[0])

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows: 7198
Number of columns: 17


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

In [189]:

data_check_gg = open_data_gg('googleplaystore.csv')
explore = explore_data(data_check_gg, 0, 5,)
print('\n')
print(data_check_gg[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10842
Number of columns: 13


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

For the full information of column names, follow the link: [Link][1][2] . . . [1]: https://www.kaggle.com/lava18/google-play-store-apps [2]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps For determine what kind of app will attract user, here are some column that play a role: AppStore: 'rating_count_tot', 'user_rating','track_name','count_rating','prime_genre','size_bytes','Price' GoogleAppStore: 'App', 'Category', 'Rating','Content Rating','Genres','Price'

Before we're analyzing data, we must detect the error data. First, we will detect it in GooglePlaystore

We're trying to find any of wrong data (missing, etc..) in data by the for-loop iterate all over GooglePlaystore data and print out content of row, number of row:

In [190]:

data_gg = open_data_gg('googleplaystore.csv')
for row in data_gg:
    if len(row) != len(data_gg[0]):
        print(row)
        print('\n')
        print('Number of row:', data_gg[1:].index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of row: 10472

The data at row 10472 is missing Category value in its data list. We will try to add the missing value in by 2 step: step 1 is adding 'LIFESTYLE' in position 1 'Category', step 2 is adding 'Lifestyle' in position 9 'Genres', and the next step is modify any position if it unreasonable Code for step 1 and step 2:

....list[10472].insert(1, 'LIFESTYLE') ....list[10472].insert(9, 'Lifestyle')

As the printed row result at above, we can see that at position 9 is missing, if we add something in, that position is pushing to the right an become another error data => we use the code below to delete that position:

....list[10472].pop(10)

In [191]:

data_gg = open_data_gg('googleplaystore.csv')
data_modify = data_gg[1:]
data_modify[10472].insert(1, 'LIFESTYLE')
data_modify[10472].insert(9, 'Lifestyle')
data_modify[10472].pop(10) ## to delete the space left in position 10th
'''Print out to check the data if is adding to the right position'''
print(data_modify[10470:10473])

[['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up'], ['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'], ['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'February 11, 2018', '1.0.19', '4.0 and up']]

Next step, we're going to check if the data has duplicate or not, because with a large of data, we can't avoid of having any duplicate element situation. We will check all over data of GooglePlayStore first to see which data field be duplicated. To do that, we will create two list: Unique list and Duplicate list and add some contents in with for-loop:

In [192]:

data_gg = open_data_gg('googleplaystore.csv')
unique_list = []
duplicate_list = []
for row in data_gg:
    app_name = row[0]
    if app_name in unique_list:
        duplicate_list.append(app_name)
    else:
        unique_list.append(app_name)
print('Number of duplicate app: ', len(duplicate_list))
print('\n')
print('Some of duplicate app: ', duplicate_list[:20])

Number of duplicate app:  1181


Some of duplicate app:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']

As we can see, we have total 1181 apps being duplicated. Now we will explore one of these data to find out what's the duplicate point, and dertermining what's duplicated apps can be deleted/ what's not:

In [193]:

for row in data_gg:
    app_name = row[0]
    if app_name == 'Quick PDF Scanner + OCR FREE':
        print(row)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

We will try another duplicate data to see what's else happened:

In [194]:

for row in data_gg:
    app_name = row[0]
    if app_name == 'HipChat - Chat Built for Teams':
        print(row)

['HipChat - Chat Built for Teams', 'BUSINESS', '3.8', '5868', '20M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 3, 2018', '3.19.005', '4.1 and up']
['HipChat - Chat Built for Teams', 'BUSINESS', '3.8', '5868', '20M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 3, 2018', '3.19.005', '4.1 and up']

In [195]:

for row in data_gg:
    app_name = row[0]
    if app_name == 'QuickBooks Accounting: Invoicing & Expenses':
        print(row)

['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']
['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']
['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']

In the result above, we get some suggest: Some app-names iself being duplicate (ex: Google My Business), or, in app-names itself have different Rating/ same Rating. From suggests above, we could have the role to detele duplicate app: Or delete duplicate app-names itself, or keep only in each duplicate app which Rating number is higher (because the less could be older collected set).

Before we're deleting duplicate apps, we will check the original total data without duplicate:

In [196]:

print('Expected length :', len(data_gg) - 1181)

Expected length : 9661

The reason why we have 9661 total remain app is, we have been filled in to complete the wrong data above (row 10472) instead of delete it, so that row still remain with full data field (plus a header row), and it's normal.

In order to delete duplicate data, we will create a dictionary of all unique apps with the highest Rating value. The progress will be shown as the code below:

In [197]:

review_max = {}
for row in data_modify:
    app_name = row[0]
    n_review = float(row[3])
    if app_name in review_max and review_max[app_name] < n_review: 
        ## to check if which app have its name already in dictionary AND its rating equal to the highest Rating or not
        ## then constrait its rating to the highest Rating
        review_max[app_name] = n_review
    if app_name not in review_max:
        ## if which apps that its name not already in dictionary, then add it in with its highest Rating from the first
        review_max[app_name] = n_review
print(len(review_max))

We'll use this dictionary to remove duplicate data with the process: 1> Create 2 list: 1 list to store the data row which was cleaned and other is the mark list, which play a role as key track in another normal dictionary. 2> Compare the Rating value of data in GooglePlaystore with the Rating value stored in rating_max source check dictionary (at the same time check if that app_name is become key track in 2nd list or not), if two value is equal => append the app_name to key track list, at the same time add the data row in the 1st list. We've already know that the total data is 9660 (not include header) which isn't contain duplicate data, so after cleaning process, we have to check the length of list which storing cleaned data in order to make sure that we've cleaned successfully

In [198]:

android_clean = [] ## to save app was clean
already_added = [] ## to marked which app name was clean, like, iteration value
for row in data_modify:
    app_name = row[0]
    n_review = float(row[3])
    if app_name not in already_added and n_review == review_max[app_name]:
        android_clean.append(row)
        already_added.append(app_name)
print(len(android_clean))
print(android_clean[:10])

9660
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up'], ['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up'], ['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up'], ['Kids Paint Free - Drawing Fun', 'ART_AND_DESIGN', '4.7', '121', '3.1M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'July 3, 2018', '2.8', '4.0.3 and up'], ['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']]

Now it's time to return to the previous duplicate data(s) which was detected above, and see whether if we cleaned it properly: (We got an example app name is 'QuickBooks Accounting: Invoicing & Expenses', which was detected that a duplicate data: ... for row in data_gg: app_name = row[0] if app_name == 'QuickBooks Accounting: Invoicing & Expenses': print(row) ... Result: ... ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ...)

In [199]:

for row in android_clean:
    app_name = row[0]
    if app_name == 'QuickBooks Accounting: Invoicing & Expenses':
        print(row)

['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']

We've cleaned it successfully, next step, we will check the data name is in English or not. Both of GooglePayStore and AppStore have some app with name not in English, and keeping it in our data could be lead to inaccurry result => We've to filter all the app name in order to keep only apps with English native name. We will filter and clear the none English name app by using the theory: 'a' marked by 97 while '愛' marked by 29233 in ASCII code => we'll use ord() build-in function of Python to check this point

In [200]:

def isEnglish(string):
    check_string = [] ## to store number of string that have ASCII complicate number over 127
    for i in string:
        if ord(i) > 127:
            check_string.append(i)
            if len(check_string) > 3:
                return False
    return True      

In [201]:

isEnglish('Instagram')

Out[201]:

True

In [202]:

isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播')

Out[202]:

False

In [203]:

isEnglish('Docs To Go™ Free Office Suite')

Out[203]:

True

In [204]:

isEnglish('Instachat 😜')

Out[204]:

True

Now we will use this function to check across all the data of AppStore. We also use function explore_data to print out we've got how much row is remained with native English name.

In [205]:

data = open_data('AppStore.csv')
data_final = []
data_error = []
for row in data[1:]:
    app_name = row[2]
    if isEnglish(app_name) == True:
        data_final.append(row)
    else:
        data_error.append(row)

In [206]:

def explore_data(file_name, start, end, rows_and_columns = True):
    data_slice = file_name[1:][start:end]
    for row in data_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of row: ', len(file_name))
        print('Number of column: ', len(data_slice[0]))

In [207]:

explore_data(data_final, 1, 5,)

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1']


Number of row:  6183
Number of column:  17

Let's see the different of before and after filter app-name data in AppStore data: ''' Before: Number of row: 7198 (contain header) Number of column: 17 ''' ''' After: Number of row: 6183 Number of column: 17 ''' If we take an easy caculation of subtract 7197 versus 6183, we get the result 1014 - we got 1014 non-English app name in AppStore data. Should we checking out some of non-English app in AppStore??

In [208]:

print(len(data_error))
print(data_error[1:10])

1014
[['96', '303191318', '同花顺-炒股、股票', '122886144', 'USD', '0', '1744', '0', '3.5', '0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1'], ['239', '331259725', '央视影音-海量央视内容高清直播', '54648832', 'USD', '0', '2070', '0', '2.5', '0', '6.2.0', '4+', 'Sports', '37', '0', '1', '1'], ['268', '336141475', '优酷视频', '204959744', 'USD', '0', '4885', '0', '3.5', '0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1'], ['295', '340368403', 'クックパッド - No.1料理レシピ検索アプリ', '76644352', 'USD', '0', '115', '0', '3.5', '0', '17.5.1.0', '4+', 'Food & Drink', '37', '5', '1', '1'], ['343', '351091731', '大众点评-发现品质生活', '244516864', 'USD', '0', '844', '1', '4', '1', '9.2.4', '17+', 'Lifestyle', '37', '5', '2', '1'], ['374', '356968629', 'ヤフオク! 利用者数NO.1のオークション、フリマアプリ', '187040768', 'USD', '0', '9', '0', '3', '0', '6.14.0', '17+', 'Shopping', '37', '0', '1', '1'], ['459', '370139302', 'QQ 浏览器-搜新闻、选小说漫画、看视频', '119812096', 'USD', '0', '1750', '19', '3.5', '5', '7.4.1', '17+', 'Utilities', '38', '0', '1', '1'], ['477', '373454750', '随手记（专业版）-好用的记账理财工具', '83899392', 'USD', '0.99', '1267', '0', '4.5', '0', '10.6.3', '4+', 'Finance', '38', '0', '3', '1'], ['501', '376561911', 'かなもじ', '61272064', 'USD', '3.99', '19', '0', '4.5', '0', '1.9.3', '4+', 'Education', '24', '5', '2', '1']]

In order to get the final of both dataset, we're going to clean the rest of GooglePlayStore, by the function isEnglish is looped overall the lastest GooglePlayStore data - android_clean - which was passed the duplicate-free progress:

In [209]:

data_GG_final = []
data_error_GG = []
for row in android_clean:
    app_name = row[0]
    if isEnglish(app_name) == True:
        data_GG_final.append(row)
    else:
        data_error_GG.append(row)

In [210]:

explore_data(data_GG_final, 1, 5,)

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


Number of row:  9615
Number of column:  13

Similary of AppStore, let's check out how much non-English app we've got here, and see how lit our function effect to the dataset

In [211]:

print(len(data_error_GG))
print(data_error_GG[:5])

45
[['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up'], ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'], ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'], ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']]

As we can see the result, we have 45 app non-English name here. We have one situation: 'သိင်္ Astrology - Min Thein Kha BayDin', this app name seems to be in native English, but at the top we can see some strange marks, and the sentences below. For my opinion, I can't get sense that this is a native English app, I couldn't get the mean of this sentences :'Min Thein Kha BayDin' - who else can explain to me what does this mean and this sentences seem to be from which country?? Or it's simply a random name to push up in PlayStore and try to get customer's look??

Now, we've got the final dataset from both of AppStore and GooglePlayStore. It's time to get the list of free-app and answer the question: What kind of app do customers like?

In [212]:

pre_analysis_data = []
rest_data = []
for row in data_final:
    Price_check = row[5]
    if Price_check == '0':## condition to isolate all free-app
        pre_analysis_data.append(row)
    elif Price_check != '0':
        rest_data.append(row)

In [213]:

pre_analysis_data_GG = []
rest_GG_data = []
for row in data_GG_final:
    Price_check_GG = row[7]
    if Price_check_GG == '0':
        pre_analysis_data_GG.append(row)
    else:
        rest_GG_data.append(row)

We will explore our pre_process data in AppStore first, to check that we've got in the final data is only app free

In [214]:

explore_data(pre_analysis_data,1,5,)
print('\n')
print(len(rest_data)) ## All must-pay app number will appear at here
print(rest_data[:5])

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']


['8', '284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0', '1126879', '3594', '4', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of row:  3222
Number of column:  17


2961
[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1'], ['9', '284666222', 'PCalc - The Best Calculator', '49250304', 'USD', '9.99', '1117', '4', '4.5', '5', '3.6.6', '4+', 'Utilities', '37', '5', '1', '1'], ['10', '284736660', 'Ms. PAC-MAN', '70023168', 'USD', '3.99', '7885', '40', '4', '4', '4.0.4', '4+', 'Games', '38', '0', '10', '1'], ['11', '284791396', 'Solitaire by MobilityWare', '49618944', 'USD', '4.99', '76720', '4017', '4.5', '4.5', '4.10.1', '4+', 'Games', '38', '4', '11', '1']]

Next is GooglePlayStore data, similary we will explore all final data and see we've got how much apps free and how much must-pay app

In [215]:

explore_data(pre_analysis_data_GG,1,5,)
print('\n')
print(len(rest_GG_data)) ## Must-pay app number will appear here
print(rest_GG_data[:5])

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


Number of row:  8865
Number of column:  13


750
[['TurboScan: scan documents and receipts in PDF', 'BUSINESS', '4.7', '11442', '6.8M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'March 25, 2018', '1.5.2', '4.0 and up'], ['Tiny Scanner Pro: PDF Doc Scan', 'BUSINESS', '4.8', '10295', '39M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'April 11, 2017', '3.4.6', '3.0 and up'], ['Puffin Browser Pro', 'COMMUNICATION', '4.0', '18247', 'Varies with device', '100,000+', 'Paid', '$3.99', 'Everyone', 'Communication', 'July 5, 2018', '7.5.3.20547', '4.1 and up'], ['Truth or Dare Pro', 'DATING', 'NaN', '0', '20M', '50+', 'Paid', '$1.49', 'Teen', 'Dating', 'September 1, 2017', '1.0', '4.0 and up'], ['Private Dating, Hide App- Blue for PrivacyHider', 'DATING', 'NaN', '0', '18k', '100+', 'Paid', '$2.99', 'Everyone', 'Dating', 'July 25, 2017', '1.0.1', '4.0 and up']]

After the clean progress, we've got 2 dataset final of both GooglePlayStore and AppStore data, which is no missing/ duplicate/ worng format name and all free-app: pre_analysis_data (AppStore) and pre_analysis_data_GG (GooglePlayStore). Now we're ready to analysis these dataset.

Since our revenue come from users (the number of user get bigger, the more revenue we got), we have to determine the kinds of apps that are likely to attract more user.

Normally, when an app is deverloped, it flow through three step:

Build a minimal Android version of the app, add it to Google Play.
If the app has a good respone from users, we develop it further.
If the app is profitale after 6 months, we build an iOS version of the app and add it to the App Store.

Because the end goal is to add the app on both Google Play and the App Store, we need to find the app profiles that are successful in both markets. For those reasons, we're going to begin the analysis process by determining the most common genres for each market. Let's review both GooglePlay Store and AppStore header:

AppStore: ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Google Play Store: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Our analysis goal is genres, so in each app store we will focus on: ''' AppStore: 'prime_genres' ''' ''' Google Play Store: 'Category', 'Genres' '''

Below, we will call back 2 final dataset, and create a frequence table of genres:

In [216]:

def freq_table(dataset, index):
    freq_tab = {}
    for element in dataset:
        check_con = element[index]
        if check_con in freq_tab:
            freq_tab[check_con]+=1
        else:
            freq_tab[check_con] =1
    value = []
    for key in freq_tab: ## for caculate sum of freq_element
        value.append(freq_tab[key])
    summarize = sum(value)
    freq_tab_perc = {}
    for key in freq_tab:
        perc = round((freq_tab[key]/summarize)*100,2)
        freq_tab_perc[key] = perc
    return freq_tab_perc
## Because we want to display the frequency as descending form, so we use sorted() build-in function
## but sorted() function just operate with list/ tuple, so we have to change our freq dictionary form to a list form
def display_table(freq_tab_perc):
    table = freq_tab_perc
    table_display = [] ## to save the new list form of freq_table that have created
    for key in table: ## to modify freq table become a list
        key_val_as_tube = [(table[key], key)]
        table_display.append(key_val_as_tube)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        for entry_2 in entry: 
            ## we need to use nest loop because 'entry' variable just access through element (table[key], key)
            ## but we want it like Genres:Value in each row, so we need one more iteration value loop inside 'entry' variable
            print(entry_2[1],':',entry_2[0],'%')

This is the frequency of apps's type base on 'prime_genre' of AppStore ('prime_genre': index 12 in AppStore dataset)

In [217]:

display_table(freq_table(pre_analysis_data, 12))

Games : 58.16 %
Entertainment : 7.88 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.51 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.33 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %

Based on the result, we have some conclusions:

The most common genres is Game (58.16%), the next is Entertainment (7.88%) => On the app market, we could see so many apps of Game genres is released, and user have so many way to choose => We have to notice this point to expend our scope
On free-apps scope, we can skip app's type like Weather, Food&Drink....
Apps that in genres of entertaiment with free-type is mostly acttract user.
In according to the result, we have Games is the most percent in App Genres, but it happend because there are so many games app released by so many issuer, from user's demand in i-Device is mostly entertaiment. This frequency table not contain rating for each app in particular Genres, so we don't have an overview for app's status in Games genres.

=> We need to determine the number of install for each apps genres to know what's happend in user's trend

Below is the frequency of apps's type in GooglePlayStore, the first is based on 'Category' (index 1), the second after is based on 'Genres' (index 9)

In [218]:

display_table(freq_table(pre_analysis_data_GG, 1))

FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.91 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.53 %
SPORTS : 3.4 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.24 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.92 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %

In [219]:

display_table(freq_table(pre_analysis_data_GG, 9))

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Lifestyle : 3.9 %
Productivity : 3.89 %
Finance : 3.7 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.24 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.92 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %
Racing;Action & Adventure : 0.17 %
Puzzle;Brain Games : 0.17 %
Entertainment;Music & Video : 0.17 %
Casual;Brain Games : 0.14 %
Casual;Action & Adventure : 0.14 %
Arcade;Action & Adventure : 0.12 %
Action;Action & Adventure : 0.1 %
Educational;Pretend Play : 0.09 %
Simulation;Action & Adventure : 0.08 %
Parenting;Education : 0.08 %
Entertainment;Brain Games : 0.08 %
Board;Brain Games : 0.08 %
Parenting;Music & Video : 0.07 %
Educational;Brain Games : 0.07 %
Casual;Creativity : 0.07 %
Art & Design;Creativity : 0.07 %
Education;Pretend Play : 0.06 %
Role Playing;Pretend Play : 0.05 %
Education;Creativity : 0.05 %
Role Playing;Action & Adventure : 0.03 %
Puzzle;Action & Adventure : 0.03 %
Entertainment;Creativity : 0.03 %
Entertainment;Action & Adventure : 0.03 %
Educational;Creativity : 0.03 %
Educational;Action & Adventure : 0.03 %
Education;Music & Video : 0.03 %
Education;Brain Games : 0.03 %
Education;Action & Adventure : 0.03 %
Adventure;Action & Adventure : 0.03 %
Video Players & Editors;Music & Video : 0.02 %
Sports;Action & Adventure : 0.02 %
Simulation;Pretend Play : 0.02 %
Puzzle;Creativity : 0.02 %
Music;Music & Video : 0.02 %
Entertainment;Pretend Play : 0.02 %
Casual;Education : 0.02 %
Board;Action & Adventure : 0.02 %
Video Players & Editors;Creativity : 0.01 %
Trivia;Education : 0.01 %
Travel & Local;Action & Adventure : 0.01 %
Tools;Education : 0.01 %
Strategy;Education : 0.01 %
Strategy;Creativity : 0.01 %
Strategy;Action & Adventure : 0.01 %
Simulation;Education : 0.01 %
Role Playing;Brain Games : 0.01 %
Racing;Pretend Play : 0.01 %
Puzzle;Education : 0.01 %
Parenting;Brain Games : 0.01 %
Music & Audio;Music & Video : 0.01 %
Lifestyle;Pretend Play : 0.01 %
Lifestyle;Education : 0.01 %
Health & Fitness;Education : 0.01 %
Health & Fitness;Action & Adventure : 0.01 %
Entertainment;Education : 0.01 %
Communication;Creativity : 0.01 %
Comics;Creativity : 0.01 %
Casual;Music & Video : 0.01 %
Card;Action & Adventure : 0.01 %
Books & Reference;Education : 0.01 %
Art & Design;Pretend Play : 0.01 %
Art & Design;Action & Adventure : 0.01 %
Arcade;Pretend Play : 0.01 %
Adventure;Education : 0.01 %

In the first 'Category' frequency, we can see a different distribution from first three Category app: FAMILY : 18.91 % GAME : 9.72 % TOOLS : 8.46 % Game is not the first one, instead of the Family => we can so-called that in GooglePlayStore, the released apps is much more balance than AppStore (Games is ~53%).
We could see the balace distribution more clearly by notice in distribution in 'GENRES' frequency: Tools : 8.45 % Entertainment : 6.07 % Education : 5.35 % Business : 4.59 % Lifestyle : 3.9 % .... Puzzle : 1.13 % Racing : 0.99 % Role Playing : 0.94 % Libraries & Demo : 0.94 % Auto & Vehicles : 0.92 % Strategy : 0.91 % ..... Game genres is seperated in many more kind and have percent < 1%

Now we had a look through on both dataset, we want to know the average number of install for ech apps genres in order to deterining what genres are the most popular. For GooglePlayStore, we can do this by find in the Install column, and for AppStore, we can do this by take the total number of user ratings ('rating_count_tot') as a proxy and work with this information.

We will process in AppStore first, because we need to do three flow below:

Isolate the apps of each genres
Add up the user ratings for the apps of that genres
Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

We're starting by the unique genres table in freq_table above and create a nest loop: One loop for unique genres (main loop - to take information about Generes) and the rest for the data that already cleaning (pre_analysis_data) (minor loop - to determine which apps is shared in which generes on Unique genres table).

Before we're going to caculate the average, we'll create a func to display and sort the result so we could quickly check the top 3 app with the highest Install times.

In [220]:

def display_table(file_name):
    table = file_name
    table_display = []
    for key in table:
        key_val_as_tube = [(table[key], key)]
        table_display.append(key_val_as_tube)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        for entry_2 in entry:
            print(entry_2[1],':',entry_2[0],'times')

In [221]:

unique_generes = freq_table(pre_analysis_data, 12)
check_freq = {} ## I add it in to use with func display_table (already defined above) which could be display on descending form
for genres in unique_generes:
    total = 0 #To store the sum of user rating specific to each genres
    len_genre = 0 #To store the number of apps specific to each genres
    for app in pre_analysis_data:
        genre_app = app[12]
        if genre_app == genres:
            len_genre += 1
            total += int(app[6])
    average_user_rating = round(total / len_genre,0)
    check_freq[genres] = average_user_rating

In [222]:

display_table(check_freq)

Navigation : 86090.0 times
Reference : 74942.0 times
Social Networking : 71548.0 times
Music : 57327.0 times
Weather : 52280.0 times
Book : 39758.0 times
Food & Drink : 33334.0 times
Finance : 31468.0 times
Photo & Video : 28442.0 times
Travel : 28244.0 times
Shopping : 26920.0 times
Health & Fitness : 23298.0 times
Sports : 23009.0 times
Games : 22789.0 times
News : 21248.0 times
Productivity : 21028.0 times
Utilities : 18684.0 times
Lifestyle : 16486.0 times
Entertainment : 14030.0 times
Business : 7491.0 times
Education : 7004.0 times
Catalogs : 4004.0 times
Medical : 612.0 times

While we taking a look in the distribution of apps's genres install number, reminding that we've said the reuslt by Genres above have not contain the install of each apps genres (rating), so we can't consider that Games is the most popular trend of user's. Notice that: we've Games in most poppular in Genres (58,16%), but when compare with user's install times, it just 22788.6 times install, less than the most install is Navigation (86090 times) approxiate three times. Through these result, we also determine that our users have demand in navigation higher than the rest, because nowadays, smartphone was played a role for almost the specific hands-on navigation which need to be pay fee to use and update maps data, which could be easy if we have a smartphone with GPS and a free-navigation app.

Similary, we caculate the install number of app's genres in GooglePlayStore data. First, we're going to render a frequency table about Install columns of GooglePlayStore dataset with function freq_table() we're created:

In [223]:

display_table(freq_table(pre_analysis_data_GG, 5))

1,000,000+ : 15.72 times
100,000+ : 11.55 times
10,000,000+ : 10.55 times
10,000+ : 10.2 times
1,000+ : 8.4 times
100+ : 6.91 times
5,000,000+ : 6.82 times
500,000+ : 5.56 times
50,000+ : 4.77 times
5,000+ : 4.51 times
10+ : 3.54 times
500+ : 3.25 times
50,000,000+ : 2.3 times
100,000,000+ : 2.13 times
50+ : 1.92 times
5+ : 0.79 times
1+ : 0.51 times
500,000,000+ : 0.27 times
1,000,000,000+ : 0.23 times
0+ : 0.05 times
0 : 0.01 times

Because install number in GooglePlayStore side just 'number +' - a shape of summary data, but we want only number without comma '+' => We will modify these key of install number by fucntion str.replace(old, new) in nest loop like below:

In [224]:

unique_genre = freq_table(pre_analysis_data_GG, 1)
freq_install = {}
for category in unique_genre:
    total = 0 # to save sum of install number for each app's genre
    c_category = 0 # to save apps number equal to each category
    for app in pre_analysis_data_GG:
        app_category = app[1]
        if app_category == category:
            install_num = app[5].replace('+', '')## deleted comma '+' 
            install_num2 = install_num.replace(',', '')## deleted comma ',', and already free from '+'
            total += float(install_num2)
            c_category += 1
    average_install_num = round(total/ c_category,0)
    freq_install[category] = average_install_num

In [225]:

display_table(freq_install)

COMMUNICATION : 38456119.0 times
VIDEO_PLAYERS : 24727872.0 times
SOCIAL : 23253652.0 times
PHOTOGRAPHY : 17840110.0 times
PRODUCTIVITY : 16787331.0 times
GAME : 15588016.0 times
TRAVEL_AND_LOCAL : 13984078.0 times
ENTERTAINMENT : 11640706.0 times
TOOLS : 10801391.0 times
NEWS_AND_MAGAZINES : 9549178.0 times
BOOKS_AND_REFERENCE : 8767812.0 times
SHOPPING : 7036877.0 times
PERSONALIZATION : 5201483.0 times
WEATHER : 5074486.0 times
HEALTH_AND_FITNESS : 4188822.0 times
MAPS_AND_NAVIGATION : 4056942.0 times
FAMILY : 3695642.0 times
SPORTS : 3638640.0 times
ART_AND_DESIGN : 1986335.0 times
FOOD_AND_DRINK : 1924898.0 times
EDUCATION : 1833495.0 times
BUSINESS : 1712290.0 times
LIFESTYLE : 1433676.0 times
FINANCE : 1387692.0 times
HOUSE_AND_HOME : 1331541.0 times
DATING : 854029.0 times
COMICS : 817657.0 times
AUTO_AND_VEHICLES : 647318.0 times
LIBRARIES_AND_DEMO : 638504.0 times
PARENTING : 542604.0 times
BEAUTY : 513152.0 times
EVENTS : 253542.0 times
MEDICAL : 120551.0 times

By the result, we can see top three of apps category will make us earn more money from revenue is COMMUNICATION (38456119 times), VIDEO_PLAYERS (24727872 times), SOCIAL(23253652 times). As casual, Game is not the most popular item, and we could put Game in our b-side options of deverlopment plan.