First, we will render data from file. csv of both AppStore, GooglePlayStore. We create two function: 1 to load data from file .csv, and the rest to explore/look through some data
def open_data(file_name = 'AppleStore.csv'):## to load data from excel file: AppleStore.csv
from csv import reader
open_file = open('C:/Users/X1 Carbon/.jupyter/AppleStore.csv', encoding='utf8')
read_file = reader(open_file)
data = list(read_file)
return data
def open_data_gg(file_name = 'googleplaystore.csv'):
from csv import reader
open_file = open('C:/Users/X1 Carbon/.jupyter/googleplaystore.csv', encoding='utf8')
read_file = reader(open_file)
data_gg = list(read_file)
return data_gg
def explore_data(data, start, end, rows_and_columns=True):## to render and explore data
data_slice = data[1:][start:end]
for row in data_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(data))
print('Number of columns:', len(data[1:][0]))
def explore_data_gg(data_gg, start, end, rows_and_columns=True):
data_slice = data_gg[1:][start:end]
for row in data_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(data_gg))
print('Number of columns:', len(data_gg[1:][0]))
Data have already loaded in, so now we will quickly look into some data of AppleStore and GooglePlayStore. In this block we can see both of data have how much column, row, and theh header.
data_check = open_data('AppStore.csv')
explore = explore_data(data_check,0,5,)
print('\n')
print(data_check[0])
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'] Number of rows: 7198 Number of columns: 17 ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
data_check_gg = open_data_gg('googleplaystore.csv')
explore = explore_data(data_check_gg, 0, 5,)
print('\n')
print(data_check_gg[0])
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] Number of rows: 10842 Number of columns: 13 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
For the full information of column names, follow the link: [Link][1][2] . . . [1]: https://www.kaggle.com/lava18/google-play-store-apps [2]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps For determine what kind of app will attract user, here are some column that play a role: AppStore: 'rating_count_tot', 'user_rating','track_name','count_rating','prime_genre','size_bytes','Price' GoogleAppStore: 'App', 'Category', 'Rating','Content Rating','Genres','Price'
Before we're analyzing data, we must detect the error data. First, we will detect it in GooglePlaystore
We're trying to find any of wrong data (missing, etc..) in data by the for-loop iterate all over GooglePlaystore data and print out content of row, number of row:
data_gg = open_data_gg('googleplaystore.csv')
for row in data_gg:
if len(row) != len(data_gg[0]):
print(row)
print('\n')
print('Number of row:', data_gg[1:].index(row))
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] Number of row: 10472
The data at row 10472 is missing Category value in its data list. We will try to add the missing value in by 2 step: step 1 is adding 'LIFESTYLE' in position 1 'Category', step 2 is adding 'Lifestyle' in position 9 'Genres', and the next step is modify any position if it unreasonable Code for step 1 and step 2:
....list[10472].insert(1, 'LIFESTYLE') ....list[10472].insert(9, 'Lifestyle')
As the printed row result at above, we can see that at position 9 is missing, if we add something in, that position is pushing to the right an become another error data => we use the code below to delete that position:
....list[10472].pop(10)
data_gg = open_data_gg('googleplaystore.csv')
data_modify = data_gg[1:]
data_modify[10472].insert(1, 'LIFESTYLE')
data_modify[10472].insert(9, 'Lifestyle')
data_modify[10472].pop(10) ## to delete the space left in position 10th
'''Print out to check the data if is adding to the right position'''
print(data_modify[10470:10473])
[['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up'], ['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'], ['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'February 11, 2018', '1.0.19', '4.0 and up']]
Next step, we're going to check if the data has duplicate or not, because with a large of data, we can't avoid of having any duplicate element situation. We will check all over data of GooglePlayStore first to see which data field be duplicated. To do that, we will create two list: Unique list and Duplicate list and add some contents in with for-loop:
data_gg = open_data_gg('googleplaystore.csv')
unique_list = []
duplicate_list = []
for row in data_gg:
app_name = row[0]
if app_name in unique_list:
duplicate_list.append(app_name)
else:
unique_list.append(app_name)
print('Number of duplicate app: ', len(duplicate_list))
print('\n')
print('Some of duplicate app: ', duplicate_list[:20])
Number of duplicate app: 1181 Some of duplicate app: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']
As we can see, we have total 1181 apps being duplicated. Now we will explore one of these data to find out what's the duplicate point, and dertermining what's duplicated apps can be deleted/ what's not:
for row in data_gg:
app_name = row[0]
if app_name == 'Quick PDF Scanner + OCR FREE':
print(row)
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'] ['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up'] ['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
We will try another duplicate data to see what's else happened:
for row in data_gg:
app_name = row[0]
if app_name == 'HipChat - Chat Built for Teams':
print(row)
['HipChat - Chat Built for Teams', 'BUSINESS', '3.8', '5868', '20M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 3, 2018', '3.19.005', '4.1 and up'] ['HipChat - Chat Built for Teams', 'BUSINESS', '3.8', '5868', '20M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 3, 2018', '3.19.005', '4.1 and up']
for row in data_gg:
app_name = row[0]
if app_name == 'QuickBooks Accounting: Invoicing & Expenses':
print(row)
['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']
In the result above, we get some suggest: Some app-names iself being duplicate (ex: Google My Business), or, in app-names itself have different Rating/ same Rating. From suggests above, we could have the role to detele duplicate app: Or delete duplicate app-names itself, or keep only in each duplicate app which Rating number is higher (because the less could be older collected set).
Before we're deleting duplicate apps, we will check the original total data without duplicate:
print('Expected length :', len(data_gg) - 1181)
Expected length : 9661
The reason why we have 9661 total remain app is, we have been filled in to complete the wrong data above (row 10472) instead of delete it, so that row still remain with full data field (plus a header row), and it's normal.
In order to delete duplicate data, we will create a dictionary of all unique apps with the highest Rating value. The progress will be shown as the code below:
review_max = {}
for row in data_modify:
app_name = row[0]
n_review = float(row[3])
if app_name in review_max and review_max[app_name] < n_review:
## to check if which app have its name already in dictionary AND its rating equal to the highest Rating or not
## then constrait its rating to the highest Rating
review_max[app_name] = n_review
if app_name not in review_max:
## if which apps that its name not already in dictionary, then add it in with its highest Rating from the first
review_max[app_name] = n_review
print(len(review_max))
9660
We'll use this dictionary to remove duplicate data with the process: 1> Create 2 list: 1 list to store the data row which was cleaned and other is the mark list, which play a role as key track in another normal dictionary. 2> Compare the Rating value of data in GooglePlaystore with the Rating value stored in rating_max source check dictionary (at the same time check if that app_name is become key track in 2nd list or not), if two value is equal => append the app_name to key track list, at the same time add the data row in the 1st list. We've already know that the total data is 9660 (not include header) which isn't contain duplicate data, so after cleaning process, we have to check the length of list which storing cleaned data in order to make sure that we've cleaned successfully
android_clean = [] ## to save app was clean
already_added = [] ## to marked which app name was clean, like, iteration value
for row in data_modify:
app_name = row[0]
n_review = float(row[3])
if app_name not in already_added and n_review == review_max[app_name]:
android_clean.append(row)
already_added.append(app_name)
print(len(android_clean))
print(android_clean[:10])
9660 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up'], ['Infinite Painter', 'ART_AND_DESIGN', '4.1', '36815', '29M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 14, 2018', '6.1.61.1', '4.2 and up'], ['Garden Coloring Book', 'ART_AND_DESIGN', '4.4', '13791', '33M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'September 20, 2017', '2.9.2', '3.0 and up'], ['Kids Paint Free - Drawing Fun', 'ART_AND_DESIGN', '4.7', '121', '3.1M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'July 3, 2018', '2.8', '4.0.3 and up'], ['Text on Photo - Fonteee', 'ART_AND_DESIGN', '4.4', '13880', '28M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'October 27, 2017', '1.0.4', '4.1 and up']]
Now it's time to return to the previous duplicate data(s) which was detected above, and see whether if we cleaned it properly: (We got an example app name is 'QuickBooks Accounting: Invoicing & Expenses', which was detected that a duplicate data: ... for row in data_gg: app_name = row[0] if app_name == 'QuickBooks Accounting: Invoicing & Expenses': print(row) ... Result: ... ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up'] ...)
for row in android_clean:
app_name = row[0]
if app_name == 'QuickBooks Accounting: Invoicing & Expenses':
print(row)
['QuickBooks Accounting: Invoicing & Expenses', 'BUSINESS', '4.3', '23175', '41M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 13, 2018', '18.7', '4.1 and up']
We've cleaned it successfully, next step, we will check the data name is in English or not. Both of GooglePayStore and AppStore have some app with name not in English, and keeping it in our data could be lead to inaccurry result => We've to filter all the app name in order to keep only apps with English native name. We will filter and clear the none English name app by using the theory: 'a' marked by 97 while '愛' marked by 29233 in ASCII code => we'll use ord() build-in function of Python to check this point
def isEnglish(string):
check_string = [] ## to store number of string that have ASCII complicate number over 127
for i in string:
if ord(i) > 127:
check_string.append(i)
if len(check_string) > 3:
return False
return True
isEnglish('Instagram')
True
isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播')
False
isEnglish('Docs To Go™ Free Office Suite')
True
isEnglish('Instachat 😜')
True
Now we will use this function to check across all the data of AppStore. We also use function explore_data to print out we've got how much row is remained with native English name.
data = open_data('AppStore.csv')
data_final = []
data_error = []
for row in data[1:]:
app_name = row[2]
if isEnglish(app_name) == True:
data_final.append(row)
else:
data_error.append(row)
def explore_data(file_name, start, end, rows_and_columns = True):
data_slice = file_name[1:][start:end]
for row in data_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of row: ', len(file_name))
print('Number of column: ', len(data_slice[0]))
explore_data(data_final, 1, 5,)
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'] ['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1'] Number of row: 6183 Number of column: 17
Let's see the different of before and after filter app-name data in AppStore data: ''' Before: Number of row: 7198 (contain header) Number of column: 17 ''' ''' After: Number of row: 6183 Number of column: 17 ''' If we take an easy caculation of subtract 7197 versus 6183, we get the result 1014 - we got 1014 non-English app name in AppStore data. Should we checking out some of non-English app in AppStore??
print(len(data_error))
print(data_error[1:10])
1014 [['96', '303191318', '同花顺-炒股、股票', '122886144', 'USD', '0', '1744', '0', '3.5', '0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1'], ['239', '331259725', '央视影音-海量央视内容高清直播', '54648832', 'USD', '0', '2070', '0', '2.5', '0', '6.2.0', '4+', 'Sports', '37', '0', '1', '1'], ['268', '336141475', '优酷视频', '204959744', 'USD', '0', '4885', '0', '3.5', '0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1'], ['295', '340368403', 'クックパッド - No.1料理レシピ検索アプリ', '76644352', 'USD', '0', '115', '0', '3.5', '0', '17.5.1.0', '4+', 'Food & Drink', '37', '5', '1', '1'], ['343', '351091731', '大众点评-发现品质生活', '244516864', 'USD', '0', '844', '1', '4', '1', '9.2.4', '17+', 'Lifestyle', '37', '5', '2', '1'], ['374', '356968629', 'ヤフオク! 利用者数NO.1のオークション、フリマアプリ', '187040768', 'USD', '0', '9', '0', '3', '0', '6.14.0', '17+', 'Shopping', '37', '0', '1', '1'], ['459', '370139302', 'QQ 浏览器-搜新闻、选小说漫画、看视频', '119812096', 'USD', '0', '1750', '19', '3.5', '5', '7.4.1', '17+', 'Utilities', '38', '0', '1', '1'], ['477', '373454750', '随手记(专业版)-好用的记账理财工具', '83899392', 'USD', '0.99', '1267', '0', '4.5', '0', '10.6.3', '4+', 'Finance', '38', '0', '3', '1'], ['501', '376561911', 'かなもじ', '61272064', 'USD', '3.99', '19', '0', '4.5', '0', '1.9.3', '4+', 'Education', '24', '5', '2', '1']]
In order to get the final of both dataset, we're going to clean the rest of GooglePlayStore, by the function isEnglish is looped overall the lastest GooglePlayStore data - android_clean - which was passed the duplicate-free progress:
data_GG_final = []
data_error_GG = []
for row in android_clean:
app_name = row[0]
if isEnglish(app_name) == True:
data_GG_final.append(row)
else:
data_error_GG.append(row)
explore_data(data_GG_final, 1, 5,)
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'] ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up'] Number of row: 9615 Number of column: 13
Similary of AppStore, let's check out how much non-English app we've got here, and see how lit our function effect to the dataset
print(len(data_error_GG))
print(data_error_GG[:5])
45 [['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up'], ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'], ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'], ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']]
As we can see the result, we have 45 app non-English name here. We have one situation: 'သိင်္ Astrology - Min Thein Kha BayDin', this app name seems to be in native English, but at the top we can see some strange marks, and the sentences below. For my opinion, I can't get sense that this is a native English app, I couldn't get the mean of this sentences :'Min Thein Kha BayDin' - who else can explain to me what does this mean and this sentences seem to be from which country?? Or it's simply a random name to push up in PlayStore and try to get customer's look??
Now, we've got the final dataset from both of AppStore and GooglePlayStore. It's time to get the list of free-app and answer the question: What kind of app do customers like?
pre_analysis_data = []
rest_data = []
for row in data_final:
Price_check = row[5]
if Price_check == '0':## condition to isolate all free-app
pre_analysis_data.append(row)
elif Price_check != '0':
rest_data.append(row)
pre_analysis_data_GG = []
rest_GG_data = []
for row in data_GG_final:
Price_check_GG = row[7]
if Price_check_GG == '0':
pre_analysis_data_GG.append(row)
else:
rest_GG_data.append(row)
We will explore our pre_process data in AppStore first, to check that we've got in the final data is only app free
explore_data(pre_analysis_data,1,5,)
print('\n')
print(len(rest_data)) ## All must-pay app number will appear at here
print(rest_data[:5])
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'] ['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1'] ['8', '284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0', '1126879', '3594', '4', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1'] Number of row: 3222 Number of column: 17 2961 [['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1'], ['9', '284666222', 'PCalc - The Best Calculator', '49250304', 'USD', '9.99', '1117', '4', '4.5', '5', '3.6.6', '4+', 'Utilities', '37', '5', '1', '1'], ['10', '284736660', 'Ms. PAC-MAN', '70023168', 'USD', '3.99', '7885', '40', '4', '4', '4.0.4', '4+', 'Games', '38', '0', '10', '1'], ['11', '284791396', 'Solitaire by MobilityWare', '49618944', 'USD', '4.99', '76720', '4017', '4.5', '4.5', '4.10.1', '4+', 'Games', '38', '4', '11', '1']]
Next is GooglePlayStore data, similary we will explore all final data and see we've got how much apps free and how much must-pay app
explore_data(pre_analysis_data_GG,1,5,)
print('\n')
print(len(rest_GG_data)) ## Must-pay app number will appear here
print(rest_GG_data[:5])
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'] ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up'] Number of row: 8865 Number of column: 13 750 [['TurboScan: scan documents and receipts in PDF', 'BUSINESS', '4.7', '11442', '6.8M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'March 25, 2018', '1.5.2', '4.0 and up'], ['Tiny Scanner Pro: PDF Doc Scan', 'BUSINESS', '4.8', '10295', '39M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'April 11, 2017', '3.4.6', '3.0 and up'], ['Puffin Browser Pro', 'COMMUNICATION', '4.0', '18247', 'Varies with device', '100,000+', 'Paid', '$3.99', 'Everyone', 'Communication', 'July 5, 2018', '7.5.3.20547', '4.1 and up'], ['Truth or Dare Pro', 'DATING', 'NaN', '0', '20M', '50+', 'Paid', '$1.49', 'Teen', 'Dating', 'September 1, 2017', '1.0', '4.0 and up'], ['Private Dating, Hide App- Blue for PrivacyHider', 'DATING', 'NaN', '0', '18k', '100+', 'Paid', '$2.99', 'Everyone', 'Dating', 'July 25, 2017', '1.0.1', '4.0 and up']]
After the clean progress, we've got 2 dataset final of both GooglePlayStore and AppStore data, which is no missing/ duplicate/ worng format name and all free-app: pre_analysis_data (AppStore) and pre_analysis_data_GG (GooglePlayStore). Now we're ready to analysis these dataset.
Since our revenue come from users (the number of user get bigger, the more revenue we got), we have to determine the kinds of apps that are likely to attract more user.
Normally, when an app is deverloped, it flow through three step:
Because the end goal is to add the app on both Google Play and the App Store, we need to find the app profiles that are successful in both markets. For those reasons, we're going to begin the analysis process by determining the most common genres for each market. Let's review both GooglePlay Store and AppStore header:
AppStore: ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Google Play Store: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Our analysis goal is genres, so in each app store we will focus on: ''' AppStore: 'prime_genres' ''' ''' Google Play Store: 'Category', 'Genres' '''
Below, we will call back 2 final dataset, and create a frequence table of genres:
def freq_table(dataset, index):
freq_tab = {}
for element in dataset:
check_con = element[index]
if check_con in freq_tab:
freq_tab[check_con]+=1
else:
freq_tab[check_con] =1
value = []
for key in freq_tab: ## for caculate sum of freq_element
value.append(freq_tab[key])
summarize = sum(value)
freq_tab_perc = {}
for key in freq_tab:
perc = round((freq_tab[key]/summarize)*100,2)
freq_tab_perc[key] = perc
return freq_tab_perc
## Because we want to display the frequency as descending form, so we use sorted() build-in function
## but sorted() function just operate with list/ tuple, so we have to change our freq dictionary form to a list form
def display_table(freq_tab_perc):
table = freq_tab_perc
table_display = [] ## to save the new list form of freq_table that have created
for key in table: ## to modify freq table become a list
key_val_as_tube = [(table[key], key)]
table_display.append(key_val_as_tube)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
for entry_2 in entry:
## we need to use nest loop because 'entry' variable just access through element (table[key], key)
## but we want it like Genres:Value in each row, so we need one more iteration value loop inside 'entry' variable
print(entry_2[1],':',entry_2[0],'%')
This is the frequency of apps's type base on 'prime_genre' of AppStore ('prime_genre': index 12 in AppStore dataset)
display_table(freq_table(pre_analysis_data, 12))
Games : 58.16 % Entertainment : 7.88 % Photo & Video : 4.97 % Education : 3.66 % Social Networking : 3.29 % Shopping : 2.61 % Utilities : 2.51 % Sports : 2.14 % Music : 2.05 % Health & Fitness : 2.02 % Productivity : 1.74 % Lifestyle : 1.58 % News : 1.33 % Travel : 1.24 % Finance : 1.12 % Weather : 0.87 % Food & Drink : 0.81 % Reference : 0.56 % Business : 0.53 % Book : 0.43 % Navigation : 0.19 % Medical : 0.19 % Catalogs : 0.12 %
Based on the result, we have some conclusions:
=> We need to determine the number of install for each apps genres to know what's happend in user's trend
Below is the frequency of apps's type in GooglePlayStore, the first is based on 'Category' (index 1), the second after is based on 'Genres' (index 9)
display_table(freq_table(pre_analysis_data_GG, 1))
FAMILY : 18.91 % GAME : 9.72 % TOOLS : 8.46 % BUSINESS : 4.59 % LIFESTYLE : 3.91 % PRODUCTIVITY : 3.89 % FINANCE : 3.7 % MEDICAL : 3.53 % SPORTS : 3.4 % PERSONALIZATION : 3.32 % COMMUNICATION : 3.24 % HEALTH_AND_FITNESS : 3.08 % PHOTOGRAPHY : 2.94 % NEWS_AND_MAGAZINES : 2.8 % SOCIAL : 2.66 % TRAVEL_AND_LOCAL : 2.34 % SHOPPING : 2.24 % BOOKS_AND_REFERENCE : 2.14 % DATING : 1.86 % VIDEO_PLAYERS : 1.79 % MAPS_AND_NAVIGATION : 1.4 % FOOD_AND_DRINK : 1.24 % EDUCATION : 1.16 % ENTERTAINMENT : 0.96 % LIBRARIES_AND_DEMO : 0.94 % AUTO_AND_VEHICLES : 0.92 % HOUSE_AND_HOME : 0.82 % WEATHER : 0.8 % EVENTS : 0.71 % PARENTING : 0.65 % ART_AND_DESIGN : 0.64 % COMICS : 0.62 % BEAUTY : 0.6 %
display_table(freq_table(pre_analysis_data_GG, 9))
Tools : 8.45 % Entertainment : 6.07 % Education : 5.35 % Business : 4.59 % Lifestyle : 3.9 % Productivity : 3.89 % Finance : 3.7 % Medical : 3.53 % Sports : 3.46 % Personalization : 3.32 % Communication : 3.24 % Action : 3.1 % Health & Fitness : 3.08 % Photography : 2.94 % News & Magazines : 2.8 % Social : 2.66 % Travel & Local : 2.32 % Shopping : 2.24 % Books & Reference : 2.14 % Simulation : 2.04 % Dating : 1.86 % Arcade : 1.85 % Video Players & Editors : 1.77 % Casual : 1.76 % Maps & Navigation : 1.4 % Food & Drink : 1.24 % Puzzle : 1.13 % Racing : 0.99 % Role Playing : 0.94 % Libraries & Demo : 0.94 % Auto & Vehicles : 0.92 % Strategy : 0.91 % House & Home : 0.82 % Weather : 0.8 % Events : 0.71 % Adventure : 0.68 % Comics : 0.61 % Beauty : 0.6 % Art & Design : 0.6 % Parenting : 0.5 % Card : 0.45 % Casino : 0.43 % Trivia : 0.42 % Educational;Education : 0.39 % Board : 0.38 % Educational : 0.37 % Education;Education : 0.34 % Word : 0.26 % Casual;Pretend Play : 0.24 % Music : 0.2 % Racing;Action & Adventure : 0.17 % Puzzle;Brain Games : 0.17 % Entertainment;Music & Video : 0.17 % Casual;Brain Games : 0.14 % Casual;Action & Adventure : 0.14 % Arcade;Action & Adventure : 0.12 % Action;Action & Adventure : 0.1 % Educational;Pretend Play : 0.09 % Simulation;Action & Adventure : 0.08 % Parenting;Education : 0.08 % Entertainment;Brain Games : 0.08 % Board;Brain Games : 0.08 % Parenting;Music & Video : 0.07 % Educational;Brain Games : 0.07 % Casual;Creativity : 0.07 % Art & Design;Creativity : 0.07 % Education;Pretend Play : 0.06 % Role Playing;Pretend Play : 0.05 % Education;Creativity : 0.05 % Role Playing;Action & Adventure : 0.03 % Puzzle;Action & Adventure : 0.03 % Entertainment;Creativity : 0.03 % Entertainment;Action & Adventure : 0.03 % Educational;Creativity : 0.03 % Educational;Action & Adventure : 0.03 % Education;Music & Video : 0.03 % Education;Brain Games : 0.03 % Education;Action & Adventure : 0.03 % Adventure;Action & Adventure : 0.03 % Video Players & Editors;Music & Video : 0.02 % Sports;Action & Adventure : 0.02 % Simulation;Pretend Play : 0.02 % Puzzle;Creativity : 0.02 % Music;Music & Video : 0.02 % Entertainment;Pretend Play : 0.02 % Casual;Education : 0.02 % Board;Action & Adventure : 0.02 % Video Players & Editors;Creativity : 0.01 % Trivia;Education : 0.01 % Travel & Local;Action & Adventure : 0.01 % Tools;Education : 0.01 % Strategy;Education : 0.01 % Strategy;Creativity : 0.01 % Strategy;Action & Adventure : 0.01 % Simulation;Education : 0.01 % Role Playing;Brain Games : 0.01 % Racing;Pretend Play : 0.01 % Puzzle;Education : 0.01 % Parenting;Brain Games : 0.01 % Music & Audio;Music & Video : 0.01 % Lifestyle;Pretend Play : 0.01 % Lifestyle;Education : 0.01 % Health & Fitness;Education : 0.01 % Health & Fitness;Action & Adventure : 0.01 % Entertainment;Education : 0.01 % Communication;Creativity : 0.01 % Comics;Creativity : 0.01 % Casual;Music & Video : 0.01 % Card;Action & Adventure : 0.01 % Books & Reference;Education : 0.01 % Art & Design;Pretend Play : 0.01 % Art & Design;Action & Adventure : 0.01 % Arcade;Pretend Play : 0.01 % Adventure;Education : 0.01 %
Now we had a look through on both dataset, we want to know the average number of install for ech apps genres in order to deterining what genres are the most popular. For GooglePlayStore, we can do this by find in the Install column, and for AppStore, we can do this by take the total number of user ratings ('rating_count_tot') as a proxy and work with this information.
We will process in AppStore first, because we need to do three flow below:
We're starting by the unique genres table in freq_table above and create a nest loop: One loop for unique genres (main loop - to take information about Generes) and the rest for the data that already cleaning (pre_analysis_data) (minor loop - to determine which apps is shared in which generes on Unique genres table).
Before we're going to caculate the average, we'll create a func to display and sort the result so we could quickly check the top 3 app with the highest Install times.
def display_table(file_name):
table = file_name
table_display = []
for key in table:
key_val_as_tube = [(table[key], key)]
table_display.append(key_val_as_tube)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
for entry_2 in entry:
print(entry_2[1],':',entry_2[0],'times')
unique_generes = freq_table(pre_analysis_data, 12)
check_freq = {} ## I add it in to use with func display_table (already defined above) which could be display on descending form
for genres in unique_generes:
total = 0 #To store the sum of user rating specific to each genres
len_genre = 0 #To store the number of apps specific to each genres
for app in pre_analysis_data:
genre_app = app[12]
if genre_app == genres:
len_genre += 1
total += int(app[6])
average_user_rating = round(total / len_genre,0)
check_freq[genres] = average_user_rating
display_table(check_freq)
Navigation : 86090.0 times Reference : 74942.0 times Social Networking : 71548.0 times Music : 57327.0 times Weather : 52280.0 times Book : 39758.0 times Food & Drink : 33334.0 times Finance : 31468.0 times Photo & Video : 28442.0 times Travel : 28244.0 times Shopping : 26920.0 times Health & Fitness : 23298.0 times Sports : 23009.0 times Games : 22789.0 times News : 21248.0 times Productivity : 21028.0 times Utilities : 18684.0 times Lifestyle : 16486.0 times Entertainment : 14030.0 times Business : 7491.0 times Education : 7004.0 times Catalogs : 4004.0 times Medical : 612.0 times
While we taking a look in the distribution of apps's genres install number, reminding that we've said the reuslt by Genres above have not contain the install of each apps genres (rating), so we can't consider that Games is the most popular trend of user's. Notice that: we've Games in most poppular in Genres (58,16%), but when compare with user's install times, it just 22788.6 times install, less than the most install is Navigation (86090 times) approxiate three times. Through these result, we also determine that our users have demand in navigation higher than the rest, because nowadays, smartphone was played a role for almost the specific hands-on navigation which need to be pay fee to use and update maps data, which could be easy if we have a smartphone with GPS and a free-navigation app.
Similary, we caculate the install number of app's genres in GooglePlayStore data. First, we're going to render a frequency table about Install columns of GooglePlayStore dataset with function freq_table() we're created:
display_table(freq_table(pre_analysis_data_GG, 5))
1,000,000+ : 15.72 times 100,000+ : 11.55 times 10,000,000+ : 10.55 times 10,000+ : 10.2 times 1,000+ : 8.4 times 100+ : 6.91 times 5,000,000+ : 6.82 times 500,000+ : 5.56 times 50,000+ : 4.77 times 5,000+ : 4.51 times 10+ : 3.54 times 500+ : 3.25 times 50,000,000+ : 2.3 times 100,000,000+ : 2.13 times 50+ : 1.92 times 5+ : 0.79 times 1+ : 0.51 times 500,000,000+ : 0.27 times 1,000,000,000+ : 0.23 times 0+ : 0.05 times 0 : 0.01 times
Because install number in GooglePlayStore side just 'number +' - a shape of summary data, but we want only number without comma '+' => We will modify these key of install number by fucntion str.replace(old, new) in nest loop like below:
unique_genre = freq_table(pre_analysis_data_GG, 1)
freq_install = {}
for category in unique_genre:
total = 0 # to save sum of install number for each app's genre
c_category = 0 # to save apps number equal to each category
for app in pre_analysis_data_GG:
app_category = app[1]
if app_category == category:
install_num = app[5].replace('+', '')## deleted comma '+'
install_num2 = install_num.replace(',', '')## deleted comma ',', and already free from '+'
total += float(install_num2)
c_category += 1
average_install_num = round(total/ c_category,0)
freq_install[category] = average_install_num
display_table(freq_install)
COMMUNICATION : 38456119.0 times VIDEO_PLAYERS : 24727872.0 times SOCIAL : 23253652.0 times PHOTOGRAPHY : 17840110.0 times PRODUCTIVITY : 16787331.0 times GAME : 15588016.0 times TRAVEL_AND_LOCAL : 13984078.0 times ENTERTAINMENT : 11640706.0 times TOOLS : 10801391.0 times NEWS_AND_MAGAZINES : 9549178.0 times BOOKS_AND_REFERENCE : 8767812.0 times SHOPPING : 7036877.0 times PERSONALIZATION : 5201483.0 times WEATHER : 5074486.0 times HEALTH_AND_FITNESS : 4188822.0 times MAPS_AND_NAVIGATION : 4056942.0 times FAMILY : 3695642.0 times SPORTS : 3638640.0 times ART_AND_DESIGN : 1986335.0 times FOOD_AND_DRINK : 1924898.0 times EDUCATION : 1833495.0 times BUSINESS : 1712290.0 times LIFESTYLE : 1433676.0 times FINANCE : 1387692.0 times HOUSE_AND_HOME : 1331541.0 times DATING : 854029.0 times COMICS : 817657.0 times AUTO_AND_VEHICLES : 647318.0 times LIBRARIES_AND_DEMO : 638504.0 times PARENTING : 542604.0 times BEAUTY : 513152.0 times EVENTS : 253542.0 times MEDICAL : 120551.0 times
By the result, we can see top three of apps category will make us earn more money from revenue is COMMUNICATION (38456119 times), VIDEO_PLAYERS (24727872 times), SOCIAL(23253652 times). As casual, Game is not the most popular item, and we could put Game in our b-side options of deverlopment plan.