This project will have a focus on analyzing and cleaning data from the Apple Store and Googe Plays store with an emphasis on using concepts such as basic python, lists and loops, conditional statements, dictionaries and frequency tables, functions. This project will be done in Jupyter Notebook.
Here is the link for the Apple Store dataset and the link for the Google Store dataset.
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.
Isolating the free apps will be our last step in the data cleaning process. On the next screen, we're going to start analyzing the data.
# Imports the reader function from the csv module
from csv import reader
# Create a list of lists for both datasets
open_apple = open(r'C:\Users\david\OneDrive\Desktop\AppleStore.csv', encoding="utf8")
read_apple = reader(open_apple)
apple_data = list(read_apple)
apple_header = apple_data[0]
apple_data = apple_data[1:]
open_google = open(r'C:\Users\david\OneDrive\Desktop\googleplaystoremaster.csv', encoding="utf8")
read_google = reader(open_google)
google_data = list(read_google)
google_header = google_data[0]
google_data = google_data[1:]
# Create a function to give us insight of the data
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
explore_data(google_data, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] Number of rows: 10841 Number of columns: 13
There was a row in the Google Play Stores data that had a missing data point so we removed it with del google_data[10473]
. There are now 10,840 rows instead of 10,841 rows. Furthermore, in the cell below I created a for loop to count the number of duplicate app titles there are in the Google Play dataset. There are 1,181 duplicate app names as shown below.
duplicate_apps = []
unique_apps = []
del google_data[10472]
for app in google_data:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])
Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
In order to clean our data I will create a dictionary where the duplicate datas name column and number of reviews column will be the key:value pair and out of the key:value pairs that have the highest number of reviews that data pair will be placed into reviews_max. This operation will be done with every duplicate value using a for loop, as shown below.
reviews_max = {}
for app in google_data:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
print('Expected length:', len(google_data) - 1181)
print('Actual length:', len(reviews_max))
Expected length: 9659 Actual length: 9659
As shown above, I filtered the duplicate data using a dictionary. Our calculated number of duplicate rows was 1,181. Therefore, we would expect a clean list of data to have a length of 9,659 (original length of the google_data list after the removal of row[10472] is 10,840. So 10,840 - 1,181 = 9,659).
Below I created two lists, google_clean where the cleaned google_data will be stored and google_app_names for app names. The google_app_names list is a secondary condition where we filter out the results that would have gone into the google_clean list if the results where exactly the same. This is because our for loop above will store multiple duplicate columns if the reviews are exactly the same.
google_clean = []
google_app_names = []
for app in google_data:
name = app[0]
n_reviews = float(app[3])
if (reviews_max[name] == n_reviews) and (name not in google_app_names):
google_clean.append(app)
google_app_names.append(name)
Now that the duplicate data has been removed we need to attended do that data that is non-english. In order to accomplish this I've created a function below that counts the ASCII value of each letter in a string when inputed into said function.
Furthermore, some app names contain symbols, dashes and emojis, yet they are usually still english based. In order to clean through app names which could pass as english-based yet have excessive amounts of emojis or special character we will limit the number of non-ASCII values by three per string.
def is_eng(string):
non_ascii = 0
for character in string:
if ord(character) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
Time to run both data sets through the function below:
google_english = []
apple_english = []
for row in google_clean:
name = row[0]
if is_eng(name):
google_english.append(row)
for row in apple_data:
name = row[1]
if is_eng(name):
apple_english.append(row)
explore_data(google_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9614 Number of columns: 13 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 6183 Number of columns: 16
Our next step is to isolate the applications on both datasets that are free, as we are only analyzing free, english-based apps in this project.
google_free = []
apple_free = []
for row in google_english: # For loop iterates through price in google dataset, if price is free it assigns it to a new dataset
price = row[7]
if price == '0':
google_free.append(price)
for row in apple_english: # For loop iterates through price in apple dataset, if price is free it assigns it to a new dataset
price = row[4]
if price == '0.0':
apple_free.append(price)
print('Number of free apps on Google Play: ', len(google_free))
print('Number of free apps on Apple Store: ', len(apple_free))
Number of free apps on Google Play: 8864 Number of free apps on Apple Store: 3222
explain validation strat
give readers context why app needs to be succ in both markets (gamification?)
produce freq tables for both markets to determine the most popular genres (use percents)
def freq_table(dataset, index):
table = {}
total = 0
for row in dataset:
total += 1
value = row[index]
if value in table:
table[value] += 1
else:
table[value] = 1
table_percentages = {}
for key in table:
percentage = (table[key] / total) * 100
table_percentages[key] = percentage
return table_percentages
def display_table(dataset, index):
table = freq_table(dataset, index)
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
print(entry[1], ':', entry[0])
print(apple_header[-5])
prime_genre
display_table(apple_free, -5)
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_892\439294329.py in <module> ----> 1 display_table(apple_free, -5) ~\AppData\Local\Temp\ipykernel_892\11938962.py in display_table(dataset, index) 20 21 def display_table(dataset, index): ---> 22 table = freq_table(dataset, index) 23 table_display = [] 24 for key in table: ~\AppData\Local\Temp\ipykernel_892\11938962.py in freq_table(dataset, index) 5 for row in dataset: 6 total += 1 ----> 7 value = row[index] 8 if value in table: 9 table[value] += 1 IndexError: string index out of range