This project will have a focus on analyzing and cleaning data from the Apple Store and Googe Plays store with an emphasis on using concepts such as basic python, lists and loops, conditional statements, dictionaries and frequency tables, functions. This project will be done in Jupyter Notebook.
# Imports the reader function from the csv module
from csv import reader
Here is the link for the Apple Store dataset and the link for the Google Store dataset.
# Create a list of lists for both datasets
open_apple = open(r'C:\Users\david\OneDrive\Desktop\AppleStoreMaster.csv', encoding="utf8")
read_apple = reader(open_apple)
apple_data = list(read_apple)
apple_header = apple_data[0]
apple_data = apple_data[1:]
open_google = open(r'C:\Users\david\OneDrive\Desktop\googleplaystoremaster.csv', encoding="utf8")
read_google = reader(open_google)
google_data = list(read_google)
google_header = google_data[0]
google_data = google_data[1:]
# Create a function to give us insight of the data
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
There was a row in the Google Play Stores data that had a missing data point so we removed it with del google_data[10473]
. There are now 10,842 rows instead of 10,843 rows. Furthermore, in the cell below I created a for loop to count the number of duplicate app titles there are in the Google Play dataset. There are 1,181 duplicate app names as shown below.
duplicate_apps = []
unique_apps = []
for app in google_data:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])
Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
Finish Description
reviews_max = {}
for app in google_data:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_17864\1458982666.py in <module> 3 for app in google_data: 4 name = app[0] ----> 5 n_reviews = float(app[3]) 6 7 if name in reviews_max and reviews_max[name] < n_reviews: ValueError: could not convert string to float: '3.0M'
len(google_data)
10841
#del google_data[10472]
print(google_data[10472])