The aim of this project is find the profitable app profiles for the Google Play and App Store markets. We are working as a Data Analyst in a company and our aim is to help our developers build apps that are likely to generate profits from App Store and Google Play markets.
At our company, we only build apps that are free to download. Our main revenue from those apps are through in-app ads. Therefore, our profits depends likely on the number of intalls of a particular app we are introducing in the markets. Through our data analysis techniques, we need to make data driven decision in order to enable our dev team to deliver an app that is more likely to attract huge no. of users.
#opening the App store dataset
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[:1]
#opening the Google Play store dataset
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[:1]
def explore_data(dataset, start, end, rows_and_columns = False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows: ', len(dataset))
print('Number of columns: ', len(dataset[0]))
explore_data(ios, 0, 2, True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] Number of rows: 7198 Number of columns: 16
Taking a first glance at ios dataset, we believe there are some significant columns that could help us with our analysis. For example: track_name, currency, price, rating_count_tot, user_rating, prime_genre. Not all columns are self-explanatory, therefore please refer to this documentation below if you need more understanding.
ios_dataset: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home
explore_data(android, 0, 2, True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] Number of rows: 10842 Number of columns: 13
Taking a first glace at google play store dataset, we see that the dataset has 10842 apps and 13 columns and among those, we believe there are some significant columns that could help us with our analysis. For example: App, Category, Rating, Reviews, Installs, Price, Genres.
print(android_header)
print('\n')
print(android[10473])
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']] ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Above we did some data cleaning on the play store dataset, where we removed a row with a missing category information. Many more to come :)
In the last step, we started the data cleaning process and deleted a row with incorrect data from the Google Play data set. If we explore the Google Play data set long enough, we'll notice some apps have duplicate entries. For instance, Instagram has four entries:
for app in android:
name = app[0]
if name == 'Instagram':
print(app)
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
duplicate_apps = []
unique_apps = []
for app in android:
name = app[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Example of duplicate apps: ', duplicate_apps[:5])
Number of duplicate apps: 1181 Example of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
As we see above, there are many applications that are listed as duplicate entries in the list of applications and our next step should be to remove all the duplicate entries and make the list more consitent and reliable. One thing to note is that we can randomly remove the duplicate entries, however it makes sense to only keep the entry with the latest and maximum no. of reviews as per column for and remove all the remaining entries for the same apps.
del android[10473]
print(android[10473])
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
reviews_max = {}
for app in android:
name = app[0]
n_reviews = float(app[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
ValueErrorTraceback (most recent call last) <ipython-input-9-1fc1a52e355c> in <module>() 4 for app in android: 5 name = app[0] ----> 6 n_reviews = float(app[3]) 7 8 if name in reviews_max and reviews_max[name] < n_reviews: ValueError: could not convert string to float: 'Reviews'