This project is intended to give insights on what types of apps are like to attract more users. This data will be useful to share with our development team so that they can understand what types of apps to develop so that we can gain the highest number of users. Since we only develop free apps, our main source of revenue are from in-app ads. This is why we are interested in finding out what types of apps attract more users.
Our goal for this project is to determine which apps currently available in the app store and the google play store are attracting users and why.
As a first step, we're going to open both data sets and then slice them into two lists each, so each data set will have a list with all the headers, and a list with the actual data. They will be named subsequently with ios or android to distinguish between the two operating systems.
from csv import reader
# import ios data set and convert to two separate lists
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_data = ios[1:]
# import android data set and convert to two separate lists
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]
Now that we have our data sets opened and converted into lists, we're going to inspect the data and see what types of data it has. To do this, first we define a function which allows us to explore a data set by separating the rows so that they are easier to read, this function takes in 4 inputs, the name of the dataset, the start and end rows for displaying data, and a parameter to count the number of rows and lengths in a data set (without the header data).
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n') # adds a new empty line after each row
if rows_and_columns is True:
print('Number of rows: ', len(dataset))
print('Number of columns: ', len(dataset[0]))
return
print(ios_header)
print('\n')
explore_data(ios_data, 0, 2, True)
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] Number of rows: 7197 Number of columns: 16
Analyzing the data from the iOS store, we see that there are a total of 7,197 different apps, and 16 different columns, out of these columns we might be able to use "track_name," "currency," "price," "ratingcounttot," "ratingcountver," and "prime_genre" for our analysis. For more details on what these columns actually mean, you may click here
print(android_header)
print('\n')
explore_data(android_data, 0, 2, True)
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] Number of rows: 10841 Number of columns: 13
Analyzing the data from the Google Play store, we see that there are a total of 10,841 apps and 13 columns, out of which the following seem interesting: "App," "Category," "Reviews," "Installs," "Type," "Price," and "Genres."
Now, we are going to check both data sets to make sure the data is reliable. The first step we are going to do is to make sure that all columns of each data set have the same length as the header row. This tells us that all rows in the data set have the same length of data.
length_ios_header = len(ios_header)
for row in ios_data:
if len(row) != length_ios_header:
print(row)
print(ios_data.index(row))
Running the above code, we did not get any output, which means that the rows in the iOS data set are reliable and correct. Now we are going to run the same test for the android data set.
length_android_header = len(android_header)
for row in android_data:
if len(row) != length_android_header:
print(row)
print(android_data.index(row))
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 10472
Based on the output from the above code, we can see that row index 10472 does not have the same length as the header row, and this means that there is a data point missing in this row. We are going to delete this row from the data set in the next step.
del android_data[10472]
Now, we are going to check for duplicate apps in each data set. To achieve this, we're going to create two empty lists, one for unique apps, and the other for duplicate apps. First, we're going to perform this for the ios data set, and then the android data set.
unique_apps = []
duplicate_apps = []
for row in ios_data:
name = row[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])
Number of duplicate apps: 0 Examples of duplicate apps: []
As we can see, there are no duplicate apps in the ios data set. Now, we're going to run the same code for the android data set.
unique_apps = []
duplicate_apps = []
for row in android_data:
name = row[0]
if name in unique_apps:
duplicate_apps.append(name)
else:
unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])
Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']
We can see above that the result gave us a total number 1,181 apps that are duplicated in the android data set. To keep things simple, I reused the same lists from the ios example above.
To find out more about the duplicate apps, I'm going to print the rows of one duplicate app and see how the data looks like.
for row in android_data:
name = row[0]
if name == 'Slack':
print(row)
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] ['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
I took the example of the Slack app, and if we look at the rows above, the only difference between the 3 entries is in the 4th column and it looks like it represents the number of reviews. The third entry has the highest number of reviews, so this entry would be the one that we would want to keep, and the other 2 entries would have to be removed. We will use this logic to remove duplicates in the following steps.
reviews_max = {}
for row in android_data:
name = row[0]
n_reviews = float(row[3])
if name in reviews_max and reviews_max[name] < n_reviews:
reviews_max[name] = n_reviews
elif name not in reviews_max:
reviews_max[name] = n_reviews
expected_length = len(android_data) - len(duplicate_apps)
print(expected_length)
print(len(reviews_max))
9659 9659
Above, we created an empty dictionary and looped through the android data set and for each app, we added the name and the number of reviews to the dictionary. Within the loop, we checked that if the name already existed inside the dictionary, we checked if the number of reviews was less than the current row being looped in, and if it was true, we updated the number of reviews in the dictionary with the higher number.
Then, we checked the expected length of our new data set by subtracting duplicate apps from the original data set and comparing this length with the length of the newly created dictionary, to make sure we will have the same length after removing our duplicates.
android_clean = []
already_added = []
for row in android_data:
name = row[0]
n_reviews = float(row[3])
if n_reviews == reviews_max[name] and name not in already_added:
android_clean.append(row)
already_added.append(name)
print(len(android_clean))
9659
Above, we completed the act of cleaning our android data set. We did this first by creating to new empty lists, one for the newly cleaned data, the second to maintain a list of names of apps that have already been added to the cleaned list. We cleaned the data by looping through the android data set and comparing the number of reviews for each row with the number of reviews stored in our dictionary against the corresponding app name. If the number of reviews matched for that specific app inside the dictionary, we added the full row to the android_clean list, and added the name of the app to the other list if it already didn't exist.
Now we have completed the cleaning of duplicate apps.
Now, we are going to begin the process of removing non English apps from our data. Since our company is only focused on creating apps for the English speaking population, we are only interested in alayzing English language apps. To begin with this, we're going to write a function that checks whether an app's name is in English or not.
def english_check(a_string):
special_char = 0
for character in a_string:
if ord(character) > 127:
special_char += 1
if special_char > 3:
return False
else:
return True
Above, we wrote a function that checks each character of a string and uses the ord function to determine it's ASCII number. According to ASCII, all letters in the English language have an ASCII number from 0 to 127. So our function is designed in such a way that it checks the ASCII number of each character in the string to see if it is > 127 and if it is, it returns a False statement which tells us it's not in English.
However, we can see that some apps which have special characters such as Docs To Go™ Free Office Suite and Instachat 😜 will return a False statement with the above function. This is because emojis and special characters are outside of the 0-127 range of ASCII characters. So now, we're going to modify the function so that it accepts at most 3 special characters to determine whether an app is English or not, any app more than 3 special characters, will be discounted as non English, even if it is made up of all English characters.
Now that we have our function ready, we are going to loop through both data sets and identify apps as English, and if they are identified as English, we will append them to a new list.
ios_english = []
android_english = []
for row in ios_data:
if english_check(row[1]):
ios_english.append(row)
for row in android_clean:
if english_check(row[0]):
android_english.append(row)
explore_data(ios_english, 0, 3, True)
print('\n')
explore_data(android_english, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 6183 Number of columns: 16 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9614 Number of columns: 13
After running the above loop, we found 6,183 apps that were English and created a new list with the data, the same for Android with 9,614 apps. We used our function to check if the apps were English, and if they were, we added the whole row to a new list. These two lists will now be the basis of our analysis.
We have one more step to do before we begin our analysis, that is to remove the non free apps from both data sets. We are only interested in free apps, so in order to analyze them, we need to remove all the apps that are paid.
ios_free = []
android_free = []
for row in ios_english:
price = row[4]
if price == '0.0':
ios_free.append(row)
for row in android_english:
price = row[6]
if price == 'Free':
android_free.append(row)
explore_data(ios_free, 0, 3, True)
print('\n')
explore_data(android_english, 0, 3, True)
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 3222 Number of columns: 16 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 9614 Number of columns: 13