#!/usr/bin/env python
# coding: utf-8

# # Profitable App Profiles for the App Store and Google Play Markets 

# ## Introduction

# We are working as a data analyst for a company that builds *Android* and *iOS mobile apps*. The comapny builds *free apps* (which are free to download and install). Main revenue of the company is from *in-app ads*. It depends on the number of users. i.e. more the number of users watch and engage with the ads, more the revenue. Our aim here is to help our developers understand what type of apps attract more users. We have come up with a list of apps which are profitable to both *Apple Store* and *Google Play Store*.

# * Dataset containing ~ 10,000 Android apps from Google Play Store __[link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)__
# 
# * Dataset containing ~ 7,000 iOS apps from App Store __[link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)__ 

# ## Exploring the Data for Better Comprehension

# In[1]:


# Open both the the csv files
#Read the data and transform it into list of lists
import csv

open_Apple = open('AppleStore.csv')
read_Apple = csv.reader(open_Apple)
data_Apple = list(read_Apple)
#header_Apple = list(read_Apple)[0]

open_google = open('googleplaystore.csv')
read_google = csv.reader(open_google)
data_google = list(read_google)
#header_google = list(read_google)[0]


# In[2]:


# Define a function 'explore_data()' which prints the rows & columns, 
# also prints the no.of rows and columns of the dataset 
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    #loop through the dataset 
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows', len(dataset))
        print('Number of columns', len(dataset[0]))


# In[3]:


explore_data(data_Apple, 1, 4) #exploring the first three rows of PlayStore data


# In[4]:


explore_data(data_google, 1, 4) #exploring the first three rows of Android Store data


# In[5]:


#Printing the no.of rows & no.of columns for Play Store
print('Number of rows', len(data_Apple[1:]))
print('Number of column', len(data_Apple[0]))


# In[6]:


#Printing the no.of rows and no.of columns for Android Store
print('Number of rows', len(data_google[1:]))
print('Number of columns', len(data_google[0]))


# In[7]:


print('AppleStore column names:', data_Apple[0]) #Printing the header row for Play Store
print('Required Columns: rating_count_tot, user_rating, cont_rating') #Link to the original data


# `For more clarity check:` [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

# In[8]:


print('googleplaystore column names:', data_google[0]) #Printing header row for Android Store
print('Required Columns: Rating, Reviews, Installs, Content Rating') #Link to the original data


# `For more clarity check:`[link](https://www.kaggle.com/lava18/google-play-store-apps)

# **Above we explored both datasets by** 
# * Printing the *number of rows and columns*.
# * Looked at the *header row*.
# * Looked at the *body* of the data.

# ## Cleaning the Data for Ease of Analysis

# We are going to check if there is any missing data in the *google Play Store*. The way we do it is to check if the *length of any row is 'not equal' to the length of the header row*. We will delete such rows.

# In[9]:


#select the header row for Android Store data
header_google = data_google[0] 

#loop over the data
for row in data_google[1:]:
    header_len = len(header_google)
    row_len = len(row)        #if the length of the row 
    
    if row_len != header_len: #is not equivalent to the 
        print(row)            #header row print the row &
        print(data_google.index(row)) # it's index


# In[10]:


del data_google[10473] #delete the selected row


# In[11]:


print(data_google[10473]) #check if the perticular row is deleted


# We will perform the above action on *Apple Store* data as well

# In[77]:


header_Apple = data_Apple[0]

no_row_len = 0
for row in data_Apple[1:]:
    header_len = len(header_Apple)
    row_len = len(row)
    
    if row_len != header_len:
        no_row_len += 1
        
if no_row_len == 0:
    print("There are no missing rows")
else:
    print(no_row_len)


# We found that there is one row with missing data in *Android Play Store* and we deleted that row. But there is no missing data among *Apple Store*.

# ## Removing Duplicate Entries for Android Store: Part 1

# In here we are going to findout if there are any duplicate Apps in the *Android Store*.

# In[13]:


#loop through the android data
for app in data_google:
    name = app[0]
    if name == 'Facebook': #we have taken 'Facebook' for checking duplicate entries 
        print(app)


# Below we will calculate the number of duplicate apps for the *Android Play Store* 

# In[14]:


duplicate_apps = [] #create an empty list called 'duplicate_apps'
unique_apps = []    #create an empty list called 'unique_apps'

#loop through the data and append the relavent apps in the above empty lists
for app in data_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

#Calculate the number of duplicate apps
print('Number of duplicate apps:', len(duplicate_apps))


# Removing the duplicate rows manually(randomly) is a cumbersome and laborious process. So we should come up with programmatic way to carry out this process.
# 
# **Here are few methods we can implement:**

# `Option1:` Choosing the highest number of reviews(column 4) as it will be the more recent review and removing all other data(duplicates). 

# `Option2:` Selecting the highest number of installs(column 6) as it will be the most recent one and removing the others(duplicates).

# `Option3:` Selecting the last updated(column 11), which will be recently updated app and removing the other duplicates. 

# `Option4:` Selecting the latest(current) version(column 12) as it will be the most recent App than the others.

# **Here we are going to perform the first method**
# 
# * We will create a dictionary called *reviews_max*, where the *key* is *app name* and *value* is *max_reviews*(i.e. maximum reviews recorded by an app)
# 
# * We will find out the length of the dictionary(in order to cross check the answer):  
# `10840(total apps) - 1181(duplicate apps) = 9659`
#        
# * We will create a list called *android_clean* where we can add the complete row of an app with maximum reviews.
# 
# * We will create a list called *already_added* where we can add the names of apps which are already included in the android_clean list. (We are adding this supplementary information to take care of fact that if the maximum number of reviews is same for more than one duplicate app) 

# ## Removing Duplicate Entries from the Android Store: Part 2

# Below we perform the method mentioned above

# In[15]:


reviews_max = {}    #Create a dictionary

#loop through the rows and add the 'key' & 'values' to the dictionary
for row in data_google[1:]:
    name = row[0]
    n_reviews = float(row[3]) #change the row[3] data type to float & name it 'n_reviews'
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


# In[16]:


#Calculate the length of the dictionary
print('Expected length:', len(reviews_max))


# The length of the dictionary, *reviews_max* exactly matches with the expected length. Below we are going to use the *reviews_max* dictionary to remove duplicate rows 

# In[17]:


android_clean = []  #create empty lists called 'android_clean' and
already_added = []  #'already_added'

#loop through the rows 
for row in data_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    #append the app with maximum reviews in 'android_clean' app and rest 
    #in 'already_added'
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
    

# In[18]:


#Here we are going to check our method by observing the cleaned data
# and by checking the length of the rows
print((android_clean[:3])) 
print('Number of expected rows:', len(android_clean))


# As expected we have got *9659* rows.  

# ## Removing Non-English Apps: Part 1

# The company we are working for is specially for English speaking audience. So we need only English Apps for our analysis. We will delete all other apps.
# 
# First we will define a function which checks if the given app has only English alphabets or not using *ord()* function. 
# 
# Below we define a function *english_app()*, with *string* as a parameter. It checks if the given app is english or not

# In[19]:


def english_app(string):
    
    #loop through the string & apply the conditions
    for character in string:
        if ord(character) > 127:
            return False
        else:
            return True


# In[20]:


print(english_app('Instagram'))


# In[21]:


print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))


# In[22]:


print(english_app('Docs To Go™ Free Office Suite'))


# In[23]:


print(english_app('Instachat 😜'))


# To check if the function is giving appropriate outcomes, we ran the function with *English* and *non-English* app names. We got the correct answers.

# ## Removing Non-English Apps: Part 2

# If we use the above function, we may loose some English apps along with non-English ones. So we are going to define one more function very similar to the ealier one. Here we will only remove the app if it has more than three characters with corresponding numbers falling outside the ASCII range. This means if the app has up to three emoji or other special characters, it will still be labelled as Englsh app.

# In[24]:


# define a function E_A() with string as a parameter
def E_A(string):
    ord_list = 0  #assign variable ord_list to zero value
    
    #loop through string, if the character's number is greater 
    #than 127 increment the ord_list
    for character in string:
        if ord(character) > 127:
            ord_list += 1
    
    #if the value of 'ord_list' is greater than 3 the function
    #returns False else True
    if ord_list > 3:
        return False
    else:
        return True


# In[25]:


print(E_A('Docs To Go™ Free Office Suite'))


# In[26]:


print(E_A('Instachat 😜'))


# In[27]:


print(E_A('爱奇艺PPS -《欢乐颂2》电视剧热播'))


# Above we checked the function on few apps and it works properly

# Now we are going to apply the above function on both *Android Store* and *Apple Play Store*.

# In[28]:


android_English_App = [] #create an empty list 

#loop through the cleaned data and append the 
#row in the list above
for row in android_clean:
    app = row[0]
    
    if E_A(app) is True:
        android_English_App.append(row)


# Apply the function *explore_data()* on the list *android_English_App* and observe the results

# In[29]:


print(explore_data(android_English_App, 0, 4, rows_and_columns=True))


# Repeat the same process on *Apple Store Data*

# In[30]:


apple_English_App = [] #create an empty list

#loop through the data and append it to the above list
for row in data_Apple[1:]:
    app = row[1]
    
    if E_A(app) is True:
        apple_English_App.append(row)


# Apply the *explore_data()* function on the list *apple_English_App* and observe the data

# In[31]:


print(explore_data(apple_English_App, 0, 4, rows_and_columns=True))


# We have successfully removed the non English apps from both *Android Store* and *Apple Play Store*

# ## Isolating the Free Apps from the Paid Apps

# The last step of the data cleaning is to isolate the free apps from the paid apps. As I have mentioned at the beginning the company is  only interested in the free apps and the main revenue comes from the in-app ads. 
# 
# Below I am going to seperate the free apps from the paid apps for both *Android Store* and *Apple Play Store* together.

# In[32]:


#create an empty lists 
android_free_app = []
ios_free_app = []

#loop through  the data and append the appropriate list with zero price
for row in android_English_App:
    price = row[7]
    if price == '0':
        android_free_app.append(row)
    
for row in apple_English_App:
    price = row[4]
    if price == '0.0':   #exploration of the data shows that for ios 
         ios_free_app.append(row) #zero price is listed as '0.0'


# In[33]:


print(explore_data(android_free_app, 0, 3, rows_and_columns=True ))


# In[34]:


print(explore_data(ios_free_app, 0, 3, rows_and_columns=True ))


# We applied *explore_data()* function on both the lists and observed the number of rows and columns.

# ## Most Common Apps by Genre: Part 1

# Our goal is to find an app profile that attracts users on both *App Store* and *google play*. Once we identify such a profile, we would like to validate our recommendation by first building a new app fitting this profile on one of the platforms (say android), observing its usage and, if successful port the app to the other platform.
# 
# **Our validation strategy for an app idea has 3 steps:**
# 
# 1) Build a minimal Android version of the app and add it to *Google Play*.
# 
# 2) If the app has a good response from users, develope it further.
# 
# 3) If the app is profitable after six months in *Google Play*, build an iOS version of the app and add it to the *App Store*.
# 
# 

# **For generating frequency tables to find out most common genres we can use:**
# 
# For Google Play: `Column 2 (category)` and `Column 10 (Genres)`
# 
# For Play Store: `Column 12 (prime_genre)`
# 
# 

# ## Most Common Apps by Genre: Part 2

# First we are going to define a function which creates a frequency table. This frequency tables shows the percentage of each genre.

# In[35]:


#create a function called freq_table() which takes 
#dataset & index as parameters
def freq_table(dataset, index): 
    dict_freq = {}
    total = 0
    
    #loop through the row & add it to the total &
    #generate the frequency table with genre as key 
    # and no.of genre as value
    for row in dataset:
        total += 1
        genre = row[index]
        
        if genre in dict_freq:
            dict_freq[genre] += 1
        else:
            dict_freq[genre] = 1         
                   
    #generate a frequency table with value(no.of genre) in the above
    #frequency table as key and percentage of total (total no.of genre)
    #as value
    
    dict_freq_percentage = {}
    for value in dict_freq:
        dict_freq[value] /= total
        percentage = dict_freq[value] * 100
        dict_freq_percentage[value] = percentage
        
    return dict_freq_percentage


# We will create one more function called *display_table()*, which will display the genre percentages in a descending order 

# In[36]:


#create a function display_table with dataset and index as parameters
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []    #create an empty list
    
    #loop through the key in the above table and append the 
    #above list with a tuple of table[key] & key
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    #sort the list using sorted function in descending order
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


# Below we apply the *display_table()* function on *'android Category'*, *'android Genre'* and *'ios prime_genre'* and observe the output

# In[37]:


print('android Category:')
print(display_table(android_free_app, 1))
print('\n')
print('android Genre:')
print(display_table(android_free_app, 9))
print('\n')
print('ios prime_genre:')
print(display_table(ios_free_app, 11))


# ## Most Common Apps by Genre: Part 3

# **Our analysis of the frequency table generated for prime_genre of the App Store data set:**
# 
# * Frequency table shows that the most common genre is *games* (more than 50%) and next common genre is *Entertainment* apps. 
# 
# * Majority of the apps (i.e. more than 70%) fall in the *Entertainment purpose* compared to *Practical purpose*.
# 
# * It is premature to recommend an app profile based on the above frequency table, as this table is built using *app genre* and not with any kind of user information. 
# 
# 
# **Our analysis of the frequency table generated for Category and Genres column of the Google Play data set:**
# 
# * For *Google Play* the most common genres are *Entertainment* and *Tools*. Among *category*, *Family* with *~ 18%* is on top of the table. Among *genre*, *tools* with *~ 8%* is on top of the table.  
# 
# * Here the ratio of *practicle purpose apps* and *Entertainment purpose apps* are almost equal unlike in *app store*.
# 
# * As in the case of *app store*, it is not possible to recommend any app profile for *google play store* as well. As these are based on *app genre/category* and not on any user information.

# ## Most Popular Apps by Genre on Apple Store

# Below we are going to list the *genre* and respective *average rating counts* for 'Apple Store'

# In[38]:


unique_genre = freq_table(ios_free_app, 11) #generate a 
#'unique genre' using frequecy table 'prime genre' column

#loop through the unique_genre
for genre in unique_genre:
    total = 0 #sum of user rating
    len_genre = 0 #no. of apps specific to each genre
    
    #loop through the ios_free_app and iterate through
    #genre and rating counts
    for row in ios_free_app:
        genre_app = row[11]
        if genre_app == genre:
            rating_counts = float(row[5])
            total += rating_counts
            len_genre += 1
    
    #calculate the average rating counts
    avg_rating_counts = total / len_genre
    print(genre, ':', avg_rating_counts)


# Here we recommend an app profile for *IOS app store* based on the user ratings.
# 
# Above frequency table shows that there are *five* apps which have *>50000* user rating counts. We are listing them in descending order
# 
# * Navigation
# * Reference 
# * Social Networking
# * Music
# * Weather
# 
# Few other apps which have the rating counts of *>30000* are listed below in descending order
# 
# * Book
# * Food & drink
# * Finance
# 
# Some which are *<30000* are listed below
# 
# * Travel
# * Photo & Video
# * Shopping

# Below we give the list of *Navigation Apps* with respective rating counts for 'Apple Store', as this app has the highest rating count of *~86000*. 

# In[39]:


for row in ios_free_app:
    if row[11] == 'Navigation':
        print(row[1], ':', row[5])


# ## Most Popular Apps by Genre: Google Play

# Below we are going to list the *category* and respective *average number of installs* for 'Google Play Store'

# In[40]:


#generate a unique app
unique_app_genre_android = freq_table(android_free_app, 1)

#loop through the unique app
for category in unique_app_genre_android:
    total = 0
    len_category = 0
    
    #loop through the android free app & select the category
    for row in android_free_app:
        category_app = row[1]
        
        #apply the conditional statement to the category
        #and select installs
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs #add the installs to total
            len_category += 1   #count the no.of category
            
    #calculate the average installs and print it
    avg_no_n_installs = total / len_category
    print(category, ':', avg_no_n_installs)


# Here we recommend the app profile for *Google Play Store* based on the number of user installs.
# 
# Frequency table above shows that there are *nine* apps which are *> 10000000* installs. We are listing them in descending order.
# 
# * Communication
# * Video-Players
# * Social
# * Photography
# * Productivity
# * Game
# * Travel & Local
# * Entertainment
# * Tools

# Below we give the list of few *Communication Apps* (which has a *~ 38456119* average no. of installs) for 'Google Play Store'

# In[41]:


for row in android_free_app[:290]:
    if row[1]== 'COMMUNICATION':
        print(row[0], ':', row[5])


# Based on our analysis and observation, we have come up with a list of apps common to both *Apple Store* and *Google Play Store*. 
# 
# * Social/Social Networking 
# * Photography/Photo & Video
# * Travel & Local
# * Entertainment/Music
# * Book & Reference
# * Finance
# * Communication
# * Games
# * Food & Drink

# ## Conclusion
# 
# In this project we analyzed app data of *Apple Store* and *Google Play Store*. We needed to recommend a free app which will be profitable to both. We have come up with a list of apps which can be profitable to both the stores. These apps are listed below:
# 
# * Social/Social Networking 
# * Photography/Photo & Video
# * Travel & Local
# * Entertainment/Music
# * Book & Reference
# * Finance
# * Communication
# * Games
# * Food & Drink
#