#!/usr/bin/env python
# coding: utf-8

# ## Profitable App Profiles for the App Store and Google Play Markets
# 
# This project consists finding which mobile app profiles are profitibale for the App Store and Google Play Markets. 
# 
# The Apps are free to download, so the companies main source of revenue are in the app-ads. The more users who see and engage with the ad, the more revenue. To gain more users, our goal is to help our developers understand which type of apps are more attractive.
# 
# 
# ## Opening and Exploring the Data
# 
# Without spending too much time and resources to find the data we will use an existing sample data. We were able to find 2 data sets that will help us achieve our goal.
# 
# - The 1st dataset ([Link](https://www.kaggle.com/lava18/google-play-store-apps)) contains around **10,000** Android apps from Google Play that was collected in August 2018. Directly download the dataset here [Link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
# 
# 
# - The 2nd dataset ([Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)) contains around **7,000** apps from the App Store that was collected in July 2017. Directly download the dataset here [Link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
# 
# First we will *open* the 2 datasets and then *explore* them:
# 
#      
# 
# 
#   
#    
#        
# 

# In[1]:


### Opening the 2 datasets ##
from csv import reader

## Google Play Dataset ##
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

## App Store Dataset ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


# Now that the 2 datasets have been *opened*, we can now proceed *exploring* them.
# 
# First we will write a function called *explore_data* with an option to show the number of rows and columns per dataset:

# In[2]:


## Google play dataset, Android ##

def explore_data (dataset, start, end, rows_and_columns=False): ## 4 parameters ##
    dataset_slice = dataset[start:end]    ## slices the dataset ##
    for row in dataset_slice:  ## loops through the slice ##   
        print(row)     ## each iteration prints a row ##
        print('\n')    ## adds a new empty line after ##   
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)
    

# After running the *explore_data* function for the Google Play dataset, we find out that there are **10,841** number of apps and **13** columns.
# 
# The columns that will help us the most with our analysis are: *'App'*, *'Category'*, *'Reviews'*, *'Installs'*, *'Type'*, *'Price'*, *'Content Rating'* and *'Genres'*.
# 
# Now to take a look at the dataset of Apple Store:
# 
# 

# In[3]:


print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)


# As you see above, we find out that the dataset of the Apple Store contains **7,197** number of apps and **16** columns. 
# 
# The Columns that will help us with our analysis are: *'track_name'*, *'currency'*, *'price'*, *'rating_count_tot'*, *'rating_count_ver'* and *'prime_genre'*.
# 
# Details about the column names can be found here [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
# 

# # Wrong Data Input and How to Delete It
# 
# There will be times where datasets will have data input errors and must be removed or modified to fit the purpose of our analysis. We will be finding an error within the Google Play dataset.
# 
# Through the use of this [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) we can communicate with others to resolve problems we may have.
# 
# One discussion in particular, [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), states that there is an error for a certain row. The error is in row **10472**, below we will find this error and delete it.

# In[4]:


### To find the error, we will find which row is not the same length as the header ###

print(android_header)  ### print the header to show the right length ###
print('\n')

for row in android:
    if len(row)!=len(android_header): ### loop through Google Play data set ###
        print(row) 
        print('\n')
        print(android.index(row))  ### print the location of the row ###
        

# Above, show's us that the row ***10472*** corresponds to the app *'Life Made WI-Fi Touchscreen Photo Frame'*.
# 
# Directly after the app name, the *'Category'* column is missing an input. You can tell by seeing 2 quotation marks surrounding a comma ***','***
# 
# With this being done, the *'Category'* is matched with ***'1.9'*** and the *'Rating'* is matched with ***'19'***. This creates an error and needs to be deleted.

# In[5]:


### deleting the error ###

print(len(android))
del android[10472] ### only run this once ###
print(len(android))


# # Removing Duplicate Entries
# 
# ## Part One:
# 
# As we explore futher into Google Play dataset, we come across another problem. There are duplicate entries of the same app. As an example, *'Facebook'* has a total of 2 entries. 

# In[6]:


for app in android:
    name = app[0]
    if name == 'Facebook':
        print(app)


# Next we will figure out how many duplicate entries there are in total within Google Play dataset:

# In[7]:


### Start by creating 2 lists ###

duplicate_apps = []
unique_apps = []

for app in android: ### in operator because we are checking for membership within a list ###
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name) ### name existed in unique_apps append it to duplicate_apps ###
    else:                          
        unique_apps.append(name) ### name non-existent in unique_apps append it to unique_apps ###
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:' , duplicate_apps[:15])


# In total, there are ***1,181*** times when an app occurs more than once.
# 
# It's very important to keep one entry per app so we are not wasting time and resources while analyzing data. 
# 
# If you take a look at the two rows we printed for *'Facebook'*, there is one difference between both of them. At the 3rd index of each list, *'Reviews'*, you can see they are not the same which shows that this data was collected at different times.
# 
# The Higher the number of reviews, the more reliable it is and the more recent the data is. With this being said, we will keep the row with the highest review count and remove the other entries.
# 

# # Part Two:

# First we will create a dictionary so that each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

# In[8]:


reviews_max = {}

for app in android: 
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews    
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
 

# Previously we found out that there are ***1,181*** times when an app occurs more than once. 
# 
# With this information, we will take the difference between the length of our data set and ***1,181*** to see if it is equal to to the length of our dictionary (*unique_apps*).

# In[9]:


print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))


# Now to use the *reviews_max* dictionary created in the above cell to remove all the duplicates. 
# 
# 1st, we will start by creating two empty lists:
# - android_clean (store our cleaned data set)
# - already_added (store app names)
# 
# 2nd, we will loop through the Android dataset and for every iteration:
# - isolate app name and number of reviews
# - add current row (app) to android_clean list ***and*** the app name (name) to the already_added list if:
#    - number of reviews of current app matches the number of reviews of that app as described in the reviews_max dictionary; and
#    - name of the app is not in the already_added list. (In some cases, the highest number of reviews are the same for certain duplicated apps. For this reason we must add this condition.)

# In[10]:


android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

# Lets explore the *android_clean* dataset to confirm that we have ***9,659*** rows

# In[11]:


explore_data(android_clean, 0, 3, True)


# We can now confirm that we have ***9,659*** rows as anticipated.

# # Time to Remove Non-English Apps
# 
# For this project, we are only intrested in the apps with the English language. Each letter in the English alphabet has a corresponding number associated with it using the ASCII standard. 
# 
# The ASCII standard states all English characters correspond to the numbers between ***0*** and ***127***.
# 
# Lets build a function that checks an apps name and tells us if it contains non-ASCII characters.
# 
# ## Part One:
# 

# Using the built-in *ord( )* function, we will find out the corresponding number of each character.
# 

# In[12]:


def english_lang(string):
    for character in string:
        if ord(character) > 127:
            return False        
        
    return True  

### Time to check our work with these app names ###

print(english_lang('Instagram'))
print(english_lang('爱奇艺PPS -《欢乐颂2》电视剧热播'))


# The funciton is working properly. 
# 
# Yet we come across another problem. Some app names consist of emojis or other symbols that are not within the ASCII range and will be removed because of this and we will lose useful data.

# In[13]:


### useful apps being removed because of emojis, symbols, etc... ###

print(english_lang('Docs To Go™ Free Office Suite'))
print(english_lang('Instachat 😜'))
print('\n')

print(ord('™'))   ### symbol is > 127 so it thinks the app is not in English ###
print(ord('😜'))


# ## Part Two:

# It's very important to minimize the impact of data loss.
# 
# We will only remove an app if its name has more than ***3*** non ASCII characters.

# In[14]:


def english_lang(string):
    non_ascii = 0 
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True
        
### time to check the above function ###

print(english_lang('Docs To Go™ Free Office Suite'))
print(english_lang('Instachat 😜'))
print(english_lang('爱奇艺PPS -《欢乐颂2》电视剧热播'))


# Without spending to much time in optimization, our function is working the way we want it to with very few non-English apps passing through.
# 
# Now we will use the *english_lang* function we created to filter out the non-English apps for both Datasets.

# In[15]:


android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if english_lang(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if english_lang(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)


# We are left with, ***9,614*** Android apps and ***6,183*** IOS apps

# # Isolating the Free Apps
# 
# The last step of our data cleaning process is to isolate the free apps from the non-free apps.
# 
# 

# In[16]:


android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
### checking the length of each dataset to see how many apps remain ###

print(len(android_final))
print(len(ios_final))    


# After isolating the free apps from the non-free, we have ***8,864*** Android apps and ***3,222*** IOS apps leftover.
# 
# Time to begin our analysis!

# # Most Common Apps by Genre
# 
# ## Part one:
# 
# With the apps being free to download and use, our revenue source comes from in app-adds. Our goal is to find what kind of apps appeal and attract more users. The more users use an app, the more revenue it brings in.
# 
# Our validation strategy for an app to minimize risks and overhead consists of three steps:
# 1. Build a Android version of the app, and add it to Google Play.
# 2. If it has a positive response from users, develop it further.
# 3. After 6 months and it is profitable, build an iOS version of the app and add it to the App Store
# 
# Our end goal is to add the app on both Google Play and the App Store. For this to happen we must find profiles that are successful in both markets.
# 
# We will begin our analysis by determining the most common genres for each market. 
# 
# Start by building a frequency table for  *prime_genre* column of the App Store data set and the *Genres* and *Category* columns of Google Play data set.
# 
# 

# # Part Two:
# 
# To analyze these frequency tables we will build two functions:
# 1. One function to generate frequency tables that show percentages
# 2. Another function to display the percentages in descending order
# 

# In[17]:


def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_as_tuple = (table[key], key)
        table_display.append(key_value_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

# ## Part Three:
# 
# Lets anaylze the frequency table we generated for the *prime_genre* column of the App Store data set.

# In[18]:


display_table(ios_final, 11)


# Above, we can see that more than half (***58.16%***) of the apps are in the *Games* genre. Which makes this the most common genre of apps of the App Store.
# 
# This is followed by *Entertainment* (***7.88%***), then *Photo & Video*(***4.96%***), etc...
# 
# Most of the free apps designed are dominated by entertainment use suchs as *Games*, *Photo & Video*, *Social Networking* and *Sports*. Apps that are designed for practical purposes such as *Education*, *Shopping* and *lifestyle* are rare and less.
# 
# This can tell us that there is a large market for gaming based off the amount of apps in the App Store. Yet, this does not imply that they have the most amount of users. The demand might not equal the offer.
# 
# We will continue by analyzing the frequency table for *Category* and *Genres* column of the Google Play dataset
# 

# In[19]:


### Start with the Category column ###

display_table(android_final, 1)


# The Google Play store has a more significant stand with apps that are designed for practical purposes(*family*, *tools*, *Business*, etc..).
# 
# The leading genre is *family* at ***18.90%***, followed by *Game* at ***9.72%*** and so on.
# 
# As we investigate further, we find out that the *family* category contains a lot of games for kids. Here you can see the apps under the family category within Google Play store, [Google Play store: Family](https://play.google.com/store/search?q=family&c=apps&hl=en_US&gl=US).
# 

# In[20]:


### Genre Column ###

display_table(android_final, -4)


# The difference between the *Category* column and the *Genre* column is that the genre column has more categories. As we move forward and looking at the bigger picture, we will only be using the *Category* column.
# 
# Now lets summarize what we know. The App Store is domniated by apps for entertainment purposes and fun. While Google Play store is more diversed and balanced between entertainment apps and practical use apps.
# 
# The frequency tables we have generated only tells us which percentage of apps fall into what genre and category. Now its time to find out which apps are the most popular (most active users).

# # The Most Popular Apps by Genre on App Store
# 
# To find which apps are the most popular (most users), we will be calculating the average number of installs for each app genre.
# 
# We find this information within the *Installs* column for the Google Play data set. For the App Store data set we will do things a little different since there is no information given about this. As a proxy, we will take the total number of user ratings which can be found in the *rating_count_tot* app.
# 
# Lets start by calculating the average number of user ratings per app genre on App Store:

# In[21]:


genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)


# Looking at the list above, we can see that the *Navigation* app has the highest number of user reviews. Its not a suprise to see this thanks to the influence of big players such as Waze and Google Maps.
# 
# Waze and Google Maps combined almost have up to half a million user reviews:

# In[22]:


for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])


# With these big players dominating this genre of the market, competition will be cutthroat.
# 
# Lets take a look at the *Travel* genre:

# In[23]:


for app in ios_final:
    if app[-5] == 'Travel':
        print(app[1], ':', app[5])


# The *Travel* genre shows more potential for revenue because the bigger players  are more diverse within thier dominance within the market.
# 
# There are several apps within this genre that can do multitude of services for the consumer such as purchasing a airline ticket, hotel and rental car on one platform (*TripAdvisor Hotels Flights Restaurants*). Or apps that are stricly selective on one purpose such as solely purchasing ailrine tickets (* United Airlines*) 
# 
# With this diversity, there are advantages where in another genre's such as *Navigation* really doesn't exist.
# 
# Now it's time to take a look at analyzing the Google Play market.

# # The Most Popular Apps by Genre on Google Play
# 
# Since having the data of the number of installs, it should be easier to attain the genre popularity. On the other hand, the numbers arent actually precise and mostly left open-ended (***1,000,000+***, ***100,000+***, ***10,000+***, etc..).

# In[25]:


display_table(android_final, 5)


# With this information we cant tell that an app with ***10,000*** installs really has 10,000, 20,000, or all the way up to 49,000 installs. Makes it not the most precise. 
# 
# However, we only want to find out which app genres attract the most users. We don't need it to be perfect precision with respect to the numbers of users. We will leave the numbers as they are.
# 
# ***100,000+*** app installs will mean 100,000 installs, ***10,000+*** app installs will mean 10,000 installs and so on.
# 
# We will start by computing the average number of installs for each genre (category) while converting each install number to float. If we do not convert to float (remove commas and plus characters) the conversion will raise an error.

# In[26]:


categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)


# The list above tells us that *Communication* apps has the most installs on average at ***38,456,119***. Followed by *Video_Players* at ***24,727,872*** and thirdly *Social* with ***23,253,652*** installs.
# 
# Each of these three categorys of apps are heavily influenced by big players.
# - **Communication** - WhatsApp, Facebook messenger, Skype, etc...
# - **Video_Players** - Youtube, Google Play Movies & tv, etc...
# - **Social** - Facebook, Instagram, etc...
# 
# This tells us that a small amount of apps contain the larger amount of installs 

# In[27]:


for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])


# This shows that this app genre is somewhat dominated by few giants which can make competition cutthroat and a fight for earning revenue.
# 
# lets try to find an app genre with a decent amount of user installs and fewer amount of popular apps that skew the average: *TRAVEL_AND_LOCAL*
# 

# In[28]:


for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL':
        print(app[0], ':', app[5])


# There are a variety of apps in the *Travel_AND_LOCAL* genre, yet fewer big number of popular apps that skew the average: 

# In[29]:


for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])


# With very few popular apps this market shows a promising potential.
# 
# Lets get some app ideas based on the apps that are somewhat in the middle in terms of popularity (between 1,000,000 and 50,000,000 downloads):

# In[30]:


for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])


# Above we have the amount of downloads between 1 million to 50 million for the apps in *TRAVEL_AND_LOCAL* genre. Lets start breaking it down into intervals to see if we can find a pattern:
# - ***1,000,000*** - ***5,000,000***
# - ***5,000,000*** - ***10,000,000***
# - ***10,000,000*** - ***50,000,000***

# In[31]:


for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '1,000,000+'):
                                            
                                            
          print(app[0], ':', app[5])
print('\n')

for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '5,000,000+'):
        
        print(app[0], ':', app[5])
print('\n')

for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '10,000,000+'):
        
        print(app[0], ':', app[5])
print('\n')

for app in android_final:
    if app[1] == 'TRAVEL_AND_LOCAL' and (app[5] == '50,000,000+'):
        
        print(app[0], ':', app[5])
        

# We will categorize the apps into thier own category of the service it provides (***1,000,000*** - ***50,000,000***)
# 
# (***1,000,000***)  Category : Amount
# - Airline: 11
# - Travel Organizer (flights, hotels, cars): 10
# - lodging: 7
# - Navigation: 5
# - Railroad: 1
# - Resturaunt: 1
# - Car rental: 1
# 
# So the genre of apps that were most popular with the least amount of installs are where a consumer can purchase an airline ticket (*Qatar Airways*, *British Airways*, etc.. ). Closely behind this are the apps that are a one stop shop where a consumer can purchase tickets, hotels and rental cars on the same app (*TripIt: Travel Organizer*, *CityMaps2Go Plan Trips Travel Guide Offline Maps*, etc..)
# 
# (***5,000,000***) Category : Amount
# - Airline: 8 
# - Travel Organizer (Flights, hotels, cars): 4
# - Navigation: 4
# - Lodging: 3
# - Railroad: 1
# - Resturaunt: 1 
# - Car rental: 0
# 
# As the amount of installs increase there are really no major changes excpet ***0*** for apps strictly to rent a car and Navigation apps overtakes the lodging apps.
# 
# 
# (***10,000,000***) Category : Amount
# - Travel Organizer (Flights, hotels, cars): 6
# - Airline: 4
# - Navigation: 4
# - Lodging: 4
# - Railroad: 2
# - Resturaunt: 1 
# - Gas: 1
# - Car rental: 0
# 
# Starting to see a patern where Travel Organizer apps, Airline apps, and Navigation apps are the most popular in terms of installs.
# 
# (***50,000,000***) Category : Amount
# - Navigation: 3
# - Travel Organizer (Flights, hotels, cars): 1
# - Airline: 0
# - Lodging: 0
# - Railroad: 0
# - Resturaunt: 0
# - Gas: 0
# - Car rental: 0
# 
# 
# There is a strong competition for apps between *Airline* apps,*Travel Organizer* apps and *Navigation* apps. There is significantly less competition between *Railroad* apps, *Resturaunt* apps and *Car rental* apps.

# # Conclusion
# 
# In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.
# 
# The *Travel* genre in both markets have a decent amount of installs and active users. This alone has a potential for it be a profitable venture with our in-add apps revenue system.
# 
# We took it a step further and analyzed within the *Travel* genre, there are apps that offer many different services on one app or one select service per app. This makes it very diverse for the consumer to make a decsion on which app to use for thier travels.
# 
# This is where we can profit. With such diversity we have a larger audience. We can experiment with the two ways, one app that is a one stop shop for the consumer and apps that are strictly selective for one purpose. Can even take a step futher and let one app promote the other app if it's a better deal for the consumer, win win. 
# 
# 
# 
# 
# 
# 
# 
#