TOPIC:
Indetifying the profile of a profitable app for the ANDROID and iOS Market.
ABSTRACT:
As a company that builds free of charge apps, the main source of revenue is the in-app advertisements.That means mostly the active users may influence company's profit.Our target is to analyze the data and transfer to our developers what kind of apps are likely to attract more and active users.
DATA:
We will analyze two data sets that seem suitable for our purpose:
A data set containing data about approximately ten thousand Android apps from Google Play.
A data set containing data about approximately seven thousand iOS apps from the App Store.
#Load libraries
import csv
from csv import reader
import os
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Load data
android=pd.read_csv('googleplaystore.csv')
android.head(2)
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
android['Category'].unique()
array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9'], dtype=object)
#Genres offering more detailed description of the categories.So will take into account this since gives more info
android['Genres'].unique()[:10]
array(['Art & Design', 'Art & Design;Pretend Play', 'Art & Design;Creativity', 'Art & Design;Action & Adventure', 'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business', 'Comics', 'Comics;Creativity'], dtype=object)
DROP THE DUPLICATES, KEEP THAT ONE WITH THE HIGHEST REVIEWS
clean_android=android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first')
clean_android.reset_index(drop=True,inplace=True)
clean_android.head(2)
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GollerCepte Live Score | SPORTS | 4.2 | 9992 | 31M | 1,000,000+ | Free | 0 | Everyone | Sports | May 23, 2018 | 6.5 | 4.1 and up |
1 | Ad Block REMOVER - NEED ROOT | TOOLS | 3.3 | 999 | 91k | 100,000+ | Free | 0 | Everyone | Tools | December 17, 2013 | 3.2 | 2.2 and up |
#confirm the number of duplicated rows
dupli_lines=len(android)-len(android.sort_values('Reviews',ascending=False).drop_duplicates('App',keep='first'))
print('Duplicated lines:',dupli_lines)
print('Original number of rows:',len(android))
print('No duplicates, number of rows:',len(clean_android))
Duplicated lines: 1181 Original number of rows: 10841 No duplicates, number of rows: 9660
Find NaN.We have 1463 rows with no rating
clean_android.isnull().sum()
App 0 Category 0 Rating 1463 Reviews 0 Size 0 Installs 0 Type 1 Price 0 Content Rating 1 Genres 0 Last Updated 0 Current Ver 8 Android Ver 3 dtype: int64
#The NaN rows of Rating
clean_android[clean_android['Rating'].isnull()].head(2)
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | Voice Tables - no internet | PARENTING | NaN | 970 | 71M | 100,000+ | Free | 0 | Everyone | Parenting | May 28, 2018 | 2.0 | 4.0.3 and up |
113 | Gold Quote - Gold.fr | FINANCE | NaN | 96 | 1.5M | 10,000+ | Free | 0 | Everyone | Finance | May 19, 2016 | 2.3 | 2.2 and up |
#check the installs=users of the rows with NaN in rating
na_rows=clean_android[clean_android['Rating'].isnull()]
na_rows.sort_values('Installs',ascending=False).head()
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5704 | Young Speeches | LIBRARIES_AND_DEMO | NaN | 2221 | 2.4M | 500,000+ | Free | 0 | Everyone | Libraries & Demo | January 8, 2017 | 1.1 | 2.3 and up |
8793 | EJ.by | NEWS_AND_MAGAZINES | NaN | 10 | 2.3M | 500+ | Free | 0 | Everyone | News & Magazines | October 27, 2015 | 1.2 | 4.0.3 and up |
4503 | Poteau BA | FAMILY | NaN | 3 | 4.0M | 500+ | Free | 0 | Everyone | Education | July 16, 2017 | 1.0.2 | 5.0 and up |
1896 | CD JUANITO | SPORTS | NaN | 6 | 16M | 500+ | Free | 0 | Everyone | Sports | October 26, 2017 | 6.0 | 4.1 and up |
1900 | F | TOOLS | NaN | 6 | 4.9M | 500+ | Free | 0 | Everyone | Tools | May 15, 2018 | 1.0.2 | 4.0 and up |
-Rows with NaN Rating have very low number of installs against the top_20 of apps installs. Furthermore they have very low number of reviews.Seem they don't offer any valuable info to our analysis and in consequence we can drop these rows. -Additional we see the row 4484 has false rating value and installs ,therefore we will drop this row too. Row 4484 is the row of Rating's NaN which is already included in the list of NaN's rows we will drop.
#Lets compare with the top_10 of installs
clean_android.sort_values('Installs',ascending=False).head(3)
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4484 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
7372 | My Talking Tom | GAME | 4.5 | 14892469 | Varies with device | 500,000,000+ | Free | 0 | Everyone | Casual | July 19, 2018 | 4.8.0.132 | 4.1 and up |
5682 | Candy Crush Saga | GAME | 4.4 | 22430188 | 74M | 500,000,000+ | Free | 0 | Everyone | Casual | July 5, 2018 | 1.129.0.2 | 4.1 and up |
#Find the rows with NaN
na_rows=clean_android.isnull().sum(axis=1)
rows_to_drop=na_rows[na_rows!=0].index
rows_to_drop
Int64Index([ 74, 113, 118, 146, 148, 175, 270, 308, 309, 312, ... 9650, 9651, 9652, 9653, 9654, 9655, 9656, 9657, 9658, 9659], dtype='int64', length=1470)
#DROP NaN rows
clean_android.drop(clean_android.index[rows_to_drop],inplace=True)
clean_android.reset_index(drop=True,inplace=True)
#confirm there is not left any NaN
clean_android.isnull().sum()
App 0 Category 0 Rating 0 Reviews 0 Size 0 Installs 0 Type 0 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 0 Android Ver 0 dtype: int64
Take a look at the frequency of the offered app categories in the android market
top10_frequent_app=clean_android['Genres'].value_counts().head(10)
top10_frequent_app
Tools 717 Entertainment 471 Education 429 Finance 302 Productivity 301 Lifestyle 300 Personalization 296 Action 292 Medical 290 Sports 266 Name: Genres, dtype: int64
fig,ax=plt.subplots(figsize=(16,5))
ax=sns.barplot(top10_frequent_app.index,top10_frequent_app.values)
print(type(clean_android['Rating'][0]))
print(type(clean_android['Reviews'][0]))
print(type(clean_android['Installs'][0]))
<class 'numpy.float64'> <class 'str'> <class 'str'>
Transform to int the values of the columns reviews and installs in order to check for their correlation and find the top_10 of the most installed reviewed and rated categories.
clean_android['Reviews']=[int(i) for i in clean_android['Reviews']]
#installs is a string
print(type(clean_android['Installs'][0]))
clean_android['Installs'].unique()
<class 'str'>
array(['1,000,000+', '100,000+', '500,000+', '50,000+', '10,000,000+', '5,000,000+', '10,000+', '50,000,000+', '100,000,000+', '5,000+', '1,000+', '1,000,000,000+', '500+', '100+', '10+', '50+', '500,000,000+', '5+', '1+'], dtype=object)
#clean the strings from '+',',' transform them to float
to_integers=[]
for i in clean_android['Installs']:
s=i.replace('+',',').replace(',','')
b=float(s)
to_integers.append(b)
clean_android['Installs']=to_integers
#top_10 of the most installed apps
top_10_installs=clean_android.sort_values('Installs',ascending=False).head(10)
top_10_installs
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7823 | Google Photos | PHOTOGRAPHY | 4.5 | 10859051 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device |
1397 | SOCIAL | 4.5 | 66577446 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
405 | Google News | NEWS_AND_MAGAZINES | 3.9 | 878065 | 13M | 1.000000e+09 | Free | 0 | Teen | News & Magazines | August 1, 2018 | 5.2.0 | 4.4 and up |
4802 | YouTube | VIDEO_PLAYERS | 4.3 | 25655305 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Video Players & Editors | August 2, 2018 | Varies with device | Varies with device |
2541 | Google+ | SOCIAL | 4.2 | 4831125 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | July 26, 2018 | Varies with device | Varies with device |
2686 | Gmail | COMMUNICATION | 4.3 | 4604483 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 2, 2018 | Varies with device | Varies with device |
95 | Google Chrome: Fast & Secure | COMMUNICATION | 4.3 | 9643041 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 1, 2018 | Varies with device | Varies with device |
3738 | Hangouts | COMMUNICATION | 4.0 | 3419513 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | July 21, 2018 | Varies with device | Varies with device |
6873 | Google Play Books | BOOKS_AND_REFERENCE | 3.9 | 1433233 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Books & Reference | August 3, 2018 | Varies with device | Varies with device |
823 | SOCIAL | 4.1 | 78158306 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | August 3, 2018 | Varies with device | Varies with device |
Top reviewed are not top rated.But reviewing an app shows a more stable interaction of the user with the app, spend more time to evaluate and go through the app, since we don't have any info about the login and logout time. So potential advertisements are more likely to be seen by these users.
top_10_reviewed=clean_android.sort_values('Reviews',ascending=False).head(10)
top_10_reviewed.head()
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
823 | SOCIAL | 4.1 | 78158306 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | August 3, 2018 | Varies with device | Varies with device | |
1271 | WhatsApp Messenger | COMMUNICATION | 4.4 | 69119316 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 3, 2018 | Varies with device | Varies with device |
1397 | SOCIAL | 4.5 | 66577446 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
1968 | Messenger – Text and Video Chat for Free | COMMUNICATION | 4.0 | 56646578 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 1, 2018 | Varies with device | Varies with device |
2791 | Clash of Clans | GAME | 4.6 | 44893888 | 98M | 1.000000e+08 | Free | 0 | Everyone 10+ | Strategy | July 15, 2018 | 10.322.16 | 4.1 and up |
top10_reviewed_and_installed=clean_android.sort_values(['Installs','Reviews'],ascending=[False,False]).head(10)
top10_reviewed_and_installed.head(10)
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
823 | SOCIAL | 4.1 | 78158306 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | August 3, 2018 | Varies with device | Varies with device | |
1271 | WhatsApp Messenger | COMMUNICATION | 4.4 | 69119316 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 3, 2018 | Varies with device | Varies with device |
1397 | SOCIAL | 4.5 | 66577446 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
1968 | Messenger – Text and Video Chat for Free | COMMUNICATION | 4.0 | 56646578 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 1, 2018 | Varies with device | Varies with device |
4544 | Subway Surfers | GAME | 4.5 | 27725352 | 76M | 1.000000e+09 | Free | 0 | Everyone 10+ | Arcade | July 12, 2018 | 1.90.0 | 4.1 and up |
4802 | YouTube | VIDEO_PLAYERS | 4.3 | 25655305 | Varies with device | 1.000000e+09 | Free | 0 | Teen | Video Players & Editors | August 2, 2018 | Varies with device | Varies with device |
7823 | Google Photos | PHOTOGRAPHY | 4.5 | 10859051 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Photography | August 6, 2018 | Varies with device | Varies with device |
7921 | Skype - free IM & video calls | COMMUNICATION | 4.1 | 10484169 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 3, 2018 | Varies with device | Varies with device |
95 | Google Chrome: Fast & Secure | COMMUNICATION | 4.3 | 9643041 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Communication | August 1, 2018 | Varies with device | Varies with device |
218 | Maps - Navigate & Explore | TRAVEL_AND_LOCAL | 4.3 | 9235373 | Varies with device | 1.000000e+09 | Free | 0 | Everyone | Travel & Local | July 31, 2018 | Varies with device | Varies with device |
We can see below that the majority of the most installed app have rate from 4.0 to 4.5 while also the highest reviews
#Check relations between columns rating,reviews,installs,Genres
scatter_matrix(clean_android[['Rating','Reviews','Installs']],figsize=(16,6))#,'Installs']])
plt.show()
clean_android.plot.scatter('Rating','Reviews')
<matplotlib.axes._subplots.AxesSubplot at 0x2065ef46860>
We observe two clusters of apps with the most installed
clean_android.plot.scatter('Rating','Installs')
<matplotlib.axes._subplots.AxesSubplot at 0x2065f40d400>
clean_android.plot.scatter('Installs','Reviews')
<matplotlib.axes._subplots.AxesSubplot at 0x2065f4b7a90>
Lets have a look in ios apps
ios=pd.read_csv('AppleStore.csv')
ios.head(2)
id | track_name | size_bytes | currency | price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284882215 | 389879808 | USD | 0.0 | 2974676 | 212 | 3.5 | 3.5 | 95.0 | 4+ | Social Networking | 37 | 1 | 29 | 1 | |
1 | 389801252 | 113954816 | USD | 0.0 | 2161558 | 1289 | 4.5 | 4.0 | 10.23 | 12+ | Photo & Video | 37 | 0 | 29 | 1 |
Here we don't have installs or reviews, so we can't know the total number of users, but we have total number of rating which shows us the number of interactive users with the app.Based on that we will try to see which categories are the most popular here.
#top_10 of most rated apps
ios_top10_rated=ios.sort_values('rating_count_tot',ascending=False).head(10)
ios_top10_rated.head()
id | track_name | size_bytes | currency | price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284882215 | 389879808 | USD | 0.0 | 2974676 | 212 | 3.5 | 3.5 | 95.0 | 4+ | Social Networking | 37 | 1 | 29 | 1 | |
1 | 389801252 | 113954816 | USD | 0.0 | 2161558 | 1289 | 4.5 | 4.0 | 10.23 | 12+ | Photo & Video | 37 | 0 | 29 | 1 | |
2 | 529479190 | Clash of Clans | 116476928 | USD | 0.0 | 2130805 | 579 | 4.5 | 4.5 | 9.24.12 | 9+ | Games | 38 | 5 | 18 | 1 |
3 | 420009108 | Temple Run | 65921024 | USD | 0.0 | 1724546 | 3842 | 4.5 | 4.0 | 1.6.2 | 9+ | Games | 40 | 5 | 1 | 1 |
4 | 284035177 | Pandora - Music & Radio | 130242560 | USD | 0.0 | 1126879 | 3594 | 4.0 | 4.5 | 8.4.1 | 12+ | Music | 37 | 4 | 1 | 1 |
CONCLUSION:
We saw here too that first positions are dominated by social apps while instagram is social-photography topic. iOS apps have more games in the top_10 ranking but still the main apps are social and communication categories. So an idea for a new app could be an app that teachs you how to edit video and photography while the work of each user could be published in the social app(facebook,instagram) under competition where the score will be given a part by build in criteria of the app and a second part by other users of the app or the users of the social connected app(For example facebook).Combine a prize provided commercial companies this may attract massively the interest of many customers and therefore will be profitable for the company through the advertisements.