Mozilla Firefox releases various versions of its web browser over time. These versions are run on different platforms by users around the world. Some versions may make users happy, while some others, not so much. There is a huge amount of feedback that is generated by users across platforms and versions.
In this project, the aim is to analyze the feedback data and find interesting behaviours associated with it.
The feedback is available at the following link.
https://input.mozilla.org/en-US/?product=Firefox
Since the data is not readily available in the form of a table, it is necessary to scrape relevant information from the website. As a sample dataset to play around with, I scraped a week's worth of data (around 200 pages) using BeautifulSoup, a Python library for pulling data out of HTML and XML files.
The code for scraping is as follows:
#I have commented out the scraping code here.
'''
import requests
from pattern import web
from bs4 import BeautifulSoup
import traceback
#base url with GET dictionary
url = 'https://input.mozilla.org/en-US'
feedback1 = open('feedback1.txt','a')
for i in xrange(1,203):
#feedback1 - 1st 202 pages are valid as of 19th Oct 8:30AM IST
try:
params = dict(product='Firefox', page=i)
r = requests.get(url, params=params)
bs = BeautifulSoup(r.text)
for opinion in bs.findAll('li','opinion'):
senti = opinion.find('span','sprite').contents[0]
datetime = opinion.find('time')['datetime']
time = opinion.find('time').string #not a very useful value. We want the absolute date and time
#to remove whitespaces on either side of s tring, just do for e.g. time.strip()
version = time.find_next('a').find_next('a').contents[0]
platform = version.find_next('a').contents[0]
locale = platform.find_next('a').contents[0]
feedback1.write(senti+'\t'+datetime+'\t'+version+'\t'+platform+'\t'+locale+'\n')
except:
#the try catch block was added beacuse ceratin opinions in a page were geving errors as
#certain characters couldn't be encoded by ascii codec. E.g. (Norwegian Bokmal), the 'a' was different
print "Error while retreiving page ", i
print traceback.format_exc()
continue
feedback1.close()
'''
'\nimport requests\nfrom pattern import web\nfrom bs4 import BeautifulSoup\nimport traceback\n\n#base url with GET dictionary\nurl = \'https://input.mozilla.org/en-US\'\nfeedback1 = open(\'feedback1.txt\',\'a\')\nfor i in xrange(1,203): \n#feedback1 - 1st 202 pages are valid as of 19th Oct 8:30AM IST\n\ttry:\n\t\tparams = dict(product=\'Firefox\', page=i)\n\t\tr = requests.get(url, params=params)\n\t\tbs = BeautifulSoup(r.text)\n\t\tfor opinion in bs.findAll(\'li\',\'opinion\'):\n\t\t\tsenti = opinion.find(\'span\',\'sprite\').contents[0]\n\t\t\tdatetime = opinion.find(\'time\')[\'datetime\']\n\t\t\ttime = opinion.find(\'time\').string #not a very useful value. We want the absolute date and time\n\t\t\t#to remove whitespaces on either side of s tring, just do for e.g. time.strip()\n\t\t\tversion = time.find_next(\'a\').find_next(\'a\').contents[0]\n\t\t\tplatform = version.find_next(\'a\').contents[0]\n\t\t\tlocale = platform.find_next(\'a\').contents[0]\n\t\t\tfeedback1.write(senti+\'\t\'+datetime+\'\t\'+version+\'\t\'+platform+\'\t\'+locale+\'\n\')\n\texcept:\n\t\t#the try catch block was added beacuse ceratin opinions in a page were geving errors as\n\t\t#certain characters couldn\'t be encoded by ascii codec. E.g. (Norwegian Bokmal), the \'a\' was different\n\t\tprint "Error while retreiving page ", i\n\t\tprint traceback.format_exc()\n\t\tcontinue\n\nfeedback1.close()\n'
# Scraped data has been sevd in feedback2.txt
!head feedback2.txt
Sad 2015-10-18-08:00 41.0 Linux English (US) Sad 2015-10-18-08:00 41.0 Windows XP Vietnamese Happy 2015-10-18-08:00 41.0.2 Windows 7 English (US) Sad 2015-10-18-08:00 37.0 Windows 7 English (US) Sad 2015-10-18-08:00 38.3.0 Windows 7 English (US) Happy 2015-10-18-08:00 8.0 Linux Spanish (Spain) Sad 2015-10-18-08:00 41.0.2 Windows 7 English (US) Sad 2015-10-18-08:00 41.0 Windows 10 English (US) Sad 2015-10-18-08:00 41.0.2 Windows XP German Sad 2015-10-18-08:00 41.0.2 Windows 8.1 Spanish (Spain)
import matplotlib.pyplot as plt
import matplotlib.pylab as P
import pandas as pd
import numpy as np
names = ['sentiment', 'date', 'version', 'platform', 'locale']
data = pd.read_csv('feedback2.txt', delimiter='\t', names=names).dropna()
print "Number of rows: %i" % data.shape[0]
data.head()
Number of rows: 3760
sentiment | date | version | platform | locale | |
---|---|---|---|---|---|
0 | Sad | 2015-10-18-08:00 | 41.0 | Linux | English (US) |
1 | Sad | 2015-10-18-08:00 | 41.0 | Windows XP | Vietnamese |
2 | Happy | 2015-10-18-08:00 | 41.0.2 | Windows 7 | English (US) |
3 | Sad | 2015-10-18-08:00 | 37.0 | Windows 7 | English (US) |
4 | Sad | 2015-10-18-08:00 | 38.3.0 | Windows 7 | English (US) |
Now, we have the data in the right format.
For the next part, I'm using an ad-hoc way for dealing with the datetime column. I am going to extract only the day. However, I'll learn how to use the actual date in later analysis that I perform in the future. For now, this is a dirty way, but suffices for the sample dataset that I have scraped.
# extracting the exact day from date
# This is not a good practice since I'm just extracting the date
# Ideally we should be extracting the entire date to compare them. Will do that later.
data.date = [int(d.split('-')[2]) for d in data.date]
data.head()
sentiment | date | version | platform | locale | |
---|---|---|---|---|---|
0 | Sad | 18 | 41.0 | Linux | English (US) |
1 | Sad | 18 | 41.0 | Windows XP | Vietnamese |
2 | Happy | 18 | 41.0.2 | Windows 7 | English (US) |
3 | Sad | 18 | 37.0 | Windows 7 | English (US) |
4 | Sad | 18 | 38.3.0 | Windows 7 | English (US) |
data[['sentiment', 'version', 'platform', 'locale']].describe()
sentiment | version | platform | locale | |
---|---|---|---|---|
count | 3760 | 3760 | 3760 | 3760 |
unique | 2 | 69 | 31 | 48 |
top | Sad | 41.0.1 | Windows 7 | English (US) |
freq | 3134 | 1546 | 1382 | 2090 |
So we can infer that majority of the feedabck has been negative.
The top version, platform and locale from where feedback has been coming can also be seen.
data['date'].describe()
count 3760.000000 mean 14.633511 std 2.175523 min 11.000000 25% 13.000000 50% 15.000000 75% 16.000000 max 18.000000 Name: date, dtype: float64
We can now look at answering some questions that we may have regarding the given data.
What exactly are we looking for? Some questions instantly come to mind.
Although the above are important questions, we may want to dig a little deeper into the data to find certain
correlations between attributes.
# Custom function to make graphs simple and pretty
%matplotlib inline
#tell pandas to display wide tables as pretty HTML tables
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
#--------------------------To remove borders from the matplotlib plots-------------------------
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
#negative and positive feedback over time
#resources: http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html
# http://matplotlib.org/examples/statistics/histogram_demo_multihist.html
#Happy gives KeyError without .values
fig = plt.figure(1)
plt.hist([[data[data.sentiment=='Sad'].date],[data[data.sentiment=='Happy'].date]],
bins=np.arange(11,21),
color=['crimson','chartreuse'],
label=['Negative','Positive'])
plt.xlabel('Date (day of October 2015)')
plt.title('Feedback over Time')
plt.legend(prop={'size': 10})
remove_border()
#handles, labels = ax.get_legend_handles_labels()
fig.savefig('fig1.png', bbox_inches='tight')
We observe that positive feedback has been low and almost constant on all days.
Negative feedback has slightly varied with time, peaking on 16-10-2015.
We can assume a bias in the result because not everyone reports positive feedback, but problems are immediately reported by users.
So we need to place more emphasis on negative feedback in our analysis in order to pinpoint areas which Mozilla needs to focus on, in order to fix issues.
#Finding the versions with the most number of positive and negative feedbacks.I am ignoring the ones which
#have a count of <20. I am also ignoring the 'Unknown' column.
#Resources: http://pandas.pydata.org/pandas-docs/version/0.13.1/visualization.html
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))
data[(data.sentiment=='Sad') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar",
ax=axes[0],color='crimson')
axes[0].set_xlabel('Version')
axes[0].set_title('Versions with negative feedback')
remove_border(axes=axes[0])
data[(data.sentiment=='Happy') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar",
ax=axes[1],color='chartreuse')
plt.ylim([0,1400])
axes[1].set_xlabel('Version')
axes[1].set_title('Versions with postive feedback')
remove_border(axes=axes[1])
fig.savefig('fig2.png', bbox_inches='tight')
We find that the versoin causin maximum problems is also the one with the most positive feedback.
This is true for other versions as well. So it is safe to assume that Version 41.0.1 seems to be the most
popular version amongst users, as it generates the most feedback (positive and negative) overall. But as I had already mentioned, the positive feedback doesn't really help us as the number is small. We may have to look into the text to see if there's any information in there.
#Finding the platforms with the most number of positive and negative feedbacks.I am ignoring the ones
#with a low count value. I am also ignoring the 'Unknown' column.
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))
data[(data.sentiment=='Sad') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar",
ax=axes[0],color='crimson')
axes[0].set_xlabel('Platform')
axes[0].set_title('Platforms with negative feedback')
remove_border(axes=axes[0])
data[(data.sentiment=='Happy') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar",
ax=axes[1],color='chartreuse')
plt.ylim([0,1200])
axes[1].set_xlabel('Platform')
axes[1].set_title('Platforms with postive feedback')
remove_border(axes=axes[1])
fig.savefig('fig3.png', bbox_inches='tight')
The Windows platforms seem to be generating the most problems. One thing to note here is that platform count is not directly proportional to the percentag of those platform users facing the problem. This is because the number of overall users for a platform varies.
For e.g. Windows 7 problems > Windows 8.1 problems
But this does not imply that %(7 users facing problems) > %(8 users facing problems)
Because the number of 8 users may be much lesser than 7 users leading to lower feedback count.
While dealing with the issues, Mozilla probably needs to assign equal importance to 8 users so that it doesn't lose out on the 8 userbase.
#Finding the locations with the most number of positive and negative feedbacks.I am ignoring the ones
#with a low count value. I am also ignoring the 'Unknown' column.
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))
data[(data.sentiment=='Sad') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar",
ax=axes[0],color='crimson')
axes[0].set_xlabel('Locale')
axes[0].set_title('Locales with negative feedback')
remove_border(axes=axes[0])
data[(data.sentiment=='Happy') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar",
ax=axes[1],color='chartreuse')
plt.ylim([0,1800])
axes[1].set_xlabel('Locale')
axes[1].set_title('Locales with postive feedback')
remove_border(axes=axes[1])
fig.savefig('fig4.png', bbox_inches='tight')
Users in USA have a huge share in the providing feedback, both positive and negative.
After dabbling with a lot of bar graphs and getting a sense of how the general bahaviour of the data is, it would be interesting to see if there are any correlations between different versions and platforms. The most apt graph for thos purpose would be a heat map. Since the attributes are non-numerical, it is quite a task to get a heatmap out of it. But with a bit of effort (and a LOT of googling), I was successful in getting the desired output!
In the following sections, I have explained in detail how I went about the entire process.
#I'm using dummy column to have a lits of 1s as count value for each instance. This is needed for the pivot table.
data.insert(len(data.columns), column='dummy', value=[1]*len(data))
#Resources: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
pt1 = pd.pivot_table(data[(data.sentiment=='Sad') & (data.version!='Unknown') & (data.platform!='Unknown')],
index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0)
pt1.head()
#use pt1 to look at the entire pivot table
platform | Android 4.2.2 | Android 4.3 | Fedora | FreeBSD | Linux | Maemo | OS X | Windows 10 | Windows 2000 | Windows 7 | Windows 8 | Windows 8.1 | Windows NT | Windows NT 4.10 | Windows Vista | Windows XP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
version | ||||||||||||||||
10.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
11.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
12.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 2 |
13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 2 |
14.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 |
#Removing versions having a total negative feedback count of less than 30
#Note that I haven't used a for loop since the upper limit is taken to be non-changing in for loops.
#But in our case, we keep decrementing nrows every time a row is deleted. So I have used a do-while equivalent
#in Python.
nrows = len(pt1)
i = 0;
while True:
row = pt1.ix[i,]
if(row.sum()<30):
pt1.drop(pt1.index[i],inplace=True)
i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position
nrows = nrows-1
i = i+1
if(i>=nrows):
break
pt1
platform | Android 4.2.2 | Android 4.3 | Fedora | FreeBSD | Linux | Maemo | OS X | Windows 10 | Windows 2000 | Windows 7 | Windows 8 | Windows 8.1 | Windows NT | Windows NT 4.10 | Windows Vista | Windows XP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
version | ||||||||||||||||
40.0.3 | 0 | 0 | 0 | 0 | 2 | 0 | 5 | 14 | 0 | 21 | 2 | 10 | 1 | 0 | 2 | 9 |
41.0 | 0 | 0 | 0 | 0 | 19 | 0 | 27 | 89 | 0 | 124 | 7 | 25 | 0 | 1 | 12 | 39 |
41.0.1 | 1 | 0 | 1 | 0 | 58 | 0 | 117 | 290 | 0 | 513 | 19 | 133 | 0 | 0 | 56 | 153 |
41.0.2 | 0 | 1 | 0 | 0 | 11 | 0 | 56 | 206 | 0 | 316 | 14 | 86 | 0 | 0 | 30 | 95 |
42.0 | 0 | 1 | 0 | 0 | 2 | 0 | 17 | 47 | 0 | 74 | 8 | 8 | 0 | 0 | 16 | 14 |
44.0a1 | 0 | 0 | 0 | 1 | 9 | 0 | 3 | 33 | 0 | 23 | 0 | 5 | 0 | 0 | 2 | 0 |
# You can also plot a heatmap with normed values if needed. Replace pt1 with pt1_norm in the next code snippet.
# And uncomment below line.
#pt1_norm = (pt1 - pt1.mean()) / (pt1.max() - pt1.min())
#Resources:
#http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor
#https://plot.ly/python/heatmaps/
fig, ax = plt.subplots()
heatmap = ax.pcolor(pt1, cmap=plt.cm.Reds, alpha=0.89)
# Format
fig = plt.gcf()
fig.set_size_inches(15,5)
# turn off the frame
ax.set_frame_on(False)
# put the major ticks at the middle of each cell
ax.set_yticks(np.arange(pt1.shape[0]) + 0.5, minor=False)
ax.set_xticks(np.arange(pt1.shape[1]) + 0.5, minor=False)
# want a more natural, table-like display
ax.invert_yaxis()
ax.xaxis.tick_top()
# Set the labels
ax.set_xticklabels(pt1.columns, minor=False)
ax.set_yticklabels(pt1.index, minor=False)
# rotate the x labels
plt.xticks(rotation=90)
ax.grid(False)
# Turn off all the ticks
ax = plt.gca()
for t in ax.xaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
for t in ax.yaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
# name the axes and plot
ax.set_xlabel('Platform')
ax.set_ylabel('Version')
ax.xaxis.set_label_position('top')
plt.title('Heatmap of Version vs Platform having negative feedback',y=-0.08)
handles, labels = ax.get_legend_handles_labels()
#lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1))
#fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight')
fig.savefig('fig5.png', bbox_inches='tight')
Voila! Here is our much-waited heatmap!
pt2 = pd.pivot_table(data[(data.sentiment=='Happy') & (data.version!='Unknown') & (data.platform!='Unknown')],
index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0)
pt2.head()
#use pt2 to look at the entire pivot table
platform | Android 4.2.2 | Linux | Maemo | OS X | Windows 10 | Windows 7 | Windows 8 | Windows 8.1 | Windows Vista | Windows XP |
---|---|---|---|---|---|---|---|---|---|---|
version | ||||||||||
10.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
10.0.2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
11.0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 |
12.0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 0 |
13.0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 2 |
#Removing versions having a total positive feedback count of less than 20
nrows = len(pt2)
i = 0;
while True:
row = pt2.ix[i,]
if(row.sum()<10):
pt2.drop(pt2.index[i],inplace=True)
i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position
nrows = nrows-1
i = i+1
if(i>=nrows):
break
pt2
platform | Android 4.2.2 | Linux | Maemo | OS X | Windows 10 | Windows 7 | Windows 8 | Windows 8.1 | Windows Vista | Windows XP |
---|---|---|---|---|---|---|---|---|---|---|
version | ||||||||||
40.0 | 0 | 0 | 0 | 1 | 4 | 3 | 0 | 1 | 0 | 2 |
40.0.3 | 0 | 2 | 0 | 1 | 1 | 8 | 0 | 2 | 0 | 5 |
41.0 | 0 | 7 | 0 | 3 | 12 | 24 | 2 | 10 | 2 | 9 |
41.0.1 | 1 | 9 | 0 | 15 | 48 | 55 | 4 | 17 | 5 | 45 |
41.0.2 | 0 | 3 | 0 | 4 | 26 | 34 | 2 | 7 | 4 | 26 |
42.0 | 0 | 2 | 0 | 2 | 6 | 23 | 0 | 3 | 5 | 5 |
44.0a1 | 0 | 3 | 0 | 1 | 5 | 3 | 1 | 1 | 0 | 0 |
8.0 | 0 | 12 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 |
#Resources:
#http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor
#https://plot.ly/python/heatmaps/
fig, ax = plt.subplots()
heatmap = ax.pcolor(pt2, cmap=plt.cm.Greens, alpha=0.89)
# Format
fig = plt.gcf()
fig.set_size_inches(15,5)
# turn off the frame
ax.set_frame_on(False)
# put the major ticks at the middle of each cell
ax.set_yticks(np.arange(pt2.shape[0]) + 0.5, minor=False)
ax.set_xticks(np.arange(pt2.shape[1]) + 0.5, minor=False)
# want a more natural, table-like display
ax.invert_yaxis()
ax.xaxis.tick_top()
# Set the labels
ax.set_xticklabels(pt2.columns, minor=False)
ax.set_yticklabels(pt2.index, minor=False)
# rotate the x labels
plt.xticks(rotation=90)
ax.grid(False)
# Turn off all the ticks
ax = plt.gca()
for t in ax.xaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
for t in ax.yaxis.get_major_ticks():
t.tick1On = False
t.tick2On = False
# name the axes and plot
ax.set_xlabel('Platform')
ax.set_ylabel('Version')
ax.xaxis.set_label_position('top')
plt.title('Heatmap of Version vs Platform having positive feedback',y=-0.08)
handles, labels = ax.get_legend_handles_labels()
#lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1))
#fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight')
fig.savefig('fig6.png', bbox_inches='tight')
We notice a similar trend in this heatmap as well. But it should be noted that positive count is much lesser than the negative one. A heatmap is always relative to the data that is used for plotting it.
Finally we need to delete the dummy column that we had created.
#deleting the dummy column
data.drop('dummy', axis=1, inplace=True)
data.head()
sentiment | date | version | platform | locale | |
---|---|---|---|---|---|
0 | Sad | 18 | 41.0 | Linux | English (US) |
1 | Sad | 18 | 41.0 | Windows XP | Vietnamese |
2 | Happy | 18 | 41.0.2 | Windows 7 | English (US) |
3 | Sad | 18 | 37.0 | Windows 7 | English (US) |
4 | Sad | 18 | 38.3.0 | Windows 7 | English (US) |
In this analysis, a Mozilla Firefox feedback data was analyzed and certain insights were drawn into user behaviour and version/platform dependency.