Introduction to Non-Personalized Recommenders¶

The recommendation problem¶

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature¶

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Notation¶

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.

Goal of a recommendation system¶

$$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $$

Problem statement¶

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Challenges¶

Availability of item metadata¶

Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:

Ingredients
Nutrition facts
Brand
Description
County of origin

New user problem¶

A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.

New item problem¶

Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.

Data sparsity¶

When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.

Flow chart: the big picture¶

In [3]:

from IPython.core.display import Image 
Image(filename='./imgs/recsys_arch.png')

Out[3]:

The CourseTalk dataset: loading and first look¶

Loading of the CourseTalk database.

The CourseTalk data is spread across three files. Using the pd.read_table method we load each file:

In [5]:

import pandas as pd

unames = ['user_id', 'username']
users = pd.read_table('./data/users_set.dat',
                      sep='|', header=None, names=unames)

rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('./data/ratings.dat',
                        sep='|', header=None, names=rnames)

mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('./data/cursos.dat',
                       sep='|', header=None, names=mnames)

# show how one of them looks
ratings.head(10)

Out[5]:

	user_id	course_id	rating
0	1	1	5
1	2	1	5
2	3	1	5
3	4	1	5
4	5	1	5
5	6	1	5
6	7	1	5
7	8	1	5
8	9	1	5
9	10	1	5

In [293]:

# show how one of them looks
users[:5]

Out[293]:

	user_id	username
0	1	patrickdijusto1
1	2	natalya_ivanova
2	3	justineittreim
3	4	ronmay
4	5	paulstock

In [254]:

courses[:5]

Out[254]:

	course_id	title	avg_rating	workload	university	difficulty	provider
0	1	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera
1	2	Modern & Contemporary American Poetry	4.9	5-9 hours/week	University of Pennsylvania	Easy/medium	coursera
2	3	A Beginner's Guide to Irrational Behavior	4.9	7-10 hours/week	Duke University	Medium	coursera
3	4	Design: Creation of Artifacts in Society	4.9	5-10 hours/week	University of Pennsylvania	Medium	coursera
4	5	Greek and Roman Mythology	4.9	8-10 hours/week	University of Pennsylvania	Medium	coursera

Using pd.merge we get it all into one big DataFrame.

In [6]:

coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk

Out[6]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2773 entries, 0 to 2772
Data columns (total 10 columns):
user_id       2773  non-null values
course_id     2773  non-null values
rating        2773  non-null values
title         2773  non-null values
avg_rating    2773  non-null values
workload      2773  non-null values
university    2616  non-null values
difficulty    2773  non-null values
provider      2773  non-null values
username      2773  non-null values
dtypes: float64(1), int64(2), object(7)

In [295]:

coursetalk.ix[0]

Out[295]:

user_id                                                       1
course_id                                                     1
rating                                                        5
title         An Introduction to Interactive Programming in ...
avg_rating                                                  4.9
workload                                        7-10 hours/week
university                                      Rice University
difficulty                                               Medium
provider                                               coursera
username                                        patrickdijusto1
Name: 0, dtype: object

Collaborative filtering: generalizations of the aggregation function¶

Non-personalized recommendations¶

Groupby¶

The idea of groupby is that of split-apply-combine:

split data in an object according to a given key;
apply a function to each subset;
combine results into a new object.

To get mean course ratings grouped by the provider, we can use the pivot_table method:

In [284]:

mean_ratings = coursetalk.pivot_table('rating', rows='provider', aggfunc='mean')
mean_ratings.order(ascending=False)

Out[284]:

provider
None            4.562500
coursera        4.527835
edx             4.491620
codecademy      4.450000
udacity         4.241071
udemy           4.200000
open2study      4.083333
khanacademy     4.000000
novoed          3.281250
mruniversity    3.250000
Name: rating, dtype: float64

Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:

In [297]:

ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]

Out[297]:

title
14.73x: The Challenges of Global Poverty                     2
2.01x: Elements of Structures                                2
3.091x: Introduction to Solid State Chemistry                3
6.002x: Circuits and Electronics                            10
6.00x: Introduction to Computer Science and Programming     21
7.00x: Introduction to Biology - The Secret of Life          3
8.02x: Electricity and Magnetism                             3
8.MReVx: Mechanics ReView                                    1
A Beginner&#39;s Guide to Irrational Behavior              147
A Crash Course on Creativity                                 5
dtype: int64

In [298]:

active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]

Out[298]:

Index([u'6.00x: Introduction to Computer Science and Programming', u'A Beginner&#39;s Guide to Irrational Behavior', u'An Introduction to Interactive Programming in Python', u'An Introduction to Operations Management', u'CS-191x: Quantum Mechanics and Quantum Computation', u'CS188.1x Artificial Intelligence', u'Calculus: Single Variable', u'Computing for Data Analysis', u'Critical Thinking in Global Challenges', u'Cryptography I'], dtype=object)

The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:

In [300]:

mean_ratings = coursetalk.pivot_table('rating', rows='title', aggfunc='mean')
mean_ratings

Out[300]:

title
14.73x: The Challenges of Global Poverty                        4.250000
2.01x: Elements of Structures                                   4.750000
3.091x: Introduction to Solid State Chemistry                   4.166667
6.002x: Circuits and Electronics                                4.800000
6.00x: Introduction to Computer Science and Programming         4.166667
7.00x: Introduction to Biology - The Secret of Life             4.666667
8.02x: Electricity and Magnetism                                4.333333
8.MReVx: Mechanics ReView                                       5.000000
A Beginner&#39;s Guide to Irrational Behavior                   4.874150
A Crash Course on Creativity                                    3.500000
A History of the World since 1300                               4.318182
A Look at Nuclear Science and Technology                        3.000000
A New History for a New China, 1700-2000: New Data and New Methods, Part 1    0.500000
AIDS                                                            5.000000
Aboriginal Worldviews and Education                             4.333333
...
The Modern World: Global History since 1760        4.775862
The Modern and the Postmodern                      4.777778
The Science of Gastronomy                          4.000000
The Social Context of Mental Health and Illness    4.333333
Think Again: How to Reason and Argue               3.815789
Useful Genetics Part 1                             4.500000
VLSI CAD:  Logic to Layout                         4.500000
Vaccine Trials: Methods and Best Practices         5.000000
Vaccines                                           3.750000
Web Development                                    4.625000
Web Intelligence and Big Data                      3.802326
Women and the Civil Rights Movement                5.000000
Writing for the Web (WriteWeb)                     5.000000
Writing in the Sciences                            4.000000
jQuery                                             4.250000
Name: rating, Length: 211, dtype: float64

By computing the mean rating for each course, we will order with the highest rating listed first.

In [301]:

mean_ratings.ix[active_titles].order(ascending=False)

Out[301]:

title
An Introduction to Interactive Programming in Python            4.915652
Modern &amp; Contemporary American Poetry                       4.901515
Design: Creation of Artifacts in Society                        4.879581
A Beginner&#39;s Guide to Irrational Behavior                   4.874150
Greek and Roman Mythology                                       4.864198
Calculus: Single Variable                                       4.854167
CS188.1x Artificial Intelligence                                4.833333
Machine Learning                                                4.830000
Functional Programming Principles in Scala                      4.822581
Gamification                                                    4.796296
An Introduction to Operations Management                        4.785714
The Modern World: Global History since 1760                     4.775862
Programming Languages                                           4.770833
CS-191x: Quantum Mechanics and Quantum Computation              4.727273
Cryptography I                                                  4.700000
Discrete Optimization                                           4.695652
Introduction to Computer Science                                4.687500
Learn to Program: Crafting Quality Code                         4.585714
Model Thinking                                                  4.578125
Internet History, Technology, and Security                      4.541667
Fantasy and Science Fiction: The Human Mind, Our Modern World    4.522727
Learn to Program: The Fundamentals                              4.303571
6.00x: Introduction to Computer Science and Programming         4.166667
Critical Thinking in Global Challenges                          3.961538
Web Intelligence and Big Data                                   3.802326
Computing for Data Analysis                                     3.187500
Introduction to Finance                                         3.086957
Introduction to Data Science                                    3.060000
Name: rating, dtype: float64

To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:

In [7]:

mean_ratings = coursetalk.pivot_table('rating', rows='title',cols='provider', aggfunc='mean')
mean_ratings[:10]

Out[7]:

provider	None	codecademy	coursera	edx	khanacademy	mruniversity	novoed	open2study	udacity	udemy
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	4.250000	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	4.750000	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	4.800000	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	4.666667	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	4.333333	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	5.000000	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	4.87415	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN	NaN

In [303]:

mean_ratings['coursera'][active_titles].order(ascending=False)[:10]

Out[303]:

title
An Introduction to Interactive Programming in Python    4.915652
Modern &amp; Contemporary American Poetry               4.901515
Design: Creation of Artifacts in Society                4.879581
A Beginner&#39;s Guide to Irrational Behavior           4.874150
Greek and Roman Mythology                               4.864198
Calculus: Single Variable                               4.854167
Programming Languages                                   4.850000
Machine Learning                                        4.830000
Functional Programming Principles in Scala              4.822581
Gamification                                            4.796296
Name: coursera, dtype: float64

Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+

Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:

In [8]:

# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
                                             rows='user_id',
                                             cols='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]

Out[8]:

title	14.73x: The Challenges of Global Poverty	2.01x: Elements of Structures	3.091x: Introduction to Solid State Chemistry	6.002x: Circuits and Electronics	6.00x: Introduction to Computer Science and Programming	7.00x: Introduction to Biology - The Secret of Life	8.02x: Electricity and Magnetism	8.MReVx: Mechanics ReView	A Beginner's Guide to Irrational Behavior	A Crash Course on Creativity	A History of the World since 1300	A Look at Nuclear Science and Technology	A New History for a New China, 1700-2000: New Data and New Methods, Part 1	AIDS	Aboriginal Worldviews and Education
user_id
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Let's extract only the rating that are 4 or higher.

In [19]:

ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible

ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]

Out[19]:

title	14.73x: The Challenges of Global Poverty	2.01x: Elements of Structures	3.091x: Introduction to Solid State Chemistry	6.002x: Circuits and Electronics	6.00x: Introduction to Computer Science and Programming	7.00x: Introduction to Biology - The Secret of Life	8.02x: Electricity and Magnetism	8.MReVx: Mechanics ReView	A Beginner's Guide to Irrational Behavior	A Crash Course on Creativity	A History of the World since 1300	A Look at Nuclear Science and Technology	A New History for a New China, 1700-2000: New Data and New Methods, Part 1	AIDS	Aboriginal Worldviews and Education
user_id
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.

In [90]:

ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)

Out[90]:

	gte_4	total
title
14.73x: The Challenges of Global Poverty	2	2
2.01x: Elements of Structures	2	2
3.091x: Introduction to Solid State Chemistry	2	3
6.002x: Circuits and Electronics	10	10
6.00x: Introduction to Computer Science and Programming	15	21
7.00x: Introduction to Biology - The Secret of Life	3	3
8.02x: Electricity and Magnetism	2	3
8.MReVx: Mechanics ReView	1	1
A Beginner's Guide to Irrational Behavior	146	147
A Crash Course on Creativity	2	5

In [92]:

ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)

Out[92]:

	gte_4	total	gte_4_ratio
title
14.73x: The Challenges of Global Poverty	2	2	1.000000
2.01x: Elements of Structures	2	2	1.000000
3.091x: Introduction to Solid State Chemistry	2	3	0.666667
6.002x: Circuits and Electronics	10	10	1.000000
6.00x: Introduction to Computer Science and Programming	15	21	0.714286
7.00x: Introduction to Biology - The Secret of Life	3	3	1.000000
8.02x: Electricity and Magnetism	2	3	0.666667
8.MReVx: Mechanics ReView	1	1	1.000000
A Beginner's Guide to Irrational Behavior	146	147	0.993197
A Crash Course on Creativity	2	5	0.400000

In [86]:

ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score

Functional Programming Principles in Scala 31 31 1.0
Introduction to Computer Science 24 24 1.0
Programming Languages 24 24 1.0
Web Development 16 16 1.0
6.002x: Circuits and Electronics 10 10 1.0
Compilers 8 8 1.0
Archaeology&#39;s Dirty Little Secrets 7 7 1.0
How to Build a Startup 7 7 1.0
Introduction to Sociology 7 7 1.0
Stat2.1X: Introduction to Statistics: Descriptive Statistics 7 7 1.0

Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.

In [96]:

ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]

Out[96]:

title
An Introduction to Interactive Programming in Python    575
Design: Creation of Artifacts in Society                191
A Beginner&#39;s Guide to Irrational Behavior           147
Modern &amp; Contemporary American Poetry               132
An Introduction to Operations Management                 98
Greek and Roman Mythology                                81
Critical Thinking in Global Challenges                   65
Gamification                                             54
Machine Learning                                         50
Web Intelligence and Big Data                            43
dtype: int64

Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.

In [97]:

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score

An Introduction to Interactive Programming in Python 572 575 0.994782608696
Design: Creation of Artifacts in Society 190 191 0.994764397906
A Beginner&#39;s Guide to Irrational Behavior 146 147 0.993197278912
Modern &amp; Contemporary American Poetry 130 132 0.984848484848
An Introduction to Operations Management 96 98 0.979591836735
Greek and Roman Mythology 80 81 0.987654320988
Critical Thinking in Global Challenges 47 65 0.723076923077
Gamification 52 54 0.962962962963
Machine Learning 48 49 0.979591836735
Web Intelligence and Big Data 26 43 0.604651162791

Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.

In [102]:

course_users = coursetalk.pivot_table('rating', rows='title', cols='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]

Out[102]:

user_id	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A History of the World since 1300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Look at Nuclear Science and Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AIDS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aboriginal Worldviews and Education	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

First, let's get only the users that rated the course An Introduction to Interactive Programming in Python

In [122]:

ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)

Now, for all other courses let's filter out only the ratings from users that rated the Python course.

In [138]:

their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]

Out[138]:

user_id	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A History of the World since 1300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Look at Nuclear Science and Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AIDS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aboriginal Worldviews and Education	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.

In [158]:

course_count =  their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)

Ordering by the score, highest first excepts the first one which contains the course itself.

In [162]:

sims.order(ascending=False)[1:][:10]

Out[162]:

title
Machine Learning                           0.006957
Cryptography I                             0.006957
Web Development                            0.005217
Python                                     0.005217
Learn to Program: Crafting Quality Code    0.005217
Introduction to Computer Science           0.005217
Human-Computer Interaction                 0.005217
Gamification                               0.005217
Computational Investing, Part I            0.005217
CS-169.1x: Software as a Service           0.005217
dtype: float64