# Analysis of the escort market in UK¶

In [95]:
from IPython.core.display import display, HTML, Markdown
display(HTML('''
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide(); } else {$('div.input').show();
}
code_show = !code_show
}
</script>
<a onclick="code_toggle()" href=#>Show source</a>'''))

Show source
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import numpy as np
import sys
import cPickle
from collections import Counter
from IPython.core.display import display, HTML
from IPython.display import Image
import datetime
from stop_words import get_stop_words
import locale
locale.setlocale(locale.LC_TIME, "en_US.UTF-8")
from wordcloud import WordCloud, STOPWORDS
from dateutil.relativedelta import relativedelta
import operator
from ipy_table import *
from collections import Counter
stop_words = get_stop_words('en')
stop_words.extend(['will','her','time','get','can','got','just','didnt','much','done','thing'])

import sys
sys.setdefaultencoding('utf-8')

import seaborn as sb
sb.set_style("darkgrid")

def displayTable(table, theme='basic'):
make_table(table)
apply_theme(theme)
return set_global_style(float_format="%.2f")

def displaySeries(s, width=False):
res = []
res.append([s.index.name, s.name])
for k,v in s.iteritems():
if width:
if len(v)>width:
res.append([k, v[:width]+"..."])
else:
res.append([k, v[:width]])
else:
res.append([k, v])
return displayTable(res)

def displayDataFrame(s, width=False):
res = []
for k,v in s.iteritems():
if width:
if len(str(v.iloc[0]))>width:
res.append([k, str(v.iloc[0])[:width]+"..."])
else:
res.append([k, str(v.iloc[0])[:width]])
else:
res.append([k, str(v.iloc[0])])
return displayTable(res)

def makeWordCloud(words):
wordcloud = WordCloud(#font_path='/home/fcheung/CabinSketch-Bold.ttf'
stopwords=STOPWORDS,
background_color='white',
width=1800,
height=1400,
).generate(words)
plt.imshow(wordcloud)
plt.axis('off')

%pylab inline
pylab.rcParams['figure.figsize'] = (10, 4)

display(HTML("<style>.container { width:70% !important; }</style>"))

In [3]:
with open("data/escort/datasets/reviews.pickle") as f:
df_reviews = df_reviews[df_reviews.visit_date.dt.year>1999]
df_reviews = df_reviews[df_reviews.visit_date.dt.year<=2016]
# Until 31-oct-2016 (full month)
df_reviews = df_reviews[df_reviews.visit_date<datetime.datetime(2016,11,1)]

In [4]:
with open("data/escort/datasets/providers.pickle") as f:

In [5]:
df = pd.merge(left = df_reviews, right = df_providers, on = 'id_service_provider', how = 'left')


Disclaimer and Motivation

I've experienced that many otherwise quiet and thoughtful people are outraged when some subjects are treated. The escort business —and prostitution in general— is such a topic. In case you find yourself offended by the subject, please read this section in full (and thanks for reaching here).

I am a data analyst, and my motivation is to analyze a market somehow in the shadows, or at least not in the public spot. The decision to go for this market was threefold:

1. The business owners are not necessarily technical people. That means, I cannot impress them with Principal Component Analysis, unsupervised clustering or clever dimension reductions. If I don't give them business insights I wasted their time.

2. There is a wealth of data available thanks to the review site Punternet.com. It was structured enough to be able to efficiently extract information, and at the same time free enough to be a challenge -i.e. free text descriptions, free text in the date field, etc.-. I applied basic statistics as well as Natural Language Processing techniques for this report.

3. It is an interesting market from an economic point of view. I find a challenge to price the time of an escort. Usually the equilibrium price —given from the offer-supply— comes from the cost of production (in very broad terms), but here, the cost of the provider is null. It is not even like the case of consultants, where their time is priced (more or less) according to the cost of their education and opportunity costs. In any case, there must be a cost for the escort, or all girls in the world would be doing it (and driving their price down). So, that's why it's interesting. That, and that it is not a openly, analyzed-to-death market, like, for example, real estate.

# Foreword¶

There is an old joke about a drunkard who is searching for something he has lost

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them somewhere along his way home. The policeman asks why he is searching here, and the drunk replies, "this is where the light is."

This joke shows what is called the Streetlight Effect, the bias often present in scientific research. I noticed that reviewers take their time to write long, well articulated reviews, that led me to the conclusion that these clients are mostly well educated. This fact may bias the results, since less educated —or even illiterate— clients are not represented. It may be possible that the most part of clients are aggressive, illiterate individuals, and that this dataset shows just the nice ones. Maybe, but unlikely. There's another aphorism that says If it looks like a duck, and quacks like a duck, we have probably a duck, so probably we are analyzing, at least, a representative sample of the population. Just don't forget that although data might not lie, we can only search where there is light.

# A global glimpse¶

## Origin of the data¶

The review site Punternet.com is a meeting point of punters (escort clients) to share their experiences. It has been active from 1999 and to date (2016) reviews keep arriving in a steady fashion. This provides a wealth of information to analyze and extract information from such closed market.

Here is a screenshot of one random review and one provider.

Example Review
Example Provider

In addition to standard data like location, date and price there are two free text sections, The Lady, where the looks and first impression of the lady is described, and The Story, where the client tells in detail how the session evolved.

Regarding the providers, profiles may be more or less complete, but commonly found information includes Name of the Lady, Prices, List of services, Phone Number, a short description and possibly a Photo Gallery.

In [6]:
display(Markdown("This dataset contains **%d** reviews of **%d** providers from 1999 to today (2016)" % (df.shape[0], df_providers.shape[0])))


This dataset contains 32115 reviews of 715 providers from 1999 to today (2016)

As a first exploratory step, let's analyze the number of visits per month, in order to gauge the popularity (and thus relevance) of the site Punternet.com in the escort industry.

In [7]:
plt.figure()
by_month = df.groupby(pd.Grouper(key='visit_date', freq='1M')).size().plot()
plt.title("Number of total reviews per month (in Punternet.com)")
plt.xlabel("Date")
plt.ylabel("Total reviews")
plt.show()


It seems that Punternet.com grew from less than 10 reviews per month in the year 2000 (according to their site they Established [in] 1999)to a peak of almost 400 reviews per month about 2011. Since then, it seems to be declining to the current value, 200-250 reviews per month. It could mean the escort industry has been declining or that Punternet it losing popularity. For that purpose, we check Google trends.

In [8]:
gt = pd.read_csv("data/escort/aux/googletrends.csv", index_col="Monat")
gt.index = pd.to_datetime(gt.index, format='%Y-%m')
gt.plot()
plt.xlabel("Year")
plt.show()


What's interesting in this chart is that (according to Google Trends) Punternet.com has been in decline since at least 2005, with an anecdotal spike in 2009 (maybe an article in mainstream media?). That means that the previous peak in visits in 2011 existed despite the fact that the website's popularity was decreasing. That leads to believe that in 2011 there were a decreasing number of very active users.

In order to verify that let's plot the number of different users per month:

In [9]:
plt.figure()
by_month = df.groupby(pd.Grouper(key='visit_date', freq='1M')).author_link.apply(lambda x: len(x.unique())).plot()
plt.title("Number of unique users per month (in Punternet.com)")
plt.xlabel("Year")
plt.ylabel("Number of Unique Users")
plt.show()


Well, that was unexpected. It seems that around 2011 the number of unique reviewers also spiked! Indeed, the number of reviews per user (per month) is fairly constant:

In [10]:
by_month = df.groupby(pd.Grouper(key='visit_date', freq='1M')).size()
by_month2 = df.groupby(pd.Grouper(key='visit_date', freq='1M')).author_link.apply(lambda x: len(x.unique()))
(by_month/by_month2).plot()
plt.title("Number reviews per user (per month)")
plt.xlabel("Year")
plt.ylabel("Reviews per user")
plt.show()


All this points to the fact that Punternet was growing in popularity around end 2010, despite Google searches declining. That means clients do not reach the site through Google. Two hypothesis can explain this fact:

1. Clients reach the website directly (Word of mouth).
2. There are other advertising networks (like escort websites associations?) that send traffic to the site.

Let's show the number of different clients that ever posted a review on the site by month

In [11]:
plt.figure(figsize=(15,5))
d0 = datetime.datetime(2000,1,1)
d1 = datetime.datetime(2016,10,2)
d = d0

labels= []
Y= []
while d<d1:

labels.append(d.strftime("%b %Y"))

d+= relativedelta(months=3)

plt.bar(range(len(labels)), [0]+list(np.diff(Y)), tick_label=labels, align='center')
plt.title("Number of new accounts per month")
plt.xticks(rotation='vertical')
plt.xlabel("Date")
plt.ylabel("New accounts")
plt.xlim([0,len(labels)])
plt.show()


This values are to be taken with a grain of salt since due to the anonymity of the site a single user can create multiple accounts. Anyway it gives a global insight of the trend. It seems that most new accounts have been created around end 2010.

Data points to the fact that Punternet.com is declining in both usage and engagement. The question arises, where are these people going? And, is in this new place information so well structured like in Punternet.com?

## Analyzing the time of visits¶

This a posteriori obvious fact caught me by surprise. Let's take a look at the hour at which the visits take place more often.

In [12]:
plt.figure()
plt.subplot(121)
df.visit_date.dt.hour.hist(bins = np.arange(25)-0.5)
plt.xlim(0.5,23.5)
plt.xlabel("Hour of the day")
plt.ylabel("Number of visits")

plt.subplot(122)
ax = df.visit_date.dt.weekday.hist(bins = np.arange(8)-0.5, width=0.8)
tmp = ax.get_xticks().tolist()
tmp[1:8] = ["Mon.","Tue.","Wed.","Thu.","Fri.","Sat.","Sun."]
ax.set_xticklabels(tmp)
plt.xlim(-0.5,6.5)
plt.xlabel("Day of week")
plt.ylabel("Number of visits")
plt.show()
plt.show()


The graph on the left shows that visits are not during night-time, but instead during working hours. This is complemented by the graph on the right, that shows the number of visits per day of week.

We see that visits during the weeks are a 50% higher than on weekends! So, visits take place on work days during working hours. The question is, who goes to the ladies on Sundays? Let's check if this distribution is the same every day of the week.

In [13]:
plt.figure(figsize=(10, 20))
weekdays = ["Mondays","Tuesdays","Wednesdays","Thursdays","Fridays","Saturdays","Sundays"]
for i in range(7):
plt.subplot(4,2,i+1)
df[df.visit_date.dt.weekday==i].visit_date.dt.hour.hist(bins = np.arange(25)-0.5)
plt.title("Visits by hour on "+weekdays[i])
plt.xlim(0.5,23.5)
plt.xlabel("Hour of the day")
plt.ylabel("Number of visits")
plt.ylim((0,610))
plt.show()


Interestingly enough, people who go on Sundays also go on working hours. Maybe people who work on Sundays too?

To confirm, this distribution is also constant across the years.

In [14]:
plt.figure(figsize=(20,10))
for i, year in enumerate(range(2009,2017)):
plt.subplot(2,4,i+1)
plt.title("Year %s" % year)
df[df.visit_date.dt.year==year].visit_date.dt.hour.hist(bins = np.arange(25)-0.5)
plt.xlim(0.5,23.5)
plt.ylim(0,450)
plt.xlabel("Hour of the day")
plt.ylabel("Number of visits")
plt.show()


If clients mostly go at working hours on working days, does it mean that in summer (and thus, vacation time) there are less visits?

In [15]:
plt.figure()
times = pd.to_datetime(df.visit_date)
df_month = df.groupby(times.dt.month).size()
ax = df_month.plot(kind="bar")
plt.xlim(-0.5,12)
plt.title("Number of visits by month")
plt.xticks(rotation='horizontal')
months = ["Jan.","Feb.","March","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov","Dec"]
ax.set_xticklabels(months)
plt.xlabel("Month")
plt.ylabel("Number of visits")

plt.show()

In [16]:
display(Markdown("""The drop in number of visits between the highest (**%s** visits in **%s**) and the lowest (**%s** visits in **%s**) months is just **%s**, a small but *noticeable* value. It's remarkable that the highest value is achieved in **July**, in the middle of the summer, while the lowest value falls in winter (probably because of *Christmas*).""" %  (df_month.max(), months[df_month.argmax()-1], df_month.min(), months[df_month.argmin()-1], "%.d%%" % ((df_month.max()-df_month.min())*1.0/df_month.max()/0.01)) ))


The drop in number of visits between the highest (2892 visits in Jul.) and the lowest (2378 visits in Dec) months is just 17%, a small but noticeable value. It's remarkable that the highest value is achieved in July, in the middle of the summer, while the lowest value falls in winter (probably because of Christmas).

Lastly, let's see if the number of visits are constant throughout the month. If there are more visits at the beginning of the month than at the end, it would indicate that the clients are sensitive to the payday. A priori I believe that is not the case. Let's see the data:

In [17]:
plt.figure()
times = pd.to_datetime(df.visit_date)
df_day_month = df.groupby(times.dt.day).size()
ax = df_day_month.plot(kind="bar")
plt.xlim(-0.5,31)
plt.title("Number visits by day in month")
plt.xlabel("Day in month")
plt.xticks(rotation='horizontal')
plt.ylabel("Number of visits")
plt.show()

In [18]:
display(Markdown("""After ignoring the **31st** (and maybe also the **30**, because of *February*), there seems to be a slight decreasing slope between the **1st** (**%s** visits) and the **29th** (**%s** visits). That means a drop of **%s**.""" %  (df_day_month[1], df_day_month[29], "%.d%%" % ((df_day_month[1]-df_day_month[29])*1.0/df_day_month[1]/0.01)) ))


After ignoring the 31st (and maybe also the 30, because of February), there seems to be a slight decreasing slope between the 1st (1086 visits) and the 29th (955 visits). That means a drop of 12%.

[Note for the mathematically inclined reader]: In order to carry out a significance analysis we would need to make further hypothesis about the rate of the arrival of clients. A first approach would be to model the number of clients in a given day as a Poisson process (with lambda estimated on the first half of the year), and then calculate the probability of getting the results of the second half, knowing that the process is controlled by said Poisson process. This is left as an exercise to the reader since such a proof is out of the scope of this report.

## How many go to escorts?¶

Let's see if the escort market is dominated by a handful of very active clients, or if many people try a few times along their lives. Next graph shows the number of visits per client. As before, since Punternet.com is anonymous, the actual number of visits may be larger, if the same user posted from multiple accounts.

In order to make the following graph, all clients have been sorted by number of visits along the horizontal axis. Then, for each client, the height of the line shows his number of visits.

In [19]:
#plt.figure(figsize=(15,8))
visits_client_old_index = copy(visits_client.index)
visits_client.hist(bins = np.arange(0,20,1)-0.5)

#plt.plot([0,len(visits_client)], [visits_client.median()]*2,linestyle='dashed', c='r', label="Median visits per client(%.2f)" % (visits_client.median()))
#plt.yticks(list(plt.yticks()[0]) + [visits_client.median()])
plt.legend()

plt.title("Recurrence of visits")
plt.xlabel("Number of visits")
plt.xlim((0.5,21))
plt.ylabel("Number of clients")
plt.show()


As normal, most people have low values and there are small groups with a high value (as with income and number of visits). The most common value if 1 visit.

In [20]:
display(Markdown("* **Top 10%%** punters made more than **%d** visits\n" % (visits_client.quantile(0.9))))
display(Markdown("* **Top 1%%** punters made more than **%d** visits\n" % (visits_client.quantile(0.99))))
display(Markdown("* **Most profilif** punter made **%d** visits\n" % (visits_client.max())))

• Top 10% punters made more than 5 visits
• Top 1% punters made more than 22 visits
• Most profilif punter made 213 visits

Wow! Who's this guy that made 213 visits? Let's find out...

In [21]:
display(Markdown("""
His *nickname* is **%s**. He seems to have spent **&#163;%s** so far in *at least* **%d different girls**.

Let's look at how many times he went with each one (the rest are unknown):

girl_id="Indian Palace"


His nickname is COLL. He seems to have spent £13,409 so far in at least 10 different girls.

Let's look at how many times he went with each one (the rest are unknown):

In [22]:
g = df[df.username==username].groupby("name").author_link.size()
g =g[g.index>0]
g.sort_values(inplace=True,ascending=False)
g.name = "Number of visits"
name = g.index[0]
displaySeries(g)

Out[22]:
 Lady Number of visits Indian Palace 12 Covent Garden Girls 4 VIP Massage 1 Stunning Indian Sonam - Highly Recommended 1 Sara-100% Independent Escort 1 Chantelle 1 Anjali 1 Amira Pretty Indian 1
In [23]:
display(Markdown("Who is this **%s** he visited **%d** times?" % (girl_id, g[girl_id])))


Who is this Indian Palace he visited 12 times?

In [24]:
tmp = df.set_index("name", inplace=False)
pd.set_option('max_colwidth', 1000)
displaySeries(a, width=100)

Out[24]:
 name lady_description Indian Palace Sara is a young punjabi girl in her early 20s and although she claimed to be Hindu....she had the lo... Indian Palace Young 20?s, striking indian girl from Birmingham. Slim sexy figure and a very nice face. Shes a very... Indian Palace Sara is a sexy fair skinned Indian babe, 5' 5" & athletic, wavy long hair, perfectly shaped cup C bo... Indian Palace Pretty Anglo-Indian girl Indian Palace Sara is apparently 21 and studying for a sociology degree. Sara's pictures on the site do not do her...
In [25]:
display(Markdown("Seems to be an Indian lady named **Sara**, who worked **%d** times with **%d** different clients. Here is her profile:" % ((df.name==girl_id).sum(),df[df.name==girl_id].author_link.unique().shape[0])))


Seems to be an Indian lady named Sara, who worked 208 times with 112 different clients. Here is her profile:

In [26]:
displayDataFrame(df_providers[df_providers.name==name][['about', 'age', 'category', 'location', 'name']])

Out[26]:
 about This profile has expired. age nan category Service Provider Category: Agency/Parlour     Profile Type: Company location Central London, Bayswater W2 name Indian Palace
In [27]:
display(Markdown("""
It seems an expired profile. Well, that's odd considering that her last review was on **%s**. It should be noted that it is also an **agency**, not a single lady.
Anyway, why is this profile *abandoned*? It seems to be the preferred place of a good customer. Maybe it's not so good *overall*? Let's see the global ranking by month.
"""% (df[df.name==girl_id].visit_date.max().strftime('%A %B %-d %Y at %H:%M') )))


It seems an expired profile. Well, that's odd considering that her last review was on Friday September 30 2016 at 23:00. It should be noted that it is also an agency, not a single lady. Anyway, why is this profile abandoned? It seems to be the preferred place of a good customer. Maybe it's not so good overall? Let's see the global ranking by month.

To make the following graph, ladies have been sorted according to the number of visits per month and distributed along the horizontal axis. Then, for each lady, the height of the line shows her exact number of clients per month.

In [28]:
first_visit_by_lady = df.groupby("name").visit_date.min()
number_of_total_visits = df.groupby("name").visit_date.count()

plt.legend()

plt.title("Number of visits per month per lady")
plt.xticks(rotation='vertical')
plt.ylabel("Visits per month")
plt.show()


The dashed red line shows the position of Indian Palace, which is among the Top performers. That's really odd. Maybe there's a reason why they let their profile expire, but probably it wouldn't hurt to keep it updated.

Let's analyze now the most active ladies/clubs

In [29]:
visits_by_provider = df[df.name!=-1].groupby("name").size().sort_values(ascending=False).head()
visits_by_provider.name ="Number of Visits"
displaySeries(visits_by_provider)

Out[29]:
 Lady Number of Visits Sandy's Superstars 3233 Annabellas MK 1527 House Of Divine 772 Arabesque 646 Max's Angels 505

Interestingly there seems to be two large providers with more than 1000 reviews.

First one:

In [30]:
most_active_providers = visits_by_provider.index
first = df_providers[df_providers.name==most_active_providers[0]]

 name Sandy's Superstars about This limited profile was automatically created as a result of reviews being submitted. If you are th... category Service Provider Category: Agency/Parlour     Profile Type: Company location Manchester phone 01619029902

It doesn't seem to give much information, other than it's a Mancherster agency named Sandy's Superstars. The name implies multiple ladies and thus it's normal to have such a high number of visits.

Let's take a look at the second one:

In [31]:
second = df_providers[df_providers.name==most_active_providers[1]]
pd.set_option('max_colwidth', 100)
display(displayDataFrame(second[['name','about', 'category','location','rate_text', 'number_of_pictures', 'tags', 'website']], width=100))

 name Annabellas MK about Whether you're a local to Milton Keynes or a visitor, Annabella's is worth the trip! Our Milton Key... category Service Provider Category: Agency/Parlour     Profile Type: Company location Milton Keynes rate_text 20min  'Quicky'—£50 30min -  fully inclusive—£60 45min - fully inclusive—£90 60min - fully... number_of_pictures 20 tags [u'Assisted Bath/Shower', u'Breast Relief', u'Domination - Mild', u'Domination - Severe', u'Face Sit... website http://www.annabellasescorts.com/home.php

Now this one seems more interesting. As expected, it's also an agency and with a fairly complete profile. It's located in Milton Keynes.

Now that we have these two let's compare some relevant metrics, like income distribution and satisfaction of clients. Both are expected to be high, but let data speak for itself.

In [32]:
ax = visit_per_month_lady.hist(bins=np.arange(0,20,1)-0.5)
plt.legend()

ax.xaxis.set_major_locator(plticker.MultipleLocator(base=1.0))
plt.title("Activity of providers")
plt.xlabel("Number of clients per month")
plt.xlim(-0.5,20.5)
plt.xticks(rotation='horizontal')
plt.ylabel("Number of providers")
plt.show()


This graph shows that most of the providers have one single review per month. Of course, not all clients leave review, so this graph shouls be read with an at least in mind.

In [33]:
table = [ ['Top %', 'Visits per month'],
]
displayTable(table)

Out[33]:
 Top % Visits per month Best 18.36 Top 0.1% 11.78 Top 1% 3.24 Top 2% 2.11 Median (50%) 0.40

Indeed the difference between the 0.1% and the rest is extremely large. Who are those ladies? Or maybe are they agencies?

In [34]:
visit_per_month_lady.name = "Visits per month"

Out[34]:
 Lady id Visits per month Sandy's Superstars 18.36 Annabellas MK 8.90 House Of Divine 5.76 Arabesque 3.63 Ego Massage 3.49

They are the same as in the previous graph. The first one, Id 13521 is the Manchester agency named Sandy's Superstars. They don't have a profile on the site yet, but data points that they should really create one and keep it updated.

## Cost. Price. Income. Let's talk about money.¶

### Income per visit¶

In [35]:
plt.figure()
earnings_per_visit = (df[df.name!=-1].groupby("name").amount_paid.sum()/df[df.name!=-1].groupby("name").amount_paid.count()).sort_values(ascending=False)
plt.title("Frequency of income per visit")
ax = earnings_per_visit.hist(bins = np.arange(0,1000,50)-25)

plt.xlim((25,1000))
plt.xlabel("Inocome per visit")
ax.xaxis.set_major_locator(plticker.MultipleLocator(base=50.0))
plt.show()


As shown in the graph, most ladies earn around £100 per visit. The Top 1% performers earn most of the money in the industry, while the median income is not particularly high.

Let's look at the numbers:

In [36]:
table = [ ['Top %', 'Income per visit'],
['Best' , earnings_per_visit.max()],
['Top 1%', earnings_per_visit.quantile(0.99)],
['Top 2%', earnings_per_visit.quantile(0.98)],
['Top 3%', earnings_per_visit.quantile(0.97)],
['Median (50%)', earnings_per_visit.quantile(0.5)]
]
displayTable(table)

Out[36]:
 Top % Income per visit Best 2296.25 Top 1% 900.42 Top 2% 674.63 Top 3% 569.92 Median (50%) 141.82
In [37]:
display(Markdown("""
Now it's clear that **the Top 1%% earns a %d%% more than the median lady**.
For info, *the top earner per visit* is **%s** from **%s**.
""" % ((earnings_per_visit.quantile(0.99)-earnings_per_visit.quantile(0.5))*100/earnings_per_visit.quantile(0.5),
df[df.name==earnings_per_visit.argmax()].name.iloc[0],
df[df.name==earnings_per_visit.argmax()].location.iloc[0] )))


Now it's clear that the Top 1% earns a 534% more than the median lady. For info, the top earner per visit is Elegant Escort Agency from Nottingham.

On a micro level, let's view the income distribution for both the top performer and the second best.

In [38]:
plt.figure()
plt.subplot(121)
plt.title("Amount paid per visit for top performer")
df[df.name==most_active_providers[0]].amount_paid.hist(bins = np.arange(0,300,10))
plt.xlabel("Amount paid")
plt.ylabel("Number of visits")

plt.subplot(122)
plt.title("Amount paid per visit for second top performer")
df[df.name==most_active_providers[1]].amount_paid.hist(bins = np.arange(0,300,10))
plt.xlabel("Amount paid")
plt.ylabel("Number of visits")
plt.ylim((0,1400))
plt.show()


The most common price seems to be slightly above £50 per visit, with a long tail of high paying clients.

We can go a step further and analyze the income per hour, instead of per visit. Although the author does not find this metric particularly interesting (and even misleading), since extremely short services (or even those for where the user filled with a zero) give inflated values. To account for that only visit longer than 30 minutes have been considered.

### Income per hour¶

In [39]:
plt.figure(figsize=(15,5))
df2 = df[df.duration_visit_minutes>=30]

earnings_per_hour = (df2.groupby("name").amount_paid.sum()/df2.groupby("name").duration_visit_minutes.sum()).sort_values(ascending=False)*60.0
earnings_per_hour.hist(bins=np.arange(0,1000,50))

plt.title("Frequency of income per visit")
plt.xlabel("Income per visit")
plt.show()


We can repeat the table above to view the distribution of income per hour

In [40]:
table = [ ['Top %', 'Income per hour'],
['Best' , earnings_per_hour.max()],
['Top 1%', earnings_per_hour.quantile(0.99)],
['Top 2%', earnings_per_hour.quantile(0.98)],
['Top 3%', earnings_per_hour.quantile(0.97)],
['Median (50%)', earnings_per_hour.quantile(0.5)]
]
displayTable(table)

Out[40]:
 Top % Income per hour Best 3125.00 Top 1% 1204.03 Top 2% 531.02 Top 3% 331.30 Median (50%) 131.62
In [41]:
display(Markdown("""
Compared to the previous table (*income per visit*) here the values are sensibly larger. In particular, the top earner *per hour* is larger than the top earner *per visit*. For info, *the top earner per visit* is **%s** from **%s**.
""" % (df2[df2.name==earnings_per_hour.argmax()].name.iloc[0], df2[df2.name==earnings_per_hour.argmax()].location.iloc[0]) ))


Compared to the previous table (income per visit) here the values are sensibly larger. In particular, the top earner per hour is larger than the top earner per visit. For info, the top earner per visit is Exquisite Cristal from North West/Manchester.

### Cost of visit at different times¶

Let's see if some hours of the day are more expensive than others.

In [42]:
plt.figure()
times = pd.to_datetime(df.visit_date)
df.groupby(times.dt.hour).amount_paid.median().plot(kind="bar")
plt.xlim(-0.5,23.5)
plt.title("Median cost per visit by hour of day")
plt.xticks(rotation='horizontal')
plt.xlabel("Hour of the day")
plt.ylabel("Median cost of visit")
plt.show()


It seems that they might be. Notice how the more expensive hours are those with very few visits according to the first graph (see above), so that might very well be statistical noise (and thus not a proof of anything). That is the case for 8:00-9:00 a.m.

For hours in the evening, around 20:00, this surge in price indicates that visits starting at that time or later include all-night, longer services.

We saw before that most visits take place on workdays. But what happens with their price? Is it the same every day of the week?

In [43]:
plt.figure()
times = pd.to_datetime(df.visit_date)
df_weekdays = df.groupby(times.dt.weekday).amount_paid.median()
ax = df_weekdays.plot(kind="bar")
plt.xlim(-0.5,6.5)
plt.title("Median cost per visit by day of week")
plt.xticks(rotation='horizontal')
tmp = ["Mon.","Tue.","Wed.","Thu.","Fri.","Sat.","Sun."]
ax.set_xticklabels(tmp)
plt.xlabel("Day of the week")
plt.ylabel("Median cost of visit")
plt.show()

In [44]:
display(Markdown("""
Weekends are slightly more economical, being the lowest on **Sunday** a **%d%%** cheaper than the most expensive, **Tuesday-Thursday**.
""" % ((df_weekdays.max()-df_weekdays.min())/df_weekdays.max()/0.01, ) ))


Weekends are slightly more economical, being the lowest on Sunday a 20% cheaper than the most expensive, Tuesday-Thursday.

Concerning the median price of a visit along the year, we get this:

In [45]:
plt.figure()
by_year = df[df.visit_date>datetime.datetime(2003,1,1)].groupby(pd.Grouper(key='visit_date', freq='12M')).amount_paid.median()
ax = by_year.plot(kind='bar')
plt.title("Median amount paid by year (from 2003 onwards)")
plt.xlabel("Year")
plt.xticks(rotation='horizontal')
plt.ylabel("Median cost per visit")
ax.set_xticklabels(by_year.index.year-1)
plt.show()


Median cost passed from £80 in 2002 to £100 in 2016. Which doesn't seem to be higher than the inflation rate, according to the Bank Of England.

The next quesiton I answered was if the number of visits and the cost per visit are constant along the month. It seems to be so.

In [46]:
plt.figure(figsize=(20,5))
plt.subplot(121)
times = pd.to_datetime(df.visit_date)
ax = df.groupby(times.dt.day).amount_paid.median().plot(kind="bar")
plt.xlim(-0.5,31)
plt.title("Median cost per visit by day in month")
plt.xlabel("Day in month")
plt.xticks(rotation='horizontal')
plt.ylabel("Median cost of visit")

plt.subplot(122)

times = pd.to_datetime(df.visit_date)
ax = df.groupby(times.dt.month).amount_paid.median().plot(kind="bar")
plt.xlim(-0.5,12)
plt.title("Median cost per visit by month")
plt.xticks(rotation='horizontal')
tmp = ["Jan.","Feb.","March","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov","Dec"]
ax.set_xticklabels(tmp)
plt.xlabel("Month")
plt.ylabel("Median cost of visit")

plt.show()


Next graph shows the median cost per visit for each location. In order to stabilize the results only locations with more than 10 visits have been considered.

In [47]:
plt.figure()
g = df[df.visit_date.dt.year>2010].groupby("location")
g.filter(lambda e: len(e)>50).groupby("location").amount_paid.median().sort_values(ascending=False).plot(kind='bar', label="")

plt.title("Median cost per visit per city (with 50+ visits)")
plt.xlabel("City")
plt.ylabel("Median cost per visit")
plt.legend()
plt.show()


The most expensive area is Earl's Court district in London, with a representative price £200 per visit.

To conclude, in order to give an impression of the amount of money this industry moves per year, take a look at the following graph.

In [48]:
def sumClampToPct95(v):
th = v.quantile(0.95)
v[v>th]= th
return v.sum()

plt.figure()
df.groupby(pd.Grouper(key='visit_date', freq='12M')).amount_paid.apply(sumClampToPct95).plot(label="Total amount paid (k)")
(df.groupby(pd.Grouper(key='visit_date', freq='12M')).size()*100).plot(label="Number of visits (x100)", alpha=0.3, linestyle='dashed')

plt.legend()
plt.show()


This graph shows, naturally, the total amount of money paid by the people who post reviews in the site. It is, then, a fraction of the actual value. In any case, during the peak in 2011 just the punters who shared their experience in Punternet.com spent more than £400,000 per year.

Be careful when over-analyzing this graph. The amount of money spent as reported by Punternet.com depend on the popularity of the site. Indeed, this graph is basically mirroring the number of visits to Punternet.com (dashed green line).

## Satisfaction criteria¶

To analyze the satisfaction of the client is tricky. A first approach was to measure the frequency of recommendation. Unfortunately, this approach is too coarse for useful analysis. The client cannot say so so recommended, but either a full yes/no. In order to increase the accuracy of satisfaction the text of the story has been run through a sentiment analyzer. These relatively novel techniques in Natural Language Processing have been nicely implemented by the VADER (Valence Aware Dictionary and sEntiment Reasoner) project.

In [49]:
display(Markdown("""Each of the **%s** reviews has been analyzed and a score between **-1** (most negative) and **+1** (most positive) has been given.""" %  (len(df_reviews)) ))


Each of the 32115 reviews has been analyzed and a score between -1 (most negative) and +1 (most positive) has been given.

Then, the average satisfaction of all the visits for each price point has been plotted.

In [50]:
df.groupby(pd.cut(df.amount_paid, np.arange(0, 500,50)))[['lady_sentiment','story_sentiment']].mean().plot(kind='bar')
plt.legend(bbox_to_anchor=(1.2, 1.2))
plt.title("Average satisfaction (according to story description) by amount paid")
plt.xticks(rotation='horizontal')
plt.xlabel("Amount paid")
plt.ylabel("Average satisfaction (from all reviews)")
plt.show()


It seems to indicate that no clear correlation between price and satisfaction exists, except for that in the bottom range. In order to check for consistency let's plot the same graph, but with the average recommendation.

In [51]:
plt.figure()
df.groupby(pd.cut(df.amount_paid, np.arange(0, 500,50))).recommend.mean().plot(kind='bar')
plt.title("Average satisfaction (according to if recommended) by amount paid")
plt.xticks(rotation='horizontal')
plt.xlabel("Amount paid")
plt.ylabel("Average satisfaction (from 'if recommended')")
plt.show()