Notebook

Department of Informatics and Telecommunications - University of Athens

Data Mining

Data Analysis for AirBnB dataset

Konstantinos Nikoletos | Myrto Iglezou

Spring 2020

Import of essential libraries

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number
import folium as fl
import wordcloud as wc
import collections
import seaborn as sbn
import nltk
import itertools
import matplotlib.image as mpimg
from pandas import DataFrame, read_csv
from string import punctuation 
from wordcloud import STOPWORDS,WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import OrderedDict
from operator import itemgetter
from nltk import word_tokenize, BigramCollocationFinder
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer,PorterStemmer
%matplotlib inline
punctuation = list(punctuation)
punctuation.append('’')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nikol\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nikol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Read of files and constructing the train.csv¶

In [2]:

months = ['\\april','\\febrouary','\march']
months2 = ['febrouary','march','april']
files = ['\listings.csv','\listings0.csv']
# inputPath= r"C:\Users\myrto\Desktop\data"
inputPath= r"C:\Users\nikol\Desktop\dataMining_p1\data\data"
columnlist = ['id','zipcode','transit','bedrooms','beds','review_scores_rating','number_of_reviews','neighbourhood','name','latitude','longitude','last_review','instant_bookable','host_since','host_response_rate','host_identity_verified','host_has_profile_pic','first_review','description','city','cancellation_policy','bed_type','bathrooms','accommodates','amenities','room_type','property_type','price','availability_365','minimum_nights','host_id']

framelist = []
monthlist = []
i=0
for month in months:
    for file in files:
        tempfile = pd.read_csv(inputPath+month+file,index_col=False)
        tempframe = pd.DataFrame(data=tempfile)
        framelist.append(tempframe)
    for l in range(0,len(framelist)-1):
        train = framelist[0].combine_first(framelist[l+1])
    df = pd.DataFrame(data = train, columns=columnlist)
    df.drop_duplicates(subset='id',ignore_index=True,inplace=True)  # id is primary key, no duplicates allowed in the same month
    df.insert(loc=len(df.columns),column='month_id',value=months2[i])
    monthlist.append(df)
    i=i+1

df = pd.concat(monthlist)
df['price'] = df['price'].apply(lambda x: x.translate(str.maketrans({',':'','$':''}))) # changing data types of column price
df['price'] = pd.to_numeric(df['price'])

# replacing greek words that already exist with their existent english one 
df.replace(to_replace='ΠΑΓΚΡΑΤΙ',value='Pangrati',inplace=True)
df.replace(to_replace='ΕΜΠΟΡΙΚΟ ΤΡΙΓΩΝΟ-ΠΛΑΚΑ',value='Emporiko Trigono-Plaka',inplace=True)
df.replace(to_replace='ΑΓΙΟΣ ΚΩΝΣΤΑΝΤΙΝΟΣ-ΠΛΑΤΕΙΑ ΒΑΘΗΣ',value='Agios Konstantinos-Plateia Vathis',inplace=True)
df.replace(to_replace='ΜΟΥΣΕΙΟ-ΕΞΑΡΧΕΙΑ-ΝΕΑΠΟΛΗ',value=' Mouseio-Exarcheia-Neapoli',inplace=True)
df.replace(to_replace='ΠΕΝΤΑΓΩΝΟ',value='Pentagono',inplace=True)

# making a copy of dataframe for query 3
originaldf = df.copy(deep=True)

for x in df.select_dtypes('number').columns:
    df[x].fillna(df[x].mean(),inplace=True) # replacing NAN with the mean , in numeric columns
df.fillna(method='backfill',inplace=True) # filling the NAN of the string columns with the previous valid
df.dropna(inplace=True) # if there are some still existing -> drop
df.to_csv("train.csv")

C:\Users\nikol\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (61,62) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Merge of files reviews, reviews0 into a Dataframe¶

In [4]:

# preparing wanted dataframe for the wordcloud of last reviews
temp = df[['id','neighbourhood']]
# reviews = r'C:\Users\myrto\Desktop\data\april\reviews.csv'
# reviews0 = r'C:\Users\myrto\Desktop\data\april\reviews0.csv'
reviews = r'C:\Users\nikol\Desktop\dataMining_p1\data\data\april\reviews.csv'
reviews0 = r'C:\Users\nikol\Desktop\dataMining_p1\data\data\april\reviews0.csv'
reviews = pd.read_csv(reviews,usecols=['id','comments'])
reviews0 = pd.read_csv(reviews0)
reviews = pd.DataFrame(data=reviews)
reviews0 = pd.DataFrame(data=reviews0)
reviews = reviews.combine_first(reviews0)
reviews.dropna(inplace=True)

1.1 The most common room type¶

In [5]:

explode = (0.1,0,0)
dt = df['room_type'].value_counts()
dt.plot(kind='pie',figsize=(5,5),title='Most frequent room type',fontsize=10,explode=explode,startangle=90,colors=['blue','darkblue','red'],shadow=True)
plt.title("Pie Chart of Room Type",fontweight='bold',pad=10)
plt.ylabel("")
plt.show()

1.2 Average price escalation for the three months¶

In [6]:

groupbymonth = df.groupby(by='month_id',sort=False)['price'].mean().plot(kind='line',x='per_month',y='price',color='blue',figsize=(20,5),title='Average price scaling for three months')
plt.title('Average price escalation for the three months',fontweight='bold',fontsize=15,pad=10)
plt.ylabel('#Average price',fontweight='bold',fontsize=10)
plt.xlabel('Months',fontweight='bold',fontsize=10)
plt.show()

1.3 Five most reviewed neighbourhoods¶

In [11]:

temp = df[['neighbourhood','number_of_reviews']].groupby('neighbourhood',as_index=False).sum()
temp = temp.sort_values(['number_of_reviews'],ascending=False)

explode = (0.1,0.1,0.1,0.1,0.1,0,0,0,0,0)
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('5 most reviewed neighbourhoods',fontweight='bold',fontsize=25)
temp.head(5).plot.bar(x='neighbourhood',color='brown',figsize=(20,15),fontsize=20,ax=ax1)
ax1.set_title('Top 5',fontweight='bold',fontsize=20)
ax1.set_ylabel('#Number of reviews',fontweight='bold',fontsize=15)
ax1.set_xlabel('Neighbourhoods',fontweight='bold',fontsize=15)
temp.head(10).plot(kind='pie',x='neighbourhood',y='number_of_reviews',figsize=(30,10),startangle=90, shadow=True, labels=temp['neighbourhood'], legend = False,fontsize=20,explode=explode,ax=ax2)
ax2.set_title('Top 10',fontweight='bold',fontsize=15)
ax2.set_ylabel('')
ax2.set_xlabel('')
fig.subplots_adjust(hspace=0.5)

1.4 Neighbourhoods with the most entries¶

In [14]:

df.groupby(by='neighbourhood',as_index=False).agg({'id':'nunique'}).sort_values(['id'],ascending=False,ignore_index=True).plot.bar(x='neighbourhood',color='darkgreen',fontsize=15,figsize=(20,5))
plt.title('Neighbourhoods with the most entries',fontweight='bold',fontsize=15,pad=10)
plt.ylabel('#Entries',fontweight='bold',fontsize=15)
plt.xlabel('Neighbourhoods',fontweight='bold',fontsize=15)
plt.show()

1.5 Number of entries per neighbourhood and per month¶

In [24]:

d = {}
templist = []
aprilList = []
marchList = []
febrouaryList = []
groupbyMonth = df
groupbyMonth = groupbyMonth.groupby(['neighbourhood','month_id']).agg({'id':'count'}).reset_index()
groupbyMonth = groupbyMonth.rename(columns = {'id':'Count'})
groupbyMonth.apply(lambda row : templist.append(tuple([row['neighbourhood'],row['month_id'],row['Count']])),axis=1)

for neighbourhood,month,count in templist:
    d[neighbourhood] = {'april':0,'march':0,'febrouary':0}

for neighbourhood,month,count in templist:
    d[neighbourhood][month] = count

neighbourhoodList = [x for x in d.keys()]
for x in neighbourhoodList:
    aprilList.append(d[x]['april'])
    marchList.append(d[x]['march'])
    febrouaryList.append(d[x]['febrouary'])

barWidth = 0.3
plt.figure(figsize=(30,10))
# Set position of bar on X axis
r1 = np.arange(len(neighbourhoodList))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
 
# Make the plot
plt.bar(r1, febrouaryList, color='maroon', width=barWidth, edgecolor='white', label='febrouary')
plt.bar(r2, marchList, color='wheat', width=barWidth, edgecolor='white', label='march')
plt.bar(r3, aprilList, color='olive', width=barWidth, edgecolor='white', label='april')
 
# Add xticks on the middle of the group bars
plt.title('Number of entries per neighbourhood and per month',fontweight='bold',fontsize=30,pad=10)
plt.xlabel('Neighbourhoods', fontweight='bold',fontsize=15)
plt.ylabel('#Entries', fontweight='bold',fontsize=15)
plt.xticks([r + 0.3 for r in range(len(neighbourhoodList))], neighbourhoodList,rotation='vertical',fontsize=20)
 
# Create legend & Show graphic
plt.legend()
plt.show()

1.6 Neighbourhood column histogram¶

In [26]:

graph = df.groupby(by='neighbourhood').size()
graph.plot.bar(y='Count',color='darkgreen',fontsize=15,figsize=(30,10))
plt.title('Neighbourhood column histogram',fontweight='bold',fontsize=20,pad=10)
plt.ylabel('#Entries',fontweight='bold',fontsize=15)
plt.xlabel('Neighbourhoods',fontweight='bold',fontsize=15)
plt.show()

1.7 Most frequent room type per neighbourhood¶

In [28]:

d = {}
templist = []
aptList = []
privateRoomList = []
entireHomeAptList = []
tempDf = df[['room_type','neighbourhood','id']]
tempDf = tempDf.groupby(['neighbourhood','room_type']).agg({'id':'nunique'}).reset_index()
tempDf = tempDf.rename(columns = {'id':'Count'})
tempDf.apply(lambda row : templist.append(tuple([row['neighbourhood'],row['room_type'],row['Count']])),axis=1)

for neighbourhood,type,count in templist:
    d[neighbourhood] = {'Private room':0,'Entire home/apt':0,'Shared room':0}

for neighbourhood,type,count in templist:
    d[neighbourhood][type] = count

neighbourhoodList = [x for x in d.keys()]
for x in neighbourhoodList:
    privateRoomList.append(d[x]['Private room'])
    entireHomeAptList.append(d[x]['Entire home/apt'])
    aptList.append(d[x]['Shared room'])

barWidth = 0.3
plt.figure(figsize=(30,10))
# Set position of bar on X axis
r1 = np.arange(len(neighbourhoodList))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
 
# Make the plot
plt.bar(r1, privateRoomList, color='lightblue', width=barWidth, edgecolor='white', label='Private room')
plt.bar(r2, entireHomeAptList, color='darkblue', width=barWidth, edgecolor='white', label='Entire home/apt')
plt.bar(r3, aptList, color='olive', width=barWidth, edgecolor='white', label='Shared room')
 
# Add xticks on the middle of the group bars
plt.title('Most frequent room type per neighbourhood',fontweight='bold',fontsize=25,pad=10)
plt.xlabel('Neighbourhoods', fontweight='bold',fontsize=15)
plt.ylabel('Frequency', fontweight='bold',fontsize=15)
plt.xticks([r + 0.3 for r in range(len(neighbourhoodList))], neighbourhoodList,rotation='vertical',fontsize=20)
 
# Create legend & Show graphic
plt.legend()
plt.show()

1.8 Most expensive room type¶

In [31]:

tempdf = df[['room_type','id','price']]
# tempdf.drop_duplicates(subset='id',inplace=True)
temp = tempdf.groupby(by=['room_type'],as_index=False).agg({'price':'mean'}).sort_values(['price'],ascending=False,ignore_index=True)
explode = (0.1,0,0)
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Most expensive room type',fontweight='bold',fontsize=25)
temp.plot(kind='bar',x='room_type',y='price',color=['darkred','salmon','tomato'],figsize=(30,10),fontsize=15,ax=ax1)
ax1.set_title('Histogram',fontweight='bold',fontsize=15)
ax1.set_ylabel('#Average price',fontweight='bold',fontsize=15)
ax1.set_xlabel('Room type',fontweight='bold',fontsize=15)
temp.plot(kind='pie',x='room_type',y='price',figsize=(30,10),startangle=90, shadow=True,colors=['darkred','salmon','tomato'], labels=temp['room_type'].unique(), legend = False,fontsize=14,explode=explode,ax=ax2)
ax2.set_title('Pie',fontweight='bold',fontsize=15)
ax2.set_ylabel('')
ax2.set_xlabel('')
fig.subplots_adjust(hspace=0.5)

1.9 Folium Map of 100 accommodations in Athens¶

In [32]:

latitude = []
longitude = []
tempdf = DataFrame(data=df[['latitude','longitude','id','price','bed_type','room_type','month_id']])
tempdf = tempdf.loc[tempdf['month_id'] == 'april']

latitude = tempdf['latitude'].tolist()
longitude = tempdf['longitude'].tolist()
price = tempdf['price'].tolist()
btype = tempdf['bed_type'].tolist()
rtype = tempdf['room_type'].tolist()

tooltip = 'Click me!'

m = fl.Map(location=[latitude[0],longitude[0]], zoom_start=12,tiles='Stamen Terrain')

for i, j in zip(range(1,100),range(1,100)):
    fl.Marker(location=[latitude[i], longitude[j]], icon=fl.Icon(color='red', icon='info-sign'),tooltip=tooltip,popup=('$'+str(price[i])+'\n'+str(rtype[i])+'\n'+str(btype[i]))).add_to(m)

img=mpimg.imread('map.png')
plt.figure(figsize=(20,10))
plt.imshow(img,interpolation='bilinear')
plt.axis("off")
plt.show()

1.10 Wordcloud of neighbourhoods¶

In [33]:

text = df['neighbourhood']
text = text.tolist()
tuples = collections.Counter(text)

words = WordCloud(background_color='white').generate_from_frequencies(frequencies=dict(tuples))

plt.figure(figsize=(20,10))
plt.imshow(words,interpolation='bilinear')
plt.axis("off")
plt.show()

1.10 Wordcloud of transit information¶

In [34]:

stopwords = set(STOPWORDS)
greek_stopwords = nltk.corpus.stopwords.words('greek')
greek_stopwords = set(greek_stopwords)
stopwords.update(greek_stopwords)

textWords = []
for x in df['transit']:
    for y in word_tokenize(x) : # trasnforming sentences to words
        y = y.lower() # making them lowercase
        y = WordNetLemmatizer().lemmatize(y) # lemmatizing 
        if (y not in stopwords) and (y not in punctuation):
            textWords.append(y)
            
counter = collections.Counter(textWords)
words = WordCloud(background_color='white',stopwords=stopwords).generate_from_frequencies(frequencies=dict(counter))

plt.figure(figsize=(20,10))
plt.imshow(words,interpolation='bilinear')
plt.axis("off")
plt.show()

1.10 Wordcloud of Descriptions¶

In [17]:

textWords = []
for x in df['description']:
    for y in word_tokenize(x) :
        y = y.lower()
        if (y not in stopwords) and (y not in punctuation):
            y = WordNetLemmatizer().lemmatize(y)
            textWords.append(y)
            
counter = collections.Counter(textWords)
words = WordCloud(background_color='white',stopwords=stopwords).generate_from_frequencies(frequencies=dict(counter))

plt.figure(figsize=(20,10))
plt.imshow(words,interpolation='bilinear')
plt.axis("off")
plt.show()

1.10 Wordcloud of last reviews¶

In [18]:

temp = df['last_review']
dates = temp.tolist()
temp = df['id']
ids = temp.tolist()

rev = DataFrame(data=reviews[['date','comments','listing_id']])
temp  = rev.loc[(rev['date'].isin(dates)) & (rev['listing_id'].isin(ids))]

textWords = []
for x in temp['comments']:
    for y in word_tokenize(x) :
        y = y.lower()
        if (y not in stopwords) and (y not in punctuation):
            y = WordNetLemmatizer().lemmatize(y)
            textWords.append(y)
            
counter = collections.Counter(textWords)
words = WordCloud(background_color='white',stopwords=stopwords).generate_from_frequencies(frequencies=dict(counter))

plt.figure(figsize=(20,10))
plt.imshow(words,interpolation='bilinear')
plt.axis("off")
plt.show()

12.a Most expensive property type¶

In [19]:

propertyTypeGB = df.groupby('property_type').agg({'price':'mean'}).reset_index()

my_norm = matplotlib.colors.Normalize(vmin=0, vmax=len(propertyTypeGB))
my_cmap = matplotlib.cm.get_cmap('twilight')
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Most expensive property type',fontweight='bold',fontsize=25)
propertyTypeGB.plot(kind='bar',x='property_type',y='price',figsize=(30,5),ax=ax1)
ax1.set_title('Histogram',fontweight='bold',fontsize=10)
ax1.set_ylabel('#Average price',fontweight='bold')
ax1.set_xlabel('Property type',fontweight='bold')
propertyTypeGB.plot(kind='pie',x='property_type',y='price',figsize=(30,10),startangle=90, shadow=True, labels=propertyTypeGB['property_type'].unique(),legend = False,fontsize=14,ax=ax2)
ax2.set_title('Pie',fontweight='bold',fontsize=10)
ax2.set_ylabel('')
ax2.set_xlabel('')
fig.subplots_adjust(hspace=0.5)

12.b Heat map of average price and average availabilty in Athens for the three months¶

In [22]:

sbn.set()
tempdf = df.drop_duplicates('id')
neighbourhoodsCost = tempdf.groupby('neighbourhood').agg({'price':'mean','availability_365':'mean','id':'count'}).reset_index()
neighbourhoodsCost = neighbourhoodsCost.rename(columns = {'price':'Average price','availability_365':'Average availability','id':'#hotels'})
plt.figure(figsize = (20,10))
plt.title('Heat map of average price and average availabilty in Athens for the three months',fontweight='bold',fontsize=15,pad=10)
plt.xlabel('Average price', fontweight='bold',fontsize=15)
plt.ylabel('Average availability in one year', fontweight='bold',fontsize=15)
sbn.kdeplot(data= neighbourhoodsCost['Average price'],data2=neighbourhoodsCost['Average availability'],cbar=True,cmap="inferno",shade=True,bw='silverman',gridsize=100)

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x2521dbffbc8>

Recommendation System¶

In [35]:

df = originaldf.copy(deep=True)
df.fillna('NULL',inplace=True) # replacing NAN with NULL string
table = DataFrame(data=df[['id','name','description']])
table['concat'] = table['name']+ " " +table['description']
table.drop_duplicates(subset='id',inplace=True,ignore_index=True)
stopwords = set(STOPWORDS)
stopwords = [x for x in stopwords]

3.1 TF-IDF vector from unigrams and bigrams of name and description concatenation¶

In [36]:

temp = [x for x in table['concat']]
vectorizer = TfidfVectorizer(max_df=1.0,min_df=1,stop_words=stopwords)
vectors = vectorizer.fit_transform(temp)

C:\Users\nikol\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:385: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

3.2 100 most similar accommodations in Athens¶

In [37]:

similarity_matrix = cosine_similarity(vectors)
similarity_matrix = np.triu(similarity_matrix, k=1)  # transforming the table to its upper tronic table

similarity_dictionary = {}
xindex = 0
listid = table['id'].tolist()
for x in similarity_matrix:    # loop that finds the max of every row from the similarity table
    max = x[0]
    maxindex = 0
    yindex = xindex+1
    for yindex in range(xindex+1,len(x)): 
        if x[yindex]>max:
            max=x[yindex]
            maxindex=yindex
        yindex+=1
    idtuple = (listid[xindex],listid[maxindex])  # the max will be the most similar word , so we insert them as a tuple to a dictionary
    similarity_dictionary[idtuple] = max 
    xindex+=1
    
# sorting the dictionary
sortedDict = {k : similarity_dictionary[k] for k in sorted(similarity_dictionary,key = similarity_dictionary.get,reverse=True)}
out = dict(itertools.islice(sortedDict.items(), 100))
i=1
print('--------------> 100 most similar accomodations of Athens <------------------\n')
for k,v in out.items():
    print(str(i)+". "+str(k[0])+" with "+str(k[1]) )
    i+=1

--------------> 100 most similar accomodations of Athens <------------------

1. 22074774 with 22074946
2. 22074946 with 22075076
3. 25207595 with 25207888
4. 25941499 with 25941675
5. 25941675 with 26092250
6. 28357206 with 28357761
7. 29122586 with 31879553
8. 29718779 with 33355636
9. 30439676 with 30514173
10. 30514173 with 30514220
11. 32680393 with 32680769
12. 32680769 with 32681311
13. 32681311 with 32704544
14. 32704544 with 32704974
15. 32704974 with 32705240
16. 32825922 with 32851335
17. 32851335 with 32851721
18. 32851721 with 32852415
19. 32881153 with 32881267
20. 32881267 with 32881617
21. 8594460 with 33764094
22. 10603975 with 27162716
23. 18735366 with 20210253
24. 22173279 with 24214247
25. 23552881 with 23554220
26. 23554220 with 23554464
27. 23554464 with 23554584
28. 26125014 with 27282238
29. 26383438 with 26620126
30. 26497308 with 30424563
31. 27282238 with 27686070
32. 27958339 with 27958876
33. 30089884 with 30341020
34. 30097307 with 30917255
35. 30098895 with 30917430
36. 30268642 with 30286826
37. 30314691 with 30314693
38. 30314693 with 30314695
39. 30314695 with 30314700
40. 30314700 with 30314701
41. 30314701 with 30314703
42. 30341020 with 30362269
43. 30362269 with 30363015
44. 30363015 with 30760835
45. 30760835 with 30906405
46. 30906405 with 30980429
47. 30980429 with 30980510
48. 31509474 with 31509785
49. 32434173 with 33654961
50. 32535436 with 32882143
51. 32536448 with 32730342
52. 32540793 with 32540918
53. 32677773 with 32678527
54. 32678527 with 32679653
55. 32708365 with 32708730
56. 32708730 with 32708948
57. 32730342 with 32731713
58. 32731713 with 32732369
59. 32732369 with 32732785
60. 32733160 with 32733614
61. 32878646 with 32878875
62. 32878875 with 32879287
63. 32879287 with 32879563
64. 32882143 with 32882427
65. 32882427 with 32882802
66. 32882802 with 32882960
67. 33067136 with 33092281
68. 33091714 with 33093112
69. 33093112 with 33093322
70. 33223592 with 33224318
71. 33290253 with 33291079
72. 33291079 with 33291140
73. 33291140 with 33291208
74. 1223199 with 29057234
75. 5426236 with 29786418
76. 9648099 with 9648354
77. 15793210 with 27774209
78. 19771906 with 28658067
79. 20294387 with 28622064
80. 20686418 with 20692552
81. 20692552 with 20704187
82. 20704187 with 20705564
83. 20705564 with 20706206
84. 20706206 with 20707116
85. 20707394 with 20708277
86. 21195111 with 21402471
87. 22042880 with 32104748
88. 22164074 with 24201549
89. 22171981 with 24214048
90. 23075516 with 29390694
91. 25711811 with 30723595
92. 26214131 with 26237466
93. 26237466 with 26237805
94. 26402249 with 26402383
95. 27254673 with 30009754
96. 27736231 with 30593649
97. 28978327 with 28978330
98. 29925898 with 29926293
99. 29926293 with 29945264
100. 29945264 with 30363275

3.3 Recommendation function¶

In [38]:

def recommend(item_id,topN,table):

    similarity_dictionary = {}
    stopwords = set(STOPWORDS)
    greek_stopwords = nltk.corpus.stopwords.words('greek')  # preprocessing stop words
    greek_stopwords = set(greek_stopwords)
    stopwords.update(greek_stopwords)
    stopwords = [x for x in stopwords]
    rowDocuments = [x for x in table['concat']]
    vectorizer = TfidfVectorizer(max_df=1.0,min_df=1,stop_words=stopwords)
    vectors = vectorizer.fit_transform(rowDocuments)
    similarity_matrix = cosine_similarity(vectors)
    similarity_matrix = np.triu(similarity_matrix, k=1)
    listid = table['id'].tolist()
    wantedIndex =  listid.index(item_id)    

    for index in range(wantedIndex+1,len(similarity_matrix[wantedIndex])):  # finding the most similar to the wanted id
        idtuple = (listid[wantedIndex],listid[index])
        similarity_dictionary[idtuple] = similarity_matrix[wantedIndex][index]
        index+=1

    sortedDict = {k : similarity_dictionary[k] for k in sorted(similarity_dictionary,key = similarity_dictionary.get,reverse=True)}
    if topN>len(similarity_matrix[wantedIndex]): # checking if topN is greater that the size of list
         topN=len(similarity_matrix[wantedIndex])
         print("topN is greater than all other hotels so "+str(topN)+"will be presented")
    out = dict(itertools.islice(sortedDict.items(),topN))
    i=1
    print('Recommending '+ str(topN) +' listings similar to id ' + str(item_id) + ' with name: ' + str(table['name'].values[wantedIndex]))
    print('----------------------------------------------------------------------------------------------------')
    for k,v in out.items():
        print(""+str(i)+'. id: '+str(k[1]))
        similarityindex = listid.index(k[1])
        print('-Recommended: '+str(table['name'].values[similarityindex]))
        print('-Description: '+'\n\t'+str(table['description'].values[similarityindex]))
        print('-> Score : %.5f' % v)
        i+=1
        print("\n")

recommend(33558429,15,table)

C:\Users\nikol\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:385: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn', 'δι', 'ἀλλ'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

Recommending 15 listings similar to id 33558429 with name: JR SUITE CENTRAL ATHENS GAZI VIEW "PERSEFONI"
----------------------------------------------------------------------------------------------------
1. id: 33636471
-Recommended: Luxury modern apartment , near the center !!
-Description:
Το διαμέρισμα είναι πλήρως ανακαινισμένο προσφέροντας μια όμορφη διαμονή . Η δυνατότητα φιλοξενίας είναι από 1 έως 3 άτομα . Βρίσκεται λίγα λεπτά μακριά μακριά από το σταθμό του τρένου κάτω Πατήσια Το διαμέρισμα είναι πλήρως ανακαινισμένο προσφέροντας μια όμορφη διαμονή. Η δυνατότητα φιλοξενίας είναι για ένα έως τρία άτομα . Παρέχεται ένα διπλό κρεβάτι , ένας καναπές-κρεβάτι , μια κουζίνα με όλο τον οικιακό εξοπλισμό και ένα άνετο απολαυστικό μπάνιο Ολοκληρωτικά σε όλο το διαμέρισμα Δεν θα βρίσκομαι στην ίδια ιδιοκτησία αλλα θα είμαι διαθέσιμη για οτιδήποτε κατά τη διάρκεια της διαμονής σας Ο χώρος μου βρίσκεται σε μια ήσυχη γειτονιά κοντά στο κέντρο της Αθήνας . Γύρω από αυτή βρίσκεις οτιδήποτε χρειαστείς όπως supermarkets , τράπεζες , φαρμακεία , νοσοκομείο , νυχτερινά κέντρα και πολλών ειδών εστιατόρια . Βρίσκεται στη καλύτερη θέση για να φτάσετε σε οποιαδήποτε μέρος της πόλης σε πολύ λίγο χρόνο με τα μέσα μεταφοράς που διαθέτει . Χρησιμοποιόντας όλα τα μέσα μεταφοράς περπατώντας
-> Score : 0.20925

2. id: 33759095
-Recommended: CALLIOPE
-Description:
Ένας όμορφος χώρος 40 τμ φτιαγμενος και ανακαινισμένος από την αρχή με πολύ αγάπη και μεράκι.Την μούσα Καλλιόπη επικαλούνται οι ραψωδοι για να τους εμπνεύσει. Με την επίκληση της ξεκινούν τα ομηρικά επη.Σε ακτίνα 1.3χλμ βρίσκεται ο Ιερός βράχος της Ακρόπολης το μουσείο της Ακρόπολης το μουσείο σύγχρονης Τέχνης το Ίδρυμα Σταύρος Νιάρχος (Λυρικη )και ο σταθμός Μετρό Φιξ ενώ κοντά βρίσκονται καφέ και εστιατόρια. Το διαμέρισμα διαθέτει ιδιωτικη εισοδο.Περιλαμβανει δωρεάν wifi επίπεδη τηλεοράση κλιματισμο και σύστημα συναγερμού πόρτα ασφαλείας κουτί φυλαξης ειδη πρώτης ανάγκης όπως πετσέτες σεντόνια .Κουζίνα όπου μπορείτε να ετοιμάσετε πλήρης γευματα Επειδή γνωρίζουμε την έννοια ΔΙΑΚΟΠΕΣ θα είμαστε στην διάθεσή για ότι χρειαστείτε για μια ευχάριστη διαμονή. Έχει πολλά καφέ και εστιατόρια αλλά αυτό που ξεχωρίζει είναι το little John ακριβώς απέναντι από το διαμέρισμα όπως και το σούπερ μάρκετ ΑΒκαι αυτό απέναντι από το διαμέρισμα . Δίπλα ακριβώς από το διαμέρισμα υπάρχει υπάρχει στάση λεωφο
-> Score : 0.17952

3. id: 33760753
-Recommended: Discover Koukaki's Neighborhood
-Description:
Ένα ευρύχωρο διαμέρισμα στο Κουκάκι, την πιο ήσυχη και ασφαλή περιοχή της Αθήνας, λίγα μόλις λεπτά απο την Ακρόπολη, το Ιστορικό Κέντρο, Πλάκα, Θησείο και άμεση πρόσβαση στο σταθμό του μετρό Συγγρού Φιξ. Αυτό το άνετο και κομψό διαμέρισμα 50τμ είναι ιδανική επιλογή τόσο για φίλους όσο για οικογένειες, με χώρο για να φιλοξενήσει 3 ενήλικες (δυνατότητα και για τέταρτο ενήλικα κατόπιν συνεννόησης). Το διαμέρισμα διαθέτει πλήρως ανακαινισμένη κουζίνα με όλες τις οικοσυσκευές, τραπεζαρία για τέσσερα άτομα, άνετο και μοντέρνο κλιματιζόμενο σαλόνι με καναπέ κρεβάτι και υπνοδωμάτιο επίσης κλιματιζόμενο με διπλό κρεβάτι και τηλεόραση 32' . Εξωτερική πόρτα ασφαλείας και πόρτες αλουμινίου στα δωμάτια που προσφέρουν ασφάλεια και φως στον χώρο. Ελεύθερη πρόσβαση Wifi αλλά και χρήση της βιβλιοθήκης . Ο οικοδεσπότης θα είναι διαθέσιμος να προσφέρει οποιαδήποτε πληροφορία ή βοήθεια, για να είναι πιο ευχάριστη η διαμονή σας. Στον πεζόδρομο της Δράκου θα βρείτε καταστήματα καφέ και φαγητού. Μεγάλο
-> Score : 0.17067

4. id: 33730265
-Recommended: Loft apartment with unique Acropolis view
-Description:
Διαμέρισμα στον 6ο όροφο με υπέροχη θέα της Ακρόπολης .Υπάρχουν παράθυρα γύρω γύρω και ιδιωτική βεράντα με εκπηκτική θέα του λόφου της Ακρόπολης.Είναι πάρα πολύ κοντά στο μετρό του Συντάγματος .Η γειτονιά έχει εξαιρετικά εστιατόρια για όλα τα γούστα , πολλά καφέ και πολλά μπαρ. Η οδός Ερμού με τα μαγαζιά είναι σε απόσταση 500μ. Το διαμέρισμα είναι πολύ φωτεινο πολυ ιδιαιτερο με τζαμια γυρω γυρω με μοναδικη θεα στον λοφο Ακροπολης .Είναι πολύ καλόγουστο και προσφέρει όλες τις παροχές για μια υπέροχη διαμονή. Βρίσκεται στο Σύνταγμα στο πιο κεντρικό σημείο της Αθήνας.
-> Score : 0.11064

5. id: 33581524
-Recommended: Comfortable, spacious studio flat @ Kolonaki
-Description:
Comfortable spacious studio flat 75 sq.m situated in the heart of "Kolonaki" the most prestigious and happening part of Athens. Ευρύχωρο στούντιο διαμέρισμα 75 τμ στην καρδιά του Κολωνακίου και κοντά στον Λυκαβυττό. Excellent customer service ,the hosts are available any time to answer your questions and address tour needs. Αψογη εξυπηρέτηση, ο ιδιοκτήτης είναι στην διάθεσή σας ανά πάσα στιγμή να σας εξυπηρετήσει. The apartment is only 100 meters away from the cable car that leads to Lycabetus hill, and 200 meters from Kolonaki square and 15 min. walk to Syntagma square. Great coffee spots (Queen Bee, Ante Post, Da Capo) Restaurants (Markonis, Barbounaki, Oikio) Great shopping, museums, theaters, bars and more, all within 10-15 min. walking distance . Το διαμέρσμα βρίσκεται 100 μέτρα από το τελεφερίκ για τον Λυκαβητό , 200 μέτρα από την πλατεία Κολωνάκιου και μόλις 15 λεπτά περπάτημα για την πλατεία Συντάγματος. Δημοφιλή καφέ (Queen Bee, Ante Post, Da Capo), Εστιατόρια (Malkonis, Barbo
-> Score : 0.09461

6. id: 33745949
-Recommended: Αττική
-Description:
Η διακόσμηση είναι σύνθετο στο σαλόνι, καναπέ γωνιακό. Στην κουζίνα έχουμε ψυγείο, φούρνο, τραπεζαρία. Το υπνοδωμάτιο παρέχει ντουλάπες διπλό κρεβάτι 2 κομοδίνα. Το σπίτι είναι με αυτόνομη θέρμανση φυσικού αερίου σε όλα. Υπάρχει και έξτρα Ράτζο ενός ατόμου. Το σπίτι βρίσκεται κοντά στα 200 μετρα απο το σταθμό ηλεκτρικού-μετρό, τρόλεϊ, λεωφορεία. Είναι ήσυχη Περιοχή. Σχετικά είναι ήσυχη στην γειτονοα. Καφε έχει στον Αγ. Παντελεήμονα όπου βρίσκεται πάλη στα 150 μέτρα από την τοποθεσία σάς Το σπίτι βρίσκεται κοντά στα 200 μετρα απο το σταθμό ηλεκτρικού-μετρό, τρόλεϊ, λεωφορεία.
-> Score : 0.07416

7. id: 33610063
-Recommended: Plato's Academy - Luxury Studio 2/4 persons
-Description:
Αρχαιολογικό Πάρκο Ακαδημίας Πλάτωνος. Πολυτελές διαμέρισμα μπροστά στο πάρκο με απεριόριστη θέα, ενδοδαπέδια θέρμανση, κλιματισμό, συναγερμό, πάρκινγκ, δίπλα σε στάση λεωφορείου, στο κέντρο της Αθήνας. Ήσυχο, studio, loft design, για φιλοξενία έως και 4 επισκέπτες. Μοντέρνο, δρύινα πατώματα, αυτονομία θέρμανσης, φυσικό άεριο, internet, πρωινό, αλουμίνια ασφαλείας, ηλεκτρικά στόρια, μεγάλη βεράντα προσόψεως με απεριόριστη θέα
-> Score : 0.07071

8. id: 33818761
-Recommended: Omonia Square Apartment
-Description:
A cozy, semi renovated, fully airconditioned 55 sq.m. sixth floor apartment in the Center of Athens behind the National Theatre, consisting of a bedroom with a double bed, a living room with a couch, a kitchen and a bathroom. Η Ομόνοια αποτελεί σημαντικό συγκοινωνιακό κόμβο, συνδεόμενη με τις δημοτικές ενότητες της πόλης προς πάσα κατεύθυνση μέσω των οδών Σταδίου, Αθηνάς, Παναγή Τσαλδάρη (Πειραιώς), Πανεπιστημίου, Αγίου Κωνσταντίνου και 3ης Σεπτεμβρίου. Αρχικά η ομώνυμη πλατεία είχε σχήμα έλλειψης και αποτελούσε κόμβο διασύνδεσης όλων των δρόμων, όμως, τροποποιήθηκε με την ενοποίηση τμημάτων αυτής και την απομάκρυνση του τραμ που διερχόταν από την οδό Αθηνάς.[1] Η πλατεία αποτελεί σημείο συνάντησης Αθηναίων και επισκεπτών και φημίζεται για τα εκλεπτυσμένα καφέ και την αρχιτεκτονική των παραδοσιακών της κτιρίων, όπως αυτή του Μπάγκειου μεγάρου και του "Μέγας Αλέξανδρος", που οικοδομήθηκαν τη δεκαετία του 1880 και λειτούργησαν αρχικά ως ξενοδοχεία[2]. Κάτω από την πλατεία διακλαδώνονται
-> Score : 0.04523

9. id: 33662015
-Recommended: Άνετο δυάρι στο Γαλάτσι
-Description:
Αυτό είναι το σπίτι μου! Όταν λείπω μου αρέσει να το μοιραζομαι με ανθρώπους που τους αρέσει να φιλοξενούν και να φιλοξενούνται! Είμαι σίγουρη ότι αμέσως θα νιώσετε ζεστά και οικεία! Κατάλληλο για ζευγάρι, οικογένεια ή παρέα φίλων. Βρίσκεται σε κεντρικό δρόμο του Γαλατσίου, 100 μ. από την αφετηρία των τρόλεϊ, 500μ.απο το Ολυμπιακό Στάδιο Γαλατσίου και 700μ. από το Άλσος Βεΐκου.
-> Score : 0.03507

10. id: 33660586
-Recommended: BRAND NEW CENTRAL ATHENS RESORT
-Description:
Brand new apartment located In the heart of the city, furnitured, fully equipped and fully airconditioned. There are three comfortable bedrooms where 6 persons can be easily accommodated and 2 bathrooms. In the living room there is a long sofa consisted of 2 single beds. The metro stations of "Omonia square" & "Metaxourgio" are only 5min walking distance. National Archeological Museum and National Theater are only 3 blocks away. Stores, restaurants and mini markets are all around the building. Πολύ κοντά σε δυο σταθμούς Μέτρο, εύκολο parking (ειδικά απο το μεσημέρι και μετά), Μέσα μεταφορας 50μ απο το σπίτι .
-> Score : 0.03443

11. id: 33583191
-Recommended: Spacious Suite w. huge balcony + Acropolis view
-Description:
Conveniently located in the heart of historical Athens, this 2019 utterly reconstructed 2bdr is in the premises of an iconic shoe-factory from the 70’s, which was flipped into a serviced condo-building. It provides its guests with direct access to all MUST visit attractions. Bars and restaurants, subway, bus and train stops, as well as all the major sightseeings are within 3 minutes walking distance. Constant WiFi and a fully equipped private kitchen are a few of the free offered amenities. This 2019 fully renovated 46 sqm. (495 sf.) 2-bedroom suite is situated in Psyri neighborhood on the fourth floor of the 5-storey building. This home consists of two parts, the living space on the fourth floor and the shared rooftop on the fifth floor which offers an outstanding view of Acropolis and the surrounding neighborhood of Monastiraki and Psyri. The rooftop is equipped with tables and chairs in order for you to enjoy your breakfast, your evening drink or just chill under the ancient Acropol
-> Score : 0.02644

12. id: 33657359
-Recommended: Central pristine & sunny Suite with cute balcony
-Description:
The perfect interior and spot for a couple who wants to explore the multifarious city center of Athens and the picturesque Plaka & Parthenon. Fully equipped and tranquil, spacious with clean bedroom & bathroom & all the amenities a contemporary flat should have for a hassle free stay. Ideal for guests who like to be at the very center without having to live in the noisy, touristic area, which starts 50 metres away, but also recommended for professionals as it offers quick commuting anywhere. We tried to combine the luxury the contemporary traveller expects without lacking the coziness and homey environment a guest seeks, and here we are! In the center of Athens (700m from Syntagm square), brand new and clean airy Suite with adjustable lights, spacious living room and pristine bathroom. The kitchen is equipped with all the kitchenware to provide you the nessecary independence, if you would like a dinner at its cozy and beautiful balcony or you rather stay in your cozy interior. The mar
-> Score : 0.02196

13. id: 33655278
-Recommended: The Acropolis Penthouse
-Description:
This 70 sq meter penthouse apartment is full of light, has great views of the antiquities is located 5 min from the acropolis 5 min from the new acropolis museum and the subway station. The center of Athens (shops and business district) is a short walk through the historic Plaka area. Underneath the Acropolis,a few hundred feet away from the Parthenon,is our 70 sq.m. penthouse apartment which has been recently (Website hidden by Airbnb) is full of light and one can sit in the front porch looking at the Hill of the Nymphs (Philopapou)and the ancient Herodion Theater.The old part of the city including the Plaka district,the Monastiraki flea market and the old area of Thision are within walking distance.The subway station is 5 minutes away.The new Acropolis museum is also a five minute walk.The center of Athens with the shops and business center are a twenty minute walk through the historical district of Plaka.A total of four people can be accomodated. With respect to habits and living r
-> Score : 0.01355

14. id: 33742566
-Recommended: Irresistible cute&lux Suite in the ❤ of Athens!
-Description:
Dream pristine flat with nice interior! A rare gem in Athens' center located a breath away from Parthenon and major attractions. Fully equipped and tranquil, airy with clean bedroom & bathroom & all the amenities a contemporary flat should have for a hassle free stay. Ideal for guests who like to be at the very center without having to live in the noisy, touristic area, which starts 50 metres away, but also recommended for professionals as it offers quick commuting by its proximity to metro. We tried to combine the luxury the contemporary traveller expects without lacking the coziness and homey environment a guest seeks, and here we are. In the center of Athens (700m from Syntagm square), brand new and clean airy apartment with adjustable lights, spacious living room and pristine bathroom. The kitchen is equipped with all the kitchenware to provide you the nessecary independence, if you would like a dinner at its cozy and beautiful balcony or you rather stay in the spacious living roo
-> Score : 0.01279

15. id: 33796363
-Recommended: Very central meters to Acropolis
-Description:
Beautiful small space in Acropolis
-> Score : 0.00995

3.4 10 most common-used word bigrams¶

In [29]:

textWords = []
for x in table['concat']:
    for y in word_tokenize(x):
        if y not in punctuation:
            textWords.append(y)

stopwords = set(STOPWORDS)
greek_stopwords = nltk.corpus.stopwords.words('greek')
greek_stopwords = set(greek_stopwords)
stopwords.update(greek_stopwords)
stopwords = [x for x in stopwords]

finder = BigramCollocationFinder.from_words(textWords)
finder.apply_word_filter(lambda w: w.lower() in stopwords)
bigram_measures = nltk.collocations.BigramAssocMeasures()
mostCommon = finder.nbest(bigram_measures.likelihood_ratio,10)

print("-------- 10 most common-used word bigrams ------------")
i=1
for word in mostCommon:
    print(str(i)+". "+str(word[0])+" -> "+str(word[1]))
    i+=1

-------- 10 most common-used word bigrams ------------
1. living -> room
2. fully -> equipped
3. walking -> distance
4. metro -> station
5. double -> bed
6. equipped -> kitchen
7. washing -> machine
8. brand -> new
9. minutes -> walk
10. sofa -> bed