TripAdvisor, l'un des plus grand site de voyage au monde, numéro un de publications d'avis est devenu juste incontournable! Dans ce projet, nous allons extraire les données hôtelières de Lyon de 20 pages du site TripAdvisor. Nous obtiendrons des informations sur l'ID de l'hôtel, qui est un numéro unique de chaque hôtel, nom, note, équipements, fournisseur et avis. Nous examinerons différentes fonctionnalités telles que le prix, les équipements et les notes qui vous aideront à réserver un hôtel la prochaine fois que vous planifiez un voyage à Lyon! Nous analyserons également les avis pour comprendre que la prochaine fois que vous réserverez un hôtel et paierez un prix plus élevé pour un hôtel bien noté, cela en vaut-il vraiment la peine?
** Pour faire cette analyse, nous nous concentrons sur les éléments suivants:**
1.- Nous allons extraire les données de Tripadvisor
2.- Nous aborderons les équipements et le prix moyen de l'hôtel en fonction de la note
3.- Nous explorerons les commodités populaires de l'hôtel
4.- Hôtels les mieux notés
5.- Nous découvrirons les meilleurs prestataires de réservation d'hôtels
6.- Nous analyserons les avis des hôtels pour trouver les 20 mots les plus fréquents
7.- Nous analyserons les opinions des commentaires.
8.- Analysez si les hôtels très bien évalués valent la peine d'être examinés.
Bien, faisons ça!
#Import librairies
from requests import get
import re
import seaborn as sns
import matplotlib.pyplot as plt
import string
from plotly.offline import init_notebook_mode, iplot
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()
from bs4 import BeautifulSoup
#import time
from random import randint
from time import time
from time import sleep
from IPython.core.display import clear_output
import pandas as pd
import numpy as np
from wordcloud import WordCloud,STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#Requests
page = get("https://www.tripadvisor.com/Hotels-g187265-Lyon_Rhone_Auvergne_Rhone_Alpes-Hotels.html")
#Créez une instance de la classe BeautifulSoup pour analyser notre document
parser = BeautifulSoup(page.content, "lxml")
#Get data
hotels = parser.find_all("div", {'class' : 'prw_rup prw_meta_hsx_responsive_listing ui_section listItem'})
print (parser.head.find('title').text)
print ("Found", len(hotels), "hotels")
THE 10 BEST Hotels in Lyon for 2020 (from $50) - Tripadvisor Found 30 hotels
#Sur ce site, chaque page compte 30 hôtels, et c'est la seule différence dans les sites de chaque page.
pages = [str(i) for i in range(0,600,30)]
print (pages)
['0', '30', '60', '90', '120', '150', '180', '210', '240', '270', '300', '330', '360', '390', '420', '450', '480', '510', '540', '570']
#Scrapper
name = []
new_price = []
orig_price = []
bubbles = []
review_count = []
review_url = []
amenities = []
provider = []
reviews = []
start_time = time()
requests = 0
for page in pages:
page = get('https://www.tripadvisor.com/Hotels-g187265-oa' + page + '-Lyon_Rhone_Auvergne_Rhone_Alpes-Hotels.html')
sleep(randint(8,15))
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
if page.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, page.status_code))
if requests > 72:
warn('Number of requests was greater than expected.')
break
parser = BeautifulSoup(page.content, "lxml")
hotels = parser.find_all("div", {'class' : 'prw_rup prw_meta_hsx_responsive_listing ui_section listItem'})
for hotel in hotels:
nm = hotel.find("div", {"class" : "listing_title"}).get_text()
name.append(nm)
np = hotel.find('div', {"class" : "price autoResize"})
if np:
new_price.append(np.text)
else:
new_price.append("")
op = hotel.find('div', {"class" : "xthrough"})#.text
orig_price.append(op)
bub = hotel.find(class_="ui_bubble_rating")
if bub:
bubbles.append(bub.get("alt"))
else:
bubbles.append("")
rc = hotel.find(class_="review_count")
if rc:
review_count.append(rc.get_text(strip=True))
else:
review_count.append("")
ru = hotel.find(class_="ui_bubble_rating")
if ru:
review_url.append(ru.get("href"))
else:
review_url.append("")
am = hotel.find("div", {"class" : "prw_rup prw_common_hotel_icons_list linespace is-shown-at-tablet"}).text
amenities.append(am)
prov = hotel.find(class_="provider_text")
if prov:
provider.append(prov.text)
else:
provider.append("")
rev = hotel.find("a", {"class" : "review-link"})
if rev:
reviews.append(rev["title"])
else:
reviews.append("")
Request:20; Frequency: 0.06608568872918887 requests/s
#Appelez la classe DataFrame et passez chaque liste d'éléments sous forme de dictionnaire.
hotels = pd.DataFrame({
"name" : name,
"discounted_price" : new_price,
"original_price" : orig_price,
"bubbles" : bubbles,
"review_count" : review_count,
"review_url" : review_url,
"amenities" : amenities,
"provider" : provider,
"reviews" : reviews
})
#Vérifier les données
hotels.head(2)
name | discounted_price | original_price | bubbles | review_count | review_url | amenities | provider | reviews | |
---|---|---|---|---|---|---|---|---|---|
0 | Staycity Aparthotels Rue Garibaldi | $91 | None | 4.5 of 5 bubbles | 924 reviews | /Hotel_Review-g187265-d12136060-Reviews-Stayci... | Free Wifi Fitness center | Expedia.com | A little further out than I’d have liked but t... |
1 | Hotel Lyon Metropole | $108 | None | 4 of 5 bubbles | 1,837 reviews | /Hotel_Review-g187265-d232351-Reviews-Hotel_Ly... | Free Wifi Free parking | Expedia.com | The room was large, modern and very clean it... |
#Vérifiez le nombre d'enregistrements
hotels.shape
(592, 9)
Nous avons extrait les données de 592 hôtels de la ville de Lyon
Les données supprimées du site Web comportent de nombreuses anomalies qui doivent être corrigées pour obtenir un ensemble de données propre avant de pouvoir commencer à analyser. Nous effectuerons les étapes ci-dessous pour nettoyer les données:
Supprimer "Sponsorisé" du nom de l'hôtel. Supprimer les lignes en double Supprimez "property_" de l'ID de l'hôtel. Supprimer "$" du prix réduit des hôtels. Supprimez "#REVIEWS" et "/" des URL d'avis sur les hôtels. Colonne propre des équipements de l'hôtel pour les séparer par des virgules. Nettoyer les bulles, les colonnes review_count, discounted_price Pour certains hôtels. le prix d'origine est manquant, mais nous avons le prix réduit. Par conséquent, créez une nouvelle colonne de prix. Modifiez les types de données. Gérer les valeurs nulles
def clean_name(name):
if "Sponsored" in name:
return (name.replace("Sponsored", ""))
else:
return name
hotels["name"] = hotels["name"].apply(clean_name)
hotels = hotels.drop_duplicates()
#Nous vérifions le nombre d'hôtels après avoir supprimé les données répétées
hotels.shape
(326, 9)
Le nombre de rangées est passé de 592 à 326 hôtels
hotels["name"] = hotels["name"].str.replace("property_", "")
hotels["discounted_price"] = hotels["discounted_price"].str.replace("$", "")
hotels["review_url"] = hotels["review_url"].str.replace("/","")
hotels["review_url"] = hotels["review_url"].str.replace("#REVIEWS", "")
def split_func(amenities):
if amenities:
output = re.sub('(?=[A-Z])', ' ', amenities).strip()
output1 = re.sub(r' ([^ ]*(?: |$))', r',\1', output)
if "Free" in output1:
output1 = output1.replace(',', '', 1)
if not output1:
return np.nan
else:
return (output1)
else:
return "None"
hotels["amenities"] = hotels["amenities"].apply(split_func)
def clean_bubbles(rating):
return (rating.split(" ")[0])
hotels["bubbles"] = hotels["bubbles"].apply(clean_bubbles)
hotels["review_count"] = hotels["review_count"].str.replace("reviews", "")
hotels["review_count"] = hotels["review_count"].str.replace("review", "")
hotels["name"] = hotels["name"].str.replace("property_", "")
hotels["review_count"] = hotels["review_count"].str.replace(",","")
hotels["discounted_price"] = hotels["discounted_price"].str.replace(",","")
hotels.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 326 entries, 0 to 546 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 326 non-null object 1 discounted_price 326 non-null object 2 original_price 3 non-null object 3 bubbles 326 non-null object 4 review_count 326 non-null object 5 review_url 326 non-null object 6 amenities 326 non-null object 7 provider 326 non-null object 8 reviews 326 non-null object dtypes: object(9) memory usage: 25.5+ KB
hotels["original_price"].value_counts(dropna = False).head(5)
NaN 323 [[$159]] 1 [[$63]] 1 [[$78]] 1 Name: original_price, dtype: int64
def clean_price(price):
price = str(price)
if price == "0":
return np.nan
else:
return (price.split("$")[1].split("<")[0])
hotels["original_price"].fillna(0, inplace=True)
hotels["original_price"] = hotels["original_price"].apply(clean_price)
hotels["review_count"] = hotels["review_count"].astype(int)
hotels.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 326 entries, 0 to 546 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 326 non-null object 1 discounted_price 326 non-null object 2 original_price 3 non-null object 3 bubbles 326 non-null object 4 review_count 326 non-null int32 5 review_url 326 non-null object 6 amenities 326 non-null object 7 provider 326 non-null object 8 reviews 326 non-null object dtypes: int32(1), object(8) memory usage: 24.2+ KB
hotels["review_count"].fillna(0, inplace=True)
hotels["bubbles"].fillna(0, inplace=True)
hotels.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 326 entries, 0 to 546 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 326 non-null object 1 discounted_price 326 non-null object 2 original_price 3 non-null object 3 bubbles 326 non-null object 4 review_count 326 non-null int32 5 review_url 326 non-null object 6 amenities 326 non-null object 7 provider 326 non-null object 8 reviews 326 non-null object dtypes: int32(1), object(8) memory usage: 24.2+ KB
#Nous vérifions les données après avoir nettoyé les erreurs et changé le type de données
hotels.head()
name | discounted_price | original_price | bubbles | review_count | review_url | amenities | provider | reviews | price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Staycity Aparthotels Rue Garibaldi | 91 | None | 4.5 | 924 | Hotel_Review-g187265-d12136060-Reviews-Staycit... | Free Wifi,, Fitness,center | Expedia.com | A little further out than I’d have liked but t... | 91 |
1 | Hotel Lyon Metropole | 108 | None | 4 | 1837 | Hotel_Review-g187265-d232351-Reviews-Hotel_Lyo... | Free Wifi,, Free,parking | Expedia.com | The room was large, modern and very clean it... | 108 |
2 | Radisson Blu Hotel, Lyon | 128 | 159 | 4 | 1205 | Hotel_Review-g187265-d472240-Reviews-Radisson_... | Free Wifi,, Restaurant | Expedia.com | We had a fantastic 3 nights in a superior roo... | 128 |
3 | Ibis Budget Aeroport Lyon Saint Exupery | 69 | None | 4 | 1420 | Hotel_Review-g187265-d3716594-Reviews-Ibis_Bud... | Free Wifi,, Restaurant | Expedia.com | a small but comfortable stay for a stopover go... | 69 |
4 | Hotel Carlton Lyon - MGallery Collection | 163 | None | 4.5 | 1879 | Hotel_Review-g187265-d232355-Reviews-Hotel_Car... | Free Wifi,, Room,service | Expedia.com | ... the river and the old town with it's wonde... | 163 |
fig = go.Figure(data=go.Bar(x = hotels["name"], y = hotels["price"]))
fig.update_layout(title=' Hôtels Lyon', xaxis_title='Hôtel', yaxis_title='Prix')
fig.show()
hotels["amenities"].value_counts().head(10)
Free Wifi 110 None 75 Free Wifi,, Restaurant 46 Free Wifi,, Bar/,,Lounge 24 Free Wifi,, Free,parking 17 Free Wifi,, Pool 17 Free Wifi,, Room,service 8 Free Wifi,, Fitness,center 6 Free Wifi,, Spa 5 Bar/,,Lounge 4 Name: amenities, dtype: int64
#Un graphique à barres des 10 meilleurs services hôteliers les plus populaires
hotel_amenities_bar = go.Bar(y = hotels["amenities"].value_counts().head(10).values,
x = hotels["amenities"].value_counts().head(10).index)
layout = go.Layout(title = "COMMODITÉS HÔTELIÈRES POPULAIRES", yaxis_title = "Count")
fig = go.Figure(hotel_amenities_bar, layout)
iplot(fig)
rating_price_bar = go.Bar(x=hotels["bubbles"], y=hotels["name"])
layout = go.Layout(title = "Rank Class", yaxis_title = "Name", xaxis_title = "Rating")
fig = go.Figure(rating_price_bar,layout)
iplot(fig)
4.5 rated hotels have the highest average price, but you can find a good hotel at rating 4.0 and 5.0 for a lower cost!
There will be a mismatch when a hotel having just 1 review has 5.0 rating, while a hotel with rating of 3 has 1000 reviews. So we will look at the mean of the reviews and decide a comparative cut-off and select hotels only above that range.
#Check review mean to decide the cut-off
hotels["review_count"].mean()
380.2188552188552
#We will decide a cut-off of 400 and select only hotels with review counts more than 400 to get rating-wise hotel prices and names.
review_hotels = hotels[hotels["review_count"] > 400]
#Sort data on rating and price
max_price_sorted_hotels = review_hotels.sort_values(["bubbles", "price"], ascending = False)
#Get the maximum price for each rating
rating_max_price = max_price_sorted_hotels.groupby(["bubbles"], as_index=False)["name", "bubbles", "price"].head(1)
#Plot the graph of rating-wise hotels with maximum price
rating_max_price_bar = go.Bar(x = rating_max_price["name"], y = rating_max_price["price"],
text=rating_max_price['bubbles'], hovertemplate = "<b>%{text}</b><br><br>" + "<extra></extra>")
layout = go.Layout(title = "RATING-WISE HOTELS WITH MAXIMUM PRICE", yaxis_title = "Price in $")
fig = go.Figure(rating_max_price_bar,layout)
iplot(fig)
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
#Sort data on rating descending and price ascending to get the lowest price for each rating
min_price_sorted_hotels = review_hotels.sort_values(["bubbles", "price"], ascending = [False, True])
#Get the minimum price for each rating
rating_min_price = min_price_sorted_hotels.groupby(["bubbles"], as_index=False)["name", "bubbles", "price"].head(1)
#Plot the graph of rating-wise hotels with minimum price
rating_min_price_bar = go.Bar(x = rating_min_price["name"], y = rating_min_price["price"],
text=rating_min_price['bubbles'], hovertemplate = "<b>%{text}</b><br><br>" + "<extra></extra>")
layout = go.Layout(title = "RATING-WISE HOTELS WITH MINIMUM PRICE", yaxis_title = "Price in $")
fig = go.Figure(rating_min_price_bar,layout)
iplot(fig)
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\ipykernel_launcher.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
fig = go.Figure(data=go.Bar(x=hotels["name"], y=hotels["price"]))
fig.update_layout(title='DISCOUNT AND RATING', xaxis_title='Rating', yaxis_title='Discounts')
fig.show()
We can see that as the rating increases, more and more discounts are offered except for the rating of 5.0.
#Plot bar graph to analyze providers
providers_bar = go.Bar(x=hotels["provider"].value_counts().head(10).index,
y=hotels["provider"].value_counts().head(10).values)
layout = go.Layout(title = "POPULAR HOTEL BOOKING PROVIDERS", yaxis_title = "Count")
fig = go.Figure(providers_bar,layout)
iplot(fig)
It looks like Booking.com is the most popular site for hotel bookings and discounts, followed with a huge gap by Expedia.com.
hotels.sort_values("review_count", ascending = False)[["name", "review_count"]].head(5)
name | review_count | |
---|---|---|
23 | Sofitel Lyon Bellecour | 2470 |
229 | Sofitel Lyon Bellecour | 2470 |
222 | Ibis Lyon Part Dieu Les Halles | 1969 |
3 | Hotel Carlton Lyon - MGallery Collection | 1879 |
531 | Hotel Carlton Lyon - MGallery Collection | 1879 |
We will analyze the hotel reviews to look at the most frequent words and also assigning sentiments to the reviews - positive, negative or neutral based on the text. But, before we can do that, we need to clean the text to remove punctuation marks and stopwords.
#Clean the review column to remove unwanted characters
pd.set_option('max_colwidth', 5000)
def clean_review(review):
return review.replace("", "")
hotels["reviews"] = hotels["reviews"].apply(clean_review)
# Create our list of punctuation marks
punctuations = string.punctuation
#Remove punctations from the reviews for further analysis
hotels["reviews"] = hotels["reviews"].apply(lambda x : x.lower())
hotels["reviews"] = hotels["reviews"].apply(lambda x : ''.join([a for a in x if a not in punctuations]))
# Get length of review for EDA
hotels['review_length'] = hotels["reviews"].apply(lambda x: 0 if x == "none" else len(x))
hotels.head(2)
name | discounted_price | original_price | bubbles | review_count | review_url | amenities | provider | reviews | price | review_length | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Radisson Blu Hotel, Lyon | 107 | 112 | 4 | 1205 | Hotel_Review-g187265-d472240-Reviews-Radisson_Blu_Hotel_Lyon-Lyon_Rhone_Auvergne_Rhone_Alpes.html | Free Wifi, Restaurant | Agoda.com | we had a fantastic 3 nights in a superior room with the most amazing river and city views | 112 | 90 |
1 | Staycity Aparthotels Rue Garibaldi | 74 | None | 4.5 | 924 | Hotel_Review-g187265-d12136060-Reviews-Staycity_Aparthotels_Rue_Garibaldi-Lyon_Rhone_Auvergne_Rhone_Alpes.html | Free Wifi, Fitness,center | Booking.com | a little further out than i’d have liked but the metro pass is so cheap making getting anywhere an absolute breeze large room with so much floor space clean and modern | 74 | 168 |
print('The mean for the length of review:',hotels['review_length'].mean())
print('The standard deviation for the length of reviews:',hotels['review_length'].std())
print('The maximum for the length of reviews:',hotels['review_length'].max())
The mean for the length of review: 124.58922558922559 The standard deviation for the length of reviews: 83.80446767588707 The maximum for the length of reviews: 214
#Plot a wordcloud to look at the frequently occuring words
words = " ".join(hotels['reviews'])
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color='black',
width=3000,
height=2500
).generate(words)
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
We will use CountVectorizer. It counts the number of times a token shows up in the reviews and uses this value as its weight.
#Instantiate the vectorizer object
cv = CountVectorizer(stop_words = 'english')
#Convert the reviews into a matrix - where each row represents a specific text in the reviews and each column represents
#a word in vocabulary. words[i,j] is the occurrence of word j in the text i.
words = cv.fit_transform(hotels["reviews"])
#sum_words is a vector that contains the sum of each word occurrence in all texts in the reviews.
#We are adding the elements for each column of words matrix.
sum_words = words.sum(axis=0)
#Create a list of tuples with the word and the frequency. cv.vocabulary_ is a dict, where the keys are
#the words (features) and the values are indices
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
#Sort the list of tuples that contain the word and their occurrence in the corpus.
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
#Create a dataframe
frequency = pd.DataFrame(words_freq, columns=['word', 'freq'])
#Plot a bar plot to analyze
word_freq_bar = go.Bar(x = frequency["word"].head(10), y = frequency["freq"].head(10))
layout = go.Layout(title = "TOP 20 FREQUENTLY OCCURING WORDS", yaxis_title = "Count")
fig = go.Figure(word_freq_bar,layout)
iplot(fig)
We will use VADER (Valence Aware Dictionary and sEntiment Reasoner) which is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is. It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon.
def sentiment_scores(sentence):
#Create a SentimentIntensityAnalyzer object.
sid_obj = SentimentIntensityAnalyzer()
# polarity_scores method of SentimentIntensityAnalyzer gives the polarity indices for the given sentence
# It gives a sentiment dictionary which contains pos, neg, neu, and compound scores.
# The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).
sentiment_dict = sid_obj.polarity_scores(sentence)
#We will use the VADER scoring methodology, given below:
if sentiment_dict['compound'] >= 0.05 :
return ("Positive")
elif sentiment_dict['compound'] <= - 0.05 :
return ("Negative")
else :
return ("Neutral")
#Calculate hotel sentiments
hotels["sentiment"] = hotels["reviews"].apply(sentiment_scores)
#Check overall hotel sentiments in LA
hotel_sentiments = hotels.groupby("sentiment", as_index=False)["name"].count()
word_freq_bar = go.Bar(x = hotel_sentiments["sentiment"], y = hotel_sentiments["name"])
layout = go.Layout(title = "HOTEL SENTIMENTS", yaxis_title = "Count")
fig = go.Figure(word_freq_bar,layout)
iplot(fig)
Overall hotels have positve sentiments in their reviews with very few negative sentiments.
Let us check if the high priced hotels have positive sentiments and are worth the money we spend!
#Get the sentiment-wise hotel count for each rating
rating_sentiment = hotels.groupby(["bubbles", "sentiment"], as_index = False)["name"].count()
#Plot a bar graph to check the sentiment for each rating
plt.figure(figsize=(12, 8))
sns.barplot(x="bubbles", hue="sentiment", y="name", data=rating_sentiment)
plt.ylabel("Count")
plt.xlabel("Rating")
plt.title("REVIEW-WISE HOTEL SENTIMENTS")
plt.show()
#Get the sentiment-wise hotel count for each rating
rating_sentiment = hotels.groupby(["provider", "sentiment"], as_index = False)["name"].count()
#Plot a bar graph to check the sentiment for each rating
plt.figure(figsize=(12, 8))
sns.barplot(x="provider", hue="sentiment", y="name", data=rating_sentiment)
plt.ylabel("Count")
plt.xlabel("Rating")
plt.title("REVIEW-WISE HOTEL SENTIMENTS")
plt.show()
Maximum hotels with rating 3.0 have a high number of reviews with positive sentiment. Hotels with rating less than 1.5 have more negative sentiments. So, we can conclude that the high price for a good hotel is worth the luxury and convenience!!
In our LA hotel analysis, we found that: