Open In Colab

Quora Question Pairs

1. Business Problem

1.1 Description

Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.


Credits: Kaggle

Problem Statement

  • Identify which questions asked on Quora are duplicates of questions that have already been asked.
  • This could be useful to instantly provide answers to questions that have already been answered.
  • We are tasked with predicting whether a pair of questions are duplicates or not.

1.2 Sources/Useful Links

1.3 Real world/Business Objectives and Constraints

  1. The cost of a mis-classification can be very high.
  2. You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
  3. No strict latency concerns.
  4. Interpretability is partially important.

2. Machine Learning Probelm

2.1 Data

2.1.1 Data Overview

- Data will be in a file Train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290

2.1.2 Example Data point

"id","qid1","qid2","question1","question2","is_duplicate"
"0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to invest in share market?","0"
"1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?","0"
"7","15","16","How can I be a good geologist?","What should I do to be a great geologist?","1"
"11","23","24","How do I read and find my YouTube comments?","How can I see all my Youtube comments?","1"

2.2 Mapping the real world problem to an ML problem

2.2.1 Type of Machine Leaning Problem

It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

2.2.2 Performance Metric

3. Exploratory Data Analysis

In [0]:
!pip install distance
Collecting distance
  Downloading https://files.pythonhosted.org/packages/5c/1a/883e47df323437aefa0d0a92ccfb38895d9416bd0b56262c2e46a47767b8/Distance-0.1.3.tar.gz (180kB)
     |████████████████████████████████| 184kB 2.9MB/s 
Building wheels for collected packages: distance
  Building wheel for distance (setup.py) ... done
  Stored in directory: /root/.cache/pip/wheels/d5/aa/e1/dbba9e7b6d397d645d0f12db1c66dbae9c5442b39b001db18e
Successfully built distance
Installing collected packages: distance
Successfully installed distance-0.1.3
In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc

import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
In [0]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive

3.1 Reading data and basic stats

In [0]:
!ls
drive  sample_data
In [0]:
df = pd.read_csv("drive/My Drive/Quora/train.csv",nrows = 100000)

print("Number of data points:",df.shape[0])
Number of data points: 100000
In [0]:
df.head()
Out[0]:
id qid1 qid2 question1 question2 is_duplicate
0 0 1 2 What is the step by step guide to invest in sh... What is the step by step guide to invest in sh... 0
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... What would happen if the Indian government sto... 0
2 2 5 6 How can I increase the speed of my internet co... How can Internet speed be increased by hacking... 0
3 3 7 8 Why am I mentally very lonely? How can I solve... Find the remainder when [math]23^{24}[/math] i... 0
4 4 9 10 Which one dissolve in water quikly sugar, salt... Which fish would survive in salt water? 0
In [0]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
id              100000 non-null int64
qid1            100000 non-null int64
qid2            100000 non-null int64
question1       100000 non-null object
question2       100000 non-null object
is_duplicate    100000 non-null int64
dtypes: int64(4), object(2)
memory usage: 4.6+ MB

We are given a minimal number of data fields here, consisting of:

  • id: Looks like a simple rowID
  • qid{1, 2}: The unique ID of each question in the pair
  • question{1, 2}: The actual textual contents of the questions.
  • is_duplicate: The label that we are trying to predict - whether the two questions are duplicates of each other.

3.2.1 Distribution of data points among output classes

  • Number of duplicate(smilar) and non-duplicate(non similar) questions
In [0]:
df.groupby("is_duplicate")['id'].count().plot.bar()
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe317c43978>
In [0]:
print('~> Total number of question pairs for training:\n   {}'.format(len(df)))
~> Total number of question pairs for training:
   100000
In [0]:
print('~> Question pairs are not Similar (is_duplicate = 0):\n   {}%'.format(100 - round(df['is_duplicate'].mean()*100, 2)))
print('\n~> Question pairs are Similar (is_duplicate = 1):\n   {}%'.format(round(df['is_duplicate'].mean()*100, 2)))
~> Question pairs are not Similar (is_duplicate = 0):
   62.75%

~> Question pairs are Similar (is_duplicate = 1):
   37.25%

3.2.2 Number of unique questions

In [0]:
qids = pd.Series(df['qid1'].tolist() + df['qid2'].tolist())
unique_qs = len(np.unique(qids))
qs_morethan_onetime = np.sum(qids.value_counts() > 1)
print ('Total number of  Unique Questions are: {}\n'.format(unique_qs))
#print len(np.unique(qids))

print ('Number of unique questions that appear more than one time: {} ({}%)\n'.format(qs_morethan_onetime,qs_morethan_onetime/unique_qs*100))

print ('Max number of times a single question is repeated: {}\n'.format(max(qids.value_counts()))) 

q_vals=qids.value_counts()

q_vals=q_vals.values
Total number of  Unique Questions are: 165931

Number of unique questions that appear more than one time: 19446 (11.719329118730075%)

Max number of times a single question is repeated: 32

In [0]:
x = ["unique_questions" , "Repeated Questions"]
y =  [unique_qs , qs_morethan_onetime]

plt.figure(figsize=(10, 6))
plt.title ("Plot representing unique and repeated questions  ")
sns.barplot(x,y)
plt.show()

3.2.3 Checking for Duplicates

In [0]:
#checking whether there are any repeated pair of questions

pair_duplicates = df[['qid1','qid2','is_duplicate']].groupby(['qid1','qid2']).count().reset_index()

print ("Number of duplicate questions",(pair_duplicates).shape[0] - df.shape[0])
Number of duplicate questions 0

3.2.4 Number of occurrences of each question

In [0]:
plt.figure(figsize=(20, 10))

plt.hist(qids.value_counts(), bins=10)

plt.yscale('log', nonposy='clip')

plt.title('Log-Histogram of question appearance counts')

plt.xlabel('Number of occurences of question')

plt.ylabel('Number of questions')

print ('Maximum number of times a single question is repeated: {}\n'.format(max(qids.value_counts()))) 
Maximum number of times a single question is repeated: 32

3.2.5 Checking for NULL values

In [0]:
#Checking whether there are any rows with null values
nan_rows = df[df.isnull().any(1)]
print (nan_rows)
Empty DataFrame
Columns: [id, qid1, qid2, question1, question2, is_duplicate]
Index: []
  • There are no null values
In [0]:
# # Filling the null values with ' '
# df = df.fillna('')
# nan_rows = df[df.isnull().any(1)]
# print (nan_rows)
In [0]:
y_true = df['is_duplicate'].values
# df.drop(['is_duplicate'],axis=1,inplace = True)
In [0]:
df.shape
Out[0]:
(100000, 6)
In [0]:
y_true.shape
Out[0]:
(100000,)
In [0]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y_true, test_size=0.33, stratify=y_true)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.33, stratify=y_train)
In [0]:
print("Number of data points in train data :",X_train.shape)
print("Number of data points in cross-val data :",X_cv.shape)
print("Number of data points in test data :",X_test.shape)
Number of data points in train data : (44890, 5)
Number of data points in cross-val data : (22110, 5)
Number of data points in test data : (33000, 5)

3.3 Basic Feature Extraction (before cleaning)

Let us now construct a few features like:

  • freq_qid1 = Frequency of qid1's
  • freq_qid2 = Frequency of qid2's
  • q1len = Length of q1
  • q2len = Length of q2
  • q1_n_words = Number of words in Question 1
  • q2_n_words = Number of words in Question 2
  • word_Common = (Number of common unique words in Question 1 and Question 2)
  • word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
  • word_share = (word_common)/(word_Total)
  • freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
  • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

Basic Feature Extraction for train set

In [0]:
if os.path.isfile('drive/My Drive/Quora_assigment/X_train_fe_without_preprocessing.csv'):
    X_train = pd.read_csv("drive/My Drive/Quora_assigment/X_train_fe_without_preprocessing.csv",encoding='latin-1')
else:
    X_train['freq_qid1'] = X_train.groupby('qid1')['qid1'].transform('count') 
    X_train['freq_qid2'] = X_train.groupby('qid2')['qid2'].transform('count')
    X_train['q1len'] = X_train['question1'].str.len() 
    X_train['q2len'] = X_train['question2'].str.len()
    X_train['q1_n_words'] = X_train['question1'].apply(lambda row: len(row.split(" ")))
    X_train['q2_n_words'] = X_train['question2'].apply(lambda row: len(row.split(" ")))

    def normalized_word_Common(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)
    X_train['word_Common'] = X_train.apply(normalized_word_Common, axis=1)

    def normalized_word_Total(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * (len(w1) + len(w2))
    X_train['word_Total'] = X_train.apply(normalized_word_Total, axis=1)

    def normalized_word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
    X_train['word_share'] = X_train.apply(normalized_word_share, axis=1)

    X_train['freq_q1+q2'] = X_train['freq_qid1']+X_train['freq_qid2']
    X_train['freq_q1-q2'] = abs(X_train['freq_qid1']-X_train['freq_qid2'])

    X_train.to_csv("drive/My Drive/Quora_assignment/X_train_fe_without_preprocessing_train.csv", index=False)

X_train.head()
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2
38970 38970 70704 70705 Can you suggest me a good name that is related... What are the precautions to be taken for const... 0 1 1 73 69 15 12 0.0 27.0 0.000000 2 0
99681 99681 165453 165454 How do I know if I am a lesbian trapped in a m... I am 15 and just came out to my dad as a lesbi... 0 1 1 83 118 19 30 5.0 39.0 0.128205 2 0
45433 45433 81424 81425 I'm not afraid of my future. What can I do? I'm terribly afraid of my future. What should ... 0 1 1 43 51 10 10 8.0 20.0 0.400000 2 0
91049 91049 138056 152668 How much is one million and one billion in lak... How do I spend one million dollar? 0 1 1 60 34 12 7 3.0 17.0 0.176471 2 0
82061 82061 139239 139240 Is there an advanced search syntax for Amazon'... How was Amazon search in its initial stages? 0 1 1 53 44 9 8 1.0 17.0 0.058824 2 0

Basic Feature Extraction for cross set

In [0]:
if os.path.isfile('drive/My Drive/Quora_assigment/X_cv_fe_without_preprocessing.csv'):
    X_cv = pd.read_csv("drive/My Drive/Quora_assigment/X_cv_fe_without_preprocessing.csv",encoding='latin-1')
else:
    X_cv['freq_qid1'] = X_cv.groupby('qid1')['qid1'].transform('count') 
    X_cv['freq_qid2'] = X_cv.groupby('qid2')['qid2'].transform('count')
    X_cv['q1len'] = X_cv['question1'].str.len() 
    X_cv['q2len'] = X_cv['question2'].str.len()
    X_cv['q1_n_words'] = X_cv['question1'].apply(lambda row: len(row.split(" ")))
    X_cv['q2_n_words'] = X_cv['question2'].apply(lambda row: len(row.split(" ")))

    def normalized_word_Common(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)
    X_cv['word_Common'] = X_cv.apply(normalized_word_Common, axis=1)

    def normalized_word_Total(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * (len(w1) + len(w2))
    X_cv['word_Total'] = X_cv.apply(normalized_word_Total, axis=1)

    def normalized_word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
    X_cv['word_share'] = X_cv.apply(normalized_word_share, axis=1)

    X_cv['freq_q1+q2'] = X_cv['freq_qid1']+X_cv['freq_qid2']
    X_cv['freq_q1-q2'] = abs(X_cv['freq_qid1']-X_cv['freq_qid2'])

    X_cv.to_csv("drive/My Drive/Quora_assignment/X_cv_fe_without_preprocessing.csv", index=False)

X_cv.head()
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2
60746 60746 106185 106186 What are some of the innovative startups in In... What are some of the new innovative startups i... 0 1 1 50 54 9 10 9.0 19.0 0.473684 2 0
36802 36802 67067 67068 What are the tips and hacks for getting the cl... What are the tips and hacks for getting the cl... 0 1 1 97 94 19 19 17.0 36.0 0.472222 2 0
78720 78720 134178 134179 Is it grammatically correct to put a comma aft... What is relation between kp and kc? 0 1 1 71 35 11 7 1.0 17.0 0.058824 2 0
83051 83051 140697 140698 What is intermittent fasting? What was your intermittent fasting experience? 0 1 1 29 46 4 6 2.0 10.0 0.200000 2 0
60336 60336 105528 105529 Why do black people have white palms? Why do so many Asian people say "whites" or "b... 0 1 1 37 84 7 15 3.0 21.0 0.142857 2 0

Basic Feature Extraction for test set

In [0]:
if os.path.isfile('drive/My Drive/Quora_assigment/X_test_fe_without_preprocessing.csv'):
    X_test = pd.read_csv("drive/My Drive/Quora_assigment/X_test_fe_without_preprocessing.csv",encoding='latin-1')
else:
    X_test['freq_qid1'] = X_test.groupby('qid1')['qid1'].transform('count') 
    X_test['freq_qid2'] = X_test.groupby('qid2')['qid2'].transform('count')
    X_test['q1len'] = X_test['question1'].str.len() 
    X_test['q2len'] = X_test['question2'].str.len()
    X_test['q1_n_words'] = X_test['question1'].apply(lambda row: len(row.split(" ")))
    X_test['q2_n_words'] = X_test['question2'].apply(lambda row: len(row.split(" ")))

    def normalized_word_Common(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)
    X_test['word_Common'] = X_test.apply(normalized_word_Common, axis=1)

    def normalized_word_Total(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * (len(w1) + len(w2))
    X_test['word_Total'] = X_test.apply(normalized_word_Total, axis=1)

    def normalized_word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
    X_test['word_share'] = X_test.apply(normalized_word_share, axis=1)

    X_test['freq_q1+q2'] = X_test['freq_qid1']+X_test['freq_qid2']
    X_test['freq_q1-q2'] = abs(X_test['freq_qid1']-X_test['freq_qid2'])

    X_test.to_csv("drive/My Drive/Quora_assignment/X_test_fe_without_preprocessing_train.csv", index=False)

X_test.head()
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2
13403 13403 25742 20236 Smartphones: What is the best phone camera at ... Which phone has the best camera? 1 1 2 57 32 10 6 3.0 15.0 0.200000 3 1
21490 21490 40457 16040 Is the agricultural sector a failing one in In... What are the problems in the agricultural sect... 1 1 1 50 58 9 10 5.0 18.0 0.277778 2 0
34199 34199 62703 62704 When bacteria die do they also decay, will the... Do bacteria reason in a very simplified way or... 0 1 1 80 108 14 19 4.0 31.0 0.129032 2 0
34671 34671 63487 63488 How can you identify a phishing attack and how... How can I identify phishing Emails? 1 1 1 81 35 15 6 4.0 16.0 0.250000 2 0
75567 75567 19868 129309 Can using birth control cause complications in... Can birth control pills cause me to become per... 1 1 1 66 65 9 10 4.0 19.0 0.210526 2 0

3.3.1 Analysis of some of the extracted features

  • Here are some questions have only one single words.
In [0]:
print ("Minimum length of the questions in question1 : " , min(X_train['q1_n_words']))

print ("Minimum length of the questions in question2 : " , min(X_train['q2_n_words']))

print ("Number of Questions with minimum length [question1] :", X_train[X_train['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_train[X_train['q2_n_words']== 1].shape[0])
Minimum length of the questions in question1 :  1
Minimum length of the questions in question2 :  1
Number of Questions with minimum length [question1] : 6
Number of Questions with minimum length [question2] : 3
In [0]:
print ("Minimum length of the questions in question1 : " , min(X_cv['q1_n_words']))

print ("Minimum length of the questions in question2 : " , min(X_cv['q2_n_words']))

print ("Number of Questions with minimum length [question1] :", X_cv[X_cv['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_cv[X_cv['q2_n_words']== 2].shape[0])
Minimum length of the questions in question1 :  1
Minimum length of the questions in question2 :  2
Number of Questions with minimum length [question1] : 6
Number of Questions with minimum length [question2] : 4
In [0]:
print ("Minimum length of the questions in question1 : " , min(X_test['q1_n_words']))

print ("Minimum length of the questions in question2 : " , min(X_test['q2_n_words']))

print ("Number of Questions with minimum length [question1] :", X_test[X_test['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_test[X_test['q2_n_words']== 1].shape[0])
Minimum length of the questions in question1 :  1
Minimum length of the questions in question2 :  1
Number of Questions with minimum length [question1] : 3
Number of Questions with minimum length [question2] : 2

3.3.1.1 Feature: word_share

In [0]:
plt.figure(figsize=(12, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = X_train)

plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['word_share'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['word_share'][0:] , label = "0" , color = 'blue' )
plt.show()
  • The distributions for normalized word_share have some overlap on the far right-hand side, i.e., there are quite a lot of questions with high word similarity
  • The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)

3.3.1.2 Feature: word_Common

In [0]:
plt.figure(figsize=(12, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_Common', data = X_train)

plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['word_Common'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['word_Common'][0:] , label = "0" , color = 'blue' )
plt.show()

3.3.5 : EDA: Advanced Feature Extraction.

In [0]:
!pip install fuzzywuzzy
Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/d8/f1/5a267addb30ab7eaa1beab2b9323073815da4551076554ecc890a3595ec9/fuzzywuzzy-0.17.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.17.0
In [0]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc

import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
# This package is used for finding longest common subsequence between two strings
# you can write your own dp code for this
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz
from sklearn.manifold import TSNE
# Import the Required lib packages for WORD-Cloud generation
# https://stackoverflow.com/questions/45625434/how-to-install-wordcloud-in-python3-6
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image
In [0]:
# #https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
# if os.path.isfile('drive/My Drive/Quora_assignment/df_fe_without_preprocessing_train.csv'):
#     df = pd.read_csv("drive/My Drive/Quora_assignment/df_fe_without_preprocessing_train.csv",encoding='latin-1')
#     df = df.fillna('')
#     df.head()
# else:
#     print("get df_fe_without_preprocessing_train.csv from drive or run the previous notebook")
In [0]:
# df.head(2)
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2
0 0 1 2 What is the step by step guide to invest in sh... What is the step by step guide to invest in sh... 0 1 1 66 57 14 12 10.0 23.0 0.434783 2 0
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... What would happen if the Indian government sto... 0 1 1 51 88 8 13 4.0 20.0 0.200000 2 0

3.4 Preprocessing of Text

  • Preprocessing:
    • Removing html tags
    • Removing Punctuations
    • Performing stemming
    • Removing Stopwords
    • Expanding contractions etc.
In [0]:
import nltk
nltk.download('stopwords')
# To get the results in 4 decemal points
SAFE_DIV = 0.0001 

STOP_WORDS = stopwords.words("english")


def preprocess(x):
    x = str(x).lower()
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    
    
    porter = PorterStemmer()
    pattern = re.compile('\W')
    
    if type(x) == type(''):
        x = re.sub(pattern, ' ', x)
    
    
    if type(x) == type(''):
        x = porter.stem(x)
        example1 = BeautifulSoup(x)
        x = example1.get_text()
               
    
    return x
    
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
  • Function to Compute and get the features : With 2 parameters of Question 1 and Question 2

3.5 Advanced Feature Extraction (NLP and Fuzzy Features)

Definition:

  • Token: You get a token by splitting sentence a space
  • Stop_Word : stop words as per NLTK.
  • Word : A token that is not a stop_word

Features:

  • cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2
    cwc_min = common_word_count / (min(len(q1_words), len(q2_words))

  • cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2
    cwc_max = common_word_count / (max(len(q1_words), len(q2_words))

  • csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2
    csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))

  • csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2
    csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))

  • ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
    ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))

  • ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
    ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))

  • last_word_eq : Check if First word of both questions is equal or not
    last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])

  • first_word_eq : Check if First word of both questions is equal or not
    first_word_eq = int(q1_tokens[0] == q2_tokens[0])

  • abs_len_diff : Abs. length difference
    abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))

  • mean_len : Average Token Length of both Questions
    mean_len = (len(q1_tokens) + len(q2_tokens))/2

  • longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2
    longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))
In [0]:
def get_token_features(q1, q2):
    token_features = [0.0]*10
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()

    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features
    # Get the non-stopwords in Questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    #Get the stopwords in Questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from Question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from Question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from Question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    token_features[8] = abs(len(q1_tokens) - len(q2_tokens))
    
    #Average Token Length of both Questions
    token_features[9] = (len(q1_tokens) + len(q2_tokens))/2
    return token_features

# get the Longest Common sub string

def get_longest_substr_ratio(a, b):
    strs = list(distance.lcsubstrings(a, b))
    if len(strs) == 0:
        return 0
    else:
        return len(strs[0]) / (min(len(a), len(b)) + 1)

def extract_features(df):
    # preprocessing each question
    df["question1"] = df["question1"].fillna("").apply(preprocess)
    df["question2"] = df["question2"].fillna("").apply(preprocess)

    print("token features...")
    
    # Merging Features with dataset
    
    token_features = df.apply(lambda x: get_token_features(x["question1"], x["question2"]), axis=1)
    
    df["cwc_min"]       = list(map(lambda x: x[0], token_features))
    df["cwc_max"]       = list(map(lambda x: x[1], token_features))
    df["csc_min"]       = list(map(lambda x: x[2], token_features))
    df["csc_max"]       = list(map(lambda x: x[3], token_features))
    df["ctc_min"]       = list(map(lambda x: x[4], token_features))
    df["ctc_max"]       = list(map(lambda x: x[5], token_features))
    df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
    df["first_word_eq"] = list(map(lambda x: x[7], token_features))
    df["abs_len_diff"]  = list(map(lambda x: x[8], token_features))
    df["mean_len"]      = list(map(lambda x: x[9], token_features))
   
    #Computing Fuzzy Features and Merging with Dataset
    
    # do read this blog: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
    # https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings
    # https://github.com/seatgeek/fuzzywuzzy
    print("fuzzy features..")

    df["token_set_ratio"]       = df.apply(lambda x: fuzz.token_set_ratio(x["question1"], x["question2"]), axis=1)
    # The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and 
    # then joining them back into a string We then compare the transformed strings with a simple ratio().
    df["token_sort_ratio"]      = df.apply(lambda x: fuzz.token_sort_ratio(x["question1"], x["question2"]), axis=1)
    df["fuzz_ratio"]            = df.apply(lambda x: fuzz.QRatio(x["question1"], x["question2"]), axis=1)
    df["fuzz_partial_ratio"]    = df.apply(lambda x: fuzz.partial_ratio(x["question1"], x["question2"]), axis=1)
    df["longest_substr_ratio"]  = df.apply(lambda x: get_longest_substr_ratio(x["question1"], x["question2"]), axis=1)
    return df
In [0]:
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_train.csv'):
    X_train= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_train.csv",encoding='latin-1')
    X_train.fillna('')
else:
    print("Extracting features for train:")
#     df = pd.read_csv("drive/My Drive/Quora_assignment/train.csv")
    X_train= extract_features(X_train)
    X_train.to_csv("drive/My Drive/Quora_assignment/nlp_features_train.csv", index=False)
X_train.head(2)
Extracting features for train:
token features...
fuzzy features..
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2 cwc_min cwc_max csc_min csc_max ctc_min ctc_max last_word_eq first_word_eq abs_len_diff mean_len token_set_ratio token_sort_ratio fuzz_ratio fuzz_partial_ratio longest_substr_ratio
38970 38970 70704 70705 can you suggest me a good name that is related... what are the precautions to be taken for const... 0 1 1 73 69 15 12 0.0 27.0 0.000000 2 0 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 4.0 14.0 36 36 36 43 0.057143
99681 99681 165453 165454 how do i know if i am a lesbian trapped in a m... i am 15 and just came out to my dad as a lesbi... 0 1 1 83 118 19 30 5.0 39.0 0.128205 2 0 0.142855 0.142855 0.454541 0.35714 0.299999 0.199999 0.0 0.0 10.0 25.0 43 49 35 42 0.127907
In [0]:
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_cv.csv'):
    X_cv= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_cv.csv",encoding='latin-1')
    X_cv.fillna('')
else:
    print("Extracting features for cv:")
#     df = pd.read_csv("drive/My Drive/Quora_assignment/cv.csv")
    X_cv= extract_features(X_cv)
    X_cv.to_csv("drive/My Drive/Quora_assignment/nlp_features_cv.csv", index=False)
X_cv.head(2)
Extracting features for cv:
token features...
fuzzy features..
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2 cwc_min cwc_max csc_min csc_max ctc_min ctc_max last_word_eq first_word_eq abs_len_diff mean_len token_set_ratio token_sort_ratio fuzz_ratio fuzz_partial_ratio longest_substr_ratio
60746 60746 106185 106186 what are some of the innovative startups in in... what are some of the new innovative startups i... 0 1 1 50 54 9 10 9.0 19.0 0.473684 2 0 0.999967 0.749981 0.999983 0.999983 0.999989 0.899991 1.0 1.0 1.0 9.5 100 96 96 92 0.588235
36802 36802 67067 67068 what are the tips and hacks for getting the cl... what are the tips and hacks for getting the cl... 0 1 1 97 94 19 19 17.0 36.0 0.472222 2 0 0.874989 0.874989 0.999990 0.999990 0.894732 0.894732 1.0 1.0 0.0 19.0 97 95 95 94 0.873684
In [0]:
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_test.csv'):
    X_test= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_test.csv",encoding='latin-1')
    X_test.fillna('')
else:
    print("Extracting features for test:")
#     df = pd.read_csv("drive/My Drive/Quora_assignment/test.csv")
    X_test= extract_features(X_test)
    X_test.to_csv("drive/My Drive/Quora_assignment/nlp_features_test.csv", index=False)
X_test.head(2)
Extracting features for test:
token features...
fuzzy features..
Out[0]:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2 cwc_min cwc_max csc_min csc_max ctc_min ctc_max last_word_eq first_word_eq abs_len_diff mean_len token_set_ratio token_sort_ratio fuzz_ratio fuzz_partial_ratio longest_substr_ratio
13403 13403 25742 20236 smartphones what is the best phone camera at ... which phone has the best camera 1 1 2 57 32 10 6 3.0 15.0 0.200000 3 1 0.999967 0.599988 0.333322 0.249994 0.666656 0.399996 0 0 4 8.0 81 60 57 66 0.333333
21490 21490 40457 16040 is the agricultural sector a failing one in in... what are the problems in the agricultural sect... 1 1 1 50 58 9 10 5.0 18.0 0.277778 2 0 0.749981 0.599988 0.499988 0.399992 0.555549 0.499995 1 0 1 9.5 79 68 62 79 0.490196

3.5.1 Analysis of extracted features

3.5.1.1 Plotting Word clouds

  • Creating Word Cloud of Duplicates and Non-Duplicates Question pairs
  • We can observe the most frequent occuring words
In [0]:
X_train_duplicate = X_train[X_train['is_duplicate'] == 1]
X_train_nonduplicate = X_train[X_train['is_duplicate'] == 0]

# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_train_duplicate["question1"], X_train_duplicate["question2"]]).flatten()
n = np.dstack([X_train_nonduplicate["question1"], X_train_nonduplicate["question2"]]).flatten()

print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))

#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/train_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/train_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 33446
Number of data points in class 0 (non duplicate pairs) : 56334
In [0]:
X_cv_duplicate = X_cv[X_cv['is_duplicate'] == 1]
X_cv_nonduplicate = X_cv[X_cv['is_duplicate'] == 0]

# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_cv_duplicate["question1"], X_cv_duplicate["question2"]]).flatten()
n = np.dstack([X_cv_nonduplicate["question1"], X_cv_nonduplicate["question2"]]).flatten()

print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))

#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/cv_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/cv_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 16474
Number of data points in class 0 (non duplicate pairs) : 27746
In [0]:
X_test_duplicate = X_test[X_test['is_duplicate'] == 1]
X_test_nonduplicate = X_test[X_test['is_duplicate'] == 0]

# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_test_duplicate["question1"], X_test_duplicate["question2"]]).flatten()
n = np.dstack([X_test_nonduplicate["question1"], X_test_nonduplicate["question2"]]).flatten()

print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))

#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/test_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/test_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 24588
Number of data points in class 0 (non duplicate pairs) : 41412
In [0]:
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'

textp_w = open(path.join(d, 'train_p.txt')).read()
textn_w = open(path.join(d, 'train_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")

stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w))
print ("Total number of words in non duplicate pair questions :",len(textn_w))
Total number of words in duplicate pair questions : 1804520
Total number of words in non duplicate pair questions : 3663966
In [0]:
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'

textp_w_cv = open(path.join(d, 'cv_p.txt')).read()
textn_w_cv = open(path.join(d, 'cv_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")

stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w_cv))
print ("Total number of words in non duplicate pair questions :",len(textn_w_cv))
Total number of words in duplicate pair questions : 884086
Total number of words in non duplicate pair questions : 1804102
In [0]:
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'

textp_w_test = open(path.join(d, 'test_p.txt')).read()
textn_w_test = open(path.join(d, 'test_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")

stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w_test))
print ("Total number of words in non duplicate pair questions :",len(textn_w_test))
Total number of words in duplicate pair questions : 1333031
Total number of words in non duplicate pair questions : 2681939

WordCloud For Train Set

Word Clouds generated from duplicate pair question's text

In [0]:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs

Word Clouds generated from non duplicate pair question's text

In [0]:
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs:

WordCloud For CV Set

In [0]:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w_cv)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs

Word Clouds generated from non duplicate pair question's text

In [0]:
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w_cv)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs:

WordCloud For Test Set

In [0]:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w_test)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs

Word Clouds generated from non duplicate pair question's text

In [0]:
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w_test)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs: