Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
__ Problem Statement __
____ Useful Links ____
- Data will be in a file Train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290
"id","qid1","qid2","question1","question2","is_duplicate" "0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to invest in share market?","0" "1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?","0" "7","15","16","How can I be a good geologist?","What should I do to be a great geologist?","1" "11","23","24","How do I read and find my YouTube comments?","How can I see all my Youtube comments?","1"
It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.
Source: https://www.kaggle.com/c/quora-question-pairs#evaluation
Metric(s):
!pip install distance
Collecting distance Downloading https://files.pythonhosted.org/packages/5c/1a/883e47df323437aefa0d0a92ccfb38895d9416bd0b56262c2e46a47767b8/Distance-0.1.3.tar.gz (180kB) |████████████████████████████████| 184kB 2.9MB/s Building wheels for collected packages: distance Building wheel for distance (setup.py) ... done Stored in directory: /root/.cache/pip/wheels/d5/aa/e1/dbba9e7b6d397d645d0f12db1c66dbae9c5442b39b001db18e Successfully built distance Installing collected packages: distance Successfully installed distance-0.1.3
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc
import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code Enter your authorization code: ·········· Mounted at /content/drive
!ls
drive sample_data
df = pd.read_csv("drive/My Drive/Quora/train.csv",nrows = 100000)
print("Number of data points:",df.shape[0])
Number of data points: 100000
df.head()
id | qid1 | qid2 | question1 | question2 | is_duplicate | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | What is the step by step guide to invest in sh... | What is the step by step guide to invest in sh... | 0 |
1 | 1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Dia... | What would happen if the Indian government sto... | 0 |
2 | 2 | 5 | 6 | How can I increase the speed of my internet co... | How can Internet speed be increased by hacking... | 0 |
3 | 3 | 7 | 8 | Why am I mentally very lonely? How can I solve... | Find the remainder when [math]23^{24}[/math] i... | 0 |
4 | 4 | 9 | 10 | Which one dissolve in water quikly sugar, salt... | Which fish would survive in salt water? | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 6 columns): id 100000 non-null int64 qid1 100000 non-null int64 qid2 100000 non-null int64 question1 100000 non-null object question2 100000 non-null object is_duplicate 100000 non-null int64 dtypes: int64(4), object(2) memory usage: 4.6+ MB
We are given a minimal number of data fields here, consisting of:
df.groupby("is_duplicate")['id'].count().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fe317c43978>
print('~> Total number of question pairs for training:\n {}'.format(len(df)))
~> Total number of question pairs for training: 100000
print('~> Question pairs are not Similar (is_duplicate = 0):\n {}%'.format(100 - round(df['is_duplicate'].mean()*100, 2)))
print('\n~> Question pairs are Similar (is_duplicate = 1):\n {}%'.format(round(df['is_duplicate'].mean()*100, 2)))
~> Question pairs are not Similar (is_duplicate = 0): 62.75% ~> Question pairs are Similar (is_duplicate = 1): 37.25%
qids = pd.Series(df['qid1'].tolist() + df['qid2'].tolist())
unique_qs = len(np.unique(qids))
qs_morethan_onetime = np.sum(qids.value_counts() > 1)
print ('Total number of Unique Questions are: {}\n'.format(unique_qs))
#print len(np.unique(qids))
print ('Number of unique questions that appear more than one time: {} ({}%)\n'.format(qs_morethan_onetime,qs_morethan_onetime/unique_qs*100))
print ('Max number of times a single question is repeated: {}\n'.format(max(qids.value_counts())))
q_vals=qids.value_counts()
q_vals=q_vals.values
Total number of Unique Questions are: 165931 Number of unique questions that appear more than one time: 19446 (11.719329118730075%) Max number of times a single question is repeated: 32
x = ["unique_questions" , "Repeated Questions"]
y = [unique_qs , qs_morethan_onetime]
plt.figure(figsize=(10, 6))
plt.title ("Plot representing unique and repeated questions ")
sns.barplot(x,y)
plt.show()
#checking whether there are any repeated pair of questions
pair_duplicates = df[['qid1','qid2','is_duplicate']].groupby(['qid1','qid2']).count().reset_index()
print ("Number of duplicate questions",(pair_duplicates).shape[0] - df.shape[0])
Number of duplicate questions 0
plt.figure(figsize=(20, 10))
plt.hist(qids.value_counts(), bins=10)
plt.yscale('log', nonposy='clip')
plt.title('Log-Histogram of question appearance counts')
plt.xlabel('Number of occurences of question')
plt.ylabel('Number of questions')
print ('Maximum number of times a single question is repeated: {}\n'.format(max(qids.value_counts())))
Maximum number of times a single question is repeated: 32
#Checking whether there are any rows with null values
nan_rows = df[df.isnull().any(1)]
print (nan_rows)
Empty DataFrame Columns: [id, qid1, qid2, question1, question2, is_duplicate] Index: []
# # Filling the null values with ' '
# df = df.fillna('')
# nan_rows = df[df.isnull().any(1)]
# print (nan_rows)
y_true = df['is_duplicate'].values
# df.drop(['is_duplicate'],axis=1,inplace = True)
df.shape
(100000, 6)
y_true.shape
(100000,)
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y_true, test_size=0.33, stratify=y_true)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.33, stratify=y_train)
print("Number of data points in train data :",X_train.shape)
print("Number of data points in cross-val data :",X_cv.shape)
print("Number of data points in test data :",X_test.shape)
Number of data points in train data : (44890, 5) Number of data points in cross-val data : (22110, 5) Number of data points in test data : (33000, 5)
Let us now construct a few features like:
if os.path.isfile('drive/My Drive/Quora_assigment/X_train_fe_without_preprocessing.csv'):
X_train = pd.read_csv("drive/My Drive/Quora_assigment/X_train_fe_without_preprocessing.csv",encoding='latin-1')
else:
X_train['freq_qid1'] = X_train.groupby('qid1')['qid1'].transform('count')
X_train['freq_qid2'] = X_train.groupby('qid2')['qid2'].transform('count')
X_train['q1len'] = X_train['question1'].str.len()
X_train['q2len'] = X_train['question2'].str.len()
X_train['q1_n_words'] = X_train['question1'].apply(lambda row: len(row.split(" ")))
X_train['q2_n_words'] = X_train['question2'].apply(lambda row: len(row.split(" ")))
def normalized_word_Common(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)
X_train['word_Common'] = X_train.apply(normalized_word_Common, axis=1)
def normalized_word_Total(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * (len(w1) + len(w2))
X_train['word_Total'] = X_train.apply(normalized_word_Total, axis=1)
def normalized_word_share(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
X_train['word_share'] = X_train.apply(normalized_word_share, axis=1)
X_train['freq_q1+q2'] = X_train['freq_qid1']+X_train['freq_qid2']
X_train['freq_q1-q2'] = abs(X_train['freq_qid1']-X_train['freq_qid2'])
X_train.to_csv("drive/My Drive/Quora_assignment/X_train_fe_without_preprocessing_train.csv", index=False)
X_train.head()
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38970 | 38970 | 70704 | 70705 | Can you suggest me a good name that is related... | What are the precautions to be taken for const... | 0 | 1 | 1 | 73 | 69 | 15 | 12 | 0.0 | 27.0 | 0.000000 | 2 | 0 |
99681 | 99681 | 165453 | 165454 | How do I know if I am a lesbian trapped in a m... | I am 15 and just came out to my dad as a lesbi... | 0 | 1 | 1 | 83 | 118 | 19 | 30 | 5.0 | 39.0 | 0.128205 | 2 | 0 |
45433 | 45433 | 81424 | 81425 | I'm not afraid of my future. What can I do? | I'm terribly afraid of my future. What should ... | 0 | 1 | 1 | 43 | 51 | 10 | 10 | 8.0 | 20.0 | 0.400000 | 2 | 0 |
91049 | 91049 | 138056 | 152668 | How much is one million and one billion in lak... | How do I spend one million dollar? | 0 | 1 | 1 | 60 | 34 | 12 | 7 | 3.0 | 17.0 | 0.176471 | 2 | 0 |
82061 | 82061 | 139239 | 139240 | Is there an advanced search syntax for Amazon'... | How was Amazon search in its initial stages? | 0 | 1 | 1 | 53 | 44 | 9 | 8 | 1.0 | 17.0 | 0.058824 | 2 | 0 |
if os.path.isfile('drive/My Drive/Quora_assigment/X_cv_fe_without_preprocessing.csv'):
X_cv = pd.read_csv("drive/My Drive/Quora_assigment/X_cv_fe_without_preprocessing.csv",encoding='latin-1')
else:
X_cv['freq_qid1'] = X_cv.groupby('qid1')['qid1'].transform('count')
X_cv['freq_qid2'] = X_cv.groupby('qid2')['qid2'].transform('count')
X_cv['q1len'] = X_cv['question1'].str.len()
X_cv['q2len'] = X_cv['question2'].str.len()
X_cv['q1_n_words'] = X_cv['question1'].apply(lambda row: len(row.split(" ")))
X_cv['q2_n_words'] = X_cv['question2'].apply(lambda row: len(row.split(" ")))
def normalized_word_Common(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)
X_cv['word_Common'] = X_cv.apply(normalized_word_Common, axis=1)
def normalized_word_Total(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * (len(w1) + len(w2))
X_cv['word_Total'] = X_cv.apply(normalized_word_Total, axis=1)
def normalized_word_share(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
X_cv['word_share'] = X_cv.apply(normalized_word_share, axis=1)
X_cv['freq_q1+q2'] = X_cv['freq_qid1']+X_cv['freq_qid2']
X_cv['freq_q1-q2'] = abs(X_cv['freq_qid1']-X_cv['freq_qid2'])
X_cv.to_csv("drive/My Drive/Quora_assignment/X_cv_fe_without_preprocessing.csv", index=False)
X_cv.head()
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60746 | 60746 | 106185 | 106186 | What are some of the innovative startups in In... | What are some of the new innovative startups i... | 0 | 1 | 1 | 50 | 54 | 9 | 10 | 9.0 | 19.0 | 0.473684 | 2 | 0 |
36802 | 36802 | 67067 | 67068 | What are the tips and hacks for getting the cl... | What are the tips and hacks for getting the cl... | 0 | 1 | 1 | 97 | 94 | 19 | 19 | 17.0 | 36.0 | 0.472222 | 2 | 0 |
78720 | 78720 | 134178 | 134179 | Is it grammatically correct to put a comma aft... | What is relation between kp and kc? | 0 | 1 | 1 | 71 | 35 | 11 | 7 | 1.0 | 17.0 | 0.058824 | 2 | 0 |
83051 | 83051 | 140697 | 140698 | What is intermittent fasting? | What was your intermittent fasting experience? | 0 | 1 | 1 | 29 | 46 | 4 | 6 | 2.0 | 10.0 | 0.200000 | 2 | 0 |
60336 | 60336 | 105528 | 105529 | Why do black people have white palms? | Why do so many Asian people say "whites" or "b... | 0 | 1 | 1 | 37 | 84 | 7 | 15 | 3.0 | 21.0 | 0.142857 | 2 | 0 |
if os.path.isfile('drive/My Drive/Quora_assigment/X_test_fe_without_preprocessing.csv'):
X_test = pd.read_csv("drive/My Drive/Quora_assigment/X_test_fe_without_preprocessing.csv",encoding='latin-1')
else:
X_test['freq_qid1'] = X_test.groupby('qid1')['qid1'].transform('count')
X_test['freq_qid2'] = X_test.groupby('qid2')['qid2'].transform('count')
X_test['q1len'] = X_test['question1'].str.len()
X_test['q2len'] = X_test['question2'].str.len()
X_test['q1_n_words'] = X_test['question1'].apply(lambda row: len(row.split(" ")))
X_test['q2_n_words'] = X_test['question2'].apply(lambda row: len(row.split(" ")))
def normalized_word_Common(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)
X_test['word_Common'] = X_test.apply(normalized_word_Common, axis=1)
def normalized_word_Total(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * (len(w1) + len(w2))
X_test['word_Total'] = X_test.apply(normalized_word_Total, axis=1)
def normalized_word_share(row):
w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))
return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
X_test['word_share'] = X_test.apply(normalized_word_share, axis=1)
X_test['freq_q1+q2'] = X_test['freq_qid1']+X_test['freq_qid2']
X_test['freq_q1-q2'] = abs(X_test['freq_qid1']-X_test['freq_qid2'])
X_test.to_csv("drive/My Drive/Quora_assignment/X_test_fe_without_preprocessing_train.csv", index=False)
X_test.head()
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13403 | 13403 | 25742 | 20236 | Smartphones: What is the best phone camera at ... | Which phone has the best camera? | 1 | 1 | 2 | 57 | 32 | 10 | 6 | 3.0 | 15.0 | 0.200000 | 3 | 1 |
21490 | 21490 | 40457 | 16040 | Is the agricultural sector a failing one in In... | What are the problems in the agricultural sect... | 1 | 1 | 1 | 50 | 58 | 9 | 10 | 5.0 | 18.0 | 0.277778 | 2 | 0 |
34199 | 34199 | 62703 | 62704 | When bacteria die do they also decay, will the... | Do bacteria reason in a very simplified way or... | 0 | 1 | 1 | 80 | 108 | 14 | 19 | 4.0 | 31.0 | 0.129032 | 2 | 0 |
34671 | 34671 | 63487 | 63488 | How can you identify a phishing attack and how... | How can I identify phishing Emails? | 1 | 1 | 1 | 81 | 35 | 15 | 6 | 4.0 | 16.0 | 0.250000 | 2 | 0 |
75567 | 75567 | 19868 | 129309 | Can using birth control cause complications in... | Can birth control pills cause me to become per... | 1 | 1 | 1 | 66 | 65 | 9 | 10 | 4.0 | 19.0 | 0.210526 | 2 | 0 |
print ("Minimum length of the questions in question1 : " , min(X_train['q1_n_words']))
print ("Minimum length of the questions in question2 : " , min(X_train['q2_n_words']))
print ("Number of Questions with minimum length [question1] :", X_train[X_train['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_train[X_train['q2_n_words']== 1].shape[0])
Minimum length of the questions in question1 : 1 Minimum length of the questions in question2 : 1 Number of Questions with minimum length [question1] : 6 Number of Questions with minimum length [question2] : 3
print ("Minimum length of the questions in question1 : " , min(X_cv['q1_n_words']))
print ("Minimum length of the questions in question2 : " , min(X_cv['q2_n_words']))
print ("Number of Questions with minimum length [question1] :", X_cv[X_cv['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_cv[X_cv['q2_n_words']== 2].shape[0])
Minimum length of the questions in question1 : 1 Minimum length of the questions in question2 : 2 Number of Questions with minimum length [question1] : 6 Number of Questions with minimum length [question2] : 4
print ("Minimum length of the questions in question1 : " , min(X_test['q1_n_words']))
print ("Minimum length of the questions in question2 : " , min(X_test['q2_n_words']))
print ("Number of Questions with minimum length [question1] :", X_test[X_test['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", X_test[X_test['q2_n_words']== 1].shape[0])
Minimum length of the questions in question1 : 1 Minimum length of the questions in question2 : 1 Number of Questions with minimum length [question1] : 3 Number of Questions with minimum length [question2] : 2
plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = X_train)
plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['word_share'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['word_share'][0:] , label = "0" , color = 'blue' )
plt.show()
plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_Common', data = X_train)
plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['word_Common'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['word_Common'][0:] , label = "0" , color = 'blue' )
plt.show()
!pip install fuzzywuzzy
Collecting fuzzywuzzy Downloading https://files.pythonhosted.org/packages/d8/f1/5a267addb30ab7eaa1beab2b9323073815da4551076554ecc890a3595ec9/fuzzywuzzy-0.17.0-py2.py3-none-any.whl Installing collected packages: fuzzywuzzy Successfully installed fuzzywuzzy-0.17.0
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc
import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
# This package is used for finding longest common subsequence between two strings
# you can write your own dp code for this
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz
from sklearn.manifold import TSNE
# Import the Required lib packages for WORD-Cloud generation
# https://stackoverflow.com/questions/45625434/how-to-install-wordcloud-in-python3-6
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image
# #https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c
# if os.path.isfile('drive/My Drive/Quora_assignment/df_fe_without_preprocessing_train.csv'):
# df = pd.read_csv("drive/My Drive/Quora_assignment/df_fe_without_preprocessing_train.csv",encoding='latin-1')
# df = df.fillna('')
# df.head()
# else:
# print("get df_fe_without_preprocessing_train.csv from drive or run the previous notebook")
# df.head(2)
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | What is the step by step guide to invest in sh... | What is the step by step guide to invest in sh... | 0 | 1 | 1 | 66 | 57 | 14 | 12 | 10.0 | 23.0 | 0.434783 | 2 | 0 |
1 | 1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Dia... | What would happen if the Indian government sto... | 0 | 1 | 1 | 51 | 88 | 8 | 13 | 4.0 | 20.0 | 0.200000 | 2 | 0 |
import nltk
nltk.download('stopwords')
# To get the results in 4 decemal points
SAFE_DIV = 0.0001
STOP_WORDS = stopwords.words("english")
def preprocess(x):
x = str(x).lower()
x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
.replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
.replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
.replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
.replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
.replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
.replace("€", " euro ").replace("'ll", " will")
x = re.sub(r"([0-9]+)000000", r"\1m", x)
x = re.sub(r"([0-9]+)000", r"\1k", x)
porter = PorterStemmer()
pattern = re.compile('\W')
if type(x) == type(''):
x = re.sub(pattern, ' ', x)
if type(x) == type(''):
x = porter.stem(x)
example1 = BeautifulSoup(x)
x = example1.get_text()
return x
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
Definition:
Features:
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
def get_token_features(q1, q2):
token_features = [0.0]*10
# Converting the Sentence into Tokens:
q1_tokens = q1.split()
q2_tokens = q2.split()
if len(q1_tokens) == 0 or len(q2_tokens) == 0:
return token_features
# Get the non-stopwords in Questions
q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
#Get the stopwords in Questions
q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
# Get the common non-stopwords from Question pair
common_word_count = len(q1_words.intersection(q2_words))
# Get the common stopwords from Question pair
common_stop_count = len(q1_stops.intersection(q2_stops))
# Get the common Tokens from Question pair
common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
# Last word of both question is same or not
token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
# First word of both question is same or not
token_features[7] = int(q1_tokens[0] == q2_tokens[0])
token_features[8] = abs(len(q1_tokens) - len(q2_tokens))
#Average Token Length of both Questions
token_features[9] = (len(q1_tokens) + len(q2_tokens))/2
return token_features
# get the Longest Common sub string
def get_longest_substr_ratio(a, b):
strs = list(distance.lcsubstrings(a, b))
if len(strs) == 0:
return 0
else:
return len(strs[0]) / (min(len(a), len(b)) + 1)
def extract_features(df):
# preprocessing each question
df["question1"] = df["question1"].fillna("").apply(preprocess)
df["question2"] = df["question2"].fillna("").apply(preprocess)
print("token features...")
# Merging Features with dataset
token_features = df.apply(lambda x: get_token_features(x["question1"], x["question2"]), axis=1)
df["cwc_min"] = list(map(lambda x: x[0], token_features))
df["cwc_max"] = list(map(lambda x: x[1], token_features))
df["csc_min"] = list(map(lambda x: x[2], token_features))
df["csc_max"] = list(map(lambda x: x[3], token_features))
df["ctc_min"] = list(map(lambda x: x[4], token_features))
df["ctc_max"] = list(map(lambda x: x[5], token_features))
df["last_word_eq"] = list(map(lambda x: x[6], token_features))
df["first_word_eq"] = list(map(lambda x: x[7], token_features))
df["abs_len_diff"] = list(map(lambda x: x[8], token_features))
df["mean_len"] = list(map(lambda x: x[9], token_features))
#Computing Fuzzy Features and Merging with Dataset
# do read this blog: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
# https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings
# https://github.com/seatgeek/fuzzywuzzy
print("fuzzy features..")
df["token_set_ratio"] = df.apply(lambda x: fuzz.token_set_ratio(x["question1"], x["question2"]), axis=1)
# The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and
# then joining them back into a string We then compare the transformed strings with a simple ratio().
df["token_sort_ratio"] = df.apply(lambda x: fuzz.token_sort_ratio(x["question1"], x["question2"]), axis=1)
df["fuzz_ratio"] = df.apply(lambda x: fuzz.QRatio(x["question1"], x["question2"]), axis=1)
df["fuzz_partial_ratio"] = df.apply(lambda x: fuzz.partial_ratio(x["question1"], x["question2"]), axis=1)
df["longest_substr_ratio"] = df.apply(lambda x: get_longest_substr_ratio(x["question1"], x["question2"]), axis=1)
return df
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_train.csv'):
X_train= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_train.csv",encoding='latin-1')
X_train.fillna('')
else:
print("Extracting features for train:")
# df = pd.read_csv("drive/My Drive/Quora_assignment/train.csv")
X_train= extract_features(X_train)
X_train.to_csv("drive/My Drive/Quora_assignment/nlp_features_train.csv", index=False)
X_train.head(2)
Extracting features for train: token features... fuzzy features..
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | cwc_min | cwc_max | csc_min | csc_max | ctc_min | ctc_max | last_word_eq | first_word_eq | abs_len_diff | mean_len | token_set_ratio | token_sort_ratio | fuzz_ratio | fuzz_partial_ratio | longest_substr_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38970 | 38970 | 70704 | 70705 | can you suggest me a good name that is related... | what are the precautions to be taken for const... | 0 | 1 | 1 | 73 | 69 | 15 | 12 | 0.0 | 27.0 | 0.000000 | 2 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 4.0 | 14.0 | 36 | 36 | 36 | 43 | 0.057143 |
99681 | 99681 | 165453 | 165454 | how do i know if i am a lesbian trapped in a m... | i am 15 and just came out to my dad as a lesbi... | 0 | 1 | 1 | 83 | 118 | 19 | 30 | 5.0 | 39.0 | 0.128205 | 2 | 0 | 0.142855 | 0.142855 | 0.454541 | 0.35714 | 0.299999 | 0.199999 | 0.0 | 0.0 | 10.0 | 25.0 | 43 | 49 | 35 | 42 | 0.127907 |
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_cv.csv'):
X_cv= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_cv.csv",encoding='latin-1')
X_cv.fillna('')
else:
print("Extracting features for cv:")
# df = pd.read_csv("drive/My Drive/Quora_assignment/cv.csv")
X_cv= extract_features(X_cv)
X_cv.to_csv("drive/My Drive/Quora_assignment/nlp_features_cv.csv", index=False)
X_cv.head(2)
Extracting features for cv: token features... fuzzy features..
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | cwc_min | cwc_max | csc_min | csc_max | ctc_min | ctc_max | last_word_eq | first_word_eq | abs_len_diff | mean_len | token_set_ratio | token_sort_ratio | fuzz_ratio | fuzz_partial_ratio | longest_substr_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60746 | 60746 | 106185 | 106186 | what are some of the innovative startups in in... | what are some of the new innovative startups i... | 0 | 1 | 1 | 50 | 54 | 9 | 10 | 9.0 | 19.0 | 0.473684 | 2 | 0 | 0.999967 | 0.749981 | 0.999983 | 0.999983 | 0.999989 | 0.899991 | 1.0 | 1.0 | 1.0 | 9.5 | 100 | 96 | 96 | 92 | 0.588235 |
36802 | 36802 | 67067 | 67068 | what are the tips and hacks for getting the cl... | what are the tips and hacks for getting the cl... | 0 | 1 | 1 | 97 | 94 | 19 | 19 | 17.0 | 36.0 | 0.472222 | 2 | 0 | 0.874989 | 0.874989 | 0.999990 | 0.999990 | 0.894732 | 0.894732 | 1.0 | 1.0 | 0.0 | 19.0 | 97 | 95 | 95 | 94 | 0.873684 |
if os.path.isfile('drive/My Drive/Quora_assignment/nlp_features_test.csv'):
X_test= pd.read_csv("drive/My Drive/Quora_assignment/nlp_features_test.csv",encoding='latin-1')
X_test.fillna('')
else:
print("Extracting features for test:")
# df = pd.read_csv("drive/My Drive/Quora_assignment/test.csv")
X_test= extract_features(X_test)
X_test.to_csv("drive/My Drive/Quora_assignment/nlp_features_test.csv", index=False)
X_test.head(2)
Extracting features for test: token features... fuzzy features..
id | qid1 | qid2 | question1 | question2 | is_duplicate | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | cwc_min | cwc_max | csc_min | csc_max | ctc_min | ctc_max | last_word_eq | first_word_eq | abs_len_diff | mean_len | token_set_ratio | token_sort_ratio | fuzz_ratio | fuzz_partial_ratio | longest_substr_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13403 | 13403 | 25742 | 20236 | smartphones what is the best phone camera at ... | which phone has the best camera | 1 | 1 | 2 | 57 | 32 | 10 | 6 | 3.0 | 15.0 | 0.200000 | 3 | 1 | 0.999967 | 0.599988 | 0.333322 | 0.249994 | 0.666656 | 0.399996 | 0 | 0 | 4 | 8.0 | 81 | 60 | 57 | 66 | 0.333333 |
21490 | 21490 | 40457 | 16040 | is the agricultural sector a failing one in in... | what are the problems in the agricultural sect... | 1 | 1 | 1 | 50 | 58 | 9 | 10 | 5.0 | 18.0 | 0.277778 | 2 | 0 | 0.749981 | 0.599988 | 0.499988 | 0.399992 | 0.555549 | 0.499995 | 1 | 0 | 1 | 9.5 | 79 | 68 | 62 | 79 | 0.490196 |
X_train_duplicate = X_train[X_train['is_duplicate'] == 1]
X_train_nonduplicate = X_train[X_train['is_duplicate'] == 0]
# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_train_duplicate["question1"], X_train_duplicate["question2"]]).flatten()
n = np.dstack([X_train_nonduplicate["question1"], X_train_nonduplicate["question2"]]).flatten()
print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))
#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/train_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/train_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 33446 Number of data points in class 0 (non duplicate pairs) : 56334
X_cv_duplicate = X_cv[X_cv['is_duplicate'] == 1]
X_cv_nonduplicate = X_cv[X_cv['is_duplicate'] == 0]
# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_cv_duplicate["question1"], X_cv_duplicate["question2"]]).flatten()
n = np.dstack([X_cv_nonduplicate["question1"], X_cv_nonduplicate["question2"]]).flatten()
print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))
#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/cv_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/cv_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 16474 Number of data points in class 0 (non duplicate pairs) : 27746
X_test_duplicate = X_test[X_test['is_duplicate'] == 1]
X_test_nonduplicate = X_test[X_test['is_duplicate'] == 0]
# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([X_test_duplicate["question1"], X_test_duplicate["question2"]]).flatten()
n = np.dstack([X_test_nonduplicate["question1"], X_test_nonduplicate["question2"]]).flatten()
print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))
#Saving the np array into a text file
np.savetxt('drive/My Drive/Quora_assignment/test_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('drive/My Drive/Quora_assignment/test_n.txt', n, delimiter=' ', fmt='%s')
Number of data points in class 1 (duplicate pairs) : 24588 Number of data points in class 0 (non duplicate pairs) : 41412
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'
textp_w = open(path.join(d, 'train_p.txt')).read()
textn_w = open(path.join(d, 'train_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")
stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w))
print ("Total number of words in non duplicate pair questions :",len(textn_w))
Total number of words in duplicate pair questions : 1804520 Total number of words in non duplicate pair questions : 3663966
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'
textp_w_cv = open(path.join(d, 'cv_p.txt')).read()
textn_w_cv = open(path.join(d, 'cv_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")
stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w_cv))
print ("Total number of words in non duplicate pair questions :",len(textn_w_cv))
Total number of words in duplicate pair questions : 884086 Total number of words in non duplicate pair questions : 1804102
# reading the text files and removing the Stop Words:
d = 'drive/My Drive/Quora_assignment/'
textp_w_test = open(path.join(d, 'test_p.txt')).read()
textn_w_test = open(path.join(d, 'test_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")
stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w_test))
print ("Total number of words in non duplicate pair questions :",len(textn_w_test))
Total number of words in duplicate pair questions : 1333031 Total number of words in non duplicate pair questions : 2681939
__ Word Clouds generated from duplicate pair question's text __
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs
__ Word Clouds generated from non duplicate pair question's text __
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w_cv)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs
__ Word Clouds generated from non duplicate pair question's text __
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w_cv)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w_test)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for Duplicate Question pairs
__ Word Clouds generated from non duplicate pair question's text __
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w_test)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud for non-Duplicate Question pairs:
n = X_train.shape[0]
sns.pairplot(X_train[['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio', 'is_duplicate']][0:n], hue='is_duplicate', vars=['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'])
plt.show()
# Distribution of the token_sort_ratio
plt.figure(figsize=(10, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'token_sort_ratio', data = X_train[0:] , )
plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['token_sort_ratio'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['token_sort_ratio'][0:] , label = "0" , color = 'blue' )
plt.show()
plt.figure(figsize=(10, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'fuzz_ratio', data = X_train[0:] , )
plt.subplot(1,2,2)
sns.distplot(X_train[X_train['is_duplicate'] == 1.0]['fuzz_ratio'][0:] , label = "1", color = 'red')
sns.distplot(X_train[X_train['is_duplicate'] == 0.0]['fuzz_ratio'][0:] , label = "0" , color = 'blue' )
plt.show()
# Using TSNE for Dimentionality reduction for 15 Features(Generated after cleaning the data) to 3 dimention
from sklearn.preprocessing import MinMaxScaler
X_trainp_subsampled = X_train[0:5000]
X = MinMaxScaler().fit_transform(X_trainp_subsampled[['cwc_min', 'cwc_max', 'csc_min', 'csc_max' , 'ctc_min' , 'ctc_max' , 'last_word_eq', 'first_word_eq' , 'abs_len_diff' , 'mean_len' , 'token_set_ratio' , 'token_sort_ratio' , 'fuzz_ratio' , 'fuzz_partial_ratio' , 'longest_substr_ratio']])
y = X_trainp_subsampled['is_duplicate'].values
tsne2d = TSNE(
n_components=2,
init='random', # pca
random_state=101,
method='barnes_hut',
n_iter=1000,
verbose=2,
angle=0.5
).fit_transform(X)
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 5000 samples in 0.025s... [t-SNE] Computed neighbors for 5000 samples in 0.438s... [t-SNE] Computed conditional probabilities for sample 1000 / 5000 [t-SNE] Computed conditional probabilities for sample 2000 / 5000 [t-SNE] Computed conditional probabilities for sample 3000 / 5000 [t-SNE] Computed conditional probabilities for sample 4000 / 5000 [t-SNE] Computed conditional probabilities for sample 5000 / 5000 [t-SNE] Mean sigma: 0.138864 [t-SNE] Computed conditional probabilities in 0.306s [t-SNE] Iteration 50: error = 80.9635162, gradient norm = 0.0418726 (50 iterations in 2.776s) [t-SNE] Iteration 100: error = 70.6095810, gradient norm = 0.0121856 (50 iterations in 1.885s) [t-SNE] Iteration 150: error = 68.8774948, gradient norm = 0.0057241 (50 iterations in 1.751s) [t-SNE] Iteration 200: error = 68.1385345, gradient norm = 0.0037310 (50 iterations in 1.779s) [t-SNE] Iteration 250: error = 67.7121658, gradient norm = 0.0037579 (50 iterations in 1.776s) [t-SNE] KL divergence after 250 iterations with early exaggeration: 67.712166 [t-SNE] Iteration 300: error = 1.7984626, gradient norm = 0.0012046 (50 iterations in 1.838s) [t-SNE] Iteration 350: error = 1.4062829, gradient norm = 0.0004809 (50 iterations in 1.900s) [t-SNE] Iteration 400: error = 1.2428937, gradient norm = 0.0002774 (50 iterations in 1.877s) [t-SNE] Iteration 450: error = 1.1549420, gradient norm = 0.0001883 (50 iterations in 1.876s) [t-SNE] Iteration 500: error = 1.1012262, gradient norm = 0.0001429 (50 iterations in 1.899s) [t-SNE] Iteration 550: error = 1.0672253, gradient norm = 0.0001188 (50 iterations in 1.880s) [t-SNE] Iteration 600: error = 1.0446147, gradient norm = 0.0001088 (50 iterations in 1.902s) [t-SNE] Iteration 650: error = 1.0294924, gradient norm = 0.0000967 (50 iterations in 1.901s) [t-SNE] Iteration 700: error = 1.0191708, gradient norm = 0.0000847 (50 iterations in 1.917s) [t-SNE] Iteration 750: error = 1.0112816, gradient norm = 0.0000823 (50 iterations in 1.943s) [t-SNE] Iteration 800: error = 1.0050118, gradient norm = 0.0000816 (50 iterations in 1.977s) [t-SNE] Iteration 850: error = 1.0003502, gradient norm = 0.0000795 (50 iterations in 1.912s) [t-SNE] Iteration 900: error = 0.9965533, gradient norm = 0.0000753 (50 iterations in 1.914s) [t-SNE] Iteration 950: error = 0.9932551, gradient norm = 0.0000722 (50 iterations in 1.901s) [t-SNE] Iteration 1000: error = 0.9902268, gradient norm = 0.0000671 (50 iterations in 1.884s) [t-SNE] KL divergence after 1000 iterations: 0.990227
df = pd.DataFrame({'x':tsne2d[:,0], 'y':tsne2d[:,1] ,'label':y})
# draw the plot in appropriate place in the grid
sns.lmplot(data=df, x='x', y='y', hue='label', fit_reg=False, size=8,palette="Set1",markers=['s','o'])
plt.title("perplexity : {} and max_iter : {}".format(30, 1000))
plt.show()
from sklearn.manifold import TSNE
tsne3d = TSNE(
n_components=3,
init='random', # pca
random_state=101,
method='barnes_hut',
n_iter=1000,
verbose=2,
angle=0.5
).fit_transform(X)
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 5000 samples in 0.019s... [t-SNE] Computed neighbors for 5000 samples in 0.437s... [t-SNE] Computed conditional probabilities for sample 1000 / 5000 [t-SNE] Computed conditional probabilities for sample 2000 / 5000 [t-SNE] Computed conditional probabilities for sample 3000 / 5000 [t-SNE] Computed conditional probabilities for sample 4000 / 5000 [t-SNE] Computed conditional probabilities for sample 5000 / 5000 [t-SNE] Mean sigma: 0.138864 [t-SNE] Computed conditional probabilities in 0.292s [t-SNE] Iteration 50: error = 81.5270157, gradient norm = 0.0352745 (50 iterations in 11.530s) [t-SNE] Iteration 100: error = 69.6421280, gradient norm = 0.0034818 (50 iterations in 6.108s) [t-SNE] Iteration 150: error = 68.3586655, gradient norm = 0.0017799 (50 iterations in 5.151s) [t-SNE] Iteration 200: error = 67.7689056, gradient norm = 0.0011547 (50 iterations in 5.504s) [t-SNE] Iteration 250: error = 67.4351501, gradient norm = 0.0009449 (50 iterations in 5.696s) [t-SNE] KL divergence after 250 iterations with early exaggeration: 67.435150 [t-SNE] Iteration 300: error = 1.5505812, gradient norm = 0.0007420 (50 iterations in 7.915s) [t-SNE] Iteration 350: error = 1.2073249, gradient norm = 0.0002019 (50 iterations in 9.791s) [t-SNE] Iteration 400: error = 1.0615990, gradient norm = 0.0001004 (50 iterations in 9.455s) [t-SNE] Iteration 450: error = 0.9877149, gradient norm = 0.0000709 (50 iterations in 9.534s) [t-SNE] Iteration 500: error = 0.9476711, gradient norm = 0.0000618 (50 iterations in 9.568s) [t-SNE] Iteration 550: error = 0.9255852, gradient norm = 0.0000533 (50 iterations in 9.727s) [t-SNE] Iteration 600: error = 0.9111718, gradient norm = 0.0000374 (50 iterations in 9.724s) [t-SNE] Iteration 650: error = 0.9011445, gradient norm = 0.0000320 (50 iterations in 9.687s) [t-SNE] Iteration 700: error = 0.8930086, gradient norm = 0.0000285 (50 iterations in 9.660s) [t-SNE] Iteration 750: error = 0.8859218, gradient norm = 0.0000272 (50 iterations in 9.589s) [t-SNE] Iteration 800: error = 0.8800092, gradient norm = 0.0000234 (50 iterations in 9.455s) [t-SNE] Iteration 850: error = 0.8747241, gradient norm = 0.0000224 (50 iterations in 9.364s) [t-SNE] Iteration 900: error = 0.8700252, gradient norm = 0.0000250 (50 iterations in 9.427s) [t-SNE] Iteration 950: error = 0.8659607, gradient norm = 0.0000280 (50 iterations in 9.474s) [t-SNE] Iteration 1000: error = 0.8629290, gradient norm = 0.0000268 (50 iterations in 9.553s) [t-SNE] KL divergence after 1000 iterations: 0.862929
trace1 = go.Scatter3d(
x=tsne3d[:,0],
y=tsne3d[:,1],
z=tsne3d[:,2],
mode='markers',
marker=dict(
sizemode='diameter',
color = y,
colorscale = 'Portland',
colorbar = dict(title = 'duplicate'),
line=dict(color='rgb(255, 255, 255)'),
opacity=0.75
)
)
data=[trace1]
layout=dict(height=800, width=800, title='3d embedding with engineered features')
fig=dict(data=data, layout=layout)
py.iplot(fig, filename='3DBubble')
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
warnings.filterwarnings("ignore")
import sys
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
# exctract word2vec vectors
# https://github.com/explosion/spaCy/issues/1721
# http://landinghub.visualstudio.com/visual-cpp-build-tools
import spacy
# # avoid decoding problems
# df = pd.read_csv("drive/My Drive/Quora/train.csv",nrows=10000)
# y_true = data['is_duplicate'].values
# df.drop(['is_duplicate'], axis=1, inplace=True)
# X_train,X_test, y_train, y_test = train_test_split(df, y_true, stratify=y_true, test_size=0.3)
# print("Number of data points in train data :",X_train.shape)
# print("Number of data points in test data :",X_test.shape)
# # encode questions to unicode
# # https://stackoverflow.com/a/6812069
# # ----------------- python 2 ---------------------
# # df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))
# # df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))
# # ----------------- python 3 ---------------------
# df['question1'] = df['question1'].apply(lambda x: str(x))
# df['question2'] = df['question2'].apply(lambda x: str(x))
!ls
sample_data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(df['question1']) + list(df['question2'])
tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(questions)
# dict key:word and value:tf-idf score
# word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))
<808580x109679 sparse matrix of type '<class 'numpy.float64'>' with 8146555 stored elements in Compressed Sparse Row format>
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,4), max_features=5000)
# questions_train = list(X_train['question1']) + list(X_train['question2'])
# questions_cv = list(X_cv['question1']) + list(X_cv['question2'])
# questions_test = list(X_test['question1']) + list(X_test['question2'])
tfidf_train_q1 = vectorizer.fit_transform(X_train['question1'].values)
tfidf_cv_q1 = vectorizer.transform(X_cv['question1'].values)
tfidf_test_q1 = vectorizer.transform(X_test['question1'].values)
feat_tfidf = vectorizer.get_feature_names()
print("Train tfidf ",tfidf_train_q1.shape)
print("CV tfidf",tfidf_cv_q1.shape)
print("Test tfidf",tfidf_test_q1.shape)
Train tfidf (44890, 5000) CV tfidf (22110, 5000) Test tfidf (33000, 5000)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,4), max_features=5000)
# questions_train = list(X_train['question1']) + list(X_train['question2'])
# questions_cv = list(X_cv['question1']) + list(X_cv['question2'])
# questions_test = list(X_test['question1']) + list(X_test['question2'])
tfidf_train_q2 = vectorizer.fit_transform(X_train['question2'].values)
tfidf_cv_q2 = vectorizer.transform(X_cv['question2'].values)
tfidf_test_q2 = vectorizer.transform(X_test['question2'].values)
feat_tfidf_q2 = vectorizer.get_feature_names()
print("Train tfidf ",tfidf_train_q2.shape)
print("CV tfidf",tfidf_cv_q2.shape)
print("Test tfidf",tfidf_test_q2.shape)
Train tfidf (44890, 5000) CV tfidf (22110, 5000) Test tfidf (33000, 5000)
# en_vectors_web_lg, which includes over 1 million unique vectors.
nlp = spacy.load('en_core_web_sm')
vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progress bar
for qu1 in tqdm(list(df['question1'])):
doc1 = nlp(qu1)
# 384 is the number of dimensions of vectors
mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
for word1 in doc1:
# word2vec
vec1 = word1.vector
# fetch df score
try:
idf = word2tfidf[str(word1)]
except:
idf = 0
# compute final vec
mean_vec1 += vec1 * idf
mean_vec1 = mean_vec1.mean(axis=0)
vecs1.append(mean_vec1)
df['q1_feats_m'] = list(vecs1)
100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [2:13:51<00:00, 50.34it/s]
vecs2 = []
for qu2 in tqdm(list(df['question2'])):
doc2 = nlp(qu2)
mean_vec1 = np.zeros([len(doc1), len(doc2[0].vector)])
for word2 in doc2:
# word2vec
vec2 = word2.vector
# fetch df score
try:
idf = word2tfidf[str(word2)]
except:
#print word
idf = 0
# compute final vec
mean_vec2 += vec2 * idf
mean_vec2 = mean_vec2.mean(axis=0)
vecs2.append(mean_vec2)
df['q2_feats_m'] = list(vecs2)
100%|████████████████████████████████████████████████████████████████████████| 404290/404290 [1:47:52<00:00, 62.46it/s]
#prepro_features_train.csv (Simple Preprocessing Feartures)
#nlp_features_train.csv (NLP Features)
if os.path.isfile('nlp_features_train.csv'):
dfnlp = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
else:
print("download nlp_features_train.csv from drive or run previous notebook")
if os.path.isfile('df_fe_without_preprocessing_train.csv'):
dfppro = pd.read_csv("df_fe_without_preprocessing_train.csv",encoding='latin-1')
else:
print("download df_fe_without_preprocessing_train.csv from drive or run previous notebook")
df1 = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)
df2 = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3 = df.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3_q1 = pd.DataFrame(df3.q1_feats_m.values.tolist(), index= df3.index)
df3_q2 = pd.DataFrame(df3.q2_feats_m.values.tolist(), index= df3.index)
# dataframe of nlp features
df1.head()
id | is_duplicate | cwc_min | cwc_max | csc_min | csc_max | ctc_min | ctc_max | last_word_eq | first_word_eq | abs_len_diff | mean_len | token_set_ratio | token_sort_ratio | fuzz_ratio | fuzz_partial_ratio | longest_substr_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.999980 | 0.833319 | 0.999983 | 0.999983 | 0.916659 | 0.785709 | 0.0 | 1.0 | 2.0 | 13.0 | 100 | 93 | 93 | 100 | 0.982759 |
1 | 1 | 0 | 0.799984 | 0.399996 | 0.749981 | 0.599988 | 0.699993 | 0.466664 | 0.0 | 1.0 | 5.0 | 12.5 | 86 | 63 | 66 | 75 | 0.596154 |
2 | 2 | 0 | 0.399992 | 0.333328 | 0.399992 | 0.249997 | 0.399996 | 0.285712 | 0.0 | 1.0 | 4.0 | 12.0 | 66 | 66 | 54 | 54 | 0.166667 |
3 | 3 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 2.0 | 12.0 | 36 | 36 | 35 | 40 | 0.039216 |
4 | 4 | 0 | 0.399992 | 0.199998 | 0.999950 | 0.666644 | 0.571420 | 0.307690 | 0.0 | 1.0 | 6.0 | 10.0 | 67 | 47 | 46 | 56 | 0.175000 |
# data before preprocessing
df2.head()
id | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 66 | 57 | 14 | 12 | 10.0 | 23.0 | 0.434783 | 2 | 0 |
1 | 1 | 4 | 1 | 51 | 88 | 8 | 13 | 4.0 | 20.0 | 0.200000 | 5 | 3 |
2 | 2 | 1 | 1 | 73 | 59 | 14 | 10 | 4.0 | 24.0 | 0.166667 | 2 | 0 |
3 | 3 | 1 | 1 | 50 | 65 | 11 | 9 | 0.0 | 19.0 | 0.000000 | 2 | 0 |
4 | 4 | 3 | 1 | 76 | 39 | 13 | 7 | 2.0 | 20.0 | 0.100000 | 4 | 2 |
# Questions 1 tfidf weighted word2vec
df3_q1.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 121.929927 | 100.083900 | 72.497894 | 115.641800 | -48.370870 | 34.619058 | -172.057787 | -92.502617 | 113.223315 | 50.562441 | ... | 12.397642 | 40.909519 | 8.150261 | -15.170692 | 18.007709 | 6.166999 | -30.124163 | 3.700902 | -1.757693 | -1.818058 |
1 | -78.070939 | 54.843781 | 82.738482 | 98.191872 | -51.234859 | 55.013510 | -39.140730 | -82.692352 | 45.161489 | -9.556289 | ... | -21.987077 | -12.389279 | 20.667979 | 2.202714 | -17.142454 | -5.880972 | -10.123963 | -4.890663 | -13.018389 | -5.219310 |
2 | -5.355015 | 73.671810 | 14.376365 | 104.130241 | 1.433537 | 35.229116 | -148.519385 | -97.124595 | 41.972195 | 50.948731 | ... | 3.027700 | 14.025767 | -2.960312 | -3.206544 | 4.355141 | 2.936152 | -20.199555 | 9.816351 | 11.894366 | -8.798819 |
3 | 5.778359 | -34.712038 | 48.999631 | 59.699204 | 40.661263 | -41.658731 | -36.808594 | 24.170655 | 0.235600 | -29.407290 | ... | 13.100007 | 1.405670 | -1.891076 | -7.882638 | 18.000561 | 12.106918 | -10.507835 | 5.243834 | 10.158340 | 5.886351 |
4 | 51.138220 | 38.587312 | 123.639488 | 53.333041 | -47.062739 | 37.356212 | -298.722753 | -106.421119 | 106.248914 | 65.880707 | ... | 13.906532 | 43.461721 | 11.519207 | -22.468284 | 45.431128 | 8.161224 | -35.373910 | 7.728865 | 9.592849 | 5.447336 |
5 rows × 384 columns
# Questions 2 tfidf weighted word2vec
df3_q2.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 125.983301 | 95.636485 | 42.114702 | 95.449980 | -37.386295 | 39.400078 | -148.116070 | -87.851475 | 110.371966 | 62.272814 | ... | 16.165592 | 33.030668 | 7.019996 | -14.793959 | 15.437511 | 8.199658 | -25.070834 | 1.571619 | 1.603738 | 0.305645 |
1 | -106.871904 | 80.290331 | 79.066297 | 59.302092 | -42.175328 | 117.616655 | -144.364237 | -127.131513 | 22.962533 | 25.397575 | ... | -4.901128 | -4.565393 | 41.520751 | -0.727564 | -16.413776 | -7.373778 | 2.638877 | -7.403457 | 2.703070 | 0.408040 |
2 | 7.072875 | 15.513378 | 1.846914 | 85.937583 | -33.808811 | 94.702337 | -122.256856 | -114.009530 | 53.922293 | 60.131814 | ... | 8.359966 | -2.165985 | 10.936580 | -16.531660 | 14.681230 | 15.633759 | -1.210901 | 14.183826 | 11.703135 | 10.148075 |
3 | 39.421531 | 44.136989 | -24.010929 | 85.265863 | -0.339022 | -9.323137 | -60.499651 | -37.044763 | 49.407848 | -23.350150 | ... | 3.311411 | 3.788879 | 13.398598 | -6.592596 | 6.437365 | 5.993293 | 2.732392 | -3.727647 | 5.614115 | 6.023693 |
4 | 31.950101 | 62.854106 | 1.778164 | 36.218768 | -45.130875 | 66.674880 | -106.342341 | -22.901008 | 59.835938 | 62.663961 | ... | -2.403870 | 11.991204 | 8.088483 | -15.090201 | 8.375166 | 1.727225 | -6.601129 | 11.317413 | 11.544603 | 2.478689 |
5 rows × 384 columns
print("Number of features in nlp dataframe :", df1.shape[1])
print("Number of features in preprocessed dataframe :", df2.shape[1])
print("Number of features in question1 w2v dataframe :", df3_q1.shape[1])
print("Number of features in question2 w2v dataframe :", df3_q2.shape[1])
print("Number of features in final dataframe :", df1.shape[1]+df2.shape[1]+df3_q1.shape[1]+df3_q2.shape[1])
Number of features in nlp dataframe : 17 Number of features in preprocessed dataframe : 12 Number of features in question1 w2v dataframe : 384 Number of features in question2 w2v dataframe : 384 Number of features in final dataframe : 794
# storing the final features to csv file
if not os.path.isfile('final_features.csv'):
df3_q1['id']=df1['id']
df3_q2['id']=df1['id']
df1 = df1.merge(df2, on='id',how='left')
df2 = df3_q1.merge(df3_q2, on='id',how='left')
result = df1.merge(df2, on='id',how='left')
result.to_csv('final_features.csv')
# storing the final features to csv file
if not os.path.isfile('final_features.csv'):
df3_q1['id']=df1['id']
df3_q2['id']=df1['id']
df1 = df1.merge(df2, on='id',how='left')
df2 = df3_q1.merge(df3_q2, on='id',how='left')
result = df1.merge(df2, on='id',how='left')
result.to_csv('final_features.csv')
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import sqlite3
from sqlalchemy import create_engine # database connection
import csv
import os
warnings.filterwarnings("ignore")
import datetime as dt
import numpy as np
from nltk.corpus import stopwords
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
from mlxtend.classifier import StackingClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, auc, roc_curve
#Creating db file from csv
if not os.path.isfile('drive/My Drive/Quora/train.db'):
disk_engine = create_engine('sqlite:///train.db')
start = dt.datetime.now()
chunksize = 180000
j = 0
index_start = 1
for df in pd.read_csv('final_features.csv', names=['Unnamed: 0','id','is_duplicate','cwc_min','cwc_max','csc_min','csc_max','ctc_min','ctc_max','last_word_eq','first_word_eq','abs_len_diff','mean_len','token_set_ratio','token_sort_ratio','fuzz_ratio','fuzz_partial_ratio','longest_substr_ratio','freq_qid1','freq_qid2','q1len','q2len','q1_n_words','q2_n_words','word_Common','word_Total','word_share','freq_q1+q2','freq_q1-q2','0_x','1_x','2_x','3_x','4_x','5_x','6_x','7_x','8_x','9_x','10_x','11_x','12_x','13_x','14_x','15_x','16_x','17_x','18_x','19_x','20_x','21_x','22_x','23_x','24_x','25_x','26_x','27_x','28_x','29_x','30_x','31_x','32_x','33_x','34_x','35_x','36_x','37_x','38_x','39_x','40_x','41_x','42_x','43_x','44_x','45_x','46_x','47_x','48_x','49_x','50_x','51_x','52_x','53_x','54_x','55_x','56_x','57_x','58_x','59_x','60_x','61_x','62_x','63_x','64_x','65_x','66_x','67_x','68_x','69_x','70_x','71_x','72_x','73_x','74_x','75_x','76_x','77_x','78_x','79_x','80_x','81_x','82_x','83_x','84_x','85_x','86_x','87_x','88_x','89_x','90_x','91_x','92_x','93_x','94_x','95_x','96_x','97_x','98_x','99_x','100_x','101_x','102_x','103_x','104_x','105_x','106_x','107_x','108_x','109_x','110_x','111_x','112_x','113_x','114_x','115_x','116_x','117_x','118_x','119_x','120_x','121_x','122_x','123_x','124_x','125_x','126_x','127_x','128_x','129_x','130_x','131_x','132_x','133_x','134_x','135_x','136_x','137_x','138_x','139_x','140_x','141_x','142_x','143_x','144_x','145_x','146_x','147_x','148_x','149_x','150_x','151_x','152_x','153_x','154_x','155_x','156_x','157_x','158_x','159_x','160_x','161_x','162_x','163_x','164_x','165_x','166_x','167_x','168_x','169_x','170_x','171_x','172_x','173_x','174_x','175_x','176_x','177_x','178_x','179_x','180_x','181_x','182_x','183_x','184_x','185_x','186_x','187_x','188_x','189_x','190_x','191_x','192_x','193_x','194_x','195_x','196_x','197_x','198_x','199_x','200_x','201_x','202_x','203_x','204_x','205_x','206_x','207_x','208_x','209_x','210_x','211_x','212_x','213_x','214_x','215_x','216_x','217_x','218_x','219_x','220_x','221_x','222_x','223_x','224_x','225_x','226_x','227_x','228_x','229_x','230_x','231_x','232_x','233_x','234_x','235_x','236_x','237_x','238_x','239_x','240_x','241_x','242_x','243_x','244_x','245_x','246_x','247_x','248_x','249_x','250_x','251_x','252_x','253_x','254_x','255_x','256_x','257_x','258_x','259_x','260_x','261_x','262_x','263_x','264_x','265_x','266_x','267_x','268_x','269_x','270_x','271_x','272_x','273_x','274_x','275_x','276_x','277_x','278_x','279_x','280_x','281_x','282_x','283_x','284_x','285_x','286_x','287_x','288_x','289_x','290_x','291_x','292_x','293_x','294_x','295_x','296_x','297_x','298_x','299_x','300_x','301_x','302_x','303_x','304_x','305_x','306_x','307_x','308_x','309_x','310_x','311_x','312_x','313_x','314_x','315_x','316_x','317_x','318_x','319_x','320_x','321_x','322_x','323_x','324_x','325_x','326_x','327_x','328_x','329_x','330_x','331_x','332_x','333_x','334_x','335_x','336_x','337_x','338_x','339_x','340_x','341_x','342_x','343_x','344_x','345_x','346_x','347_x','348_x','349_x','350_x','351_x','352_x','353_x','354_x','355_x','356_x','357_x','358_x','359_x','360_x','361_x','362_x','363_x','364_x','365_x','366_x','367_x','368_x','369_x','370_x','371_x','372_x','373_x','374_x','375_x','376_x','377_x','378_x','379_x','380_x','381_x','382_x','383_x','0_y','1_y','2_y','3_y','4_y','5_y','6_y','7_y','8_y','9_y','10_y','11_y','12_y','13_y','14_y','15_y','16_y','17_y','18_y','19_y','20_y','21_y','22_y','23_y','24_y','25_y','26_y','27_y','28_y','29_y','30_y','31_y','32_y','33_y','34_y','35_y','36_y','37_y','38_y','39_y','40_y','41_y','42_y','43_y','44_y','45_y','46_y','47_y','48_y','49_y','50_y','51_y','52_y','53_y','54_y','55_y','56_y','57_y','58_y','59_y','60_y','61_y','62_y','63_y','64_y','65_y','66_y','67_y','68_y','69_y','70_y','71_y','72_y','73_y','74_y','75_y','76_y','77_y','78_y','79_y','80_y','81_y','82_y','83_y','84_y','85_y','86_y','87_y','88_y','89_y','90_y','91_y','92_y','93_y','94_y','95_y','96_y','97_y','98_y','99_y','100_y','101_y','102_y','103_y','104_y','105_y','106_y','107_y','108_y','109_y','110_y','111_y','112_y','113_y','114_y','115_y','116_y','117_y','118_y','119_y','120_y','121_y','122_y','123_y','124_y','125_y','126_y','127_y','128_y','129_y','130_y','131_y','132_y','133_y','134_y','135_y','136_y','137_y','138_y','139_y','140_y','141_y','142_y','143_y','144_y','145_y','146_y','147_y','148_y','149_y','150_y','151_y','152_y','153_y','154_y','155_y','156_y','157_y','158_y','159_y','160_y','161_y','162_y','163_y','164_y','165_y','166_y','167_y','168_y','169_y','170_y','171_y','172_y','173_y','174_y','175_y','176_y','177_y','178_y','179_y','180_y','181_y','182_y','183_y','184_y','185_y','186_y','187_y','188_y','189_y','190_y','191_y','192_y','193_y','194_y','195_y','196_y','197_y','198_y','199_y','200_y','201_y','202_y','203_y','204_y','205_y','206_y','207_y','208_y','209_y','210_y','211_y','212_y','213_y','214_y','215_y','216_y','217_y','218_y','219_y','220_y','221_y','222_y','223_y','224_y','225_y','226_y','227_y','228_y','229_y','230_y','231_y','232_y','233_y','234_y','235_y','236_y','237_y','238_y','239_y','240_y','241_y','242_y','243_y','244_y','245_y','246_y','247_y','248_y','249_y','250_y','251_y','252_y','253_y','254_y','255_y','256_y','257_y','258_y','259_y','260_y','261_y','262_y','263_y','264_y','265_y','266_y','267_y','268_y','269_y','270_y','271_y','272_y','273_y','274_y','275_y','276_y','277_y','278_y','279_y','280_y','281_y','282_y','283_y','284_y','285_y','286_y','287_y','288_y','289_y','290_y','291_y','292_y','293_y','294_y','295_y','296_y','297_y','298_y','299_y','300_y','301_y','302_y','303_y','304_y','305_y','306_y','307_y','308_y','309_y','310_y','311_y','312_y','313_y','314_y','315_y','316_y','317_y','318_y','319_y','320_y','321_y','322_y','323_y','324_y','325_y','326_y','327_y','328_y','329_y','330_y','331_y','332_y','333_y','334_y','335_y','336_y','337_y','338_y','339_y','340_y','341_y','342_y','343_y','344_y','345_y','346_y','347_y','348_y','349_y','350_y','351_y','352_y','353_y','354_y','355_y','356_y','357_y','358_y','359_y','360_y','361_y','362_y','363_y','364_y','365_y','366_y','367_y','368_y','369_y','370_y','371_y','372_y','373_y','374_y','375_y','376_y','377_y','378_y','379_y','380_y','381_y','382_y','383_y'], chunksize=chunksize, iterator=True, encoding='utf-8', ):
df.index += index_start
j+=1
print('{} rows'.format(j*chunksize))
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
#http://www.sqlitetutorial.net/sqlite-python/create-tables/
def create_connection(db_file):
""" create a database connection to the SQLite database
specified by db_file
:param db_file: database file
:return: Connection object or None
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
def checkTableExists(dbcon):
cursr = dbcon.cursor()
str = "select name from sqlite_master where type='table'"
table_names = cursr.execute(str)
print("Tables in the databse:")
tables =table_names.fetchall()
print(tables[0][0])
return(len(tables))
read_db = 'drive/My Drive/Quora/train.db'
conn_r = create_connection(read_db)
checkTableExists(conn_r)
conn_r.close()
Tables in the databse: data
# try to sample data according to the computing power you have
if os.path.isfile(read_db):
conn_r = create_connection(read_db)
if conn_r is not None:
# for selecting first 1M rows
# data = pd.read_sql_query("""SELECT * FROM data LIMIT 100001;""", conn_r)
data =pd.read_sql_query("""SELECT * From data ORDER BY RANDOM() LIMIT 100001;""", conn_r)
# for selecting random points
# data = pd.read_sql_query("SELECT * From data ORDER BY RANDOM() LIMIT 100001;", conn_r)
conn_r.commit()
conn_r.close()
data.shape
(100001, 798)
# remove the first row
data.drop(data.index[0], inplace=True)
y_true = data['is_duplicate'].values
data.drop(['Unnamed: 0', 'id','index','is_duplicate'], axis=1, inplace=True)
data.shape
(100000, 794)
y_true.shape
(100000,)
data.head()
cwc_min | cwc_max | csc_min | csc_max | ctc_min | ctc_max | last_word_eq | first_word_eq | abs_len_diff | mean_len | token_set_ratio | token_sort_ratio | fuzz_ratio | fuzz_partial_ratio | longest_substr_ratio | freq_qid1 | freq_qid2 | q1len | q2len | q1_n_words | q2_n_words | word_Common | word_Total | word_share | freq_q1+q2 | freq_q1-q2 | 0_x | 1_x | 2_x | 3_x | 4_x | 5_x | 6_x | 7_x | 8_x | 9_x | 10_x | 11_x | 12_x | 13_x | ... | 344_y | 345_y | 346_y | 347_y | 348_y | 349_y | 350_y | 351_y | 352_y | 353_y | 354_y | 355_y | 356_y | 357_y | 358_y | 359_y | 360_y | 361_y | 362_y | 363_y | 364_y | 365_y | 366_y | 367_y | 368_y | 369_y | 370_y | 371_y | 372_y | 373_y | 374_y | 375_y | 376_y | 377_y | 378_y | 379_y | 380_y | 381_y | 382_y | 383_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.599988000239995 | 0.374995312558593 | 0.66664444518516 | 0.399992000159997 | 0.624992187597655 | 0.384612426058261 | 0.0 | 1.0 | 5.0 | 10.5 | 74 | 62 | 59 | 71 | 0.325581395348837 | 1 | 3 | 82 | 42 | 13 | 8 | 5.0 | 21.0 | 0.238095238095238 | 4 | 2 | 57.1058412827551 | 10.6745837330818 | 23.2628687694669 | -33.6145638562739 | 33.9563886523247 | -54.3863698244095 | -38.9512281417847 | -5.65749143064022 | 19.4633273463696 | -14.4108464121819 | 0.413780856877565 | -90.0927007198334 | -25.9417567253113 | 39.9195522069931 | ... | -1.12130831182003 | -6.98040819168091 | -6.33985570073128 | 7.6240513920784 | 22.3642468806356 | -6.40829902887344 | -5.13831973075867 | 5.59842221438885 | -2.4262265637517 | 5.13488824665546 | 4.47589510679245 | 9.35799412429333 | -1.54873855412006 | -18.6881028264761 | -0.318126812577248 | -27.2441799640656 | -12.7842939198017 | 3.23777961730957 | 10.4648281931877 | 3.1136802136898 | -16.1753657013178 | -6.6811613291502 | 8.9008380137384 | -8.09360283613205 | 10.669078797102 | 13.805245757103 | -3.37707404792309 | 11.9117770791054 | -13.6444728076458 | -9.17912252247334 | -4.08385995030403 | 14.0245571732521 | 5.93214458227158 | -6.84712482988834 | 9.7337102573365 | 2.93709091842175 | -8.35303891450167 | 3.64266863837838 | -2.83494944870472 | 0.795595072209835 |
2 | 0.0 | 0.0 | 0.33332222259258 | 0.199996000079998 | 0.14285510206997 | 0.111109876556927 | 0.0 | 0.0 | 2.0 | 8.0 | 47 | 47 | 34 | 42 | 0.21875 | 1 | 2 | 47 | 31 | 9 | 7 | 1.0 | 16.0 | 0.0625 | 3 | 1 | 8.77402806282043 | 26.4920916557312 | -25.6423462629318 | 151.584264576435 | -12.4909554719925 | 10.1954989433289 | -80.1942399442196 | -8.31146311759949 | 10.3102250099182 | 65.4081729203463 | 23.3336698561907 | -45.3783755302429 | 54.6424433588982 | -33.0326794981956 | ... | -17.9594589471817 | 3.58886796236038 | -6.29027318954468 | 5.16409941017628 | 20.1258146762848 | -6.72773689031601 | -6.86605572700501 | 9.4583829343319 | -5.7872397005558 | 5.50593328475952 | -0.359634578227997 | -3.37525436282158 | 2.41916358470917 | -9.87290219590068 | 0.113897413015366 | -18.8911101222038 | -10.9363823086023 | 16.1445648670197 | 7.70360687375069 | 9.39097859710455 | -19.3960244655609 | -9.99294281005859 | -0.711765699088573 | -11.8579163914546 | 9.96197581291199 | 4.71993850171566 | -0.404253512620926 | 5.98044725437649 | -18.5462153553963 | -7.54864060878754 | -6.00434684753418 | 2.09378188848496 | -4.36162620782852 | -2.86872500181198 | 9.95783281326294 | 3.96753732697107 | -6.04816972836852 | 13.1370285153389 | -4.59646981954575 | -2.10167560027912 |
3 | 0.571420408279882 | 0.39999600004 | 0.249993750156246 | 0.124998437519531 | 0.416663194473379 | 0.277776234576475 | 0.0 | 0.0 | 6.0 | 15.0 | 63 | 55 | 50 | 59 | 0.193548387096774 | 18 | 1 | 92 | 61 | 18 | 12 | 5.0 | 29.0 | 0.172413793103448 | 19 | 17 | 111.379810988903 | 29.0309159681201 | -17.0490934848785 | 163.781318992376 | 65.5736330747604 | 5.04822397232056 | -123.368053853512 | 18.1942257136107 | 104.339100897312 | 65.868305683136 | 33.2836086694151 | -76.4499957561493 | 80.5674305558205 | -67.4319585561752 | ... | -16.6893599927425 | 4.52293415740132 | -11.1851584911346 | 20.2472121715546 | 36.1004081815481 | -3.33331942558289 | -13.2763716876507 | 8.09458849206567 | -17.5483088828623 | 3.25249992311001 | 5.05329053103924 | -10.4370025992393 | -11.4497925713658 | -20.9479368925095 | -17.451088488102 | -36.9104922413826 | -8.31025653332472 | 14.6365592330694 | 3.99670352041721 | 14.1713411808014 | -31.4632322192192 | -10.0113016366959 | -0.462194621562958 | -10.113861897029 | 27.8743404261768 | 23.7049300670624 | -7.57172939367592 | 7.93385477364063 | -14.4578313231468 | -9.78929396728927 | 2.83521626889706 | 16.6881836429238 | -3.64371617138386 | -9.01050212979317 | 7.68894088082016 | 12.0409759283066 | -15.9408362954855 | 18.9260811805725 | 7.33242186903953 | 7.84472468495369 |
4 | 0.499991666805553 | 0.374995312558593 | 0.33332222259258 | 0.199996000079998 | 0.444439506227709 | 0.307689940846609 | 0.0 | 0.0 | 4.0 | 11.0 | 62 | 51 | 37 | 43 | 0.234042553191489 | 2 | 1 | 46 | 65 | 9 | 13 | 2.0 | 22.0 | 0.0909090909090909 | 3 | 1 | 30.6566087305546 | 51.383349776268 | -14.1957220435143 | 117.821711063385 | -15.1299973726273 | 12.4104707539082 | -122.302688717842 | -27.978790640831 | 38.1506505087018 | 4.2088098526001 | 49.3372248411179 | -32.5690251588821 | 44.6788084805012 | -13.8778696358204 | ... | -14.8087664395571 | -16.8481034561992 | -17.6492288708687 | 14.7391265034676 | 30.414575509727 | -1.46904289722443 | -25.7505879290402 | -3.48447132110596 | -35.4570420384407 | 10.8127281069756 | 16.6586848348379 | -20.1382753551006 | -17.5363703109324 | -39.0970457335934 | -19.068716943264 | -54.4635823965073 | -14.0194551348686 | 20.4581990391016 | 27.132982048206 | 18.852943867445 | -31.7658805251121 | -6.10707682371139 | 24.6091204285622 | -8.91956734657288 | 35.5811465382576 | 31.1579805016518 | -1.13222348690033 | 22.4273942857981 | -18.6040524542332 | -24.644641160965 | -1.57668825238943 | 26.7479315027595 | -0.426841732114554 | -1.77006526530022 | 27.6149563193321 | 3.78940312564373 | -5.8349786400795 | 7.09428422152996 | 14.4553550630808 | -4.88616823777556 |
5 | 0.66664444518516 | 0.333327777870369 | 0.799984000319994 | 0.799984000319994 | 0.749990625117186 | 0.545449586821938 | 0.0 | 1.0 | 3.0 | 9.5 | 82 | 77 | 72 | 86 | 0.288461538461538 | 1 | 1 | 71 | 51 | 11 | 8 | 4.0 | 19.0 | 0.210526315789474 | 2 | 0 | 66.5820848941803 | 79.7439966201782 | 72.8350574970246 | 58.9022444486618 | -54.2756164073944 | -12.6337466053665 | -163.805635280907 | -46.6827344894409 | 55.8445687294006 | 97.7997419312596 | 90.573876529932 | -102.867875941098 | 14.7288395762444 | -36.5945960283279 | ... | -8.35292768478394 | -0.0757631957530975 | -19.4078957438469 | 23.4379781782627 | 21.874951839447 | -1.70280447602272 | -0.266398936510086 | 4.7552992105484 | -10.2854710556567 | 10.2418370693922 | 0.0553742945194244 | 10.1732945069671 | 9.15499824285507 | -17.3461922854185 | 4.12215511500835 | -14.4531718040816 | -7.96267762780189 | 18.1066705882549 | 2.71714213490486 | 5.34526636265218 | -13.5789397358894 | -10.2646422609687 | 3.89758487045765 | -5.82636298239231 | 20.0318053662777 | 3.9367158387322 | -12.4601362505928 | 4.76215037703514 | -3.27821743488312 | 3.07628712058067 | 4.75916939880699 | 7.9967329595238 | 0.294882056768984 | -4.27459578216076 | 12.0830285754055 | -10.966049104929 | -10.1352945417166 | 7.26292437314987 | 4.26316990330815 | 1.2718748524785 |
5 rows × 794 columns
# after we read from sql table each entry was read it as a string
# we convert all the features into numaric before we apply any model
cols = list(data.columns)
for i in cols:
data[i] = data[i].apply(pd.to_numeric)
print(i)
cwc_min cwc_max csc_min csc_max ctc_min ctc_max last_word_eq first_word_eq abs_len_diff mean_len token_set_ratio token_sort_ratio fuzz_ratio fuzz_partial_ratio longest_substr_ratio freq_qid1 freq_qid2 q1len q2len q1_n_words q2_n_words word_Common word_Total word_share freq_q1+q2 freq_q1-q2 0_x 1_x 2_x 3_x 4_x 5_x 6_x 7_x 8_x 9_x 10_x 11_x 12_x 13_x 14_x 15_x 16_x 17_x 18_x 19_x 20_x 21_x 22_x 23_x 24_x 25_x 26_x 27_x 28_x 29_x 30_x 31_x 32_x 33_x 34_x 35_x 36_x 37_x 38_x 39_x 40_x 41_x 42_x 43_x 44_x 45_x 46_x 47_x 48_x 49_x 50_x 51_x 52_x 53_x 54_x 55_x 56_x 57_x 58_x 59_x 60_x 61_x 62_x 63_x 64_x 65_x 66_x 67_x 68_x 69_x 70_x 71_x 72_x 73_x 74_x 75_x 76_x 77_x 78_x 79_x 80_x 81_x 82_x 83_x 84_x 85_x 86_x 87_x 88_x 89_x 90_x 91_x 92_x 93_x 94_x 95_x 96_x 97_x 98_x 99_x 100_x 101_x 102_x 103_x 104_x 105_x 106_x 107_x 108_x 109_x 110_x 111_x 112_x 113_x 114_x 115_x 116_x 117_x 118_x 119_x 120_x 121_x 122_x 123_x 124_x 125_x 126_x 127_x 128_x 129_x 130_x 131_x 132_x 133_x 134_x 135_x 136_x 137_x 138_x 139_x 140_x 141_x 142_x 143_x 144_x 145_x 146_x 147_x 148_x 149_x 150_x 151_x 152_x 153_x 154_x 155_x 156_x 157_x 158_x 159_x 160_x 161_x 162_x 163_x 164_x 165_x 166_x 167_x 168_x 169_x 170_x 171_x 172_x 173_x 174_x 175_x 176_x 177_x 178_x 179_x 180_x 181_x 182_x 183_x 184_x 185_x 186_x 187_x 188_x 189_x 190_x 191_x 192_x 193_x 194_x 195_x 196_x 197_x 198_x 199_x 200_x 201_x 202_x 203_x 204_x 205_x 206_x 207_x 208_x 209_x 210_x 211_x 212_x 213_x 214_x 215_x 216_x 217_x 218_x 219_x 220_x 221_x 222_x 223_x 224_x 225_x 226_x 227_x 228_x 229_x 230_x 231_x 232_x 233_x 234_x 235_x 236_x 237_x 238_x 239_x 240_x 241_x 242_x 243_x 244_x 245_x 246_x 247_x 248_x 249_x 250_x 251_x 252_x 253_x 254_x 255_x 256_x 257_x 258_x 259_x 260_x 261_x 262_x 263_x 264_x 265_x 266_x 267_x 268_x 269_x 270_x 271_x 272_x 273_x 274_x 275_x 276_x 277_x 278_x 279_x 280_x 281_x 282_x 283_x 284_x 285_x 286_x 287_x 288_x 289_x 290_x 291_x 292_x 293_x 294_x 295_x 296_x 297_x 298_x 299_x 300_x 301_x 302_x 303_x 304_x 305_x 306_x 307_x 308_x 309_x 310_x 311_x 312_x 313_x 314_x 315_x 316_x 317_x 318_x 319_x 320_x 321_x 322_x 323_x 324_x 325_x 326_x 327_x 328_x 329_x 330_x 331_x 332_x 333_x 334_x 335_x 336_x 337_x 338_x 339_x 340_x 341_x 342_x 343_x 344_x 345_x 346_x 347_x 348_x 349_x 350_x 351_x 352_x 353_x 354_x 355_x 356_x 357_x 358_x 359_x 360_x 361_x 362_x 363_x 364_x 365_x 366_x 367_x 368_x 369_x 370_x 371_x 372_x 373_x 374_x 375_x 376_x 377_x 378_x 379_x 380_x 381_x 382_x 383_x 0_y 1_y 2_y 3_y 4_y 5_y 6_y 7_y 8_y 9_y 10_y 11_y 12_y 13_y 14_y 15_y 16_y 17_y 18_y 19_y 20_y 21_y 22_y 23_y 24_y 25_y 26_y 27_y 28_y 29_y 30_y 31_y 32_y 33_y 34_y 35_y 36_y 37_y 38_y 39_y 40_y 41_y 42_y 43_y 44_y 45_y 46_y 47_y 48_y 49_y 50_y 51_y 52_y 53_y 54_y 55_y 56_y 57_y 58_y 59_y 60_y 61_y 62_y 63_y 64_y 65_y 66_y 67_y 68_y 69_y 70_y 71_y 72_y 73_y 74_y 75_y 76_y 77_y 78_y 79_y 80_y 81_y 82_y 83_y 84_y 85_y 86_y 87_y 88_y 89_y 90_y 91_y 92_y 93_y 94_y 95_y 96_y 97_y 98_y 99_y 100_y 101_y 102_y 103_y 104_y 105_y 106_y 107_y 108_y 109_y 110_y 111_y 112_y 113_y 114_y 115_y 116_y 117_y 118_y 119_y 120_y 121_y 122_y 123_y 124_y 125_y 126_y 127_y 128_y 129_y 130_y 131_y 132_y 133_y 134_y 135_y 136_y 137_y 138_y 139_y 140_y 141_y 142_y 143_y 144_y 145_y 146_y 147_y 148_y 149_y 150_y 151_y 152_y 153_y 154_y 155_y 156_y 157_y 158_y 159_y 160_y 161_y 162_y 163_y 164_y 165_y 166_y 167_y 168_y 169_y 170_y 171_y 172_y 173_y 174_y 175_y 176_y 177_y 178_y 179_y 180_y 181_y 182_y 183_y 184_y 185_y 186_y 187_y 188_y 189_y 190_y 191_y 192_y 193_y 194_y 195_y 196_y 197_y 198_y 199_y 200_y 201_y 202_y 203_y 204_y 205_y 206_y 207_y 208_y 209_y 210_y 211_y 212_y 213_y 214_y 215_y 216_y 217_y 218_y 219_y 220_y 221_y 222_y 223_y 224_y 225_y 226_y 227_y 228_y 229_y 230_y 231_y 232_y 233_y 234_y 235_y 236_y 237_y 238_y 239_y 240_y 241_y 242_y 243_y 244_y 245_y 246_y 247_y 248_y 249_y 250_y 251_y 252_y 253_y 254_y 255_y 256_y 257_y 258_y 259_y 260_y 261_y 262_y 263_y 264_y 265_y 266_y 267_y 268_y 269_y 270_y 271_y 272_y 273_y 274_y 275_y 276_y 277_y 278_y 279_y 280_y 281_y 282_y 283_y 284_y 285_y 286_y 287_y 288_y 289_y 290_y 291_y 292_y 293_y 294_y 295_y 296_y 297_y 298_y 299_y 300_y 301_y 302_y 303_y 304_y 305_y 306_y 307_y 308_y 309_y 310_y 311_y 312_y 313_y 314_y 315_y 316_y 317_y 318_y 319_y 320_y 321_y 322_y 323_y 324_y 325_y 326_y 327_y 328_y 329_y 330_y 331_y 332_y 333_y 334_y 335_y 336_y 337_y 338_y 339_y 340_y 341_y 342_y 343_y 344_y 345_y 346_y 347_y 348_y 349_y 350_y 351_y 352_y 353_y 354_y 355_y 356_y 357_y 358_y 359_y 360_y 361_y 362_y 363_y 364_y 365_y 366_y 367_y 368_y 369_y 370_y 371_y 372_y 373_y 374_y 375_y 376_y 377_y 378_y 379_y 380_y 381_y 382_y 383_y
y_true[:10]
array(['1', '0', '0', '0', '0', '1', '0', '0', '0', '0'], dtype=object)
# https://stackoverflow.com/questions/7368789/convert-all-strings-in-a-list-to-int
y_true = pd.Series(map(int,list(y_true)))
# X_train,X_test, y_train, y_test = train_test_split(data, y_true, stratify=y_true, test_size=0.3)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, y_true, test_size=0.33, stratify=y_true)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.33, stratify=y_train)
print("Number of data points in train data :",X_train.shape)
print("Number of data points in test data :",X_cv.shape)
print("Number of data points in test data :",X_test.shape)
Number of data points in train data : (44890, 794) Number of data points in test data : (22110, 794) Number of data points in test data : (33000, 794)
print("Number of data points in train data :",X_train.shape)
print("Number of data points in test data :",X_cv.shape)
print("Number of data points in test data :",X_test.shape)
Number of data points in train data : (44890, 32) Number of data points in test data : (22110, 32) Number of data points in test data : (33000, 32)
print("-"*10, "Distribution of output variable in train data", "-"*10)
train_distr = Counter(y_train)
train_len = len(y_train)
print("Class 0: ",int(train_distr[0])/train_len,"Class 1: ", int(train_distr[1])/train_len)
print("-"*10, "Distribution of output variable in test data", "-"*10)
test_distr = Counter(y_test)
test_len = len(y_test)
print("Class 0: ",int(test_distr[0])/test_len, "Class 1: ",int(test_distr[1])/test_len)
---------- Distribution of output variable in train data ---------- Class 0: 0.6274671419024281 Class 1: 0.37253285809757186 ---------- Distribution of output variable in test data ---------- Class 0: 0.6274545454545455 Class 1: 0.37254545454545457
# This function plots the confusion matrices given y_i, y_i_hat.
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
# C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
A =(((C.T)/(C.sum(axis=1))).T)
#divid each element of the confusion matrix with the sum of elements in that column
# C = [[1, 2],
# [3, 4]]
# C.T = [[1, 3],
# [2, 4]]
# C.sum(axis = 1) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =1) = [[3, 7]]
# ((C.T)/(C.sum(axis=1))) = [[1/3, 3/7]
# [2/3, 4/7]]
# ((C.T)/(C.sum(axis=1))).T = [[1/3, 2/3]
# [3/7, 4/7]]
# sum of row elements = 1
B =(C/C.sum(axis=0))
#divid each element of the confusion matrix with the sum of elements in that row
# C = [[1, 2],
# [3, 4]]
# C.sum(axis = 0) axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
# C.sum(axix =0) = [[4, 6]]
# (C/C.sum(axis=0)) = [[1/4, 2/6],
# [3/4, 4/6]]
plt.figure(figsize=(20,4))
labels = [1,2]
# representing A in heatmap format
cmap=sns.light_palette("blue")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")
plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")
plt.subplot(1, 3, 3)
# representing B in heatmap format
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")
plt.show()
# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# ref: https://stackoverflow.com/a/18662466/4084039
# we create a output array that has exactly same size as the CV data
predicted_y = np.zeros((test_len,2))
for i in range(test_len):
rand_probs = np.random.rand(1,2)
predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test, predicted_y, eps=1e-15))
predicted_y =np.argmax(predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y)
Log loss on Test Data using Random Model 0.8900928865618952
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state =None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link:
#------------------------------
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_test)
log_error_array.append(log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.592800211149 For values of alpha = 0.0001 The log loss is: 0.532351700629 For values of alpha = 0.001 The log loss is: 0.527562275995 For values of alpha = 0.01 The log loss is: 0.534535408885 For values of alpha = 0.1 The log loss is: 0.525117052926 For values of alpha = 1 The log loss is: 0.520035530431 For values of alpha = 10 The log loss is: 0.521097925307
For values of best alpha = 1 The train log loss is: 0.513842874233 For values of best alpha = 1 The test log loss is: 0.520035530431 Total number of data points : 30000
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link:
#------------------------------
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l1', loss='hinge', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_test)
log_error_array.append(log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l1', loss='hinge', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.657611721261 For values of alpha = 0.0001 The log loss is: 0.489669093534 For values of alpha = 0.001 The log loss is: 0.521829068562 For values of alpha = 0.01 The log loss is: 0.566295616914 For values of alpha = 0.1 The log loss is: 0.599957866217 For values of alpha = 1 The log loss is: 0.635059427016 For values of alpha = 10 The log loss is: 0.654159467907
For values of best alpha = 0.0001 The train log loss is: 0.478054677285 For values of best alpha = 0.0001 The test log loss is: 0.489669093534 Total number of data points : 30000
# 'cwc_min','cwc_max','csc_min','csc_max','ctc_min','ctc_max','last_word_eq','first_word_eq','abs_len_diff','mean_len','token_set_ratio','token_sort_ratio','fuzz_ratio','fuzz_partial_ratio','longest_substr_ratio','freq_qid1','freq_qid2','q1len','q2len','q1_n_words','q2_n_words','word_Common','word_Total','word_share','freq_q1+q2','freq_q1-q2
cols = list(X_train.columns)
cols[5:]
['is_duplicate', 'freq_qid1', 'freq_qid2', 'q1len', 'q2len', 'q1_n_words', 'q2_n_words', 'word_Common', 'word_Total', 'word_share', 'freq_q1+q2', 'freq_q1-q2', 'cwc_min', 'cwc_max', 'csc_min', 'csc_max', 'ctc_min', 'ctc_max', 'last_word_eq', 'first_word_eq', 'abs_len_diff', 'mean_len', 'token_set_ratio', 'token_sort_ratio', 'fuzz_ratio', 'fuzz_partial_ratio', 'longest_substr_ratio']
X_train.shape
(44890, 32)
X_train[cols[6:]].shape
(44890, 26)
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((tfidf_train_q1,tfidf_train_q2,X_train[['freq_qid1', 'freq_qid2', 'q1len', 'q2len', 'q1_n_words', 'q2_n_words','word_Common', 'word_Total', 'word_share', 'freq_q1+q2', 'freq_q1-q2','cwc_min','cwc_max', 'csc_min', 'csc_max', 'ctc_min', 'ctc_max','last_word_eq', 'first_word_eq', 'abs_len_diff', 'mean_len','token_set_ratio', 'token_sort_ratio', 'fuzz_ratio', 'fuzz_partial_ratio', 'longest_substr_ratio']])).tocsr()
X_cr = hstack((tfidf_cv_q1,tfidf_cv_q2,X_cv[['freq_qid1', 'freq_qid2', 'q1len', 'q2len', 'q1_n_words', 'q2_n_words','word_Common', 'word_Total', 'word_share', 'freq_q1+q2', 'freq_q1-q2','cwc_min','cwc_max', 'csc_min', 'csc_max', 'ctc_min', 'ctc_max','last_word_eq', 'first_word_eq', 'abs_len_diff', 'mean_len','token_set_ratio', 'token_sort_ratio', 'fuzz_ratio', 'fuzz_partial_ratio', 'longest_substr_ratio']])).tocsr()
X_te = hstack((tfidf_test_q1,tfidf_test_q2,X_test[['freq_qid1', 'freq_qid2', 'q1len', 'q2len', 'q1_n_words', 'q2_n_words','word_Common', 'word_Total', 'word_share', 'freq_q1+q2', 'freq_q1-q2','cwc_min','cwc_max', 'csc_min', 'csc_max', 'ctc_min', 'ctc_max','last_word_eq', 'first_word_eq', 'abs_len_diff', 'mean_len','token_set_ratio', 'token_sort_ratio', 'fuzz_ratio', 'fuzz_partial_ratio', 'longest_substr_ratio']])).tocsr()
print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)
Final Data matrix (44890, 10026) (44890,) (22110, 10026) (22110,) (33000, 10026) (33000,) ====================================================================================================
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state =None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link:
#------------------------------
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l1', loss='log', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_tr, y_train)
predict_y = sig_clf.predict_proba(X_cr)
log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l1', loss='log', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_cr, y_cv)
predict_y = sig_clf.predict_proba(X_tr)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_te)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.486122125885547 For values of alpha = 0.0001 The log loss is: 0.5331278421475841 For values of alpha = 0.001 The log loss is: 0.5478625643422692 For values of alpha = 0.01 The log loss is: 0.5246757351176663 For values of alpha = 0.1 The log loss is: 0.5677369451296661 For values of alpha = 1 The log loss is: 0.6136396550348843 For values of alpha = 10 The log loss is: 0.6602970012154868
For values of best alpha = 1e-05 The train log loss is: 0.5043990230156697 For values of best alpha = 1e-05 The test log loss is: 0.5045227813866615 Total number of data points : 33000
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state =None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link:
#------------------------------
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_tr, y_train)
predict_y = sig_clf.predict_proba(X_cr)
log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_cr, y_cv)
predict_y = sig_clf.predict_proba(X_tr)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_te)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.4789651965991101 For values of alpha = 0.0001 The log loss is: 0.4796432993221746 For values of alpha = 0.001 The log loss is: 0.49076128888211723 For values of alpha = 0.01 The log loss is: 0.5174224819637837 For values of alpha = 0.1 The log loss is: 0.5377837665263189 For values of alpha = 1 The log loss is: 0.5667679418881167 For values of alpha = 10 The log loss is: 0.5905260940044674
For values of best alpha = 1e-05 The train log loss is: 0.5026869940073284 For values of best alpha = 1e-05 The test log loss is: 0.5049064228811674 Total number of data points : 33000
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None,
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5,
# class_weight=None, warm_start=False, average=False, n_iter=None)
# some of methods
# fit(X, y[, coef_init, intercept_init, …]) Fit linear model with Stochastic Gradient Descent.
# predict(X) Predict class labels for samples in X.
#-------------------------------
# video link:
#------------------------------
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l1', loss='hinge', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_tr, y_train)
predict_y = sig_clf.predict_proba(X_cr)
log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l1', loss='hinge', random_state=42)
clf.fit(X_tr, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_cr, y_cv)
predict_y = sig_clf.predict_proba(X_tr)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_te)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.4848332825423125 For values of alpha = 0.0001 The log loss is: 0.5378475958582134 For values of alpha = 0.001 The log loss is: 0.5417904056869058 For values of alpha = 0.01 The log loss is: 0.5288605937385088 For values of alpha = 0.1 The log loss is: 0.5582942249037874 For values of alpha = 1 The log loss is: 0.6230572564022668 For values of alpha = 10 The log loss is: 0.6459682453560096
For values of best alpha = 1e-05 The train log loss is: 0.49829496672249113 For values of best alpha = 1e-05 The test log loss is: 0.49742513210493855 Total number of data points : 33000
import xgboost as xgb
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4
d_train = xgb.DMatrix(X_tr, label=y_train)
d_test = xgb.DMatrix(X_cr , label=y_cv)
watchlist = [(d_train, 'train'), (d_test, 'valid')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)
xgdmat = xgb.DMatrix(X_tr,y_train)
d_test = xgb.DMatrix(X_te, label=y_test)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
[0] train-logloss:0.686037 valid-logloss:0.686264 Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping. Will train until valid-logloss hasn't improved in 20 rounds. [10] train-logloss:0.627941 valid-logloss:0.630236 [20] train-logloss:0.585854 valid-logloss:0.58967 [30] train-logloss:0.554472 valid-logloss:0.559685 [40] train-logloss:0.530438 valid-logloss:0.536768 [50] train-logloss:0.511657 valid-logloss:0.519057 [60] train-logloss:0.496642 valid-logloss:0.504604 [70] train-logloss:0.483971 valid-logloss:0.492629 [80] train-logloss:0.473931 valid-logloss:0.48303 [90] train-logloss:0.465607 valid-logloss:0.475318 [100] train-logloss:0.45863 valid-logloss:0.468825 [110] train-logloss:0.45258 valid-logloss:0.463205 [120] train-logloss:0.447511 valid-logloss:0.458575 [130] train-logloss:0.443278 valid-logloss:0.45462 [140] train-logloss:0.439687 valid-logloss:0.451355 [150] train-logloss:0.436489 valid-logloss:0.448471 [160] train-logloss:0.433715 valid-logloss:0.446005 [170] train-logloss:0.431153 valid-logloss:0.443634 [180] train-logloss:0.429026 valid-logloss:0.441811 [190] train-logloss:0.426707 valid-logloss:0.439764 [200] train-logloss:0.424854 valid-logloss:0.438183 [210] train-logloss:0.42299 valid-logloss:0.436568 [220] train-logloss:0.420913 valid-logloss:0.434812 [230] train-logloss:0.419246 valid-logloss:0.433372 [240] train-logloss:0.417655 valid-logloss:0.431996 [250] train-logloss:0.415866 valid-logloss:0.430539 [260] train-logloss:0.414382 valid-logloss:0.429371 [270] train-logloss:0.413073 valid-logloss:0.428351 [280] train-logloss:0.411905 valid-logloss:0.427444 [290] train-logloss:0.410746 valid-logloss:0.426491 [300] train-logloss:0.409722 valid-logloss:0.425766 [310] train-logloss:0.408668 valid-logloss:0.42502 [320] train-logloss:0.407459 valid-logloss:0.424034 [330] train-logloss:0.406495 valid-logloss:0.423334 [340] train-logloss:0.405659 valid-logloss:0.422758 [350] train-logloss:0.404737 valid-logloss:0.422061 [360] train-logloss:0.403716 valid-logloss:0.421269 [370] train-logloss:0.402872 valid-logloss:0.420671 [380] train-logloss:0.40199 valid-logloss:0.419973 [390] train-logloss:0.401122 valid-logloss:0.419321 [399] train-logloss:0.400299 valid-logloss:0.418737 The test log loss is: 0.4175338218190432
predicted_y =np.array(predict_y>0.5,dtype=int)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
Total number of data points : 33000
# https://gist.github.com/wrwr/3f6b66bf4ee01bf48be965f60d14454d
import time
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
x_train, y_train, x_valid, y_valid, x_test, y_test = X_tr , y_train , X_cr , y_cv , X_te , y_test # load datasets
clf = xgb.XGBClassifier()
param_grid = {
'silent': [False],
'max_depth': [1,2,3,4,5],
'learning_rate': [0.00001,0.0001,0.001, 0.01, 0.1, 0.2, 0,3],
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
'gamma': [0, 0.25, 0.5, 1.0],
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
'n_estimators': [100]}
fit_params = {'eval_metric': 'logloss',
'early_stopping_rounds': 10,
'eval_set': [(x_valid, y_valid)]}
rs_clf = RandomizedSearchCV(clf, param_grid, n_iter=20,n_jobs=-1, verbose=2, cv=2,scoring='neg_log_loss', refit=False, random_state=42)
print("Randomized search..")
search_time_start = time.time()
rs_clf.fit(x_train, y_train,**fit_params)
print("Randomized search time:", time.time() - search_time_start)
best_score = rs_clf.best_score_
best_params = rs_clf.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
print('%s: %r' % (param_name, best_params[param_name]))
Randomized search.. Fitting 2 folds for each of 20 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 3.8min
Randomized search time: 233.99588918685913 Best score: -0.3964163938054521 Best params: colsample_bylevel: 1.0 colsample_bytree: 0.7 gamma: 0.25 learning_rate: 0.2 max_depth: 4 min_child_weight: 5.0 n_estimators: 100 reg_lambda: 1.0 silent: False subsample: 0.9
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 3.9min finished
import xgboost as xgb
params = {'objective' : 'binary:logistic',
'eval_metric' : 'logloss' ,
'colsample_bylevel': 1.0 ,
'colsample_bytree': 0.7,
'gamma': 0.25 ,
'learning_rate': 0.2,
'max_depth': 4,
'min_child_weight': 5.0,
'n_estimators': 100,
'reg_lambda': 1.0,
'silent': False,
'subsample': 0.9}
d_train = xgb.DMatrix(X_tr, label=y_train)
d_test = xgb.DMatrix(X_cr , label=y_cv)
watchlist = [(d_train, 'train'), (d_test, 'valid')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)
xgdmat = xgb.DMatrix(X_tr,y_train)
d_test = xgb.DMatrix(X_te, label=y_test)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(y_test, predict_y, labels=np.array([0, 1]), eps=1e-15))
[0] train-logloss:0.630701 valid-logloss:0.633702 Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping. Will train until valid-logloss hasn't improved in 20 rounds. [10] train-logloss:0.450203 valid-logloss:0.461041 [20] train-logloss:0.423017 valid-logloss:0.435876 [30] train-logloss:0.409264 valid-logloss:0.424347 [40] train-logloss:0.399364 valid-logloss:0.417086 [50] train-logloss:0.392632 valid-logloss:0.412569 [60] train-logloss:0.385623 valid-logloss:0.408459 [70] train-logloss:0.380434 valid-logloss:0.40503 [80] train-logloss:0.376415 valid-logloss:0.402892 [90] train-logloss:0.372699 valid-logloss:0.400642 [100] train-logloss:0.368354 valid-logloss:0.398424 [110] train-logloss:0.365952 valid-logloss:0.397567 [120] train-logloss:0.363346 valid-logloss:0.396579 [130] train-logloss:0.359589 valid-logloss:0.395004 [140] train-logloss:0.356627 valid-logloss:0.393626 [150] train-logloss:0.353823 valid-logloss:0.392205 [160] train-logloss:0.351752 valid-logloss:0.391396 [170] train-logloss:0.348841 valid-logloss:0.39009 [180] train-logloss:0.346663 valid-logloss:0.388971 [190] train-logloss:0.344432 valid-logloss:0.388092 [200] train-logloss:0.34291 valid-logloss:0.387582 [210] train-logloss:0.340718 valid-logloss:0.387115 [220] train-logloss:0.338635 valid-logloss:0.386149 [230] train-logloss:0.335902 valid-logloss:0.385041 [240] train-logloss:0.334184 valid-logloss:0.384558 [250] train-logloss:0.332432 valid-logloss:0.38429 [260] train-logloss:0.33064 valid-logloss:0.383657 [270] train-logloss:0.329221 valid-logloss:0.383348 [280] train-logloss:0.327746 valid-logloss:0.382928 [290] train-logloss:0.326367 valid-logloss:0.382502 [300] train-logloss:0.325176 valid-logloss:0.382358 [310] train-logloss:0.323829 valid-logloss:0.38203 [320] train-logloss:0.321791 valid-logloss:0.381645 [330] train-logloss:0.320545 valid-logloss:0.381565 [340] train-logloss:0.319565 valid-logloss:0.381679 Stopping. Best iteration: [324] train-logloss:0.321273 valid-logloss:0.381502 The test log loss is: 0.37883156924394973
predicted_y =np.array(predict_y>0.5,dtype=int)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
Total number of data points : 33000
# https://www.geeksforgeeks.org/understanding-python-pickling-example/
import pickle
# Its important to use binary mode
pickle_file = open('drive/My Drive/Quora_assignment/Ass_Tfidf_W2V.pkl', 'wb')
# source, destination
pickle.dump(X_train, pickle_file)
pickle.dump(y_train, pickle_file)
pickle.dump(X_cv, pickle_file)
pickle.dump(y_cv, pickle_file)
pickle.dump(X_test, pickle_file)
pickle.dump(y_test, pickle_file)
pickle_file.close()
import pickle
pickle_file = open('drive/My Drive/Quora_assignment/Ass_Tfidf_W2V.pkl', 'rb')
X_train = pickle.load(pickle_file)
y_train = pickle.load(pickle_file)
X_cv = pickle.load(pickle_file)
y_cv = pickle.load(pickle_file)
X_test = pickle.load(pickle_file)
y_test = pickle.load(pickle_file)
pickle_file.close()
print(X_train.shape,X_cv.shape,X_test.shape)
(44890, 794) (22110, 794) (33000, 794)
import xgboost as xgb
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4
d_train = xgb.DMatrix(X_train, label=y_train)
d_test = xgb.DMatrix(X_test, label=y_test)
watchlist = [(d_train, 'train'), (d_test, 'valid')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)
xgdmat = xgb.DMatrix(X_train,y_train)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
[0] train-logloss:0.684819 valid-logloss:0.684845 Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping. Will train until valid-logloss hasn't improved in 20 rounds. [10] train-logloss:0.61583 valid-logloss:0.616104 [20] train-logloss:0.564616 valid-logloss:0.565273 [30] train-logloss:0.525758 valid-logloss:0.52679 [40] train-logloss:0.496661 valid-logloss:0.498021 [50] train-logloss:0.473563 valid-logloss:0.475182 [60] train-logloss:0.455315 valid-logloss:0.457186 [70] train-logloss:0.440442 valid-logloss:0.442482 [80] train-logloss:0.428424 valid-logloss:0.430795 [90] train-logloss:0.418803 valid-logloss:0.421447 [100] train-logloss:0.41069 valid-logloss:0.413583 [110] train-logloss:0.403831 valid-logloss:0.40693 [120] train-logloss:0.398076 valid-logloss:0.401402 [130] train-logloss:0.393305 valid-logloss:0.396851 [140] train-logloss:0.38913 valid-logloss:0.392952 [150] train-logloss:0.385469 valid-logloss:0.389521 [160] train-logloss:0.382327 valid-logloss:0.386667 [170] train-logloss:0.379541 valid-logloss:0.384148 [180] train-logloss:0.377014 valid-logloss:0.381932 [190] train-logloss:0.374687 valid-logloss:0.379883 [200] train-logloss:0.372585 valid-logloss:0.378068 [210] train-logloss:0.370615 valid-logloss:0.376367 [220] train-logloss:0.368559 valid-logloss:0.374595 [230] train-logloss:0.366545 valid-logloss:0.372847 [240] train-logloss:0.364708 valid-logloss:0.371311 [250] train-logloss:0.363021 valid-logloss:0.369886 [260] train-logloss:0.36144 valid-logloss:0.368673 [270] train-logloss:0.359899 valid-logloss:0.367421 [280] train-logloss:0.358465 valid-logloss:0.366395 [290] train-logloss:0.357128 valid-logloss:0.365361 [300] train-logloss:0.355716 valid-logloss:0.364315 [310] train-logloss:0.354425 valid-logloss:0.363403 [320] train-logloss:0.353276 valid-logloss:0.362595 [330] train-logloss:0.352084 valid-logloss:0.361823 [340] train-logloss:0.351051 valid-logloss:0.361167 [350] train-logloss:0.349867 valid-logloss:0.36043 [360] train-logloss:0.348829 valid-logloss:0.359773 [370] train-logloss:0.347689 valid-logloss:0.359019 [380] train-logloss:0.346607 valid-logloss:0.358311 [390] train-logloss:0.345568 valid-logloss:0.357674 The test log loss is: 0.357054433715
predicted_y =np.array(predict_y>0.5,dtype=int)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
Total number of data points : 30000
# https://gist.github.com/wrwr/3f6b66bf4ee01bf48be965f60d14454d
import time
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
x_train, y_train, x_valid, y_valid, x_test, y_test = X_train , y_train , X_cv , y_cv , X_test , y_test # load datasets
clf = xgb.XGBClassifier()
param_grid = {
'silent': [False],
'max_depth': [1,2,3,4],
'learning_rate': [0.00001,0.0001,0.001, 0.01, 0.1, 0.2, 0,3],
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
'gamma': [0, 0.25, 0.5, 1.0],
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
'n_estimators': [100]}
fit_params = {'eval_metric': 'logloss',
'early_stopping_rounds': 10,
'eval_set': [(x_valid, y_valid)]}
rs_clf = RandomizedSearchCV(clf, param_grid, n_iter=20,n_jobs=-1, verbose=2, cv=2,scoring='neg_log_loss', refit=False, random_state=42)
print("Randomized search..")
search_time_start = time.time()
rs_clf.fit(x_train, y_train,**fit_params)
print("Randomized search time:", time.time() - search_time_start)
best_score = rs_clf.best_score_
best_params = rs_clf.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
print('%s: %r' % (param_name, best_params[param_name]))
Randomized search.. Fitting 2 folds for each of 20 candidates, totalling 40 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 20.7min
Randomized search time: 1294.6718440055847 Best score: -0.37062025775996893 Best params: colsample_bylevel: 0.9 colsample_bytree: 0.5 gamma: 0 learning_rate: 0.2 max_depth: 2 min_child_weight: 0.5 n_estimators: 100 reg_lambda: 50.0 silent: False subsample: 0.7
[Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 21.6min finished
import xgboost as xgb
from sklearn.metrics.classification import accuracy_score, log_loss
params = {'objective' : 'binary:logistic',
'eval_metric' : 'logloss' ,
'colsample_bylevel': 0.9 ,
'colsample_bytree': 0.5,
'gamma': 0 ,
'learning_rate': 0.2,
'max_depth': 2,
'min_child_weight': 0.5,
'n_estimators': 100,
'reg_lambda': 50,
'silent': False,
'subsample': 0.7}
d_train = xgb.DMatrix(X_train, label=y_train)
d_test = xgb.DMatrix(X_cv , label=y_cv)
watchlist = [(d_train, 'train'), (d_test, 'valid')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)
xgdmat = xgb.DMatrix(X_train,y_train)
d_test = xgb.DMatrix(X_test, label=y_test)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(y_test, predict_y, labels=np.array([0, 1]), eps=1e-15))
[0] train-logloss:0.640877 valid-logloss:0.641782 Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping. Will train until valid-logloss hasn't improved in 20 rounds. [10] train-logloss:0.461213 valid-logloss:0.462357 [20] train-logloss:0.418264 valid-logloss:0.421439 [30] train-logloss:0.402297 valid-logloss:0.406791 [40] train-logloss:0.390498 valid-logloss:0.396445 [50] train-logloss:0.380775 valid-logloss:0.387824 [60] train-logloss:0.3734 valid-logloss:0.38167 [70] train-logloss:0.368185 valid-logloss:0.377755 [80] train-logloss:0.36413 valid-logloss:0.375146 [90] train-logloss:0.360052 valid-logloss:0.372069 [100] train-logloss:0.356098 valid-logloss:0.368832 [110] train-logloss:0.353312 valid-logloss:0.367147 [120] train-logloss:0.350631 valid-logloss:0.365369 [130] train-logloss:0.348349 valid-logloss:0.363831 [140] train-logloss:0.346149 valid-logloss:0.362506 [150] train-logloss:0.34407 valid-logloss:0.361372 [160] train-logloss:0.342202 valid-logloss:0.360198 [170] train-logloss:0.34043 valid-logloss:0.359128 [180] train-logloss:0.33888 valid-logloss:0.358736 [190] train-logloss:0.336996 valid-logloss:0.357674 [200] train-logloss:0.335251 valid-logloss:0.356733 [210] train-logloss:0.333718 valid-logloss:0.355938 [220] train-logloss:0.33239 valid-logloss:0.355529 [230] train-logloss:0.330993 valid-logloss:0.354991 [240] train-logloss:0.329721 valid-logloss:0.354817 [250] train-logloss:0.328159 valid-logloss:0.354091 [260] train-logloss:0.326777 valid-logloss:0.353394 [270] train-logloss:0.325536 valid-logloss:0.352998 [280] train-logloss:0.324304 valid-logloss:0.352504 [290] train-logloss:0.323037 valid-logloss:0.352028 [300] train-logloss:0.321881 valid-logloss:0.351794 [310] train-logloss:0.320591 valid-logloss:0.351145 [320] train-logloss:0.319404 valid-logloss:0.350745 [330] train-logloss:0.318148 valid-logloss:0.350618 [340] train-logloss:0.317022 valid-logloss:0.350512 [350] train-logloss:0.316006 valid-logloss:0.350387 [360] train-logloss:0.31491 valid-logloss:0.35006 [370] train-logloss:0.313916 valid-logloss:0.349903 [380] train-logloss:0.312867 valid-logloss:0.349528 [390] train-logloss:0.311883 valid-logloss:0.349432 [399] train-logloss:0.311026 valid-logloss:0.349117 The test log loss is: 0.34759434860614374
predicted_y =np.array(predict_y>0.5,dtype=int)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
Total number of data points : 33000
# Please compare all your models using Prettytable library
# http://zetcode.com/python/prettytable/
from prettytable import PrettyTable
#If you get a ModuleNotFoundError error , install prettytable using: pip3 install prettytable
x = PrettyTable()
x.field_names = ["Vectorizer", "Model", "Hyper Parameter", "AUC"]
x.add_row(["NA", "RANDOM", "NA", 0.89])
x.add_row(["TFIDF", "Linear SVM", "Alpha:10**-5 Penalty:L1", 0.497])
x.add_row(["TFIDF", "Logistic Regression", "Alpha:10**-5 Penalty:L1", 0.504])
x.add_row(["TFIDF", "Logistic Regression", "Alpha:10**-5 Penalty:L2", 0.505])
x.add_row(["TFIDF-weighted W2V", "Linear SVM", "Alpha:0.0001 Penalty:L1", 0.48])
x.add_row(["TFIDF-weighted W2V", "Logistic Regression", "Alpha:1 Penalty:L1", 0.52])
x.add_row(["TFIDF", "XGBoost", "Default Parameters", 0.41])
x.add_row(["TFIDF", "XGBoost", '''colsample_bylevel: 1.0,colsample_bytree: 0.7,gamma: 0.25,learning_rate: 0.2,max_depth: 4,
min_child_weight: 5.0,n_estimators: 100,reg_lambda: 1.0,silent: False,subsample: 0.9''', 0.378])
x.add_row(["TFIDF-weighted W2V", "XGBoost", "Default Parameters", 0.357])
x.add_row(["TFIDF-weighted W2V", "XGBoost", ''''colsample_bylevel': 0.9 ,'colsample_bytree': 0.5,'gamma': 0 ,'learning_rate': 0.2,
'max_depth': 2,'min_child_weight': 0.5,'n_estimators': 100,'reg_lambda': 50,'silent': False,'subsample': 0.7''', 0.347])
print(x)
+--------------------+---------------------+------------------------------------------------------------------------------------------------------------------------+-------+ | Vectorizer | Model | Hyper Parameter | AUC | +--------------------+---------------------+------------------------------------------------------------------------------------------------------------------------+-------+ | NA | RANDOM | NA | 0.89 | | TFIDF | Linear SVM | Alpha:10**-5 Penalty:L1 | 0.497 | | TFIDF | Logistic Regression | Alpha:10**-5 Penalty:L1 | 0.504 | | TFIDF | Logistic Regression | Alpha:10**-5 Penalty:L2 | 0.505 | | TFIDF-weighted W2V | Linear SVM | Alpha:0.0001 Penalty:L1 | 0.48 | | TFIDF-weighted W2V | Logistic Regression | Alpha:1 Penalty:L1 | 0.52 | | TFIDF | XGBoost | Default Parameters | 0.41 | | TFIDF | XGBoost | colsample_bylevel: 1.0,colsample_bytree: 0.7,gamma: 0.25,learning_rate: 0.2,max_depth: 4, | 0.378 | | | | min_child_weight: 5.0,n_estimators: 100,reg_lambda: 1.0,silent: False,subsample: 0.9 | | | TFIDF-weighted W2V | XGBoost | Default Parameters | 0.357 | | TFIDF-weighted W2V | XGBoost | 'colsample_bylevel': 0.9 ,'colsample_bytree': 0.5,'gamma': 0 ,'learning_rate': 0.2, | 0.347 | | | | 'max_depth': 2,'min_child_weight': 0.5,'n_estimators': 100,'reg_lambda': 50,'silent': False,'subsample': 0.7 | | +--------------------+---------------------+------------------------------------------------------------------------------------------------------------------------+-------+