这个shallow modle输出的结果是0.88160,比我之前的根据官方教程做的0.84要好很多。分类器恐怕不会造成这么大的差异,应该是在BOW_LR.py中,lab_fea = select_feature('../../data/feature_chi.txt', max_feature)["1"]
,这一行语句的效果。从中选了1000个作为feature的单词。
%load_ext autoreload
%autoreload 2
import pandas as pd
"""
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised,
and no DataFrame will be returned.
If False, then these “bad lines” will dropped from the DataFrame that is returned.
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True,
a warning for each “bad line” will be output.
"""
# Preprocessing Training Data
train = pd.read_csv("../Sentiment/data/labeledTrainData.tsv", header=0, delimiter='\t', quoting=3, error_bad_lines=False)
train.head()
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
train.shape
(25000, 3)
train['review'].size
25000
这里为了加快运算,只取前100个样本好了
train = train[:100]
train.shape
(100, 3)
num_reviews = train['review'].size
num_reviews
100
Cleaning and parsing the training set movie reviews
ls
Part 1 Shallow Model.ipynb feature_chi.txt
__init__.py utils/
from utils.TextPreprocess import review_to_words
clean_train_reviews = []
for i in range(0, num_reviews):
clean_train_reviews.append(review_to_words(train['review'][i]))
调查一下问题,下面进入review_to_words:
raw_review = train['review'][0]
raw_review
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
# 1. Remove HTML
review_text = BeautifulSoup(raw_review, "lxml")
review_text
<html><body><p>"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br/><br/>Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br/><br/>The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br/><br/>Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br/><br/>Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."</p></body></html>
# 1. Remove HTML
review_text = BeautifulSoup(raw_review, "lxml").get_text()
review_text
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
letters_only
' With all this stuff going down at the moment with MJ i ve started listening to his music watching the odd documentary here and there watched The Wiz and watched Moonwalker again Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent Moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord Why he wants MJ dead so bad is beyond me Because MJ overheard his plans Nah Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence Also the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line this movie is for people who like MJ on one level or another which i think is most people If not then stay away It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty Well with all the attention i ve gave this subject hmmm well i don t know because people can be different behind closed doors i know this for a fact He is either an extremely nice but stupid guy or one of the most sickest liars I hope he is not the latter '
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
words[:20]
['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the']
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
list(stops)[:20]
['out', 'why', 'because', 'that', 'other', 'themselves', 'not', 'just', 'this', 'so', 't', 'now', 'itself', 'most', 'didn', 'did', 'ourselves', 'i', 'very', 'which']
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
meaningful_words[:10]
['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary']
# 6. Join the words back into one string separated by space,
# and return the result.
print(" ".join( meaningful_words ))
stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter
def review_to_words(raw_review):
# Function to convert a raw review to a string of words
# The input is a single string (a raw movie review), and
# the output is a single string (a preprocessed movie review)
#
# 1. Remove HTML
review_text = BeautifulSoup(raw_review, "lxml").get_text()
#
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
#
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
#
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
#
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
#
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
重新回到runBow.py:
print("Cleaning and parsing the training set movie reviews...")
clean_train_reviews = []
for i in range(0, num_reviews):
clean_train_reviews.append(review_to_words(train["review"][i]))
Cleaning and parsing the training set movie reviews...
准备test data
test = pd.read_csv("../Sentiment/data/testData.tsv", header = 0, delimiter = "\t", quoting = 3)
test = test[:100]
num_reviews = len(test["review"])
clean_test_reviews = []
print("Cleaning and parsing the test set movie reviews...")
for i in range(0, num_reviews):
clean_review = review_to_words(test["review"][i])
clean_test_reviews.append(clean_review)
Cleaning and parsing the test set movie reviews...
test.shape
(100, 2)
下面是进行分类,这个需要进入BOW_LR.py,查看构建的class BagOfWords(object)。
'''
Train and Test
'''
# import BOW_LR
bow = BOW_LR.BagOfWords(vocab = True, tfidf = True, max_feature = 19000)
bow.train_lr(clean_train_reviews, list(train["sentiment"]), C = 1)
result = bow.test_lr(clean_test_reviews)
print(result)
print("output...")
out = open("result\\BOW_chi_tfidf.csv", 'w')
out.write("\"id\"" + "," + "\"sentiment\"")
out.write("\n")
for i, key in enumerate(list(test["id"])):
out.write(str(key) + "," + str(result[i]) + "\n")
# out.close()
上面的代码不用运行,path应该是基于windows的,我们先进入BOW_LR.py,下面是整个文件完整的内容。
from utils.feature_select import select_feature
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from scipy.sparse import bsr_matrix
import numpy as np
class BagOfWords(object):
def __init__(self, vocab = False, tfidf = False, max_feature = 1000):
lab_fea = None
if(vocab == True):
print("select features...")
lab_fea = select_feature('data\\feature_chi.txt', max_feature)["1"]
self.vectorizer = None
if(tfidf == True):
self.vectorizer = TfidfVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
vocabulary = lab_fea,
max_features = max_feature)
else:
self.vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
vocabulary = lab_fea,
max_features = max_feature)
self.lr = None
def train_lr(self, train_data, lab_data, C = 1.0):
train_data_features = self.vectorizer.fit_transform(train_data)
train_data_features = bsr_matrix(train_data_features)
print (train_data_features.shape)
print("Training the logistic regression...")
self.lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
self.lr = self.lr.fit(train_data_features, lab_data)
def test_lr(self, test_data):
test_data_features = self.vectorizer.transform(test_data)
test_data_features = bsr_matrix(test_data_features)
result = self.lr.predict_proba(test_data_features)[:,1]
return result
def validate_lr(self, train_data, lab_data, C = 1.0):
train_data_features = self.vectorizer.fit_transform(train_data)
train_data_features = bsr_matrix(train_data_features)
lab_data = np.array(lab_data)
print("start k-fold validate...")
lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
cv = np.mean(cross_val_score(lr, train_data_features, lab_data, cv=10, scoring='roc_auc'))
return cv
下面进入select_feature,这部分没有看懂究竟是想做什么,这个feature_chi.tex的文件又是从哪里来的?但输出的是1000个单词,这个应该就是用来选择作为维度的1000个单词
import heapq
def select_feature(filePath, k):
read = open(filePath, 'r')
lab_fea = {}
for line in read:
line_arr = line.strip().split()
if len(line_arr) - 1 <= k:
lab_fea[line_arr[0]] = [kv.split(':')[0] for kv in line_arr[1 : ]]
else:
heap = []
heapq.heapify(heap)
for kv in line_arr[1 : ]:
key, val = kv.split(':')
if len(heap) < k:
heapq.heappush(heap, (float(val), key))
else:
if float(val) > heap[0][0]:
heapq.heappop(heap)
heapq.heappush(heap, (float(val), key))
lab_fea[line_arr[0]] = [heapq.heappop(heap)[1] for i in range(len(heap))]
read.close()
return lab_fea
lab_fea = select_feature('feature_chi.txt', 1000)['1']
len(lab_fea)
1000
接上上面BagOfWords,这里先分析一下train_lr的过程:
max_feature = 1000
vectorizer = TfidfVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
vocabulary = lab_fea,
max_features = max_feature)
def train_lr(self, train_data, lab_data, C = 1.0):
train_data_features = self.vectorizer.fit_transform(train_data)
train_data_features = bsr_matrix(train_data_features)
print (train_data_features.shape)
print("Training the logistic regression...")
self.lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=C, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
self.lr = self.lr.fit(train_data_features, lab_data)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features
<100x1000 sparse matrix of type '<class 'numpy.float64'>' with 2607 stored elements in Compressed Sparse Row format>
train_data_features = bsr_matrix(train_data_features)
train_data_features
<100x1000 sparse matrix of type '<class 'numpy.float64'>' with 2607 stored elements (blocksize = 1x1) in Block Sparse Row format>
train_data_features.shape
(100, 1000)
# target
lab_data = np.array(list(train["sentiment"]))
lr = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
lr = lr.fit(train_data_features, lab_data)
对test data进行预测:
# prepare test data
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = bsr_matrix(test_data_features)
# predict
result = lr.predict_proba(test_data_features)
result[0]
array([ 0.49322008, 0.50677992])
这个结果应该是分别为0和1的概率,我们要的是1的概率
result = lr.predict_proba(test_data_features)[:, 1]
result2 = lr.predict(test_data_features)
result2
array([1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0])
test.head()
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |