import nltk
from nltk.stem import *
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
True
text = """For the second time this year, the coronavirus has found its way to the very top of British politics and forced Prime Minister Boris Johnson into self-quarantine. On Sunday night, Johnson tweeted that he must "self-isolate for two weeks, after being in contact with someone with Covid-19." "It doesn't matter that I've had the disease and I'm bursting with antibodies," he said in a Monday video message, adding that he "felt great" and would keep leading the UK virus response, as well as his government's plans to "#BuildBackBetter." Yet the optimism in that message, including the hashtag, masks the reality of exactly how enormous a week this is for the Johnson premiership, and how much of a blow it is for the PM to be trapped in solitude. Downing Street had spent the weekend dealing with the fallout from three straight days of chaos, in which two of his most senior advisers dramatically resigned following allegations that they had been briefing viciously against both Johnson himself and his fiancée, Carrie Symonds. The advisers in question, Lee Cain and Dominic Cummings, were among the most controversial and disliked members of Johnson's inner circle and have been accused by numerous people in government of being power hungry and self-interested. Before Johnson's self-quarantining, the turmoil in Downing Street had dominated five days of coverage in the UK, overshadowing what is arguably an even bigger headache for the PM than the coronavirus. Brexit really is now on the home stretch. The current transition period -- which was designed to prevent a sudden halt of the flow of goods, among other things, between the UK and the European Union -- ends on December 31. If the two sides are unable to strike a free trade agreement before that date, then the chaotic no-deal cliff edge -- which many fear would lead to shortages in things like food, toilet paper and medicine -- would be the new reality. Thursday's video conference of the EU27 is the penultimate time that the heads of government from the bloc's member states are scheduled to meet before the end of the year. The final meeting of 2020, on December 10, is considered too late in the day. As has been the case for months, a deal is in sight and the areas of agreement vastly outweigh the areas of disagreement. However, the key stumbling blocks that have prevented a deal remain. The first and most important is Brussels' insistence on a level playing field in exchange for access to the EU's single market. This, for some time, has been a red line for the UK, which objects to being bound by EU competition rules prohibiting how the government could use state aid to help the growth of British enterprise. The two other key sticking points -- fishing rights and the involvement of EU law in the arbitration of any deal -- are also difficult, though it's easier to see a path to agreement in both. It has for some time been assumed that when the talks reached the stages that they are at now (with legal texts on the table and a deal within grasp), the negotiators, who are civil servants acting on the mandate of their political leadership, would make way for political leaders to bridge the final gaps."""
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)
nltk.sent_tokenize(text)
['For the second time this year, the coronavirus has found its way to the very top of British politics and forced Prime Minister Boris Johnson into self-quarantine.', 'On Sunday night, Johnson tweeted that he must "self-isolate for two weeks, after being in contact with someone with Covid-19."', '"It doesn\'t matter that I\'ve had the disease and I\'m bursting with antibodies," he said in a Monday video message, adding that he "felt great" and would keep leading the UK virus response, as well as his government\'s plans to "#BuildBackBetter."', 'Yet the optimism in that message, including the hashtag, masks the reality of exactly how enormous a week this is for the Johnson premiership, and how much of a blow it is for the PM to be trapped in solitude.', 'Downing Street had spent the weekend dealing with the fallout from three straight days of chaos, in which two of his most senior advisers dramatically resigned following allegations that they had been briefing viciously against both Johnson himself and his fiancée, Carrie Symonds.', "The advisers in question, Lee Cain and Dominic Cummings, were among the most controversial and disliked members of Johnson's inner circle and have been accused by numerous people in government of being power hungry and self-interested.", "Before Johnson's self-quarantining, the turmoil in Downing Street had dominated five days of coverage in the UK, overshadowing what is arguably an even bigger headache for the PM than the coronavirus.", 'Brexit really is now on the home stretch.', 'The current transition period -- which was designed to prevent a sudden halt of the flow of goods, among other things, between the UK and the European Union -- ends on December 31.', 'If the two sides are unable to strike a free trade agreement before that date, then the chaotic no-deal cliff edge -- which many fear would lead to shortages in things like food, toilet paper and medicine -- would be the new reality.', "Thursday's video conference of the EU27 is the penultimate time that the heads of government from the bloc's member states are scheduled to meet before the end of the year.", 'The final meeting of 2020, on December 10, is considered too late in the day.', 'As has been the case for months, a deal is in sight and the areas of agreement vastly outweigh the areas of disagreement.', 'However, the key stumbling blocks that have prevented a deal remain.', "The first and most important is Brussels' insistence on a level playing field in exchange for access to the EU's single market.", 'This, for some time, has been a red line for the UK, which objects to being bound by EU competition rules prohibiting how the government could use state aid to help the growth of British enterprise.', "The two other key sticking points -- fishing rights and the involvement of EU law in the arbitration of any deal -- are also difficult, though it's easier to see a path to agreement in both.", 'It has for some time been assumed that when the talks reached the stages that they are at now (with legal texts on the table and a deal within grasp), the negotiators, who are civil servants acting on the mandate of their political leadership, would make way for political leaders to bridge the final gaps.']
from nltk.probability import FreqDist
freq = FreqDist(words)
freq.most_common(10)
[('the', 47), ('of', 20), ('and', 15), ('in', 15), ('to', 13), ('a', 11), ('that', 10), ('for', 10), ('is', 8), ('s', 7)]
import matplotlib.pyplot as plt
freq.plot(100,cumulative=False)
plt.show()
Check for "Stop Words"
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords.words('english')
print(stopwords.words() [620:680])
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. ['your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at']
print(stopwords.fileids())
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
en_stops = set(stopwords.words('english'))
filterd = []
for x in words:
if x not in en_stops:
filterd.append(x)
filterd_dist = FreqDist(filterd)
filterd_dist.most_common(10)
[('Johnson', 6), ('The', 5), ('deal', 5), ('time', 4), ('self', 4), ('two', 4), ('would', 4), ('UK', 4), ('government', 4), ('agreement', 3)]
Checking for Stemmerize words
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmered = []
for x in filterd:
stemmered.append(ps.stem(x))
stemmered_dist = FreqDist(stemmered)
stemmered_dist.most_common(10)
[('johnson', 6), ('deal', 6), ('the', 5), ('time', 4), ('self', 4), ('two', 4), ('would', 4), ('UK', 4), ('govern', 4), ('polit', 3)]
Checking for Lemmatizeried words
%pip install -U textblob
Requirement already up-to-date: textblob in /usr/local/lib/python3.6/dist-packages (0.15.3) Requirement already satisfied, skipping upgrade: nltk>=3.1 in /usr/local/lib/python3.6/dist-packages (from textblob) (3.2.5) Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from nltk>=3.1->textblob) (1.15.0)
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word
def lemmatize_with_postag(sentence):
sent = TextBlob(sentence)
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
return " ".join(lemmatized_list)
# Lemmatize
sentence = text
lemmatize_with_postag(sentence)
[nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /root/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!
"For the second time this year the coronavirus have find it way to the very top of British politics and force Prime Minister Boris Johnson into self-quarantine On Sunday night Johnson tweet that he must self-isolate for two week after be in contact with someone with Covid-19 It do n't matter that I 've have the disease and I 'm burst with antibody he say in a Monday video message add that he felt great and would keep lead the UK virus response as well a his government 's plan to BuildBackBetter Yet the optimism in that message include the hashtag mask the reality of exactly how enormous a week this be for the Johnson premiership and how much of a blow it be for the PM to be trap in solitude Downing Street have spend the weekend deal with the fallout from three straight day of chaos in which two of his most senior adviser dramatically resign follow allegation that they have be brief viciously against both Johnson himself and his fiancée Carrie Symonds The adviser in question Lee Cain and Dominic Cummings be among the most controversial and disliked member of Johnson 's inner circle and have be accuse by numerous people in government of be power hungry and self-interested Before Johnson 's self-quarantining the turmoil in Downing Street have dominate five day of coverage in the UK overshadow what be arguably an even big headache for the PM than the coronavirus Brexit really be now on the home stretch The current transition period which be design to prevent a sudden halt of the flow of good among other thing between the UK and the European Union end on December 31 If the two side be unable to strike a free trade agreement before that date then the chaotic no-deal cliff edge which many fear would lead to shortage in thing like food toilet paper and medicine would be the new reality Thursday 's video conference of the EU27 be the penultimate time that the head of government from the bloc 's member state be schedule to meet before the end of the year The final meeting of 2020 on December 10 be consider too late in the day As have be the case for month a deal be in sight and the area of agreement vastly outweigh the area of disagreement However the key stumbling block that have prevent a deal remain The first and most important be Brussels ' insistence on a level playing field in exchange for access to the EU 's single market This for some time have be a red line for the UK which object to be bind by EU competition rule prohibit how the government could use state aid to help the growth of British enterprise The two other key stick point fish right and the involvement of EU law in the arbitration of any deal be also difficult though it 's easy to see a path to agreement in both It have for some time be assume that when the talk reach the stage that they be at now with legal text on the table and a deal within grasp the negotiator who be civil servant act on the mandate of their political leadership would make way for political leader to bridge the final gap"
new = lemmatize_with_postag(sentence)
new_words = tokenizer.tokenize(new)
new_list = []
for x in new_words:
if x not in en_stops:
new_list.append(x)
freq_new = FreqDist(new_list)
freq_new.most_common(10)
[('Johnson', 6), ('deal', 6), ('The', 5), ('time', 4), ('self', 4), ('two', 4), ('would', 4), ('UK', 4), ('government', 4), ('day', 3)]
Sentiment Analysis
from nltk import sentiment
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def sentiment_scores(sentence):
sid_obj = SentimentIntensityAnalyzer()
sentiment_dict = sid_obj.polarity_scores(sentence)
print("Overall sentiment dictionary is : ", sentiment_dict)
print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative")
print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral")
print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
print("Sentence Overall Rated As", end = " ")
if sentiment_dict['compound'] >= 0.05 :
print("Positive")
elif sentiment_dict['compound'] <= - 0.05 :
print("Negative")
else :
print("Neutral")
import nltk
nltk.download('vader_lexicon')
if __name__ == "__main__" :
print("\n1st statement :")
sentence = new
sentiment_scores(sentence)
print("\n2nd Statement :")
sentence = text
sentiment_scores(sentence)
print("\n3rd Statement :")
sentence = text2
sentiment_scores(sentence)
[nltk_data] Downloading package vader_lexicon to /root/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date! 1st statement : Overall sentiment dictionary is : {'neg': 0.072, 'neu': 0.846, 'pos': 0.082, 'compound': 0.7547} sentence was rated as 7.199999999999999 % Negative sentence was rated as 84.6 % Neutral sentence was rated as 8.200000000000001 % Positive Sentence Overall Rated As Positive 2nd Statement : Overall sentiment dictionary is : {'neg': 0.074, 'neu': 0.848, 'pos': 0.078, 'compound': 0.3727} sentence was rated as 7.3999999999999995 % Negative sentence was rated as 84.8 % Neutral sentence was rated as 7.8 % Positive Sentence Overall Rated As Positive 3rd Statement : Overall sentiment dictionary is : {'neg': 0.097, 'neu': 0.815, 'pos': 0.088, 'compound': -0.7655} sentence was rated as 9.700000000000001 % Negative sentence was rated as 81.5 % Neutral sentence was rated as 8.799999999999999 % Positive Sentence Overall Rated As Negative
Bag of Words
text2 = """President Donald Trump is facing a barrage of calls to permit potentially life-saving transition talks between his health officials and incoming President-elect Joe Biden's aides on a fast-worsening pandemic he is continuing to ignore in his obsessive effort to discredit an election that he clearly lost. The increasingly urgent pleas are coming from inside his administration, the President-elect's team and independent public health experts as Covid-19 cases rage out of control countrywide, claiming more than 1,000 US lives a day. More than 246,000 Americans have now died from the disease, and a bitter winter lies ahead even amid encouraging news such as Monday's announcement that a vaccine developed by Moderna is demonstrating a high success rate in early clinical trials, the second such positive vaccine news in about a week. But instead of listening or mobilizing to tackle what some medical experts warn is becoming a "humanitarian" crisis, Trump spent the weekend during which the US passed 11 million infections amplifying lies and misinformation about his election loss. At one point, he appeared to acknowledge Sunday in a tweet that Biden won, before backtracking with a stream of defiance on Twitter. This came as the nation's top infectious disease expert, Dr. Anthony Fauci, said on CNN's "State of the Union" Sunday that "of course it would be better if we could start working with" the Biden team that will take office on January 20. Biden's incoming White House chief of staff Ron Klain said Sunday that the President-elect's team had been unable to talk to current top health officials like Fauci about the pandemic owing to Trump's refusal to trigger ascertainment — the formal process of opening a transition to a new administration. "Joe Biden's going to become president of the United States in the midst of an ongoing crisis. That has to be a seamless transition," Klain said on NBC's "Meet the Press," adding that while the new administration planned to contact top pharmaceutical firms making the vaccine like Pfizer, it was particularly key to get in touch with Department of Health and Human Services officials responsible for rolling it out in the coming months. But the official who is currently most influential with the President, Dr. Scott Atlas, who critics say favors a herd immunity approach that could lead to thousands of deaths, wrote an inflammatory tweet on Sunday that exemplified the White House's contempt for unifying leadership during the pandemic. Atlas called on the people of Michigan to "rise up" against new Covid-19 restrictions introduced in schools, theaters and restaurants by Democratic Gov. Gretchen Whitmer -- who was recently the target of an alleged domestic terrorism kidnapping plot."""
words2 = tokenizer.tokenize(text2)
filtered2 = [w for w in words2 if w not in en_stops]
stemmered2 = [ps.stem(w) for w in filtered2]
freq_words2 = FreqDist(stemmered2)
import pandas as pd
df = pd.DataFrame([dict(stemmered_dist),dict(freq_words2)])
df.fillna(0,inplace=True)
df.head()
for | second | time | year | coronaviru | found | way | top | british | polit | forc | prime | minist | bori | johnson | self | quarantin | On | sunday | night | tweet | must | isol | two | week | contact | someon | covid | 19 | It | matter | I | diseas | burst | antibodi | said | monday | video | messag | ad | ... | particularli | get | touch | depart | human | servic | roll | influenti | scott | atla | critic | say | favor | herd | immun | approach | thousand | death | wrote | inflammatori | exemplifi | contempt | unifi | michigan | rise | restrict | introduc | school | theater | restaur | democrat | gov | gretchen | whitmer | recent | target | domest | terror | kidnap | plot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | 4.0 | 2.0 | 2.0 | 1.0 | 2.0 | 1 | 2.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 6.0 | 4.0 | 2.0 | 1.0 | 1 | 1.0 | 1 | 1.0 | 1.0 | 4.0 | 2 | 1 | 1.0 | 1 | 1 | 2.0 | 1.0 | 2.0 | 1 | 1.0 | 1.0 | 1 | 1 | 2.0 | 2.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 | 0.0 | 2 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 | 2 | 2 | 0.0 | 0.0 | 0.0 | 2 | 0.0 | 0.0 | 3 | 1 | 0.0 | 0.0 | 1 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 rows × 409 columns
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
array([[1. , 0.14682763], [0.14682763, 1. ]])