import pandas as pd
import plotly.express as px
import math
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from typing import Dict, List
TRAIN_PATH = "../data/raw/train2.tsv"
VAL_PATH = "../data/raw/val2.tsv"
TEST_PATH = "../data/raw/test2.tsv"
train_df = pd.read_csv(TRAIN_PATH, sep="\t", names=["id", "statement_json", "label", "statement", "subject", "speaker", "speaker_title", "state_info", "party_affiliation", "barely_true_count", "false_count", "half_true_count", "mostly_true_count", "pants_fire_count", "context", "justification"])
val_df = pd.read_csv(VAL_PATH, sep="\t", names=["id", "statement_json", "label", "statement", "subject", "speaker", "speaker_title", "state_info", "party_affiliation", "barely_true_count", "false_count", "half_true_count", "mostly_true_count", "pants_fire_count", "context", "justification"])
test_df = pd.read_csv(TEST_PATH, sep="\t", names=["id", "statement_json", "label", "statement", "subject", "speaker", "speaker_title", "state_info", "party_affiliation", "barely_true_count", "false_count", "half_true_count", "mostly_true_count", "pants_fire_count", "context", "justification"])
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 500
len(train_df)
10267
len(val_df)
1284
len(test_df)
1283
train_df.columns
Index(['id', 'statement_json', 'label', 'statement', 'subject', 'speaker', 'speaker_title', 'state_info', 'party_affiliation', 'barely_true_count', 'false_count', 'half_true_count', 'mostly_true_count', 'pants_fire_count', 'context', 'justification'], dtype='object')
train_df.head()
id | statement_json | label | statement | subject | speaker | speaker_title | state_info | party_affiliation | barely_true_count | false_count | half_true_count | mostly_true_count | pants_fire_count | context | justification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2635.json | false | Says the Annies List political group supports third-trimester abortions on demand. | abortion | dwayne-bohac | State representative | Texas | republican | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | a mailer | That's a premise that he fails to back up. Annie's List makes no bones about being comfortable with candidates who oppose further restrictions on late-term abortions. Then again, this year its backing two House candidates who voted for more limits. |
1 | 1 | 10540.json | half-true | When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. | energy,history,job-accomplishments | scott-surovell | State delegate | Virginia | democrat | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | a floor speech. | Surovell said the decline of coal "started when natural gas took off That started to begin in President (George W. ) Bushs administration. "No doubt, natural gas has been gaining ground on coal in generating electricity. The trend started in the 1990s but clearly gained speed during the Bush administration when the production of natural gas -- a competitor of coal -- picked up. But analysts give little credit or blame to Bush for that trend. They note that other factors, such as technologic... |
2 | 2 | 324.json | mostly-true | Hillary Clinton agrees with John McCain "by voting to give George Bush the benefit of the doubt on Iran." | foreign-policy | barack-obama | President | Illinois | democrat | 70.0 | 71.0 | 160.0 | 163.0 | 9.0 | Denver | Obama said he would have voted against the amendment if he had been present. So though Clinton may have "agreed" with McCain on the issue, they did not technically vote the same way on it. To say that voting for Kyl-Lieberman is "giving George Bush the benefit of the doubt on Iran" remains a contentious issue. But Obama's main point is that Clinton and McCain were on the same side, and that is correct. |
3 | 3 | 1123.json | false | Health care reform legislation is likely to mandate free sex change surgeries. | health-care | blog-posting | NaN | NaN | none | 7.0 | 19.0 | 3.0 | 5.0 | 44.0 | a news release | The release may have a point that Mikulskis comment could open the door to "medically necessary" coverage which conceivably may include sex-change operations. But it's unclear whether her amendment will remain in the legislation, and there's nothing specific in the legislation on sex-change procedures and nothing else solid that indicates such coverage will be provided. The news release cherry-picked a few fleeting references to gender and sexual orientation in completely unrelated contexts ... |
4 | 4 | 9028.json | half-true | The economic turnaround started at the end of my term. | economy,jobs | charlie-crist | NaN | Florida | democrat | 15.0 | 9.0 | 20.0 | 19.0 | 2.0 | an interview on CNN | Crist said that the economic "turnaround started at the end of my term. "During Crists last year in office, Floridas economy experienced notable gains in personal income and industrial production, and more marginal improvements in the unemployment rate and in payroll employment. But GDP didnt grow again until Scott took office. Economists say Crist deserves some credit for the economic turnaround because he accepted federal stimulus dollars, but they add that any state is inevitably buffeted... |
# Drop rows with no label
train_df.dropna(subset=["label"], inplace=True)
len(train_df)
10267
# Normalized distribution of labels (roughly equal except for the flagrantly false statement "pants-fire")
train_df.label.value_counts(normalize=True)
half-true 0.206682 false 0.194604 mostly-true 0.191487 true 0.163826 barely-true 0.161391 pants-fire 0.082010 Name: label, dtype: float64
label_ratios = train_df.label.value_counts(normalize=True)
px.bar(label_ratios, x=label_ratios.index, y=label_ratios.values, labels={"index": "label", "y": "ratios"}, title="Label Distribution")
# Notice a huge number of speaker titles
train_df.speaker_title.nunique()
1187
train_df.speaker_title[train_df.speaker_title.notnull()]
0 State representative 1 State delegate 2 President 5 Wisconsin Assembly speaker 7 President ... 10255 President-Elect 10257 Senator 10258 State Senator, 8th District 10259 Senior editor, The Atlantic 10266 chairman of the Republican National Committee Name: speaker_title, Length: 7367, dtype: object
# A lot of repetition in speaker_title - not canonicalized
train_df.speaker_title.value_counts()[:20]
President 497 U.S. Senator 480 Governor 391 President-Elect 274 U.S. senator 263 Presidential candidate 254 Former governor 180 U.S. Representative 172 Milwaukee County Executive 150 Senator 148 State Senator 108 U.S. representative 103 U.S. House of Representatives 102 Attorney 81 Congressman 80 Governor of New Jersey 78 Social media posting 78 Co-host on CNN's "Crossfire" 73 State Representative 73 State representative 66 Name: speaker_title, dtype: int64
train_df.speaker.value_counts()
barack-obama 493 donald-trump 274 hillary-clinton 239 mitt-romney 180 scott-walker 150 ... mike-coffman 1 donna-edwards 1 national-education-association 1 protect-families-first 1 frank-pallone-jr 1 Name: speaker, Length: 2915, dtype: int64
train_df.speaker.nunique()
2915
affiliation_counts = train_df.party_affiliation.value_counts()
px.bar(affiliation_counts, x=affiliation_counts.index, y=affiliation_counts.values, labels={"index": "affiliation", "y": "counts"}, title="Counts Per Affiliation")
# Convert from 6-way scale to binary scale
def get_binary_label(label: str) -> bool:
if label in {"pants-fire", "barely-true", "false"}:
return False
elif label in {"true", "half-true", "mostly-true"}:
return True
train_df["binary_label"] = train_df.label.apply(get_binary_label)
party_groups = train_df.groupby(["party_affiliation"])
party_groups.get_group("republican").binary_label.value_counts(normalize=True)
True 0.502329 False 0.497671 Name: binary_label, dtype: float64
party_groups.get_group("democrat").binary_label.value_counts(normalize=True)
True 0.661584 False 0.338416 Name: binary_label, dtype: float64
train_df.binary_label.value_counts(normalize=True)
True 0.561995 False 0.438005 Name: binary_label, dtype: float64
unigram_lens = train_df.statement.str.split().str.len()
px.histogram(unigram_lens, x=unigram_lens.values, labels={"x": "unigram lens"}, title="Unigram Length Distribution")
unigram_lens.median()
17.0
unigram_lens.mean()
17.90990552254797
unigram_lens.max()
66
# Ran into some noisy labels for certain columns so have to remove it
train_df[train_df.pants_fire_count == "a television interview"]
/Users/mihaileric/anaconda3/envs/fake-news/lib/python3.8/site-packages/pandas/core/computation/expressions.py:68: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
id | statement_json | label | statement | subject | speaker | speaker_title | state_info | party_affiliation | barely_true_count | false_count | half_true_count | mostly_true_count | pants_fire_count | context | justification | binary_label |
---|
# Drop column with invalid type for `pants_fire`
train_df.drop(6134, inplace=True)
# Separate true samples from false ones
true_ex = train_df[train_df.binary_label == True]
false_ex = train_df[train_df.binary_label == False]
train_df.barely_true_count.describe()
count 10266.000000 mean 11.557276 std 19.001815 min 0.000000 25% 0.000000 50% 2.000000 75% 12.000000 max 70.000000 Name: barely_true_count, dtype: float64
# TODO (mihail): Include feature for credit history counts (binned)
barely_true_counts = train_df.barely_true_count.value_counts().sort_index()
px.bar(barely_true_counts, x=barely_true_counts.index, y=barely_true_counts.values, labels={"index": "credit", "y": "counts"}, title="Barely True Credit Distribution")
px.histogram(train_df, x="barely_true_count", labels={"x": "credit score"}, title="Barely True Credit Histogram", nbins=10)
barely_true_counts.values
array([3032, 1516, 817, 490, 236, 317, 190, 237, 171, 247, 104, 289, 112, 50, 115, 70, 69, 63, 135, 56, 150, 115, 142, 148, 117, 180, 93, 239, 273, 493])
train_df.false_count.describe()
count 10266.000000 mean 13.306546 std 24.122985 min 0.000000 25% 0.000000 50% 2.000000 75% 15.000000 max 114.000000 Name: false_count, dtype: float64
train_df.half_true_count.describe()
count 10266.000000 mean 17.195695 std 35.951114 min 0.000000 25% 0.000000 50% 3.000000 75% 13.000000 max 160.000000 Name: half_true_count, dtype: float64
train_df.mostly_true_count.describe()
count 10266.000000 mean 16.491720 std 36.255254 min 0.000000 25% 0.000000 50% 3.000000 75% 11.000000 max 163.000000 Name: mostly_true_count, dtype: float64
train_df.pants_fire_count.describe()
count 10266.000000 mean 6.198617 std 16.110747 min 0.000000 25% 0.000000 50% 1.000000 75% 5.000000 max 105.000000 Name: pants_fire_count, dtype: float64
train_df.pants_fire_count.astype(float).describe()
count 10266.000000 mean 6.198617 std 16.110747 min 0.000000 25% 0.000000 50% 1.000000 75% 5.000000 max 105.000000 Name: pants_fire_count, dtype: float64
true_ex.statement.str.split().str.len().describe()
count 5770.000000 mean 18.337782 std 7.798941 min 2.000000 25% 13.000000 50% 17.000000 75% 23.000000 max 66.000000 Name: statement, dtype: float64
false_ex.statement.str.split().str.len().describe()
count 4496.000000 mean 17.361210 std 7.668194 min 2.000000 25% 12.000000 50% 16.000000 75% 22.000000 max 60.000000 Name: statement, dtype: float64
# Sample true and false examples to observe characteristics
true_ex.sample(frac=0.2).head(25)
id | statement_json | label | statement | subject | speaker | speaker_title | state_info | party_affiliation | barely_true_count | false_count | half_true_count | mostly_true_count | pants_fire_count | context | justification | binary_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10262 | 10264 | 5473.json | mostly-true | There are a larger number of shark attacks in Florida than there are cases of voter fraud. | animals,elections | aclu-florida | NaN | Florida | none | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | interview on "The Colbert Report" | They compounded their error by combining full and partial years of data -- even though they (like the governor himself) were told not to do so. | True |
570 | 570 | 5688.json | half-true | Mitt Romneys pledged to protect (oil companies) record profits and their billions in special tax breaks, too. | energy,taxes | priorities-usa-action | NaN | NaN | democrat | 3.0 | 1.0 | 4.0 | 2.0 | 1.0 | a television ad | The Priorities USA Action ad says "Mitt Romneys pledged to protect their (oil companies) record profits and their billions in special tax breaks, too. "We couldnt find any pledge from Romney to protect the oil industries tax breaks -- if anything, Romney appears to be trying to avoid making any clear statement on how he would handle those subsidies. But there are signs that he is favorable toward maintaining those tax breaks. "Romney hasnt made a "pledge" but there are signs he supports the... | True |
210 | 210 | 7123.json | true | Says that in the 1985 election former Gov. Tom Kean had the largest winning margin for a gubernatorial candidate in Jersey history. | elections,states | raymond-bateman | NaN | NaN | republican | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | a column | Bateman said that in the 1985 general election Kean had "the largest winning margin for a gubernatorial candidate in Jersey history. "Kean trounced his opponent that year. The former Republican governor won more than 1. 37 million votes, compared with the roughly 578,000 ballots cast for his Democratic rival. That means Kean won by more than 790,000 votes. Thats the largest margin of victory for a gubernatorial election in New Jerseys history. | True |
6951 | 6953 | 8527.json | mostly-true | We have created new jobs here in Cranston -- more than 1,000. | city-government,corporations,economy,government-regulation,job-accomplishments,jobs,market-regulation,small-business,workers | allan-fung | mayor, city of Cranston, R.I. | Rhode Island | republican | 2.0 | 2.0 | 0.0 | 2.0 | 1.0 | a campaign speech | Still, even if Todd Palin paid more for his snack food fix, it doesnt support Sarah Palins argument that food prices are skyrocketing. Food prices -- an always-volatile sector -- are indeed going up, and that may or may not be a worry for the longer term. However, food prices are not rising by anything approaching 169 percent. Her anecdote offers spice, but not a lot of meat. | True |
2714 | 2714 | 5402.json | mostly-true | The governor is trying to take credit for recent actions taken by these companies. The problem is that for all three companies, the decision to move or stay in New Jersey happened before Chris Christie became governor. | corporations | new-jersey-senate-democrats | State Senators | New Jersey | democrat | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | a press release | NaN | True |
8367 | 8369 | 6332.json | true | Twenty two years ago, when he was running for governor, Bill Nelson missed 56 percent of his votes in the U.S. House. | candidates-biography,voting-record | connie-mack | U.S. representative from Fort Myers | Florida | republican | 3.0 | 3.0 | 1.0 | 3.0 | 1.0 | a letter from the campaign to the Tampa Bay Times | More than 115,000 social media users passed along a story headlined, "Newly Elected Republican Senators Sign Pledge to Eliminate Food Stamp Program in 2015. "But they failed to do due diligence and were snookered, since the story came from a publication that bills itself (quietly) as a "satirical, parody website. "UPDATE, Jan. 8, 2015: After we published our fact-check, we received a response from Kevin Gallagher of the Daily Leak. We had asked him whether he considers the sites articles to... | True |
5574 | 5574 | 12850.json | mostly-true | Russia and China are doing naval exercises together someplace. | china,foreign-policy,terrorism | donald-trump | President-Elect | New York | republican | 63.0 | 114.0 | 51.0 | 37.0 | 61.0 | a rally in Columbus, Ohio | But thats a direct tax on gas, not on the oil companies. Our rating Ryan said: "President Obamas proposed oil tax would cost consumers 24 cents a gallon. "Estimates are in the 24 cents range for the $10-per-barrel tax that Obama proposed on oil companies. But its not a sure thing, if the tax became law, that all of the tax would be passed onto consumers. | True |
6587 | 6589 | 589.json | half-true | In the wake of hurricanes Katrina and Rita, offshore drilling "did not cause any real difficulties." | environment | john-mccain | U.S. senator | Arizona | republican | 31.0 | 39.0 | 31.0 | 37.0 | 8.0 | Albuquerque, N.M. | Gohmert said Obama has "not proposed one thing that would change" the fact that Warren Buffett pays a lower tax rate than his secretary. In fact, Obama has on multiple occasions proposed changing the tax code so that it complies with the "Buffett rule," which attempts to ensure that high-income people pay a certain percentage of their income in taxes that is at least higher than what middle-class people pay. Theres debate over whether or not this is good policy or how many millionaires would... | True |
5484 | 5484 | 8943.json | mostly-true | In Oregon, women earn an average of 79 cents for every dollar that men earn for doing the same job. Thats just wrong. | income | brad-avakian | NaN | NaN | state-official | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | a campaign website statement | ; Tammy Baldwin, D-Wis. ; Heidi Heitkamp, D-N. D. ; and Mazie Hirono, D-Hawaii. Footnote: Hutchison was among seven female senators on her initial swearing-in in June 1993, according to the Senate web page. Women have held more than 10 Senate seats since 2001, more than 15 since 2007. | True |
10110 | 10112 | 9676.json | half-true | Having an active father makes children 98 percent more likely to graduate from college. | children,education,gays-and-lesbians,marriage | marco-rubio | U.S. Senator | Florida | republican | 33.0 | 24.0 | 32.0 | 35.0 | 5.0 | a speech at Catholic University | PolitiFact Oregon talked to Rasmussen to see if he could explain his remarks. "We were discussing Oregons volatile revenue," he said, "and I was just summarizing the two general proposals that I had heard other people talking about neither of which were policy proposals that I was advocating. "Upon review of the actual comments, it is clear that while Rasmussen might have mentioned a sales tax (and, yes, we think "transactions that may be taxable" is pretty much code for sales tax), he isnt ... | True |
4705 | 4705 | 2535.json | half-true | Marco Rubio wants to raise the Social Security retirement age, and cut benefits. | message-machine,social-security | charlie-crist | NaN | Florida | democrat | 15.0 | 9.0 | 20.0 | 19.0 | 2.0 | a TV ad. | McDonald said, "Nine hundred people have been fired since I became secretary. Weve got 60 people that we fired who have manipulated wait times. "He also said that those 900 people "were with us before I became secretary. "While the data shows that 900 people have been let go under McDonald, half those dismissals were probationary employees, meaning they were just starting work as the scandal had come to light, or werent even there when it was going on. Looking at historical trends, the numbe... | True |
9065 | 9067 | 3448.json | mostly-true | Says for the equivalent cost of a single mile of freeway, we have a bike infrastructure. | transportation | sam-adams | Mayor of Portland | Oregon | democrat | 3.0 | 2.0 | 5.0 | 2.0 | 0.0 | a video online | UPDATE: This item uses updated numbers on the debt limit votes that were provided to us by the McCollumn campaign after this item was published. | True |
790 | 790 | 1791.json | half-true | She (Kagan) took money from Goldman Sachs just like her boss, Obama. | campaign-finance,kagan-nomination,pundits,supreme-court | michael-savage | NaN | NaN | none | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | his website | That put Goldman Sachs employees No. 2 on Obama's top donors list. But again, getting a check from Goldman Sachs for services rendered (as Kagan did, albeit a small one) is far different than accepting campaign contributions from the company's employees. By that measure, Obama's Republican opponent Sen. John McCain took money from Goldman Sachs too -- $230,000 in campaign contributions from its employees. | True |
3104 | 3104 | 4583.json | half-true | We just cant afford to pay 100 percent of government employee benefits. | labor,state-finances,unions | alliance-americas-future | lobbying group | Virginia | none | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | a campaign mailing | That is a critical distinction in the world of science and of politics. Barretts ad ignores it, presenting Walker as so extreme on the matter that he is "against hope. | True |
4645 | 4645 | 7370.json | true | Says, Since 1994 when VAWA was first passed, incidents of domestic violence have dropped more than 50 percent. | congress,women | jeff-merkley | U.S. Senator | Oregon | democrat | 0.0 | 1.0 | 3.0 | 6.0 | 0.0 | a press call | In its tweet, the Republican Party overreaches when it says the administration now calls the mandate a tax. The administration (still) isn't doing that. But it does cite Congress' power to levy taxes as authority for the mandate. And that enables the GOP to score its own political point, based on what looks like the administration's runs at having it both ways. | True |
9900 | 9902 | 165.json | mostly-true | Mayor Giuliani's lawsuit killed the line-item veto. | federal-budget | mitt-romney | Former governor | Massachusetts | republican | 34.0 | 32.0 | 58.0 | 33.0 | 19.0 | NaN | Some of the contributions come from people who work forExxonMobil, ConocoPhillips and the other multinational gas and oil companies commonly considered "Big Oil. "But, many come from smaller firms, including locally owned heating companies and gas stations, that do not fall under the category. And as PolitiFact has noted in previous rulings, just because people who work for an industry donate to a campaign, it doesnt necessarily mean its coming from the industry itself. | True |
8613 | 8615 | 12742.json | half-true | Says Hillary Clinton has called for a radical 550 percent increase in Syrian ... refugees . . . despite the fact that theres no way to screen these refugees in order to find out who they are or where they come from. | foreign-policy,homeland-security,human-rights,immigration,terrorism | donald-trump | President-Elect | New York | republican | 63.0 | 114.0 | 51.0 | 37.0 | 61.0 | a speech at the Republican convention | Like many chain emails we've checked, this one is so flimsy that it needs to fabricate its credibility. Critics of Obama are free to speculate about the first lady's words, but the "damn flag" translation was not done by anyone at the River School. | True |
4277 | 4277 | 6677.json | mostly-true | Says Obama doubled funding for the Pell Grant. | education | kal-penn | actor | NaN | democrat | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | his Democratic National Convention speech | Sullivan said Mark Begich "has taken campaign cash from the Kochs" but that Sullivan "hasnt taken a dime. "In 2010, KochPAC donated $5,000 to a Begich-affiliated PAC. But that's a modest amount by campaign-finance standards, and Begich hasnt received any more in the past four years. While Sullivan hasnt received money directly from the Kochs, he and Begichs other Republican challengers certainly benefit from the much larger TV ad buys by Koch-affiliated groups attacking the Democrat. A Koch-... | True |
5417 | 5417 | 3581.json | true | President Obama has stopped using the phrase war on terror. | homeland-security,terrorism | tim-pawlenty | NaN | NaN | republican | 2.0 | 3.0 | 3.0 | 6.0 | 1.0 | his book, "Courage to Stand: An American Story" | Clinton said, "A trade war is something very different (than curbing new trade agreements). We went down that road in the 1930s. It made the Great Depression longer and more painful. "Numerous experts we checked with said the Smoot-Hawley tariffs and the resulting trade war werent the only factor to worsen the Great Depression. However, they agreed that the trade war undeniably had a negative impact. | True |
5442 | 5442 | 1203.json | half-true | On whether he would put a missile shield in Poland. | foreign-policy | barack-obama | President | Illinois | democrat | 70.0 | 71.0 | 160.0 | 163.0 | 9.0 | a White House announcement | There are two ways to look at Medicaid spending. However, if we only count state dollars, then education eats up a bigger piece of the budget. It's important to understand that the federal government contributes to Medicaid, so the statement is accurate but needs additional information. | True |
3073 | 3073 | 10152.json | mostly-true | Figures for September 2014s job growth in Wisconsin mark the largest private-sector job creation weve had in the month of September in more than a decade | jobs | scott-walker | Milwaukee County Executive | Wisconsin | republican | 26.0 | 41.0 | 32.0 | 40.0 | 11.0 | a speech | Theres little question more IRS workers will have to be added, but the agency has not set a number. | True |
2329 | 2329 | 6198.json | half-true | Statistics show that more people at this time telecommute than they ride carpools, mass transit, bicycle or walk. | transportation | debbie-dooley | NaN | Georgia | tea-party-member | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | a speech | Nelson, along with many other Democrats, said the Zika funding bill would limit access to family planning and contraceptives that would help stop the spread of the Zika virus. The legislation would have blocked the flow of money to one organization, Profamilias, the Planned Parenthood chapter in Puerto Rico. However, the bill also provided funds that would potentially help clinics and hospitals in nearly every municipality on the island. There would be some pockets without services, but it i... | True |
5779 | 5779 | 10513.json | half-true | In states that have private-sale background checks for handguns 49 percent fewer women are shot and killed. | crime,guns,states,women | lori-haas | Virginia | Virginia director, Coalition to Stop Gun Violence | activist | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | a news conference | Murphy said Floridas regulations on the payday lending industry are "stronger than almost any other state. "Consumer advocates, Pew researchers and the head of the Consumer Financial Protection Bureau have several criticisms of Floridaslaw, including the high interest rate. Pew, an independent organization, says that Colorado has the best model law in the country. The Center for Responsible Lending points to 14 states -- not including Florida -- that cap interest rates at 36 percent as a bet... | True |
7425 | 7427 | 4830.json | half-true | When you sanction the Iranian central bank, that will shut down (Irans) economy. | economy,foreign-policy | rick-perry | Governor | Texas | republican | 30.0 | 30.0 | 42.0 | 23.0 | 18.0 | the CNN Republican presidential debate | We understand the everlasting appeal of Houston as the first word flung from the moon. However, we were surprised to learn, it's not so. Armstrong's call from the moon to mission control was preceded by various other and easy-to-overlook words. | True |
3691 | 3691 | 7926.json | mostly-true | In the past four years, (the U.S. Senate) has only passed nine out of 48 appropriation bills. | congress,federal-budget | jack-kingston | U.S. Representative | Georgia | republican | 3.0 | 1.0 | 4.0 | 3.0 | 0.0 | a speech | Parents have a secondary role. "Thats a bogus quote attributed to Clintons book It Takes a Village. | True |
false_ex.sample(frac=0.2).head(25)
id | statement_json | label | statement | subject | speaker | speaker_title | state_info | party_affiliation | barely_true_count | false_count | half_true_count | mostly_true_count | pants_fire_count | context | justification | binary_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2831 | 2831 | 403.json | false | Hillary Clinton and Barack Obama want to raise taxes on all income brackets. | taxes | chain-email | NaN | NaN | none | 11.0 | 43.0 | 8.0 | 5.0 | 105.0 | a chain e-mail | "John McCain believes Roe vs. Wade is a flawed decision that must be overturned, and as president he will nominate judges who understand that courts should not be in the business of legislating from the bench," the Web site states. "Constitutional balance would be restored by the reversal of Roe vs. Wade, returning the abortion question to the individual states. "We find McCain's brief remark of support falls well short of a full-fledged change in position. His voting record on abortion ap... | False |
7949 | 7951 | 4350.json | barely-true | Says direct shipment of wine makes underage drinking as simple as a mouse click. | Alcohol,government-regulation | joseph-cryan | State Assemblyman | New Jersey | democrat | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | an op-ed piece published on CourierPostOnline.com | "In that respect, we certainly aren't the wealthiest state or the poorest state. We do have a large representation of very disadvantaged people. At the same time, we have a robust economy. "Much debate lies in the details. | False |
9224 | 9226 | 8295.json | false | Every engine manufacturer in the United States is now in the state of Texas. | jobs,states | rick-perry | Governor | Texas | republican | 30.0 | 30.0 | 42.0 | 23.0 | 18.0 | a discussion on CNN's "Crossfire" | That sounds like a broad-brush statement of Obama's taxation philosophy. But Obama does not promise those things; in fact, he promises more taxes for taxpayers with the highest incomes. | False |
10088 | 10090 | 1126.json | pants-fire | Seniors and the disabled will have to stand in front of Obamas death panel so his bureaucrats can decide, based on a subjective judgment of their level of productivity in society, whether they are worthy of health care. | health-care | sarah-palin | NaN | Alaska | republican | 9.0 | 19.0 | 9.0 | 6.0 | 6.0 | a message posted on Facebook | In fact, the AAAN's focus is on local initiatives, and has no foreign policy. | False |
1228 | 1228 | 5478.json | barely-true | President Obamas budget would call for about $25 trillion in debt by the end of his term, if he was re-elected. | deficit,federal-budget | bob-mcdonnell | Governor | Virginia | republican | 6.0 | 5.0 | 7.0 | 6.0 | 3.0 | a symposium. | McDonnell said that if Obama is reelected this year, his budget policies would push total U. S. debt to about $25 trillion by the time the presidents second term expired in 2017. The governors spokesman qualified the statement. McDonnell, he said, cited a White House projection of the national debt in 2021 if Obamas policies are continued. Indeed, the administration has estimated a $25 trillion debt that year. But in 2017 -- at the end of a second Obama term that McDonnell addressed with dr... | False |
7556 | 7558 | 10364.json | false | The Milwaukee Bucks are actually younger than the Marquette team. | economy,education,recreation,sports | peter-feigin | President, Milwaukee Bucks | NaN | none | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | a speech | His statement is not far off the annual decline found in Ohio using statistics the USDA prefers. And ultimately, the differences in numbers dont impact his underlying point. As for the estate taxs role in the decline, its difficult to say exactly how many family farms were lost for that reason. But Cornely, an authority on Ohio agriculture, said his is sure "some" of the loses would be due to taxes. | False |
7228 | 7230 | 2965.json | false | Sen. Jim Webb persists on negating Sen. Mark Warners votes | voting-record | george-allen | consultant | Virginia | republican | 2.0 | 8.0 | 3.0 | 4.0 | 1.0 | a news release. | West tweeted that "more Americans receive food aid than work in (the) private sector. "However, the data West used appears to have undercounted the number of people with a private-sector job and overcounted the number of people receiving food aid. In addition, the comparison isnt really apples to apples. | False |
1288 | 1288 | 3168.json | pants-fire | Illegal aliens cost the state of Rhode Island $400 million a year. | census,crime,education,health-care,immigration,state-budget,taxes | terry-gorman | President, Rhode Islanders for Immigration Law Enforcement | Rhode Island | newsmaker | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | a radio interview | NaN | False |
1044 | 1044 | 10008.json | false | Mary Burkes record: 130,000 fewer jobs. | economy,job-accomplishments,jobs | republican-governors-association | NaN | Washington, D.C. | republican | 5.0 | 1.0 | 6.0 | 4.0 | 2.0 | a campaign TV ad | Our rating The Republican Governors Association said Mary Burkes record is one of "130,000 fewer jobs. "But that number corresponds to Doyles second term, not Burkes time as head of Commerce -- a period that saw an increase. | False |
5379 | 5379 | 10989.json | pants-fire | Says South Carolina Gov. Nikki Haley is an immigrant. | candidates-biography,history,immigration | ann-coulter | Columnist and author | New York | republican | 2.0 | 3.0 | 3.0 | 0.0 | 4.0 | comments on Fox Business Network | Thomas said that mass shootings have tripled since the 2000 to 2008 period and the country now sees about 15episodes a year. The study that matches those figures looked at a different form of violence and included instances where no one was murdered. The studys author told us that mass murders, as defined by the FBI, have not increased. Using a more narrow definition, Mother Jones found that the country now has between three and four mass killings a year, a doubling since about a decade ago.... | False |
757 | 757 | 12457.json | false | It was allowed, referring to her email practices. | homeland-security,technology | hillary-clinton | Presidential candidate | New York | democrat | 40.0 | 29.0 | 69.0 | 76.0 | 7.0 | an interview on ABC | Regarding her decision to use a private email server, Clinton said, "It was allowed. "No one ever stopped Clinton from conducting work over her private email server exclusively. But thats not the same thing as it beingallowed. Offices within the State Department told an independent inspector general that if she had asked, they would not have allowed it. The report from the State Departments Office of the Inspector General shatters one of Clintons go-to phrases about her email practice. | False |
77 | 77 | 3005.json | barely-true | Says one out of three U.S. homeless men is a veteran. | poverty,veterans | texas-veterans-commission | NaN | NaN | none | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | a newspaper article. | Using U. S. Census Bureau statistics to extrapolate the percentage of males among veterans (93 percent), McGhee comes up with a figure of 99,720 homeless male veterans slightly more than one-third of the total number of adult homeless people calculated from the January 2009 one-night survey. | False |
4860 | 4860 | 11622.json | barely-true | The United States has 10,000 IRS agents making sure that you dont take an improper charity deduction, but to fight terrorism, it has less than two dozen people focusing on countering violent extremism at home. | homeland-security,taxes,terrorism | martha-mcsally | NaN | NaN | republican | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | a news conference | ActionAid UK said "tax dodging costs developing countries $200 billion every year. "Even the IMF report that provided the figure called the estimate "highly speculative. "For the most part, the outside experts we reached were even more cautious. ActionAid didnt reflect any of that uncertainty. In addition, the total includes taxes lost by countries that are doing much better than low income nations. The statement is partially correct but the core figure comes with many caveats. Update,April ... | False |
3119 | 3119 | 4455.json | false | Says CNNs Wolf Blitzer was wrong to say that the wealthiest Americans, they pay the most in taxes already -- 50 percent of Americans dont even pay any federal income tax. | taxes | debbie-wasserman-schultz | U.S. Representative, Florida District 23 | Florida | democrat | 7.0 | 9.0 | 8.0 | 15.0 | 3.0 | an interview before the CNN/Tea Party Express presidential debate | All that seems to be changing is where the division is housed for budgeting purposes. He's really just redrawing lines on an org chart, and it's notas groundbreaking a change as it sounds. | False |
9116 | 9118 | 4760.json | false | Over half of the people who would be taxed under (a millionaire surtax) are, in fact, small businesspeople. | taxes,abc-news-week | john-boehner | Speaker of the House of Representatives | Ohio | republican | 13.0 | 22.0 | 11.0 | 4.0 | 2.0 | an ABC interview with Christiane Amanpour | The NRCC said that Kuster has turned "a blind eye to those in need of funding" by voting "against funding for our nations veterans, low-income women and children, the FDA and the National Institutes of Health. "The NRCC is correct that Kuster did vote against each of the resolutions that would have temporarily provided funding for veterans, low-income women and children, the FDA and the National Institutes of Health. However this only tells part of the story. The Democrats say they are pushi... | False |
10016 | 10018 | 607.json | false | John McCain will keep the estate tax at 0 percent, the same as it is now. | taxes | chain-email | NaN | NaN | none | 11.0 | 43.0 | 8.0 | 5.0 | 105.0 | a chain e-mail | According to one analysis of Romneys financial disclosure, his wealth may total $264 million -- but it may also be closer to $85 million. Its clear that most of it has been held in blind trusts since 2003, which is common practice among elected officials to avoid conflicts of interest. His tax returns show he previously had a Swiss account and that he made almost $3 million in foreign income in 2010. But the tax returns and financial disclosures indicate that a relatively small share of Romn... | False |
4750 | 4750 | 7648.json | false | Under Obamacare, Virginia taxpayers would have been forced to pay for abortions if the General Assembly had not recently intervened. | abortion,health-care,taxes | virginia-society-human-life | NaN | Virginia | none | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | a statement. | The Obama ad says Romney personally approved $70 million in fictional tax losses through the Son of Boss tax shelter. Marriott International used Son of Boss. Romney was head of Marriotts audit committee at the time. Experts disagree on whether the corporate board would have known about the deal and had the chance to question it. The company neither confirmed nor denied that the board approved the transaction. At some point, the board would have approved filings that included the fraudulent ... | False |
2919 | 2919 | 4199.json | false | Since 1965, the United States has spent untold trillions yet the poverty rate hasnt budged. | federal-budget,history,poverty,pundits | bill-oreilly | Fox News Channel host | NaN | none | 4.0 | 6.0 | 3.0 | 5.0 | 1.0 | a comment on Fox News' "The O'Reilly Factor" | Sanders tweeted, "Increasing the min. wage to $15 an hour would reduce spending on food stamps, public housing and other programs by over $7. 6 billion a year. "Sanders based this on a study that looked at what would happen if the minimum wage were raised to $10. 10, not to $15. In reality, no such study of a $15 wage hike exists -- and economists say theres good reason to believe that jobs lostfrom a wage hike that large could be significant. It might even be big enough to increase the cost... | False |
9132 | 9134 | 8749.json | barely-true | When I was Mayor of South Pasadena, we actually reduced the property taxes we collected. | taxes | kathleen-peters | District 69 state representative | Florida | republican | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | a campaign mailer | Sanders said that for African-Americans between the ages of 17 and 20, "the real unemployment rate is 51 percent. "His terminology was off, but the numbers he used check out, and his general point was correct -- that in an apples-to-apples comparison, African-American youth have significantly worse prospects in the job market than either Hispanics or whites do. | False |
8631 | 8633 | 9825.json | false | Every 28 hours an unarmed black person is shot by a cop. | crime,diversity,public-safety | marc-lamont-hill | NaN | NaN | democrat | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | a CNN segment | As weve shown, the Kochs hold oil leases in Canada. We're not ruling out that they could benefit in some ways from the Keystone XLpipeline. But trying to extrapolate their oil-sands leasesinto a specific profit figure is sheer folly. | False |
7298 | 7300 | 4990.json | false | I balanced the budget for four straight years, paid off $405 billion in debt. | deficit,federal-budget,history | newt-gingrich | Co-host on CNN's "Crossfire" | Georgia | republican | 16.0 | 15.0 | 20.0 | 10.0 | 11.0 | a Republican presidential debate in Sioux City, Iowa | Caputo, a Trump supporter and former adviser, said, Trump"spent the least amount of money of any of the competitive primary contenders that he beat so badly. "When Trumps spending is looked at broadly over the course of the campaign, he appears to outspend the majority of his opponents through the end of May. But a review of his spending during the time rival candidates were still in the race shows a different result. All except Kasich spent more money than Trump before dropping out. It's al... | False |
4549 | 4549 | 10519.json | false | Says award-winning Milwaukee Public Schools teacher Megan Sampson was laid off because Gov. Scott Walker cut state aid to education. | education,labor | gail-collins | Columnist for the New York Times | New York | journalist | 2.0 | 1.0 | 1.0 | 3.0 | 0.0 | a column | Scott's office did not specify how he arrived at 7. 4 percent, but he's close enough. He's right that there's a clear Canadian infatuation with Florida. As one Canadian-turned-Floridian-real-estate-agent joked in a National Public Radio report on the housing phenomenon, "If there ever was an 11th (Canadian) province, it probably would be Florida. "Scott has his Canada trivia down cold, and we were unable to find any statistics that contradict him. | False |
4746 | 4746 | 2387.json | barely-true | DeWine took $1.9 million from big banks, supported legislation that helped Bernie Madoff make millions and protected predatory lenders while families lost their homes. | campaign-finance,economy,job-accomplishments | ohio-democratic-party | NaN | Ohio | democrat | 4.0 | 1.0 | 2.0 | 2.0 | 2.0 | an election ad | Voight said that the word progressive was created as a substitute for communist. The historic record shows that the progressive movement emerged around the turn of the last century in response to the conditions created by runaway capitalism. Its policies aimed to regulate private industry, not eliminate it. That agenda enjoyed broad support from people who identified with both parties and led to many of the basic features of government today. Many of the movements mainstream supporters backe... | False |
8466 | 8468 | 5610.json | barely-true | Says we got a chance to pass what I think is Oregons first human trafficking bill which has increased by 66 percent the calls to the human trafficking hotline. | human-rights | jefferson-smith | state representative | Oregon | democrat | 1.0 | 1.0 | 0.0 | 4.0 | 0.0 | an interview. | Tant said Scott only got 60 percent of the teacher pay raise he wanted. The final budget compromise adds thousands of noninstructors into the pay raise mix, which could affect the number of teachers who get raises or at least how much those raises are. Still, we dont know what that will mean on a statewide level because teacher pay is worked out by local districts and their unions. Still, this isnt even what Tant was trying to analyze. She just used a mismatched number. | False |
1237 | 1237 | 1531.json | pants-fire | I'm glad for the wording of it (an ethics report on corporate-sponsored Congressional trips) because clearly the wording exonerates me. | ethics | charles-rangel | U.S. Congressman | New York | democrat | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | a press conference | But we have to place our confidence appropriately, and were held responsible for that. "You can agree with the ethics panel ruling or not, but the fact is it found that even if Rangel had no direct knowledge of the staff memos warning him about corporate sponsorship, Rangel "was responsible for the knowledge and actions of his staff in the performance of their official duties. "There's just no way to spin that into "exoneration. "To the contrary, the ethics report included a public adm... | False |
stripped = false_ex.state_info.copy().str.strip()
false_ex.loc[:, "state_info"] = stripped
stripped = true_ex.state_info.copy().str.strip()
true_ex.loc[:, "state_info"] = stripped
/Users/mihaileric/anaconda3/envs/fake-news/lib/python3.8/site-packages/pandas/core/indexing.py:1745: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Clean up the variants of state info
CANONICAL_TO_VARIANTS = {
"Tennessee": {"Tennessee", "Tennesse"},
"Washington D.C.": {"District of Columbia", "Washington D.C.", "Washington, D.C.", "Washington DC"},
"Texas": {"Tex", "Texas"},
"Washington": {"Washington", "Washington state"},
"Virginia": {"Virginia", "Virgina", "Virgiia"},
"Pennsylvania": {"Pennsylvania", "PA - Pennsylvania"},
"Rhode Island": {"Rhode Island", "Rhode island"},
"Ohio": {"Ohio", "ohio"}
}
def get_variant_to_canonical(can_to_var: Dict):
variant_to_canonical = {}
for canonical, variant in can_to_var.items():
for var in variant:
variant_to_canonical[var] = canonical
return variant_to_canonical
variant_to_canonical = get_variant_to_canonical(CANONICAL_TO_VARIANTS)
def clean_variant(state_info, variant_to_canonical):
if state_info in variant_to_canonical.keys():
return variant_to_canonical[state_info]
return state_info
true_ex.loc[:, "state_info"] = true_ex.state_info.apply(lambda x: clean_variant(x, variant_to_canonical))
false_ex.loc[:, "state_info"] = false_ex.state_info.apply(lambda x: clean_variant(x, variant_to_canonical))
Takeaway from below seems to be that no state is considerably more inclined to "True" or "False" statements (top in each category are roughly the same)
state_true_counts = true_ex.state_info.value_counts()
px.bar(state_true_counts, x=state_true_counts.index, y=state_true_counts.values, labels={"index": "state", "y": "counts"}, title="True Statement State Distribution")
state_false_counts = false_ex.state_info.value_counts()
px.bar(state_false_counts, x=state_false_counts.index, y=state_false_counts.values, labels={"index": "state", "y": "counts"}, title="False Statement State Distribution")
def get_top_ngrams(corpus, ngram_len: int=1, num: int=None) -> List:
vec = CountVectorizer(ngram_range=(ngram_len, ngram_len), stop_words = 'english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:num]
top_unigrams_true = get_top_ngrams(true_ex.statement, 1, 30)
top_unigrams_true
[('says', 1217), ('percent', 840), ('state', 508), ('000', 467), ('year', 397), ('tax', 386), ('years', 382), ('states', 359), ('million', 356), ('people', 337), ('obama', 337), ('health', 301), ('jobs', 290), ('president', 287), ('new', 266), ('texas', 244), ('care', 230), ('taxes', 228), ('billion', 225), ('country', 223), ('federal', 204), ('united', 202), ('said', 194), ('rate', 186), ('budget', 186), ('10', 177), ('pay', 177), ('voted', 176), ('time', 171), ('government', 163)]
true_state_distr = pd.DataFrame(top_unigrams_true, columns=["unigram", "count"])
px.bar(true_state_distr, x="unigram", y="count", title="Top True Unigrams Frequency")
top_unigrams_false = get_top_ngrams(false_ex[false_ex.statement.notnull()].statement.str.lower(), num=30)
false_state_distr = pd.DataFrame(top_unigrams_false, columns=["unigram", "count"])
px.bar(false_state_distr, x="unigram", y="count", title="Top False Unigrams Frequency")
top_bigrams_false = get_top_ngrams(false_ex[false_ex.statement.notnull()].statement.str.lower(), ngram_len=2, num=30)
false_state_distr = pd.DataFrame(top_bigrams_false, columns=["bigram", "count"])
px.bar(false_state_distr, x="bigram", y="count", title="Top False Bigrams Frequency")
top_bigrams_true = get_top_ngrams(true_ex[true_ex.statement.notnull()].statement.str.lower(), ngram_len=2, num=30)
true_state_distr = pd.DataFrame(top_bigrams_true, columns=["bigram", "count"])
px.bar(true_state_distr, x="bigram", y="count", title="Top True Bigrams Frequency")
true_ex.statement.str.split().str.len().describe()
count 5770.000000 mean 18.337782 std 7.798941 min 2.000000 25% 13.000000 50% 17.000000 75% 23.000000 max 66.000000 Name: statement, dtype: float64
false_ex.statement.str.split().str.len().describe()
count 4496.000000 mean 17.361210 std 7.668194 min 2.000000 25% 12.000000 50% 16.000000 75% 22.000000 max 60.000000 Name: statement, dtype: float64
train_df.statement.str.split().str.len().describe()
count 10266.000000 mean 17.910092 std 7.756725 min 2.000000 25% 12.000000 50% 17.000000 75% 22.000000 max 66.000000 Name: statement, dtype: float64
In True statements:
def print_topics(model: TruncatedSVD, vectorizer: TfidfVectorizer, top_n: int=10) -> None:
for idx, topic in enumerate(model.components_):
print("Topic %d: " % (idx))
print([(vectorizer.get_feature_names()[i], topic[i])
for i in topic.argsort()[:-top_n - 1:-1]])
print("\n")
def run_lsa_and_print_topics(df: pd.DataFrame, num_topics: int=5, num_words: int=5) -> None:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True)
lsa_model = TruncatedSVD(n_components=num_topics)
tfidf_transformed = tfidf_vectorizer.fit_transform(df.statement)
lsa_transform = lsa_model.fit_transform(tfidf_transformed)
print_topics(lsa_model, tfidf_vectorizer)
run_lsa_and_print_topics(true_ex)
Topic 0: [('percent', 0.3446066859852471), ('says', 0.30134235709426055), ('tax', 0.17570181183559286), ('000', 0.17537535212814193), ('state', 0.17432700724205186), ('years', 0.1637992974569964), ('year', 0.15946360298300044), ('health', 0.15223179554013794), ('jobs', 0.1512468345357739), ('obama', 0.15023338680030865)] Topic 1: [('percent', 0.7130245965769676), ('rate', 0.1284049542600459), ('income', 0.12122599079377616), ('unemployment', 0.07846849345941984), ('40', 0.0778244000009848), ('highest', 0.0647159077703515), ('states', 0.0571891410621218), ('90', 0.05610926476450938), ('10', 0.054398506567232024), ('pay', 0.052111081137052234)] Topic 2: [('health', 0.6121872483190378), ('care', 0.5155687801326315), ('insurance', 0.20703308144524285), ('percent', 0.1730531525177484), ('americans', 0.09819725734870934), ('plan', 0.09325383691447942), ('reform', 0.07681252909630736), ('law', 0.06790690769043992), ('coverage', 0.06526842084089038), ('people', 0.059433004101819994)] Topic 3: [('jobs', 0.40371897959093284), ('000', 0.3451014301020803), ('year', 0.1779840231677903), ('created', 0.17731598266193382), ('million', 0.15981600826172734), ('new', 0.1328380956845921), ('states', 0.12775569676343954), ('state', 0.10584583954852417), ('lost', 0.09924218485512679), ('sector', 0.08960744564317102)] Topic 4: [('says', 0.29169832295606907), ('tax', 0.26597605849363154), ('state', 0.22262641284967005), ('states', 0.22132740450524047), ('highest', 0.17778499827354302), ('united', 0.1524035639570866), ('world', 0.12172744840584018), ('rate', 0.10149670311086846), ('texas', 0.08673194692196448), ('romney', 0.08325273167638415)]
run_lsa_and_print_topics(false_ex)
Topic 0: [('says', 0.3713131865433048), ('health', 0.25918786332360555), ('obama', 0.24361909530305434), ('care', 0.23891105739101826), ('president', 0.2041384203426522), ('percent', 0.181556702902801), ('tax', 0.18134775197568065), ('barack', 0.18043088027681578), ('state', 0.15384441594684212), ('000', 0.13395247996062468)] Topic 1: [('health', 0.5362594773337584), ('care', 0.5095311190637265), ('law', 0.15962497932327632), ('insurance', 0.1190311410969168), ('government', 0.09612352440341311), ('reform', 0.07624963656718822), ('plan', 0.0548456239711609), ('affordable', 0.05226139390917967), ('takeover', 0.049974737396065505), ('federal', 0.047618463589091775)] Topic 2: [('obama', 0.43263828126153625), ('barack', 0.3654004481916315), ('president', 0.3528303655470234), ('health', 0.25232884260135696), ('care', 0.2387796515096658), ('obamas', 0.07340978013293742), ('law', 0.06190046959374422), ('muslim', 0.04801688367820748), ('insurance', 0.03985947037815977), ('government', 0.03510742846599261)] Topic 3: [('says', 0.3601926378062941), ('tax', 0.29914237709692576), ('voted', 0.2090570961781729), ('taxes', 0.16708420182393732), ('security', 0.13306598973258443), ('social', 0.12751584759756707), ('clinton', 0.12148999142698667), ('hillary', 0.11948774383975123), ('increase', 0.1160830061483816), ('medicare', 0.09987800559590629)] Topic 4: [('tax', 0.5696738942959733), ('percent', 0.27862096191758035), ('increase', 0.23370453182670134), ('history', 0.13298358877532798), ('taxes', 0.11574315902491318), ('largest', 0.11034773532780529), ('middle', 0.10329336319112692), ('income', 0.10310490206430784), ('class', 0.10125439217511539), ('rate', 0.08622352209226332)]
run_lsa_and_print_topics(true_ex, num_topics=10, num_words=8)
Topic 0: [('percent', 0.34464970280088003), ('says', 0.30136858342051503), ('tax', 0.17581041906159198), ('000', 0.1754249262137547), ('state', 0.17434933458660695), ('years', 0.16385160106239952), ('year', 0.15961304277565858), ('health', 0.15220337133723724), ('jobs', 0.15118547394373813), ('obama', 0.15023166146065628)] Topic 1: [('percent', 0.7104995722985197), ('rate', 0.13039463945785099), ('income', 0.12298361734013426), ('40', 0.07615673815721166), ('unemployment', 0.07609073459883278), ('highest', 0.07196722575532691), ('10', 0.0582961801849593), ('90', 0.055239834560814646), ('pay', 0.054807897120837855), ('states', 0.0538145413263559)] Topic 2: [('health', 0.6127669595485653), ('care', 0.5158195751657895), ('insurance', 0.20677873337158775), ('percent', 0.17085946352051343), ('americans', 0.1021210415077282), ('plan', 0.09286467177153919), ('reform', 0.07552356833153973), ('coverage', 0.06511718559401251), ('law', 0.064600352922588), ('people', 0.05631684729133986)] Topic 3: [('jobs', 0.39186334685557306), ('000', 0.3499856835677836), ('year', 0.19200833218755234), ('million', 0.1817461771958524), ('created', 0.17186836572100928), ('new', 0.15123506378125176), ('state', 0.1256521250688246), ('lost', 0.09488502940246446), ('sector', 0.09295815365181305), ('private', 0.08561777827570806)] Topic 4: [('states', 0.49057478714807023), ('united', 0.3653021685420567), ('highest', 0.23840000906301523), ('tax', 0.19764501790919237), ('world', 0.19558657220732956), ('says', 0.18869435909104107), ('rate', 0.18647918813363873), ('country', 0.10897356346384747), ('corporate', 0.09906563655116386), ('rates', 0.08028796254880786)] Topic 5: [('states', 0.37277915514386106), ('obama', 0.32802387965486113), ('president', 0.31383377656967026), ('united', 0.30608917171119954), ('barack', 0.22128257811355806), ('jobs', 0.14442257536347833), ('million', 0.09806146933462641), ('people', 0.09353037420322818), ('world', 0.08930008438420224), ('created', 0.07708632949290355)] Topic 6: [('tax', 0.46986742012845295), ('billion', 0.2500813432545964), ('taxes', 0.2377607548339805), ('year', 0.20093892461461285), ('cut', 0.16935595733969636), ('income', 0.11941529826080334), ('cuts', 0.11785769499061632), ('debt', 0.11627099898190965), ('obama', 0.10326603027257057), ('highest', 0.09464474632252631)] Topic 7: [('state', 0.36327393215548387), ('years', 0.2913904979373814), ('billion', 0.2021482765517757), ('year', 0.19596219524951175), ('budget', 0.16842879817106066), ('spending', 0.14794025723331494), ('states', 0.14032179819430154), ('debt', 0.13759548305067246), ('united', 0.10987159646747006), ('education', 0.08772849889449348)] Topic 8: [('000', 0.3713904599424029), ('year', 0.30263332712067104), ('people', 0.21947369671686773), ('states', 0.14602392302577663), ('united', 0.13708088289959092), ('trump', 0.12933821439309043), ('says', 0.12788520949241758), ('donald', 0.11804823343276882), ('average', 0.10437467387759754), ('billion', 0.10356949695811583)] Topic 9: [('million', 0.39551188680778687), ('taxes', 0.3875452081027945), ('cut', 0.26689761742451057), ('states', 0.2066096701507176), ('united', 0.16234442830836723), ('budget', 0.12911764751718238), ('billion', 0.12813496210885858), ('jobs', 0.12668494890390622), ('years', 0.08559420105643778), ('created', 0.07602658584659931)]
run_lsa_and_print_topics(false_ex, num_topics=10, num_words=8)
Topic 0: [('says', 0.3712933547096524), ('health', 0.2591965063122843), ('obama', 0.24362255975433417), ('care', 0.23891208445532328), ('president', 0.2040692485051366), ('percent', 0.18159046145243976), ('tax', 0.18118826470974575), ('barack', 0.18042537912651432), ('state', 0.15382168539836216), ('000', 0.13401033599128065)] Topic 1: [('health', 0.5359168000529226), ('care', 0.5099104865146392), ('law', 0.1597688423271019), ('insurance', 0.11811153543328377), ('government', 0.09754349187061077), ('reform', 0.07612505832910675), ('plan', 0.054788264819049935), ('affordable', 0.052444410624017376), ('takeover', 0.05020500506216276), ('federal', 0.04953565053959517)] Topic 2: [('obama', 0.4318319468696116), ('barack', 0.36529452164564635), ('president', 0.3559452810276161), ('health', 0.2523673816645355), ('care', 0.23887771362635574), ('obamas', 0.07415884014526254), ('law', 0.06411084741735812), ('muslim', 0.048281772601301995), ('insurance', 0.03956856127918926), ('government', 0.035858673495733503)] Topic 3: [('says', 0.3539600368989182), ('tax', 0.3078196495384467), ('voted', 0.1957383494630742), ('taxes', 0.1889878939673833), ('clinton', 0.12082510280345107), ('hillary', 0.11884023757009579), ('increase', 0.11272776731877572), ('security', 0.11110225351801019), ('social', 0.10670533703330345), ('raise', 0.10490053142499428)] Topic 4: [('tax', 0.5848570707174481), ('percent', 0.266356768886472), ('increase', 0.2078556452951394), ('history', 0.11328847525081341), ('income', 0.10766965031707744), ('middle', 0.10411677904246334), ('plan', 0.10261492563557403), ('class', 0.10088797148056364), ('taxes', 0.09733809583500684), ('largest', 0.0947958745915399)] Topic 5: [('percent', 0.6227440815791139), ('unemployment', 0.1638942336471306), ('rate', 0.1414109190270333), ('state', 0.08894402759719573), ('states', 0.08102280044468621), ('clinton', 0.07642665947346683), ('hillary', 0.07383609843982977), ('texas', 0.06101047589916523), ('90', 0.054148489059908794), ('united', 0.05168707477363369)] Topic 6: [('state', 0.5529115759659462), ('billion', 0.19148657227426047), ('budget', 0.15483990450423527), ('president', 0.14880276110887147), ('scott', 0.1453258283692525), ('tax', 0.11566178158532843), ('wisconsin', 0.10460598786765822), ('walker', 0.10338007216182425), ('gov', 0.10045506776703537), ('barack', 0.08720440858250163)] Topic 7: [('security', 0.2888721921044027), ('social', 0.2710233705393134), ('states', 0.27094480720203423), ('medicare', 0.2632679335587358), ('years', 0.23862807711262857), ('united', 0.218339928858858), ('billion', 0.175600347050041), ('voted', 0.16771166232642518), ('obamacare', 0.1429282467290355), ('budget', 0.1368641576959897)] Topic 8: [('states', 0.3834340865309137), ('united', 0.3048686806353401), ('state', 0.21781976649340778), ('tax', 0.2164884574434894), ('people', 0.16476753065958635), ('texas', 0.13843465330682697), ('history', 0.09933434918234924), ('senate', 0.08641740376925208), ('illegal', 0.08595859838469107), ('country', 0.08561298674616122)] Topic 9: [('states', 0.3351068935017282), ('united', 0.2500461472071464), ('scott', 0.22543926487868446), ('walker', 0.17215860903970068), ('gov', 0.15890258473784705), ('wisconsin', 0.15709835439962302), ('says', 0.1324516717931882), ('years', 0.1295491210133638), ('billion', 0.08304769307517522), ('taxes', 0.0812893881119258)]
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
def extract_compound_sentiment(statement: str) -> float:
return analyzer.polarity_scores(statement)["compound"]
def extract_pos_sentiment(statement: str) -> float:
return analyzer.polarity_scores(statement)["pos"]
def extract_neg_sentiment(statement: str) -> float:
return analyzer.polarity_scores(statement)["neg"]
def extract_neu_sentiment(statement: str) -> float:
return analyzer.polarity_scores(statement)["neu"]
true_compound = true_ex.statement.apply(lambda x: extract_compound_sentiment(x))
px.histogram(true_compound, x=true_compound.values, labels={"x": "polarity"}, title="True Compound Polarity")
true_compound.describe()
count 5770.000000 mean -0.019625 std 0.394974 min -0.974400 25% -0.296000 50% 0.000000 75% 0.250000 max 0.942300 Name: statement, dtype: float64
true_pos = true_ex.statement.apply(lambda x: extract_pos_sentiment(x))
true_neg = true_ex.statement.apply(lambda x: extract_neg_sentiment(x))
false_compound = false_ex.statement.apply(lambda x: extract_compound_sentiment(x))
false_pos = false_ex.statement.apply(lambda x: extract_pos_sentiment(x))
false_neg = false_ex.statement.apply(lambda x: extract_neg_sentiment(x))
px.histogram(false_compound, x=false_compound.values, labels={"x": "polarity"}, title="False Compound Polarity")
false_compound.describe()
count 4496.000000 mean -0.002768 std 0.382121 min -0.973500 25% -0.273200 50% 0.000000 75% 0.273200 max 0.937100 Name: statement, dtype: float64
true_pos.describe()
count 5770.000000 mean 0.066705 std 0.095489 min 0.000000 25% 0.000000 50% 0.000000 75% 0.124000 max 0.598000 Name: statement, dtype: float64
false_pos.describe()
count 4496.000000 mean 0.073712 std 0.102112 min 0.000000 25% 0.000000 50% 0.000000 75% 0.137000 max 0.672000 Name: statement, dtype: float64
true_neg.describe()
count 5770.000000 mean 0.075679 std 0.110518 min 0.000000 25% 0.000000 50% 0.000000 75% 0.137000 max 0.796000 Name: statement, dtype: float64
false_neg.describe()
count 4496.000000 mean 0.074186 std 0.110158 min 0.000000 25% 0.000000 50% 0.000000 75% 0.138000 max 0.783000 Name: statement, dtype: float64
train_df.speaker_title.value_counts()
President 497 U.S. Senator 480 Governor 391 President-Elect 273 U.S. senator 263 ... Georgia Transportation COmmissioner 1 Lobbyst for the Rhode Island Federation of Teachers and Health Professionals 1 Milwaukee Alderman 1 Director of research, Catholic Family and Human Rights Institute 1 president, Massachusetts Prevention Alliance 1 Name: speaker_title, Length: 1187, dtype: int64
train_df.speaker_title.nunique()
1187
lower_speaker_title = train_df.speaker_title.dropna().astype(str).apply(lambda val: val.lower().strip().replace("-", " "))
lower_speaker_title.value_counts().plot.hist()
<AxesSubplot:ylabel='Frequency'>
lower_speaker_title.dropna(inplace=True)
lower_speaker_title
0 state representative 1 state delegate 2 president 5 wisconsin assembly speaker 7 president ... 10255 president elect 10257 senator 10258 state senator, 8th district 10259 senior editor, the atlantic 10266 chairman of the republican national committee Name: speaker_title, Length: 7366, dtype: object
import editdistance
unique_speaker_title = lower_speaker_title.unique()
for i in range(0, len(unique_speaker_title)):
for j in range(i, len(unique_speaker_title)):
if i!=j and editdistance.eval(unique_speaker_title[i].strip(), unique_speaker_title[j].strip()) <= 2:
print(i, j, unique_speaker_title[i], ", ", unique_speaker_title[j])
7 1005 u.s. house member 4th district , u.s. house member 7th district 7 1006 u.s. house member 4th district , u.s. house member 6th district 7 1027 u.s. house member 4th district , u.s. house member 8th district 12 65 house minority leader , house majority leader 16 478 state senator , state senators 17 19 u.s. house of representative , u.s. house of representatives 19 782 u.s. house of representatives , n.c. house of representatives 31 1035 talk show host , talks show host 39 310 u.s. congressman , u. s. congressman 44 938 senate minority leader , senate majority leader 46 104 congressman , congresswoman 49 362 u.s. representative , u.s. representativej 50 468 state assembly member, 78th district , state assembly member, 95th district 52 1043 constable, travis county, precinct 5 , constable, travis county, precinct 2 55 177 politican action committee , political action committee 77 588 n.c. secretary of commerce , u.s. secretary of commerce 92 636 assemblywoman , assemblyman 95 157 u.s. representative, florida district 22 , u.s. representative, florida district 23 95 365 u.s. representative, florida district 22 , u.s. representative, florida district 17 95 529 u.s. representative, florida district 22 , u.s. representative, florida district 10 95 534 u.s. representative, florida district 22 , u.s. representative, florida district 2 95 553 u.s. representative, florida district 22 , u.s. representative, florida district 25 95 1015 u.s. representative, florida district 22 , u.s. representative, florida district 8 99 1018 former u.s. representative from ohio's 11th district , former u.s. representative from ohio's 18th district 106 637 co host on cnn's "crossfire" , co host of cnn's "crossfire" 113 230 businessman , businesswoman 114 209 journalists , journalist 118 226 author , actor 131 707 state assembly member, 62nd district , state assembly member, 22nd district 131 714 state assembly member, 62nd district , state assembly member, 42nd district 138 979 nonprofit organization , nonproft organization 141 174 state senator, 20th district , state senator, 8th district 141 562 state senator, 20th district , state senator, 24th district 141 845 state senator, 20th district , state senator, 13th district 143 210 state senator, district 27 , state senator, district 33 143 211 state senator, district 27 , state senator, district 16 143 272 state senator, district 27 , state senator, district 4 143 289 state senator, district 27 , state senator, district 23 143 314 state senator, district 27 , state senator, district 26 143 396 state senator, district 27 , state senator, district 18 143 691 state senator, district 27 , state senator, district 32 144 288 state assemblyman , state assemblywoman 157 365 u.s. representative, florida district 23 , u.s. representative, florida district 17 157 529 u.s. representative, florida district 23 , u.s. representative, florida district 10 157 534 u.s. representative, florida district 23 , u.s. representative, florida district 2 157 553 u.s. representative, florida district 23 , u.s. representative, florida district 25 157 1015 u.s. representative, florida district 23 , u.s. representative, florida district 8 174 562 state senator, 8th district , state senator, 24th district 174 845 state senator, 8th district , state senator, 13th district 210 211 state senator, district 33 , state senator, district 16 210 272 state senator, district 33 , state senator, district 4 210 289 state senator, district 33 , state senator, district 23 210 314 state senator, district 33 , state senator, district 26 210 396 state senator, district 33 , state senator, district 18 210 691 state senator, district 33 , state senator, district 32 211 272 state senator, district 16 , state senator, district 4 211 289 state senator, district 16 , state senator, district 23 211 314 state senator, district 16 , state senator, district 26 211 396 state senator, district 16 , state senator, district 18 211 691 state senator, district 16 , state senator, district 32 215 459 retired , retiree 226 527 actor , pastor 226 728 actor , doctor 252 379 representative from ohio's 11th congressional district , representative from ohio's 18th congressional district 254 1068 pac , cpa 261 415 state senate majority leader , state senate minority leader 272 289 state senator, district 4 , state senator, district 23 272 314 state senator, district 4 , state senator, district 26 272 396 state senator, district 4 , state senator, district 18 272 691 state senator, district 4 , state senator, district 32 281 940 district 69 state representative , district 43 state representative 289 314 state senator, district 23 , state senator, district 26 289 396 state senator, district 23 , state senator, district 18 289 691 state senator, district 23 , state senator, district 32 314 396 state senator, district 26 , state senator, district 18 314 691 state senator, district 26 , state senator, district 32 323 637 co host of cnn's crossfire , co host of cnn's "crossfire" 331 471 atlanta city councilman , atlanta city councilwoman 334 1077 physician , physicist 344 447 restauranteur , restaurateur 358 870 oregon house member , oregon house members 365 529 u.s. representative, florida district 17 , u.s. representative, florida district 10 365 534 u.s. representative, florida district 17 , u.s. representative, florida district 2 365 553 u.s. representative, florida district 17 , u.s. representative, florida district 25 365 1015 u.s. representative, florida district 17 , u.s. representative, florida district 8 396 691 state senator, district 18 , state senator, district 32 467 540 district 3 member, austin city council , district 8 member, austin city council 467 763 district 3 member, austin city council , district 1 member, austin city council 529 534 u.s. representative, florida district 10 , u.s. representative, florida district 2 529 553 u.s. representative, florida district 10 , u.s. representative, florida district 25 529 1015 u.s. representative, florida district 10 , u.s. representative, florida district 8 534 553 u.s. representative, florida district 2 , u.s. representative, florida district 25 534 1015 u.s. representative, florida district 2 , u.s. representative, florida district 8 540 763 district 8 member, austin city council , district 1 member, austin city council 553 1015 u.s. representative, florida district 25 , u.s. representative, florida district 8 562 845 state senator, 24th district , state senator, 13th district 658 1068 ceo , cpa 707 714 state assembly member, 22nd district , state assembly member, 42nd district 718 997 florida democratic party chairman , florida democratic party chairwoman 854 1075 cbs news chief white house correspondent , abc news chief white house correspondent 1005 1006 u.s. house member 7th district , u.s. house member 6th district 1005 1027 u.s. house member 7th district , u.s. house member 8th district 1006 1027 u.s. house member 6th district , u.s. house member 8th district
def compute_bin_idx(val: float, bins: List[float]) -> int:
for idx, bin_val in enumerate(bins):
if val <= bin_val:
return idx