This notebook counts the occurrences of words in the cleaned the text files. By default the cleaned text files are expected to be found in the folder Cleaned
and the count files are written into the folder Counts
. We leave the counting to CountVectorizer from sklearn.feature_extraction.text
. The most time is spent for separating the matrix of all counts and storing the counts for each file separately. We invest this time so that the counts may easily be reviewed manually.
A few examples at the end the notebook illustrate the result of the process.
This notebooks writes to and reads from your file system. Per default all used directory are within ~/TextData/Abgeordnetenwatch
, where ~
stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit LDA Spike - Configuration.ipynb and run it before you run this notebook.
This notebooks operates on text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the Open Database License (ODbL) v1.0 at that site.
import time
import random as rnd
from pathlib import Path
import json
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
%store -r own_configuration_was_read
if not('own_configuration_was_read' in globals()): raise Exception(
'\nReminder: You might want to run your configuration notebook before you run this notebook.' +
'\nIf you want to manage your configuration from each notebook, just remove this check.')
%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'
%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'
cleaned_dir = text_data_dir / project_name / 'Cleaned'
counts_dir = text_data_dir / project_name / 'Counts'
assert cleaned_dir.exists(), 'Directory should exist.'
assert cleaned_dir.is_dir(), 'Directory should be a directory.'
assert next(cleaned_dir.iterdir(), None) != None, 'Directory should not be empty.'
counts_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!
update_only_missing_counts = True
min_df = 3 # Ignore words that do not occure in at least in some documents. Helps to ignore misspelled words.
# 3 is a rather low number that leads to a big vocabulary.
max_df = 0.5 # Ignore words that are in the majority of documents. Helps to ignore regular phrases.
# 0.5 still keeps words that occur in almost every second document.
notebook_start_time = time.perf_counter()
filenames = []
texts = []
files = list(cleaned_dir.glob('*A*.txt')) # Answers
list.sort(files)
for file in files:
filenames.append(file.stem)
texts.append(file.read_text())
print('Read {} documents: "{}" ... "{}""'.format(len(filenames), filenames[0], filenames[-1]))
Read 7696 documents: "achim-kessler_die-linke_Q0001_2017-08-06_A01_2017-08-11_gesundheit" ... "zaklin-nastic_die-linke_Q0008_2017-10-25_A01_2018-09-24_demokratie-und-bürgerrechte""
counter_start_time = time.perf_counter()
counter = CountVectorizer(analyzer='word', min_df=min_df, max_df=max_df, lowercase=False)
word_counts = counter.fit_transform(texts)
words = counter.get_feature_names()
print('Counted {} unique words.'.format(len(words)))
counter_end_time = time.perf_counter()
print('Counting took {:.2f}s.'.format(counter_end_time - counter_start_time))
Counted 19670 unique words. Counting took 0.85s.
dump_start_time = time.perf_counter()
for doc, filename in enumerate(filenames):
target_file = counts_dir / (filename + '.count')
if update_only_missing_counts and target_file.exists(): continue
counts = {}
doc_word_counts = word_counts[doc, :]
_, word_indices = word_counts[doc, :].nonzero()
for word in word_indices:
counts[words[word]] = str(doc_word_counts[0, word])
target_file.write_text(json.dumps(counts, ensure_ascii=False, indent=0, sort_keys=True))
print('\rWrote ' + filename, end='')
dump_end_time = time.perf_counter()
print('\nDumping the word counts to files took {:.2f}s.'.format(dump_end_time - dump_start_time))
Dumping the word counts to files took 0.10s.
# For slice the notation [from:to:step] see the
# reference https://docs.python.org/3/library/stdtypes.html?highlight=slice%20notation#common-sequence-operations or the
# explanation https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation/509295#509295
# For sorting with argsort see
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# https://docs.scipy.org/doc/numpy/reference/routines.sort.html
sample_documents = rnd.sample(range(len(filenames)), 7)
for doc in sample_documents:
filename = filenames[doc]
print('{:32.32}: '.format(filename), end ='')
word_count = word_counts[doc, :].toarray().flatten()
most_frequent = np.argsort(word_count)[:-6:-1]
for word in most_frequent:
print('{:4} {:12.12}'.format(word_counts[doc, word], '"' + words[word] + '"'), end = '')
print('')
tobias-pfluger_die-linke_Q0007_2: 2 "entsprechen 2 "Traditionse 1 "bw" 1 "http" 1 "Mitglied" katja-kipping_die-linke_Q0059_20: 4 "Open" 3 "öffentlich" 2 "digitale" 2 "Source" 2 "sehen" dr-daniela-de-ridder_spd_Q0002_2: 4 "neue" 2 "gelingen" 2 "Mensch" 2 "de" 2 "finden" irene-mihalic_die-grünen_Q0003_2: 12 "Tierschutz" 7 "Tier" 5 "Massentierh 5 "grün" 5 "wollen" jorg-schneider_afd_Q0002_2017-09: 3 "Bundeswehr" 2 "Ausstattung 2 "materiell" 1 "hören" 1 "schnellen" bernd-riexinger_die-linke_Q0005_: 4 "sollen" 2 "persoenlich 2 "gut" 2 "Gerechtigke 2 "denken" mahmut-ozdemir_spd_Q0010_2018-06: 3 "Frage" 2 "engagieren" 2 "politische" 2 "sozialdemok 2 "Gespräch"
min_len = 400
max_len = 800
example_text = ''
while (len(example_text) < min_len or len(example_text) > max_len):
example = rnd.randint(0, len(texts))
example_text = texts[example]
print(30 * '-' + ' Cleaned text: ' + 30 * '-')
print(example_text)
print(30 * '-' + ' Word counts: ' + 30 * '-')
counts = json.loads((counts_dir / (filenames[example] + '.count')).read_text())
print(counts)
print(30 * '-' + ' Words not counted: ' + 30 * '-')
print(', '.join([word for word in example_text.split(' ') if not word in counts]))
def df_to_text(df):
return "{}".format(df) if isinstance(df, int) else '{:.0%} of the'.format(df)
print('(We did not count words that are in less than {} documents or in more than {} documents.)'.format(
df_to_text(min_df), df_to_text(max_df)))
------------------------------ Cleaned text: ------------------------------ Dank Frage persönlich konkret neu Gesetz freuen vorangetrieben gut Schutz Bewohner Bahnstrecken gemeinsam Bürgerinitiativen entsprechend drucken erreichen laut Güterzüge Schiene fahren dürfen Betroffene Pankow konkret helfen Wahl deutsche Bundestag einsetzen Mieterinnen Mieter Energieversorgung gleich Recht einräumen Hausbesitzer Verabschiedung Mieterstrommodells erreichen mögen Aufstockung Städtebaufördermittel erwähnen SPD durchsetzen können mitteln können Wahlkreis u.a. Schule sanieren Gesetz zeigen Beharrlichkeit ständig Thematisieren Politik auszahlen Gesetz Mittelaufstockung Mehrheit absehbar freuen Gesetz Broschüre finden weit dingen Bundestag Wahlkreis erreichen können finden Download http://www.klaus-mindrup.de/content/pressebilder-downloads Rückfrage stehen Verfügung ------------------------------ Word counts: ------------------------------ {'Aufstockung': '1', 'Bahnstrecken': '1', 'Betroffene': '1', 'Bewohner': '1', 'Broschüre': '1', 'Bundestag': '2', 'Bürgerinitiativen': '1', 'Download': '1', 'Energieversorgung': '1', 'Frage': '1', 'Gesetz': '4', 'Güterzüge': '1', 'Hausbesitzer': '1', 'Mehrheit': '1', 'Mieter': '1', 'Mieterinnen': '1', 'Pankow': '1', 'Politik': '1', 'Recht': '1', 'Rückfrage': '1', 'SPD': '1', 'Schiene': '1', 'Schule': '1', 'Schutz': '1', 'Verabschiedung': '1', 'Verfügung': '1', 'Wahl': '1', 'Wahlkreis': '2', 'absehbar': '1', 'auszahlen': '1', 'content': '1', 'de': '1', 'deutsche': '1', 'dingen': '1', 'downloads': '1', 'drucken': '1', 'durchsetzen': '1', 'dürfen': '1', 'einräumen': '1', 'einsetzen': '1', 'entsprechend': '1', 'erreichen': '3', 'erwähnen': '1', 'fahren': '1', 'finden': '2', 'freuen': '2', 'gemeinsam': '1', 'gleich': '1', 'gut': '1', 'helfen': '1', 'http': '1', 'klaus': '1', 'konkret': '2', 'laut': '1', 'mindrup': '1', 'mitteln': '1', 'mögen': '1', 'neu': '1', 'persönlich': '1', 'sanieren': '1', 'stehen': '1', 'ständig': '1', 'vorangetrieben': '1', 'weit': '1', 'www': '1', 'zeigen': '1'} ------------------------------ Words not counted: ------------------------------ Dank, Mieterstrommodells, Städtebaufördermittel, können, können, u.a., Beharrlichkeit, Thematisieren, Mittelaufstockung, können, http://www.klaus-mindrup.de/content/pressebilder-downloads (We did not count words that are in less than 3 documents or in more than 50% of the documents.)
notebook_end_time = time.perf_counter()
print()
print(' Runtime of the notebook ')
print('-------------------------')
print('{:8.2f}s Counting the words'.format(
counter_end_time - counter_start_time))
print('{:8.2f}s Dumping the word counts to files'.format(
dump_end_time - dump_start_time))
print('{:8.2f}s All calculations together'.format(
notebook_end_time - notebook_start_time))
Runtime of the notebook ------------------------- 0.85s Counting the words 0.10s Dumping the word counts to files 4.00s All calculations together
© D. Speicher, T. Dong Licensed under a CC BY-NC 4.0 . |
Acknowledgments: This material was prepared within the project P3ML which is funded by the Ministry of Education and Research of Germany (BMBF) under grant number 01/S17064. The authors gratefully acknowledge this support. |