In this tutorial, we demonstrate how to read text data in R, tokenize texts and create a document-term matrix.
Set global options at the beginning.
options(stringsAsFactors = FALSE)
Then we load the required dependencies.
require(quanteda)
require(magrittr)
require(dplyr)
require(data.table)
The read.csv
command reads a CSV (Comma Separated Value) file from disk. Such files represent a table whose rows are represented by single lines in the files and columns are marked by a separator character within lines. Arguments of the command can be set to specify whether the CSV file contains a line with column names (header = TRUE
or FALSE
) and the character set.
We read a CSV containing 19.000 "Job Description". The texts are freely available from https://www.kaggle.com/madhab/jobposts. Our CSV file has the format: "jobpost";"date";"Title"; etc.
. Text is encapsualted into quotes ("
). Since sepration is marked by ,
, we need to specify the separator char.
textdata <- read.csv("data/data job posts.csv", header = TRUE, sep = ",", encoding = "UTF-8",quote = "\"")
The texts are now available in a data frame together with some metadata. Let us first see how many documents and metadata we have read.
# dimensions of the data frame
dim(textdata)
# column names of text and metadata
colnames(textdata)
How many job postings do we have per year? This can easily be counted with the command table
, which can be used to create a cross table of different values. If we apply it to a column, e.g. Year of our data frame, we get the counts of the unique Year values.
table(textdata[, "Year"])
Now we want to transfer the loaded text source into a corpus object of the tm
-package.
For this we crate with readTabular a mapping between column names in the data frame and placeholders in the tm corpus object. A corpus object is created with the Corpus command. As parameter, the command gets the data source wrapped by a specific reader function (DataframeSource
, other reader functions are available, e.g. for simple vectors). The reader control parameter takes the previously defined mapping of metadata as input.
textdata <- as.data.table(textdata)
english_stopwords <- readLines("data/stopwords_en.txt", encoding = "UTF-8")
textdata %<>% filter(!duplicated(jobpost))
textdata %<>% mutate(d_id = 1:nrow(textdata))
#Build a dictionary of lemmas
lemmaData <- read.csv2("data/baseform_en.tsv", sep="\t", header=FALSE, encoding = "UTF-8", stringsAsFactors = F)
data_corpus <- corpus(textdata$jobpost, docnames = textdata$d_id)
A corpus is an extension of R list objects. With the [[]]
brackets, we can access single list elements, here documents, within a corpus.
# accessing a single document object
data_corpus[1]
paste0(substring(as.character(data_corpus[1]), 0, 120), "...")
Success!!! We now have
length(data_corpus$documents$texts)
job descriptions for further analysis available in a convenient quanteda corpus object!
A further aim of this exercise is to learn about statistical characteristics of text data. At the moment, our texts are represented as long character strings wrapped in document objects of a corpus. To analyze which word forms the texts contain, they must be tokenized. This means that all the words in the texts need to be identified and separated. Only in this way is it possible to count the frequency of individual word forms. A word form is also called "type". The occurrence of a type in a text is a "token".
For text mining, text are further transformed into a numeric representation. The basic idea is that the texts can be represented as statistics about the contained words (or other content fragments such as sequences of two words). The list of every distinct word form in the entire corpsu forms the vocabulary of a corpus. For each document, we can count how often each word of the vocabulary occurs in it. By this, we get a term frequency vector for each document. The dimensionality of this term vector corresponds to the size of the vocabulary. Hence, the word vectors have the same form for each document in a corpus. Consequently, multiple term vectors representing different documents can be combined into a matrix. This data structure is called document-term matrix (DTM).
The function DocumentTermMatrix
of the tm package creates such a DTM. If this command is called without further parameters, the individual word forms are identified by using the "space" as the word separator.
data_dfm_entries <- data_corpus %>% tokens() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_tolower() %>%
tokens_replace(., lemmaData$V1, lemmaData$V2) %>%
tokens_ngrams(1) %>% dfm()
data_dfm_entries_sub <- data_dfm_entries %>%
dfm_select(pattern = "[a-z]", valuetype = "regex", selection = 'keep')
colnames(data_dfm_entries_sub) <- colnames(data_dfm_entries_sub) %>% stringi::stri_replace_all_regex("[^_a-z]", "")
DTM <- dfm_compress(data_dfm_entries_sub, "features")
DTM
A first impression of text statistics we can get from a word list. Such a word list represents the frequency counts of all words in all documents. We can get that information easily from the DTM by summing all of its column vectors.
A so-called sparse matrix data structure is used for the document term matrix in the tm package. Since most entries in a document term vector are 0, it would be very inefficient to actually store all these values. A sparse data structure instead stores only those values of a vector/matrix different from zero. The Matrix package provides arithmetic operations on sparse DTMs.
# sum columns for word counts
freqs <- colSums(DTM)
# get vocabulary vector
words <- colnames(DTM)
# combine words and their frequencies in a data frame
wordlist <- data.frame(words, freqs)
# re-order the wordlist by decreasing frequency
wordIndexes <- order(wordlist[, "freqs"], decreasing = TRUE)
wordlist <- wordlist[wordIndexes, ]
# show the most frequent words
head(wordlist, 25)
The words in this sorted list have a ranking depending on the position in this list. If the word ranks are plotted on the x axis and all frequencies on the y axis, then the Zipf distribution is obtained. This is a typical property of language data and its distribution is similar for all languages.
plot(wordlist$freqs , type = "l", lwd=2, main = "Rank frequency Plot", xlab="Rank", ylab ="Frequency")
The distribution follows an extreme power law distribution (very few words occur very often, very many words occur very rare). The Zipf law says that the frequency of a word is reciprocal to its rank (1 / r). To make the plot more readable, the axes can be logarithmized.
plot(wordlist$freqs , type = "l", log="xy", lwd=2, main = "Rank-Frequency Plot", xlab="log-Rank", ylab ="log-Frequency")
In the plot, two extreme ranges can be determined. Words in ranks between ca. 10,000 and r nrow(wordlist)
can be observed only 10 times or less. Words below rank 100 can be oberved more than 1000 times in the documents. The goal of text mining is to automatically find structures in documents. Both mentioned extreme ranges of the vocabulary often are not suitable for this. Words which occur rarely, on very few documents, and words which occur extremely often, in almost every document, do not contribute much to the meaning of a text.
Hence, ignoring very rare / frequent words has many advantages:
To illustrate the range of ranks best to be used for analysis, we augment information in the rank frequency plot. First, we mark so-called stop words. These are words of a language that normally do not contribute to semantic information about a text. In addition, all words in the word list are identified which occur less than 10 times.
The %in%
operator can be used to compare which elements of the first vector are contained in the second vector. At this point, we compare the words in the word list with a loaded stopword list (retrieved by the function stopwords
of the tm package) . The result of the %in%
operator is a boolean vector which contains TRUE or FALSE values.
A boolean value (or a vector of boolean values) can be inverted with the !
operator (TRUE
gets FALSE
and vice versa). The which
command returns the indices of entries in a boolean vector which contain the value TRUE
.
We also compute indices of words, which occur less than 10 times. With a union set operation, we combine both index lists. With a setdiff operation, we reduce a vector of all indices (the sequence 1:nrow(wordlist)
) by removing the stopword indices and the low freuent word indices.
With the command "lines" the range of the remining indices can be drawn into the plot.
plot(wordlist$freqs, type = "l", log="xy",lwd=2, main = "Rank-Frequency plot", xlab="Rank", ylab = "Frequency")
stopwords_idx <- which(wordlist$words %in% english_stopwords)
low_frequent_idx <- which(wordlist$freqs < 10)
insignificant_idx <- union(stopwords_idx, low_frequent_idx)
meaningful_range_idx <- setdiff(1:nrow(wordlist), insignificant_idx)
lines(meaningful_range_idx, wordlist$freqs[meaningful_range_idx], col = "green", lwd=2, type="p", pch=20)
The green range marks the range of meaningful terms for the collection.
head(wordlist[meaningful_range_idx, ], 25)
sum(wordlist$freqs == 1) / nrow(wordlist)
ncol(DTM) / sum(DTM)