Word Cooccurrence

  1. Cooccurrene_Matrices
  2. Project_topics
  3. Resources

Cooccurrene_Matrices

There are many ways to define distributional word cooccurrence matrices. Here's a schematic overview that highlights the major decisions for building a $word\times word$ matrix:

  1. What is a co-occurrence. Cooccurrence happens in a context. The context could be an entire document, a paragraph, a sentence, a clause. In word2vec, it is a sliding window.
  2. How to count the co-occurrences. We consider the case that the context is a window of a fixed size $k$. This is the approach that is used by most distributional repreentations. There are several approaches to count the co-occurrences. The simplest method just counts everything in the context window, giving equal weight to every word inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $(k-d)$, where $d$ is the distance the token from the target, and k is the window size. This method is also used in word2vec. Another method is to using the weight $1/d$. This is used in Glove.

To create the cooccurrence pairs, we scan through the corpus, build a dictionary of word pairs along with their co-occurrence values. Every time a pair of words $w$ and $w′$ occurs in the same context, we incarese the pair weight. It can be 1 (in theory only, never used in prqactice), 1/d, or k-d.

In [16]:
from collections import Counter
import re
import numpy as np
from scipy.sparse import csr_matrix,dok_matrix
In [23]:
def read_vocab(corpus_file, min_count):
    vocab = Counter()
    with open(corpus_file) as f:
        for line in f:
            tokens=re.findall('[a-zA-Z]+', line.lower())            
            vocab.update(Counter(tokens))
    return dict([(token, count) for token, count in vocab.items() if count >= min_count])


MIN_COUNT=1
corpus="../data/vldb" 
vocab=read_vocab(corpus+".txt", MIN_COUNT)
#vocab
In [24]:
words = list(vocab.keys())
sorted_words = sorted(words)
wi = dict([(w, i) for i, w in enumerate(sorted_words)]) 
# words and their indexes. For efficiency purpose

Count cooccurrences within a window

Here is an example for counting co-occurrences inside windows. The output is list of (word-pair, freq). Words farther away receive less weight according to (k-d+1), where d is the distance from the center word.

More precisely, for a seqeunce of words

$$w_i w_{i+1}...w_{i+d}$$

The weight of pair (wi, w(i+d)) is k-d+1. I.e., it is proportional to the proximity.

In [25]:
#this one is logically clearer. But need to be improved for large data
def collectPairs(corpus_file, WIN):
    counter = Counter()
    with open(corpus_file) as f: 
        for line in f:
            tokens=re.findall('[a-zA-Z]+', line.lower())
            len_tokens = len(tokens)
            for i, tok in enumerate(tokens):
                end = i + WIN 
                if end > len_tokens:
                    end = len_tokens
                for j in range(i+1,end):
                    distance=j-i
                    proximity=WIN-distance+1;
                    if tokens[i] < tokens[j]:
                          counter[tokens[i]+' '+tokens[j]] +=proximity+1
                    else: 
                          counter[tokens[j]+' '+tokens[i]] +=proximity+1
    return counter

#Pair collection is computationally intensive. We can improve the speed by treating each term as an intger index. 
def collectPairs_fast(corpus_file, vocab, wi):
    count=0
    pair_count = dict()
    with open(corpus_file) as f:
        for line in f:
            tokens=re.findall('[a-zA-Z]+', line.lower())
            len_tokens = len(tokens)
            i=-1
            for tok in tokens:
                i+=1
                end = i + WIN
                if end > len_tokens:
                    end = len_tokens
                w1index=wi[tokens[i]]
                for j in range(i+1,end): # range is (i, end) if include the self similarity
                    w2index=wi[tokens[j]]
                    v= pair_count.get((w1index, w2index))
                    val=WIN-(j-i)  # WIN WIN-1, ...,  1
                    if  v==None:
                        pair_count[(w1index, w2index)]=val
                    else:
                        pair_count[(w1index, w2index)]=v+val
                    count+=1
            if (count%1000000)==0:
                print(".", end='',  flush=True)
    return pair_count
 

Re-Weighting schemes

Construct matrix

Once the occurrences are obtained, we can read the counts into a sparse matrix (CSR) from the word-word frequncy count as below.

.. .. wj ..
.. ..
-- -- --- --
wi .. Fij ..
-- -- --- --
.. ..
-- -- --- --

Normally, the raw count $F_{ij}$ are seldom used directly. Often, they are reweighted using metrics such as mutual information. PMI essentially measures the ratio between the observed frequency and expected frequency: $$ PMI_{ij} \propto \frac{f_{ij}}{f_if_j} $$

Efficient implementation

dok_matrix is slow when adding elements. It is more efficient to use dict directly, then use update() to transform it into dok_matrix. Note that update() is no longer supported in python3, so we use update() instead. Sppedup is approximately by a factor of 5.

In [20]:
def get_matrix(counter, N, vocab, win,lamb,method):
    words = list(vocab.keys())
    iw = sorted(words)
    wi = dict([(w, i) for i, w in enumerate(iw)])
    n=len(wi)
    counts = csr_matrix((n, n), dtype=np.float32)
    tmp_counts = dok_matrix((n, n), dtype=np.float32)
    update_batch_size = 100000
    i = 0
    factor=N/win/(win-1)/win/(win-1) # 

    for c in counter:
            wa, wb = c.strip().split() 
            fab = float(counter[c])
            fa = vocab[wa] 
            fb = vocab[wb]
            pmi= float(fab)/fa/fb*factor
            if method==PMI:
                value=pmi 
            elif (method==SPPMI):
                value=pmi/5
            else:    #method=DMI
                d = lamb/np.sqrt(fab)+1
                value= pmi/d
                
            if value>1:  ## keep positive values when they are logged. logging is done in eval.py
                tmp_counts[wi[wa], wi[wb]] = value
                tmp_counts[wi[wb], wi[wa]] = value
                i += 1
            if i == update_batch_size:
                counts = counts + tmp_counts.tocsr()
                tmp_counts = dok_matrix((n, n), dtype=np.float32)
                i = 0
    counts = counts + tmp_counts.tocsr()
    return counts

Collect word pairs

In [21]:
WIN=5
PMI=0
SPPMI=1
DMI=2 
METHODS=[PMI, SPPMI, DMI]
mnames=["PMI","SPPMI" , "DMI"]
pairs=collectPairs(corpus+".txt", 5)
n=sum(vocab.values()) #corpus length
N=sum(pairs.values())
N2=n*WIN*(WIN+1)/2 # approximately equal to N when line length is long. 
m=len(wi) # voc size

print(N2)
print(N)
print ("Unique pairs:"+str(len(pairs)))
541365.0
476942
Unique pairs:53459
$$ w_iw_{i+1}\dots w_{i+k}$$

For each window of of size k, there are $$\sum_{i=1}^k i =\frac{k(k+1)}{2}$$ pairs. If the text is one very long line, the total number of pairs is $$ n \frac{k(k+1)}{2} $$

When there are lot of lines, the last a few words in each line will have less pairs. So, the actual pair count should be less than $n k (k+1)/2$.

In [22]:
import matplotlib.pyplot as plt
rank=np.array(range(len(pairs)))+1
pairs_sorted=dict(sorted(pairs.items(), key=lambda item: item[1], reverse=True))

pair_freq=list(pairs_sorted.values())
plt.loglog(rank, pair_freq, marker='o', linestyle='None')
plt.title("Co-ocurrence freq")
plt.ylabel("Number of Occurrences")
plt.xlabel("Rank of pairs")
plt.show()
In [14]:
hist = {}
for i in pair_freq:
    hist[i] = hist.get(i, 0) + 1
Y=list(hist.values())
X=list(hist.keys())
plt.loglog(X,Y,marker="x", linestyle='None')
plt.xlabel("Pair Frequency")
plt.ylabel("Freq of Freq") 
plt.show()

Overall it follows the power law, similar to word frequency distribution. A slight difference can be observed for the rare pairs--The number of pairs that occur only once is similar in size with pairs that occur twice, 3 times and 4 times. For word frequency, hapax is almost twice the count of the words that occur twice.

The reason of less rare pairs is due to the way pairs are taken: pairs only 5 spaces away will have freq 1. Pairs with distances from 1 to 4 automatically have higher counts. Also, there are clusters of five--e.g., pairs with size 5,6,7,8,9 clusters together. They are caused by the way they are counted. Randomized selection as in word2vec will iron out those wrinkles in the plot.

In [15]:
i=0
freq=np.zeros(m)
for w in sorted_words:
    freq[i]=vocab[w]
    i+=1
pairs=collectPairs_fast(corpus+".txt", vocab, wi)
N=sum(pairs.values())    
print(f"{corpus}\tPairs actual={N}\tpairs={N2}\tVoc={len(vocab)}\tcorpus length ={n}\t{N/n}")
factor=2*N/WIN/(WIN-1)/WIN/(WIN-1) # approximate Fi from fi. not accurate when lines are short. 

for method in METHODS:
    print("\n Creating sparse matrix for "+ mnames[method]+"\t", flush=True)
    pairs_val=dict()
    i = 0
    for w1,w2 in pairs:   
        Fab =pairs[(w1,w2)] 
        fa = freq[w1] 
        fb = freq[w2]
        fa_b=fa*fb
        pmi= Fab/fa_b*factor
        if method==PMI:
            value = pmi
        elif (method==SPPMI):
            value=pmi/5
        if value>1:
            i+=1
            pairs_val[(w1,w2)]=value
            pairs_val[(w2,w1)]=value
            #if (i%1000000)==0:
            #print('.', end='',  flush=True)
    tmp_counts = dok_matrix((m, m), dtype=np.float32)
    tmp_counts._update(pairs_val)
    print("Items:"+ str(i))
    matrix = csr_matrix((m, m), dtype=np.float32)
    matrix = matrix + tmp_counts.tocsr()
    np.savez_compressed(corpus+'_'+mnames[method]+'.mat', data=matrix.data, indices=matrix.indices, indptr=matrix.indptr, shape=matrix.shape)
  
../data/vldb	Pairs actual=274518	pairs=541365.0	Voc=4583	corpus length =36091	7.606273032057854

 Creating sparse matrix for PMI	
Items:44405

 Creating sparse matrix for SPPMI	
Items:28264

 Creating sparse matrix for DMI	
Items:0

Project_topics

  • Co-occurrence counting
    • What is the distribution plot of pair counts if pairs are selected using weighting scheme 1/d as in Glove? (you will see that they no longer follow a power law).
      • What is the distribution plot if pairs are selected uing a probability as in Word2Vec?
    • Can we improve Glove using differnt pair selection methods?
      • weighting using arithmetic serials.
      • randomized pair selection.
  • Matrix reweighting

    • Can you coompare the performance for different reweigthing schemes (PMI, SPPMI, etc).
  • Shifting. The estimation is not good when pair count is small. One approach is to remove PMI values when the value is small. Can we use the variance as a guideline to remove values that are not certain?

    • Disintuish actual observations from multipliers due to the proximity.

Resources

*Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185.