NLP analysis of Osama bin Laden documents

By David Taylor, www.prooffreader.com, dtdata.io

On May 20, 2015, the Office of the Director of National Intelligence released documents written by Osama bin Laden, captured during the raid on Abbotabad.

My first thought was, "Great, a corpus!"

Bear in mind that these documents are translated, almost certainly by many different people. That said, let's explore.

1. Download the files

The requests library makes this pretty easy. The documents themselves are all PDFs of 12-point courier. That's government info release for you. They're encouraged to do it, but they don't gotta make it easy for you to do anything but read one document at a time.

I pulled the list of links from the html source and regexed it, nothing fancy.

In [29]:
%%time

urls = """http://www.odni.gov/files/documents/ubl/english/06%20Ramadan.pdf
http://www.odni.gov/files/documents/ubl/english/A%20Letter%20to%20the%20Sunnah%20people%20in%20Syria.pdf
http://www.odni.gov/files/documents/ubl/english/Afghani%20Opportunity.pdf
http://www.odni.gov/files/documents/ubl/english/CALL%20FOR%20GUIDANCE%20AND%20REFORM%2013%20April%201994.pdf
http://www.odni.gov/files/documents/ubl/english/Despotism%20of%20Big%20Money.pdf
http://www.odni.gov/files/documents/ubl/english/The%20German%20Economy.pdf
http://www.odni.gov/files/documents/ubl/english/Gist%20of%20conversation%20Oct%2011.pdf
http://www.odni.gov/files/documents/ubl/english/Ideas%20as%20discussion%20with%20the%20sons%20of%20the%20Peninsula.pdf
http://www.odni.gov/files/documents/ubl/english/Instructions%20to%20Applicants.pdf
http://www.odni.gov/files/documents/ubl/english/Jihad%20and%20Reform%20Front%2022%20May%202007.pdf
http://www.odni.gov/files/documents/ubl/english/Lessons%20Learned%20Following%20the%20Fall%20of%20the%20Islamic%20Emirate.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20about%20revolutions.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20Addressed%20to%20Atiyah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20addressed%20to%20Shaykh.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20Ansar%20Al-Sunnah%20Group.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2007%20August%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2009%20August%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2013%20Oct%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2016%20December%202007.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2018%20JUL%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2021%20May%202007.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%2013%20Oct%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%205%20April%202011.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%20March%202008.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20dtd%20November%2024%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Abu%20Abdullah%20to%20his%20Mother%202.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Abu%20Abdullah%20to%20his%20mother.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Al-Zawahiri%20dtd%20August%202003.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Hafiz.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Hamzah%20to%20father%20dtd%20July%202009.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Khalid%20to%20Abd-al-Latif.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Khalid%20to%20Abdullah%20and%20Abu%20al-Harish.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Khalid%20to%20his%20son.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Qari%20early%20April.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20UBL%20to%20Atiyah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20from%20Zamray%20dtd%2007%20August%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20Implications%20of%20Climate%20Change.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20re%20Fatwas%20of%20the%20Permanent%20Committee.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20regarding%20Abu%20al-Hasan.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Abd%20Al-Latif%20dtd%2029%20December%202009.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Abdallah.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Abd%20al%20Rahman.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Abu%20Abdallah%20al-Hajj.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Abu%20Sulayman.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Aunt.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Aunt%20Umm-Khalid.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Badr%20Khan%203%20Dec%202002.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Brother%20Fatimah.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Brother%20from%20Abu%20Abdallah.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20brother%20Hamzah.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Brother%20Ilyas%20al-.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20brother%20Yahya%20-%20Arabic.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20daughter%20Umm-Muadh.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Hakimullah%20Mahsud%20Leader%20of%20the%20Taliban%20Movement.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Hamza.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Islamic%20Emirate%20of%20Afghanistan.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Muhammad%20Aslam%20dtd%2022%20April%202011.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Mujahidin%20in%20Somalia%20dtd%2028%20December%202006.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20my%20beloved%20Brother.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Shaykh%20Abu%20Abdallah%20dtd%2017%20July%202010.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Shaykh%20Abu%20Abdallah%20dtd%202%20September%202009.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Shaykh%20%20Abu%20Yahya.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Shaykh%20Abu%20Yahya%202.pdf
http://www.odni.gov/files/documents/ubl/english2/Letter%20to%20Shaykh%20Abu-al-Layth%20Shaykh%20Abu-Yahya%20Shaykh%20Abdallah%20Said.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Shaykh%20Azmaray%20dtd%204%20February%202008.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Shaykh%20from%20Abu%20Abdallah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Shaykh%20Mahmud.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Shaykh%20Mahmud%2026%20September%202010.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Shaykh%20Mahmud%20and%20Shaykh%20Abu%20Yahya.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20sister%20Um-Abd-al-Rahman.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20sons%20Uthman%20Muhammad%20Hamzah%20wife%20Um%20Hamzah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Special%20Committee%20of%20al-Jihads%20Qaida%20of%20the%20Mujahidin%20Affairs%20in%20Iraq%20and%20to%20the%20Ansar%20al-Sunnah%20Army.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20the%20American%20people.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20UBL%20from%20daughter%20Khadijah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Um%20Abd-al-Rahman%20dtd%2026%20April%202011.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Um%20Abid%20al-Rahman.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Um%20Saad%20from%20aunt%20Um%20Khalid.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Umm%20Khalid%20from%20Sarah.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20Uthman.pdf
http://www.odni.gov/files/documents/ubl/english/Letter%20to%20wife.pdf
http://www.odni.gov/files/documents/ubl/english/Message%20for%20all%20Muslims%20following%20US%20State%20of%20the%20Union%20Address.pdf
http://www.odni.gov/files/documents/ubl/english/Message%20for%20general%20Islamic%20nation.pdf
http://www.odni.gov/files/documents/ubl/english/Message%20for%20Islamic%20Ummah%20in%20general.pdf
http://www.odni.gov/files/documents/ubl/english/Message%20from%20Abu%20Hammam%20al-Ghurayb.pdf
http://www.odni.gov/files/documents/ubl/english/Message%20to%20%20Muslim%20brothers%20in%20Iraq%20and%20to%20the%20Islamic%20nation.pdf
http://www.odni.gov/files/documents/ubl/english/Report%20on%20External%20Operations.pdf
http://www.odni.gov/files/documents/ubl/english2/Request%20for%20Documents%20from%20CTC.pdf
http://www.odni.gov/files/documents/ubl/english/Spreadsheet%202.pdf
http://www.odni.gov/files/documents/ubl/english/Study%20Paper%20about%20the%20Kampala%20Raid%20in%20Uganda.pdf
http://www.odni.gov/files/documents/ubl/english/Suggestion%20to%20End%20the%20Yemen%20Revolution.pdf
http://www.odni.gov/files/documents/ubl/english/Summary%20on%20situation%20in%20Afghanistan%20and%20Pakistan.pdf
http://www.odni.gov/files/documents/ubl/english/Terror%20Franchise.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20letter.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20letter%202.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20Letter%203.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20letter%20from%20Khalid%20Habib.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20letter%20re%20Afghanistan.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20message%20re%20Egypt%20demonstrations.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20statement.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20statement%202.pdf
http://www.odni.gov/files/documents/ubl/english/Undated%20statement%20re%20American%20conversions%20to%20Islam.pdf
http://www.odni.gov/files/documents/ubl/english/Verbally%20Released%20doc%20for%20Naseer%20trial.pdf
http://www.odni.gov/files/documents/ubl/english/Zamrai%20UBL%20letter%20to%20Unis.pdf"""

urls = urls.split('\n')

import requests
import os.path
for url in urls:
    filename = os.path.split(url)[1]
    r = requests.get(url)
    with open(filename, 'wb+') as f:
        f.write(r.content)

Those %20s are hard to read, let's convert them to spaces.

In [95]:
import os
import glob
import urllib.request
for filename in glob.glob('*.pdf'):
    os.rename(filename, urllib.request.unquote(filename))

2. Convert to text

I'm sure there are perfectly good ways to do this in Python, but I used another program, Calibre, which I know does a good job. Unfortunately, Calibre truncates the filenames of the new .txt files; two sets of files were thus given the same name, so I put (2) in one of the names to differentiate them. I could write a script to rename the files to match the PDFs, but I don't see the point right now.

3. Make lists of tokens

The data is arranged in the list documents, containing a dict with the following keys:

  • name : the filename (without .txt extension)
  • text : the full text
  • tokens : a list of tokens
In [1]:
import re
import glob
from nltk import word_tokenize
import urllib.request 
documents = []
for filename in glob.glob('*.txt'):
    # remove .txt extension
    name = filename[:-4]
    collector = {}
    collector['name'] = name
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read()
    # remove line breaks, multiple spaces,
    # Page numbers and and translators' notes
    text = re.sub('Page [0-9]{1,2}', '', text)
    text = re.sub('\n', ' ', text)
    while re.search('  ', text):
        text = re.sub('  ', ' ', text)
    text = re.sub('\(.+?trans.+?\)', '', text)
    text = re.sub('\(TN:.+?\)', '', text)
    collector['text'] = text
    tokens = word_tokenize(text.lower())
    # keep any token that has at least one alphabetic character
    # e.g. tokens with apostrophes in them, which happen a lot
    # in Arabic transliterations.
    # there's probably a way to do this with a list comprehension,
    # but sometimes it's best just to code quickly and accurately
    # and let a few lines of code that you'll only ever run
    # once take a bit longer. That's my philosophy, anyway.
    # A lazy man's philosophy. If only I could comment the
    # rest of my code so thoroughly!
    tokens_alpha = []
    for token in tokens:
        if len(re.sub('[^a-z]', '', token)) > 0:
               tokens_alpha.append(token)
    collector['tokens'] = tokens_alpha
    documents.append(collector)

Let's have a look at our data structure, and the first few things in each.

In [2]:
print('DOCUMENT NAME:', documents[0]['name'])
print('TEXT:', documents[0]['text'][:60], '...')
print('TOKENS:', documents[0]['tokens'][:10], '...')
DOCUMENT NAME: 06 Ramadan
TEXT:   The beginning of the decision: The enemy is the Crusader a ...
TOKENS: ['the', 'beginning', 'of', 'the', 'decision', 'the', 'enemy', 'is', 'the', 'crusader'] ...

4. Explore documents and corpus size

Here we'll look at some properties of the corpus as a whole.

First, a histogram of the documents' length. Most of the documents were titled 'Letter to...' or 'Letter from...', so we'd expect them to be short.

In [112]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
print(plt.style.available)

lengths = [len(x['tokens']) for x in documents]
plt.figure(figsize=(10,6))
plt.hist(lengths, bins=20)
plt.title('Lengths of documents in Bin Laden correspondence')
plt.xlabel('Number of tokens/words')
plt.ylabel('Number of documents')
plt.show()
['grayscale', 'dark_background', 'bmh', 'ggplot', 'fivethirtyeight']

What are those two long ones?

In [8]:
for i, length in enumerate(lengths):
    if length > 4000:
        print('DOCUMENT NAME', documents[i]['name'])
        print('TEXT:', documents[i]['text'][:500], '...')
        print('TOKENS:', documents[i]['tokens'][:50], '...')
        print('')
                                      
DOCUMENT NAME A Letter to the Sunnah people i
TEXT:   From the series (and you will uncover the road of the criminals) Report 2 A letter from the Mujahid brother Salih Abdullah Al-Qar’awi May God bless him To the Sunnah people in Syria Abdul ‘Azzam Brigades  Wednesday, 24 November 2010 Production: Al-Fajr Center for information In the name of the Lord the merciful the compassionate Thanks to the Lord who makes the believers and workers victorious. Who made the Sunnah his worshippers {if you make the Lord victorious then he will make you victoriou ...
TOKENS: ['from', 'the', 'series', 'and', 'you', 'will', 'uncover', 'the', 'road', 'of', 'the', 'criminals', 'report', 'a', 'letter', 'from', 'the', 'mujahid', 'brother', 'salih', 'abdullah', 'al-qar’awi', 'may', 'god', 'bless', 'him', 'to', 'the', 'sunnah', 'people', 'in', 'syria', 'abdul', '‘azzam', 'brigades', 'wednesday', 'november', 'production', 'al-fajr', 'center', 'for', 'information', 'in', 'the', 'name', 'of', 'the', 'lord', 'the', 'merciful'] ...

DOCUMENT NAME Letter to Shaykh Abu Abdallah d
TEXT:  In the name of God, the Merciful, the Compassionate To: Our Dear Shaykh, ((Abu 'Abdallah)), may God keep him and look after him and gird him and guide him and instill wisdom in his words and his deeds, that it may fill his heart, amen. Peace be upon you, and God's mercy and His blessings, Praise be to God that you are well. We have received your latest messages on Thursday, 3 Sha'ban , and we had previously prepared that which you are reading now. These are some of the points that we can write  ...
TOKENS: ['in', 'the', 'name', 'of', 'god', 'the', 'merciful', 'the', 'compassionate', 'to', 'our', 'dear', 'shaykh', 'abu', "'abdallah", 'may', 'god', 'keep', 'him', 'and', 'look', 'after', 'him', 'and', 'gird', 'him', 'and', 'guide', 'him', 'and', 'instill', 'wisdom', 'in', 'his', 'words', 'and', 'his', 'deeds', 'that', 'it', 'may', 'fill', 'his', 'heart', 'amen', 'peace', 'be', 'upon', 'you', 'and'] ...

Apparently Bin Laden wrote both short and long letters. 6000 words is quite a long letter, it would be about 20 pages double spaced.

From the above we can verify that punctuation was removed, yet words with non-alphabetic characters, like al-qar’awi, were retained.

Now let's put all the tokens together and see what we can see.

In [9]:
all_tokens = []
for doc in documents:
    all_tokens += doc['tokens']
print('There are {} tokens altogether.'.format(len(all_tokens)))
There are 74908 tokens altogether.

Let's do a comparison with some novels' word counts, taken from http://commonplacebook.com/culture/literature/books/word-count-for-famous-novels/

In [78]:
word_counts, novels = zip((36363, "The Lion, the Witch and the Wardrobe"),
        (59635, 'Black Beauty'),
        (67203, 'The Fault in Our Stars'),
        (74908, 'Osama bin Laden correspondence'),
        (77325, "Harry Potter and the Philosopher’s Stone"),
        (95022, "The Hobbit"),
        (109571, "The Adventures of Huckleberry Finn"),
        (174269, "Catch-22"))

import numpy as np
y_pos = np.arange(len(novels))
plt.figure(figsize = (10,7))
plt.barh(y_pos, word_counts, align='center')
plt.yticks(y_pos, novels, fontsize=14)
plt.xlabel('Number of words', fontsize=14)
plt.title('Comparison of size of Bin Laden correspondence', fontsize=17)
plt.show()

Now let's determine the reading level using the Flesch-Kinkcaid formula -- bearing in mind these are translations! -- and compare it to some famous works.

You can do this in Python, but I'll be lazy and just save all the text to a file and upload it to http://readability-score.com. Comparison scores are from http://countwordsworth.com/statistics/fleschkincaid.

In [77]:
all_txt =  ""
for doc in documents:
    all_txt += doc['text']
with open('all_bin_laden_correspondence', 'w+', encoding='utf-8') as f:
    f.write(all_txt)
In [83]:
score, novels = zip((7.6, "Peter Pan"),
        (7.9, 'Anne of Green Gables'),
        (9.2, 'A Christmas Carol'),
        (10, 'Osama bin Laden correspondence'),
        (10.1, "A Connecticut Yankee in King Arthur's Court"),
        (11.6, "Pride and Predjudice"),
        (12.5, "Frankenstein"),
        (13.7, "The Legend of Sleepy Hollow"),
        (16.1, "Gulliver's Travels"),
        (20, "Robinson Crusoe"))

y_pos = np.arange(len(novels))
plt.figure(figsize = (10,7))
plt.barh(y_pos, score, align='center')
plt.yticks(y_pos, novels, fontsize=14)
plt.xlabel('Number of words', fontsize=14)
plt.title('Comparison of reading level of Bin Laden correspondence\n(using Flesch-Kincaid grade level)', fontsize=17)
plt.show()

4. Explore vocabulary

Always bearing in mind that these are translations, let's see OBL's word usage using the NLTK library.

In [108]:
# First let's get a list of stopwords to remove, like 'the', 'and', etc.
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Now build a frequency distribution of non-stopwords
from nltk import FreqDist
freqdist = FreqDist([x for x in all_tokens if x not in stop])

#and determine the 50 most common

words, freqs = zip(*freqdist.most_common(40))
# reverse lists so that longest bars will be topmost
words = words[::-1]
freqs = freqs[::-1]

# and graph them
y_pos = np.arange(len(words))
plt.figure(figsize = (10,20))
plt.barh(y_pos, freqs, align='center')
plt.yticks(y_pos, words, fontsize=14)
plt.xlabel('Frequency of use', fontsize=14)
plt.title('Most common words in Osama Bin Laden correspondence\n(stopwords like \'the\' and \'and\' removed', fontsize=17)

plt.show()

Now let's do the same thing with bigrams (two-word phrases)

In [86]:
from nltk import bigrams as nltk_bigrams
bigrams = [x for x in nltk_bigrams(all_tokens)]
In [110]:
freqdist = FreqDist([' '.join(x) for x in bigrams if (x[0] not in stop and x[1] not in stop)])

#and determine the 50 most common

words, freqs = zip(*freqdist.most_common(40))
# reverse lists so that longest bars will be topmost
words = words[::-1]
freqs = freqs[::-1]

# and graph them
y_pos = np.arange(len(words))
plt.figure(figsize = (10,20))
plt.barh(y_pos, freqs, align='center')
plt.yticks(y_pos, words, fontsize=14)
plt.xlabel('Frequency of use', fontsize=14)
plt.title('Most common bigrams in Osama Bin Laden correspondence\n(stopwords like \'the\' and \'and\' removed', fontsize=17)
plt.show()

5. Topic modeling

For those unfamiliar with the concept, topic modeling is the use of automated methods to determine what words best sort the documents into different topics. You can spend hours and hours and hours and hours on this; I'd like to get this done within 24 hours of the publication of these texts, so I'll just do a simple NMF analysis liberally cribbed from http://derekgreene.com/nmf-topic/.

The number of topics is arbitrary; it's good to choose a low enough number that there will be a non-subtle differentiation between topics. I went with 5.

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

from nltk.stem import WordNetLemmatizer
lemmatize = WordNetLemmatizer()

bag_of_words = []
for doc in documents:
    bag_of_words.append(' '.join([lemmatize.lemmatize(token) for token in doc['tokens']]))

tfidf = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, lowercase=True, strip_accents="unicode", use_idf=True, norm="l2", min_df = 5) 
matrix = tfidf.fit_transform(bag_of_words)

num_terms = len(tfidf.vocabulary_)
terms = [""] * num_terms
for term in tfidf.vocabulary_.keys():
    terms[ tfidf.vocabulary_[term] ] = term

model = decomposition.NMF(init="nndsvd", n_components=5, max_iter=200)
W = model.fit_transform(matrix)
H = model.components_

for topic_index in range( H.shape[0] ):
    top_indices = np.argsort( H[topic_index,:] )[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
Topic 0: god, peace, dear, praise, sister, blessing, mercy, willing, letter, prayer
Topic 1: al, shaykh, brother, abu, letter, wa, mahmud, muhammad, god, informed
Topic 2: allah, brother, mercy, ask, wa, al, praise, father, child, know
Topic 3: god, said, people, ha, crusader, jihad, nation, war, ye, islam
Topic 4: revolution, people, muslim, regime, ummah, egypt, opportunity, blood, ruler, wa