In [44]:

# Functional Progamming: Text Summarization Using SumBasic Method and NLKT

import wikipedia
import string
from nltk.tokenize import sent_tokenize, regexp_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist


def data(*args):
    contents = []
    for arg in args:
        contents.append((arg, wikipedia.page(arg).content))
    return contents

def sentence_tokenizer(data):
    header, datum = data
    print(header)
    print()
    return sent_tokenize(datum)

def word_tokenizer(sentence_tokenizer):
    
    word_tokens = []
    sentence_tokens = {sentence: [] for sentence in sentence_tokenizer}
    
    for one_sentence in sentence_tokenizer:
        for token in regexp_tokenize(one_sentence.lower(), '\w+'):
            if token not in string.punctuation:
                if token not in stopwords.words('english'):
                    word_tokens.append(token)
                    sentence_tokens[one_sentence].append(token)
    return sentence_tokens, word_tokens
    

def lemmatizer(word_tokenizer):
    sentence_tokens, word_tokens = word_tokenizer
    lem = WordNetLemmatizer()
    
    lem_words = [lem.lemmatize(word) for word in word_tokens]
    lem_sentences = {sentence: [lem.lemmatize(word) for word in sentence_tokens[sentence]] for sentence in sentence_tokens}
    return lem_sentences, lem_words

def sumbasic(lemmatizer):
    lem_sentences, lem_words = lemmatizer
    
    freq = FreqDist(lem_words)
    total = sum(freq.values())
    probs = {k: v/total for k, v in freq.items()}
    
    len_summary = int(0.1 * len(lem_sentences))
    
    summary = []
    
    for _ in range(len_summary):
        
        scores = {k: [] for k in lem_sentences}
        importance = {k: 0 for k in scores}
        for key, value in lem_sentences.items():
            for word in value:
                scores[key].append(probs[word])
            importance[key] = sum(scores[key]) / len(scores[key])         
            
        most_importance_sentence = max(scores, key=scores.get)
        summary.append(most_importance_sentence)
        
        for word in lem_sentences[most_importance_sentence]:
            probs[word] = probs[word] * probs[word]
            
    for sentence in lem_sentences:
        if sentence in summary:
            print(sentence)

for i, datum in enumerate(data('Functional programming', 'Automatic summarization')):
    sumbasic(lemmatizer(word_tokenizer(sentence_tokenizer(datum))))
    if i == 0:
        print('*'*100, sep='\n')
    

Functional programming

In functional programming, functions are treated as first-class citizens, meaning that they can be bound to names (including local identifiers), passed as arguments, and returned from other functions, just as any other data type can.
Lambda calculus forms the basis of all functional programming languages.
=== Pure functions ===

Pure functions (or expressions) have no side effects (memory or I/O).
Some compilers, such as gcc, add extra keywords for a programmer to explicitly mark external functions as pure, to enable such optimizations.
Tail recursion optimization can be implemented by transforming the program into continuation passing style during compiling, among other approaches.
Common patterns of recursion can be abstracted away using higher-order functions, with catamorphisms and anamorphisms (or "folds" and "unfolds") being the most obvious examples.
Such recursion schemes play a role analogous to built-in control structures such as loops in imperative languages.
=== Strict versus non-strict evaluation ===

Functional languages can be categorized by whether they use strict (eager) or non-strict (lazy) evaluation, concepts that refer to how function arguments are processed when an expression is being evaluated.
Lazy evaluation does not evaluate function arguments unless their values are required to evaluate the function call itself.
While these languages are mainly of interest in academic research (including in formalized mathematics), they have begun to be used in engineering as well.
For example, the array with constant access and update times is a basic component of most imperative languages, and many imperative data-structures, such as the hash table and binary heap, are based on arrays.
Purely functional data structures have persistence, a property of keeping previous versions of the data structure unmodified.
For programs that perform intensive numerical computations, functional languages such as OCaml and Clean are only slightly slower than C according to The Computer Language Benchmarks Game.
However, the most general implementations of lazy evaluation making extensive use of dereferenced code and data perform poorly on modern processors with deep pipelines and multi-level caches (where a cache miss may cost hundreds of cycles).
Python had support for "lambda", "map", "reduce", and "filter" in 1994, as well as closures in Python 2.2, though Python 3 relegated "reduce" to the functools standard library module.
== Applications ==

=== Academia ===
Functional programming is an active area of research in the field of programming language theory.
There are several peer-reviewed publication venues focusing on functional programming, including the International Conference on Functional Programming, the Journal of Functional Programming, and the Symposium on Trends in Functional Programming.
Haskell, though initially intended as a research language, has also been applied by a range of companies, in areas such as aerospace systems, hardware design, and web programming.Other functional programming languages that have seen use in industry include Scala, F#, Wolfram Language, Lisp, Standard ML, and Clojure.Functional "platforms" have been popular in finance for risk analytics (particularly with the larger investment banks).
Some use it as their introduction to programming, while others teach it after teaching imperative programming.Outside of computer science, functional programming is being used as a method to teach problem solving, algebra and geometric concepts.
Higher-Order Perl.
****************************************************************************************************
Automatic summarization

Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
This problem is called multi-document summarization.
Image collection summarization is another application example of automatic summarization.
Video summarization is a related domain, where the system automatically creates a trailer of a long video.
These algorithms model notions like diversity, coverage, information and representativeness of the summary.
Keyphrases have many applications.
Using the known keyphrases, we can assign positive or negative labels to the examples.
The two measures can be combined in an F-score, which is the
harmonic mean of the two (F = 2PR/(P + R) ).
Thus, recall may suffer.
In the case of Turney's GenEx algorithm, a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm.
While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
Edges are created based on word co-occurrence in this application of TextRank.
As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.
One way to think about it is the following.
A word that appears multiple times throughout a text may have many different co-occurring neighbors.
=== Document summarization ===
Like keyphrase extraction, document summarization aims to identify the essence of a text.
Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner.
First summarizes that perform adaptive summarization have been created.
==== TextRank and LexRank ====
The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data.
Then the sentences can be ranked with regard to their similarity to this centroid sentence.
The task remains the same in both cases—only the number of sentences to choose from has grown.
Each article is likely to have many similar sentences, and you would only want to include distinct ideas in the summary.
The methods are domain-independent and easily portable.
The set cover function attempts to find a subset of objects which cover a given set of concepts.
All these important models encouraging coverage, diversity and information are all submodular.
Similarly, work by Lin and Bilmes, 2011, shows that many existing systems for automatic summarization are instances of submodular functions.
Submodular Functions have also successfully been used for summarizing machine learning datasets.
== Evaluation techniques ==
The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.
For example, automatic summarization research on medical text generally attempts to utilize the various sources of codified medical knowledge and ontologies.
The Use of Topic Segmentation for Automatic Summarization.