Notebook

TensorFlow NLP

Lesson 4

Keras Text Summarization

TensorFlow Devices Preparation and Pre-Processing Training the Model The Beast Testing the Model Summary Challenge

***Original Content by Xianshun Chen:***
https://github.com/chen0040/keras-text-summarization

OVERVIEW

This Lesson will show you how to implement the Keras Sequence2Sequence Text Summarizer on a News dataset in order to create summaries.
This lessons folder (L4_data) contains several different Seq2Seq and Encoder-Decoder RNN implementations for you to experiment with. They may even yield better results depending on the data-set you use.
[Click here for an Introduction to Text Summarization](https://machinelearningmastery.com/gentle-introduction-text-summarization/) [Click here for an Introduction to Encoder/Decoder Models](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)

TENSORFLOW DEVICES

After executing the code cell below, you can see further details for your devices in the Jupyter Console.

In [ ]:

from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

get_available_devices()

PREPARATION AND PRE-PROCESSING

Imports

In [2]:

from __future__ import print_function

import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split

from keras_text_summarization.library.utility.plot_utils import plot_and_save_history
from keras_text_summarization.library.seq2seq import Seq2SeqSummarizer
from keras_text_summarization.library.applications.fake_news_loader import fit_text

Using TensorFlow backend.

In [3]:

LOAD_EXISTING_WEIGHTS = True

np.random.seed(42)
data_dir_path = './L4_data/data'
report_dir_path = './L4_data/reports'
model_dir_path = './L4_data/models'

TRAINING

Load Training Data

We will use a provided news data-set which contains articles and titles from various news sources.

This data is pre-processed inside the custom functions in the 'keras_text_summarization' folder.

In [4]:

# Load CSV into DataFrame
print('Loading CSV . . .')
df = pd.read_csv(data_dir_path + "/news.csv")

# Extract text for configuration
print('Extracting for config . . . ')
Y = df.title
X = df['text']
config = fit_text(X, Y)
print('-> Complete')

Loading CSV . . .
Extracting for config . . . 
-> Complete

WARNING

>- Make sure that the dataset is fully downloaded and extracted before continuing.

Quote

...there are two different approaches for automatic summarization currently:

Extraction and Abstraction.

Extractive summarization methods work by identifying important sections of the text and generating them verbatim;

...Abstractive summarization methods aim at producing important material in a new way. In other words, they interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.

- [Text Summarization Techniques: A Brief Survey, 2017](https://arxiv.org/abs/1707.02268)

Initialize Summarizer Model

In [5]:

summarizer = Seq2SeqSummarizer(config)

# Change this value to 'false' above to start fresh!
if LOAD_EXISTING_WEIGHTS:
    summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))

Split Data into Train and Test Sets

In [6]:

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)

Fit the Training Data to the Model

In other words - let's start training our model!

WARNING

>- The code cell directly below will start training the model! >- This model is set to execute 100 epochs with a batch size of 5. >- This results in a long training time unless you are secretly Megatron. >- See 'The Beast' section for more information on speeding this up. >- If you get tired of waiting for it to train locally: - Interrupt the kernel and continue to the 'Testing' section.

In [ ]:

# Optional TF Device Selection (code below must be indented)
with tf.device('/GPU:0'):
    history = summarizer.fit(Xtrain, Ytrain, Xtest, Ytest, epochs=100, batch_size=5, model_dir_path=model_dir_path)
    
history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history.png'

THE BEAST

AI/Hub Team Members can also use 'The Beast' to process this training code at a faster rate!

An informational document is being created for using The Beast; It will be available on the ORSIE AI/Hub Internal Site once it has been completed!

Please ask your Lead Researcher for more information regarding this.

However, you will be able to test the current model locally, even with limited training!

(. . . Mind the results)

NOTE

>- The code cell directly below will only execute after completing a full training loop!

'history' is created on completion of the summarizer.fit() function

If you manually stop the training, you will not be able to run this cell!

In [ ]:

if LOAD_EXISTING_WEIGHTS:
    history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history-v' + str(summarizer.version) + '.png'
# Plot and Save History
plot_and_save_history(history, summarizer.model_name, history_plot_file_path, metrics={'loss', 'acc'})

TESTING

Load Testing Data

In [7]:

# Randomize Seed
np.random.seed(42)

# Define Directory Paths
data_dir_path = './L4_data/data' # refers to the demo/data folder
model_dir_path = './L4_data/models' # refers to the demo/models folder

# Load CSV from Directory
print('Loading CSV . . .')
df = pd.read_csv(data_dir_path + "/news.csv")

# Assign dataframe text and title to X and Y values
print('Extracting features . . .')
X = df['text']
Y = df.title
print('-> Complete')

Loading CSV . . .
Extracting features . . .
-> Complete

Load Stored Model and Re-Initialize

In [8]:

# Load stored model configuration using NumPy.load()
config = np.load(Seq2SeqSummarizer.get_config_file_path(model_dir_path=model_dir_path)).item()

# Re-Initialize the model using the stored configuration
summarizer = Seq2SeqSummarizer(config)

# Load the stored weights into the model
summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))

Predict Some Headlines

In [10]:

# Print predicted headlines along with their original title
print('Predicting Headlines . . .')
for i in range(10):
    x = X[i]
    actual_headline = Y[i]
    headline = summarizer.summarize(x)

    print('\n', 'Original: ', actual_headline)
    #print('Article: ', x)
    print('Generated: ', headline)
print('\n', '-> Complete')

Predicting Headlines . . .

 Original:  You Can Smell Hillary’s Fear
Generated:  clinton campaign biggest national are - the onion - america's finest news source

 Original:  Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)
Generated:  the trump is what trump's rick of gop debate

 Original:  Kerry to go to Paris in gesture of sympathy
Generated:  not to back to back at least time

 Original:  Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!'
Generated:  the gop debate on the party is in against trump is a bit to twitter

 Original:  The Battle of New York: Why This Primary Matters
Generated:  the battle of new why why many could go to win

 Original:  Tehran, USA
Generated:  john obama: political top to daily

 Original:  Girl Horrified At What She Watches Boyfriend Do After He Left FaceTime On
Generated:  of be hillary’s why trump’s campaign in 2016

 Original:  ‘Britain’s Schindler’ Dies at 106
Generated:  re: clinton’s email and coming

 Original:  Fact check: Trump and Clinton at the 'commander-in-chief' forum
Generated:  is republicans the jeb bush director up the gop debate in the

 Original:  Iran reportedly makes new push for uranium concessions in nuclear talks
Generated:  election is coming in the world war iii - the onion - america's finest news source

 -> Complete

SUMMARY

This tutorial showed how to generate headlines for news articles of various length using Keras' sequence2sequence text summarizer.

CHALLENGE

These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.

You may want to backup this Notebook before making any changes.

Train the model for larger/smaller batches. Does it improve the quality of the generated summaries?
Try another architecture for the Recurrent Neural Network (See the demo folder) Can you improve the quality of the generated summaries?
Try using a different dataset to train and test this model - or one of the others provided in the lesson folder (L4_data).

In [ ]: