Notebook

NOTEBOOK 19¶

Text-To-Text Transfer Transformer (T5)¶

El modelo T5, o Text-To-Text Transfer Transformer, es un modelo de lenguaje muy versátil desarrollado por Google Research. Fue introducido en un artículo titulado "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" por Colin Raffel y otros en 2019. El modelo se basa en la arquitectura Transformer, que se ha convertido en un estándar de facto para muchas tareas de procesamiento de lenguaje natural (NLP).

Diseño y Filosofía¶

El modelo T5 adopta un enfoque unificado hacia el procesamiento del lenguaje natural: trata todas las tareas de NLP como una tarea de "texto a texto". Esto significa que cada tarea, ya sea traducción de idiomas, resumen de texto, clasificación de sentimientos, o cualquier otra, se formula de manera que el input y el output son siempre secuencias de texto. Por ejemplo:

Traducción: El input es texto en un idioma, y el output es texto en otro idioma.
Resumen: El input es un documento largo, y el output es su resumen.
Clasificación de sentimiento: El input es una reseña, y el output es una etiqueta de sentimiento como "positivo" o "negativo".

Arquitectura¶

T5 es un modelo basado en la arquitectura Transformer, que utiliza bloques de encoder y decoder:

Encoder: Convierte el texto de entrada en una serie de representaciones intermedias o embeddings que capturan el contexto y el significado del texto.
Decoder: Utiliza las representaciones del encoder, junto con la salida generada previamente, para producir el texto de salida.

Preentrenamiento¶

T5 fue preentrenado en un dataset diverso llamado "Colossal Clean Crawled Corpus" (C4), que es un subset limpio y filtrado del Common Crawl.

Fases de entrenamiento¶

T5 se entrena en dos fases:

Preentrenamiento: El modelo aprende a entender y generar texto en general a partir de grandes cantidades de texto no etiquetado.
Fine-tuning: El modelo se ajusta a tareas específicas de NLP usando datasets etiquetados más pequeños. Aquí es donde el enfoque de "texto a texto" del modelo se adapta fácilmente a una variedad de tareas simplemente cambiando los formatos de los datos de entrada y salida.

Variantes¶

En la biblioteca Hugging Face Transformers, el modelo T5 está disponible en varios tamaños que se adaptan a diferentes requisitos de rendimiento y capacidades de procesamiento. Cada tamaño del modelo ofrece un equilibrio entre velocidad, uso de memoria y precisión. Estas son las variantes disponibles:

T5 Small
- Parámetros: Aproximadamente 60 millones.
- Uso: Ideal para aplicaciones con restricciones de recursos y para pruebas rápidas de conceptos.
T5 Base
- Parámetros: Aproximadamente 220 millones.
- Uso: Un buen equilibrio entre rendimiento y tamaño, adecuado para muchas aplicaciones de producción.
T5 Large
- Parámetros: Aproximadamente 770 millones.
- Uso: Para cuando se necesita una mayor precisión en las tareas y se dispone de más recursos de computación.
T5 3B
- Parámetros: Aproximadamente 3 mil millones.
- Uso: Usado en escenarios donde la precisión es crítica y se dispone de infraestructura para manejar modelos grandes.
T5 11B
- Parámetros: Aproximadamente 11 mil millones.
- Uso: Este tamaño es extremadamente grande, utilizado principalmente en investigación y situaciones donde se necesitan las capacidades máximas del modelo.

Cómo elegir el tamaño adecuado¶

La elección del tamaño del modelo depende de varios factores:

Recursos disponibles: Más parámetros generalmente requieren más memoria y poder de procesamiento.
Requisitos de la tarea: Tareas más complejas pueden beneficiarse de modelos más grandes.
Latencia: Modelos más pequeños ofrecen respuestas más rápidas, lo cual es crucial para aplicaciones en tiempo real.
Costo: El entrenamiento y la inferencia en modelos más grandes pueden ser más costosos en términos de computación y tiempo.

Puedes acceder a estos modelos directamente a través de la interfaz de Hugging Face Transformers, lo cual facilita su uso y experimentación en una amplia gama de tareas de procesamiento del lenguaje natural.

Ejemplos de uso mediante Hugging Face Transformers¶

Aquí tienes un ejemplo de cómo cargar y usar el modelo T5 en Hugging Face Transformers para hacer resúmenes de texto:

In [1]:

from transformers import T5ForConditionalGeneration, T5Tokenizer

def sumarize(text, model_name="t5-base", task="summarize"):
    # Cargamos el tokenizador y el modelo
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # Preparamos la entrada
    input_text = f"{task}: {text}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Generamos la salida
    outputs = model.generate(input_ids, max_length=100)
    summarized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return summarized_text

# Ejemplo de resumen
text = """
"The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison and Ringo Starr. They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and the recognition of popular music as an art form.
"""
print("Resumen:", sumarize(text, task="summarize"))  # <-- Fíjate en el argumento task. En función de este argumento, el modelo realizará una tarea u otra

/Users/cayetano/Propio/Notebooks/Machine Learning/RL/env/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:240: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Resumen: the Beatles were an english rock band formed in 1960. they are regarded as the most influential band of all time. they were integral to the development of 1960s counterculture.

Vamos a ver cómo traducir de inglés a francés utilizando el modelo T5:

In [2]:

from transformers import T5ForConditionalGeneration, T5Tokenizer

def translate(text, model_name="t5-base", task="translate English to French"):
    # Cargamos el tokenizador y el modelo
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # Preparamos la entrada
    input_text = f"{task}: {text}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")

    # Generamos la salida
    outputs = model.generate(input_ids, max_length=100)
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

# Ejemplo de traducción del inglés al francés
text_en_to_es = "The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison and Ringo Starr."
print("Inglés a Francés:", translate(text_en_to_es, task="translate English to French"))
text_en_to_es = "They are regarded as the most influential band of all time and were integral to the development of 1960s counterculture and the recognition of popular music as an art form."
print("Inglés a Francés:", translate(text_en_to_es, task="translate English to French"))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Inglés a Francés: Les Beatles sont un groupe rock anglais formé à Liverpool en 1960, composé de John Lennon, Paul McCartney, George Harrison et Ringo Starr.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Inglés a Francés: Ils sont considérés comme le groupe le plus influent de tous les temps et ont joué un rôle essentiel dans le développement de la contreculture des années 1960 et la reconnaissance de la musique populaire comme forme d'art.

Ejemplo: Transcripción de números a textos¶

T5 puede ser utilizado directamente a través de la biblioteca transformers de Hugging Face, que proporciona APIs de alto nivel para cargar el modelo, tokenizar textos, y generar predicciones. Esto hace que sea relativamente sencillo implementar soluciones de NLP avanzadas utilizando T5.

Vamos, por tanto, a implementar un ejemplo que nos permita entender cómo funciona T5 y cómo podemos utilizarlo para tareas de procesamiento de lenguaje natural propias. En este caso, utilizaremos el modelo T5 Base para realizar la transcripción de un número representado con sus dígitos a sus palabras en inglés. Por ejemplo, si el número es "123", la transcripción sería "one hundred twenty-three". Lo haremos en inglés en lugar de español para aprovechar la capacidad de T5 de trabajar con texto en inglés y porque, en tareas de traducción, el modelo solo ha sido entrenado en alemán, francés y rumando, además del inglés.

Importamos las librerías necesarias para crear el dataset.

In [1]:

from datasets import load_dataset, DatasetDict

# Cargamos el dataset desde un archivo CSV
dataset = load_dataset('csv', data_files='data/numbers.csv')

# Como el dataset no está dividido en entrenamiento y prueba, lo dividimos manualmente
train_test_split = dataset['train'].train_test_split(test_size=0.1)  # 90% entrenamiento, 10% prueba

dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': train_test_split['test']
})

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

/Users/cayetano/Propio/Notebooks/Machine Learning/RL/env/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:765: FutureWarning: The 'verbose' keyword in pd.read_csv is deprecated and will be removed in a future version.
  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)

Veamos un ejemplo cualquiera del dataset. Fíjate en que el input no está en texto, sino en formato entero.

In [2]:

dataset['train'][42]

Out[2]:

{'input_text': 103093641,
 'output_text': 'one hundred three million ninety three thousand six hundred forty one'}

Afinaremos un modelo T5 Base preentrenado para realizar esta tarea.

In [3]:

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-base')

/Users/cayetano/Propio/Notebooks/Machine Learning/RL/env/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:240: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Recuerdas que antes vimos que el input del dataset estaba en formato entero. Para poder utilizarlo con el modelo T5, necesitamos convertirlo a texto. Además, queremos añadir a cada ejemplo la tarea específica que queremos que el modelo realice. En este caso, la tarea es "number to text". Para todo esto vamos a crear la función add_task. Fíjate que numbers puede ser tanto un número entero como una lista de números enteros.

In [4]:

def add_task(numbers):
    if isinstance(numbers, int):
        return "number to text: " + str(numbers)
    else:
        res = []
        for number in numbers:
            text = str(number)
            text = "number to text: " + text
            res.append(text)
        return res

Veamos qué aspecto tiene un ejemplo después de aplicar la función add_task.

In [5]:

add_task([123456789, 111111111, 987654321])

Out[5]:

['number to text: 123456789',
 'number to text: 111111111',
 'number to text: 987654321']

In [6]:

def preprocess_function(examples):
    number_ids = add_task(examples['input_text'])
    text_input = tokenizer(number_ids, truncation=True, padding="max_length", max_length=15)
    labels = tokenizer(examples['output_text'], truncation=True, padding="max_length", max_length=32)

    return {
        'input_ids': text_input['input_ids'],
        'labels': labels['input_ids']
    }

Veamos qué aspecto tiene un conjunto de ejemplos antes y después de aplicar la función preprocess_function.

In [7]:

print(dataset['train'][:3])
print("------------------------------")
print(preprocess_function(dataset['train'][:3]))

{'input_text': [330237353, 517569757, 313407361], 'output_text': ['three hundred thirty million two hundred thirty seven thousand three hundred fifty three', 'five hundred seventeen million five hundred sixty nine thousand seven hundred fifty seven', 'three hundred thirteen million four hundred seven thousand three hundred sixty one']}
------------------------------
{'input_ids': [[381, 12, 1499, 10, 3, 17225, 2773, 4552, 4867, 1, 0, 0, 0, 0, 0], [381, 12, 1499, 10, 305, 2517, 4834, 4327, 3436, 1, 0, 0, 0, 0, 0], [381, 12, 1499, 10, 2664, 21129, 4552, 4241, 1, 0, 0, 0, 0, 0, 0]], 'labels': [[386, 6189, 12010, 770, 192, 6189, 12010, 2391, 7863, 386, 6189, 18358, 386, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [874, 6189, 30552, 770, 874, 6189, 27757, 4169, 7863, 2391, 6189, 18358, 2391, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [386, 6189, 27255, 770, 662, 6189, 2391, 7863, 386, 6189, 27757, 80, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

In [8]:

a = tokenizer("eighteen seventeen sixteen")


for i in a['input_ids']:
    print(tokenizer.decode([i]))

eight
e
en
seventeen
sixteen
</s>

Ahora cargamos el modelo T5 Base preentrenado y lo afinamos para realizar la tarea de transcripción de números a texto.

In [ ]:

from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer

model = T5ForConditionalGeneration.from_pretrained('t5-base')

training_args = TrainingArguments(
    output_dir='./results',  # output directory
    num_train_epochs=3,  # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # strength of weight decay
    logging_dir='./logs',  # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=dataset['train'].map(preprocess_function, batched=True),  # training dataset
    eval_dataset=dataset['test'].map(preprocess_function, batched=True),  # evaluation dataset
)

trainer.train()

wandb: Network error (ConnectionError), entering retry loop.
wandb: Network error resolved after 0:11:26.707014, resuming normal operation.
wandb: Network error (ConnectionError), entering retry loop.

Una vez completado el entrenamiento, vamos a hacer alguna prueba.

In [17]:

input_ids = tokenizer("number to text: 1000", return_tensors="pt").input_ids.to('mps')
outputs = model.generate(input_ids, max_length=50, num_beams=1)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

one hundred one million

Tamaño del modelo y recursos¶

El modelo T5 (Text-to-Text Transfer Transformer), incluido el T5-base, generalmente maneja secuencias de entrada cuya longitud máxima predeterminada es de 512 tokens. Este límite está configurado así en los modelos preentrenados disponibles en la biblioteca Hugging Face Transformers.

Veamos cuántos tokens de entrada y salida tiene el modelo T5 Base.

In [12]:

print(f"Tamaño máximo de la entrada: {tokenizer.model_max_length} tokens")  # Generalmente será 512

# Memoria ocupada por el modelo en MB
model_size = sum(p.numel() for p in model.parameters())
print(f"Memoria ocupada por el modelo: {round(model_size * 4 / 1024**2, 2)} MB")

# Número de parámetros del modelo en millones
print(f"Número de parámetros del modelo: {round(model_size/10**6,2)} millones")

Tamaño máximo de la entrada: 512 tokens
Memoria ocupada por el modelo: 850.31 MB
Número de parámetros del modelo: 222.9 millones