A good resource to use alongside this notebook is the original GPT paper [1]. This notebook largely relies on that paper for model architectures and implementation.
This article will walk through building a simple GPT style model from scratch using pytorch [1,2]. The goal of this article is to train a basic large language model from start to finish in one notebook. We will train an LLM that is small enough to fit in a single GPU during training and inference, so the notebook can be run in popular cloud GPU services (Google Colab, Kaggle, Paperspace, etc...). The computation graph of the model that we will build in this article is as follows:
This architecture resembles the original GPT model, and is quite similar to GPT2 and GPT3, with the main difference being that it is smaller (less decoder blocks and smaller embedding sizes) [1,3,4]. We will zoom into each step of this diagram throughout this article to discuss the math, code, and intuition behind them.
According to the original GPT paper, there are two main training stages for the early GPT models, pretraining and supervised fine tuning [1]. Pretraining is a self supervised learning task, where parts of the input data are omitted and used as target variables. Self supervised fine tuning works similar to traditional supervised learning tasks, with human annoted labels for input data.
The first stage in building a GPT model is pretraining. Pretraining builds the "base" of an LLM. It allows the model to understand statistical properties of language, grammar, and context.
The goal of pretraining is simple: to have a model that can reliably predict the next token given the previous k tokens in a sequence. The final result of pretraining is a deep learning model that takes in k tokens and produces a discrete probability distribution of what the k+1 token should be. We want this distribution to show a high value for the correct token and low values for the incorrect ones.
To achieve this, we start off with a large dataset of raw text. This text can be taken from books, blogs, wikis, research papers, and other text sources. After compiling the large dataset of text, we split the dataset into "chunks" of tokens, where each chunk has a certain amount of tokens (512 gpt, 1024 gpt2, 16385 gpt-3). This chunk size is known as the "context window". A pretrained model will take in that many tokens, and output the most likely next token.
When dealing with LLMs we use the word "token" to describe the smallest "unit" of text that an LLM can analyze [5]. Tokens can generally be thought of as words conceptually. When analyzing a sequence of text, an LLM first has to convert the text to tokens. This is similar to a dictionary lookup, each word/token will have an integer "index" in the lookup. This index is what will actually be fed into the network to be analyzed.
Each example of the pretraining dataset is a chunk of tokens. The same chunk of tokens is used for the input and output, but the output is shifted 1 token into the "future". The reason for this has to do with the parallel processing capabilities of the transformer, which we will go into depth further in the transformer section. The following visual helps show what the training data looks like for the pretraining model.
Because the model uses transformers and parallel processing, a single example like the one above is actually in a sense 6 different examples. The model is learning the following predictive patterns:
This will be clearer in the transformer section of the article. The main point to know now is what the format of the input and outputs of the training data should look like in the pretraining step. The outputs are the inputs, shifted by one token so that each input token aligns with the output token that comes directly after it in the original sequence.
Before doing a full pre-training loop, we will do a "test run" using a small dataset we can fit in to memory. This will allow us to focus on the internals of the model rather than complexities of data processing. We can use the Salesforce wikitext dataset that consists of an extract of good and featured wikipedia articles [6].
We will load the dataset from the huggingface datasets hub. The huggingface datasets package provides an easy way to load, preprocess, and use a variety of datasets for deep learning [7].
! pip install --upgrade datasets
! pip install tiktoken
! pip install transformers
! pip install torch
! pip install matplotlib
import warnings
import torch
import math
import time
import os
import matplotlib.pyplot as plt
from itertools import cycle
from datasets import Dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from torch.optim.lr_scheduler import _LRScheduler
warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
cuda
from datasets import load_dataset
dataset = load_dataset("EleutherAI/wikitext_document_level", "wikitext-2-raw-v1", split="train")
For pretraining language models, a simple approach to tokenizing and chunking text is as follows:
This process will change slightly when using datasets that are too large to fit into memory.
One easy way to tokenize our dataset is to use OpenAI's tokenizer implementation tiktoken for BPE (Byte Pair Encoding) [8]. This article will not go into detail on how the implementation of a tokenizer works, but just know that it converts strings of text into lists of integers, and can also convert the lists of integers back into strings of texts.
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2") # Get the same tokenizer used for GPT-2
print("Vocabulary size:", tokenizer.n_vocab) # Vocabilary size is how many unique tokens the tokenizer can encode
print("End of text token:", tokenizer.eot_token) # End of text token is used to indicate the end of a text sequence
print("Example tokenization:", tokenizer.encode("Hello world!"))
# Convert entire dataset into a single string
# This dataset is small enough to fit into memory
# For larger datasets, you may need to use more
# sophisticated methods to process the data.
all_text = ""
all_data = dataset["page"]
for example in all_data:
all_text += "<page> "+ example + " </page>"
# Tokenize the entire text at once
tokenized_text = tokenizer.encode(all_text)
# We will create a function that generates a dataset of examples
# for the language model. The function will take in the number of
# examples to generate, the block size, and the test split.
# It will return the training and test datasets.
def get_dataset(num_examples, context_window_length, test_split=0.1):
input_blocks = [] # List to store input sequences
target_blocks = [] # List to store target sequences
# Use a sliding window to create input/target sequences
for i in range(0, len(tokenized_text), context_window_length + 1):
block = tokenized_text[i:i+context_window_length+ 1]
# Skip blocks that are too short
if len(block) < context_window_length + 1:
continue
input_seq = block[:-1]
target_seq = block[1:]
input_blocks.append(input_seq)
target_blocks.append(target_seq)
# Stop if we have enough examples
if len(input_blocks) >= num_examples:
break
# Convert to tensors for pytorch and move to gpu
inputs = torch.tensor(input_blocks, dtype=torch.long).to(device)
targets = torch.tensor(target_blocks, dtype=torch.long).to(device)
# Calculate train/test split point
split_idx = int(num_examples * (1 - test_split))
# Split into train/test
train_inputs = inputs[:split_idx]
train_targets = targets[:split_idx]
test_inputs = inputs[split_idx:]
test_targets = targets[split_idx:]
return train_inputs, train_targets, test_inputs, test_targets
# Get a small dataset
i, o, _, _ = get_dataset(2, 4, 0)
print("Input Shape", i.shape)
print("Output Shape", o.shape)
print("Input Example:")
print(i)
print("Output Example:")
print(o)
Vocabulary size: 50257 End of text token: 50256 Example tokenization: [15496, 995, 0] Input Shape torch.Size([2, 4]) Output Shape torch.Size([2, 4]) Input Example: tensor([[ 27, 7700, 29, 220], [ 569, 18354, 7496, 17740]], device='cuda:0') Output Example: tensor([[ 7700, 29, 220, 796], [18354, 7496, 17740, 6711]], device='cuda:0')
Using our tokenizer methods, we have generated a "dummy" dataset that will be used for the rest of the diagrams / examples of the article to show the shapes of the matrices as they flow through the model.
This means that we have a context length of 4 tokens, and a batch size of 2. The full dummy dataset has a total of 2 examples. This is far smaller than the dataset would be in reality - but is useful for introducing the architecture.
import torch
import torch.nn as nn
import torch.nn.functional as F
# A simple configuration container
class GPTConfig:
def __init__(
self,
vocab_size, # size of the vocabulary, from tokenizer, for gpt2 tokenizer it is 50257
n_layer, # number of transformer blocks
n_head, # number of attention heads for each transformer block
n_embd, # embedding dimension for each token
seq_len, # sequence length for the model - e.g. the "context window"
):
self.vocab_size = vocab_size
self.n_layer = n_layer
self.n_head = n_head
self.n_embd = n_embd
self.seq_len = seq_len
test_config = GPTConfig(
vocab_size=tokenizer.n_vocab,
n_layer=2,
n_head=3,
n_embd=6,
seq_len=4,
)
Our first layer of the network is going to be a token embedding layer. This layer is a little bit different than traditional neural network layers. It is essentially a lookup table that returns an "embedding vector" for a given integer index. The goal of this layer is to convert tokens to vectors. These vectors are tuned as the network is trained so that their position in space relative to the other tokens reflects their statistical relationships with each other.
The embedding layer converts a discrete token (integer) into a semantic representation of that token (vector). Before the embedding layer, the model has no idea of what the token means or how it relates to other tokens. After the embedding layer, the model understands the semantic meaning of the token by its relationship with other tokens in the embedding space. For more information on word embeddings see the Word2Vec paper [13].
These are vectors that start off as random, but slowly assume values within embedding space that reflect the semantic meaning of the token. This process happens during training.
For our dummy dataset, the input to this layer will be a matrix of size 2x4, batch x token indices. The output will be 2x4x6, batch x tokens x embedding dimensions. This transformation can be visuzlized as follows:
token_embedding = nn.Embedding(test_config.vocab_size, test_config.n_embd).to(device)
test_batch_inputs, _, _, _ = get_dataset(2, test_config.seq_len, 0)
print("Batch shape:", test_batch_inputs.shape, "Batch x Seq Len")
print("After embedding:", token_embedding(test_batch_inputs).shape, "Batch x Seq Len x Embedding Dim")
print("")
print("Before embedding")
print(test_batch_inputs)
print("After embedding")
print(token_embedding(test_batch_inputs))
Batch shape: torch.Size([2, 4]) Batch x Seq Len After embedding: torch.Size([2, 4, 6]) Batch x Seq Len x Embedding Dim Before embedding tensor([[ 27, 7700, 29, 220], [ 569, 18354, 7496, 17740]], device='cuda:0') After embedding tensor([[[-0.0850, -0.5393, -0.4814, -0.5201, 0.8528, 1.5614], [ 1.7124, 0.9409, 0.5425, -0.2307, 1.0325, 0.6966], [ 1.8344, -0.6808, 0.1047, 0.9956, 0.6336, 0.1757], [-0.3960, -1.9960, -0.7287, 0.4127, 0.1946, -0.3331]], [[ 2.4173, -1.0166, 0.2398, -0.3766, -0.5538, 0.2580], [-1.2367, -0.7099, -0.0859, -0.8818, -0.9876, 0.7716], [ 1.1070, -0.5770, 0.2477, 2.5843, 0.5314, 1.3516], [ 1.0932, -0.0932, -1.0295, 0.9911, -0.2331, 2.5532]]], device='cuda:0', grad_fn=<EmbeddingBackward0>)
In this example, we are using an embedding dimension of 6, so each original token is mapped to a vector of length 6. As of right now, these vectors don't have any actual meaning, they are randomly initialized. However, during the training process, these entries will be slowly nudged via backpropagation and over time they will start to assume meaning for their respective tokens.
After embedding the tokens into embedding vectors, we will add a positional encoding to the vectors. Why do we need a positional encoding? Consider the following sentence:
The planet is smaller than the other planet.
A positional encoding allows the model to differentiate the two instances of the word "planet". Without a positional encoding, the two token embedding vectors for each instance of the word planet would be exactly the same. Having a positional encoding allows the model to differentiate the two usages within the same instance.
We will use the positional encoding formula that was used in the original transformer paper [9]. The formula works by starting out with a matrix of shape sequence length x embedding dimension. The matrix is then filled in with the following formula:
PE(POS,2i)=sin(pos100002id)Where POS is the position of the token in the sequence, i is the index of the embedding dimension within the token, and d is the embedding dimension size of the model. This entire formula outputs a matrix, and the matrix that it outputs is dependent on the embedding size. The resulting matrix will be (seq_length x embedding size). The matrix starts out as all zeros, and then the formula is applied.
def get_position_encoding(seq_len, d, n=10000):
"""
Computes the positional encoding matrix of shape (seq_len, d).
Args:
seq_len (int): Length of the sequence.
d (int): Dimension of the embedding.
n (float): The base for the exponential term (default 10000 in many Transformer implementations).
Returns:
torch.Tensor: A tensor of shape (seq_len, d) containing the positional encodings.
"""
P = torch.zeros(seq_len, d).to(device)
for pos in range(seq_len):
for i in range(0, d // 2):
P[pos, 2 * i] = math.sin(pos / (n ** ((2 * i) / d)))
if i + 1 < d:
P[pos, 2* i + 1] = math.cos(pos / (n ** ((2 * i) / d)))
return P.unsqueeze(0)
# Example usage:
position_encoding = get_position_encoding(seq_len=test_config.seq_len, d=test_config.n_embd)
print("Position encoding shape:", position_encoding.shape)
Position encoding shape: torch.Size([1, 4, 6])
Once we have the positional encoding, we add that using element wise addition to the embedding vectors. Since we are using pytorch, the addition will "broadcast" across the first dimension. This means that the 4x6 positional encoding matrix will be added to each batch example in parallel.
test_embeddings = token_embedding(test_batch_inputs)
test_embeddings_with_pos = test_embeddings + position_encoding
print("Token embeddings shape:", test_embeddings.shape)
print("Position encodings shape:", position_encoding.shape)
print("Sum of token embeddings and position encodings:",test_embeddings_with_pos.shape)
Token embeddings shape: torch.Size([2, 4, 6]) Position encodings shape: torch.Size([1, 4, 6]) Sum of token embeddings and position encodings: torch.Size([2, 4, 6])
At first, it can be a challenging to intuit what the positional encoding is doing. The positional encoding is just a constant matrix (given the sequence length and embedding size), with the values set to a desirable pattern. Each row of the matrix aligns to a token, meaning a constant vector will be added to the token at position 1 every time, and a different constant vector added to the token at position 2 every time, etc...
This differentiates the value of the word "planet" coming at the beginning vs the end of the sentence. However, sometimes relative position of words in a sentence is more important than absolute position. So how do we take that into account? The answer is that the relative relationships between words are emergent. These happen through the process of attention, which we will discuss later.
The key point here is that without positional encoding, these two sentences would look the same:
The positional encoding makes the vectors for dog and owner different in the two sentences, which allows attention to catch onto the relative relationships between these two words.
The below image shows an example of a positional encoding matrix. It looks interesting but what exactly are we looking at? Why does this help the model encode the position of each embedding vector. Remember, each row in our embedding vector represents a word/token. We will be adding this matrix to the embedding matrix to encode positions. One thing to note about this matrix is that each row is unique. There is also a smooth transition between each row. If you take rows 27 and 28 from this matrix, they are going to have very similar patterns. However if you take rows 1 and 120 from this matrix, they are going to differ much more. This smoothness is also an important feature that helps the model understand position [10].
There is nothing inherently special about the formula above, there are other formulas for positional encoding. The key thing to note is that there needs to be some matrix that we can add to our embedding matrix that encodes position. This formula has certain properties that are biased towards making it easy for the model to do that.
After positional encoding, we get to the core of the LLM - the (decoder only) transformer. The first step of the transformer is masked multiheaded self attention. We can break down the internals of the transformer into three parts: self attention, then masking, then the multiple heads.
The core idea behind self attention is that it allows every token to "talk" to the other tokens. Attention "reframes" a word's meaning into a combination of all the other words in the context window. A single self attention head does one of many possible "reframings" of each token. It allows for the model to understand a each word's context in relation to the other words of the sentence.
Self attention starts with just the token embedding matrix with position encodings. It "decomposes" this matrix into queries, keys, and values. In reality all of these are just vectors / matrices that get tuned during training, but we can conceptually think of them as queries, keys, and values due to their dot product operations that take place in the attention operation.
The original equation for scaled dot product attention is as follows [9]: Attention(Q,K,V)=softmax(QKt√dk)V
Q, K, and V are query, key, and value matrices. They are set initially through matrix projections of the input embedding matrix. The token embeddings are multplied by Wq, Wk, and Wv matrices. These weight matrices start off as random and are tuned during the process of training the network. Meaning during training, the network learns what "queries" to ask, and what "keys" and "values" to set via backpropagation by tuning these matrices. It learns how to transform the embedding matrix into "keys", "queries", and "values" in order to best reduce the loss of the network.
The projection operation to generate Q,K, and V are shown below using the dimensions for our dummy dataset/network.
Q, K, and V are all matrices that are of shape num tokens x embedding size. Each token has a query vector in "query space". Each token also has a key vector in "key space". When we do the QKT operation, we are calculating how well each token query matches each key. This could be thought of as sort of a "fuzzy lookup" using vector dot products. If the query and key have a high dot product, that means the vectors are pointing in a direction near each other. This also means those two tokens are important to take into account together.
After doing the matrix multiplication between Q and KT, we end up with a similarity matrix of tokens. This similarity matrix tells us how much each token attends to each other token. Each row of the QKT matrix is put through the softmax function so each row becomes a probability distribution that adds to one. This probability distribution can be interpreted as how strong of a match each key is to the query of the row. How much each key "attends" to each query.
The value matrix can be thought of the actual content/information that the each token has to offer. This value matrix is weighted by the similarities of the keys/queries to produce the final output of self attention.
There are some alternative ways to conceive of the individual operations of attention that can help at a conceptual / intuitive level to know what the network is doing. Let's go through each operation in attention and try to simplify down in english what it is doing at a conceptual level.
We know that the Q, K, V matrices are created by a matrix operation to the input of the transformer (for the first block, this is our position encoded word embeddings). We also know that the weights to create these matrices are tuned through the process of backpropagation. But how can we think of these matrices themselves? What information do they actually contain?
The Q matrix can be thought of as n rows of queries or questions, where n is the number of tokens in the input. When thinking about the Q matrix, think of it as n vectors instead of a single matrix. Where each vector is a query or question about the corresponding word that could be answered by some combinations of the other words. Remember, we are "reframing" the give word as some combination of the other words. For example it could look like the following:
In this case each token has a corresponding question. These questions or queries are going to be questions that can be answered by the surrounding tokens. So how are these questions created? Wq is responsible for creating the right questions for each token (with position). Wq maps a token to a relevant query about that token. These queries become relevant through the process of training via backpropagation.
We can think of the K matrix as n row vectors of keys, where n is the number of tokens in the input. What do we mean by "keys". It is easiest to think of keys as facts that can help answer queries. Above in the query section we asked questions like "what noun do I describe?". A key that might closely match this query would be "I am a noun that can be described". Similar to the queries, Wk creates these keys by learning the right mapping from token to corresponding key. These keys are good matches for the queries becuase of the QKT operation that is performed in training.
Overall, each key can be conceived of as a fact about that token that could help answer a queries that the other tokens might have.
Now that we have an intuition of the Q and K matrix, we can think about what the matrix multiplication operation QKT in the attention equation is doing. The QKT operation is a matching operation, where each query is compared with each key, by performing a dot product operation. If the dot product is large, that means that the key answers or "attends" to the query. If the dot product is small, that means the key is unrelated and does not help answer the query. The QKT operation "reframes" each query into a set of keys. The resulting matrix of the operation can be thought of as n row vectors. Every dimension or coordinate of these row vectors is a weight for a token key/fact. So a vector in this space is some weighted combination of all of the tokens (keys).
Basically, what we are doing is redescribing the original token query/question as a weighted vector of all of the token keys/answers. Instead of asking a question about of token, we have n different answers, all with their own weights.
When doing the QKT operation, we are reframing the query row vectors to a combination of the keys. Remember each query has to do with how that token relates to the other tokens, so the answers can be formed as some combination of the other tokens.
This operation is done to make the output of the softmax more stable. The dot product of two random vectors of dimension dk results in values that tend to grow proportionally to dk. This ensures that no matter, how large dk is, the softmax works as expected and does not result in extreme values.
This is an elementwise division so every element of the matrix is divided by this value. The resulting matrix can be thought of in the same way as the QKT result, just scaled.
The softmax operation is performed row-wise on the QKT√dk matrix. This means every row results in a probability distribution. We can still think of this as each token is represented as a "reframed" query vector, but now we know that each row vector adds up to one.
The V matrix is a bit hard to conceive of, but can be thought of as a column matrix, where each column is a learned feature, and each element of those vectors is the value of that feature for the token in that row. They are "feature" vectors, that contain information about specific learned features for each token. When we do the final operation, these feature vectors will be weighted, meaning that the values of these features for certain tokens on should be focused on more than other tokens. The V matrix is the actual content or output of attention. This content will be adjusted by the weights from the softmax(QKT√dk) operation
Now for the final operation of attention, multiplying by the V matrix. We can think of the V matrix as containing the original content of the embeddings. We weight this content based on the query/key matches. In other words, we weight the content based on the specific questions we are trying to ask and how the other words in context answer those questions.
softmax(QKT√dk)VWhen putting this all together (using the original dimensions of our "test" config object as we are in the code), we can see what all the matrix operations and dimensions through the self attention operation are.
Self attention can be written as a self contained pytorch module as shown below.
class SelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values
def forward(self, x):
print("Attention input shape:", x.shape)
print("")
print("Query weights shape:", self.Wq.shape)
print("Key weights shape:", self.Wk.shape)
print("Value weights shape:", self.Wv.shape)
queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
keys = x @ self.Wk # Matrix multiplication to transform input embeddings into keys
values = x @ self.Wv # Matrix multiplication to transform input embeddings into values
print("")
print("Queries shape:", queries.shape)
print("Keys shape:", keys.shape)
print("Values shape:", values.shape)
qkt = queries @ keys.transpose(-2, -1) # Calculate QK^T
qkt_scaled = qkt / math.sqrt(queries.size(-1)) # Scale QK^T by the dimension of the keys
qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights
print("")
print("QK^T shape:", qkt.shape)
attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
print("")
print("Attention output shape:", attn_output.shape)
return attn_output
attention = SelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)
Attention input shape: torch.Size([2, 4, 6]) Query weights shape: torch.Size([6, 6]) Key weights shape: torch.Size([6, 6]) Value weights shape: torch.Size([6, 6]) Queries shape: torch.Size([2, 4, 6]) Keys shape: torch.Size([2, 4, 6]) Values shape: torch.Size([2, 4, 6]) QK^T shape: torch.Size([2, 4, 4]) Attention output shape: torch.Size([2, 4, 6])
Now that we have implemented self attention, we can move on to causal self attention. During training, we are trying to predict the next token at each time step in parallel in the transformer. However, we will be cheating if we allow attention to see future tokens during the training process. It will just predict the future tokens by looking at them. For this reason we need to mask the matrices so that future tokens are hidden from self attention layers. We perform this masking after the QKT operation [11].
The masking process makes the output of the softmax operation 0 in the upper right corner. This makes it to where the following occurs:
When we say the query is able to be reframed, what we mean mathematically is that the value in that matrix entry could possibly be over 0.
We can modify our self attention block above to add masking with the following changes:
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values
def forward(self, x):
seq_len = x.shape[1] # Get sequence length (number of tokens / context window length)
queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
keys = x @ self.Wk # Matrix multiplication to transform input embeddings into keys
values = x @ self.Wv # Matrix multiplication to transform input embeddings into values
qkt = queries @ keys.transpose(-2, -1) # Calculate QK^T
qkt_scaled = qkt / math.sqrt(queries.size(-1)) # Scale QK^T by the dimension of the keys
# MASKING
# THIS IS THE ONLY DIFFERENCE, USE -inf FOR UPPER TRIANGLE MASK SO THAT SOFTMAX WILL BE 0
causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1)
causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf')) # Upper triangle masked with -inf
qkt_scaled = qkt_scaled + causal_mask # Add the mask to the scaled QK^T
# END MASKING
qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights, the -inf values will become 0 here
attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
return attn_output
attention = CausalSelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])
Now we have causal self attention, we can add in the "multi-headed" part of the attention layer. We can do this by concatenating multiple CausalAttention operations together in parallel. We then add a layer to project the final output back down the the input size.
What is this actually doing conceptually? It is allowing each head to have the tokens attend to each other in different ways. For instance one head might be focusing on grammatical structure, another might be focusing on semantic meaning, while another based on real-world meaning. If viewing the sentence "the sky is blue" from a grammatical structure perspective, the word "the" might attend to the word "sky" heavily becuase that is what it is referring to. However if viewing attention through the lense of real-world meaning, the word "the" won't attend to the word "sky" very much becuase their meanings are not similar. Each word's relationship to the other words might be different depending on what "lens" (or "head") you are viewing them through.
To reiterate, this is a helpful conceptual way to think about multi-headed attention, but the meanings of each head is not always human understandable in this way. They are going take on whatever meaning helps minimize the loss function of the training set the most.
The final output of Multi-Headed Causal Self Attention is the exact same size as the input, becuase of the final feedforward layer that projects the concatenated attention outputs back down.
The following code snippet shows an implementation of multi-headed causal self attention, building on our previous attention blocks. This is not the most compute efficient implementation due to the for loop for each head, but it is easier to read than the fully vectorized version and works for our use case due to the small datasets we are using.
class MultiHeadAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.attn_heads = nn.ModuleList([
CausalSelfAttention(config) for _ in range(config.n_head)
]) # Create n_head attention heads
self.projection = nn.Linear(config.n_embd * config.n_head, config.n_embd).to(device) # Linear layer to project multi-head attention outputs
def forward(self, x):
head_outputs = [head(x) for head in self.attn_heads] # Get the output of each attention head
multihead_output = torch.cat(head_outputs, dim=-1) # Concatenate the outputs
return self.projection(multihead_output) # Project the concatenated outputs
multihead_attn = MultiHeadAttention(test_config)
test_out = multihead_attn(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])
We have now succesfully implemented multi-headed attention. There are just a few steps left until we have a GPT "block" that we can stack onto the network over and over again. The architecture of a GPT block is as follows:
So far we have built the text embedding, positional encoding, and masked multiheaded self attention parts. Now we need to add in the normalization layers and the feedforward layers. These are straightforward pytorch layers that are common across many neural network architectures.
The layer normalization layers are straghtforward and used in many deep learning architectures. It normalizes the values of the incoming matrix across the feature dimension (in our case dimension 2). It is used to stabilize training and achieve faster convergence.
The feedforward layer of the transformer block operates with a different paradigm than attention. While attention captures relationships between tokens, the feedforward layer applies the same transformation to each token in parallel. It can be implemented using standard pytorch linear layers. We are using a factor of 4 x embedding dimension for the size of the linear layer, as was done in the original attention is all you need paper. We use the Gaussian Error Linear Unit (GELU) activation function as is implemented in the original GPT paper.
class GPTBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.mha = MultiHeadAttention(config)
self.ln1 = nn.LayerNorm(config.n_embd).to(device)
self.ffn = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
nn.GELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
).to(device)
self.ln2 = nn.LayerNorm(config.n_embd).to(device)
def forward(self, x):
x = x + self.mha(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
block = GPTBlock(test_config)
test_out = block(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])
Now that we have a block, we can stack the blocks together multiple times to have a GPT style LLM model
class GPTModel(nn.Module):
def __init__(self, config):
super().__init__()
self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd).to(device)
self.position_encoding = get_position_encoding(config.seq_len, config.n_embd)
self.blocks = nn.Sequential(*[GPTBlock(config) for _ in range(config.n_layer)])
self.ln_f = nn.LayerNorm(config.n_embd).to(device)
self.head = nn.Linear(config.n_embd, config.vocab_size).to(device)
def forward(self, x):
x = self.token_embedding(x) + self.position_encoding
x = self.blocks(x)
x = self.ln_f(x)
return self.head(x)
gpt = GPTModel(test_config)
print(test_batch_inputs.shape)
test_out = gpt(test_batch_inputs)
print(test_out.shape)
torch.Size([2, 4]) torch.Size([2, 4, 50257])
That is a full forward pass through the LLM, the input is of shape [batch,tokens] and the output is of shape [batch,tokens,probabilities]. For each token given in the input, the LLM will predict a discrete probability distribution of the next token that comes after that.
The transformer makes multiple predictions of this in parallel, one for each token in the input. While all of them are used in training, only the last prediction (of token n) is used in inference to to the final predition.
The following diagram shows the full forward pass with shapes as one example moves through the matrix.
Now that we have gone through the forward pass of the model, we can train it. The model is trained using next token prediction
According to the original GPT paper, the objective function of pretraining is the following [1]:
L1(U)=∑ilogP(ui|ui−k...ui−1;θ)Maximizing this objective function is essentially the same as minimizing the cross entropy loss function.
H(p,q)=−∑xp(x)logq(x)This is becuase during training, we use a one hot encoded vector for the true distribution, so p(x) is 1 for the correct token, and 0 for all other tokens. This means we can remove the sum and simplify the cross entropy loss to this:
H(p,q)=−logP(ui∣ui−k,…,ui−1;θ)Pytorch has a pre-built cross-entropy loss function that can be used as our criterion to minimize [12].
We will first train the model with a small dataset (10 examples) and see if we can get the model to memorize/overfit to the dataset. This is a good test to ensure that our architecture is correct and getting the loss to reduce as expected.
# Example config:
batch_size = 10
sequence_len = 128
num_steps = 1000
train_inputs, train_targets, _, _ = get_dataset(10, sequence_len, 0)
config = GPTConfig(
vocab_size=tokenizer.n_vocab,
n_layer=4, # fewer layers for a quick demo
n_head=4,
n_embd=128,
seq_len=sequence_len,
)
# Create the GPT model
model = GPTModel(config)
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Define Scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=20, min_lr=5e-6, threshold=1e-4)
# Training loop
i = 1
losses = []
while i < num_steps:
for j in range(0, len(train_inputs), batch_size):
x = train_inputs[j:j+batch_size]
y = train_targets[j:j+batch_size]
# Forward pass
logits = model(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
losses.append(loss.item())
optimizer.step()
optimizer.zero_grad()
loss = loss.item()
scheduler.step(loss)
# Print the average loss for the epoch
lr = optimizer.param_groups[0]["lr"]
print(f"Step {i+1}/{num_steps}, Loss: {loss}, LR: {lr}")
i += 1
Step 2/1000, Loss: 10.967658996582031, LR: 0.0005 Step 3/1000, Loss: 10.85688591003418, LR: 0.0005 Step 4/1000, Loss: 10.777162551879883, LR: 0.0005 Step 5/1000, Loss: 10.668031692504883, LR: 0.0005 Step 6/1000, Loss: 10.585840225219727, LR: 0.0005 Step 7/1000, Loss: 10.469751358032227, LR: 0.0005 Step 8/1000, Loss: 10.38199234008789, LR: 0.0005 Step 9/1000, Loss: 10.281015396118164, LR: 0.0005 Step 10/1000, Loss: 10.190533638000488, LR: 0.0005 Step 11/1000, Loss: 10.084761619567871, LR: 0.0005 Step 12/1000, Loss: 9.972481727600098, LR: 0.0005 Step 13/1000, Loss: 9.863290786743164, LR: 0.0005 Step 14/1000, Loss: 9.741707801818848, LR: 0.0005 Step 15/1000, Loss: 9.62871265411377, LR: 0.0005 Step 16/1000, Loss: 9.505171775817871, LR: 0.0005 Step 17/1000, Loss: 9.385248184204102, LR: 0.0005 Step 18/1000, Loss: 9.266453742980957, LR: 0.0005 Step 19/1000, Loss: 9.146824836730957, LR: 0.0005 Step 20/1000, Loss: 9.02399730682373, LR: 0.0005 Step 21/1000, Loss: 8.89885425567627, LR: 0.0005 Step 22/1000, Loss: 8.774663925170898, LR: 0.0005 Step 23/1000, Loss: 8.652946472167969, LR: 0.0005 Step 24/1000, Loss: 8.5274076461792, LR: 0.0005 Step 25/1000, Loss: 8.396858215332031, LR: 0.0005 Step 26/1000, Loss: 8.264533996582031, LR: 0.0005 Step 27/1000, Loss: 8.134319305419922, LR: 0.0005 Step 28/1000, Loss: 8.000314712524414, LR: 0.0005 Step 29/1000, Loss: 7.867947578430176, LR: 0.0005 Step 30/1000, Loss: 7.731581687927246, LR: 0.0005 Step 31/1000, Loss: 7.597165107727051, LR: 0.0005 Step 32/1000, Loss: 7.463318824768066, LR: 0.0005 Step 33/1000, Loss: 7.329626560211182, LR: 0.0005 Step 34/1000, Loss: 7.201971530914307, LR: 0.0005 Step 35/1000, Loss: 7.082633972167969, LR: 0.0005 Step 36/1000, Loss: 6.9651360511779785, LR: 0.0005 Step 37/1000, Loss: 6.861398220062256, LR: 0.0005 Step 38/1000, Loss: 6.753509521484375, LR: 0.0005 Step 39/1000, Loss: 6.652728080749512, LR: 0.0005 Step 40/1000, Loss: 6.553241729736328, LR: 0.0005 Step 41/1000, Loss: 6.451900482177734, LR: 0.0005 Step 42/1000, Loss: 6.362810134887695, LR: 0.0005 Step 43/1000, Loss: 6.274676322937012, LR: 0.0005 Step 44/1000, Loss: 6.194480895996094, LR: 0.0005 Step 45/1000, Loss: 6.11739444732666, LR: 0.0005 Step 46/1000, Loss: 6.045235633850098, LR: 0.0005 Step 47/1000, Loss: 5.982949256896973, LR: 0.0005 Step 48/1000, Loss: 5.9209208488464355, LR: 0.0005 Step 49/1000, Loss: 5.85759973526001, LR: 0.0005 Step 50/1000, Loss: 5.795312404632568, LR: 0.0005 Step 51/1000, Loss: 5.7420735359191895, LR: 0.0005 Step 52/1000, Loss: 5.694904804229736, LR: 0.0005 Step 53/1000, Loss: 5.653714179992676, LR: 0.0005 Step 54/1000, Loss: 5.607203960418701, LR: 0.0005 Step 55/1000, Loss: 5.57025146484375, LR: 0.0005 Step 56/1000, Loss: 5.534480094909668, LR: 0.0005 Step 57/1000, Loss: 5.504580020904541, LR: 0.0005 Step 58/1000, Loss: 5.468216419219971, LR: 0.0005 Step 59/1000, Loss: 5.444606781005859, LR: 0.0005 Step 60/1000, Loss: 5.419281959533691, LR: 0.0005 Step 61/1000, Loss: 5.392452239990234, LR: 0.0005 Step 62/1000, Loss: 5.369950771331787, LR: 0.0005 Step 63/1000, Loss: 5.3429460525512695, LR: 0.0005 Step 64/1000, Loss: 5.3212995529174805, LR: 0.0005 Step 65/1000, Loss: 5.293933868408203, LR: 0.0005 Step 66/1000, Loss: 5.275791645050049, LR: 0.0005 Step 67/1000, Loss: 5.260615348815918, LR: 0.0005 Step 68/1000, Loss: 5.245963096618652, LR: 0.0005 Step 69/1000, Loss: 5.220090866088867, LR: 0.0005 Step 70/1000, Loss: 5.206206798553467, LR: 0.0005 Step 71/1000, Loss: 5.186387062072754, LR: 0.0005 Step 72/1000, Loss: 5.171862602233887, LR: 0.0005 Step 73/1000, Loss: 5.148059844970703, LR: 0.0005 Step 74/1000, Loss: 5.134549140930176, LR: 0.0005 Step 75/1000, Loss: 5.1251044273376465, LR: 0.0005 Step 76/1000, Loss: 5.1150689125061035, LR: 0.0005 Step 77/1000, Loss: 5.105809211730957, LR: 0.0005 Step 78/1000, Loss: 5.1017584800720215, LR: 0.0005 Step 79/1000, Loss: 5.089953899383545, LR: 0.0005 Step 80/1000, Loss: 5.077761650085449, LR: 0.0005 Step 81/1000, Loss: 5.057864189147949, LR: 0.0005 Step 82/1000, Loss: 5.038480758666992, LR: 0.0005 Step 83/1000, Loss: 5.024689674377441, LR: 0.0005 Step 84/1000, Loss: 5.013391971588135, LR: 0.0005 Step 85/1000, Loss: 4.997678756713867, LR: 0.0005 Step 86/1000, Loss: 4.973568916320801, LR: 0.0005 Step 87/1000, Loss: 4.944432258605957, LR: 0.0005 Step 88/1000, Loss: 4.932539463043213, LR: 0.0005 Step 89/1000, Loss: 4.919715881347656, LR: 0.0005 Step 90/1000, Loss: 4.915416240692139, LR: 0.0005 Step 91/1000, Loss: 4.900571346282959, LR: 0.0005 Step 92/1000, Loss: 4.879560470581055, LR: 0.0005 Step 93/1000, Loss: 4.85857629776001, LR: 0.0005 Step 94/1000, Loss: 4.845907211303711, LR: 0.0005 Step 95/1000, Loss: 4.834258079528809, LR: 0.0005 Step 96/1000, Loss: 4.812500953674316, LR: 0.0005 Step 97/1000, Loss: 4.8053364753723145, LR: 0.0005 Step 98/1000, Loss: 4.794131278991699, LR: 0.0005 Step 99/1000, Loss: 4.783999443054199, LR: 0.0005 Step 100/1000, Loss: 4.7750325202941895, LR: 0.0005 Step 101/1000, Loss: 4.771496772766113, LR: 0.0005 Step 102/1000, Loss: 4.754128456115723, LR: 0.0005 Step 103/1000, Loss: 4.728689193725586, LR: 0.0005 Step 104/1000, Loss: 4.717083930969238, LR: 0.0005 Step 105/1000, Loss: 4.697932720184326, LR: 0.0005 Step 106/1000, Loss: 4.679559230804443, LR: 0.0005 Step 107/1000, Loss: 4.66640043258667, LR: 0.0005 Step 108/1000, Loss: 4.658744812011719, LR: 0.0005 Step 109/1000, Loss: 4.659064292907715, LR: 0.0005 Step 110/1000, Loss: 4.637427806854248, LR: 0.0005 Step 111/1000, Loss: 4.627077102661133, LR: 0.0005 Step 112/1000, Loss: 4.599566459655762, LR: 0.0005 Step 113/1000, Loss: 4.601081848144531, LR: 0.0005 Step 114/1000, Loss: 4.583940505981445, LR: 0.0005 Step 115/1000, Loss: 4.585203647613525, LR: 0.0005 Step 116/1000, Loss: 4.568021774291992, LR: 0.0005 Step 117/1000, Loss: 4.557181358337402, LR: 0.0005 Step 118/1000, Loss: 4.54134464263916, LR: 0.0005 Step 119/1000, Loss: 4.526434898376465, LR: 0.0005 Step 120/1000, Loss: 4.512308597564697, LR: 0.0005 Step 121/1000, Loss: 4.492142200469971, LR: 0.0005 Step 122/1000, Loss: 4.48798131942749, LR: 0.0005 Step 123/1000, Loss: 4.4805498123168945, LR: 0.0005 Step 124/1000, Loss: 4.468844413757324, LR: 0.0005 Step 125/1000, Loss: 4.462951183319092, LR: 0.0005 Step 126/1000, Loss: 4.434421539306641, LR: 0.0005 Step 127/1000, Loss: 4.417495250701904, LR: 0.0005 Step 128/1000, Loss: 4.407522201538086, LR: 0.0005 Step 129/1000, Loss: 4.401559352874756, LR: 0.0005 Step 130/1000, Loss: 4.394003868103027, LR: 0.0005 Step 131/1000, Loss: 4.384463310241699, LR: 0.0005 Step 132/1000, Loss: 4.374063968658447, LR: 0.0005 Step 133/1000, Loss: 4.352975368499756, LR: 0.0005 Step 134/1000, Loss: 4.345880031585693, LR: 0.0005 Step 135/1000, Loss: 4.329431056976318, LR: 0.0005 Step 136/1000, Loss: 4.308016777038574, LR: 0.0005 Step 137/1000, Loss: 4.3004655838012695, LR: 0.0005 Step 138/1000, Loss: 4.298262119293213, LR: 0.0005 Step 139/1000, Loss: 4.285660743713379, LR: 0.0005 Step 140/1000, Loss: 4.274592399597168, LR: 0.0005 Step 141/1000, Loss: 4.2520365715026855, LR: 0.0005 Step 142/1000, Loss: 4.234604835510254, LR: 0.0005 Step 143/1000, Loss: 4.204150199890137, LR: 0.0005 Step 144/1000, Loss: 4.194119453430176, LR: 0.0005 Step 145/1000, Loss: 4.173285961151123, LR: 0.0005 Step 146/1000, Loss: 4.143033027648926, LR: 0.0005 Step 147/1000, Loss: 4.14162015914917, LR: 0.0005 Step 148/1000, Loss: 4.127053260803223, LR: 0.0005 Step 149/1000, Loss: 4.101595401763916, LR: 0.0005 Step 150/1000, Loss: 4.077259063720703, LR: 0.0005 Step 151/1000, Loss: 4.075615882873535, LR: 0.0005 Step 152/1000, Loss: 4.037398338317871, LR: 0.0005 Step 153/1000, Loss: 4.00080680847168, LR: 0.0005 Step 154/1000, Loss: 4.002491474151611, LR: 0.0005 Step 155/1000, Loss: 3.9868216514587402, LR: 0.0005 Step 156/1000, Loss: 3.9425320625305176, LR: 0.0005 Step 157/1000, Loss: 3.9193999767303467, LR: 0.0005 Step 158/1000, Loss: 3.910961151123047, LR: 0.0005 Step 159/1000, Loss: 3.8992466926574707, LR: 0.0005 Step 160/1000, Loss: 3.8716888427734375, LR: 0.0005 Step 161/1000, Loss: 3.8629164695739746, LR: 0.0005 Step 162/1000, Loss: 3.8181827068328857, LR: 0.0005 Step 163/1000, Loss: 3.8121097087860107, LR: 0.0005 Step 164/1000, Loss: 3.815507173538208, LR: 0.0005 Step 165/1000, Loss: 3.786884307861328, LR: 0.0005 Step 166/1000, Loss: 3.7560245990753174, LR: 0.0005 Step 167/1000, Loss: 3.725088596343994, LR: 0.0005 Step 168/1000, Loss: 3.7211036682128906, LR: 0.0005 Step 169/1000, Loss: 3.6878623962402344, LR: 0.0005 Step 170/1000, Loss: 3.6708500385284424, LR: 0.0005 Step 171/1000, Loss: 3.653564929962158, LR: 0.0005 Step 172/1000, Loss: 3.6156325340270996, LR: 0.0005 Step 173/1000, Loss: 3.586336851119995, LR: 0.0005 Step 174/1000, Loss: 3.5640997886657715, LR: 0.0005 Step 175/1000, Loss: 3.5465335845947266, LR: 0.0005 Step 176/1000, Loss: 3.511826992034912, LR: 0.0005 Step 177/1000, Loss: 3.4743785858154297, LR: 0.0005 Step 178/1000, Loss: 3.4601149559020996, LR: 0.0005 Step 179/1000, Loss: 3.433750629425049, LR: 0.0005 Step 180/1000, Loss: 3.4152112007141113, LR: 0.0005 Step 181/1000, Loss: 3.403822660446167, LR: 0.0005 Step 182/1000, Loss: 3.4256794452667236, LR: 0.0005 Step 183/1000, Loss: 3.3998799324035645, LR: 0.0005 Step 184/1000, Loss: 3.3784213066101074, LR: 0.0005 Step 185/1000, Loss: 3.344134569168091, LR: 0.0005 Step 186/1000, Loss: 3.310579776763916, LR: 0.0005 Step 187/1000, Loss: 3.2851688861846924, LR: 0.0005 Step 188/1000, Loss: 3.275599241256714, LR: 0.0005 Step 189/1000, Loss: 3.2638020515441895, LR: 0.0005 Step 190/1000, Loss: 3.2565078735351562, LR: 0.0005 Step 191/1000, Loss: 3.2324013710021973, LR: 0.0005 Step 192/1000, Loss: 3.202582836151123, LR: 0.0005 Step 193/1000, Loss: 3.1878724098205566, LR: 0.0005 Step 194/1000, Loss: 3.191157102584839, LR: 0.0005 Step 195/1000, Loss: 3.172534942626953, LR: 0.0005 Step 196/1000, Loss: 3.1644599437713623, LR: 0.0005 Step 197/1000, Loss: 3.1573681831359863, LR: 0.0005 Step 198/1000, Loss: 3.1479012966156006, LR: 0.0005 Step 199/1000, Loss: 3.1226918697357178, LR: 0.0005 Step 200/1000, Loss: 3.097903251647949, LR: 0.0005 Step 201/1000, Loss: 3.0665550231933594, LR: 0.0005 Step 202/1000, Loss: 3.0587806701660156, LR: 0.0005 Step 203/1000, Loss: 3.023871898651123, LR: 0.0005 Step 204/1000, Loss: 2.979020357131958, LR: 0.0005 Step 205/1000, Loss: 2.9708054065704346, LR: 0.0005 Step 206/1000, Loss: 2.954228401184082, LR: 0.0005 Step 207/1000, Loss: 2.9390013217926025, LR: 0.0005 Step 208/1000, Loss: 2.905569553375244, LR: 0.0005 Step 209/1000, Loss: 2.8730244636535645, LR: 0.0005 Step 210/1000, Loss: 2.8674609661102295, LR: 0.0005 Step 211/1000, Loss: 2.855639934539795, LR: 0.0005 Step 212/1000, Loss: 2.8558738231658936, LR: 0.0005 Step 213/1000, Loss: 2.86918306350708, LR: 0.0005 Step 214/1000, Loss: 2.8461713790893555, LR: 0.0005 Step 215/1000, Loss: 2.814182758331299, LR: 0.0005 Step 216/1000, Loss: 2.7935945987701416, LR: 0.0005 Step 217/1000, Loss: 2.7812681198120117, LR: 0.0005 Step 218/1000, Loss: 2.7667975425720215, LR: 0.0005 Step 219/1000, Loss: 2.753769636154175, LR: 0.0005 Step 220/1000, Loss: 2.7543444633483887, LR: 0.0005 Step 221/1000, Loss: 2.721508264541626, LR: 0.0005 Step 222/1000, Loss: 2.729163408279419, LR: 0.0005 Step 223/1000, Loss: 2.7063753604888916, LR: 0.0005 Step 224/1000, Loss: 2.68135929107666, LR: 0.0005 Step 225/1000, Loss: 2.669187068939209, LR: 0.0005 Step 226/1000, Loss: 2.660661458969116, LR: 0.0005 Step 227/1000, Loss: 2.6326799392700195, LR: 0.0005 Step 228/1000, Loss: 2.637955665588379, LR: 0.0005 Step 229/1000, Loss: 2.6186447143554688, LR: 0.0005 Step 230/1000, Loss: 2.5809619426727295, LR: 0.0005 Step 231/1000, Loss: 2.571664810180664, LR: 0.0005 Step 232/1000, Loss: 2.577444553375244, LR: 0.0005 Step 233/1000, Loss: 2.54941987991333, LR: 0.0005 Step 234/1000, Loss: 2.5364792346954346, LR: 0.0005 Step 235/1000, Loss: 2.5383594036102295, LR: 0.0005 Step 236/1000, Loss: 2.520277500152588, LR: 0.0005 Step 237/1000, Loss: 2.5046403408050537, LR: 0.0005 Step 238/1000, Loss: 2.4785540103912354, LR: 0.0005 Step 239/1000, Loss: 2.4809670448303223, LR: 0.0005 Step 240/1000, Loss: 2.462975263595581, LR: 0.0005 Step 241/1000, Loss: 2.4556148052215576, LR: 0.0005 Step 242/1000, Loss: 2.4297916889190674, LR: 0.0005 Step 243/1000, Loss: 2.39184832572937, LR: 0.0005 Step 244/1000, Loss: 2.3745079040527344, LR: 0.0005 Step 245/1000, Loss: 2.3640854358673096, LR: 0.0005 Step 246/1000, Loss: 2.3464255332946777, LR: 0.0005 Step 247/1000, Loss: 2.3494179248809814, LR: 0.0005 Step 248/1000, Loss: 2.331693172454834, LR: 0.0005 Step 249/1000, Loss: 2.308777093887329, LR: 0.0005 Step 250/1000, Loss: 2.296970844268799, LR: 0.0005 Step 251/1000, Loss: 2.2653470039367676, LR: 0.0005 Step 252/1000, Loss: 2.236013174057007, LR: 0.0005 Step 253/1000, Loss: 2.2128851413726807, LR: 0.0005 Step 254/1000, Loss: 2.2008144855499268, LR: 0.0005 Step 255/1000, Loss: 2.1996219158172607, LR: 0.0005 Step 256/1000, Loss: 2.1947922706604004, LR: 0.0005 Step 257/1000, Loss: 2.175595760345459, LR: 0.0005 Step 258/1000, Loss: 2.157819986343384, LR: 0.0005 Step 259/1000, Loss: 2.1302566528320312, LR: 0.0005 Step 260/1000, Loss: 2.1034696102142334, LR: 0.0005 Step 261/1000, Loss: 2.0796091556549072, LR: 0.0005 Step 262/1000, Loss: 2.064819812774658, LR: 0.0005 Step 263/1000, Loss: 2.048025608062744, LR: 0.0005 Step 264/1000, Loss: 2.0346102714538574, LR: 0.0005 Step 265/1000, Loss: 2.027886390686035, LR: 0.0005 Step 266/1000, Loss: 1.9788423776626587, LR: 0.0005 Step 267/1000, Loss: 1.9474748373031616, LR: 0.0005 Step 268/1000, Loss: 1.9317245483398438, LR: 0.0005 Step 269/1000, Loss: 1.9186675548553467, LR: 0.0005 Step 270/1000, Loss: 1.8852226734161377, LR: 0.0005 Step 271/1000, Loss: 1.876244306564331, LR: 0.0005 Step 272/1000, Loss: 1.842331886291504, LR: 0.0005 Step 273/1000, Loss: 1.8287944793701172, LR: 0.0005 Step 274/1000, Loss: 1.8013193607330322, LR: 0.0005 Step 275/1000, Loss: 1.79136061668396, LR: 0.0005 Step 276/1000, Loss: 1.7899448871612549, LR: 0.0005 Step 277/1000, Loss: 1.7811002731323242, LR: 0.0005 Step 278/1000, Loss: 1.764428734779358, LR: 0.0005 Step 279/1000, Loss: 1.7321746349334717, LR: 0.0005 Step 280/1000, Loss: 1.71257746219635, LR: 0.0005 Step 281/1000, Loss: 1.711515188217163, LR: 0.0005 Step 282/1000, Loss: 1.687685251235962, LR: 0.0005 Step 283/1000, Loss: 1.6757681369781494, LR: 0.0005 Step 284/1000, Loss: 1.645421028137207, LR: 0.0005 Step 285/1000, Loss: 1.641474723815918, LR: 0.0005 Step 286/1000, Loss: 1.615427017211914, LR: 0.0005 Step 287/1000, Loss: 1.6107347011566162, LR: 0.0005 Step 288/1000, Loss: 1.611518144607544, LR: 0.0005 Step 289/1000, Loss: 1.6037676334381104, LR: 0.0005 Step 290/1000, Loss: 1.5598657131195068, LR: 0.0005 Step 291/1000, Loss: 1.539670705795288, LR: 0.0005 Step 292/1000, Loss: 1.5237125158309937, LR: 0.0005 Step 293/1000, Loss: 1.515089750289917, LR: 0.0005 Step 294/1000, Loss: 1.4829398393630981, LR: 0.0005 Step 295/1000, Loss: 1.469557285308838, LR: 0.0005 Step 296/1000, Loss: 1.4755609035491943, LR: 0.0005 Step 297/1000, Loss: 1.4488952159881592, LR: 0.0005 Step 298/1000, Loss: 1.4172840118408203, LR: 0.0005 Step 299/1000, Loss: 1.4243783950805664, LR: 0.0005 Step 300/1000, Loss: 1.4070521593093872, LR: 0.0005 Step 301/1000, Loss: 1.3949347734451294, LR: 0.0005 Step 302/1000, Loss: 1.3810094594955444, LR: 0.0005 Step 303/1000, Loss: 1.3695746660232544, LR: 0.0005 Step 304/1000, Loss: 1.3693941831588745, LR: 0.0005 Step 305/1000, Loss: 1.3583314418792725, LR: 0.0005 Step 306/1000, Loss: 1.3551133871078491, LR: 0.0005 Step 307/1000, Loss: 1.3310461044311523, LR: 0.0005 Step 308/1000, Loss: 1.3252372741699219, LR: 0.0005 Step 309/1000, Loss: 1.310289740562439, LR: 0.0005 Step 310/1000, Loss: 1.3364289999008179, LR: 0.0005 Step 311/1000, Loss: 1.3402405977249146, LR: 0.0005 Step 312/1000, Loss: 1.3057222366333008, LR: 0.0005 Step 313/1000, Loss: 1.2555572986602783, LR: 0.0005 Step 314/1000, Loss: 1.2786073684692383, LR: 0.0005 Step 315/1000, Loss: 1.2862904071807861, LR: 0.0005 Step 316/1000, Loss: 1.2695915699005127, LR: 0.0005 Step 317/1000, Loss: 1.2488386631011963, LR: 0.0005 Step 318/1000, Loss: 1.2311413288116455, LR: 0.0005 Step 319/1000, Loss: 1.195859670639038, LR: 0.0005 Step 320/1000, Loss: 1.175930142402649, LR: 0.0005 Step 321/1000, Loss: 1.1632606983184814, LR: 0.0005 Step 322/1000, Loss: 1.1373732089996338, LR: 0.0005 Step 323/1000, Loss: 1.1188726425170898, LR: 0.0005 Step 324/1000, Loss: 1.1170337200164795, LR: 0.0005 Step 325/1000, Loss: 1.0916062593460083, LR: 0.0005 Step 326/1000, Loss: 1.0770444869995117, LR: 0.0005 Step 327/1000, Loss: 1.0963935852050781, LR: 0.0005 Step 328/1000, Loss: 1.0843746662139893, LR: 0.0005 Step 329/1000, Loss: 1.0631479024887085, LR: 0.0005 Step 330/1000, Loss: 1.077026128768921, LR: 0.0005 Step 331/1000, Loss: 1.0650075674057007, LR: 0.0005 Step 332/1000, Loss: 1.0656172037124634, LR: 0.0005 Step 333/1000, Loss: 1.0517241954803467, LR: 0.0005 Step 334/1000, Loss: 1.0460470914840698, LR: 0.0005 Step 335/1000, Loss: 1.0230119228363037, LR: 0.0005 Step 336/1000, Loss: 1.00508713722229, LR: 0.0005 Step 337/1000, Loss: 0.9700304865837097, LR: 0.0005 Step 338/1000, Loss: 0.967957615852356, LR: 0.0005 Step 339/1000, Loss: 0.9377196431159973, LR: 0.0005 Step 340/1000, Loss: 0.9243243336677551, LR: 0.0005 Step 341/1000, Loss: 0.9037895202636719, LR: 0.0005 Step 342/1000, Loss: 0.9096879959106445, LR: 0.0005 Step 343/1000, Loss: 0.9215117692947388, LR: 0.0005 Step 344/1000, Loss: 0.9092743992805481, LR: 0.0005 Step 345/1000, Loss: 0.896843433380127, LR: 0.0005 Step 346/1000, Loss: 0.9039295315742493, LR: 0.0005 Step 347/1000, Loss: 0.9193969964981079, LR: 0.0005 Step 348/1000, Loss: 0.9099286794662476, LR: 0.0005 Step 349/1000, Loss: 0.9097145795822144, LR: 0.0005 Step 350/1000, Loss: 0.8837523460388184, LR: 0.0005 Step 351/1000, Loss: 0.8773738741874695, LR: 0.0005 Step 352/1000, Loss: 0.8736567497253418, LR: 0.0005 Step 353/1000, Loss: 0.8827304840087891, LR: 0.0005 Step 354/1000, Loss: 0.861080527305603, LR: 0.0005 Step 355/1000, Loss: 0.869343101978302, LR: 0.0005 Step 356/1000, Loss: 0.870052695274353, LR: 0.0005 Step 357/1000, Loss: 0.8451881408691406, LR: 0.0005 Step 358/1000, Loss: 0.8656991124153137, LR: 0.0005 Step 359/1000, Loss: 0.8662658929824829, LR: 0.0005 Step 360/1000, Loss: 0.8478053212165833, LR: 0.0005 Step 361/1000, Loss: 0.8322674632072449, LR: 0.0005 Step 362/1000, Loss: 0.8213379979133606, LR: 0.0005 Step 363/1000, Loss: 0.8200458288192749, LR: 0.0005 Step 364/1000, Loss: 0.8181761503219604, LR: 0.0005 Step 365/1000, Loss: 0.8163387179374695, LR: 0.0005 Step 366/1000, Loss: 0.7937390208244324, LR: 0.0005 Step 367/1000, Loss: 0.7693770527839661, LR: 0.0005 Step 368/1000, Loss: 0.7645674347877502, LR: 0.0005 Step 369/1000, Loss: 0.737683117389679, LR: 0.0005 Step 370/1000, Loss: 0.7150553464889526, LR: 0.0005 Step 371/1000, Loss: 0.7140983939170837, LR: 0.0005 Step 372/1000, Loss: 0.710894763469696, LR: 0.0005 Step 373/1000, Loss: 0.6989497542381287, LR: 0.0005 Step 374/1000, Loss: 0.6864019632339478, LR: 0.0005 Step 375/1000, Loss: 0.6782218217849731, LR: 0.0005 Step 376/1000, Loss: 0.6931029558181763, LR: 0.0005 Step 377/1000, Loss: 0.6965305805206299, LR: 0.0005 Step 378/1000, Loss: 0.6669130921363831, LR: 0.0005 Step 379/1000, Loss: 0.6545764207839966, LR: 0.0005 Step 380/1000, Loss: 0.643888533115387, LR: 0.0005 Step 381/1000, Loss: 0.628439724445343, LR: 0.0005 Step 382/1000, Loss: 0.6198010444641113, LR: 0.0005 Step 383/1000, Loss: 0.6036940813064575, LR: 0.0005 Step 384/1000, Loss: 0.6095243692398071, LR: 0.0005 Step 385/1000, Loss: 0.6129253506660461, LR: 0.0005 Step 386/1000, Loss: 0.6325975656509399, LR: 0.0005 Step 387/1000, Loss: 0.6411563754081726, LR: 0.0005 Step 388/1000, Loss: 0.652446448802948, LR: 0.0005 Step 389/1000, Loss: 0.6370224952697754, LR: 0.0005 Step 390/1000, Loss: 0.6187399625778198, LR: 0.0005 Step 391/1000, Loss: 0.6004735827445984, LR: 0.0005 Step 392/1000, Loss: 0.594428539276123, LR: 0.0005 Step 393/1000, Loss: 0.6067441701889038, LR: 0.0005 Step 394/1000, Loss: 0.6312612891197205, LR: 0.0005 Step 395/1000, Loss: 0.6736928224563599, LR: 0.0005 Step 396/1000, Loss: 0.6839991211891174, LR: 0.0005 Step 397/1000, Loss: 0.6524891257286072, LR: 0.0005 Step 398/1000, Loss: 0.6436099410057068, LR: 0.0005 Step 399/1000, Loss: 0.6383522152900696, LR: 0.0005 Step 400/1000, Loss: 0.6109321117401123, LR: 0.0005 Step 401/1000, Loss: 0.6183136105537415, LR: 0.0005 Step 402/1000, Loss: 0.5864837169647217, LR: 0.0005 Step 403/1000, Loss: 0.584269642829895, LR: 0.0005 Step 404/1000, Loss: 0.5934371948242188, LR: 0.0005 Step 405/1000, Loss: 0.5757063031196594, LR: 0.0005 Step 406/1000, Loss: 0.5703123211860657, LR: 0.0005 Step 407/1000, Loss: 0.5434995889663696, LR: 0.0005 Step 408/1000, Loss: 0.5514204502105713, LR: 0.0005 Step 409/1000, Loss: 0.5597916841506958, LR: 0.0005 Step 410/1000, Loss: 0.563462495803833, LR: 0.0005 Step 411/1000, Loss: 0.5499494075775146, LR: 0.0005 Step 412/1000, Loss: 0.5233972072601318, LR: 0.0005 Step 413/1000, Loss: 0.504203736782074, LR: 0.0005 Step 414/1000, Loss: 0.4897310733795166, LR: 0.0005 Step 415/1000, Loss: 0.494534969329834, LR: 0.0005 Step 416/1000, Loss: 0.46887022256851196, LR: 0.0005 Step 417/1000, Loss: 0.46209216117858887, LR: 0.0005 Step 418/1000, Loss: 0.465052992105484, LR: 0.0005 Step 419/1000, Loss: 0.45604199171066284, LR: 0.0005 Step 420/1000, Loss: 0.4371207654476166, LR: 0.0005 Step 421/1000, Loss: 0.41402560472488403, LR: 0.0005 Step 422/1000, Loss: 0.4363742768764496, LR: 0.0005 Step 423/1000, Loss: 0.4174273908138275, LR: 0.0005 Step 424/1000, Loss: 0.4107434153556824, LR: 0.0005 Step 425/1000, Loss: 0.40721216797828674, LR: 0.0005 Step 426/1000, Loss: 0.4082273542881012, LR: 0.0005 Step 427/1000, Loss: 0.39075595140457153, LR: 0.0005 Step 428/1000, Loss: 0.4067247807979584, LR: 0.0005 Step 429/1000, Loss: 0.4011359214782715, LR: 0.0005 Step 430/1000, Loss: 0.3967561721801758, LR: 0.0005 Step 431/1000, Loss: 0.4046071171760559, LR: 0.0005 Step 432/1000, Loss: 0.41799020767211914, LR: 0.0005 Step 433/1000, Loss: 0.3920240104198456, LR: 0.0005 Step 434/1000, Loss: 0.37924373149871826, LR: 0.0005 Step 435/1000, Loss: 0.3781460225582123, LR: 0.0005 Step 436/1000, Loss: 0.3815513551235199, LR: 0.0005 Step 437/1000, Loss: 0.37023890018463135, LR: 0.0005 Step 438/1000, Loss: 0.35491085052490234, LR: 0.0005 Step 439/1000, Loss: 0.34662842750549316, LR: 0.0005 Step 440/1000, Loss: 0.36207273602485657, LR: 0.0005 Step 441/1000, Loss: 0.3730314075946808, LR: 0.0005 Step 442/1000, Loss: 0.3664458394050598, LR: 0.0005 Step 443/1000, Loss: 0.3572966456413269, LR: 0.0005 Step 444/1000, Loss: 0.35378462076187134, LR: 0.0005 Step 445/1000, Loss: 0.354721337556839, LR: 0.0005 Step 446/1000, Loss: 0.3324449360370636, LR: 0.0005 Step 447/1000, Loss: 0.32452985644340515, LR: 0.0005 Step 448/1000, Loss: 0.32136186957359314, LR: 0.0005 Step 449/1000, Loss: 0.32477572560310364, LR: 0.0005 Step 450/1000, Loss: 0.32552820444107056, LR: 0.0005 Step 451/1000, Loss: 0.3165009617805481, LR: 0.0005 Step 452/1000, Loss: 0.2987794280052185, LR: 0.0005 Step 453/1000, Loss: 0.29281920194625854, LR: 0.0005 Step 454/1000, Loss: 0.2835347652435303, LR: 0.0005 Step 455/1000, Loss: 0.2828827500343323, LR: 0.0005 Step 456/1000, Loss: 0.2706434428691864, LR: 0.0005 Step 457/1000, Loss: 0.28443458676338196, LR: 0.0005 Step 458/1000, Loss: 0.28156232833862305, LR: 0.0005 Step 459/1000, Loss: 0.27728238701820374, LR: 0.0005 Step 460/1000, Loss: 0.28822922706604004, LR: 0.0005 Step 461/1000, Loss: 0.2672039866447449, LR: 0.0005 Step 462/1000, Loss: 0.253134161233902, LR: 0.0005 Step 463/1000, Loss: 0.24115963280200958, LR: 0.0005 Step 464/1000, Loss: 0.24002012610435486, LR: 0.0005 Step 465/1000, Loss: 0.23908594250679016, LR: 0.0005 Step 466/1000, Loss: 0.2319941222667694, LR: 0.0005 Step 467/1000, Loss: 0.2285253256559372, LR: 0.0005 Step 468/1000, Loss: 0.22348380088806152, LR: 0.0005 Step 469/1000, Loss: 0.22138187289237976, LR: 0.0005 Step 470/1000, Loss: 0.23238055408000946, LR: 0.0005 Step 471/1000, Loss: 0.23226062953472137, LR: 0.0005 Step 472/1000, Loss: 0.23205141723155975, LR: 0.0005 Step 473/1000, Loss: 0.2329082041978836, LR: 0.0005 Step 474/1000, Loss: 0.22224338352680206, LR: 0.0005 Step 475/1000, Loss: 0.22853977978229523, LR: 0.0005 Step 476/1000, Loss: 0.20879168808460236, LR: 0.0005 Step 477/1000, Loss: 0.20795026421546936, LR: 0.0005 Step 478/1000, Loss: 0.2103862315416336, LR: 0.0005 Step 479/1000, Loss: 0.19135843217372894, LR: 0.0005 Step 480/1000, Loss: 0.18687506020069122, LR: 0.0005 Step 481/1000, Loss: 0.1904873549938202, LR: 0.0005 Step 482/1000, Loss: 0.17999114096164703, LR: 0.0005 Step 483/1000, Loss: 0.19540205597877502, LR: 0.0005 Step 484/1000, Loss: 0.22388646006584167, LR: 0.0005 Step 485/1000, Loss: 0.22279754281044006, LR: 0.0005 Step 486/1000, Loss: 0.2158145010471344, LR: 0.0005 Step 487/1000, Loss: 0.2166273146867752, LR: 0.0005 Step 488/1000, Loss: 0.20572204887866974, LR: 0.0005 Step 489/1000, Loss: 0.21187281608581543, LR: 0.0005 Step 490/1000, Loss: 0.22066445648670197, LR: 0.0005 Step 491/1000, Loss: 0.22465398907661438, LR: 0.0005 Step 492/1000, Loss: 0.2200523316860199, LR: 0.0005 Step 493/1000, Loss: 0.21883109211921692, LR: 0.0005 Step 494/1000, Loss: 0.21836289763450623, LR: 0.0005 Step 495/1000, Loss: 0.20147296786308289, LR: 0.0005 Step 496/1000, Loss: 0.19445526599884033, LR: 0.0005 Step 497/1000, Loss: 0.18928959965705872, LR: 0.0005 Step 498/1000, Loss: 0.19034895300865173, LR: 0.0005 Step 499/1000, Loss: 0.17834091186523438, LR: 0.0005 Step 500/1000, Loss: 0.18654637038707733, LR: 0.0005 Step 501/1000, Loss: 0.17985805869102478, LR: 0.0005 Step 502/1000, Loss: 0.17028620839118958, LR: 0.0005 Step 503/1000, Loss: 0.17000171542167664, LR: 0.0005 Step 504/1000, Loss: 0.1748601794242859, LR: 0.0005 Step 505/1000, Loss: 0.18722914159297943, LR: 0.0005 Step 506/1000, Loss: 0.19759556651115417, LR: 0.0005 Step 507/1000, Loss: 0.19890208542346954, LR: 0.0005 Step 508/1000, Loss: 0.18704530596733093, LR: 0.0005 Step 509/1000, Loss: 0.1849716305732727, LR: 0.0005 Step 510/1000, Loss: 0.184732124209404, LR: 0.0005 Step 511/1000, Loss: 0.18371516466140747, LR: 0.0005 Step 512/1000, Loss: 0.1903942972421646, LR: 0.0005 Step 513/1000, Loss: 0.1872595250606537, LR: 0.0005 Step 514/1000, Loss: 0.18018800020217896, LR: 0.0005 Step 515/1000, Loss: 0.17037823796272278, LR: 0.0005 Step 516/1000, Loss: 0.1671590805053711, LR: 0.0005 Step 517/1000, Loss: 0.1693660318851471, LR: 0.0005 Step 518/1000, Loss: 0.17421209812164307, LR: 0.0005 Step 519/1000, Loss: 0.17531129717826843, LR: 0.0005 Step 520/1000, Loss: 0.1872468739748001, LR: 0.0005 Step 521/1000, Loss: 0.18938492238521576, LR: 0.0005 Step 522/1000, Loss: 0.1872682273387909, LR: 0.0005 Step 523/1000, Loss: 0.18801698088645935, LR: 0.0005 Step 524/1000, Loss: 0.17992208898067474, LR: 0.0005 Step 525/1000, Loss: 0.16556301712989807, LR: 0.0005 Step 526/1000, Loss: 0.167608842253685, LR: 0.0005 Step 527/1000, Loss: 0.16385821998119354, LR: 0.0005 Step 528/1000, Loss: 0.16344033181667328, LR: 0.0005 Step 529/1000, Loss: 0.15111441910266876, LR: 0.0005 Step 530/1000, Loss: 0.1478053331375122, LR: 0.0005 Step 531/1000, Loss: 0.15292318165302277, LR: 0.0005 Step 532/1000, Loss: 0.15604010224342346, LR: 0.0005 Step 533/1000, Loss: 0.1654641032218933, LR: 0.0005 Step 534/1000, Loss: 0.15679971873760223, LR: 0.0005 Step 535/1000, Loss: 0.14681723713874817, LR: 0.0005 Step 536/1000, Loss: 0.15554651618003845, LR: 0.0005 Step 537/1000, Loss: 0.1586087942123413, LR: 0.0005 Step 538/1000, Loss: 0.15057186782360077, LR: 0.0005 Step 539/1000, Loss: 0.1441071629524231, LR: 0.0005 Step 540/1000, Loss: 0.13398683071136475, LR: 0.0005 Step 541/1000, Loss: 0.12451115995645523, LR: 0.0005 Step 542/1000, Loss: 0.12552228569984436, LR: 0.0005 Step 543/1000, Loss: 0.12251965701580048, LR: 0.0005 Step 544/1000, Loss: 0.12409601360559464, LR: 0.0005 Step 545/1000, Loss: 0.11980509757995605, LR: 0.0005 Step 546/1000, Loss: 0.11921397596597672, LR: 0.0005 Step 547/1000, Loss: 0.12421467155218124, LR: 0.0005 Step 548/1000, Loss: 0.12658628821372986, LR: 0.0005 Step 549/1000, Loss: 0.13326843082904816, LR: 0.0005 Step 550/1000, Loss: 0.12381921708583832, LR: 0.0005 Step 551/1000, Loss: 0.11581559479236603, LR: 0.0005 Step 552/1000, Loss: 0.11535737663507462, LR: 0.0005 Step 553/1000, Loss: 0.13160692155361176, LR: 0.0005 Step 554/1000, Loss: 0.12189646065235138, LR: 0.0005 Step 555/1000, Loss: 0.11585875600576401, LR: 0.0005 Step 556/1000, Loss: 0.1124630942940712, LR: 0.0005 Step 557/1000, Loss: 0.1211152896285057, LR: 0.0005 Step 558/1000, Loss: 0.1169012039899826, LR: 0.0005 Step 559/1000, Loss: 0.112946055829525, LR: 0.0005 Step 560/1000, Loss: 0.10744037479162216, LR: 0.0005 Step 561/1000, Loss: 0.11075057834386826, LR: 0.0005 Step 562/1000, Loss: 0.11693501472473145, LR: 0.0005 Step 563/1000, Loss: 0.10756738483905792, LR: 0.0005 Step 564/1000, Loss: 0.09948016703128815, LR: 0.0005 Step 565/1000, Loss: 0.10241025686264038, LR: 0.0005 Step 566/1000, Loss: 0.112420953810215, LR: 0.0005 Step 567/1000, Loss: 0.11103066056966782, LR: 0.0005 Step 568/1000, Loss: 0.0980253666639328, LR: 0.0005 Step 569/1000, Loss: 0.09519194066524506, LR: 0.0005 Step 570/1000, Loss: 0.10033954679965973, LR: 0.0005 Step 571/1000, Loss: 0.11113598197698593, LR: 0.0005 Step 572/1000, Loss: 0.1067027673125267, LR: 0.0005 Step 573/1000, Loss: 0.10017013549804688, LR: 0.0005 Step 574/1000, Loss: 0.08589186519384384, LR: 0.0005 Step 575/1000, Loss: 0.08522863686084747, LR: 0.0005 Step 576/1000, Loss: 0.08796512335538864, LR: 0.0005 Step 577/1000, Loss: 0.08914685994386673, LR: 0.0005 Step 578/1000, Loss: 0.10033176839351654, LR: 0.0005 Step 579/1000, Loss: 0.09569718688726425, LR: 0.0005 Step 580/1000, Loss: 0.09215684235095978, LR: 0.0005 Step 581/1000, Loss: 0.09104704856872559, LR: 0.0005 Step 582/1000, Loss: 0.08370417356491089, LR: 0.0005 Step 583/1000, Loss: 0.08882755786180496, LR: 0.0005 Step 584/1000, Loss: 0.09333042800426483, LR: 0.0005 Step 585/1000, Loss: 0.09395621716976166, LR: 0.0005 Step 586/1000, Loss: 0.09483806788921356, LR: 0.0005 Step 587/1000, Loss: 0.10696176439523697, LR: 0.0005 Step 588/1000, Loss: 0.11702649295330048, LR: 0.0005 Step 589/1000, Loss: 0.11695094406604767, LR: 0.0005 Step 590/1000, Loss: 0.11708281934261322, LR: 0.0005 Step 591/1000, Loss: 0.11911292374134064, LR: 0.0005 Step 592/1000, Loss: 0.10642746835947037, LR: 0.0005 Step 593/1000, Loss: 0.10568960011005402, LR: 0.0005 Step 594/1000, Loss: 0.09062449634075165, LR: 0.0005 Step 595/1000, Loss: 0.08835713565349579, LR: 0.0005 Step 596/1000, Loss: 0.08599722385406494, LR: 0.0005 Step 597/1000, Loss: 0.08588212728500366, LR: 0.0005 Step 598/1000, Loss: 0.08590028434991837, LR: 0.0005 Step 599/1000, Loss: 0.08308453857898712, LR: 0.0005 Step 600/1000, Loss: 0.09541193395853043, LR: 0.0005 Step 601/1000, Loss: 0.08812487870454788, LR: 0.0005 Step 602/1000, Loss: 0.08792287111282349, LR: 0.0005 Step 603/1000, Loss: 0.08447670191526413, LR: 0.0005 Step 604/1000, Loss: 0.08426915854215622, LR: 0.0005 Step 605/1000, Loss: 0.08540277183055878, LR: 0.0005 Step 606/1000, Loss: 0.08323539793491364, LR: 0.0005 Step 607/1000, Loss: 0.08528713136911392, LR: 0.0005 Step 608/1000, Loss: 0.09510378539562225, LR: 0.0005 Step 609/1000, Loss: 0.07850505411624908, LR: 0.0005 Step 610/1000, Loss: 0.0889134556055069, LR: 0.0005 Step 611/1000, Loss: 0.07955340296030045, LR: 0.0005 Step 612/1000, Loss: 0.07885599136352539, LR: 0.0005 Step 613/1000, Loss: 0.08034021407365799, LR: 0.0005 Step 614/1000, Loss: 0.06894141435623169, LR: 0.0005 Step 615/1000, Loss: 0.07147861272096634, LR: 0.0005 Step 616/1000, Loss: 0.0682905912399292, LR: 0.0005 Step 617/1000, Loss: 0.07605328410863876, LR: 0.0005 Step 618/1000, Loss: 0.08166414499282837, LR: 0.0005 Step 619/1000, Loss: 0.07115388661623001, LR: 0.0005 Step 620/1000, Loss: 0.0779682919383049, LR: 0.0005 Step 621/1000, Loss: 0.07298511266708374, LR: 0.0005 Step 622/1000, Loss: 0.1083405613899231, LR: 0.0005 Step 623/1000, Loss: 0.1199323982000351, LR: 0.0005 Step 624/1000, Loss: 0.09794630110263824, LR: 0.0005 Step 625/1000, Loss: 0.07463125884532928, LR: 0.0005 Step 626/1000, Loss: 0.08452042192220688, LR: 0.0005 Step 627/1000, Loss: 0.07313860952854156, LR: 0.0005 Step 628/1000, Loss: 0.07254444062709808, LR: 0.0005 Step 629/1000, Loss: 0.08945196866989136, LR: 0.0005 Step 630/1000, Loss: 0.11651842296123505, LR: 0.0005 Step 631/1000, Loss: 0.08153659105300903, LR: 0.0005 Step 632/1000, Loss: 0.07211168110370636, LR: 0.0005 Step 633/1000, Loss: 0.0758204236626625, LR: 0.0005 Step 634/1000, Loss: 0.07265093922615051, LR: 0.0005 Step 635/1000, Loss: 0.0628904327750206, LR: 0.0005 Step 636/1000, Loss: 0.06499655544757843, LR: 0.0005 Step 637/1000, Loss: 0.0601317398250103, LR: 0.0005 Step 638/1000, Loss: 0.05996574088931084, LR: 0.0005 Step 639/1000, Loss: 0.062205951660871506, LR: 0.0005 Step 640/1000, Loss: 0.06599584966897964, LR: 0.0005 Step 641/1000, Loss: 0.05893300846219063, LR: 0.0005 Step 642/1000, Loss: 0.056494928896427155, LR: 0.0005 Step 643/1000, Loss: 0.06202076002955437, LR: 0.0005 Step 644/1000, Loss: 0.058871109038591385, LR: 0.0005 Step 645/1000, Loss: 0.049011193215847015, LR: 0.0005 Step 646/1000, Loss: 0.05677228048443794, LR: 0.0005 Step 647/1000, Loss: 0.056192588061094284, LR: 0.0005 Step 648/1000, Loss: 0.05172703415155411, LR: 0.0005 Step 649/1000, Loss: 0.057096682488918304, LR: 0.0005 Step 650/1000, Loss: 0.05319962650537491, LR: 0.0005 Step 651/1000, Loss: 0.04814205318689346, LR: 0.0005 Step 652/1000, Loss: 0.051470182836055756, LR: 0.0005 Step 653/1000, Loss: 0.053208671510219574, LR: 0.0005 Step 654/1000, Loss: 0.04990384727716446, LR: 0.0005 Step 655/1000, Loss: 0.04441634565591812, LR: 0.0005 Step 656/1000, Loss: 0.04084813967347145, LR: 0.0005 Step 657/1000, Loss: 0.044170595705509186, LR: 0.0005 Step 658/1000, Loss: 0.04632062092423439, LR: 0.0005 Step 659/1000, Loss: 0.04488269239664078, LR: 0.0005 Step 660/1000, Loss: 0.0449594184756279, LR: 0.0005 Step 661/1000, Loss: 0.057353146374225616, LR: 0.0005 Step 662/1000, Loss: 0.05477500706911087, LR: 0.0005 Step 663/1000, Loss: 0.053335029631853104, LR: 0.0005 Step 664/1000, Loss: 0.051657289266586304, LR: 0.0005 Step 665/1000, Loss: 0.05923281982541084, LR: 0.0005 Step 666/1000, Loss: 0.05572907254099846, LR: 0.0005 Step 667/1000, Loss: 0.05607025697827339, LR: 0.0005 Step 668/1000, Loss: 0.057905398309230804, LR: 0.0005 Step 669/1000, Loss: 0.06375221163034439, LR: 0.0005 Step 670/1000, Loss: 0.06039778143167496, LR: 0.0005 Step 671/1000, Loss: 0.06644179672002792, LR: 0.0005 Step 672/1000, Loss: 0.06224949285387993, LR: 0.0005 Step 673/1000, Loss: 0.060910988599061966, LR: 0.0005 Step 674/1000, Loss: 0.055030010640621185, LR: 0.0005 Step 675/1000, Loss: 0.05241537094116211, LR: 0.0005 Step 676/1000, Loss: 0.060969460755586624, LR: 0.0005 Step 677/1000, Loss: 0.058927275240421295, LR: 0.0001 Step 678/1000, Loss: 0.06402787566184998, LR: 0.0001 Step 679/1000, Loss: 0.06407123059034348, LR: 0.0001 Step 680/1000, Loss: 0.05825525522232056, LR: 0.0001 Step 681/1000, Loss: 0.05595303699374199, LR: 0.0001 Step 682/1000, Loss: 0.05010877922177315, LR: 0.0001 Step 683/1000, Loss: 0.04850585758686066, LR: 0.0001 Step 684/1000, Loss: 0.045953407883644104, LR: 0.0001 Step 685/1000, Loss: 0.04297474026679993, LR: 0.0001 Step 686/1000, Loss: 0.03996017202734947, LR: 0.0001 Step 687/1000, Loss: 0.0372818298637867, LR: 0.0001 Step 688/1000, Loss: 0.03529901057481766, LR: 0.0001 Step 689/1000, Loss: 0.03363201767206192, LR: 0.0001 Step 690/1000, Loss: 0.032860495150089264, LR: 0.0001 Step 691/1000, Loss: 0.03063543513417244, LR: 0.0001 Step 692/1000, Loss: 0.029094496741890907, LR: 0.0001 Step 693/1000, Loss: 0.027840223163366318, LR: 0.0001 Step 694/1000, Loss: 0.028953909873962402, LR: 0.0001 Step 695/1000, Loss: 0.02756868302822113, LR: 0.0001 Step 696/1000, Loss: 0.026567626744508743, LR: 0.0001 Step 697/1000, Loss: 0.02641737461090088, LR: 0.0001 Step 698/1000, Loss: 0.02756710723042488, LR: 0.0001 Step 699/1000, Loss: 0.024215396493673325, LR: 0.0001 Step 700/1000, Loss: 0.02322467789053917, LR: 0.0001 Step 701/1000, Loss: 0.0222491268068552, LR: 0.0001 Step 702/1000, Loss: 0.021404754370450974, LR: 0.0001 Step 703/1000, Loss: 0.020932018756866455, LR: 0.0001 Step 704/1000, Loss: 0.020201925188302994, LR: 0.0001 Step 705/1000, Loss: 0.0204827431589365, LR: 0.0001 Step 706/1000, Loss: 0.019123699516057968, LR: 0.0001 Step 707/1000, Loss: 0.018694665282964706, LR: 0.0001 Step 708/1000, Loss: 0.018212392926216125, LR: 0.0001 Step 709/1000, Loss: 0.01778905838727951, LR: 0.0001 Step 710/1000, Loss: 0.017347224056720734, LR: 0.0001 Step 711/1000, Loss: 0.01690581627190113, LR: 0.0001 Step 712/1000, Loss: 0.0165349543094635, LR: 0.0001 Step 713/1000, Loss: 0.016211170703172684, LR: 0.0001 Step 714/1000, Loss: 0.015871090814471245, LR: 0.0001 Step 715/1000, Loss: 0.015569751150906086, LR: 0.0001 Step 716/1000, Loss: 0.015336255542933941, LR: 0.0001 Step 717/1000, Loss: 0.014996635727584362, LR: 0.0001 Step 718/1000, Loss: 0.015334012918174267, LR: 0.0001 Step 719/1000, Loss: 0.01463706512004137, LR: 0.0001 Step 720/1000, Loss: 0.014448484405875206, LR: 0.0001 Step 721/1000, Loss: 0.014257475733757019, LR: 0.0001 Step 722/1000, Loss: 0.013931864872574806, LR: 0.0001 Step 723/1000, Loss: 0.013719340786337852, LR: 0.0001 Step 724/1000, Loss: 0.013499232940375805, LR: 0.0001 Step 725/1000, Loss: 0.013291612267494202, LR: 0.0001 Step 726/1000, Loss: 0.014798027463257313, LR: 0.0001 Step 727/1000, Loss: 0.01305677555501461, LR: 0.0001 Step 728/1000, Loss: 0.013935339637100697, LR: 0.0001 Step 729/1000, Loss: 0.01494878251105547, LR: 0.0001 Step 730/1000, Loss: 0.013428077101707458, LR: 0.0001 Step 731/1000, Loss: 0.013509241864085197, LR: 0.0001 Step 732/1000, Loss: 0.013482932932674885, LR: 0.0001 Step 733/1000, Loss: 0.013618290424346924, LR: 0.0001 Step 734/1000, Loss: 0.0130748450756073, LR: 0.0001 Step 735/1000, Loss: 0.01275942288339138, LR: 0.0001 Step 736/1000, Loss: 0.012535681016743183, LR: 0.0001 Step 737/1000, Loss: 0.01289400178939104, LR: 0.0001 Step 738/1000, Loss: 0.012860444374382496, LR: 0.0001 Step 739/1000, Loss: 0.01354834996163845, LR: 0.0001 Step 740/1000, Loss: 0.01242379005998373, LR: 0.0001 Step 741/1000, Loss: 0.011869930662214756, LR: 0.0001 Step 742/1000, Loss: 0.01180819422006607, LR: 0.0001 Step 743/1000, Loss: 0.011539941653609276, LR: 0.0001 Step 744/1000, Loss: 0.01195994671434164, LR: 0.0001 Step 745/1000, Loss: 0.011274044401943684, LR: 0.0001 Step 746/1000, Loss: 0.011409847065806389, LR: 0.0001 Step 747/1000, Loss: 0.011070836335420609, LR: 0.0001 Step 748/1000, Loss: 0.010908852331340313, LR: 0.0001 Step 749/1000, Loss: 0.010839962400496006, LR: 0.0001 Step 750/1000, Loss: 0.010682085528969765, LR: 0.0001 Step 751/1000, Loss: 0.010563052259385586, LR: 0.0001 Step 752/1000, Loss: 0.010426904074847698, LR: 0.0001 Step 753/1000, Loss: 0.010330324992537498, LR: 0.0001 Step 754/1000, Loss: 0.010155769065022469, LR: 0.0001 Step 755/1000, Loss: 0.010123392567038536, LR: 0.0001 Step 756/1000, Loss: 0.010016610845923424, LR: 0.0001 Step 757/1000, Loss: 0.00990261323750019, LR: 0.0001 Step 758/1000, Loss: 0.009779585525393486, LR: 0.0001 Step 759/1000, Loss: 0.009657101705670357, LR: 0.0001 Step 760/1000, Loss: 0.009539510123431683, LR: 0.0001 Step 761/1000, Loss: 0.010143707506358624, LR: 0.0001 Step 762/1000, Loss: 0.009367367252707481, LR: 0.0001 Step 763/1000, Loss: 0.009283630177378654, LR: 0.0001 Step 764/1000, Loss: 0.009189988486468792, LR: 0.0001 Step 765/1000, Loss: 0.009099817834794521, LR: 0.0001 Step 766/1000, Loss: 0.0090244235470891, LR: 0.0001 Step 767/1000, Loss: 0.008941411040723324, LR: 0.0001 Step 768/1000, Loss: 0.0089210644364357, LR: 0.0001 Step 769/1000, Loss: 0.00884384848177433, LR: 0.0001 Step 770/1000, Loss: 0.008725387044250965, LR: 0.0001 Step 771/1000, Loss: 0.008657841011881828, LR: 0.0001 Step 772/1000, Loss: 0.008593878708779812, LR: 0.0001 Step 773/1000, Loss: 0.008531033992767334, LR: 0.0001 Step 774/1000, Loss: 0.00847670715302229, LR: 0.0001 Step 775/1000, Loss: 0.008415953256189823, LR: 0.0001 Step 776/1000, Loss: 0.008363204076886177, LR: 0.0001 Step 777/1000, Loss: 0.008309995755553246, LR: 0.0001 Step 778/1000, Loss: 0.008257298730313778, LR: 0.0001 Step 779/1000, Loss: 0.008202975615859032, LR: 0.0001 Step 780/1000, Loss: 0.008149301633238792, LR: 0.0001 Step 781/1000, Loss: 0.008095674216747284, LR: 0.0001 Step 782/1000, Loss: 0.008045346476137638, LR: 0.0001 Step 783/1000, Loss: 0.00799850095063448, LR: 0.0001 Step 784/1000, Loss: 0.00794619508087635, LR: 0.0001 Step 785/1000, Loss: 0.007898970507085323, LR: 0.0001 Step 786/1000, Loss: 0.007858267053961754, LR: 0.0001 Step 787/1000, Loss: 0.007810953073203564, LR: 0.0001 Step 788/1000, Loss: 0.007768124341964722, LR: 0.0001 Step 789/1000, Loss: 0.007733012083917856, LR: 0.0001 Step 790/1000, Loss: 0.00769192585721612, LR: 0.0001 Step 791/1000, Loss: 0.007650978863239288, LR: 0.0001 Step 792/1000, Loss: 0.007609694264829159, LR: 0.0001 Step 793/1000, Loss: 0.007569072302430868, LR: 0.0001 Step 794/1000, Loss: 0.007529460825026035, LR: 0.0001 Step 795/1000, Loss: 0.007494403515011072, LR: 0.0001 Step 796/1000, Loss: 0.007457704283297062, LR: 0.0001 Step 797/1000, Loss: 0.0074221668764948845, LR: 0.0001 Step 798/1000, Loss: 0.0073849596083164215, LR: 0.0001 Step 799/1000, Loss: 0.00735092256218195, LR: 0.0001 Step 800/1000, Loss: 0.007318601012229919, LR: 0.0001 Step 801/1000, Loss: 0.007284875959157944, LR: 0.0001 Step 802/1000, Loss: 0.007249526679515839, LR: 0.0001 Step 803/1000, Loss: 0.007216329220682383, LR: 0.0001 Step 804/1000, Loss: 0.007182500325143337, LR: 0.0001 Step 805/1000, Loss: 0.007149656303226948, LR: 0.0001 Step 806/1000, Loss: 0.007117958273738623, LR: 0.0001 Step 807/1000, Loss: 0.007086853496730328, LR: 0.0001 Step 808/1000, Loss: 0.007057369686663151, LR: 0.0001 Step 809/1000, Loss: 0.00702676922082901, LR: 0.0001 Step 810/1000, Loss: 0.006997970398515463, LR: 0.0001 Step 811/1000, Loss: 0.006961732171475887, LR: 0.0001 Step 812/1000, Loss: 0.006932959891855717, LR: 0.0001 Step 813/1000, Loss: 0.006906145717948675, LR: 0.0001 Step 814/1000, Loss: 0.006877778563648462, LR: 0.0001 Step 815/1000, Loss: 0.006850270088762045, LR: 0.0001 Step 816/1000, Loss: 0.00683822575956583, LR: 0.0001 Step 817/1000, Loss: 0.006811405532062054, LR: 0.0001 Step 818/1000, Loss: 0.006784955505281687, LR: 0.0001 Step 819/1000, Loss: 0.006758543197065592, LR: 0.0001 Step 820/1000, Loss: 0.006731419358402491, LR: 0.0001 Step 821/1000, Loss: 0.006704866886138916, LR: 0.0001 Step 822/1000, Loss: 0.006678593344986439, LR: 0.0001 Step 823/1000, Loss: 0.006652729120105505, LR: 0.0001 Step 824/1000, Loss: 0.0066271573305130005, LR: 0.0001 Step 825/1000, Loss: 0.006602270994335413, LR: 0.0001 Step 826/1000, Loss: 0.006577847059816122, LR: 0.0001 Step 827/1000, Loss: 0.006553393788635731, LR: 0.0001 Step 828/1000, Loss: 0.006529939826577902, LR: 0.0001 Step 829/1000, Loss: 0.0065064216032624245, LR: 0.0001 Step 830/1000, Loss: 0.006483516655862331, LR: 0.0001 Step 831/1000, Loss: 0.006460773292928934, LR: 0.0001 Step 832/1000, Loss: 0.00643813144415617, LR: 0.0001 Step 833/1000, Loss: 0.006415648851543665, LR: 0.0001 Step 834/1000, Loss: 0.006394172552973032, LR: 0.0001 Step 835/1000, Loss: 0.006372210569679737, LR: 0.0001 Step 836/1000, Loss: 0.006349983159452677, LR: 0.0001 Step 837/1000, Loss: 0.006329013500362635, LR: 0.0001 Step 838/1000, Loss: 0.006308399140834808, LR: 0.0001 Step 839/1000, Loss: 0.006287516560405493, LR: 0.0001 Step 840/1000, Loss: 0.006265987642109394, LR: 0.0001 Step 841/1000, Loss: 0.0062498170882463455, LR: 0.0001 Step 842/1000, Loss: 0.006231148727238178, LR: 0.0001 Step 843/1000, Loss: 0.006210663355886936, LR: 0.0001 Step 844/1000, Loss: 0.0061917076818645, LR: 0.0001 Step 845/1000, Loss: 0.006172120105475187, LR: 0.0001 Step 846/1000, Loss: 0.006152748595923185, LR: 0.0001 Step 847/1000, Loss: 0.006133679766207933, LR: 0.0001 Step 848/1000, Loss: 0.0061145988292992115, LR: 0.0001 Step 849/1000, Loss: 0.006095671094954014, LR: 0.0001 Step 850/1000, Loss: 0.006076617632061243, LR: 0.0001 Step 851/1000, Loss: 0.006057293154299259, LR: 0.0001 Step 852/1000, Loss: 0.006044856738299131, LR: 0.0001 Step 853/1000, Loss: 0.006020909175276756, LR: 0.0001 Step 854/1000, Loss: 0.00600203825160861, LR: 0.0001 Step 855/1000, Loss: 0.00598194170743227, LR: 0.0001 Step 856/1000, Loss: 0.005963760893791914, LR: 0.0001 Step 857/1000, Loss: 0.005946979857981205, LR: 0.0001 Step 858/1000, Loss: 0.005930349230766296, LR: 0.0001 Step 859/1000, Loss: 0.0059133851900696754, LR: 0.0001 Step 860/1000, Loss: 0.005896122194826603, LR: 0.0001 Step 861/1000, Loss: 0.005879182368516922, LR: 0.0001 Step 862/1000, Loss: 0.005862336605787277, LR: 0.0001 Step 863/1000, Loss: 0.0058457693085074425, LR: 0.0001 Step 864/1000, Loss: 0.005828892812132835, LR: 0.0001 Step 865/1000, Loss: 0.005812444724142551, LR: 0.0001 Step 866/1000, Loss: 0.005797059275209904, LR: 0.0001 Step 867/1000, Loss: 0.005875434260815382, LR: 0.0001 Step 868/1000, Loss: 0.005909770727157593, LR: 0.0001 Step 869/1000, Loss: 0.005759359337389469, LR: 0.0001 Step 870/1000, Loss: 0.0057429363951087, LR: 0.0001 Step 871/1000, Loss: 0.005729073658585548, LR: 0.0001 Step 872/1000, Loss: 0.00571393733844161, LR: 0.0001 Step 873/1000, Loss: 0.0056982324458658695, LR: 0.0001 Step 874/1000, Loss: 0.005684015341103077, LR: 0.0001 Step 875/1000, Loss: 0.005669407546520233, LR: 0.0001 Step 876/1000, Loss: 0.00565493106842041, LR: 0.0001 Step 877/1000, Loss: 0.005640487652271986, LR: 0.0001 Step 878/1000, Loss: 0.005625112913548946, LR: 0.0001 Step 879/1000, Loss: 0.0056097134947776794, LR: 0.0001 Step 880/1000, Loss: 0.005594415124505758, LR: 0.0001 Step 881/1000, Loss: 0.005579091142863035, LR: 0.0001 Step 882/1000, Loss: 0.005563954822719097, LR: 0.0001 Step 883/1000, Loss: 0.005548883695155382, LR: 0.0001 Step 884/1000, Loss: 0.005533585324883461, LR: 0.0001 Step 885/1000, Loss: 0.0055184317752718925, LR: 0.0001 Step 886/1000, Loss: 0.005526943132281303, LR: 0.0001 Step 887/1000, Loss: 0.0055147456005215645, LR: 0.0001 Step 888/1000, Loss: 0.00550167728215456, LR: 0.0001 Step 889/1000, Loss: 0.005514197982847691, LR: 0.0001 Step 890/1000, Loss: 0.005448635667562485, LR: 0.0001 Step 891/1000, Loss: 0.005436110310256481, LR: 0.0001 Step 892/1000, Loss: 0.005424098111689091, LR: 0.0001 Step 893/1000, Loss: 0.005414609797298908, LR: 0.0001 Step 894/1000, Loss: 0.0053999098017811775, LR: 0.0001 Step 895/1000, Loss: 0.005386935546994209, LR: 0.0001 Step 896/1000, Loss: 0.005373725667595863, LR: 0.0001 Step 897/1000, Loss: 0.0053641302511096, LR: 0.0001 Step 898/1000, Loss: 0.005347576458007097, LR: 0.0001 Step 899/1000, Loss: 0.00533481827005744, LR: 0.0001 Step 900/1000, Loss: 0.005318287294358015, LR: 0.0001 Step 901/1000, Loss: 0.005304806865751743, LR: 0.0001 Step 902/1000, Loss: 0.00529113644734025, LR: 0.0001 Step 903/1000, Loss: 0.005278221797198057, LR: 0.0001 Step 904/1000, Loss: 0.005264527164399624, LR: 0.0001 Step 905/1000, Loss: 0.005250594578683376, LR: 0.0001 Step 906/1000, Loss: 0.005237361881881952, LR: 0.0001 Step 907/1000, Loss: 0.005224616266787052, LR: 0.0001 Step 908/1000, Loss: 0.00521545996889472, LR: 0.0001 Step 909/1000, Loss: 0.005209008231759071, LR: 0.0001 Step 910/1000, Loss: 0.005186073947697878, LR: 0.0001 Step 911/1000, Loss: 0.005171170458197594, LR: 0.0001 Step 912/1000, Loss: 0.005156806204468012, LR: 0.0001 Step 913/1000, Loss: 0.005143319256603718, LR: 0.0001 Step 914/1000, Loss: 0.00513050053268671, LR: 0.0001 Step 915/1000, Loss: 0.00511697493493557, LR: 0.0001 Step 916/1000, Loss: 0.0051078288815915585, LR: 0.0001 Step 917/1000, Loss: 0.005091926082968712, LR: 0.0001 Step 918/1000, Loss: 0.0050815800204873085, LR: 0.0001 Step 919/1000, Loss: 0.005070338025689125, LR: 0.0001 Step 920/1000, Loss: 0.005053011700510979, LR: 0.0001 Step 921/1000, Loss: 0.0050390674732625484, LR: 0.0001 Step 922/1000, Loss: 0.005026121623814106, LR: 0.0001 Step 923/1000, Loss: 0.005011658184230328, LR: 0.0001 Step 924/1000, Loss: 0.005003025289624929, LR: 0.0001 Step 925/1000, Loss: 0.004990539513528347, LR: 0.0001 Step 926/1000, Loss: 0.004974867217242718, LR: 0.0001 Step 927/1000, Loss: 0.00496390275657177, LR: 0.0001 Step 928/1000, Loss: 0.0049548642709851265, LR: 0.0001 Step 929/1000, Loss: 0.004941609688103199, LR: 0.0001 Step 930/1000, Loss: 0.004928720183670521, LR: 0.0001 Step 931/1000, Loss: 0.004916149191558361, LR: 0.0001 Step 932/1000, Loss: 0.004906709771603346, LR: 0.0001 Step 933/1000, Loss: 0.004893041215837002, LR: 0.0001 Step 934/1000, Loss: 0.004879443906247616, LR: 0.0001 Step 935/1000, Loss: 0.00487054418772459, LR: 0.0001 Step 936/1000, Loss: 0.004848017822951078, LR: 0.0001 Step 937/1000, Loss: 0.004834645893424749, LR: 0.0001 Step 938/1000, Loss: 0.004822378512471914, LR: 0.0001 Step 939/1000, Loss: 0.004809106700122356, LR: 0.0001 Step 940/1000, Loss: 0.0047960104420781136, LR: 0.0001 Step 941/1000, Loss: 0.004782628268003464, LR: 0.0001 Step 942/1000, Loss: 0.004768282175064087, LR: 0.0001 Step 943/1000, Loss: 0.004754784516990185, LR: 0.0001 Step 944/1000, Loss: 0.004741511773318052, LR: 0.0001 Step 945/1000, Loss: 0.004728074185550213, LR: 0.0001 Step 946/1000, Loss: 0.0047147138975560665, LR: 0.0001 Step 947/1000, Loss: 0.004701375495642424, LR: 0.0001 Step 948/1000, Loss: 0.004688064102083445, LR: 0.0001 Step 949/1000, Loss: 0.0046744076535105705, LR: 0.0001 Step 950/1000, Loss: 0.0046604531817138195, LR: 0.0001 Step 951/1000, Loss: 0.004647126886993647, LR: 0.0001 Step 952/1000, Loss: 0.004633200354874134, LR: 0.0001 Step 953/1000, Loss: 0.004619373474270105, LR: 0.0001 Step 954/1000, Loss: 0.004605497233569622, LR: 0.0001 Step 955/1000, Loss: 0.00459182308986783, LR: 0.0001 Step 956/1000, Loss: 0.004577638581395149, LR: 0.0001 Step 957/1000, Loss: 0.004563467111438513, LR: 0.0001 Step 958/1000, Loss: 0.004549335688352585, LR: 0.0001 Step 959/1000, Loss: 0.004535208456218243, LR: 0.0001 Step 960/1000, Loss: 0.004521423019468784, LR: 0.0001 Step 961/1000, Loss: 0.004507188685238361, LR: 0.0001 Step 962/1000, Loss: 0.00449378602206707, LR: 0.0001 Step 963/1000, Loss: 0.004479792434722185, LR: 0.0001 Step 964/1000, Loss: 0.00446388591080904, LR: 0.0001 Step 965/1000, Loss: 0.004451134707778692, LR: 0.0001 Step 966/1000, Loss: 0.004437240771949291, LR: 0.0001 Step 967/1000, Loss: 0.0044227722100913525, LR: 0.0001 Step 968/1000, Loss: 0.004408222157508135, LR: 0.0001 Step 969/1000, Loss: 0.004393703304231167, LR: 0.0001 Step 970/1000, Loss: 0.004378416109830141, LR: 0.0001 Step 971/1000, Loss: 0.004363610874861479, LR: 0.0001 Step 972/1000, Loss: 0.004348609130829573, LR: 0.0001 Step 973/1000, Loss: 0.004333472810685635, LR: 0.0001 Step 974/1000, Loss: 0.004318246617913246, LR: 0.0001 Step 975/1000, Loss: 0.004302807617932558, LR: 0.0001 Step 976/1000, Loss: 0.004287301562726498, LR: 0.0001 Step 977/1000, Loss: 0.0042717475444078445, LR: 0.0001 Step 978/1000, Loss: 0.00425606919452548, LR: 0.0001 Step 979/1000, Loss: 0.004240184091031551, LR: 0.0001 Step 980/1000, Loss: 0.004223822616040707, LR: 0.0001 Step 981/1000, Loss: 0.004206301178783178, LR: 0.0001 Step 982/1000, Loss: 0.004190454259514809, LR: 0.0001 Step 983/1000, Loss: 0.004175176378339529, LR: 0.0001 Step 984/1000, Loss: 0.004159011412411928, LR: 0.0001 Step 985/1000, Loss: 0.004143113736063242, LR: 0.0001 Step 986/1000, Loss: 0.0041270749643445015, LR: 0.0001 Step 987/1000, Loss: 0.0041109356097877026, LR: 0.0001 Step 988/1000, Loss: 0.004095254000276327, LR: 0.0001 Step 989/1000, Loss: 0.004079075995832682, LR: 0.0001 Step 990/1000, Loss: 0.004062770865857601, LR: 0.0001 Step 991/1000, Loss: 0.00404686713591218, LR: 0.0001 Step 992/1000, Loss: 0.004031156189739704, LR: 0.0001 Step 993/1000, Loss: 0.004015577025711536, LR: 0.0001 Step 994/1000, Loss: 0.004000104498118162, LR: 0.0001 Step 995/1000, Loss: 0.003984970506280661, LR: 0.0001 Step 996/1000, Loss: 0.003970731981098652, LR: 0.0001 Step 997/1000, Loss: 0.0039593251422047615, LR: 0.0001 Step 998/1000, Loss: 0.0039571551606059074, LR: 0.0001 Step 999/1000, Loss: 0.00398876890540123, LR: 0.0001 Step 1000/1000, Loss: 0.004170993808656931, LR: 0.0001
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x7fa1b9abb110>]
To perform inference, we can autoregressively feed data into the transformer, sliding the selected output token back into the input. We can test this on one of our training examples and see that our model is accurately reproducing the training example. The model has been overfit to the data, so we are testing if the model reproduces the correct outputs in the same order as the inputs.
def inference(prompt, max_new_tokens):
tokens = tokenizer.encode(prompt)
for _ in range(max_new_tokens):
num_tokens = len(tokens)
tokens_padded = tokens + [tokenizer.eot_token] * (config.seq_len - num_tokens)
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = model(tokens_padded)
predicted_token = torch.argmax(logits[0, num_tokens-1, :]).item()
tokens.append(predicted_token)
return tokenizer.decode(tokens)
print("Original: ", tokenizer.decode(train_inputs[2].tolist())[:90])
print("Predicted:", inference(" director Takeshi Ozawa . A large team of writers handled the script", max_new_tokens=6))
Original: director Takeshi Ozawa . A large team of writers handled the script . The game 's opening Predicted: director Takeshi Ozawa . A large team of writers handled the script . The game 's opening
Using tiktoken, and a small dataset, we were able to overfit a small dataset and perform inference examples. However, in order to train a LLM that can do useful things we will need a larger dataset that won't be able to fit in memory. We will also need an efficient way to tokenize the dataset and load it into pytorch tensors.
Huggingface's datasets library makes this process very easy.
# Load dataset in streaming mode
ds = load_dataset("abisee/cnn_dailymail", "3.0.0", split="train")
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")
def check_dataset_exists():
try:
# Attempt to load the dataset with reuse_cache_if_exists mode
load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")
return True
except FileNotFoundError:
return False
if not check_dataset_exists():
print("Tokenized dataset does not exist locally... Generating and saving to disk.")
def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, train_rows=100_000, test_rows=500):
"""
Tokenizes and chunks the dataset into fixed-length 512-token segments.
The 'target' sequence is shifted left by 1 token.
Stops after generating `train_rows + test_rows` tokenized chunks.
"""
buffer = [] # Rolling buffer for tokens
row_count = 0
for example in dataset:
tokens = tokenizer(example["article"], truncation=False, padding=False)['input_ids']
buffer.extend(tokens)
# Yield full chunks until we reach train_rows + test_rows
while len(buffer) >= chunk_size + 1: # +1 to ensure we can shift target
if row_count >= (train_rows + test_rows):
return # Stop yielding once enough rows are reached
# Create input-target pairs
input_chunk = buffer[:chunk_size] # First 512 tokens
target_chunk = buffer[1:chunk_size + 1] # Shifted by 1 token
# Assign to train or test split
split = "train" if row_count < train_rows else "test"
yield {
"split": split,
"input": input_chunk,
"target": target_chunk
}
buffer = buffer[chunk_size:] # Remove used tokens
row_count += 1
# Set the max number of rows for training and testing
TRAIN_ROWS = 1400000 # Adjust as needed
TEST_ROWS = 500 # Adjust as needed
CHUNK_SIZE = 128
# Convert generator to a Hugging Face Dataset
tokenized_ds = Dataset.from_generator(lambda: tokenize_and_chunk(ds, hf_tokenizer,chunk_size=CHUNK_SIZE, train_rows=TRAIN_ROWS, test_rows=TEST_ROWS))
# Split the dataset into `train` and `test`
dataset_splits = tokenized_ds.train_test_split(test_size=TEST_ROWS / (TRAIN_ROWS + TEST_ROWS), seed=42)
# Save to disk
dataset_splits["train"].to_parquet("cnn_dailymail_train.parquet")
dataset_splits["test"].to_parquet("cnn_dailymail_test.parquet")
print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows.")
else:
print("Tokenized dataset already exists locally.")
Tokenized dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1194 > 1024). Running this sequence through the model will result in indexing errors
Creating parquet from Arrow format: 0%| | 0/1400 [00:00<?, ?ba/s]
Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
✅ Saved 1400000 train rows and 500 test rows.
We have tokenized the dataset in chunks, and saved it to the disk as a parquet file. This is a scalable approach that will allow us to train the model while never having the entire dataset in memory. Let's make a more robust training loop that ensures we are saving off the model at various checkpoints.
# Example config:
batch_size = 64
sequence_len = 128
num_steps = 150000
accumulation_steps = 100
# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")
# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])
# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
config = GPTConfig(
vocab_size=hf_tokenizer.vocab_size,
n_layer=8, # fewer layers for a quick demo
n_head=8,
n_embd=128,
seq_len=sequence_len,
)
# Create the GPT model
model = GPTModel(config)
use_existing_model = os.path.exists("./pretrain_final.pth")
# Check if pre-trained model exists
if use_existing_model:
model = torch.load("./pretrain_final.pth")
print("Loaded pre-trained model from ./pretrain_final.pth, skipping training loop.")
else:
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Define Scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.3, patience=10, min_lr=5e-6, threshold=1e-4)
# Training loop
losses = []
test_losses = []
accumulator = 0
accumulator_loss = 0
start_time = time.time()
for i in range(num_steps):
model.train()
example = next(train_dataloader)
train_input = example["input"].to(device)
train_target = example["target"].to(device)
logits = model(train_input)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights
optimizer.step()
optimizer.zero_grad()
accumulator += 1
accumulator_loss += loss.item()
if accumulator >= accumulation_steps:
losses.append(accumulator_loss / accumulation_steps)
accumulator = 0
accumulator_loss = 0
model.eval()
test_loss = 0
test_accumulator = 0
with torch.no_grad():
for test_example in test_dataloader:
test_input = test_example["input"].to(device)
test_target = test_example["target"].to(device)
test_logits = model(test_input)
test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
test_accumulator += 1
test_losses.append(test_loss / test_accumulator)
elapsed_time = time.time() - start_time
print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}, LR: {optimizer.param_groups[0]['lr']}, Elapsed Time: {elapsed_time:.2f} seconds")
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
scheduler.step(test_losses[-1])
if (i+1) % 50000 == 0:
# Save the model checkpoint
print(f"Saving model checkpoint at step {i+1}")
torch.save(model, f"./model_checkpoint_{i}.pt")
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Step 100/150000, Loss: 8.435567221641541, Test Loss: 7.553930699825287, LR: 0.0005, Elapsed Time: 10.30 seconds Step 200/150000, Loss: 7.54877815246582, Test Loss: 7.525772750377655, LR: 0.0005, Elapsed Time: 20.54 seconds Step 300/150000, Loss: 7.526817994117737, Test Loss: 7.517217397689819, LR: 0.0005, Elapsed Time: 30.79 seconds Step 400/150000, Loss: 7.927651295661926, Test Loss: 7.508810877799988, LR: 0.0005, Elapsed Time: 41.07 seconds Step 500/150000, Loss: 7.511321864128113, Test Loss: 7.487947523593903, LR: 0.0005, Elapsed Time: 51.26 seconds Step 600/150000, Loss: 7.476530842781067, Test Loss: 7.447194814682007, LR: 0.0005, Elapsed Time: 61.53 seconds Step 700/150000, Loss: 7.439467644691467, Test Loss: 7.41978245973587, LR: 0.0005, Elapsed Time: 71.85 seconds Step 800/150000, Loss: 7.416441283226013, Test Loss: 7.398671627044678, LR: 0.0005, Elapsed Time: 82.34 seconds Step 900/150000, Loss: 7.397604651451111, Test Loss: 7.382740795612335, LR: 0.0005, Elapsed Time: 92.67 seconds Step 1000/150000, Loss: 7.381043090820312, Test Loss: 7.369001686573029, LR: 0.0005, Elapsed Time: 103.01 seconds Step 1100/150000, Loss: 7.368967571258545, Test Loss: 7.357015609741211, LR: 0.0005, Elapsed Time: 113.36 seconds Step 1200/150000, Loss: 7.354331102371216, Test Loss: 7.345624268054962, LR: 0.0005, Elapsed Time: 123.73 seconds Step 1300/150000, Loss: 7.337831945419311, Test Loss: 7.3321720361709595, LR: 0.0005, Elapsed Time: 134.05 seconds Step 1400/150000, Loss: 7.343300309181213, Test Loss: 7.326752722263336, LR: 0.0005, Elapsed Time: 144.42 seconds Step 1500/150000, Loss: 7.338461394309998, Test Loss: 7.328832685947418, LR: 0.0005, Elapsed Time: 154.74 seconds Step 1600/150000, Loss: 7.306818609237671, Test Loss: 7.305445730686188, LR: 0.0005, Elapsed Time: 165.07 seconds Step 1700/150000, Loss: 7.306557531356812, Test Loss: 7.295311152935028, LR: 0.0005, Elapsed Time: 175.42 seconds Step 1800/150000, Loss: 7.298436379432678, Test Loss: 7.294580817222595, LR: 0.0005, Elapsed Time: 185.85 seconds Step 1900/150000, Loss: 7.286991534233093, Test Loss: 7.286538183689117, LR: 0.0005, Elapsed Time: 196.23 seconds Step 2000/150000, Loss: 7.289896626472473, Test Loss: 7.27922135591507, LR: 0.0005, Elapsed Time: 206.66 seconds Step 2100/150000, Loss: 7.288860077857971, Test Loss: 7.266847491264343, LR: 0.0005, Elapsed Time: 217.04 seconds Step 2200/150000, Loss: 7.269688520431519, Test Loss: 7.265383839607239, LR: 0.0005, Elapsed Time: 227.58 seconds Step 2300/150000, Loss: 7.268224635124207, Test Loss: 7.26043838262558, LR: 0.0005, Elapsed Time: 238.06 seconds Step 2400/150000, Loss: 7.265768866539002, Test Loss: 7.2547767162323, LR: 0.0005, Elapsed Time: 248.47 seconds Step 2500/150000, Loss: 7.252669172286987, Test Loss: 7.244467198848724, LR: 0.0005, Elapsed Time: 258.87 seconds Step 2600/150000, Loss: 7.247338519096375, Test Loss: 7.229368269443512, LR: 0.0005, Elapsed Time: 269.35 seconds Step 2700/150000, Loss: 7.224129657745362, Test Loss: 7.234573543071747, LR: 0.0005, Elapsed Time: 279.78 seconds Step 2800/150000, Loss: 7.227110447883606, Test Loss: 7.229214012622833, LR: 0.0005, Elapsed Time: 290.18 seconds Step 2900/150000, Loss: 7.22124216556549, Test Loss: 7.2122557163238525, LR: 0.0005, Elapsed Time: 300.61 seconds Step 3000/150000, Loss: 7.213673672676086, Test Loss: 7.206285536289215, LR: 0.0005, Elapsed Time: 311.10 seconds Step 3100/150000, Loss: 7.21203209400177, Test Loss: 7.211934864521027, LR: 0.0005, Elapsed Time: 321.52 seconds Step 3200/150000, Loss: 7.206606884002685, Test Loss: 7.210776209831238, LR: 0.0005, Elapsed Time: 331.95 seconds Step 3300/150000, Loss: 7.1979950284957885, Test Loss: 7.1861525774002075, LR: 0.0005, Elapsed Time: 342.38 seconds Step 3400/150000, Loss: 7.186545124053955, Test Loss: 7.177075028419495, LR: 0.0005, Elapsed Time: 352.92 seconds Step 3500/150000, Loss: 7.181328716278077, Test Loss: 7.166107714176178, LR: 0.0005, Elapsed Time: 363.44 seconds Step 3600/150000, Loss: 7.175849142074585, Test Loss: 7.15721732378006, LR: 0.0005, Elapsed Time: 373.86 seconds Step 3700/150000, Loss: 7.16748969078064, Test Loss: 7.151500761508942, LR: 0.0005, Elapsed Time: 384.29 seconds Step 3800/150000, Loss: 7.164966082572937, Test Loss: 7.1530749797821045, LR: 0.0005, Elapsed Time: 394.70 seconds Step 3900/150000, Loss: 7.158694138526917, Test Loss: 7.147201478481293, LR: 0.0005, Elapsed Time: 405.15 seconds Step 4000/150000, Loss: 7.153739581108093, Test Loss: 7.136781811714172, LR: 0.0005, Elapsed Time: 415.59 seconds Step 4100/150000, Loss: 7.134174795150757, Test Loss: 7.138992071151733, LR: 0.0005, Elapsed Time: 426.00 seconds Step 4200/150000, Loss: 7.139421229362488, Test Loss: 7.132061719894409, LR: 0.0005, Elapsed Time: 436.42 seconds Step 4300/150000, Loss: 7.138796529769897, Test Loss: 7.123697400093079, LR: 0.0005, Elapsed Time: 446.88 seconds Step 4400/150000, Loss: 7.117022309303284, Test Loss: 7.108416318893433, LR: 0.0005, Elapsed Time: 457.45 seconds Step 4500/150000, Loss: 7.113546361923218, Test Loss: 7.103397786617279, LR: 0.0005, Elapsed Time: 467.83 seconds Step 4600/150000, Loss: 7.107618660926819, Test Loss: 7.095369100570679, LR: 0.0005, Elapsed Time: 478.26 seconds Step 4700/150000, Loss: 7.093686423301697, Test Loss: 7.085271537303925, LR: 0.0005, Elapsed Time: 488.74 seconds Step 4800/150000, Loss: 7.096328368186951, Test Loss: 7.095501601696014, LR: 0.0005, Elapsed Time: 499.20 seconds Step 4900/150000, Loss: 7.082043323516846, Test Loss: 7.0681822299957275, LR: 0.0005, Elapsed Time: 509.65 seconds Step 5000/150000, Loss: 7.071612229347229, Test Loss: 7.077954411506653, LR: 0.0005, Elapsed Time: 520.12 seconds Step 5100/150000, Loss: 7.069280514717102, Test Loss: 7.064966022968292, LR: 0.0005, Elapsed Time: 530.53 seconds Step 5200/150000, Loss: 7.064275426864624, Test Loss: 7.0552626848220825, LR: 0.0005, Elapsed Time: 540.94 seconds Step 5300/150000, Loss: 7.052716789245605, Test Loss: 7.0463568568229675, LR: 0.0005, Elapsed Time: 551.43 seconds Step 5400/150000, Loss: 7.049996829032898, Test Loss: 7.032907068729401, LR: 0.0005, Elapsed Time: 561.89 seconds Step 5500/150000, Loss: 7.043382434844971, Test Loss: 7.028134822845459, LR: 0.0005, Elapsed Time: 572.30 seconds Step 5600/150000, Loss: 7.050579061508179, Test Loss: 7.027352869510651, LR: 0.0005, Elapsed Time: 582.75 seconds Step 5700/150000, Loss: 7.0176494646072385, Test Loss: 7.014744281768799, LR: 0.0005, Elapsed Time: 593.21 seconds Step 5800/150000, Loss: 7.012702417373657, Test Loss: 7.006887197494507, LR: 0.0005, Elapsed Time: 603.68 seconds Step 5900/150000, Loss: 7.013630666732788, Test Loss: 7.013012230396271, LR: 0.0005, Elapsed Time: 614.14 seconds Step 6000/150000, Loss: 7.0234130859375, Test Loss: 7.0039860010147095, LR: 0.0005, Elapsed Time: 624.67 seconds Step 6100/150000, Loss: 7.007867722511292, Test Loss: 6.993161916732788, LR: 0.0005, Elapsed Time: 635.19 seconds Step 6200/150000, Loss: 6.99984989643097, Test Loss: 6.989144802093506, LR: 0.0005, Elapsed Time: 645.78 seconds Step 6300/150000, Loss: 6.992976989746094, Test Loss: 6.977555692195892, LR: 0.0005, Elapsed Time: 656.37 seconds Step 6400/150000, Loss: 6.9823104810714725, Test Loss: 6.978590250015259, LR: 0.0005, Elapsed Time: 666.95 seconds Step 6500/150000, Loss: 6.979199652671814, Test Loss: 6.969618499279022, LR: 0.0005, Elapsed Time: 677.51 seconds Step 6600/150000, Loss: 6.977144961357117, Test Loss: 6.964677274227142, LR: 0.0005, Elapsed Time: 688.04 seconds Step 6700/150000, Loss: 6.9690747785568234, Test Loss: 6.9579432010650635, LR: 0.0005, Elapsed Time: 698.60 seconds Step 6800/150000, Loss: 6.968114175796509, Test Loss: 6.954050719738007, LR: 0.0005, Elapsed Time: 709.07 seconds Step 6900/150000, Loss: 6.964566860198975, Test Loss: 6.952076077461243, LR: 0.0005, Elapsed Time: 719.56 seconds Step 7000/150000, Loss: 6.958062806129456, Test Loss: 6.95401918888092, LR: 0.0005, Elapsed Time: 730.07 seconds Step 7100/150000, Loss: 6.946221036911011, Test Loss: 6.945705533027649, LR: 0.0005, Elapsed Time: 740.59 seconds Step 7200/150000, Loss: 6.935234818458557, Test Loss: 6.943190574645996, LR: 0.0005, Elapsed Time: 751.07 seconds Step 7300/150000, Loss: 6.942540864944458, Test Loss: 6.9327269196510315, LR: 0.0005, Elapsed Time: 761.65 seconds Step 7400/150000, Loss: 6.9417256212234495, Test Loss: 6.940238654613495, LR: 0.0005, Elapsed Time: 772.21 seconds Step 7500/150000, Loss: 6.929655418395996, Test Loss: 6.916813850402832, LR: 0.0005, Elapsed Time: 782.88 seconds Step 7600/150000, Loss: 6.923329024314881, Test Loss: 6.914795935153961, LR: 0.0005, Elapsed Time: 793.46 seconds Step 7700/150000, Loss: 6.9257378578186035, Test Loss: 6.905551373958588, LR: 0.0005, Elapsed Time: 804.06 seconds Step 7800/150000, Loss: 6.917951383590698, Test Loss: 6.909545838832855, LR: 0.0005, Elapsed Time: 814.62 seconds Step 7900/150000, Loss: 6.917766194343567, Test Loss: 6.899311721324921, LR: 0.0005, Elapsed Time: 825.18 seconds Step 8000/150000, Loss: 6.900511713027954, Test Loss: 6.904193639755249, LR: 0.0005, Elapsed Time: 835.73 seconds Step 8100/150000, Loss: 6.901459703445434, Test Loss: 6.884034514427185, LR: 0.0005, Elapsed Time: 846.26 seconds Step 8200/150000, Loss: 6.8871054935455325, Test Loss: 6.880636274814606, LR: 0.0005, Elapsed Time: 856.79 seconds Step 8300/150000, Loss: 6.894446053504944, Test Loss: 6.883534610271454, LR: 0.0005, Elapsed Time: 867.33 seconds Step 8400/150000, Loss: 6.8837585496902465, Test Loss: 6.8744590282440186, LR: 0.0005, Elapsed Time: 877.87 seconds Step 8500/150000, Loss: 6.887346677780151, Test Loss: 6.874442636966705, LR: 0.0005, Elapsed Time: 888.45 seconds Step 8600/150000, Loss: 6.866189632415772, Test Loss: 6.867535948753357, LR: 0.0005, Elapsed Time: 899.02 seconds Step 8700/150000, Loss: 6.858483424186707, Test Loss: 6.863547086715698, LR: 0.0005, Elapsed Time: 909.62 seconds Step 8800/150000, Loss: 6.853890800476075, Test Loss: 6.860527515411377, LR: 0.0005, Elapsed Time: 920.18 seconds Step 8900/150000, Loss: 6.86736876487732, Test Loss: 6.852349758148193, LR: 0.0005, Elapsed Time: 930.75 seconds Step 9000/150000, Loss: 6.854738512039185, Test Loss: 6.851053953170776, LR: 0.0005, Elapsed Time: 941.37 seconds Step 9100/150000, Loss: 6.843123269081116, Test Loss: 6.843343198299408, LR: 0.0005, Elapsed Time: 951.94 seconds Step 9200/150000, Loss: 6.848175349235535, Test Loss: 6.842785656452179, LR: 0.0005, Elapsed Time: 962.48 seconds Step 9300/150000, Loss: 6.836861019134521, Test Loss: 6.826335310935974, LR: 0.0005, Elapsed Time: 973.04 seconds Step 9400/150000, Loss: 6.828686275482178, Test Loss: 6.836514174938202, LR: 0.0005, Elapsed Time: 983.58 seconds Step 9500/150000, Loss: 6.825911259651184, Test Loss: 6.8181822299957275, LR: 0.0005, Elapsed Time: 994.15 seconds Step 9600/150000, Loss: 6.830202875137329, Test Loss: 6.815926432609558, LR: 0.0005, Elapsed Time: 1004.70 seconds Step 9700/150000, Loss: 6.8186551952362064, Test Loss: 6.811767995357513, LR: 0.0005, Elapsed Time: 1015.26 seconds Step 9800/150000, Loss: 6.804091877937317, Test Loss: 6.8131250739097595, LR: 0.0005, Elapsed Time: 1025.77 seconds Step 9900/150000, Loss: 6.819313902854919, Test Loss: 6.799443066120148, LR: 0.0005, Elapsed Time: 1036.32 seconds Step 10000/150000, Loss: 6.79480315208435, Test Loss: 6.7904172539711, LR: 0.0005, Elapsed Time: 1046.81 seconds Step 10100/150000, Loss: 6.804099850654602, Test Loss: 6.789290547370911, LR: 0.0005, Elapsed Time: 1057.33 seconds Step 10200/150000, Loss: 6.795985345840454, Test Loss: 6.779886364936829, LR: 0.0005, Elapsed Time: 1067.85 seconds Step 10300/150000, Loss: 6.786369271278382, Test Loss: 6.783883690834045, LR: 0.0005, Elapsed Time: 1078.46 seconds Step 10400/150000, Loss: 6.788990592956543, Test Loss: 6.771858096122742, LR: 0.0005, Elapsed Time: 1089.02 seconds Step 10500/150000, Loss: 6.76845308303833, Test Loss: 6.764224350452423, LR: 0.0005, Elapsed Time: 1099.56 seconds Step 10600/150000, Loss: 6.781026287078857, Test Loss: 6.759154796600342, LR: 0.0005, Elapsed Time: 1110.07 seconds Step 10700/150000, Loss: 6.754421992301941, Test Loss: 6.752796113491058, LR: 0.0005, Elapsed Time: 1120.57 seconds Step 10800/150000, Loss: 6.76339587688446, Test Loss: 6.746412396430969, LR: 0.0005, Elapsed Time: 1131.09 seconds Step 10900/150000, Loss: 6.757268538475037, Test Loss: 6.744214475154877, LR: 0.0005, Elapsed Time: 1141.67 seconds Step 11000/150000, Loss: 6.755900716781616, Test Loss: 6.744207978248596, LR: 0.0005, Elapsed Time: 1152.16 seconds Step 11100/150000, Loss: 6.73089708328247, Test Loss: 6.746663868427277, LR: 0.0005, Elapsed Time: 1162.66 seconds Step 11200/150000, Loss: 6.739979648590088, Test Loss: 6.725337088108063, LR: 0.0005, Elapsed Time: 1173.25 seconds Step 11300/150000, Loss: 6.740782465934753, Test Loss: 6.722867488861084, LR: 0.0005, Elapsed Time: 1183.77 seconds Step 11400/150000, Loss: 6.722141880989074, Test Loss: 6.720248222351074, LR: 0.0005, Elapsed Time: 1194.29 seconds Step 11500/150000, Loss: 6.730128712654114, Test Loss: 6.714776039123535, LR: 0.0005, Elapsed Time: 1204.81 seconds Step 11600/150000, Loss: 6.7219167852401736, Test Loss: 6.703527808189392, LR: 0.0005, Elapsed Time: 1215.32 seconds Step 11700/150000, Loss: 6.703344078063965, Test Loss: 6.717762649059296, LR: 0.0005, Elapsed Time: 1225.86 seconds Step 11800/150000, Loss: 6.713574757575989, Test Loss: 6.7081698179244995, LR: 0.0005, Elapsed Time: 1236.49 seconds Step 11900/150000, Loss: 6.698301086425781, Test Loss: 6.696203947067261, LR: 0.0005, Elapsed Time: 1247.22 seconds Step 12000/150000, Loss: 6.708526597023011, Test Loss: 6.694559156894684, LR: 0.0005, Elapsed Time: 1257.91 seconds Step 12100/150000, Loss: 6.7048102426528935, Test Loss: 6.681866943836212, LR: 0.0005, Elapsed Time: 1268.63 seconds Step 12200/150000, Loss: 6.699475960731506, Test Loss: 6.684414029121399, LR: 0.0005, Elapsed Time: 1279.21 seconds Step 12300/150000, Loss: 6.6868267297744755, Test Loss: 6.6754156947135925, LR: 0.0005, Elapsed Time: 1289.85 seconds Step 12400/150000, Loss: 6.674562578201294, Test Loss: 6.684077084064484, LR: 0.0005, Elapsed Time: 1300.45 seconds Step 12500/150000, Loss: 6.677119479179383, Test Loss: 6.676577389240265, LR: 0.0005, Elapsed Time: 1311.00 seconds Step 12600/150000, Loss: 6.681392693519593, Test Loss: 6.66412079334259, LR: 0.0005, Elapsed Time: 1321.62 seconds Step 12700/150000, Loss: 6.6648876953125, Test Loss: 6.663511395454407, LR: 0.0005, Elapsed Time: 1332.17 seconds Step 12800/150000, Loss: 6.67307677268982, Test Loss: 6.649197101593018, LR: 0.0005, Elapsed Time: 1342.71 seconds Step 12900/150000, Loss: 6.657832083702087, Test Loss: 6.650216281414032, LR: 0.0005, Elapsed Time: 1353.24 seconds Step 13000/150000, Loss: 6.652597069740295, Test Loss: 6.639411389827728, LR: 0.0005, Elapsed Time: 1363.83 seconds Step 13100/150000, Loss: 6.637246680259705, Test Loss: 6.639971971511841, LR: 0.0005, Elapsed Time: 1374.45 seconds Step 13200/150000, Loss: 6.634117999076843, Test Loss: 6.633730053901672, LR: 0.0005, Elapsed Time: 1385.05 seconds Step 13300/150000, Loss: 6.647670211791993, Test Loss: 6.628168046474457, LR: 0.0005, Elapsed Time: 1395.65 seconds Step 13400/150000, Loss: 6.627514114379883, Test Loss: 6.628120243549347, LR: 0.0005, Elapsed Time: 1406.25 seconds Step 13500/150000, Loss: 6.628901286125183, Test Loss: 6.615938603878021, LR: 0.0005, Elapsed Time: 1416.90 seconds Step 13600/150000, Loss: 6.624326753616333, Test Loss: 6.6214759349823, LR: 0.0005, Elapsed Time: 1427.47 seconds Step 13700/150000, Loss: 6.61751268863678, Test Loss: 6.609710097312927, LR: 0.0005, Elapsed Time: 1438.07 seconds Step 13800/150000, Loss: 6.602189602851868, Test Loss: 6.608081936836243, LR: 0.0005, Elapsed Time: 1448.62 seconds Step 13900/150000, Loss: 6.607090282440185, Test Loss: 6.609498202800751, LR: 0.0005, Elapsed Time: 1459.25 seconds Step 14000/150000, Loss: 6.607324085235596, Test Loss: 6.5987125635147095, LR: 0.0005, Elapsed Time: 1469.85 seconds Step 14100/150000, Loss: 6.602432012557983, Test Loss: 6.596570730209351, LR: 0.0005, Elapsed Time: 1480.42 seconds Step 14200/150000, Loss: 6.600292644500732, Test Loss: 6.5954506397247314, LR: 0.0005, Elapsed Time: 1491.04 seconds Step 14300/150000, Loss: 6.601419410705566, Test Loss: 6.5890308022499084, LR: 0.0005, Elapsed Time: 1501.61 seconds Step 14400/150000, Loss: 6.588587336540222, Test Loss: 6.580264329910278, LR: 0.0005, Elapsed Time: 1512.16 seconds Step 14500/150000, Loss: 6.590622005462646, Test Loss: 6.5771055817604065, LR: 0.0005, Elapsed Time: 1522.75 seconds Step 14600/150000, Loss: 6.5810211515426635, Test Loss: 6.572095036506653, LR: 0.0005, Elapsed Time: 1533.35 seconds Step 14700/150000, Loss: 6.5652279281616215, Test Loss: 6.570411503314972, LR: 0.0005, Elapsed Time: 1543.92 seconds Step 14800/150000, Loss: 6.574029922485352, Test Loss: 6.574530601501465, LR: 0.0005, Elapsed Time: 1554.50 seconds Step 14900/150000, Loss: 6.571135225296021, Test Loss: 6.566188871860504, LR: 0.0005, Elapsed Time: 1565.05 seconds Step 15000/150000, Loss: 6.569562268257141, Test Loss: 6.557141661643982, LR: 0.0005, Elapsed Time: 1575.60 seconds Step 15100/150000, Loss: 6.570797562599182, Test Loss: 6.552781701087952, LR: 0.0005, Elapsed Time: 1586.19 seconds Step 15200/150000, Loss: 6.5492801141738894, Test Loss: 6.549380660057068, LR: 0.0005, Elapsed Time: 1596.80 seconds Step 15300/150000, Loss: 6.544323568344116, Test Loss: 6.55324387550354, LR: 0.0005, Elapsed Time: 1607.39 seconds Step 15400/150000, Loss: 6.536939973831177, Test Loss: 6.5497618317604065, LR: 0.0005, Elapsed Time: 1617.96 seconds Step 15500/150000, Loss: 6.56450659275055, Test Loss: 6.541414201259613, LR: 0.0005, Elapsed Time: 1628.67 seconds Step 15600/150000, Loss: 6.526670250892639, Test Loss: 6.537434995174408, LR: 0.0005, Elapsed Time: 1639.34 seconds Step 15700/150000, Loss: 6.55069571018219, Test Loss: 6.534805953502655, LR: 0.0005, Elapsed Time: 1649.93 seconds Step 15800/150000, Loss: 6.540039100646973, Test Loss: 6.5309025049209595, LR: 0.0005, Elapsed Time: 1660.56 seconds Step 15900/150000, Loss: 6.528563408851624, Test Loss: 6.521130383014679, LR: 0.0005, Elapsed Time: 1671.19 seconds Step 16000/150000, Loss: 6.50876401424408, Test Loss: 6.522227466106415, LR: 0.0005, Elapsed Time: 1681.80 seconds Step 16100/150000, Loss: 6.5254049968719485, Test Loss: 6.511383473873138, LR: 0.0005, Elapsed Time: 1692.36 seconds Step 16200/150000, Loss: 6.517147054672241, Test Loss: 6.512513339519501, LR: 0.0005, Elapsed Time: 1702.92 seconds Step 16300/150000, Loss: 6.5192236232757566, Test Loss: 6.501937687397003, LR: 0.0005, Elapsed Time: 1713.53 seconds Step 16400/150000, Loss: 6.51023494720459, Test Loss: 6.498667001724243, LR: 0.0005, Elapsed Time: 1724.12 seconds Step 16500/150000, Loss: 6.502197513580322, Test Loss: 6.500382721424103, LR: 0.0005, Elapsed Time: 1734.70 seconds Step 16600/150000, Loss: 6.51261917591095, Test Loss: 6.49462890625, LR: 0.0005, Elapsed Time: 1745.35 seconds Step 16700/150000, Loss: 6.508185267448425, Test Loss: 6.498956382274628, LR: 0.0005, Elapsed Time: 1755.96 seconds Step 16800/150000, Loss: 6.499858722686768, Test Loss: 6.488281190395355, LR: 0.0005, Elapsed Time: 1766.56 seconds Step 16900/150000, Loss: 6.493766951560974, Test Loss: 6.489078462123871, LR: 0.0005, Elapsed Time: 1777.11 seconds Step 17000/150000, Loss: 6.487625350952149, Test Loss: 6.484674572944641, LR: 0.0005, Elapsed Time: 1787.70 seconds Step 17100/150000, Loss: 6.482082872390747, Test Loss: 6.4762014746665955, LR: 0.0005, Elapsed Time: 1798.32 seconds Step 17200/150000, Loss: 6.49480700969696, Test Loss: 6.476258873939514, LR: 0.0005, Elapsed Time: 1808.87 seconds Step 17300/150000, Loss: 6.480599088668823, Test Loss: 6.480451822280884, LR: 0.0005, Elapsed Time: 1819.47 seconds Step 17400/150000, Loss: 6.491000347137451, Test Loss: 6.466275691986084, LR: 0.0005, Elapsed Time: 1830.02 seconds Step 17500/150000, Loss: 6.4494054412841795, Test Loss: 6.463012278079987, LR: 0.0005, Elapsed Time: 1840.69 seconds Step 17600/150000, Loss: 6.464988780021668, Test Loss: 6.454334020614624, LR: 0.0005, Elapsed Time: 1851.28 seconds Step 17700/150000, Loss: 6.460893087387085, Test Loss: 6.4548065066337585, LR: 0.0005, Elapsed Time: 1861.88 seconds Step 17800/150000, Loss: 6.470332183837891, Test Loss: 6.456714153289795, LR: 0.0005, Elapsed Time: 1872.48 seconds Step 17900/150000, Loss: 6.466375818252564, Test Loss: 6.449868381023407, LR: 0.0005, Elapsed Time: 1883.02 seconds Step 18000/150000, Loss: 6.454435615539551, Test Loss: 6.457311153411865, LR: 0.0005, Elapsed Time: 1893.59 seconds Step 18100/150000, Loss: 6.448432216644287, Test Loss: 6.449561774730682, LR: 0.0005, Elapsed Time: 1904.21 seconds Step 18200/150000, Loss: 6.450870785713196, Test Loss: 6.440909206867218, LR: 0.0005, Elapsed Time: 1914.87 seconds Step 18300/150000, Loss: 6.434606690406799, Test Loss: 6.439033031463623, LR: 0.0005, Elapsed Time: 1925.53 seconds Step 18400/150000, Loss: 6.438348422050476, Test Loss: 6.435120463371277, LR: 0.0005, Elapsed Time: 1936.11 seconds Step 18500/150000, Loss: 6.444020161628723, Test Loss: 6.433395802974701, LR: 0.0005, Elapsed Time: 1946.72 seconds Step 18600/150000, Loss: 6.443592319488525, Test Loss: 6.434479475021362, LR: 0.0005, Elapsed Time: 1957.27 seconds Step 18700/150000, Loss: 6.4349985027313235, Test Loss: 6.425047874450684, LR: 0.0005, Elapsed Time: 1967.87 seconds Step 18800/150000, Loss: 6.432713174819947, Test Loss: 6.418551445007324, LR: 0.0005, Elapsed Time: 1978.46 seconds Step 18900/150000, Loss: 6.429965615272522, Test Loss: 6.420734524726868, LR: 0.0005, Elapsed Time: 1989.03 seconds Step 19000/150000, Loss: 6.423684306144715, Test Loss: 6.4192734360694885, LR: 0.0005, Elapsed Time: 1999.64 seconds Step 19100/150000, Loss: 6.4195552253723145, Test Loss: 6.414090812206268, LR: 0.0005, Elapsed Time: 2010.22 seconds Step 19200/150000, Loss: 6.405357303619385, Test Loss: 6.405922114849091, LR: 0.0005, Elapsed Time: 2020.91 seconds Step 19300/150000, Loss: 6.404822511672974, Test Loss: 6.405009627342224, LR: 0.0005, Elapsed Time: 2031.55 seconds Step 19400/150000, Loss: 6.415255007743835, Test Loss: 6.402331709861755, LR: 0.0005, Elapsed Time: 2042.16 seconds Step 19500/150000, Loss: 6.405190634727478, Test Loss: 6.404933333396912, LR: 0.0005, Elapsed Time: 2052.77 seconds Step 19600/150000, Loss: 6.399961256980896, Test Loss: 6.394679009914398, LR: 0.0005, Elapsed Time: 2063.40 seconds Step 19700/150000, Loss: 6.394190077781677, Test Loss: 6.393055558204651, LR: 0.0005, Elapsed Time: 2073.99 seconds Step 19800/150000, Loss: 6.414848728179932, Test Loss: 6.395237922668457, LR: 0.0005, Elapsed Time: 2084.64 seconds Step 19900/150000, Loss: 6.3962875080108645, Test Loss: 6.386175692081451, LR: 0.0005, Elapsed Time: 2095.30 seconds Step 20000/150000, Loss: 6.402195482254029, Test Loss: 6.379953682422638, LR: 0.0005, Elapsed Time: 2105.93 seconds Step 20100/150000, Loss: 6.387098479270935, Test Loss: 6.375067710876465, LR: 0.0005, Elapsed Time: 2116.58 seconds Step 20200/150000, Loss: 6.380781197547913, Test Loss: 6.377723157405853, LR: 0.0005, Elapsed Time: 2127.24 seconds Step 20300/150000, Loss: 6.374047350883484, Test Loss: 6.375344514846802, LR: 0.0005, Elapsed Time: 2137.88 seconds Step 20400/150000, Loss: 6.374481873512268, Test Loss: 6.3679733872413635, LR: 0.0005, Elapsed Time: 2148.49 seconds Step 20500/150000, Loss: 6.377660479545593, Test Loss: 6.3735891580581665, LR: 0.0005, Elapsed Time: 2159.16 seconds Step 20600/150000, Loss: 6.367081460952758, Test Loss: 6.3758509159088135, LR: 0.0005, Elapsed Time: 2169.81 seconds Step 20700/150000, Loss: 6.3774712896347046, Test Loss: 6.365654647350311, LR: 0.0005, Elapsed Time: 2180.51 seconds Step 20800/150000, Loss: 6.360362243652344, Test Loss: 6.357201159000397, LR: 0.0005, Elapsed Time: 2191.18 seconds Step 20900/150000, Loss: 6.3589804744720455, Test Loss: 6.362372040748596, LR: 0.0005, Elapsed Time: 2201.84 seconds Step 21000/150000, Loss: 6.369034671783448, Test Loss: 6.358105301856995, LR: 0.0005, Elapsed Time: 2212.56 seconds Step 21100/150000, Loss: 6.354185891151428, Test Loss: 6.350670397281647, LR: 0.0005, Elapsed Time: 2223.20 seconds Step 21200/150000, Loss: 6.357424311637878, Test Loss: 6.3464654088020325, LR: 0.0005, Elapsed Time: 2233.78 seconds Step 21300/150000, Loss: 6.35779791355133, Test Loss: 6.344950079917908, LR: 0.0005, Elapsed Time: 2244.50 seconds Step 21400/150000, Loss: 6.347241988182068, Test Loss: 6.336641013622284, LR: 0.0005, Elapsed Time: 2255.18 seconds Step 21500/150000, Loss: 6.358279576301575, Test Loss: 6.335911929607391, LR: 0.0005, Elapsed Time: 2265.80 seconds Step 21600/150000, Loss: 6.335744690895081, Test Loss: 6.334254264831543, LR: 0.0005, Elapsed Time: 2276.45 seconds Step 21700/150000, Loss: 6.348782386779785, Test Loss: 6.328651070594788, LR: 0.0005, Elapsed Time: 2287.12 seconds Step 21800/150000, Loss: 6.321436200141907, Test Loss: 6.326053261756897, LR: 0.0005, Elapsed Time: 2297.78 seconds Step 21900/150000, Loss: 6.342089877128601, Test Loss: 6.3293914794921875, LR: 0.0005, Elapsed Time: 2308.46 seconds Step 22000/150000, Loss: 6.314436440467834, Test Loss: 6.328100562095642, LR: 0.0005, Elapsed Time: 2318.91 seconds Step 22100/150000, Loss: 6.326700057983398, Test Loss: 6.319675445556641, LR: 0.0005, Elapsed Time: 2329.40 seconds Step 22200/150000, Loss: 6.33526002407074, Test Loss: 6.321130096912384, LR: 0.0005, Elapsed Time: 2339.83 seconds Step 22300/150000, Loss: 6.323133850097657, Test Loss: 6.318789720535278, LR: 0.0005, Elapsed Time: 2350.29 seconds Step 22400/150000, Loss: 6.3232031393051145, Test Loss: 6.313169062137604, LR: 0.0005, Elapsed Time: 2360.76 seconds Step 22500/150000, Loss: 6.315156741142273, Test Loss: 6.309141397476196, LR: 0.0005, Elapsed Time: 2371.16 seconds Step 22600/150000, Loss: 6.316521048545837, Test Loss: 6.305017113685608, LR: 0.0005, Elapsed Time: 2381.57 seconds Step 22700/150000, Loss: 6.308063268661499, Test Loss: 6.304156482219696, LR: 0.0005, Elapsed Time: 2391.98 seconds Step 22800/150000, Loss: 6.302033014297486, Test Loss: 6.309248983860016, LR: 0.0005, Elapsed Time: 2402.39 seconds Step 22900/150000, Loss: 6.286345524787903, Test Loss: 6.30749124288559, LR: 0.0005, Elapsed Time: 2412.87 seconds Step 23000/150000, Loss: 6.310992612838745, Test Loss: 6.293560981750488, LR: 0.0005, Elapsed Time: 2423.31 seconds Step 23100/150000, Loss: 6.289297895431519, Test Loss: 6.297674119472504, LR: 0.0005, Elapsed Time: 2433.75 seconds Step 23200/150000, Loss: 6.288921642303467, Test Loss: 6.284840226173401, LR: 0.0005, Elapsed Time: 2444.12 seconds Step 23300/150000, Loss: 6.302255926132202, Test Loss: 6.294129133224487, LR: 0.0005, Elapsed Time: 2454.59 seconds Step 23400/150000, Loss: 6.297336287498474, Test Loss: 6.283893585205078, LR: 0.0005, Elapsed Time: 2465.03 seconds Step 23500/150000, Loss: 6.281531610488892, Test Loss: 6.286109745502472, LR: 0.0005, Elapsed Time: 2475.40 seconds Step 23600/150000, Loss: 6.301163878440857, Test Loss: 6.281963407993317, LR: 0.0005, Elapsed Time: 2485.82 seconds Step 23700/150000, Loss: 6.2762983036041256, Test Loss: 6.277430713176727, LR: 0.0005, Elapsed Time: 2496.27 seconds Step 23800/150000, Loss: 6.2828288173675535, Test Loss: 6.272550284862518, LR: 0.0005, Elapsed Time: 2506.67 seconds Step 23900/150000, Loss: 6.279073147773743, Test Loss: 6.274662911891937, LR: 0.0005, Elapsed Time: 2517.12 seconds Step 24000/150000, Loss: 6.281305069923401, Test Loss: 6.2732818722724915, LR: 0.0005, Elapsed Time: 2527.56 seconds Step 24100/150000, Loss: 6.263688373565674, Test Loss: 6.2711456418037415, LR: 0.0005, Elapsed Time: 2537.97 seconds Step 24200/150000, Loss: 6.273648619651794, Test Loss: 6.265712201595306, LR: 0.0005, Elapsed Time: 2548.40 seconds Step 24300/150000, Loss: 6.276871194839478, Test Loss: 6.265872597694397, LR: 0.0005, Elapsed Time: 2558.83 seconds Step 24400/150000, Loss: 6.2687137413024905, Test Loss: 6.261537075042725, LR: 0.0005, Elapsed Time: 2569.25 seconds Step 24500/150000, Loss: 6.266632571220398, Test Loss: 6.270905315876007, LR: 0.0005, Elapsed Time: 2579.65 seconds Step 24600/150000, Loss: 6.248898158073425, Test Loss: 6.251947641372681, LR: 0.0005, Elapsed Time: 2590.11 seconds Step 24700/150000, Loss: 6.246304478645325, Test Loss: 6.25252503156662, LR: 0.0005, Elapsed Time: 2600.55 seconds Step 24800/150000, Loss: 6.258883337974549, Test Loss: 6.247434079647064, LR: 0.0005, Elapsed Time: 2610.99 seconds Step 24900/150000, Loss: 6.245472912788391, Test Loss: 6.252581477165222, LR: 0.0005, Elapsed Time: 2621.45 seconds Step 25000/150000, Loss: 6.25300696849823, Test Loss: 6.2479347586631775, LR: 0.0005, Elapsed Time: 2631.87 seconds Step 25100/150000, Loss: 6.252842655181885, Test Loss: 6.243476331233978, LR: 0.0005, Elapsed Time: 2642.36 seconds Step 25200/150000, Loss: 6.245774707794189, Test Loss: 6.245626449584961, LR: 0.0005, Elapsed Time: 2652.78 seconds Step 25300/150000, Loss: 6.245088291168213, Test Loss: 6.238719463348389, LR: 0.0005, Elapsed Time: 2663.25 seconds Step 25400/150000, Loss: 6.247505536079407, Test Loss: 6.244243621826172, LR: 0.0005, Elapsed Time: 2673.67 seconds Step 25500/150000, Loss: 6.24777398109436, Test Loss: 6.232848286628723, LR: 0.0005, Elapsed Time: 2684.10 seconds Step 25600/150000, Loss: 6.234500923156738, Test Loss: 6.239774107933044, LR: 0.0005, Elapsed Time: 2694.58 seconds Step 25700/150000, Loss: 6.240865755081177, Test Loss: 6.229046821594238, LR: 0.0005, Elapsed Time: 2705.08 seconds Step 25800/150000, Loss: 6.227641453742981, Test Loss: 6.226473808288574, LR: 0.0005, Elapsed Time: 2715.57 seconds Step 25900/150000, Loss: 6.233865780830383, Test Loss: 6.221351623535156, LR: 0.0005, Elapsed Time: 2725.95 seconds Step 26000/150000, Loss: 6.220003499984741, Test Loss: 6.224205374717712, LR: 0.0005, Elapsed Time: 2736.40 seconds Step 26100/150000, Loss: 6.216693725585937, Test Loss: 6.219252705574036, LR: 0.0005, Elapsed Time: 2746.85 seconds Step 26200/150000, Loss: 6.230173382759094, Test Loss: 6.219891428947449, LR: 0.0005, Elapsed Time: 2757.36 seconds Step 26300/150000, Loss: 6.218601913452148, Test Loss: 6.2154900431633, LR: 0.0005, Elapsed Time: 2767.83 seconds Step 26400/150000, Loss: 6.210569849014282, Test Loss: 6.212136924266815, LR: 0.0005, Elapsed Time: 2778.23 seconds Step 26500/150000, Loss: 6.21944751739502, Test Loss: 6.203849673271179, LR: 0.0005, Elapsed Time: 2788.67 seconds Step 26600/150000, Loss: 6.210926289558411, Test Loss: 6.208175957202911, LR: 0.0005, Elapsed Time: 2799.12 seconds Step 26700/150000, Loss: 6.214290590286255, Test Loss: 6.203885078430176, LR: 0.0005, Elapsed Time: 2809.56 seconds Step 26800/150000, Loss: 6.207031335830688, Test Loss: 6.203709423542023, LR: 0.0005, Elapsed Time: 2820.00 seconds Step 26900/150000, Loss: 6.20112823009491, Test Loss: 6.199526906013489, LR: 0.0005, Elapsed Time: 2830.45 seconds Step 27000/150000, Loss: 6.195217218399048, Test Loss: 6.20100736618042, LR: 0.0005, Elapsed Time: 2840.89 seconds Step 27100/150000, Loss: 6.1974418306350705, Test Loss: 6.196983993053436, LR: 0.0005, Elapsed Time: 2851.34 seconds Step 27200/150000, Loss: 6.19078115940094, Test Loss: 6.1950324177742, LR: 0.0005, Elapsed Time: 2861.89 seconds Step 27300/150000, Loss: 6.19658929347992, Test Loss: 6.194707930088043, LR: 0.0005, Elapsed Time: 2872.31 seconds Step 27400/150000, Loss: 6.199545073509216, Test Loss: 6.192168176174164, LR: 0.0005, Elapsed Time: 2882.77 seconds Step 27500/150000, Loss: 6.199369215965271, Test Loss: 6.1875364780426025, LR: 0.0005, Elapsed Time: 2893.26 seconds Step 27600/150000, Loss: 6.185737738609314, Test Loss: 6.180055677890778, LR: 0.0005, Elapsed Time: 2903.71 seconds Step 27700/150000, Loss: 6.168271150588989, Test Loss: 6.181223809719086, LR: 0.0005, Elapsed Time: 2914.18 seconds Step 27800/150000, Loss: 6.190137052536011, Test Loss: 6.178031742572784, LR: 0.0005, Elapsed Time: 2924.62 seconds Step 27900/150000, Loss: 6.1977942752838135, Test Loss: 6.1742554903030396, LR: 0.0005, Elapsed Time: 2935.06 seconds Step 28000/150000, Loss: 6.170997138023377, Test Loss: 6.172465562820435, LR: 0.0005, Elapsed Time: 2945.51 seconds Step 28100/150000, Loss: 6.182390213012695, Test Loss: 6.173965871334076, LR: 0.0005, Elapsed Time: 2956.00 seconds Step 28200/150000, Loss: 6.176844487190246, Test Loss: 6.17498505115509, LR: 0.0005, Elapsed Time: 2966.42 seconds Step 28300/150000, Loss: 6.168439087867736, Test Loss: 6.165223658084869, LR: 0.0005, Elapsed Time: 2976.95 seconds Step 28400/150000, Loss: 6.171880202293396, Test Loss: 6.170812368392944, LR: 0.0005, Elapsed Time: 2987.50 seconds Step 28500/150000, Loss: 6.174358925819397, Test Loss: 6.161598026752472, LR: 0.0005, Elapsed Time: 2998.14 seconds Step 28600/150000, Loss: 6.157126908302307, Test Loss: 6.161474049091339, LR: 0.0005, Elapsed Time: 3008.74 seconds Step 28700/150000, Loss: 6.162721390724182, Test Loss: 6.163299143314362, LR: 0.0005, Elapsed Time: 3019.36 seconds Step 28800/150000, Loss: 6.158284425735474, Test Loss: 6.158563911914825, LR: 0.0005, Elapsed Time: 3029.87 seconds Step 28900/150000, Loss: 6.151708302497863, Test Loss: 6.163728952407837, LR: 0.0005, Elapsed Time: 3040.34 seconds Step 29000/150000, Loss: 6.149479413032532, Test Loss: 6.156359791755676, LR: 0.0005, Elapsed Time: 3050.78 seconds Step 29100/150000, Loss: 6.150423216819763, Test Loss: 6.153329014778137, LR: 0.0005, Elapsed Time: 3061.25 seconds Step 29200/150000, Loss: 6.145188775062561, Test Loss: 6.155669152736664, LR: 0.0005, Elapsed Time: 3071.68 seconds Step 29300/150000, Loss: 6.16031277179718, Test Loss: 6.1513184905052185, LR: 0.0005, Elapsed Time: 3082.16 seconds Step 29400/150000, Loss: 6.145023317337036, Test Loss: 6.144165277481079, LR: 0.0005, Elapsed Time: 3092.70 seconds Step 29500/150000, Loss: 6.140694913864135, Test Loss: 6.141407072544098, LR: 0.0005, Elapsed Time: 3103.17 seconds Step 29600/150000, Loss: 6.154821062088013, Test Loss: 6.142060399055481, LR: 0.0005, Elapsed Time: 3113.60 seconds Step 29700/150000, Loss: 6.133518748283386, Test Loss: 6.136374115943909, LR: 0.0005, Elapsed Time: 3124.05 seconds Step 29800/150000, Loss: 6.151553411483764, Test Loss: 6.137484014034271, LR: 0.0005, Elapsed Time: 3134.49 seconds Step 29900/150000, Loss: 6.131563773155213, Test Loss: 6.1362468004226685, LR: 0.0005, Elapsed Time: 3144.91 seconds Step 30000/150000, Loss: 6.137345452308654, Test Loss: 6.130531370639801, LR: 0.0005, Elapsed Time: 3155.27 seconds Step 30100/150000, Loss: 6.135455660820007, Test Loss: 6.135493695735931, LR: 0.0005, Elapsed Time: 3165.66 seconds Step 30200/150000, Loss: 6.133080949783325, Test Loss: 6.1243502497673035, LR: 0.0005, Elapsed Time: 3176.09 seconds Step 30300/150000, Loss: 6.135017194747925, Test Loss: 6.124425292015076, LR: 0.0005, Elapsed Time: 3186.54 seconds Step 30400/150000, Loss: 6.133797154426575, Test Loss: 6.12871378660202, LR: 0.0005, Elapsed Time: 3196.91 seconds Step 30500/150000, Loss: 6.12192367553711, Test Loss: 6.129230976104736, LR: 0.0005, Elapsed Time: 3207.38 seconds Step 30600/150000, Loss: 6.119800090789795, Test Loss: 6.117292582988739, LR: 0.0005, Elapsed Time: 3217.82 seconds Step 30700/150000, Loss: 6.1218158149719235, Test Loss: 6.116804301738739, LR: 0.0005, Elapsed Time: 3228.26 seconds Step 30800/150000, Loss: 6.123212518692017, Test Loss: 6.109973728656769, LR: 0.0005, Elapsed Time: 3238.70 seconds Step 30900/150000, Loss: 6.1166763639450075, Test Loss: 6.11478978395462, LR: 0.0005, Elapsed Time: 3249.16 seconds Step 31000/150000, Loss: 6.111529121398926, Test Loss: 6.111300349235535, LR: 0.0005, Elapsed Time: 3259.59 seconds Step 31100/150000, Loss: 6.117410712242126, Test Loss: 6.109693169593811, LR: 0.0005, Elapsed Time: 3270.03 seconds Step 31200/150000, Loss: 6.108611512184143, Test Loss: 6.105596363544464, LR: 0.0005, Elapsed Time: 3280.49 seconds Step 31300/150000, Loss: 6.100753235816955, Test Loss: 6.107708156108856, LR: 0.0005, Elapsed Time: 3291.00 seconds Step 31400/150000, Loss: 6.10855836391449, Test Loss: 6.102022230625153, LR: 0.0005, Elapsed Time: 3301.49 seconds Step 31500/150000, Loss: 6.112256917953491, Test Loss: 6.103033483028412, LR: 0.0005, Elapsed Time: 3311.93 seconds Step 31600/150000, Loss: 6.109223289489746, Test Loss: 6.1005332469940186, LR: 0.0005, Elapsed Time: 3322.37 seconds Step 31700/150000, Loss: 6.090992574691772, Test Loss: 6.1010043025016785, LR: 0.0005, Elapsed Time: 3332.80 seconds Step 31800/150000, Loss: 6.105441083908081, Test Loss: 6.09434449672699, LR: 0.0005, Elapsed Time: 3343.24 seconds Step 31900/150000, Loss: 6.093516616821289, Test Loss: 6.097595751285553, LR: 0.0005, Elapsed Time: 3353.67 seconds Step 32000/150000, Loss: 6.109400243759155, Test Loss: 6.087871611118317, LR: 0.0005, Elapsed Time: 3364.08 seconds Step 32100/150000, Loss: 6.093338103294372, Test Loss: 6.0981621742248535, LR: 0.0005, Elapsed Time: 3374.52 seconds Step 32200/150000, Loss: 6.096166443824768, Test Loss: 6.088769733905792, LR: 0.0005, Elapsed Time: 3384.92 seconds Step 32300/150000, Loss: 6.102646465301514, Test Loss: 6.087979018688202, LR: 0.0005, Elapsed Time: 3395.40 seconds Step 32400/150000, Loss: 6.09002049446106, Test Loss: 6.082152545452118, LR: 0.0005, Elapsed Time: 3405.82 seconds Step 32500/150000, Loss: 6.093568787574768, Test Loss: 6.085061728954315, LR: 0.0005, Elapsed Time: 3416.24 seconds Step 32600/150000, Loss: 6.088463568687439, Test Loss: 6.080061972141266, LR: 0.0005, Elapsed Time: 3426.73 seconds Step 32700/150000, Loss: 6.09066370010376, Test Loss: 6.080202102661133, LR: 0.0005, Elapsed Time: 3437.21 seconds Step 32800/150000, Loss: 6.091840500831604, Test Loss: 6.074113607406616, LR: 0.0005, Elapsed Time: 3447.64 seconds Step 32900/150000, Loss: 6.078388471603393, Test Loss: 6.075653076171875, LR: 0.0005, Elapsed Time: 3458.03 seconds Step 33000/150000, Loss: 6.070182995796204, Test Loss: 6.072208881378174, LR: 0.0005, Elapsed Time: 3468.46 seconds Step 33100/150000, Loss: 6.0636973524093625, Test Loss: 6.067295849323273, LR: 0.0005, Elapsed Time: 3478.94 seconds Step 33200/150000, Loss: 6.081500978469848, Test Loss: 6.063136339187622, LR: 0.0005, Elapsed Time: 3489.34 seconds Step 33300/150000, Loss: 6.069124450683594, Test Loss: 6.064612507820129, LR: 0.0005, Elapsed Time: 3499.83 seconds Step 33400/150000, Loss: 6.076454834938049, Test Loss: 6.060621738433838, LR: 0.0005, Elapsed Time: 3510.28 seconds Step 33500/150000, Loss: 6.068816485404969, Test Loss: 6.058933198451996, LR: 0.0005, Elapsed Time: 3520.70 seconds Step 33600/150000, Loss: 6.058795394897461, Test Loss: 6.067911863327026, LR: 0.0005, Elapsed Time: 3531.08 seconds Step 33700/150000, Loss: 6.059148707389832, Test Loss: 6.055261969566345, LR: 0.0005, Elapsed Time: 3541.53 seconds Step 33800/150000, Loss: 6.0572194528579715, Test Loss: 6.053225934505463, LR: 0.0005, Elapsed Time: 3552.02 seconds Step 33900/150000, Loss: 6.067165002822876, Test Loss: 6.051542341709137, LR: 0.0005, Elapsed Time: 3562.41 seconds Step 34000/150000, Loss: 6.064450054168701, Test Loss: 6.055772602558136, LR: 0.0005, Elapsed Time: 3572.83 seconds Step 34100/150000, Loss: 6.0601071882247926, Test Loss: 6.051725506782532, LR: 0.0005, Elapsed Time: 3583.27 seconds Step 34200/150000, Loss: 6.060403504371643, Test Loss: 6.046834647655487, LR: 0.0005, Elapsed Time: 3593.69 seconds Step 34300/150000, Loss: 6.04362729549408, Test Loss: 6.044512093067169, LR: 0.0005, Elapsed Time: 3604.16 seconds Step 34400/150000, Loss: 6.052924880981445, Test Loss: 6.04515153169632, LR: 0.0005, Elapsed Time: 3614.63 seconds Step 34500/150000, Loss: 6.053833904266358, Test Loss: 6.046478092670441, LR: 0.0005, Elapsed Time: 3625.14 seconds Step 34600/150000, Loss: 6.043240647315979, Test Loss: 6.038540065288544, LR: 0.0005, Elapsed Time: 3635.56 seconds Step 34700/150000, Loss: 6.044962048530579, Test Loss: 6.041688144207001, LR: 0.0005, Elapsed Time: 3645.97 seconds Step 34800/150000, Loss: 6.051593914031982, Test Loss: 6.036306083202362, LR: 0.0005, Elapsed Time: 3656.43 seconds Step 34900/150000, Loss: 6.035616126060486, Test Loss: 6.0310288071632385, LR: 0.0005, Elapsed Time: 3666.81 seconds Step 35000/150000, Loss: 6.026905360221863, Test Loss: 6.030189037322998, LR: 0.0005, Elapsed Time: 3677.23 seconds Step 35100/150000, Loss: 6.042121119499207, Test Loss: 6.028798162937164, LR: 0.0005, Elapsed Time: 3687.69 seconds Step 35200/150000, Loss: 6.039070744514465, Test Loss: 6.024654865264893, LR: 0.0005, Elapsed Time: 3698.12 seconds Step 35300/150000, Loss: 6.0257692050933835, Test Loss: 6.026781260967255, LR: 0.0005, Elapsed Time: 3708.55 seconds Step 35400/150000, Loss: 6.030350117683411, Test Loss: 6.01838618516922, LR: 0.0005, Elapsed Time: 3719.04 seconds Step 35500/150000, Loss: 6.019498686790467, Test Loss: 6.0221928358078, LR: 0.0005, Elapsed Time: 3729.55 seconds Step 35600/150000, Loss: 6.022944941520691, Test Loss: 6.017552554607391, LR: 0.0005, Elapsed Time: 3739.99 seconds Step 35700/150000, Loss: 6.0129370641708375, Test Loss: 6.012584745883942, LR: 0.0005, Elapsed Time: 3750.37 seconds Step 35800/150000, Loss: 6.017200932502747, Test Loss: 6.014536023139954, LR: 0.0005, Elapsed Time: 3760.82 seconds Step 35900/150000, Loss: 6.02465003490448, Test Loss: 6.021388590335846, LR: 0.0005, Elapsed Time: 3771.25 seconds Step 36000/150000, Loss: 6.01933720111847, Test Loss: 6.009196877479553, LR: 0.0005, Elapsed Time: 3781.64 seconds Step 36100/150000, Loss: 6.012976698875427, Test Loss: 6.01215660572052, LR: 0.0005, Elapsed Time: 3792.11 seconds Step 36200/150000, Loss: 6.019264578819275, Test Loss: 6.008389890193939, LR: 0.0005, Elapsed Time: 3802.55 seconds Step 36300/150000, Loss: 6.014468836784363, Test Loss: 6.007692992687225, LR: 0.0005, Elapsed Time: 3812.99 seconds Step 36400/150000, Loss: 6.009807748794556, Test Loss: 5.997376441955566, LR: 0.0005, Elapsed Time: 3823.48 seconds Step 36500/150000, Loss: 6.00594925403595, Test Loss: 6.002587914466858, LR: 0.0005, Elapsed Time: 3833.93 seconds Step 36600/150000, Loss: 5.995894179344178, Test Loss: 5.999123573303223, LR: 0.0005, Elapsed Time: 3844.38 seconds Step 36700/150000, Loss: 5.990535726547241, Test Loss: 6.002555668354034, LR: 0.0005, Elapsed Time: 3854.84 seconds Step 36800/150000, Loss: 6.013510704040527, Test Loss: 5.997834742069244, LR: 0.0005, Elapsed Time: 3865.27 seconds Step 36900/150000, Loss: 5.999830737113952, Test Loss: 5.991721570491791, LR: 0.0005, Elapsed Time: 3875.67 seconds Step 37000/150000, Loss: 6.005103611946106, Test Loss: 5.991996765136719, LR: 0.0005, Elapsed Time: 3886.08 seconds Step 37100/150000, Loss: 5.990902214050293, Test Loss: 5.989355683326721, LR: 0.0005, Elapsed Time: 3896.50 seconds Step 37200/150000, Loss: 5.979286127090454, Test Loss: 5.9909480810165405, LR: 0.0005, Elapsed Time: 3907.01 seconds Step 37300/150000, Loss: 5.990879378318787, Test Loss: 5.986709415912628, LR: 0.0005, Elapsed Time: 3917.48 seconds Step 37400/150000, Loss: 6.000451536178589, Test Loss: 5.979166567325592, LR: 0.0005, Elapsed Time: 3927.87 seconds Step 37500/150000, Loss: 5.976320447921753, Test Loss: 5.976952910423279, LR: 0.0005, Elapsed Time: 3938.41 seconds Step 37600/150000, Loss: 5.997237071990967, Test Loss: 5.978242993354797, LR: 0.0005, Elapsed Time: 3948.89 seconds Step 37700/150000, Loss: 5.986970491409302, Test Loss: 5.982095420360565, LR: 0.0005, Elapsed Time: 3959.35 seconds Step 37800/150000, Loss: 5.978527455329895, Test Loss: 5.977895975112915, LR: 0.0005, Elapsed Time: 3969.85 seconds Step 37900/150000, Loss: 5.96548282623291, Test Loss: 5.975106358528137, LR: 0.0005, Elapsed Time: 3980.31 seconds Step 38000/150000, Loss: 5.985452189445495, Test Loss: 5.977665364742279, LR: 0.0005, Elapsed Time: 3990.79 seconds Step 38100/150000, Loss: 5.969007239341736, Test Loss: 5.975178778171539, LR: 0.0005, Elapsed Time: 4001.19 seconds Step 38200/150000, Loss: 5.979971027374267, Test Loss: 5.9702741503715515, LR: 0.0005, Elapsed Time: 4011.64 seconds Step 38300/150000, Loss: 5.974960370063782, Test Loss: 5.968958735466003, LR: 0.0005, Elapsed Time: 4022.10 seconds Step 38400/150000, Loss: 5.968921222686768, Test Loss: 5.965389609336853, LR: 0.0005, Elapsed Time: 4032.57 seconds Step 38500/150000, Loss: 5.982145557403564, Test Loss: 5.963244736194611, LR: 0.0005, Elapsed Time: 4043.03 seconds Step 38600/150000, Loss: 5.968026962280273, Test Loss: 5.964480221271515, LR: 0.0005, Elapsed Time: 4053.51 seconds Step 38700/150000, Loss: 5.972889919281005, Test Loss: 5.956547379493713, LR: 0.0005, Elapsed Time: 4063.97 seconds Step 38800/150000, Loss: 5.960025534629822, Test Loss: 5.959240734577179, LR: 0.0005, Elapsed Time: 4074.45 seconds Step 38900/150000, Loss: 5.955317902565002, Test Loss: 5.957756757736206, LR: 0.0005, Elapsed Time: 4084.87 seconds Step 39000/150000, Loss: 5.966357536315918, Test Loss: 5.951875865459442, LR: 0.0005, Elapsed Time: 4095.26 seconds Step 39100/150000, Loss: 5.957729330062866, Test Loss: 5.951167702674866, LR: 0.0005, Elapsed Time: 4105.69 seconds Step 39200/150000, Loss: 5.962760801315308, Test Loss: 5.950755715370178, LR: 0.0005, Elapsed Time: 4116.12 seconds Step 39300/150000, Loss: 5.959100346565247, Test Loss: 5.948922038078308, LR: 0.0005, Elapsed Time: 4126.55 seconds Step 39400/150000, Loss: 5.928620958328247, Test Loss: 5.943805694580078, LR: 0.0005, Elapsed Time: 4136.96 seconds Step 39500/150000, Loss: 5.943184213638306, Test Loss: 5.943460822105408, LR: 0.0005, Elapsed Time: 4147.41 seconds Step 39600/150000, Loss: 5.948457698822022, Test Loss: 5.946314573287964, LR: 0.0005, Elapsed Time: 4157.87 seconds Step 39700/150000, Loss: 5.953649568557739, Test Loss: 5.941150426864624, LR: 0.0005, Elapsed Time: 4168.22 seconds Step 39800/150000, Loss: 5.942794432640076, Test Loss: 5.942875444889069, LR: 0.0005, Elapsed Time: 4178.65 seconds Step 39900/150000, Loss: 5.94575873374939, Test Loss: 5.93720930814743, LR: 0.0005, Elapsed Time: 4189.09 seconds Step 40000/150000, Loss: 5.938135714530945, Test Loss: 5.937519073486328, LR: 0.0005, Elapsed Time: 4199.59 seconds Step 40100/150000, Loss: 5.931557874679566, Test Loss: 5.932717740535736, LR: 0.0005, Elapsed Time: 4210.05 seconds Step 40200/150000, Loss: 5.9232305860519405, Test Loss: 5.9313605427742, LR: 0.0005, Elapsed Time: 4220.49 seconds Step 40300/150000, Loss: 5.936990547180176, Test Loss: 5.928319573402405, LR: 0.0005, Elapsed Time: 4230.97 seconds Step 40400/150000, Loss: 5.938570022583008, Test Loss: 5.931916415691376, LR: 0.0005, Elapsed Time: 4241.46 seconds Step 40500/150000, Loss: 5.937492055892944, Test Loss: 5.9264161586761475, LR: 0.0005, Elapsed Time: 4251.92 seconds Step 40600/150000, Loss: 5.923813109397888, Test Loss: 5.933529496192932, LR: 0.0005, Elapsed Time: 4262.36 seconds Step 40700/150000, Loss: 5.934683403968811, Test Loss: 5.92292720079422, LR: 0.0005, Elapsed Time: 4272.81 seconds Step 40800/150000, Loss: 5.933063597679138, Test Loss: 5.924282014369965, LR: 0.0005, Elapsed Time: 4283.23 seconds Step 40900/150000, Loss: 5.919202733039856, Test Loss: 5.91918671131134, LR: 0.0005, Elapsed Time: 4293.68 seconds Step 41000/150000, Loss: 5.914374885559082, Test Loss: 5.915018618106842, LR: 0.0005, Elapsed Time: 4304.13 seconds Step 41100/150000, Loss: 5.90896996974945, Test Loss: 5.915798664093018, LR: 0.0005, Elapsed Time: 4314.53 seconds Step 41200/150000, Loss: 5.913657970428467, Test Loss: 5.918470680713654, LR: 0.0005, Elapsed Time: 4324.99 seconds Step 41300/150000, Loss: 5.917114014625549, Test Loss: 5.913124740123749, LR: 0.0005, Elapsed Time: 4335.39 seconds Step 41400/150000, Loss: 5.9099878358840945, Test Loss: 5.910626828670502, LR: 0.0005, Elapsed Time: 4345.88 seconds Step 41500/150000, Loss: 5.903722825050354, Test Loss: 5.9093546867370605, LR: 0.0005, Elapsed Time: 4356.33 seconds Step 41600/150000, Loss: 5.913482222557068, Test Loss: 5.908509850502014, LR: 0.0005, Elapsed Time: 4366.76 seconds Step 41700/150000, Loss: 5.921222243309021, Test Loss: 5.897541165351868, LR: 0.0005, Elapsed Time: 4377.20 seconds Step 41800/150000, Loss: 5.902980070114136, Test Loss: 5.901935279369354, LR: 0.0005, Elapsed Time: 4387.64 seconds Step 41900/150000, Loss: 5.914592185020447, Test Loss: 5.897011876106262, LR: 0.0005, Elapsed Time: 4398.11 seconds Step 42000/150000, Loss: 5.90191499710083, Test Loss: 5.908784806728363, LR: 0.0005, Elapsed Time: 4408.55 seconds Step 42100/150000, Loss: 5.891466546058655, Test Loss: 5.894311368465424, LR: 0.0005, Elapsed Time: 4419.05 seconds Step 42200/150000, Loss: 5.901901683807373, Test Loss: 5.8902939558029175, LR: 0.0005, Elapsed Time: 4429.43 seconds Step 42300/150000, Loss: 5.885509872436524, Test Loss: 5.888437330722809, LR: 0.0005, Elapsed Time: 4439.88 seconds Step 42400/150000, Loss: 5.901699542999268, Test Loss: 5.886762082576752, LR: 0.0005, Elapsed Time: 4450.43 seconds Step 42500/150000, Loss: 5.884639883041382, Test Loss: 5.884190082550049, LR: 0.0005, Elapsed Time: 4460.85 seconds Step 42600/150000, Loss: 5.900937023162842, Test Loss: 5.88665235042572, LR: 0.0005, Elapsed Time: 4471.28 seconds Step 42700/150000, Loss: 5.8818558502197265, Test Loss: 5.885623753070831, LR: 0.0005, Elapsed Time: 4481.71 seconds Step 42800/150000, Loss: 5.8879849195480345, Test Loss: 5.880548894405365, LR: 0.0005, Elapsed Time: 4492.15 seconds Step 42900/150000, Loss: 5.899122343063355, Test Loss: 5.877642631530762, LR: 0.0005, Elapsed Time: 4502.64 seconds Step 43000/150000, Loss: 5.873701076507569, Test Loss: 5.878451943397522, LR: 0.0005, Elapsed Time: 4513.13 seconds Step 43100/150000, Loss: 5.890672583580017, Test Loss: 5.876022815704346, LR: 0.0005, Elapsed Time: 4523.67 seconds Step 43200/150000, Loss: 5.8862994146347045, Test Loss: 5.876372456550598, LR: 0.0005, Elapsed Time: 4534.17 seconds Step 43300/150000, Loss: 5.882007732391357, Test Loss: 5.877281308174133, LR: 0.0005, Elapsed Time: 4544.62 seconds Step 43400/150000, Loss: 5.8748514938354495, Test Loss: 5.873298645019531, LR: 0.0005, Elapsed Time: 4555.14 seconds Step 43500/150000, Loss: 5.876629953384399, Test Loss: 5.871146380901337, LR: 0.0005, Elapsed Time: 4565.62 seconds Step 43600/150000, Loss: 5.876505823135376, Test Loss: 5.877763330936432, LR: 0.0005, Elapsed Time: 4576.11 seconds Step 43700/150000, Loss: 5.867029638290405, Test Loss: 5.864722847938538, LR: 0.0005, Elapsed Time: 4586.57 seconds Step 43800/150000, Loss: 5.866069984436035, Test Loss: 5.8683894872665405, LR: 0.0005, Elapsed Time: 4597.02 seconds Step 43900/150000, Loss: 5.855792636871338, Test Loss: 5.862679600715637, LR: 0.0005, Elapsed Time: 4607.46 seconds Step 44000/150000, Loss: 5.869513192176819, Test Loss: 5.858771324157715, LR: 0.0005, Elapsed Time: 4617.95 seconds Step 44100/150000, Loss: 5.864533257484436, Test Loss: 5.857348561286926, LR: 0.0005, Elapsed Time: 4628.41 seconds Step 44200/150000, Loss: 5.862675275802612, Test Loss: 5.854569733142853, LR: 0.0005, Elapsed Time: 4638.88 seconds Step 44300/150000, Loss: 5.851214408874512, Test Loss: 5.8506457805633545, LR: 0.0005, Elapsed Time: 4649.36 seconds Step 44400/150000, Loss: 5.861044640541077, Test Loss: 5.8536030650138855, LR: 0.0005, Elapsed Time: 4659.86 seconds Step 44500/150000, Loss: 5.859848227500915, Test Loss: 5.854314982891083, LR: 0.0005, Elapsed Time: 4670.35 seconds Step 44600/150000, Loss: 5.849500579833984, Test Loss: 5.843581020832062, LR: 0.0005, Elapsed Time: 4680.78 seconds Step 44700/150000, Loss: 5.838568511009217, Test Loss: 5.846759557723999, LR: 0.0005, Elapsed Time: 4691.26 seconds Step 44800/150000, Loss: 5.834904537200928, Test Loss: 5.848383784294128, LR: 0.0005, Elapsed Time: 4701.72 seconds Step 44900/150000, Loss: 5.850045833587647, Test Loss: 5.845510482788086, LR: 0.0005, Elapsed Time: 4712.13 seconds Step 45000/150000, Loss: 5.84927390575409, Test Loss: 5.8463550209999084, LR: 0.0005, Elapsed Time: 4722.60 seconds Step 45100/150000, Loss: 5.830613083839417, Test Loss: 5.837386608123779, LR: 0.0005, Elapsed Time: 4733.01 seconds Step 45200/150000, Loss: 5.853322615623474, Test Loss: 5.835253477096558, LR: 0.0005, Elapsed Time: 4743.50 seconds Step 45300/150000, Loss: 5.832345261573791, Test Loss: 5.838078081607819, LR: 0.0005, Elapsed Time: 4753.96 seconds Step 45400/150000, Loss: 5.843818249702454, Test Loss: 5.8386335372924805, LR: 0.0005, Elapsed Time: 4764.49 seconds Step 45500/150000, Loss: 5.842650179862976, Test Loss: 5.8367942571640015, LR: 0.0005, Elapsed Time: 4774.94 seconds Step 45600/150000, Loss: 5.828404932022095, Test Loss: 5.829302608966827, LR: 0.0005, Elapsed Time: 4785.38 seconds Step 45700/150000, Loss: 5.82669798374176, Test Loss: 5.830215394496918, LR: 0.0005, Elapsed Time: 4795.85 seconds Step 45800/150000, Loss: 5.838660144805909, Test Loss: 5.828903555870056, LR: 0.0005, Elapsed Time: 4806.26 seconds Step 45900/150000, Loss: 5.825883069038391, Test Loss: 5.828873038291931, LR: 0.0005, Elapsed Time: 4816.69 seconds Step 46000/150000, Loss: 5.824935231208801, Test Loss: 5.828267276287079, LR: 0.0005, Elapsed Time: 4827.10 seconds Step 46100/150000, Loss: 5.8315033435821535, Test Loss: 5.821519494056702, LR: 0.0005, Elapsed Time: 4837.53 seconds Step 46200/150000, Loss: 5.825870933532715, Test Loss: 5.825476706027985, LR: 0.0005, Elapsed Time: 4847.98 seconds Step 46300/150000, Loss: 5.821720261573791, Test Loss: 5.8179327845573425, LR: 0.0005, Elapsed Time: 4858.45 seconds Step 46400/150000, Loss: 5.821828060150146, Test Loss: 5.815400719642639, LR: 0.0005, Elapsed Time: 4868.89 seconds Step 46500/150000, Loss: 5.811000561714172, Test Loss: 5.8163570165634155, LR: 0.0005, Elapsed Time: 4879.38 seconds Step 46600/150000, Loss: 5.804362273216247, Test Loss: 5.817447006702423, LR: 0.0005, Elapsed Time: 4889.85 seconds Step 46700/150000, Loss: 5.814602522850037, Test Loss: 5.822064518928528, LR: 0.0005, Elapsed Time: 4900.30 seconds Step 46800/150000, Loss: 5.806937408447266, Test Loss: 5.810829401016235, LR: 0.0005, Elapsed Time: 4910.71 seconds Step 46900/150000, Loss: 5.810415759086609, Test Loss: 5.804976999759674, LR: 0.0005, Elapsed Time: 4921.10 seconds Step 47000/150000, Loss: 5.815558662414551, Test Loss: 5.810133934020996, LR: 0.0005, Elapsed Time: 4931.57 seconds Step 47100/150000, Loss: 5.807269148826599, Test Loss: 5.807037353515625, LR: 0.0005, Elapsed Time: 4942.00 seconds Step 47200/150000, Loss: 5.817147989273071, Test Loss: 5.799008011817932, LR: 0.0005, Elapsed Time: 4952.40 seconds Step 47300/150000, Loss: 5.806409749984741, Test Loss: 5.799977898597717, LR: 0.0005, Elapsed Time: 4962.83 seconds Step 47400/150000, Loss: 5.800261840820313, Test Loss: 5.7986374497413635, LR: 0.0005, Elapsed Time: 4973.30 seconds Step 47500/150000, Loss: 5.8027080631256105, Test Loss: 5.7976155281066895, LR: 0.0005, Elapsed Time: 4983.72 seconds Step 47600/150000, Loss: 5.8087613487243654, Test Loss: 5.796683490276337, LR: 0.0005, Elapsed Time: 4994.12 seconds Step 47700/150000, Loss: 5.791577043533326, Test Loss: 5.790598809719086, LR: 0.0005, Elapsed Time: 5004.61 seconds Step 47800/150000, Loss: 5.791577301025391, Test Loss: 5.786646485328674, LR: 0.0005, Elapsed Time: 5015.06 seconds Step 47900/150000, Loss: 5.782914552688599, Test Loss: 5.786516070365906, LR: 0.0005, Elapsed Time: 5025.47 seconds Step 48000/150000, Loss: 5.78526882648468, Test Loss: 5.789699554443359, LR: 0.0005, Elapsed Time: 5035.84 seconds Step 48100/150000, Loss: 5.801855220794677, Test Loss: 5.797843515872955, LR: 0.0005, Elapsed Time: 5046.25 seconds Step 48200/150000, Loss: 5.780281414985657, Test Loss: 5.784226775169373, LR: 0.0005, Elapsed Time: 5056.63 seconds Step 48300/150000, Loss: 5.784627995491028, Test Loss: 5.783224582672119, LR: 0.0005, Elapsed Time: 5067.09 seconds Step 48400/150000, Loss: 5.776226944923401, Test Loss: 5.78382009267807, LR: 0.0005, Elapsed Time: 5077.53 seconds Step 48500/150000, Loss: 5.783359613418579, Test Loss: 5.780032217502594, LR: 0.0005, Elapsed Time: 5087.93 seconds Step 48600/150000, Loss: 5.784785137176514, Test Loss: 5.779132187366486, LR: 0.0005, Elapsed Time: 5098.33 seconds Step 48700/150000, Loss: 5.781630439758301, Test Loss: 5.776617169380188, LR: 0.0005, Elapsed Time: 5108.74 seconds Step 48800/150000, Loss: 5.770406036376953, Test Loss: 5.776076078414917, LR: 0.0005, Elapsed Time: 5119.19 seconds Step 48900/150000, Loss: 5.768522310256958, Test Loss: 5.774195849895477, LR: 0.0005, Elapsed Time: 5129.68 seconds Step 49000/150000, Loss: 5.762334475517273, Test Loss: 5.771038472652435, LR: 0.0005, Elapsed Time: 5140.18 seconds Step 49100/150000, Loss: 5.7668607139587404, Test Loss: 5.770519375801086, LR: 0.0005, Elapsed Time: 5150.65 seconds Step 49200/150000, Loss: 5.777495994567871, Test Loss: 5.766781866550446, LR: 0.0005, Elapsed Time: 5161.07 seconds Step 49300/150000, Loss: 5.76821852684021, Test Loss: 5.762523293495178, LR: 0.0005, Elapsed Time: 5171.49 seconds Step 49400/150000, Loss: 5.762973222732544, Test Loss: 5.760149657726288, LR: 0.0005, Elapsed Time: 5181.92 seconds Step 49500/150000, Loss: 5.7599051523208615, Test Loss: 5.766585886478424, LR: 0.0005, Elapsed Time: 5192.36 seconds Step 49600/150000, Loss: 5.747420997619629, Test Loss: 5.759222030639648, LR: 0.0005, Elapsed Time: 5202.81 seconds Step 49700/150000, Loss: 5.7633383131027225, Test Loss: 5.767602980136871, LR: 0.0005, Elapsed Time: 5213.27 seconds Step 49800/150000, Loss: 5.767099347114563, Test Loss: 5.756198346614838, LR: 0.0005, Elapsed Time: 5223.71 seconds Step 49900/150000, Loss: 5.74473021030426, Test Loss: 5.754476547241211, LR: 0.0005, Elapsed Time: 5234.19 seconds Step 50000/150000, Loss: 5.753892431259155, Test Loss: 5.752468109130859, LR: 0.0005, Elapsed Time: 5244.55 seconds Saving model checkpoint at step 50000 Step 50100/150000, Loss: 5.755419297218323, Test Loss: 5.748484790325165, LR: 0.0005, Elapsed Time: 5255.12 seconds Step 50200/150000, Loss: 5.740385332107544, Test Loss: 5.7500370144844055, LR: 0.0005, Elapsed Time: 5265.56 seconds Step 50300/150000, Loss: 5.751336932182312, Test Loss: 5.748665630817413, LR: 0.0005, Elapsed Time: 5275.98 seconds Step 50400/150000, Loss: 5.737398219108582, Test Loss: 5.744154453277588, LR: 0.0005, Elapsed Time: 5286.40 seconds Step 50500/150000, Loss: 5.735332345962524, Test Loss: 5.7421650886535645, LR: 0.0005, Elapsed Time: 5296.77 seconds Step 50600/150000, Loss: 5.74207944393158, Test Loss: 5.744273662567139, LR: 0.0005, Elapsed Time: 5307.19 seconds Step 50700/150000, Loss: 5.736642251014709, Test Loss: 5.738032579421997, LR: 0.0005, Elapsed Time: 5317.55 seconds Step 50800/150000, Loss: 5.727261786460876, Test Loss: 5.74356883764267, LR: 0.0005, Elapsed Time: 5327.99 seconds Step 50900/150000, Loss: 5.72660535812378, Test Loss: 5.733484387397766, LR: 0.0005, Elapsed Time: 5338.34 seconds Step 51000/150000, Loss: 5.7266232824325565, Test Loss: 5.731435835361481, LR: 0.0005, Elapsed Time: 5348.78 seconds Step 51100/150000, Loss: 5.717036581039428, Test Loss: 5.7295753955841064, LR: 0.0005, Elapsed Time: 5359.26 seconds Step 51200/150000, Loss: 5.734103055000305, Test Loss: 5.739583492279053, LR: 0.0005, Elapsed Time: 5369.70 seconds Step 51300/150000, Loss: 5.72184910774231, Test Loss: 5.728099346160889, LR: 0.0005, Elapsed Time: 5380.23 seconds Step 51400/150000, Loss: 5.71991144657135, Test Loss: 5.725741326808929, LR: 0.0005, Elapsed Time: 5390.71 seconds Step 51500/150000, Loss: 5.7273851585388185, Test Loss: 5.726957738399506, LR: 0.0005, Elapsed Time: 5401.20 seconds Step 51600/150000, Loss: 5.715541696548462, Test Loss: 5.720550954341888, LR: 0.0005, Elapsed Time: 5411.69 seconds Step 51700/150000, Loss: 5.718877835273743, Test Loss: 5.716041564941406, LR: 0.0005, Elapsed Time: 5422.09 seconds Step 51800/150000, Loss: 5.7087333345413205, Test Loss: 5.71467524766922, LR: 0.0005, Elapsed Time: 5432.57 seconds Step 51900/150000, Loss: 5.715588955879212, Test Loss: 5.711975276470184, LR: 0.0005, Elapsed Time: 5443.05 seconds Step 52000/150000, Loss: 5.712391657829285, Test Loss: 5.711428105831146, LR: 0.0005, Elapsed Time: 5453.52 seconds Step 52100/150000, Loss: 5.718744149208069, Test Loss: 5.708009660243988, LR: 0.0005, Elapsed Time: 5463.99 seconds Step 52200/150000, Loss: 5.706308016777038, Test Loss: 5.710836112499237, LR: 0.0005, Elapsed Time: 5474.44 seconds Step 52300/150000, Loss: 5.708078117370605, Test Loss: 5.711504578590393, LR: 0.0005, Elapsed Time: 5484.84 seconds Step 52400/150000, Loss: 5.6987220287323, Test Loss: 5.7051613330841064, LR: 0.0005, Elapsed Time: 5495.26 seconds Step 52500/150000, Loss: 5.6947549629211425, Test Loss: 5.707547843456268, LR: 0.0005, Elapsed Time: 5505.73 seconds Step 52600/150000, Loss: 5.696124267578125, Test Loss: 5.698739230632782, LR: 0.0005, Elapsed Time: 5516.17 seconds Step 52700/150000, Loss: 5.698147439956665, Test Loss: 5.694122016429901, LR: 0.0005, Elapsed Time: 5526.59 seconds Step 52800/150000, Loss: 5.6836727380752565, Test Loss: 5.693726480007172, LR: 0.0005, Elapsed Time: 5537.00 seconds Step 52900/150000, Loss: 5.691180486679077, Test Loss: 5.695088565349579, LR: 0.0005, Elapsed Time: 5547.41 seconds Step 53000/150000, Loss: 5.6981464385986325, Test Loss: 5.692256033420563, LR: 0.0005, Elapsed Time: 5557.82 seconds Step 53100/150000, Loss: 5.681665859222412, Test Loss: 5.687709212303162, LR: 0.0005, Elapsed Time: 5568.25 seconds Step 53200/150000, Loss: 5.675207862854004, Test Loss: 5.69061952829361, LR: 0.0005, Elapsed Time: 5578.69 seconds Step 53300/150000, Loss: 5.6948469591140745, Test Loss: 5.6901357769966125, LR: 0.0005, Elapsed Time: 5589.13 seconds Step 53400/150000, Loss: 5.683055510520935, Test Loss: 5.6843090653419495, LR: 0.0005, Elapsed Time: 5599.56 seconds Step 53500/150000, Loss: 5.686636686325073, Test Loss: 5.680770814418793, LR: 0.0005, Elapsed Time: 5609.97 seconds Step 53600/150000, Loss: 5.671270871162415, Test Loss: 5.681318700313568, LR: 0.0005, Elapsed Time: 5620.43 seconds Step 53700/150000, Loss: 5.6728986120224, Test Loss: 5.683937251567841, LR: 0.0005, Elapsed Time: 5630.87 seconds Step 53800/150000, Loss: 5.676726293563843, Test Loss: 5.675660490989685, LR: 0.0005, Elapsed Time: 5641.28 seconds Step 53900/150000, Loss: 5.676638979911804, Test Loss: 5.672456741333008, LR: 0.0005, Elapsed Time: 5651.67 seconds Step 54000/150000, Loss: 5.6786082029342655, Test Loss: 5.681153953075409, LR: 0.0005, Elapsed Time: 5662.13 seconds Step 54100/150000, Loss: 5.668110585212707, Test Loss: 5.673001170158386, LR: 0.0005, Elapsed Time: 5672.56 seconds Step 54200/150000, Loss: 5.668136143684388, Test Loss: 5.665448784828186, LR: 0.0005, Elapsed Time: 5683.00 seconds Step 54300/150000, Loss: 5.678878655433655, Test Loss: 5.665912628173828, LR: 0.0005, Elapsed Time: 5693.45 seconds Step 54400/150000, Loss: 5.667887487411499, Test Loss: 5.6695656180381775, LR: 0.0005, Elapsed Time: 5703.88 seconds Step 54500/150000, Loss: 5.664739036560059, Test Loss: 5.661923766136169, LR: 0.0005, Elapsed Time: 5714.32 seconds Step 54600/150000, Loss: 5.661275568008423, Test Loss: 5.659638345241547, LR: 0.0005, Elapsed Time: 5724.72 seconds Step 54700/150000, Loss: 5.67301007270813, Test Loss: 5.6553027629852295, LR: 0.0005, Elapsed Time: 5735.18 seconds Step 54800/150000, Loss: 5.64791844367981, Test Loss: 5.656724691390991, LR: 0.0005, Elapsed Time: 5745.68 seconds Step 54900/150000, Loss: 5.644829535484314, Test Loss: 5.6482908725738525, LR: 0.0005, Elapsed Time: 5756.16 seconds Step 55000/150000, Loss: 5.646475939750672, Test Loss: 5.648631989955902, LR: 0.0005, Elapsed Time: 5766.62 seconds Step 55100/150000, Loss: 5.6427884817123415, Test Loss: 5.647720217704773, LR: 0.0005, Elapsed Time: 5777.07 seconds Step 55200/150000, Loss: 5.653655405044556, Test Loss: 5.648307740688324, LR: 0.0005, Elapsed Time: 5787.56 seconds Step 55300/150000, Loss: 5.647854809761047, Test Loss: 5.646568179130554, LR: 0.0005, Elapsed Time: 5798.05 seconds Step 55400/150000, Loss: 5.647254600524902, Test Loss: 5.637981176376343, LR: 0.0005, Elapsed Time: 5808.52 seconds Step 55500/150000, Loss: 5.63037483215332, Test Loss: 5.644718408584595, LR: 0.0005, Elapsed Time: 5819.01 seconds Step 55600/150000, Loss: 5.62849027633667, Test Loss: 5.640498399734497, LR: 0.0005, Elapsed Time: 5829.47 seconds Step 55700/150000, Loss: 5.637386832237244, Test Loss: 5.634766459465027, LR: 0.0005, Elapsed Time: 5839.91 seconds Step 55800/150000, Loss: 5.6354866170883176, Test Loss: 5.628650367259979, LR: 0.0005, Elapsed Time: 5850.40 seconds Step 55900/150000, Loss: 5.641723837852478, Test Loss: 5.63021320104599, LR: 0.0005, Elapsed Time: 5860.87 seconds Step 56000/150000, Loss: 5.637692546844482, Test Loss: 5.625627517700195, LR: 0.0005, Elapsed Time: 5871.31 seconds Step 56100/150000, Loss: 5.6313524913787845, Test Loss: 5.6303152441978455, LR: 0.0005, Elapsed Time: 5881.76 seconds Step 56200/150000, Loss: 5.612042121887207, Test Loss: 5.62293666601181, LR: 0.0005, Elapsed Time: 5892.20 seconds Step 56300/150000, Loss: 5.629338946342468, Test Loss: 5.629343807697296, LR: 0.0005, Elapsed Time: 5902.62 seconds Step 56400/150000, Loss: 5.627821450233459, Test Loss: 5.619750916957855, LR: 0.0005, Elapsed Time: 5913.04 seconds Step 56500/150000, Loss: 5.625721368789673, Test Loss: 5.618183791637421, LR: 0.0005, Elapsed Time: 5923.49 seconds Step 56600/150000, Loss: 5.616006951332093, Test Loss: 5.613088667392731, LR: 0.0005, Elapsed Time: 5933.93 seconds Step 56700/150000, Loss: 5.621056046485901, Test Loss: 5.614741086959839, LR: 0.0005, Elapsed Time: 5944.34 seconds Step 56800/150000, Loss: 5.600639119148254, Test Loss: 5.6133118867874146, LR: 0.0005, Elapsed Time: 5954.79 seconds Step 56900/150000, Loss: 5.60641420841217, Test Loss: 5.603604078292847, LR: 0.0005, Elapsed Time: 5965.21 seconds Step 57000/150000, Loss: 5.608161916732788, Test Loss: 5.605472564697266, LR: 0.0005, Elapsed Time: 5975.67 seconds Step 57100/150000, Loss: 5.6160082387924195, Test Loss: 5.604896664619446, LR: 0.0005, Elapsed Time: 5986.17 seconds Step 57200/150000, Loss: 5.60330258846283, Test Loss: 5.607714235782623, LR: 0.0005, Elapsed Time: 5996.62 seconds Step 57300/150000, Loss: 5.592936367988586, Test Loss: 5.598628342151642, LR: 0.0005, Elapsed Time: 6007.03 seconds Step 57400/150000, Loss: 5.597210130691528, Test Loss: 5.598338544368744, LR: 0.0005, Elapsed Time: 6017.46 seconds Step 57500/150000, Loss: 5.589829106330871, Test Loss: 5.594490647315979, LR: 0.0005, Elapsed Time: 6027.89 seconds Step 57600/150000, Loss: 5.5904812288284305, Test Loss: 5.593830704689026, LR: 0.0005, Elapsed Time: 6038.38 seconds Step 57700/150000, Loss: 5.5909467124938965, Test Loss: 5.594666123390198, LR: 0.0005, Elapsed Time: 6048.81 seconds Step 57800/150000, Loss: 5.5929826688766475, Test Loss: 5.592221319675446, LR: 0.0005, Elapsed Time: 6059.26 seconds Step 57900/150000, Loss: 5.595340332984924, Test Loss: 5.583671569824219, LR: 0.0005, Elapsed Time: 6069.62 seconds Step 58000/150000, Loss: 5.584020671844482, Test Loss: 5.585106670856476, LR: 0.0005, Elapsed Time: 6080.04 seconds Step 58100/150000, Loss: 5.580442209243774, Test Loss: 5.582390308380127, LR: 0.0005, Elapsed Time: 6090.48 seconds Step 58200/150000, Loss: 5.590455331802368, Test Loss: 5.579961478710175, LR: 0.0005, Elapsed Time: 6100.90 seconds Step 58300/150000, Loss: 5.575482954978943, Test Loss: 5.574374794960022, LR: 0.0005, Elapsed Time: 6111.34 seconds Step 58400/150000, Loss: 5.567083826065064, Test Loss: 5.576476335525513, LR: 0.0005, Elapsed Time: 6121.82 seconds Step 58500/150000, Loss: 5.568543968200683, Test Loss: 5.572911441326141, LR: 0.0005, Elapsed Time: 6132.18 seconds Step 58600/150000, Loss: 5.563841109275818, Test Loss: 5.56520676612854, LR: 0.0005, Elapsed Time: 6142.62 seconds Step 58700/150000, Loss: 5.571560616493225, Test Loss: 5.567773938179016, LR: 0.0005, Elapsed Time: 6153.03 seconds Step 58800/150000, Loss: 5.57853741645813, Test Loss: 5.561257183551788, LR: 0.0005, Elapsed Time: 6163.40 seconds Step 58900/150000, Loss: 5.561404495239258, Test Loss: 5.561555922031403, LR: 0.0005, Elapsed Time: 6173.80 seconds Step 59000/150000, Loss: 5.552557744979858, Test Loss: 5.559659481048584, LR: 0.0005, Elapsed Time: 6184.23 seconds Step 59100/150000, Loss: 5.543534913063049, Test Loss: 5.554928779602051, LR: 0.0005, Elapsed Time: 6194.66 seconds Step 59200/150000, Loss: 5.568130536079407, Test Loss: 5.559712707996368, LR: 0.0005, Elapsed Time: 6205.09 seconds Step 59300/150000, Loss: 5.5561520385742185, Test Loss: 5.5570343136787415, LR: 0.0005, Elapsed Time: 6215.54 seconds Step 59400/150000, Loss: 5.54873631477356, Test Loss: 5.554377317428589, LR: 0.0005, Elapsed Time: 6225.94 seconds Step 59500/150000, Loss: 5.55423478603363, Test Loss: 5.547853410243988, LR: 0.0005, Elapsed Time: 6236.36 seconds Step 59600/150000, Loss: 5.551368441581726, Test Loss: 5.54738837480545, LR: 0.0005, Elapsed Time: 6246.83 seconds Step 59700/150000, Loss: 5.53619960308075, Test Loss: 5.54361629486084, LR: 0.0005, Elapsed Time: 6257.24 seconds Step 59800/150000, Loss: 5.538480167388916, Test Loss: 5.549842000007629, LR: 0.0005, Elapsed Time: 6267.68 seconds Step 59900/150000, Loss: 5.546247601509094, Test Loss: 5.541487455368042, LR: 0.0005, Elapsed Time: 6278.10 seconds Step 60000/150000, Loss: 5.530522193908691, Test Loss: 5.53598940372467, LR: 0.0005, Elapsed Time: 6288.58 seconds Step 60100/150000, Loss: 5.547292594909668, Test Loss: 5.532052278518677, LR: 0.0005, Elapsed Time: 6299.00 seconds Step 60200/150000, Loss: 5.527333459854126, Test Loss: 5.530036628246307, LR: 0.0005, Elapsed Time: 6309.41 seconds Step 60300/150000, Loss: 5.531564440727234, Test Loss: 5.5310850739479065, LR: 0.0005, Elapsed Time: 6319.87 seconds Step 60400/150000, Loss: 5.541308889389038, Test Loss: 5.522123992443085, LR: 0.0005, Elapsed Time: 6330.26 seconds Step 60500/150000, Loss: 5.533833398818969, Test Loss: 5.523458003997803, LR: 0.0005, Elapsed Time: 6340.70 seconds Step 60600/150000, Loss: 5.524838895797729, Test Loss: 5.52068418264389, LR: 0.0005, Elapsed Time: 6351.15 seconds Step 60700/150000, Loss: 5.517883625030517, Test Loss: 5.524653315544128, LR: 0.0005, Elapsed Time: 6361.62 seconds Step 60800/150000, Loss: 5.51391863822937, Test Loss: 5.514987349510193, LR: 0.0005, Elapsed Time: 6372.10 seconds Step 60900/150000, Loss: 5.519231457710266, Test Loss: 5.512127578258514, LR: 0.0005, Elapsed Time: 6382.53 seconds Step 61000/150000, Loss: 5.516327118873596, Test Loss: 5.513739347457886, LR: 0.0005, Elapsed Time: 6392.97 seconds Step 61100/150000, Loss: 5.529553956985474, Test Loss: 5.502422213554382, LR: 0.0005, Elapsed Time: 6403.43 seconds Step 61200/150000, Loss: 5.495486755371093, Test Loss: 5.503015518188477, LR: 0.0005, Elapsed Time: 6413.84 seconds Step 61300/150000, Loss: 5.4916477870941165, Test Loss: 5.501346111297607, LR: 0.0005, Elapsed Time: 6424.34 seconds Step 61400/150000, Loss: 5.505483918190002, Test Loss: 5.499835431575775, LR: 0.0005, Elapsed Time: 6434.75 seconds Step 61500/150000, Loss: 5.497504811286927, Test Loss: 5.495510220527649, LR: 0.0005, Elapsed Time: 6445.16 seconds Step 61600/150000, Loss: 5.50904263973236, Test Loss: 5.493737757205963, LR: 0.0005, Elapsed Time: 6455.62 seconds Step 61700/150000, Loss: 5.491435470581055, Test Loss: 5.487443327903748, LR: 0.0005, Elapsed Time: 6466.12 seconds Step 61800/150000, Loss: 5.493075842857361, Test Loss: 5.490986227989197, LR: 0.0005, Elapsed Time: 6476.50 seconds Step 61900/150000, Loss: 5.480420198440552, Test Loss: 5.486856579780579, LR: 0.0005, Elapsed Time: 6486.92 seconds Step 62000/150000, Loss: 5.488509993553162, Test Loss: 5.478318989276886, LR: 0.0005, Elapsed Time: 6497.33 seconds Step 62100/150000, Loss: 5.470327210426331, Test Loss: 5.479218542575836, LR: 0.0005, Elapsed Time: 6507.83 seconds Step 62200/150000, Loss: 5.4787893486022945, Test Loss: 5.470667004585266, LR: 0.0005, Elapsed Time: 6518.27 seconds Step 62300/150000, Loss: 5.489966115951538, Test Loss: 5.476600527763367, LR: 0.0005, Elapsed Time: 6528.70 seconds Step 62400/150000, Loss: 5.482001314163208, Test Loss: 5.476853251457214, LR: 0.0005, Elapsed Time: 6539.13 seconds Step 62500/150000, Loss: 5.469401154518128, Test Loss: 5.472087383270264, LR: 0.0005, Elapsed Time: 6549.60 seconds Step 62600/150000, Loss: 5.47974308013916, Test Loss: 5.461105406284332, LR: 0.0005, Elapsed Time: 6560.09 seconds Step 62700/150000, Loss: 5.462952451705933, Test Loss: 5.461405098438263, LR: 0.0005, Elapsed Time: 6570.57 seconds Step 62800/150000, Loss: 5.465084986686707, Test Loss: 5.461001932621002, LR: 0.0005, Elapsed Time: 6581.09 seconds Step 62900/150000, Loss: 5.458996248245239, Test Loss: 5.4559900760650635, LR: 0.0005, Elapsed Time: 6591.53 seconds Step 63000/150000, Loss: 5.447991847991943, Test Loss: 5.456497073173523, LR: 0.0005, Elapsed Time: 6601.98 seconds Step 63100/150000, Loss: 5.45621841430664, Test Loss: 5.455772161483765, LR: 0.0005, Elapsed Time: 6612.32 seconds Step 63200/150000, Loss: 5.446766324043274, Test Loss: 5.453738808631897, LR: 0.0005, Elapsed Time: 6622.76 seconds Step 63300/150000, Loss: 5.4466165351867675, Test Loss: 5.4492703676223755, LR: 0.0005, Elapsed Time: 6633.13 seconds Step 63400/150000, Loss: 5.441807818412781, Test Loss: 5.445022702217102, LR: 0.0005, Elapsed Time: 6643.63 seconds Step 63500/150000, Loss: 5.450954174995422, Test Loss: 5.440025210380554, LR: 0.0005, Elapsed Time: 6654.04 seconds Step 63600/150000, Loss: 5.4451498031616214, Test Loss: 5.437790632247925, LR: 0.0005, Elapsed Time: 6664.49 seconds Step 63700/150000, Loss: 5.438723106384277, Test Loss: 5.433136463165283, LR: 0.0005, Elapsed Time: 6674.94 seconds Step 63800/150000, Loss: 5.444198060035705, Test Loss: 5.431013643741608, LR: 0.0005, Elapsed Time: 6685.39 seconds Step 63900/150000, Loss: 5.434566760063172, Test Loss: 5.4253010749816895, LR: 0.0005, Elapsed Time: 6695.85 seconds Step 64000/150000, Loss: 5.4132376432418825, Test Loss: 5.42661714553833, LR: 0.0005, Elapsed Time: 6706.24 seconds Step 64100/150000, Loss: 5.431676621437073, Test Loss: 5.425026059150696, LR: 0.0005, Elapsed Time: 6716.66 seconds Step 64200/150000, Loss: 5.4101944923400875, Test Loss: 5.413289546966553, LR: 0.0005, Elapsed Time: 6727.10 seconds Step 64300/150000, Loss: 5.420518355369568, Test Loss: 5.412765026092529, LR: 0.0005, Elapsed Time: 6737.66 seconds Step 64400/150000, Loss: 5.411990575790405, Test Loss: 5.406775712966919, LR: 0.0005, Elapsed Time: 6748.13 seconds Step 64500/150000, Loss: 5.418866367340088, Test Loss: 5.404885292053223, LR: 0.0005, Elapsed Time: 6758.56 seconds Step 64600/150000, Loss: 5.395028109550476, Test Loss: 5.4111146330833435, LR: 0.0005, Elapsed Time: 6768.93 seconds Step 64700/150000, Loss: 5.407996468544006, Test Loss: 5.393102705478668, LR: 0.0005, Elapsed Time: 6779.31 seconds Step 64800/150000, Loss: 5.412163515090942, Test Loss: 5.393809914588928, LR: 0.0005, Elapsed Time: 6789.74 seconds Step 64900/150000, Loss: 5.393600001335144, Test Loss: 5.396836519241333, LR: 0.0005, Elapsed Time: 6800.16 seconds Step 65000/150000, Loss: 5.405455746650696, Test Loss: 5.390476167201996, LR: 0.0005, Elapsed Time: 6810.54 seconds Step 65100/150000, Loss: 5.396682920455933, Test Loss: 5.385003924369812, LR: 0.0005, Elapsed Time: 6820.93 seconds Step 65200/150000, Loss: 5.397874155044556, Test Loss: 5.379472255706787, LR: 0.0005, Elapsed Time: 6831.33 seconds Step 65300/150000, Loss: 5.376475811004639, Test Loss: 5.380341112613678, LR: 0.0005, Elapsed Time: 6841.79 seconds Step 65400/150000, Loss: 5.379507145881653, Test Loss: 5.375071823596954, LR: 0.0005, Elapsed Time: 6852.20 seconds Step 65500/150000, Loss: 5.376823544502258, Test Loss: 5.36929577589035, LR: 0.0005, Elapsed Time: 6862.68 seconds Step 65600/150000, Loss: 5.370702910423279, Test Loss: 5.366422176361084, LR: 0.0005, Elapsed Time: 6873.14 seconds Step 65700/150000, Loss: 5.3629554271697994, Test Loss: 5.367552399635315, LR: 0.0005, Elapsed Time: 6883.65 seconds Step 65800/150000, Loss: 5.362802391052246, Test Loss: 5.358013868331909, LR: 0.0005, Elapsed Time: 6894.12 seconds Step 65900/150000, Loss: 5.359031982421875, Test Loss: 5.354712724685669, LR: 0.0005, Elapsed Time: 6904.60 seconds Step 66000/150000, Loss: 5.348923435211182, Test Loss: 5.34974080324173, LR: 0.0005, Elapsed Time: 6915.06 seconds Step 66100/150000, Loss: 5.36217472076416, Test Loss: 5.34639185667038, LR: 0.0005, Elapsed Time: 6925.47 seconds Step 66200/150000, Loss: 5.3429257106781005, Test Loss: 5.342728137969971, LR: 0.0005, Elapsed Time: 6935.96 seconds Step 66300/150000, Loss: 5.347363924980163, Test Loss: 5.335454761981964, LR: 0.0005, Elapsed Time: 6946.46 seconds Step 66400/150000, Loss: 5.338300895690918, Test Loss: 5.329885244369507, LR: 0.0005, Elapsed Time: 6956.97 seconds Step 66500/150000, Loss: 5.32712655544281, Test Loss: 5.3313708901405334, LR: 0.0005, Elapsed Time: 6967.48 seconds Step 66600/150000, Loss: 5.31554416179657, Test Loss: 5.324614524841309, LR: 0.0005, Elapsed Time: 6978.08 seconds Step 66700/150000, Loss: 5.313377046585083, Test Loss: 5.327148020267487, LR: 0.0005, Elapsed Time: 6988.54 seconds Step 66800/150000, Loss: 5.319546875953674, Test Loss: 5.313472926616669, LR: 0.0005, Elapsed Time: 6999.11 seconds Step 66900/150000, Loss: 5.309781937599182, Test Loss: 5.3084147572517395, LR: 0.0005, Elapsed Time: 7009.65 seconds Step 67000/150000, Loss: 5.295537047386169, Test Loss: 5.302645266056061, LR: 0.0005, Elapsed Time: 7020.17 seconds Step 67100/150000, Loss: 5.315203452110291, Test Loss: 5.295942485332489, LR: 0.0005, Elapsed Time: 7030.74 seconds Step 67200/150000, Loss: 5.291923146247864, Test Loss: 5.2943994998931885, LR: 0.0005, Elapsed Time: 7041.31 seconds Step 67300/150000, Loss: 5.2950273036956785, Test Loss: 5.2863006591796875, LR: 0.0005, Elapsed Time: 7051.93 seconds Step 67400/150000, Loss: 5.29222843170166, Test Loss: 5.278313219547272, LR: 0.0005, Elapsed Time: 7062.49 seconds Step 67500/150000, Loss: 5.266166205406189, Test Loss: 5.278250753879547, LR: 0.0005, Elapsed Time: 7073.09 seconds Step 67600/150000, Loss: 5.272240891456604, Test Loss: 5.267304360866547, LR: 0.0005, Elapsed Time: 7083.63 seconds Step 67700/150000, Loss: 5.27227089881897, Test Loss: 5.264086425304413, LR: 0.0005, Elapsed Time: 7094.17 seconds Step 67800/150000, Loss: 5.264011583328247, Test Loss: 5.250308632850647, LR: 0.0005, Elapsed Time: 7104.70 seconds Step 67900/150000, Loss: 5.254223971366883, Test Loss: 5.247463703155518, LR: 0.0005, Elapsed Time: 7115.26 seconds Step 68000/150000, Loss: 5.250198707580567, Test Loss: 5.24625563621521, LR: 0.0005, Elapsed Time: 7125.80 seconds Step 68100/150000, Loss: 5.247724232673645, Test Loss: 5.236429810523987, LR: 0.0005, Elapsed Time: 7136.35 seconds Step 68200/150000, Loss: 5.232674021720886, Test Loss: 5.231915652751923, LR: 0.0005, Elapsed Time: 7146.89 seconds Step 68300/150000, Loss: 5.222946405410767, Test Loss: 5.227723300457001, LR: 0.0005, Elapsed Time: 7157.44 seconds Step 68400/150000, Loss: 5.219474577903748, Test Loss: 5.230738341808319, LR: 0.0005, Elapsed Time: 7167.95 seconds Step 68500/150000, Loss: 5.211254692077636, Test Loss: 5.213435769081116, LR: 0.0005, Elapsed Time: 7178.47 seconds Step 68600/150000, Loss: 5.205923275947571, Test Loss: 5.206523954868317, LR: 0.0005, Elapsed Time: 7188.96 seconds Step 68700/150000, Loss: 5.202418808937073, Test Loss: 5.20121842622757, LR: 0.0005, Elapsed Time: 7199.43 seconds Step 68800/150000, Loss: 5.207288222312927, Test Loss: 5.208313882350922, LR: 0.0005, Elapsed Time: 7209.87 seconds Step 68900/150000, Loss: 5.188586168289184, Test Loss: 5.203306138515472, LR: 0.0005, Elapsed Time: 7220.37 seconds Step 69000/150000, Loss: 5.196520295143127, Test Loss: 5.187402665615082, LR: 0.0005, Elapsed Time: 7230.82 seconds Step 69100/150000, Loss: 5.193294668197632, Test Loss: 5.179160892963409, LR: 0.0005, Elapsed Time: 7241.36 seconds Step 69200/150000, Loss: 5.186762108802795, Test Loss: 5.169788718223572, LR: 0.0005, Elapsed Time: 7251.96 seconds Step 69300/150000, Loss: 5.167738437652588, Test Loss: 5.179641127586365, LR: 0.0005, Elapsed Time: 7262.40 seconds Step 69400/150000, Loss: 5.165948138237, Test Loss: 5.1708632707595825, LR: 0.0005, Elapsed Time: 7272.85 seconds Step 69500/150000, Loss: 5.179466962814331, Test Loss: 5.159137308597565, LR: 0.0005, Elapsed Time: 7283.32 seconds Step 69600/150000, Loss: 5.155001244544983, Test Loss: 5.157383620738983, LR: 0.0005, Elapsed Time: 7293.77 seconds Step 69700/150000, Loss: 5.152563853263855, Test Loss: 5.1452266573905945, LR: 0.0005, Elapsed Time: 7304.18 seconds Step 69800/150000, Loss: 5.136392068862915, Test Loss: 5.144532859325409, LR: 0.0005, Elapsed Time: 7314.54 seconds Step 69900/150000, Loss: 5.143714723587036, Test Loss: 5.1445313692092896, LR: 0.0005, Elapsed Time: 7325.02 seconds Step 70000/150000, Loss: 5.148491468429565, Test Loss: 5.137582063674927, LR: 0.0005, Elapsed Time: 7335.44 seconds Step 70100/150000, Loss: 5.122361621856689, Test Loss: 5.132647454738617, LR: 0.0005, Elapsed Time: 7345.98 seconds Step 70200/150000, Loss: 5.132266206741333, Test Loss: 5.127383291721344, LR: 0.0005, Elapsed Time: 7356.48 seconds Step 70300/150000, Loss: 5.118939661979676, Test Loss: 5.1252601146698, LR: 0.0005, Elapsed Time: 7366.90 seconds Step 70400/150000, Loss: 5.123626437187195, Test Loss: 5.118603408336639, LR: 0.0005, Elapsed Time: 7377.37 seconds Step 70500/150000, Loss: 5.121994657516479, Test Loss: 5.120437800884247, LR: 0.0005, Elapsed Time: 7387.87 seconds Step 70600/150000, Loss: 5.110734448432923, Test Loss: 5.1134419441223145, LR: 0.0005, Elapsed Time: 7398.32 seconds Step 70700/150000, Loss: 5.098849425315857, Test Loss: 5.105983018875122, LR: 0.0005, Elapsed Time: 7408.73 seconds Step 70800/150000, Loss: 5.106832284927368, Test Loss: 5.10359114408493, LR: 0.0005, Elapsed Time: 7419.24 seconds Step 70900/150000, Loss: 5.076606960296631, Test Loss: 5.095781445503235, LR: 0.0005, Elapsed Time: 7429.68 seconds Step 71000/150000, Loss: 5.090879683494568, Test Loss: 5.096704661846161, LR: 0.0005, Elapsed Time: 7440.14 seconds Step 71100/150000, Loss: 5.094643654823304, Test Loss: 5.093998074531555, LR: 0.0005, Elapsed Time: 7450.59 seconds Step 71200/150000, Loss: 5.092743215560913, Test Loss: 5.087096333503723, LR: 0.0005, Elapsed Time: 7461.06 seconds Step 71300/150000, Loss: 5.074050207138061, Test Loss: 5.081305205821991, LR: 0.0005, Elapsed Time: 7471.53 seconds Step 71400/150000, Loss: 5.067913756370545, Test Loss: 5.077621579170227, LR: 0.0005, Elapsed Time: 7481.98 seconds Step 71500/150000, Loss: 5.0640016746521, Test Loss: 5.071775317192078, LR: 0.0005, Elapsed Time: 7492.38 seconds Step 71600/150000, Loss: 5.077561831474304, Test Loss: 5.066811621189117, LR: 0.0005, Elapsed Time: 7502.84 seconds Step 71700/150000, Loss: 5.0586803436279295, Test Loss: 5.0621830224990845, LR: 0.0005, Elapsed Time: 7513.27 seconds Step 71800/150000, Loss: 5.0624367094039915, Test Loss: 5.0649096965789795, LR: 0.0005, Elapsed Time: 7523.73 seconds Step 71900/150000, Loss: 5.056478843688965, Test Loss: 5.053511023521423, LR: 0.0005, Elapsed Time: 7534.23 seconds Step 72000/150000, Loss: 5.047071342468262, Test Loss: 5.0532785058021545, LR: 0.0005, Elapsed Time: 7544.72 seconds Step 72100/150000, Loss: 5.040492033958435, Test Loss: 5.050847828388214, LR: 0.0005, Elapsed Time: 7555.21 seconds Step 72200/150000, Loss: 5.049547748565674, Test Loss: 5.046030402183533, LR: 0.0005, Elapsed Time: 7565.66 seconds Step 72300/150000, Loss: 5.040655527114868, Test Loss: 5.040025115013123, LR: 0.0005, Elapsed Time: 7576.16 seconds Step 72400/150000, Loss: 5.028311057090759, Test Loss: 5.033197104930878, LR: 0.0005, Elapsed Time: 7586.65 seconds Step 72500/150000, Loss: 5.026360702514649, Test Loss: 5.036626815795898, LR: 0.0005, Elapsed Time: 7597.06 seconds Step 72600/150000, Loss: 5.02844470500946, Test Loss: 5.029848277568817, LR: 0.0005, Elapsed Time: 7607.46 seconds Step 72700/150000, Loss: 5.010988907814026, Test Loss: 5.024089813232422, LR: 0.0005, Elapsed Time: 7617.92 seconds Step 72800/150000, Loss: 5.024475946426391, Test Loss: 5.021440744400024, LR: 0.0005, Elapsed Time: 7628.39 seconds Step 72900/150000, Loss: 5.009460544586181, Test Loss: 5.023319959640503, LR: 0.0005, Elapsed Time: 7638.87 seconds Step 73000/150000, Loss: 5.008747835159301, Test Loss: 5.020497381687164, LR: 0.0005, Elapsed Time: 7649.26 seconds Step 73100/150000, Loss: 5.010475697517395, Test Loss: 5.013015627861023, LR: 0.0005, Elapsed Time: 7659.70 seconds Step 73200/150000, Loss: 5.004965076446533, Test Loss: 5.008463144302368, LR: 0.0005, Elapsed Time: 7670.12 seconds Step 73300/150000, Loss: 5.000535912513733, Test Loss: 5.005256116390228, LR: 0.0005, Elapsed Time: 7680.51 seconds Step 73400/150000, Loss: 4.999725069999695, Test Loss: 4.998247385025024, LR: 0.0005, Elapsed Time: 7690.89 seconds Step 73500/150000, Loss: 4.981686358451843, Test Loss: 4.998270750045776, LR: 0.0005, Elapsed Time: 7701.26 seconds Step 73600/150000, Loss: 4.996311163902282, Test Loss: 4.992211997509003, LR: 0.0005, Elapsed Time: 7711.73 seconds Step 73700/150000, Loss: 4.986882858276367, Test Loss: 4.991083979606628, LR: 0.0005, Elapsed Time: 7722.17 seconds Step 73800/150000, Loss: 4.984758992195129, Test Loss: 4.986659586429596, LR: 0.0005, Elapsed Time: 7732.66 seconds Step 73900/150000, Loss: 4.987473340034485, Test Loss: 4.9823867082595825, LR: 0.0005, Elapsed Time: 7743.12 seconds Step 74000/150000, Loss: 4.9866174125671385, Test Loss: 4.980959415435791, LR: 0.0005, Elapsed Time: 7753.65 seconds Step 74100/150000, Loss: 4.97588761806488, Test Loss: 4.983079612255096, LR: 0.0005, Elapsed Time: 7764.09 seconds Step 74200/150000, Loss: 4.970281558036804, Test Loss: 4.973564386367798, LR: 0.0005, Elapsed Time: 7774.54 seconds Step 74300/150000, Loss: 4.9654871320724485, Test Loss: 4.976551711559296, LR: 0.0005, Elapsed Time: 7784.95 seconds Step 74400/150000, Loss: 4.958631463050843, Test Loss: 4.969575524330139, LR: 0.0005, Elapsed Time: 7795.38 seconds Step 74500/150000, Loss: 4.966002478599548, Test Loss: 4.961613774299622, LR: 0.0005, Elapsed Time: 7805.80 seconds Step 74600/150000, Loss: 4.962967591285706, Test Loss: 4.958652496337891, LR: 0.0005, Elapsed Time: 7816.20 seconds Step 74700/150000, Loss: 4.94449866771698, Test Loss: 4.959629416465759, LR: 0.0005, Elapsed Time: 7826.62 seconds Step 74800/150000, Loss: 4.955770788192749, Test Loss: 4.96161276102066, LR: 0.0005, Elapsed Time: 7837.09 seconds Step 74900/150000, Loss: 4.961240215301514, Test Loss: 4.954391121864319, LR: 0.0005, Elapsed Time: 7847.62 seconds Step 75000/150000, Loss: 4.9420176601409915, Test Loss: 4.95599091053009, LR: 0.0005, Elapsed Time: 7858.16 seconds Step 75100/150000, Loss: 4.93077244758606, Test Loss: 4.947518229484558, LR: 0.0005, Elapsed Time: 7868.61 seconds Step 75200/150000, Loss: 4.95118049621582, Test Loss: 4.947926223278046, LR: 0.0005, Elapsed Time: 7879.05 seconds Step 75300/150000, Loss: 4.941116032600402, Test Loss: 4.948946833610535, LR: 0.0005, Elapsed Time: 7889.53 seconds Step 75400/150000, Loss: 4.935917811393738, Test Loss: 4.9424973130226135, LR: 0.0005, Elapsed Time: 7900.00 seconds Step 75500/150000, Loss: 4.93545660495758, Test Loss: 4.941090524196625, LR: 0.0005, Elapsed Time: 7910.41 seconds Step 75600/150000, Loss: 4.921295375823974, Test Loss: 4.934930860996246, LR: 0.0005, Elapsed Time: 7920.88 seconds Step 75700/150000, Loss: 4.931679763793945, Test Loss: 4.931283056735992, LR: 0.0005, Elapsed Time: 7931.33 seconds Step 75800/150000, Loss: 4.928640727996826, Test Loss: 4.932248950004578, LR: 0.0005, Elapsed Time: 7941.79 seconds Step 75900/150000, Loss: 4.927924003601074, Test Loss: 4.93260133266449, LR: 0.0005, Elapsed Time: 7952.21 seconds Step 76000/150000, Loss: 4.924715065956116, Test Loss: 4.926815152168274, LR: 0.0005, Elapsed Time: 7962.61 seconds Step 76100/150000, Loss: 4.925424189567566, Test Loss: 4.919170677661896, LR: 0.0005, Elapsed Time: 7973.01 seconds Step 76200/150000, Loss: 4.925184230804444, Test Loss: 4.922469258308411, LR: 0.0005, Elapsed Time: 7983.44 seconds Step 76300/150000, Loss: 4.914506430625916, Test Loss: 4.91369765996933, LR: 0.0005, Elapsed Time: 7993.87 seconds Step 76400/150000, Loss: 4.91730833530426, Test Loss: 4.91402143239975, LR: 0.0005, Elapsed Time: 8004.26 seconds Step 76500/150000, Loss: 4.913875842094422, Test Loss: 4.910737991333008, LR: 0.0005, Elapsed Time: 8014.66 seconds Step 76600/150000, Loss: 4.921055026054383, Test Loss: 4.909217119216919, LR: 0.0005, Elapsed Time: 8025.09 seconds Step 76700/150000, Loss: 4.8913210582733155, Test Loss: 4.909197807312012, LR: 0.0005, Elapsed Time: 8035.48 seconds Step 76800/150000, Loss: 4.897018671035767, Test Loss: 4.904219090938568, LR: 0.0005, Elapsed Time: 8045.90 seconds Step 76900/150000, Loss: 4.903208332061768, Test Loss: 4.900220036506653, LR: 0.0005, Elapsed Time: 8056.32 seconds Step 77000/150000, Loss: 4.8877551555633545, Test Loss: 4.897697925567627, LR: 0.0005, Elapsed Time: 8066.72 seconds Step 77100/150000, Loss: 4.89857008934021, Test Loss: 4.895069897174835, LR: 0.0005, Elapsed Time: 8077.15 seconds Step 77200/150000, Loss: 4.89669716835022, Test Loss: 4.895212292671204, LR: 0.0005, Elapsed Time: 8087.60 seconds Step 77300/150000, Loss: 4.895123252868652, Test Loss: 4.895405411720276, LR: 0.0005, Elapsed Time: 8098.04 seconds Step 77400/150000, Loss: 4.881409091949463, Test Loss: 4.892144501209259, LR: 0.0005, Elapsed Time: 8108.51 seconds Step 77500/150000, Loss: 4.8801361131668095, Test Loss: 4.888567507266998, LR: 0.0005, Elapsed Time: 8118.98 seconds Step 77600/150000, Loss: 4.883018622398376, Test Loss: 4.883805990219116, LR: 0.0005, Elapsed Time: 8129.43 seconds Step 77700/150000, Loss: 4.891428399085998, Test Loss: 4.884770929813385, LR: 0.0005, Elapsed Time: 8139.89 seconds Step 77800/150000, Loss: 4.891836972236633, Test Loss: 4.883958518505096, LR: 0.0005, Elapsed Time: 8150.32 seconds Step 77900/150000, Loss: 4.873677730560303, Test Loss: 4.881491720676422, LR: 0.0005, Elapsed Time: 8160.81 seconds Step 78000/150000, Loss: 4.87708881855011, Test Loss: 4.875210344791412, LR: 0.0005, Elapsed Time: 8171.30 seconds Step 78100/150000, Loss: 4.865730929374695, Test Loss: 4.874570250511169, LR: 0.0005, Elapsed Time: 8181.81 seconds Step 78200/150000, Loss: 4.875864462852478, Test Loss: 4.872153460979462, LR: 0.0005, Elapsed Time: 8192.30 seconds Step 78300/150000, Loss: 4.8768160438537596, Test Loss: 4.870166897773743, LR: 0.0005, Elapsed Time: 8202.74 seconds Step 78400/150000, Loss: 4.862622671127319, Test Loss: 4.868126332759857, LR: 0.0005, Elapsed Time: 8213.23 seconds Step 78500/150000, Loss: 4.869111142158508, Test Loss: 4.869178414344788, LR: 0.0005, Elapsed Time: 8223.63 seconds Step 78600/150000, Loss: 4.863295907974243, Test Loss: 4.867348253726959, LR: 0.0005, Elapsed Time: 8234.13 seconds Step 78700/150000, Loss: 4.855797481536865, Test Loss: 4.863194048404694, LR: 0.0005, Elapsed Time: 8244.58 seconds Step 78800/150000, Loss: 4.851744694709778, Test Loss: 4.856317698955536, LR: 0.0005, Elapsed Time: 8255.12 seconds Step 78900/150000, Loss: 4.85481614112854, Test Loss: 4.857385039329529, LR: 0.0005, Elapsed Time: 8265.53 seconds Step 79000/150000, Loss: 4.865416345596313, Test Loss: 4.855992913246155, LR: 0.0005, Elapsed Time: 8275.95 seconds Step 79100/150000, Loss: 4.850973930358887, Test Loss: 4.851654231548309, LR: 0.0005, Elapsed Time: 8286.44 seconds Step 79200/150000, Loss: 4.842926735877991, Test Loss: 4.84807151556015, LR: 0.0005, Elapsed Time: 8296.85 seconds Step 79300/150000, Loss: 4.843914189338684, Test Loss: 4.850727438926697, LR: 0.0005, Elapsed Time: 8307.27 seconds Step 79400/150000, Loss: 4.8443244934082035, Test Loss: 4.847101151943207, LR: 0.0005, Elapsed Time: 8317.70 seconds Step 79500/150000, Loss: 4.840743160247802, Test Loss: 4.8469207882881165, LR: 0.0005, Elapsed Time: 8328.13 seconds Step 79600/150000, Loss: 4.846686606407165, Test Loss: 4.850731194019318, LR: 0.0005, Elapsed Time: 8338.54 seconds Step 79700/150000, Loss: 4.834115505218506, Test Loss: 4.837089955806732, LR: 0.0005, Elapsed Time: 8349.02 seconds Step 79800/150000, Loss: 4.845659132003784, Test Loss: 4.837838411331177, LR: 0.0005, Elapsed Time: 8359.49 seconds Step 79900/150000, Loss: 4.835200510025024, Test Loss: 4.844934523105621, LR: 0.0005, Elapsed Time: 8369.94 seconds Step 80000/150000, Loss: 4.832617559432983, Test Loss: 4.837338924407959, LR: 0.0005, Elapsed Time: 8380.40 seconds Step 80100/150000, Loss: 4.839513635635376, Test Loss: 4.835440814495087, LR: 0.0005, Elapsed Time: 8390.82 seconds Step 80200/150000, Loss: 4.8298465394973755, Test Loss: 4.839035391807556, LR: 0.0005, Elapsed Time: 8401.20 seconds Step 80300/150000, Loss: 4.816262397766113, Test Loss: 4.835460960865021, LR: 0.0005, Elapsed Time: 8411.67 seconds Step 80400/150000, Loss: 4.82129798412323, Test Loss: 4.835106730461121, LR: 0.0005, Elapsed Time: 8422.12 seconds Step 80500/150000, Loss: 4.824140167236328, Test Loss: 4.825436234474182, LR: 0.0005, Elapsed Time: 8432.53 seconds Step 80600/150000, Loss: 4.822312550544739, Test Loss: 4.827188849449158, LR: 0.0005, Elapsed Time: 8442.96 seconds Step 80700/150000, Loss: 4.826302876472473, Test Loss: 4.824887692928314, LR: 0.0005, Elapsed Time: 8453.41 seconds Step 80800/150000, Loss: 4.8115832710266115, Test Loss: 4.8238485455513, LR: 0.0005, Elapsed Time: 8463.85 seconds Step 80900/150000, Loss: 4.804930920600891, Test Loss: 4.818974733352661, LR: 0.0005, Elapsed Time: 8474.33 seconds Step 81000/150000, Loss: 4.804895825386048, Test Loss: 4.821969389915466, LR: 0.0005, Elapsed Time: 8484.81 seconds Step 81100/150000, Loss: 4.82920756816864, Test Loss: 4.818613529205322, LR: 0.0005, Elapsed Time: 8495.24 seconds Step 81200/150000, Loss: 4.800021405220032, Test Loss: 4.818397521972656, LR: 0.0005, Elapsed Time: 8505.69 seconds Step 81300/150000, Loss: 4.810849533081055, Test Loss: 4.811725497245789, LR: 0.0005, Elapsed Time: 8516.12 seconds Step 81400/150000, Loss: 4.816287121772766, Test Loss: 4.810984492301941, LR: 0.0005, Elapsed Time: 8526.56 seconds Step 81500/150000, Loss: 4.797459397315979, Test Loss: 4.812243342399597, LR: 0.0005, Elapsed Time: 8537.02 seconds Step 81600/150000, Loss: 4.797677145004273, Test Loss: 4.815264642238617, LR: 0.0005, Elapsed Time: 8547.40 seconds Step 81700/150000, Loss: 4.798846735954284, Test Loss: 4.809240281581879, LR: 0.0005, Elapsed Time: 8557.82 seconds Step 81800/150000, Loss: 4.810451393127441, Test Loss: 4.8037514090538025, LR: 0.0005, Elapsed Time: 8568.23 seconds Step 81900/150000, Loss: 4.797507877349854, Test Loss: 4.803799033164978, LR: 0.0005, Elapsed Time: 8578.68 seconds Step 82000/150000, Loss: 4.80497905254364, Test Loss: 4.803577899932861, LR: 0.0005, Elapsed Time: 8589.10 seconds Step 82100/150000, Loss: 4.792753076553344, Test Loss: 4.799163401126862, LR: 0.0005, Elapsed Time: 8599.56 seconds Step 82200/150000, Loss: 4.796319851875305, Test Loss: 4.79492312669754, LR: 0.0005, Elapsed Time: 8609.96 seconds Step 82300/150000, Loss: 4.801247510910034, Test Loss: 4.792162656784058, LR: 0.0005, Elapsed Time: 8620.40 seconds Step 82400/150000, Loss: 4.79914915561676, Test Loss: 4.796527743339539, LR: 0.0005, Elapsed Time: 8630.90 seconds Step 82500/150000, Loss: 4.789124612808227, Test Loss: 4.792892754077911, LR: 0.0005, Elapsed Time: 8641.39 seconds Step 82600/150000, Loss: 4.786045956611633, Test Loss: 4.797091484069824, LR: 0.0005, Elapsed Time: 8651.86 seconds Step 82700/150000, Loss: 4.782631182670594, Test Loss: 4.79128235578537, LR: 0.0005, Elapsed Time: 8662.35 seconds Step 82800/150000, Loss: 4.794667248725891, Test Loss: 4.784290075302124, LR: 0.0005, Elapsed Time: 8672.84 seconds Step 82900/150000, Loss: 4.785751247406006, Test Loss: 4.7843780517578125, LR: 0.0005, Elapsed Time: 8683.31 seconds Step 83000/150000, Loss: 4.799305410385132, Test Loss: 4.789454162120819, LR: 0.0005, Elapsed Time: 8693.73 seconds Step 83100/150000, Loss: 4.761109509468079, Test Loss: 4.782617628574371, LR: 0.0005, Elapsed Time: 8704.13 seconds Step 83200/150000, Loss: 4.770145978927612, Test Loss: 4.77720183134079, LR: 0.0005, Elapsed Time: 8714.57 seconds Step 83300/150000, Loss: 4.778371877670288, Test Loss: 4.776857376098633, LR: 0.0005, Elapsed Time: 8725.03 seconds Step 83400/150000, Loss: 4.7728280210494995, Test Loss: 4.779247999191284, LR: 0.0005, Elapsed Time: 8735.48 seconds Step 83500/150000, Loss: 4.785246453285217, Test Loss: 4.780354678630829, LR: 0.0005, Elapsed Time: 8745.90 seconds Step 83600/150000, Loss: 4.77032576084137, Test Loss: 4.77236670255661, LR: 0.0005, Elapsed Time: 8756.30 seconds Step 83700/150000, Loss: 4.779432344436645, Test Loss: 4.771672308444977, LR: 0.0005, Elapsed Time: 8766.78 seconds Step 83800/150000, Loss: 4.763528113365173, Test Loss: 4.772262454032898, LR: 0.0005, Elapsed Time: 8777.26 seconds Step 83900/150000, Loss: 4.772539877891541, Test Loss: 4.7759690284729, LR: 0.0005, Elapsed Time: 8787.75 seconds Step 84000/150000, Loss: 4.7568769454956055, Test Loss: 4.7726582288742065, LR: 0.0005, Elapsed Time: 8798.27 seconds Step 84100/150000, Loss: 4.768244895935059, Test Loss: 4.771349132061005, LR: 0.0005, Elapsed Time: 8808.84 seconds Step 84200/150000, Loss: 4.773479900360107, Test Loss: 4.771948516368866, LR: 0.0005, Elapsed Time: 8819.40 seconds Step 84300/150000, Loss: 4.770772023200989, Test Loss: 4.764985084533691, LR: 0.0005, Elapsed Time: 8829.95 seconds Step 84400/150000, Loss: 4.7610893201828, Test Loss: 4.764857113361359, LR: 0.0005, Elapsed Time: 8840.44 seconds Step 84500/150000, Loss: 4.7670999526977536, Test Loss: 4.7675281167030334, LR: 0.0005, Elapsed Time: 8850.95 seconds Step 84600/150000, Loss: 4.754218950271606, Test Loss: 4.761522591114044, LR: 0.0005, Elapsed Time: 8861.48 seconds Step 84700/150000, Loss: 4.761878137588501, Test Loss: 4.763967514038086, LR: 0.0005, Elapsed Time: 8872.01 seconds Step 84800/150000, Loss: 4.757338399887085, Test Loss: 4.75696074962616, LR: 0.0005, Elapsed Time: 8882.48 seconds Step 84900/150000, Loss: 4.740103931427002, Test Loss: 4.757577836513519, LR: 0.0005, Elapsed Time: 8892.93 seconds Step 85000/150000, Loss: 4.7656351709365845, Test Loss: 4.758599102497101, LR: 0.0005, Elapsed Time: 8903.39 seconds Step 85100/150000, Loss: 4.740175333023071, Test Loss: 4.7591472864151, LR: 0.0005, Elapsed Time: 8913.85 seconds Step 85200/150000, Loss: 4.753162198066711, Test Loss: 4.757027328014374, LR: 0.0005, Elapsed Time: 8924.31 seconds Step 85300/150000, Loss: 4.7452677679061885, Test Loss: 4.75495845079422, LR: 0.0005, Elapsed Time: 8934.81 seconds Step 85400/150000, Loss: 4.755494880676269, Test Loss: 4.754721760749817, LR: 0.0005, Elapsed Time: 8945.27 seconds Step 85500/150000, Loss: 4.748078365325927, Test Loss: 4.7513309717178345, LR: 0.0005, Elapsed Time: 8955.80 seconds Step 85600/150000, Loss: 4.754419569969177, Test Loss: 4.743835270404816, LR: 0.0005, Elapsed Time: 8966.33 seconds Step 85700/150000, Loss: 4.747055330276489, Test Loss: 4.748860597610474, LR: 0.0005, Elapsed Time: 8976.81 seconds Step 85800/150000, Loss: 4.744970512390137, Test Loss: 4.743977963924408, LR: 0.0005, Elapsed Time: 8987.34 seconds Step 85900/150000, Loss: 4.73249361038208, Test Loss: 4.746781229972839, LR: 0.0005, Elapsed Time: 8997.79 seconds Step 86000/150000, Loss: 4.743886895179749, Test Loss: 4.740697920322418, LR: 0.0005, Elapsed Time: 9008.26 seconds Step 86100/150000, Loss: 4.736913247108459, Test Loss: 4.746123731136322, LR: 0.0005, Elapsed Time: 9018.71 seconds Step 86200/150000, Loss: 4.729964547157287, Test Loss: 4.739628314971924, LR: 0.0005, Elapsed Time: 9029.17 seconds Step 86300/150000, Loss: 4.742151613235474, Test Loss: 4.7444722056388855, LR: 0.0005, Elapsed Time: 9039.69 seconds Step 86400/150000, Loss: 4.730655908584595, Test Loss: 4.734647810459137, LR: 0.0005, Elapsed Time: 9050.12 seconds Step 86500/150000, Loss: 4.725906023979187, Test Loss: 4.742446005344391, LR: 0.0005, Elapsed Time: 9060.56 seconds Step 86600/150000, Loss: 4.746572585105896, Test Loss: 4.738179326057434, LR: 0.0005, Elapsed Time: 9071.02 seconds Step 86700/150000, Loss: 4.732806754112244, Test Loss: 4.735890448093414, LR: 0.0005, Elapsed Time: 9081.51 seconds Step 86800/150000, Loss: 4.732785940170288, Test Loss: 4.733982443809509, LR: 0.0005, Elapsed Time: 9091.95 seconds Step 86900/150000, Loss: 4.74584538936615, Test Loss: 4.736945807933807, LR: 0.0005, Elapsed Time: 9102.49 seconds Step 87000/150000, Loss: 4.725944318771362, Test Loss: 4.731871545314789, LR: 0.0005, Elapsed Time: 9112.93 seconds Step 87100/150000, Loss: 4.742381439208985, Test Loss: 4.727733254432678, LR: 0.0005, Elapsed Time: 9123.37 seconds Step 87200/150000, Loss: 4.722770009040833, Test Loss: 4.724575638771057, LR: 0.0005, Elapsed Time: 9133.83 seconds Step 87300/150000, Loss: 4.732470784187317, Test Loss: 4.725083649158478, LR: 0.0005, Elapsed Time: 9144.31 seconds Step 87400/150000, Loss: 4.708624377250671, Test Loss: 4.731218636035919, LR: 0.0005, Elapsed Time: 9154.75 seconds Step 87500/150000, Loss: 4.7362952280044555, Test Loss: 4.724602282047272, LR: 0.0005, Elapsed Time: 9165.24 seconds Step 87600/150000, Loss: 4.714593472480774, Test Loss: 4.72144228219986, LR: 0.0005, Elapsed Time: 9175.71 seconds Step 87700/150000, Loss: 4.719215745925903, Test Loss: 4.719575464725494, LR: 0.0005, Elapsed Time: 9186.19 seconds Step 87800/150000, Loss: 4.723467698097229, Test Loss: 4.724494278430939, LR: 0.0005, Elapsed Time: 9196.60 seconds Step 87900/150000, Loss: 4.716660871505737, Test Loss: 4.7248504757881165, LR: 0.0005, Elapsed Time: 9207.01 seconds Step 88000/150000, Loss: 4.723094177246094, Test Loss: 4.7154620885849, LR: 0.0005, Elapsed Time: 9217.45 seconds Step 88100/150000, Loss: 4.715024509429932, Test Loss: 4.714540064334869, LR: 0.0005, Elapsed Time: 9227.88 seconds Step 88200/150000, Loss: 4.724880037307739, Test Loss: 4.719093859195709, LR: 0.0005, Elapsed Time: 9238.31 seconds Step 88300/150000, Loss: 4.713695855140686, Test Loss: 4.723946630954742, LR: 0.0005, Elapsed Time: 9248.74 seconds Step 88400/150000, Loss: 4.706941809654236, Test Loss: 4.7153637409210205, LR: 0.0005, Elapsed Time: 9259.15 seconds Step 88500/150000, Loss: 4.695304732322693, Test Loss: 4.7155386209487915, LR: 0.0005, Elapsed Time: 9269.53 seconds Step 88600/150000, Loss: 4.708195424079895, Test Loss: 4.71875137090683, LR: 0.0005, Elapsed Time: 9279.98 seconds Step 88700/150000, Loss: 4.7044768714904786, Test Loss: 4.71229875087738, LR: 0.0005, Elapsed Time: 9290.49 seconds Step 88800/150000, Loss: 4.696226706504822, Test Loss: 4.712920665740967, LR: 0.0005, Elapsed Time: 9300.89 seconds Step 88900/150000, Loss: 4.704901070594787, Test Loss: 4.712480962276459, LR: 0.0005, Elapsed Time: 9311.29 seconds Step 89000/150000, Loss: 4.718432235717773, Test Loss: 4.705884158611298, LR: 0.0005, Elapsed Time: 9321.67 seconds Step 89100/150000, Loss: 4.702246980667114, Test Loss: 4.705253064632416, LR: 0.0005, Elapsed Time: 9332.07 seconds Step 89200/150000, Loss: 4.703167734146118, Test Loss: 4.7067548632621765, LR: 0.0005, Elapsed Time: 9342.47 seconds Step 89300/150000, Loss: 4.712045331001281, Test Loss: 4.710320353507996, LR: 0.0005, Elapsed Time: 9352.99 seconds Step 89400/150000, Loss: 4.690465898513794, Test Loss: 4.70288223028183, LR: 0.0005, Elapsed Time: 9363.43 seconds Step 89500/150000, Loss: 4.701580586433411, Test Loss: 4.701967239379883, LR: 0.0005, Elapsed Time: 9373.83 seconds Step 89600/150000, Loss: 4.709697279930115, Test Loss: 4.702018618583679, LR: 0.0005, Elapsed Time: 9384.22 seconds Step 89700/150000, Loss: 4.69653694152832, Test Loss: 4.701568067073822, LR: 0.0005, Elapsed Time: 9394.66 seconds Step 89800/150000, Loss: 4.698929047584533, Test Loss: 4.699881494045258, LR: 0.0005, Elapsed Time: 9405.14 seconds Step 89900/150000, Loss: 4.698907523155213, Test Loss: 4.698285818099976, LR: 0.0005, Elapsed Time: 9415.58 seconds Step 90000/150000, Loss: 4.694145135879516, Test Loss: 4.6999791264534, LR: 0.0005, Elapsed Time: 9426.02 seconds Step 90100/150000, Loss: 4.693687815666198, Test Loss: 4.698132336139679, LR: 0.0005, Elapsed Time: 9436.48 seconds Step 90200/150000, Loss: 4.685910682678223, Test Loss: 4.693688452243805, LR: 0.0005, Elapsed Time: 9446.91 seconds Step 90300/150000, Loss: 4.687323231697082, Test Loss: 4.6941112875938416, LR: 0.0005, Elapsed Time: 9457.29 seconds Step 90400/150000, Loss: 4.691967983245849, Test Loss: 4.6985161900520325, LR: 0.0005, Elapsed Time: 9467.68 seconds Step 90500/150000, Loss: 4.685854907035828, Test Loss: 4.692946970462799, LR: 0.0005, Elapsed Time: 9478.07 seconds Step 90600/150000, Loss: 4.683881001472473, Test Loss: 4.694627404212952, LR: 0.0005, Elapsed Time: 9488.47 seconds Step 90700/150000, Loss: 4.6915514850616455, Test Loss: 4.694908678531647, LR: 0.0005, Elapsed Time: 9498.94 seconds Step 90800/150000, Loss: 4.689019637107849, Test Loss: 4.692712366580963, LR: 0.0005, Elapsed Time: 9509.33 seconds Step 90900/150000, Loss: 4.690668997764587, Test Loss: 4.690618216991425, LR: 0.0005, Elapsed Time: 9519.75 seconds Step 91000/150000, Loss: 4.694623308181763, Test Loss: 4.688620090484619, LR: 0.0005, Elapsed Time: 9530.22 seconds Step 91100/150000, Loss: 4.693570137023926, Test Loss: 4.689449310302734, LR: 0.0005, Elapsed Time: 9540.62 seconds Step 91200/150000, Loss: 4.68170777797699, Test Loss: 4.692605435848236, LR: 0.0005, Elapsed Time: 9551.09 seconds Step 91300/150000, Loss: 4.6848638725280765, Test Loss: 4.690497636795044, LR: 0.0005, Elapsed Time: 9561.63 seconds Step 91400/150000, Loss: 4.687048277854919, Test Loss: 4.683878183364868, LR: 0.0005, Elapsed Time: 9572.22 seconds Step 91500/150000, Loss: 4.687886872291565, Test Loss: 4.6849524974823, LR: 0.0005, Elapsed Time: 9582.77 seconds Step 91600/150000, Loss: 4.667567391395568, Test Loss: 4.679482638835907, LR: 0.0005, Elapsed Time: 9593.32 seconds Step 91700/150000, Loss: 4.672996950149536, Test Loss: 4.67811793088913, LR: 0.0005, Elapsed Time: 9603.77 seconds Step 91800/150000, Loss: 4.680358791351319, Test Loss: 4.683247745037079, LR: 0.0005, Elapsed Time: 9614.28 seconds Step 91900/150000, Loss: 4.67656976222992, Test Loss: 4.6828858852386475, LR: 0.0005, Elapsed Time: 9624.75 seconds Step 92000/150000, Loss: 4.674556722640991, Test Loss: 4.68213951587677, LR: 0.0005, Elapsed Time: 9635.16 seconds Step 92100/150000, Loss: 4.676723132133484, Test Loss: 4.6823811531066895, LR: 0.0005, Elapsed Time: 9645.66 seconds Step 92200/150000, Loss: 4.674354152679443, Test Loss: 4.676217079162598, LR: 0.0005, Elapsed Time: 9656.12 seconds Step 92300/150000, Loss: 4.681895899772644, Test Loss: 4.684450745582581, LR: 0.0005, Elapsed Time: 9666.58 seconds Step 92400/150000, Loss: 4.679564142227173, Test Loss: 4.678245842456818, LR: 0.0005, Elapsed Time: 9677.02 seconds Step 92500/150000, Loss: 4.661452374458313, Test Loss: 4.6795238852500916, LR: 0.0005, Elapsed Time: 9687.45 seconds Step 92600/150000, Loss: 4.663585071563721, Test Loss: 4.673865914344788, LR: 0.0005, Elapsed Time: 9697.90 seconds Step 92700/150000, Loss: 4.672209591865539, Test Loss: 4.679299831390381, LR: 0.0005, Elapsed Time: 9708.32 seconds Step 92800/150000, Loss: 4.659889693260193, Test Loss: 4.678571283817291, LR: 0.0005, Elapsed Time: 9718.69 seconds Step 92900/150000, Loss: 4.667638483047486, Test Loss: 4.677476406097412, LR: 0.0005, Elapsed Time: 9729.09 seconds Step 93000/150000, Loss: 4.665724058151245, Test Loss: 4.667974412441254, LR: 0.0005, Elapsed Time: 9739.54 seconds Step 93100/150000, Loss: 4.672451729774475, Test Loss: 4.676164627075195, LR: 0.0005, Elapsed Time: 9749.94 seconds Step 93200/150000, Loss: 4.6606781816482545, Test Loss: 4.671724379062653, LR: 0.0005, Elapsed Time: 9760.35 seconds Step 93300/150000, Loss: 4.650898904800415, Test Loss: 4.670389354228973, LR: 0.0005, Elapsed Time: 9770.72 seconds Step 93400/150000, Loss: 4.659914307594299, Test Loss: 4.672396183013916, LR: 0.0005, Elapsed Time: 9781.21 seconds Step 93500/150000, Loss: 4.674565973281861, Test Loss: 4.668088674545288, LR: 0.0005, Elapsed Time: 9791.69 seconds Step 93600/150000, Loss: 4.657446746826172, Test Loss: 4.6658895611763, LR: 0.0005, Elapsed Time: 9802.15 seconds Step 93700/150000, Loss: 4.664053239822388, Test Loss: 4.665336430072784, LR: 0.0005, Elapsed Time: 9812.61 seconds Step 93800/150000, Loss: 4.662183227539063, Test Loss: 4.6684606075286865, LR: 0.0005, Elapsed Time: 9823.06 seconds Step 93900/150000, Loss: 4.6505805778503415, Test Loss: 4.666433274745941, LR: 0.0005, Elapsed Time: 9833.49 seconds Step 94000/150000, Loss: 4.654260153770447, Test Loss: 4.666564524173737, LR: 0.0005, Elapsed Time: 9843.91 seconds Step 94100/150000, Loss: 4.66292402267456, Test Loss: 4.6626468896865845, LR: 0.0005, Elapsed Time: 9854.41 seconds Step 94200/150000, Loss: 4.660408835411072, Test Loss: 4.663703203201294, LR: 0.0005, Elapsed Time: 9864.92 seconds Step 94300/150000, Loss: 4.645924735069275, Test Loss: 4.655661582946777, LR: 0.0005, Elapsed Time: 9875.46 seconds Step 94400/150000, Loss: 4.650605092048645, Test Loss: 4.658944606781006, LR: 0.0005, Elapsed Time: 9885.95 seconds Step 94500/150000, Loss: 4.6522779512405394, Test Loss: 4.658105671405792, LR: 0.0005, Elapsed Time: 9896.46 seconds Step 94600/150000, Loss: 4.643304872512817, Test Loss: 4.659991383552551, LR: 0.0005, Elapsed Time: 9906.87 seconds Step 94700/150000, Loss: 4.650188775062561, Test Loss: 4.655910968780518, LR: 0.0005, Elapsed Time: 9917.27 seconds Step 94800/150000, Loss: 4.644950227737427, Test Loss: 4.65347558259964, LR: 0.0005, Elapsed Time: 9927.73 seconds Step 94900/150000, Loss: 4.655018644332886, Test Loss: 4.652925968170166, LR: 0.0005, Elapsed Time: 9938.23 seconds Step 95000/150000, Loss: 4.64109251499176, Test Loss: 4.6555516719818115, LR: 0.0005, Elapsed Time: 9948.70 seconds Step 95100/150000, Loss: 4.647016501426696, Test Loss: 4.65079939365387, LR: 0.0005, Elapsed Time: 9959.20 seconds Step 95200/150000, Loss: 4.650451273918152, Test Loss: 4.653127074241638, LR: 0.0005, Elapsed Time: 9969.71 seconds Step 95300/150000, Loss: 4.637032742500305, Test Loss: 4.653068959712982, LR: 0.0005, Elapsed Time: 9980.21 seconds Step 95400/150000, Loss: 4.643475980758667, Test Loss: 4.649152159690857, LR: 0.0005, Elapsed Time: 9990.67 seconds Step 95500/150000, Loss: 4.635945873260498, Test Loss: 4.651387095451355, LR: 0.0005, Elapsed Time: 10001.12 seconds Step 95600/150000, Loss: 4.642398982048035, Test Loss: 4.652744114398956, LR: 0.0005, Elapsed Time: 10011.59 seconds Step 95700/150000, Loss: 4.6379948282241825, Test Loss: 4.652747809886932, LR: 0.0005, Elapsed Time: 10022.07 seconds Step 95800/150000, Loss: 4.651347227096558, Test Loss: 4.646367132663727, LR: 0.0005, Elapsed Time: 10032.51 seconds Step 95900/150000, Loss: 4.638998565673828, Test Loss: 4.645727097988129, LR: 0.0005, Elapsed Time: 10042.93 seconds Step 96000/150000, Loss: 4.644238095283509, Test Loss: 4.646568834781647, LR: 0.0005, Elapsed Time: 10053.35 seconds Step 96100/150000, Loss: 4.637891778945923, Test Loss: 4.647608697414398, LR: 0.0005, Elapsed Time: 10063.81 seconds Step 96200/150000, Loss: 4.633581829071045, Test Loss: 4.646522760391235, LR: 0.0005, Elapsed Time: 10074.22 seconds Step 96300/150000, Loss: 4.62905830860138, Test Loss: 4.643860340118408, LR: 0.0005, Elapsed Time: 10084.81 seconds Step 96400/150000, Loss: 4.638504371643067, Test Loss: 4.641965210437775, LR: 0.0005, Elapsed Time: 10095.45 seconds Step 96500/150000, Loss: 4.643032631874084, Test Loss: 4.642145156860352, LR: 0.0005, Elapsed Time: 10106.03 seconds Step 96600/150000, Loss: 4.6287583112716675, Test Loss: 4.6412633061409, LR: 0.0005, Elapsed Time: 10116.47 seconds Step 96700/150000, Loss: 4.6401792335510255, Test Loss: 4.640094876289368, LR: 0.0005, Elapsed Time: 10126.91 seconds Step 96800/150000, Loss: 4.640670771598816, Test Loss: 4.6432560086250305, LR: 0.0005, Elapsed Time: 10137.32 seconds Step 96900/150000, Loss: 4.625831456184387, Test Loss: 4.636908531188965, LR: 0.0005, Elapsed Time: 10147.70 seconds Step 97000/150000, Loss: 4.619618358612061, Test Loss: 4.639626145362854, LR: 0.0005, Elapsed Time: 10158.12 seconds Step 97100/150000, Loss: 4.636686840057373, Test Loss: 4.643510103225708, LR: 0.0005, Elapsed Time: 10168.53 seconds Step 97200/150000, Loss: 4.632449345588684, Test Loss: 4.639756202697754, LR: 0.0005, Elapsed Time: 10179.03 seconds Step 97300/150000, Loss: 4.6224003553390505, Test Loss: 4.635087072849274, LR: 0.0005, Elapsed Time: 10189.55 seconds Step 97400/150000, Loss: 4.628466601371765, Test Loss: 4.634986162185669, LR: 0.0005, Elapsed Time: 10200.03 seconds Step 97500/150000, Loss: 4.624435496330261, Test Loss: 4.632292091846466, LR: 0.0005, Elapsed Time: 10210.50 seconds Step 97600/150000, Loss: 4.631333756446838, Test Loss: 4.633491098880768, LR: 0.0005, Elapsed Time: 10221.03 seconds Step 97700/150000, Loss: 4.632350621223449, Test Loss: 4.635118305683136, LR: 0.0005, Elapsed Time: 10231.54 seconds Step 97800/150000, Loss: 4.623516960144043, Test Loss: 4.6315866112709045, LR: 0.0005, Elapsed Time: 10241.99 seconds Step 97900/150000, Loss: 4.634884152412415, Test Loss: 4.630384981632233, LR: 0.0005, Elapsed Time: 10252.47 seconds Step 98000/150000, Loss: 4.6223526239395145, Test Loss: 4.631523549556732, LR: 0.0005, Elapsed Time: 10262.89 seconds Step 98100/150000, Loss: 4.633681287765503, Test Loss: 4.631156325340271, LR: 0.0005, Elapsed Time: 10273.32 seconds Step 98200/150000, Loss: 4.614708132743836, Test Loss: 4.632968544960022, LR: 0.0005, Elapsed Time: 10283.72 seconds Step 98300/150000, Loss: 4.636079020500183, Test Loss: 4.632040321826935, LR: 0.0005, Elapsed Time: 10294.18 seconds Step 98400/150000, Loss: 4.636147599220276, Test Loss: 4.629549622535706, LR: 0.0005, Elapsed Time: 10304.65 seconds Step 98500/150000, Loss: 4.629519305229187, Test Loss: 4.630109131336212, LR: 0.0005, Elapsed Time: 10315.11 seconds Step 98600/150000, Loss: 4.611019201278687, Test Loss: 4.628794252872467, LR: 0.0005, Elapsed Time: 10325.52 seconds Step 98700/150000, Loss: 4.616982908248901, Test Loss: 4.627687573432922, LR: 0.0005, Elapsed Time: 10335.96 seconds Step 98800/150000, Loss: 4.620314059257507, Test Loss: 4.628237307071686, LR: 0.0005, Elapsed Time: 10346.46 seconds Step 98900/150000, Loss: 4.618053684234619, Test Loss: 4.625787377357483, LR: 0.0005, Elapsed Time: 10356.96 seconds Step 99000/150000, Loss: 4.62555016040802, Test Loss: 4.622941672801971, LR: 0.0005, Elapsed Time: 10367.43 seconds Step 99100/150000, Loss: 4.621173882484436, Test Loss: 4.628370583057404, LR: 0.0005, Elapsed Time: 10377.93 seconds Step 99200/150000, Loss: 4.617442541122436, Test Loss: 4.626085937023163, LR: 0.0005, Elapsed Time: 10388.38 seconds Step 99300/150000, Loss: 4.617632808685303, Test Loss: 4.62497866153717, LR: 0.0005, Elapsed Time: 10398.84 seconds Step 99400/150000, Loss: 4.6072718524932865, Test Loss: 4.622231721878052, LR: 0.0005, Elapsed Time: 10409.31 seconds Step 99500/150000, Loss: 4.624561305046082, Test Loss: 4.622986972332001, LR: 0.0005, Elapsed Time: 10419.73 seconds Step 99600/150000, Loss: 4.619499030113221, Test Loss: 4.619997501373291, LR: 0.0005, Elapsed Time: 10430.15 seconds Step 99700/150000, Loss: 4.623346285820007, Test Loss: 4.616527736186981, LR: 0.0005, Elapsed Time: 10440.63 seconds Step 99800/150000, Loss: 4.619135618209839, Test Loss: 4.618362128734589, LR: 0.0005, Elapsed Time: 10451.10 seconds Step 99900/150000, Loss: 4.611783108711243, Test Loss: 4.626369416713715, LR: 0.0005, Elapsed Time: 10461.53 seconds Step 100000/150000, Loss: 4.6088929367065425, Test Loss: 4.621346056461334, LR: 0.0005, Elapsed Time: 10471.96 seconds Saving model checkpoint at step 100000 Step 100100/150000, Loss: 4.6205266046524045, Test Loss: 4.621374487876892, LR: 0.0005, Elapsed Time: 10482.52 seconds Step 100200/150000, Loss: 4.612310762405396, Test Loss: 4.618714928627014, LR: 0.0005, Elapsed Time: 10492.95 seconds Step 100300/150000, Loss: 4.613575110435486, Test Loss: 4.614742815494537, LR: 0.0005, Elapsed Time: 10503.43 seconds Step 100400/150000, Loss: 4.607941718101501, Test Loss: 4.619343996047974, LR: 0.0005, Elapsed Time: 10513.84 seconds Step 100500/150000, Loss: 4.610837459564209, Test Loss: 4.612621366977692, LR: 0.0005, Elapsed Time: 10524.25 seconds Step 100600/150000, Loss: 4.60339282989502, Test Loss: 4.612646639347076, LR: 0.0005, Elapsed Time: 10534.72 seconds Step 100700/150000, Loss: 4.605064792633057, Test Loss: 4.610822558403015, LR: 0.0005, Elapsed Time: 10545.18 seconds Step 100800/150000, Loss: 4.616553387641907, Test Loss: 4.615645110607147, LR: 0.0005, Elapsed Time: 10555.62 seconds Step 100900/150000, Loss: 4.60834499835968, Test Loss: 4.611374258995056, LR: 0.0005, Elapsed Time: 10566.09 seconds Step 101000/150000, Loss: 4.604138746261596, Test Loss: 4.610837936401367, LR: 0.0005, Elapsed Time: 10576.60 seconds Step 101100/150000, Loss: 4.595677237510682, Test Loss: 4.616147518157959, LR: 0.0005, Elapsed Time: 10587.13 seconds Step 101200/150000, Loss: 4.610685257911682, Test Loss: 4.608050048351288, LR: 0.0005, Elapsed Time: 10597.55 seconds Step 101300/150000, Loss: 4.596573357582092, Test Loss: 4.61631852388382, LR: 0.0005, Elapsed Time: 10608.07 seconds Step 101400/150000, Loss: 4.6035781717300415, Test Loss: 4.610188841819763, LR: 0.0005, Elapsed Time: 10618.62 seconds Step 101500/150000, Loss: 4.611757030487061, Test Loss: 4.608919441699982, LR: 0.0005, Elapsed Time: 10629.08 seconds Step 101600/150000, Loss: 4.601540565490723, Test Loss: 4.614411234855652, LR: 0.0005, Elapsed Time: 10639.51 seconds Step 101700/150000, Loss: 4.6011603307724, Test Loss: 4.608978509902954, LR: 0.0005, Elapsed Time: 10649.98 seconds Step 101800/150000, Loss: 4.606809692382813, Test Loss: 4.606192767620087, LR: 0.0005, Elapsed Time: 10660.42 seconds Step 101900/150000, Loss: 4.5982136726379395, Test Loss: 4.608429670333862, LR: 0.0005, Elapsed Time: 10670.89 seconds Step 102000/150000, Loss: 4.606473655700683, Test Loss: 4.607306778430939, LR: 0.0005, Elapsed Time: 10681.35 seconds Step 102100/150000, Loss: 4.600328321456909, Test Loss: 4.608515739440918, LR: 0.0005, Elapsed Time: 10691.84 seconds Step 102200/150000, Loss: 4.583628311157226, Test Loss: 4.609188795089722, LR: 0.0005, Elapsed Time: 10702.28 seconds Step 102300/150000, Loss: 4.594375777244568, Test Loss: 4.604035496711731, LR: 0.0005, Elapsed Time: 10712.78 seconds Step 102400/150000, Loss: 4.598370118141174, Test Loss: 4.608293652534485, LR: 0.0005, Elapsed Time: 10723.29 seconds Step 102500/150000, Loss: 4.592981562614441, Test Loss: 4.60769259929657, LR: 0.0005, Elapsed Time: 10733.73 seconds Step 102600/150000, Loss: 4.600937805175781, Test Loss: 4.600850403308868, LR: 0.0005, Elapsed Time: 10744.20 seconds Step 102700/150000, Loss: 4.587853412628174, Test Loss: 4.601161420345306, LR: 0.0005, Elapsed Time: 10754.57 seconds Step 102800/150000, Loss: 4.586189322471618, Test Loss: 4.607200562953949, LR: 0.0005, Elapsed Time: 10765.05 seconds Step 102900/150000, Loss: 4.583870034217835, Test Loss: 4.603475332260132, LR: 0.0005, Elapsed Time: 10775.48 seconds Step 103000/150000, Loss: 4.6077166938781735, Test Loss: 4.603028476238251, LR: 0.0005, Elapsed Time: 10785.89 seconds Step 103100/150000, Loss: 4.578835482597351, Test Loss: 4.600737929344177, LR: 0.0005, Elapsed Time: 10796.28 seconds Step 103200/150000, Loss: 4.607456932067871, Test Loss: 4.5980218052864075, LR: 0.0005, Elapsed Time: 10806.75 seconds Step 103300/150000, Loss: 4.596224827766418, Test Loss: 4.598835349082947, LR: 0.0005, Elapsed Time: 10817.22 seconds Step 103400/150000, Loss: 4.581506223678589, Test Loss: 4.600177109241486, LR: 0.0005, Elapsed Time: 10827.71 seconds Step 103500/150000, Loss: 4.578043236732483, Test Loss: 4.599627673625946, LR: 0.0005, Elapsed Time: 10838.20 seconds Step 103600/150000, Loss: 4.5954876899719235, Test Loss: 4.597754061222076, LR: 0.0005, Elapsed Time: 10848.68 seconds Step 103700/150000, Loss: 4.589528551101685, Test Loss: 4.603450775146484, LR: 0.0005, Elapsed Time: 10859.21 seconds Step 103800/150000, Loss: 4.589122161865235, Test Loss: 4.59776383638382, LR: 0.0005, Elapsed Time: 10869.65 seconds Step 103900/150000, Loss: 4.591271123886108, Test Loss: 4.5991591811180115, LR: 0.0005, Elapsed Time: 10880.08 seconds Step 104000/150000, Loss: 4.58751446723938, Test Loss: 4.59814590215683, LR: 0.0005, Elapsed Time: 10890.55 seconds Step 104100/150000, Loss: 4.589705247879028, Test Loss: 4.593555808067322, LR: 0.0005, Elapsed Time: 10901.10 seconds Step 104200/150000, Loss: 4.590589661598205, Test Loss: 4.593409478664398, LR: 0.0005, Elapsed Time: 10911.61 seconds Step 104300/150000, Loss: 4.589026546478271, Test Loss: 4.588950991630554, LR: 0.0005, Elapsed Time: 10922.09 seconds Step 104400/150000, Loss: 4.585087246894837, Test Loss: 4.5934566259384155, LR: 0.0005, Elapsed Time: 10932.57 seconds Step 104500/150000, Loss: 4.581825428009033, Test Loss: 4.590896666049957, LR: 0.0005, Elapsed Time: 10943.00 seconds Step 104600/150000, Loss: 4.57651686668396, Test Loss: 4.590027153491974, LR: 0.0005, Elapsed Time: 10953.45 seconds Step 104700/150000, Loss: 4.594912619590759, Test Loss: 4.591535449028015, LR: 0.0005, Elapsed Time: 10963.98 seconds Step 104800/150000, Loss: 4.588951959609985, Test Loss: 4.589764475822449, LR: 0.0005, Elapsed Time: 10974.44 seconds Step 104900/150000, Loss: 4.589153542518615, Test Loss: 4.588797032833099, LR: 0.0005, Elapsed Time: 10984.85 seconds Step 105000/150000, Loss: 4.562139172554016, Test Loss: 4.594941318035126, LR: 0.0005, Elapsed Time: 10995.38 seconds Step 105100/150000, Loss: 4.578693151473999, Test Loss: 4.586896657943726, LR: 0.0005, Elapsed Time: 11005.88 seconds Step 105200/150000, Loss: 4.5765238809585576, Test Loss: 4.584712326526642, LR: 0.0005, Elapsed Time: 11016.40 seconds Step 105300/150000, Loss: 4.5896961212158205, Test Loss: 4.586831510066986, LR: 0.0005, Elapsed Time: 11026.87 seconds Step 105400/150000, Loss: 4.580262274742126, Test Loss: 4.590433776378632, LR: 0.0005, Elapsed Time: 11037.32 seconds Step 105500/150000, Loss: 4.582105946540833, Test Loss: 4.585211396217346, LR: 0.0005, Elapsed Time: 11047.71 seconds Step 105600/150000, Loss: 4.580722970962524, Test Loss: 4.584090828895569, LR: 0.0005, Elapsed Time: 11058.15 seconds Step 105700/150000, Loss: 4.578978352546692, Test Loss: 4.588916778564453, LR: 0.0005, Elapsed Time: 11068.58 seconds Step 105800/150000, Loss: 4.567564072608948, Test Loss: 4.588899314403534, LR: 0.0005, Elapsed Time: 11078.99 seconds Step 105900/150000, Loss: 4.573529272079468, Test Loss: 4.5820518136024475, LR: 0.0005, Elapsed Time: 11089.40 seconds Step 106000/150000, Loss: 4.581840133666992, Test Loss: 4.587289273738861, LR: 0.0005, Elapsed Time: 11099.79 seconds Step 106100/150000, Loss: 4.583699650764466, Test Loss: 4.585439503192902, LR: 0.0005, Elapsed Time: 11110.26 seconds Step 106200/150000, Loss: 4.584826011657714, Test Loss: 4.588802814483643, LR: 0.0005, Elapsed Time: 11120.69 seconds Step 106300/150000, Loss: 4.575369172096252, Test Loss: 4.5852895975112915, LR: 0.0005, Elapsed Time: 11131.10 seconds Step 106400/150000, Loss: 4.57934070110321, Test Loss: 4.578591227531433, LR: 0.0005, Elapsed Time: 11141.61 seconds Step 106500/150000, Loss: 4.56543420791626, Test Loss: 4.580428600311279, LR: 0.0005, Elapsed Time: 11152.05 seconds Step 106600/150000, Loss: 4.577886719703674, Test Loss: 4.583413362503052, LR: 0.0005, Elapsed Time: 11162.51 seconds Step 106700/150000, Loss: 4.567098631858825, Test Loss: 4.582394361495972, LR: 0.0005, Elapsed Time: 11173.00 seconds Step 106800/150000, Loss: 4.562952451705932, Test Loss: 4.584019482135773, LR: 0.0005, Elapsed Time: 11183.46 seconds Step 106900/150000, Loss: 4.5743641710281375, Test Loss: 4.5805052518844604, LR: 0.0005, Elapsed Time: 11193.93 seconds Step 107000/150000, Loss: 4.56983811378479, Test Loss: 4.58242803812027, LR: 0.0005, Elapsed Time: 11204.42 seconds Step 107100/150000, Loss: 4.569725260734558, Test Loss: 4.579241335391998, LR: 0.0005, Elapsed Time: 11214.88 seconds Step 107200/150000, Loss: 4.562703409194946, Test Loss: 4.579744279384613, LR: 0.0005, Elapsed Time: 11225.42 seconds Step 107300/150000, Loss: 4.57162504196167, Test Loss: 4.580317854881287, LR: 0.0005, Elapsed Time: 11235.93 seconds Step 107400/150000, Loss: 4.577780499458313, Test Loss: 4.582172513008118, LR: 0.0005, Elapsed Time: 11246.40 seconds Step 107500/150000, Loss: 4.577816696166992, Test Loss: 4.573326230049133, LR: 0.0005, Elapsed Time: 11256.87 seconds Step 107600/150000, Loss: 4.56252010345459, Test Loss: 4.576704025268555, LR: 0.0005, Elapsed Time: 11267.41 seconds Step 107700/150000, Loss: 4.562086844444275, Test Loss: 4.577490568161011, LR: 0.0005, Elapsed Time: 11277.84 seconds Step 107800/150000, Loss: 4.558477053642273, Test Loss: 4.581203401088715, LR: 0.0005, Elapsed Time: 11288.37 seconds Step 107900/150000, Loss: 4.567818946838379, Test Loss: 4.574652671813965, LR: 0.0005, Elapsed Time: 11298.83 seconds Step 108000/150000, Loss: 4.572022051811218, Test Loss: 4.573378562927246, LR: 0.0005, Elapsed Time: 11309.26 seconds Step 108100/150000, Loss: 4.555449562072754, Test Loss: 4.573389649391174, LR: 0.0005, Elapsed Time: 11319.70 seconds Step 108200/150000, Loss: 4.5733958339691165, Test Loss: 4.575595676898956, LR: 0.0005, Elapsed Time: 11330.22 seconds Step 108300/150000, Loss: 4.551444945335388, Test Loss: 4.568536460399628, LR: 0.0005, Elapsed Time: 11340.69 seconds Step 108400/150000, Loss: 4.553409218788147, Test Loss: 4.568656146526337, LR: 0.0005, Elapsed Time: 11351.10 seconds Step 108500/150000, Loss: 4.574968419075012, Test Loss: 4.570334851741791, LR: 0.0005, Elapsed Time: 11361.55 seconds Step 108600/150000, Loss: 4.567140502929687, Test Loss: 4.569110691547394, LR: 0.0005, Elapsed Time: 11371.99 seconds Step 108700/150000, Loss: 4.570112390518188, Test Loss: 4.570746004581451, LR: 0.0005, Elapsed Time: 11382.39 seconds Step 108800/150000, Loss: 4.569946856498718, Test Loss: 4.56686133146286, LR: 0.0005, Elapsed Time: 11392.81 seconds Step 108900/150000, Loss: 4.559395890235901, Test Loss: 4.565597832202911, LR: 0.0005, Elapsed Time: 11403.26 seconds Step 109000/150000, Loss: 4.580220856666565, Test Loss: 4.564338445663452, LR: 0.0005, Elapsed Time: 11413.78 seconds Step 109100/150000, Loss: 4.552979741096497, Test Loss: 4.5680171847343445, LR: 0.0005, Elapsed Time: 11424.31 seconds Step 109200/150000, Loss: 4.567926630973816, Test Loss: 4.56605076789856, LR: 0.0005, Elapsed Time: 11434.79 seconds Step 109300/150000, Loss: 4.549776530265808, Test Loss: 4.565725564956665, LR: 0.0005, Elapsed Time: 11445.26 seconds Step 109400/150000, Loss: 4.5639936876297, Test Loss: 4.566766262054443, LR: 0.0005, Elapsed Time: 11455.72 seconds Step 109500/150000, Loss: 4.552523670196533, Test Loss: 4.565463423728943, LR: 0.0005, Elapsed Time: 11466.23 seconds Step 109600/150000, Loss: 4.556932649612427, Test Loss: 4.5612112283706665, LR: 0.0005, Elapsed Time: 11476.67 seconds Step 109700/150000, Loss: 4.562925319671631, Test Loss: 4.562435030937195, LR: 0.0005, Elapsed Time: 11487.08 seconds Step 109800/150000, Loss: 4.55955801486969, Test Loss: 4.561127066612244, LR: 0.0005, Elapsed Time: 11497.51 seconds Step 109900/150000, Loss: 4.563456506729126, Test Loss: 4.565692245960236, LR: 0.0005, Elapsed Time: 11507.93 seconds Step 110000/150000, Loss: 4.553442587852478, Test Loss: 4.563076615333557, LR: 0.0005, Elapsed Time: 11518.37 seconds Step 110100/150000, Loss: 4.566080374717712, Test Loss: 4.5667847990989685, LR: 0.0005, Elapsed Time: 11528.86 seconds Step 110200/150000, Loss: 4.5564982700347905, Test Loss: 4.567052006721497, LR: 0.0005, Elapsed Time: 11539.33 seconds Step 110300/150000, Loss: 4.546115689277649, Test Loss: 4.5663028955459595, LR: 0.0005, Elapsed Time: 11549.79 seconds Step 110400/150000, Loss: 4.536957292556763, Test Loss: 4.56752336025238, LR: 0.0005, Elapsed Time: 11560.26 seconds Step 110500/150000, Loss: 4.555437207221985, Test Loss: 4.560078024864197, LR: 0.0005, Elapsed Time: 11570.75 seconds Step 110600/150000, Loss: 4.543961219787597, Test Loss: 4.557436466217041, LR: 0.0005, Elapsed Time: 11581.20 seconds Step 110700/150000, Loss: 4.5470106887817385, Test Loss: 4.564876675605774, LR: 0.0005, Elapsed Time: 11591.65 seconds Step 110800/150000, Loss: 4.558285331726074, Test Loss: 4.562499105930328, LR: 0.0005, Elapsed Time: 11602.10 seconds Step 110900/150000, Loss: 4.555693101882935, Test Loss: 4.55800998210907, LR: 0.0005, Elapsed Time: 11612.48 seconds Step 111000/150000, Loss: 4.548618221282959, Test Loss: 4.5634554624557495, LR: 0.0005, Elapsed Time: 11622.91 seconds Step 111100/150000, Loss: 4.55680094242096, Test Loss: 4.56287008523941, LR: 0.0005, Elapsed Time: 11633.32 seconds Step 111200/150000, Loss: 4.551384043693543, Test Loss: 4.562257647514343, LR: 0.0005, Elapsed Time: 11643.76 seconds Step 111300/150000, Loss: 4.547683753967285, Test Loss: 4.566716313362122, LR: 0.0005, Elapsed Time: 11654.22 seconds Step 111400/150000, Loss: 4.550557055473328, Test Loss: 4.5603147149086, LR: 0.0005, Elapsed Time: 11664.66 seconds Step 111500/150000, Loss: 4.553681507110595, Test Loss: 4.558698534965515, LR: 0.0005, Elapsed Time: 11675.12 seconds Step 111600/150000, Loss: 4.541778049468994, Test Loss: 4.560165047645569, LR: 0.0005, Elapsed Time: 11685.57 seconds Step 111700/150000, Loss: 4.5513324499130245, Test Loss: 4.559481084346771, LR: 0.0005, Elapsed Time: 11696.03 seconds Step 111800/150000, Loss: 4.5364692640304565, Test Loss: 4.535502731800079, LR: 0.00015, Elapsed Time: 11706.47 seconds Step 111900/150000, Loss: 4.518524746894837, Test Loss: 4.532015383243561, LR: 0.00015, Elapsed Time: 11716.92 seconds Step 112000/150000, Loss: 4.5207448577880855, Test Loss: 4.5303186774253845, LR: 0.00015, Elapsed Time: 11727.33 seconds Step 112100/150000, Loss: 4.51040702342987, Test Loss: 4.5277997851371765, LR: 0.00015, Elapsed Time: 11737.81 seconds Step 112200/150000, Loss: 4.509240312576294, Test Loss: 4.525943577289581, LR: 0.00015, Elapsed Time: 11748.23 seconds Step 112300/150000, Loss: 4.517040462493896, Test Loss: 4.5255022048950195, LR: 0.00015, Elapsed Time: 11758.67 seconds Step 112400/150000, Loss: 4.503537411689758, Test Loss: 4.5238776206970215, LR: 0.00015, Elapsed Time: 11769.11 seconds Step 112500/150000, Loss: 4.513156633377076, Test Loss: 4.524541139602661, LR: 0.00015, Elapsed Time: 11779.63 seconds Step 112600/150000, Loss: 4.520106949806213, Test Loss: 4.524295151233673, LR: 0.00015, Elapsed Time: 11790.08 seconds Step 112700/150000, Loss: 4.515898613929749, Test Loss: 4.522246837615967, LR: 0.00015, Elapsed Time: 11800.56 seconds Step 112800/150000, Loss: 4.515511326789856, Test Loss: 4.523564517498016, LR: 0.00015, Elapsed Time: 11811.07 seconds Step 112900/150000, Loss: 4.518942203521728, Test Loss: 4.519757807254791, LR: 0.00015, Elapsed Time: 11821.55 seconds Step 113000/150000, Loss: 4.514722218513489, Test Loss: 4.519052863121033, LR: 0.00015, Elapsed Time: 11832.10 seconds Step 113100/150000, Loss: 4.509721164703369, Test Loss: 4.521239936351776, LR: 0.00015, Elapsed Time: 11842.53 seconds Step 113200/150000, Loss: 4.514523415565491, Test Loss: 4.517209410667419, LR: 0.00015, Elapsed Time: 11852.95 seconds Step 113300/150000, Loss: 4.504203109741211, Test Loss: 4.5186790227890015, LR: 0.00015, Elapsed Time: 11863.43 seconds Step 113400/150000, Loss: 4.517420325279236, Test Loss: 4.519274652004242, LR: 0.00015, Elapsed Time: 11873.94 seconds Step 113500/150000, Loss: 4.504851264953613, Test Loss: 4.516611814498901, LR: 0.00015, Elapsed Time: 11884.45 seconds Step 113600/150000, Loss: 4.499229264259339, Test Loss: 4.519327878952026, LR: 0.00015, Elapsed Time: 11894.91 seconds Step 113700/150000, Loss: 4.510981035232544, Test Loss: 4.518710136413574, LR: 0.00015, Elapsed Time: 11905.39 seconds Step 113800/150000, Loss: 4.503222732543946, Test Loss: 4.516289055347443, LR: 0.00015, Elapsed Time: 11915.81 seconds Step 113900/150000, Loss: 4.503435482978821, Test Loss: 4.518833577632904, LR: 0.00015, Elapsed Time: 11926.27 seconds Step 114000/150000, Loss: 4.512168221473694, Test Loss: 4.515090346336365, LR: 0.00015, Elapsed Time: 11936.83 seconds Step 114100/150000, Loss: 4.510672211647034, Test Loss: 4.517912328243256, LR: 0.00015, Elapsed Time: 11947.29 seconds Step 114200/150000, Loss: 4.507784543037414, Test Loss: 4.517097115516663, LR: 0.00015, Elapsed Time: 11957.75 seconds Step 114300/150000, Loss: 4.508707365989685, Test Loss: 4.515001535415649, LR: 0.00015, Elapsed Time: 11968.15 seconds Step 114400/150000, Loss: 4.49879629611969, Test Loss: 4.515933692455292, LR: 0.00015, Elapsed Time: 11978.64 seconds Step 114500/150000, Loss: 4.496595015525818, Test Loss: 4.516028642654419, LR: 0.00015, Elapsed Time: 11989.07 seconds Step 114600/150000, Loss: 4.504826817512512, Test Loss: 4.514825880527496, LR: 0.00015, Elapsed Time: 11999.46 seconds Step 114700/150000, Loss: 4.496674246788025, Test Loss: 4.514963269233704, LR: 0.00015, Elapsed Time: 12009.93 seconds Step 114800/150000, Loss: 4.501419844627381, Test Loss: 4.5139800906181335, LR: 0.00015, Elapsed Time: 12020.37 seconds Step 114900/150000, Loss: 4.499157304763794, Test Loss: 4.513362109661102, LR: 0.00015, Elapsed Time: 12030.82 seconds Step 115000/150000, Loss: 4.50982714176178, Test Loss: 4.512733638286591, LR: 0.00015, Elapsed Time: 12041.31 seconds Step 115100/150000, Loss: 4.498891859054566, Test Loss: 4.513752996921539, LR: 0.00015, Elapsed Time: 12051.70 seconds Step 115200/150000, Loss: 4.483005495071411, Test Loss: 4.513491630554199, LR: 0.00015, Elapsed Time: 12062.22 seconds Step 115300/150000, Loss: 4.501119575500488, Test Loss: 4.5148725509643555, LR: 0.00015, Elapsed Time: 12072.70 seconds Step 115400/150000, Loss: 4.516231217384338, Test Loss: 4.511253774166107, LR: 0.00015, Elapsed Time: 12083.17 seconds Step 115500/150000, Loss: 4.4928572416305546, Test Loss: 4.5139641761779785, LR: 0.00015, Elapsed Time: 12093.67 seconds Step 115600/150000, Loss: 4.503080163002014, Test Loss: 4.511976182460785, LR: 0.00015, Elapsed Time: 12104.18 seconds Step 115700/150000, Loss: 4.497336449623108, Test Loss: 4.514186918735504, LR: 0.00015, Elapsed Time: 12114.62 seconds Step 115800/150000, Loss: 4.493121776580811, Test Loss: 4.511273682117462, LR: 0.00015, Elapsed Time: 12125.06 seconds Step 115900/150000, Loss: 4.499014234542846, Test Loss: 4.51170802116394, LR: 0.00015, Elapsed Time: 12135.51 seconds Step 116000/150000, Loss: 4.507682685852051, Test Loss: 4.510825276374817, LR: 0.00015, Elapsed Time: 12145.93 seconds Step 116100/150000, Loss: 4.495392971038818, Test Loss: 4.511817634105682, LR: 0.00015, Elapsed Time: 12156.34 seconds Step 116200/150000, Loss: 4.490530600547791, Test Loss: 4.51213264465332, LR: 0.00015, Elapsed Time: 12166.71 seconds Step 116300/150000, Loss: 4.494541583061218, Test Loss: 4.509191930294037, LR: 0.00015, Elapsed Time: 12177.15 seconds Step 116400/150000, Loss: 4.494622631072998, Test Loss: 4.512215793132782, LR: 0.00015, Elapsed Time: 12187.64 seconds Step 116500/150000, Loss: 4.487378377914428, Test Loss: 4.512327611446381, LR: 0.00015, Elapsed Time: 12198.14 seconds Step 116600/150000, Loss: 4.498601627349854, Test Loss: 4.509634017944336, LR: 0.00015, Elapsed Time: 12208.63 seconds Step 116700/150000, Loss: 4.485839328765869, Test Loss: 4.5126954317092896, LR: 0.00015, Elapsed Time: 12219.08 seconds Step 116800/150000, Loss: 4.50567843914032, Test Loss: 4.509118378162384, LR: 0.00015, Elapsed Time: 12229.56 seconds Step 116900/150000, Loss: 4.484252982139587, Test Loss: 4.51077675819397, LR: 0.00015, Elapsed Time: 12240.02 seconds Step 117000/150000, Loss: 4.489790840148926, Test Loss: 4.507184207439423, LR: 0.00015, Elapsed Time: 12250.50 seconds Step 117100/150000, Loss: 4.498467578887939, Test Loss: 4.507408678531647, LR: 0.00015, Elapsed Time: 12261.00 seconds Step 117200/150000, Loss: 4.4810103416442875, Test Loss: 4.509073972702026, LR: 0.00015, Elapsed Time: 12271.56 seconds Step 117300/150000, Loss: 4.499067540168762, Test Loss: 4.506733477115631, LR: 0.00015, Elapsed Time: 12282.05 seconds Step 117400/150000, Loss: 4.484221906661987, Test Loss: 4.506816387176514, LR: 0.00015, Elapsed Time: 12292.49 seconds Step 117500/150000, Loss: 4.491930375099182, Test Loss: 4.507956087589264, LR: 0.00015, Elapsed Time: 12302.97 seconds Step 117600/150000, Loss: 4.490207738876343, Test Loss: 4.509123265743256, LR: 0.00015, Elapsed Time: 12313.45 seconds Step 117700/150000, Loss: 4.496786093711853, Test Loss: 4.506335556507111, LR: 0.00015, Elapsed Time: 12323.85 seconds Step 117800/150000, Loss: 4.4916459083557125, Test Loss: 4.508045315742493, LR: 0.00015, Elapsed Time: 12334.30 seconds Step 117900/150000, Loss: 4.495825591087342, Test Loss: 4.505767822265625, LR: 0.00015, Elapsed Time: 12344.86 seconds Step 118000/150000, Loss: 4.4876113748550415, Test Loss: 4.507592618465424, LR: 0.00015, Elapsed Time: 12355.37 seconds Step 118100/150000, Loss: 4.4838976526260375, Test Loss: 4.5074891448020935, LR: 0.00015, Elapsed Time: 12365.89 seconds Step 118200/150000, Loss: 4.4878209733963015, Test Loss: 4.506527900695801, LR: 0.00015, Elapsed Time: 12376.35 seconds Step 118300/150000, Loss: 4.493080282211304, Test Loss: 4.50555282831192, LR: 0.00015, Elapsed Time: 12386.79 seconds Step 118400/150000, Loss: 4.490910849571228, Test Loss: 4.505158960819244, LR: 0.00015, Elapsed Time: 12397.24 seconds Step 118500/150000, Loss: 4.487466378211975, Test Loss: 4.505689859390259, LR: 0.00015, Elapsed Time: 12407.69 seconds Step 118600/150000, Loss: 4.498301229476929, Test Loss: 4.507144093513489, LR: 0.00015, Elapsed Time: 12418.08 seconds Step 118700/150000, Loss: 4.49314293384552, Test Loss: 4.507704555988312, LR: 0.00015, Elapsed Time: 12428.62 seconds Step 118800/150000, Loss: 4.477272725105285, Test Loss: 4.504107117652893, LR: 0.00015, Elapsed Time: 12439.10 seconds Step 118900/150000, Loss: 4.4820624399185185, Test Loss: 4.50462132692337, LR: 0.00015, Elapsed Time: 12449.61 seconds Step 119000/150000, Loss: 4.488964409828186, Test Loss: 4.505551874637604, LR: 0.00015, Elapsed Time: 12460.13 seconds Step 119100/150000, Loss: 4.489337587356568, Test Loss: 4.505482196807861, LR: 0.00015, Elapsed Time: 12470.53 seconds Step 119200/150000, Loss: 4.482102084159851, Test Loss: 4.504363834857941, LR: 0.00015, Elapsed Time: 12481.03 seconds Step 119300/150000, Loss: 4.482666277885437, Test Loss: 4.5006866455078125, LR: 0.00015, Elapsed Time: 12491.53 seconds Step 119400/150000, Loss: 4.483446817398072, Test Loss: 4.504553198814392, LR: 0.00015, Elapsed Time: 12502.05 seconds Step 119500/150000, Loss: 4.4945578956604, Test Loss: 4.5012388825416565, LR: 0.00015, Elapsed Time: 12512.50 seconds Step 119600/150000, Loss: 4.487843675613403, Test Loss: 4.505428194999695, LR: 0.00015, Elapsed Time: 12522.97 seconds Step 119700/150000, Loss: 4.485524296760559, Test Loss: 4.502618730068207, LR: 0.00015, Elapsed Time: 12533.42 seconds Step 119800/150000, Loss: 4.4955757093429565, Test Loss: 4.504160702228546, LR: 0.00015, Elapsed Time: 12543.93 seconds Step 119900/150000, Loss: 4.487319149971008, Test Loss: 4.50249582529068, LR: 0.00015, Elapsed Time: 12554.38 seconds Step 120000/150000, Loss: 4.488677215576172, Test Loss: 4.502407371997833, LR: 0.00015, Elapsed Time: 12564.81 seconds Step 120100/150000, Loss: 4.4847621870040895, Test Loss: 4.505232512950897, LR: 0.00015, Elapsed Time: 12575.29 seconds Step 120200/150000, Loss: 4.4907989406585695, Test Loss: 4.504931449890137, LR: 0.00015, Elapsed Time: 12585.82 seconds Step 120300/150000, Loss: 4.500872626304626, Test Loss: 4.504347264766693, LR: 0.00015, Elapsed Time: 12596.24 seconds Step 120400/150000, Loss: 4.486850256919861, Test Loss: 4.5039098262786865, LR: 0.00015, Elapsed Time: 12606.64 seconds Step 120500/150000, Loss: 4.479715437889099, Test Loss: 4.494948208332062, LR: 4.4999999999999996e-05, Elapsed Time: 12617.06 seconds Step 120600/150000, Loss: 4.467232046127319, Test Loss: 4.492970585823059, LR: 4.4999999999999996e-05, Elapsed Time: 12627.57 seconds Step 120700/150000, Loss: 4.481060471534729, Test Loss: 4.492671012878418, LR: 4.4999999999999996e-05, Elapsed Time: 12638.05 seconds Step 120800/150000, Loss: 4.475752177238465, Test Loss: 4.492815673351288, LR: 4.4999999999999996e-05, Elapsed Time: 12648.52 seconds Step 120900/150000, Loss: 4.480569381713867, Test Loss: 4.491977334022522, LR: 4.4999999999999996e-05, Elapsed Time: 12658.96 seconds Step 121000/150000, Loss: 4.477535028457641, Test Loss: 4.492493689060211, LR: 4.4999999999999996e-05, Elapsed Time: 12669.47 seconds Step 121100/150000, Loss: 4.479924092292785, Test Loss: 4.492965757846832, LR: 4.4999999999999996e-05, Elapsed Time: 12679.92 seconds Step 121200/150000, Loss: 4.467280168533325, Test Loss: 4.49102509021759, LR: 4.4999999999999996e-05, Elapsed Time: 12690.37 seconds Step 121300/150000, Loss: 4.471936130523682, Test Loss: 4.491993844509125, LR: 4.4999999999999996e-05, Elapsed Time: 12700.88 seconds Step 121400/150000, Loss: 4.4817187690734865, Test Loss: 4.4911975264549255, LR: 4.4999999999999996e-05, Elapsed Time: 12711.33 seconds Step 121500/150000, Loss: 4.480138850212097, Test Loss: 4.491039872169495, LR: 4.4999999999999996e-05, Elapsed Time: 12721.85 seconds Step 121600/150000, Loss: 4.4733587169647215, Test Loss: 4.491002678871155, LR: 4.4999999999999996e-05, Elapsed Time: 12732.32 seconds Step 121700/150000, Loss: 4.485110955238342, Test Loss: 4.4903857707977295, LR: 4.4999999999999996e-05, Elapsed Time: 12742.90 seconds Step 121800/150000, Loss: 4.468965845108032, Test Loss: 4.491197228431702, LR: 4.4999999999999996e-05, Elapsed Time: 12753.44 seconds Step 121900/150000, Loss: 4.4744036722183225, Test Loss: 4.490458071231842, LR: 4.4999999999999996e-05, Elapsed Time: 12763.95 seconds Step 122000/150000, Loss: 4.477947340011597, Test Loss: 4.489441335201263, LR: 4.4999999999999996e-05, Elapsed Time: 12774.40 seconds Step 122100/150000, Loss: 4.467560367584229, Test Loss: 4.4898547530174255, LR: 4.4999999999999996e-05, Elapsed Time: 12784.84 seconds Step 122200/150000, Loss: 4.473015084266662, Test Loss: 4.489317953586578, LR: 4.4999999999999996e-05, Elapsed Time: 12795.26 seconds Step 122300/150000, Loss: 4.47404993057251, Test Loss: 4.489753901958466, LR: 4.4999999999999996e-05, Elapsed Time: 12805.70 seconds Step 122400/150000, Loss: 4.469119501113892, Test Loss: 4.489037930965424, LR: 4.4999999999999996e-05, Elapsed Time: 12816.16 seconds Step 122500/150000, Loss: 4.463781461715699, Test Loss: 4.48970240354538, LR: 4.4999999999999996e-05, Elapsed Time: 12826.62 seconds Step 122600/150000, Loss: 4.474469270706177, Test Loss: 4.488738775253296, LR: 4.4999999999999996e-05, Elapsed Time: 12837.12 seconds Step 122700/150000, Loss: 4.480641741752624, Test Loss: 4.488793611526489, LR: 4.4999999999999996e-05, Elapsed Time: 12847.72 seconds Step 122800/150000, Loss: 4.468992614746094, Test Loss: 4.488360524177551, LR: 4.4999999999999996e-05, Elapsed Time: 12858.30 seconds Step 122900/150000, Loss: 4.465169644355774, Test Loss: 4.488807201385498, LR: 4.4999999999999996e-05, Elapsed Time: 12868.76 seconds Step 123000/150000, Loss: 4.461304631233215, Test Loss: 4.488252580165863, LR: 4.4999999999999996e-05, Elapsed Time: 12879.26 seconds Step 123100/150000, Loss: 4.4740800952911375, Test Loss: 4.488720893859863, LR: 4.4999999999999996e-05, Elapsed Time: 12889.87 seconds Step 123200/150000, Loss: 4.464904737472534, Test Loss: 4.489350318908691, LR: 4.4999999999999996e-05, Elapsed Time: 12900.50 seconds Step 123300/150000, Loss: 4.4693335437774655, Test Loss: 4.487623810768127, LR: 4.4999999999999996e-05, Elapsed Time: 12910.99 seconds Step 123400/150000, Loss: 4.477683815956116, Test Loss: 4.488196551799774, LR: 4.4999999999999996e-05, Elapsed Time: 12921.48 seconds Step 123500/150000, Loss: 4.467153158187866, Test Loss: 4.4890522956848145, LR: 4.4999999999999996e-05, Elapsed Time: 12931.95 seconds Step 123600/150000, Loss: 4.467233591079712, Test Loss: 4.487945854663849, LR: 4.4999999999999996e-05, Elapsed Time: 12942.46 seconds Step 123700/150000, Loss: 4.47087414264679, Test Loss: 4.488249361515045, LR: 4.4999999999999996e-05, Elapsed Time: 12953.04 seconds Step 123800/150000, Loss: 4.472932424545288, Test Loss: 4.488454639911652, LR: 4.4999999999999996e-05, Elapsed Time: 12963.50 seconds Step 123900/150000, Loss: 4.463984227180481, Test Loss: 4.489133954048157, LR: 4.4999999999999996e-05, Elapsed Time: 12974.00 seconds Step 124000/150000, Loss: 4.468763785362244, Test Loss: 4.488923013210297, LR: 4.4999999999999996e-05, Elapsed Time: 12984.46 seconds Step 124100/150000, Loss: 4.454857597351074, Test Loss: 4.487793505191803, LR: 4.4999999999999996e-05, Elapsed Time: 12994.95 seconds Step 124200/150000, Loss: 4.460699634552002, Test Loss: 4.487112522125244, LR: 4.4999999999999996e-05, Elapsed Time: 13005.50 seconds Step 124300/150000, Loss: 4.472416114807129, Test Loss: 4.488667547702789, LR: 4.4999999999999996e-05, Elapsed Time: 13016.02 seconds Step 124400/150000, Loss: 4.463651685714722, Test Loss: 4.48861426115036, LR: 4.4999999999999996e-05, Elapsed Time: 13026.52 seconds Step 124500/150000, Loss: 4.466560091972351, Test Loss: 4.487222909927368, LR: 4.4999999999999996e-05, Elapsed Time: 13037.10 seconds Step 124600/150000, Loss: 4.466427540779113, Test Loss: 4.487075567245483, LR: 4.4999999999999996e-05, Elapsed Time: 13047.66 seconds Step 124700/150000, Loss: 4.451697969436646, Test Loss: 4.48666250705719, LR: 4.4999999999999996e-05, Elapsed Time: 13058.18 seconds Step 124800/150000, Loss: 4.469199962615967, Test Loss: 4.4868868589401245, LR: 4.4999999999999996e-05, Elapsed Time: 13068.60 seconds Step 124900/150000, Loss: 4.4719459390640255, Test Loss: 4.487867295742035, LR: 4.4999999999999996e-05, Elapsed Time: 13079.05 seconds Step 125000/150000, Loss: 4.454464921951294, Test Loss: 4.487796068191528, LR: 4.4999999999999996e-05, Elapsed Time: 13089.44 seconds Step 125100/150000, Loss: 4.476164207458496, Test Loss: 4.487152814865112, LR: 4.4999999999999996e-05, Elapsed Time: 13099.85 seconds Step 125200/150000, Loss: 4.4643251419067385, Test Loss: 4.487168371677399, LR: 4.4999999999999996e-05, Elapsed Time: 13110.33 seconds Step 125300/150000, Loss: 4.455333132743835, Test Loss: 4.487185955047607, LR: 4.4999999999999996e-05, Elapsed Time: 13120.79 seconds Step 125400/150000, Loss: 4.456692519187928, Test Loss: 4.488124787807465, LR: 4.4999999999999996e-05, Elapsed Time: 13131.18 seconds Step 125500/150000, Loss: 4.470191869735718, Test Loss: 4.486340284347534, LR: 4.4999999999999996e-05, Elapsed Time: 13141.66 seconds Step 125600/150000, Loss: 4.458254261016846, Test Loss: 4.486109793186188, LR: 4.4999999999999996e-05, Elapsed Time: 13152.08 seconds Step 125700/150000, Loss: 4.467713441848755, Test Loss: 4.486522376537323, LR: 4.4999999999999996e-05, Elapsed Time: 13162.48 seconds Step 125800/150000, Loss: 4.465572094917297, Test Loss: 4.485868692398071, LR: 4.4999999999999996e-05, Elapsed Time: 13172.92 seconds Step 125900/150000, Loss: 4.460412192344665, Test Loss: 4.486004710197449, LR: 4.4999999999999996e-05, Elapsed Time: 13183.39 seconds Step 126000/150000, Loss: 4.471394152641296, Test Loss: 4.485243856906891, LR: 4.4999999999999996e-05, Elapsed Time: 13193.90 seconds Step 126100/150000, Loss: 4.462978682518005, Test Loss: 4.485128104686737, LR: 4.4999999999999996e-05, Elapsed Time: 13204.37 seconds Step 126200/150000, Loss: 4.47186026096344, Test Loss: 4.48541784286499, LR: 4.4999999999999996e-05, Elapsed Time: 13214.81 seconds Step 126300/150000, Loss: 4.456851267814637, Test Loss: 4.485319554805756, LR: 4.4999999999999996e-05, Elapsed Time: 13225.21 seconds Step 126400/150000, Loss: 4.45381320476532, Test Loss: 4.485398888587952, LR: 4.4999999999999996e-05, Elapsed Time: 13235.64 seconds Step 126500/150000, Loss: 4.469040369987487, Test Loss: 4.485400378704071, LR: 4.4999999999999996e-05, Elapsed Time: 13246.14 seconds Step 126600/150000, Loss: 4.461485404968261, Test Loss: 4.484817206859589, LR: 4.4999999999999996e-05, Elapsed Time: 13256.63 seconds Step 126700/150000, Loss: 4.470855569839477, Test Loss: 4.484742999076843, LR: 4.4999999999999996e-05, Elapsed Time: 13267.01 seconds Step 126800/150000, Loss: 4.4615496635437015, Test Loss: 4.485148310661316, LR: 4.4999999999999996e-05, Elapsed Time: 13277.56 seconds Step 126900/150000, Loss: 4.4453039693832395, Test Loss: 4.486326456069946, LR: 4.4999999999999996e-05, Elapsed Time: 13288.02 seconds Step 127000/150000, Loss: 4.45172233581543, Test Loss: 4.485834777355194, LR: 4.4999999999999996e-05, Elapsed Time: 13298.53 seconds Step 127100/150000, Loss: 4.4639071226119995, Test Loss: 4.48524683713913, LR: 4.4999999999999996e-05, Elapsed Time: 13309.03 seconds Step 127200/150000, Loss: 4.468339996337891, Test Loss: 4.484718501567841, LR: 4.4999999999999996e-05, Elapsed Time: 13319.47 seconds Step 127300/150000, Loss: 4.455747904777527, Test Loss: 4.485554397106171, LR: 4.4999999999999996e-05, Elapsed Time: 13329.92 seconds Step 127400/150000, Loss: 4.465098557472229, Test Loss: 4.4841853976249695, LR: 4.4999999999999996e-05, Elapsed Time: 13340.36 seconds Step 127500/150000, Loss: 4.458192391395569, Test Loss: 4.484564423561096, LR: 4.4999999999999996e-05, Elapsed Time: 13350.80 seconds Step 127600/150000, Loss: 4.457617545127869, Test Loss: 4.484224617481232, LR: 4.4999999999999996e-05, Elapsed Time: 13361.20 seconds Step 127700/150000, Loss: 4.448497743606567, Test Loss: 4.4840662479400635, LR: 4.4999999999999996e-05, Elapsed Time: 13371.59 seconds Step 127800/150000, Loss: 4.460550222396851, Test Loss: 4.484978675842285, LR: 4.4999999999999996e-05, Elapsed Time: 13382.02 seconds Step 127900/150000, Loss: 4.466691884994507, Test Loss: 4.484026849269867, LR: 4.4999999999999996e-05, Elapsed Time: 13392.42 seconds Step 128000/150000, Loss: 4.467158412933349, Test Loss: 4.484635889530182, LR: 4.4999999999999996e-05, Elapsed Time: 13402.81 seconds Step 128100/150000, Loss: 4.45674994468689, Test Loss: 4.483771324157715, LR: 4.4999999999999996e-05, Elapsed Time: 13413.20 seconds Step 128200/150000, Loss: 4.464687805175782, Test Loss: 4.484904766082764, LR: 4.4999999999999996e-05, Elapsed Time: 13423.67 seconds Step 128300/150000, Loss: 4.461249589920044, Test Loss: 4.483905732631683, LR: 4.4999999999999996e-05, Elapsed Time: 13434.14 seconds Step 128400/150000, Loss: 4.450422458648681, Test Loss: 4.483882963657379, LR: 4.4999999999999996e-05, Elapsed Time: 13444.63 seconds Step 128500/150000, Loss: 4.45587944984436, Test Loss: 4.484372675418854, LR: 4.4999999999999996e-05, Elapsed Time: 13455.08 seconds Step 128600/150000, Loss: 4.449896788597107, Test Loss: 4.481815874576569, LR: 1.3499999999999998e-05, Elapsed Time: 13465.55 seconds Step 128700/150000, Loss: 4.449651756286621, Test Loss: 4.481597542762756, LR: 1.3499999999999998e-05, Elapsed Time: 13476.02 seconds Step 128800/150000, Loss: 4.455052027702331, Test Loss: 4.480638802051544, LR: 1.3499999999999998e-05, Elapsed Time: 13486.46 seconds Step 128900/150000, Loss: 4.451883678436279, Test Loss: 4.480772316455841, LR: 1.3499999999999998e-05, Elapsed Time: 13496.92 seconds Step 129000/150000, Loss: 4.447164330482483, Test Loss: 4.481145143508911, LR: 1.3499999999999998e-05, Elapsed Time: 13507.40 seconds Step 129100/150000, Loss: 4.452170372009277, Test Loss: 4.4813355803489685, LR: 1.3499999999999998e-05, Elapsed Time: 13517.80 seconds Step 129200/150000, Loss: 4.4543700742721555, Test Loss: 4.480516850948334, LR: 1.3499999999999998e-05, Elapsed Time: 13528.29 seconds Step 129300/150000, Loss: 4.455539760589599, Test Loss: 4.480832159519196, LR: 1.3499999999999998e-05, Elapsed Time: 13538.76 seconds Step 129400/150000, Loss: 4.459345836639404, Test Loss: 4.480493187904358, LR: 1.3499999999999998e-05, Elapsed Time: 13549.28 seconds Step 129500/150000, Loss: 4.4450619554519655, Test Loss: 4.480683922767639, LR: 1.3499999999999998e-05, Elapsed Time: 13559.78 seconds Step 129600/150000, Loss: 4.44326199054718, Test Loss: 4.4805707931518555, LR: 1.3499999999999998e-05, Elapsed Time: 13570.26 seconds Step 129700/150000, Loss: 4.4537488460540775, Test Loss: 4.480657458305359, LR: 1.3499999999999998e-05, Elapsed Time: 13580.76 seconds Step 129800/150000, Loss: 4.4470742893218995, Test Loss: 4.480579197406769, LR: 1.3499999999999998e-05, Elapsed Time: 13591.30 seconds Step 129900/150000, Loss: 4.454203004837036, Test Loss: 4.480512380599976, LR: 1.3499999999999998e-05, Elapsed Time: 13601.76 seconds Step 130000/150000, Loss: 4.444230542182923, Test Loss: 4.479863166809082, LR: 5e-06, Elapsed Time: 13612.22 seconds Step 130100/150000, Loss: 4.453616781234741, Test Loss: 4.479707479476929, LR: 5e-06, Elapsed Time: 13622.69 seconds Step 130200/150000, Loss: 4.4402565050125125, Test Loss: 4.4796125292778015, LR: 5e-06, Elapsed Time: 13633.12 seconds Step 130300/150000, Loss: 4.440785131454468, Test Loss: 4.479307174682617, LR: 5e-06, Elapsed Time: 13643.58 seconds Step 130400/150000, Loss: 4.464026141166687, Test Loss: 4.479389727115631, LR: 5e-06, Elapsed Time: 13654.05 seconds Step 130500/150000, Loss: 4.446814670562744, Test Loss: 4.479382872581482, LR: 5e-06, Elapsed Time: 13664.48 seconds Step 130600/150000, Loss: 4.458519659042358, Test Loss: 4.479211151599884, LR: 5e-06, Elapsed Time: 13674.86 seconds Step 130700/150000, Loss: 4.453226027488708, Test Loss: 4.479368686676025, LR: 5e-06, Elapsed Time: 13685.26 seconds Step 130800/150000, Loss: 4.451517472267151, Test Loss: 4.479521751403809, LR: 5e-06, Elapsed Time: 13695.67 seconds Step 130900/150000, Loss: 4.453995113372803, Test Loss: 4.479201138019562, LR: 5e-06, Elapsed Time: 13706.10 seconds Step 131000/150000, Loss: 4.44236234664917, Test Loss: 4.479627728462219, LR: 5e-06, Elapsed Time: 13716.50 seconds Step 131100/150000, Loss: 4.4512608623504635, Test Loss: 4.479277670383453, LR: 5e-06, Elapsed Time: 13726.89 seconds Step 131200/150000, Loss: 4.446147084236145, Test Loss: 4.479288280010223, LR: 5e-06, Elapsed Time: 13737.33 seconds Step 131300/150000, Loss: 4.445587630271912, Test Loss: 4.47925478219986, LR: 5e-06, Elapsed Time: 13747.78 seconds Step 131400/150000, Loss: 4.441020140647888, Test Loss: 4.479016423225403, LR: 5e-06, Elapsed Time: 13758.20 seconds Step 131500/150000, Loss: 4.447850136756897, Test Loss: 4.478977560997009, LR: 5e-06, Elapsed Time: 13768.61 seconds Step 131600/150000, Loss: 4.441596531867981, Test Loss: 4.479114472866058, LR: 5e-06, Elapsed Time: 13779.01 seconds Step 131700/150000, Loss: 4.446668605804444, Test Loss: 4.479244112968445, LR: 5e-06, Elapsed Time: 13789.47 seconds Step 131800/150000, Loss: 4.444085536003112, Test Loss: 4.479072630405426, LR: 5e-06, Elapsed Time: 13799.92 seconds Step 131900/150000, Loss: 4.445896315574646, Test Loss: 4.479025602340698, LR: 5e-06, Elapsed Time: 13810.36 seconds Step 132000/150000, Loss: 4.448441534042359, Test Loss: 4.479126989841461, LR: 5e-06, Elapsed Time: 13820.80 seconds Step 132100/150000, Loss: 4.441824765205383, Test Loss: 4.479188501834869, LR: 5e-06, Elapsed Time: 13831.20 seconds Step 132200/150000, Loss: 4.4359630107879635, Test Loss: 4.479049980640411, LR: 5e-06, Elapsed Time: 13841.64 seconds Step 132300/150000, Loss: 4.424449138641357, Test Loss: 4.479227006435394, LR: 5e-06, Elapsed Time: 13852.08 seconds Step 132400/150000, Loss: 4.4394665670394895, Test Loss: 4.479380905628204, LR: 5e-06, Elapsed Time: 13862.52 seconds Step 132500/150000, Loss: 4.438440752029419, Test Loss: 4.479095220565796, LR: 5e-06, Elapsed Time: 13872.95 seconds Step 132600/150000, Loss: 4.431824889183044, Test Loss: 4.479286074638367, LR: 5e-06, Elapsed Time: 13883.39 seconds Step 132700/150000, Loss: 4.448132047653198, Test Loss: 4.479146838188171, LR: 5e-06, Elapsed Time: 13893.82 seconds Step 132800/150000, Loss: 4.435203351974487, Test Loss: 4.479044258594513, LR: 5e-06, Elapsed Time: 13904.26 seconds Step 132900/150000, Loss: 4.444718046188354, Test Loss: 4.479381024837494, LR: 5e-06, Elapsed Time: 13914.74 seconds Step 133000/150000, Loss: 4.446352291107178, Test Loss: 4.479088723659515, LR: 5e-06, Elapsed Time: 13925.17 seconds Step 133100/150000, Loss: 4.434353976249695, Test Loss: 4.4789891839027405, LR: 5e-06, Elapsed Time: 13935.61 seconds Step 133200/150000, Loss: 4.429741969108582, Test Loss: 4.479423403739929, LR: 5e-06, Elapsed Time: 13946.13 seconds Step 133300/150000, Loss: 4.447629985809326, Test Loss: 4.479048132896423, LR: 5e-06, Elapsed Time: 13956.61 seconds Step 133400/150000, Loss: 4.436065187454224, Test Loss: 4.4789658188819885, LR: 5e-06, Elapsed Time: 13967.12 seconds Step 133500/150000, Loss: 4.433476228713989, Test Loss: 4.479179739952087, LR: 5e-06, Elapsed Time: 13977.64 seconds Step 133600/150000, Loss: 4.446922926902771, Test Loss: 4.478704750537872, LR: 5e-06, Elapsed Time: 13988.11 seconds Step 133700/150000, Loss: 4.4583496522903445, Test Loss: 4.4789846539497375, LR: 5e-06, Elapsed Time: 13998.60 seconds Step 133800/150000, Loss: 4.453595714569092, Test Loss: 4.478910565376282, LR: 5e-06, Elapsed Time: 14009.07 seconds Step 133900/150000, Loss: 4.45616678237915, Test Loss: 4.478927254676819, LR: 5e-06, Elapsed Time: 14019.56 seconds Step 134000/150000, Loss: 4.451961078643799, Test Loss: 4.478913426399231, LR: 5e-06, Elapsed Time: 14029.95 seconds Step 134100/150000, Loss: 4.451156849861145, Test Loss: 4.478847563266754, LR: 5e-06, Elapsed Time: 14040.36 seconds Step 134200/150000, Loss: 4.452733325958252, Test Loss: 4.479070961475372, LR: 5e-06, Elapsed Time: 14050.78 seconds Step 134300/150000, Loss: 4.450558843612671, Test Loss: 4.479029715061188, LR: 5e-06, Elapsed Time: 14061.25 seconds Step 134400/150000, Loss: 4.458207788467408, Test Loss: 4.478677034378052, LR: 5e-06, Elapsed Time: 14071.67 seconds Step 134500/150000, Loss: 4.465290102958679, Test Loss: 4.4786606431007385, LR: 5e-06, Elapsed Time: 14082.11 seconds Step 134600/150000, Loss: 4.460664262771607, Test Loss: 4.478908956050873, LR: 5e-06, Elapsed Time: 14092.55 seconds Step 134700/150000, Loss: 4.466694478988647, Test Loss: 4.4786266684532166, LR: 5e-06, Elapsed Time: 14102.98 seconds Step 134800/150000, Loss: 4.458388533592224, Test Loss: 4.478971481323242, LR: 5e-06, Elapsed Time: 14113.50 seconds Step 134900/150000, Loss: 4.457404327392578, Test Loss: 4.478647589683533, LR: 5e-06, Elapsed Time: 14123.95 seconds Step 135000/150000, Loss: 4.457415971755982, Test Loss: 4.478706359863281, LR: 5e-06, Elapsed Time: 14134.42 seconds Step 135100/150000, Loss: 4.467681994438172, Test Loss: 4.478459000587463, LR: 5e-06, Elapsed Time: 14144.93 seconds Step 135200/150000, Loss: 4.45432888507843, Test Loss: 4.478470325469971, LR: 5e-06, Elapsed Time: 14155.37 seconds Step 135300/150000, Loss: 4.460258412361145, Test Loss: 4.478568494319916, LR: 5e-06, Elapsed Time: 14165.84 seconds Step 135400/150000, Loss: 4.450377073287964, Test Loss: 4.478768587112427, LR: 5e-06, Elapsed Time: 14176.34 seconds Step 135500/150000, Loss: 4.450869164466858, Test Loss: 4.478575527667999, LR: 5e-06, Elapsed Time: 14186.80 seconds Step 135600/150000, Loss: 4.46503155708313, Test Loss: 4.47839629650116, LR: 5e-06, Elapsed Time: 14197.30 seconds Step 135700/150000, Loss: 4.448193969726563, Test Loss: 4.478398025035858, LR: 5e-06, Elapsed Time: 14207.79 seconds Step 135800/150000, Loss: 4.461751093864441, Test Loss: 4.478470742702484, LR: 5e-06, Elapsed Time: 14218.25 seconds Step 135900/150000, Loss: 4.451924252510071, Test Loss: 4.478416204452515, LR: 5e-06, Elapsed Time: 14228.67 seconds Step 136000/150000, Loss: 4.459493503570557, Test Loss: 4.478494107723236, LR: 5e-06, Elapsed Time: 14239.13 seconds Step 136100/150000, Loss: 4.459863262176514, Test Loss: 4.478583514690399, LR: 5e-06, Elapsed Time: 14249.64 seconds Step 136200/150000, Loss: 4.460433435440064, Test Loss: 4.478476822376251, LR: 5e-06, Elapsed Time: 14260.04 seconds Step 136300/150000, Loss: 4.44530463218689, Test Loss: 4.478290855884552, LR: 5e-06, Elapsed Time: 14270.53 seconds Step 136400/150000, Loss: 4.452734093666077, Test Loss: 4.4783191084861755, LR: 5e-06, Elapsed Time: 14280.94 seconds Step 136500/150000, Loss: 4.451989979743957, Test Loss: 4.478296458721161, LR: 5e-06, Elapsed Time: 14291.42 seconds Step 136600/150000, Loss: 4.450274305343628, Test Loss: 4.47839891910553, LR: 5e-06, Elapsed Time: 14301.82 seconds Step 136700/150000, Loss: 4.457493181228638, Test Loss: 4.478432476520538, LR: 5e-06, Elapsed Time: 14312.28 seconds Step 136800/150000, Loss: 4.4551367712020875, Test Loss: 4.478426277637482, LR: 5e-06, Elapsed Time: 14322.68 seconds Step 136900/150000, Loss: 4.451749324798584, Test Loss: 4.478342950344086, LR: 5e-06, Elapsed Time: 14333.09 seconds Step 137000/150000, Loss: 4.455621113777161, Test Loss: 4.478265762329102, LR: 5e-06, Elapsed Time: 14343.49 seconds Step 137100/150000, Loss: 4.445135159492493, Test Loss: 4.478258669376373, LR: 5e-06, Elapsed Time: 14353.98 seconds Step 137200/150000, Loss: 4.457284317016602, Test Loss: 4.478313326835632, LR: 5e-06, Elapsed Time: 14364.43 seconds Step 137300/150000, Loss: 4.463478903770447, Test Loss: 4.478327512741089, LR: 5e-06, Elapsed Time: 14374.88 seconds Step 137400/150000, Loss: 4.448733448982239, Test Loss: 4.4785667061805725, LR: 5e-06, Elapsed Time: 14385.32 seconds Step 137500/150000, Loss: 4.44860643863678, Test Loss: 4.4781622886657715, LR: 5e-06, Elapsed Time: 14395.73 seconds Step 137600/150000, Loss: 4.455298166275025, Test Loss: 4.478264391422272, LR: 5e-06, Elapsed Time: 14406.15 seconds Step 137700/150000, Loss: 4.448502583503723, Test Loss: 4.47821718454361, LR: 5e-06, Elapsed Time: 14416.56 seconds Step 137800/150000, Loss: 4.460841007232666, Test Loss: 4.47849828004837, LR: 5e-06, Elapsed Time: 14426.97 seconds Step 137900/150000, Loss: 4.4515567111969, Test Loss: 4.4783583879470825, LR: 5e-06, Elapsed Time: 14437.57 seconds Step 138000/150000, Loss: 4.449024171829223, Test Loss: 4.478369414806366, LR: 5e-06, Elapsed Time: 14448.18 seconds Step 138100/150000, Loss: 4.449520902633667, Test Loss: 4.478293001651764, LR: 5e-06, Elapsed Time: 14458.67 seconds Step 138200/150000, Loss: 4.449461722373963, Test Loss: 4.4785783886909485, LR: 5e-06, Elapsed Time: 14469.19 seconds Step 138300/150000, Loss: 4.4479807043075565, Test Loss: 4.47832727432251, LR: 5e-06, Elapsed Time: 14479.85 seconds Step 138400/150000, Loss: 4.449244651794434, Test Loss: 4.478243708610535, LR: 5e-06, Elapsed Time: 14490.64 seconds Step 138500/150000, Loss: 4.449891104698181, Test Loss: 4.478472411632538, LR: 5e-06, Elapsed Time: 14501.49 seconds Step 138600/150000, Loss: 4.440348033905029, Test Loss: 4.478163540363312, LR: 5e-06, Elapsed Time: 14512.34 seconds Step 138700/150000, Loss: 4.459986472129822, Test Loss: 4.4785425662994385, LR: 5e-06, Elapsed Time: 14523.20 seconds Step 138800/150000, Loss: 4.440903420448303, Test Loss: 4.478104412555695, LR: 5e-06, Elapsed Time: 14533.95 seconds Step 138900/150000, Loss: 4.448002452850342, Test Loss: 4.478087842464447, LR: 5e-06, Elapsed Time: 14544.64 seconds Step 139000/150000, Loss: 4.452141966819763, Test Loss: 4.47807502746582, LR: 5e-06, Elapsed Time: 14555.26 seconds Step 139100/150000, Loss: 4.442619466781617, Test Loss: 4.478442847728729, LR: 5e-06, Elapsed Time: 14565.88 seconds Step 139200/150000, Loss: 4.446925511360169, Test Loss: 4.4781383872032166, LR: 5e-06, Elapsed Time: 14576.39 seconds Step 139300/150000, Loss: 4.4426392078399655, Test Loss: 4.478105306625366, LR: 5e-06, Elapsed Time: 14586.88 seconds Step 139400/150000, Loss: 4.45004243850708, Test Loss: 4.47863233089447, LR: 5e-06, Elapsed Time: 14597.33 seconds Step 139500/150000, Loss: 4.443422069549561, Test Loss: 4.4782615303993225, LR: 5e-06, Elapsed Time: 14607.82 seconds Step 139600/150000, Loss: 4.459392757415771, Test Loss: 4.478220582008362, LR: 5e-06, Elapsed Time: 14618.30 seconds Step 139700/150000, Loss: 4.44310875415802, Test Loss: 4.478226840496063, LR: 5e-06, Elapsed Time: 14628.86 seconds Step 139800/150000, Loss: 4.4535003185272215, Test Loss: 4.478256821632385, LR: 5e-06, Elapsed Time: 14639.29 seconds Step 139900/150000, Loss: 4.444937100410462, Test Loss: 4.47806715965271, LR: 5e-06, Elapsed Time: 14649.77 seconds Step 140000/150000, Loss: 4.439912452697754, Test Loss: 4.478237509727478, LR: 5e-06, Elapsed Time: 14660.23 seconds Step 140100/150000, Loss: 4.445317759513855, Test Loss: 4.478129148483276, LR: 5e-06, Elapsed Time: 14670.70 seconds Step 140200/150000, Loss: 4.451802768707275, Test Loss: 4.47803395986557, LR: 5e-06, Elapsed Time: 14681.10 seconds Step 140300/150000, Loss: 4.439867057800293, Test Loss: 4.47809910774231, LR: 5e-06, Elapsed Time: 14691.52 seconds Step 140400/150000, Loss: 4.449755959510803, Test Loss: 4.4781341552734375, LR: 5e-06, Elapsed Time: 14701.92 seconds Step 140500/150000, Loss: 4.456595873832702, Test Loss: 4.4780572056770325, LR: 5e-06, Elapsed Time: 14712.31 seconds Step 140600/150000, Loss: 4.445584444999695, Test Loss: 4.4781962633132935, LR: 5e-06, Elapsed Time: 14722.73 seconds Step 140700/150000, Loss: 4.431318507194519, Test Loss: 4.477784752845764, LR: 5e-06, Elapsed Time: 14733.22 seconds Step 140800/150000, Loss: 4.451277441978455, Test Loss: 4.478279292583466, LR: 5e-06, Elapsed Time: 14743.77 seconds Step 140900/150000, Loss: 4.442976865768433, Test Loss: 4.478408694267273, LR: 5e-06, Elapsed Time: 14754.26 seconds Step 141000/150000, Loss: 4.4473585033416745, Test Loss: 4.478288233280182, LR: 5e-06, Elapsed Time: 14764.71 seconds Step 141100/150000, Loss: 4.439138388633728, Test Loss: 4.478205859661102, LR: 5e-06, Elapsed Time: 14775.22 seconds Step 141200/150000, Loss: 4.4387941074371335, Test Loss: 4.478100299835205, LR: 5e-06, Elapsed Time: 14785.73 seconds Step 141300/150000, Loss: 4.445084519386292, Test Loss: 4.47817325592041, LR: 5e-06, Elapsed Time: 14796.24 seconds Step 141400/150000, Loss: 4.445789151191711, Test Loss: 4.4779932498931885, LR: 5e-06, Elapsed Time: 14806.74 seconds Step 141500/150000, Loss: 4.453622083663941, Test Loss: 4.477752864360809, LR: 5e-06, Elapsed Time: 14817.23 seconds Step 141600/150000, Loss: 4.439001026153565, Test Loss: 4.477730393409729, LR: 5e-06, Elapsed Time: 14827.68 seconds Step 141700/150000, Loss: 4.4461033821105955, Test Loss: 4.478102266788483, LR: 5e-06, Elapsed Time: 14838.18 seconds Step 141800/150000, Loss: 4.451633529663086, Test Loss: 4.478155732154846, LR: 5e-06, Elapsed Time: 14848.74 seconds Step 141900/150000, Loss: 4.444929718971252, Test Loss: 4.478077471256256, LR: 5e-06, Elapsed Time: 14859.22 seconds Step 142000/150000, Loss: 4.444979481697082, Test Loss: 4.477997601032257, LR: 5e-06, Elapsed Time: 14869.78 seconds Step 142100/150000, Loss: 4.446121497154236, Test Loss: 4.478193759918213, LR: 5e-06, Elapsed Time: 14880.35 seconds Step 142200/150000, Loss: 4.458120980262756, Test Loss: 4.478077828884125, LR: 5e-06, Elapsed Time: 14890.84 seconds Step 142300/150000, Loss: 4.444281477928161, Test Loss: 4.477860331535339, LR: 5e-06, Elapsed Time: 14901.30 seconds Step 142400/150000, Loss: 4.455064721107483, Test Loss: 4.478033363819122, LR: 5e-06, Elapsed Time: 14911.77 seconds Step 142500/150000, Loss: 4.454940156936646, Test Loss: 4.477924644947052, LR: 5e-06, Elapsed Time: 14922.29 seconds Step 142600/150000, Loss: 4.453886032104492, Test Loss: 4.4777637124061584, LR: 5e-06, Elapsed Time: 14932.83 seconds Step 142700/150000, Loss: 4.457716779708862, Test Loss: 4.477853059768677, LR: 5e-06, Elapsed Time: 14943.33 seconds Step 142800/150000, Loss: 4.461301555633545, Test Loss: 4.477818489074707, LR: 5e-06, Elapsed Time: 14953.83 seconds Step 142900/150000, Loss: 4.463912057876587, Test Loss: 4.47791588306427, LR: 5e-06, Elapsed Time: 14964.39 seconds Step 143000/150000, Loss: 4.451261177062988, Test Loss: 4.47802460193634, LR: 5e-06, Elapsed Time: 14974.82 seconds Step 143100/150000, Loss: 4.450143213272095, Test Loss: 4.477888524532318, LR: 5e-06, Elapsed Time: 14985.29 seconds Step 143200/150000, Loss: 4.457161793708801, Test Loss: 4.477930963039398, LR: 5e-06, Elapsed Time: 14995.81 seconds Step 143300/150000, Loss: 4.4597063684463505, Test Loss: 4.477966725826263, LR: 5e-06, Elapsed Time: 15006.31 seconds Step 143400/150000, Loss: 4.463327369689941, Test Loss: 4.477759897708893, LR: 5e-06, Elapsed Time: 15016.80 seconds Step 143500/150000, Loss: 4.461476716995239, Test Loss: 4.4779563546180725, LR: 5e-06, Elapsed Time: 15027.30 seconds Step 143600/150000, Loss: 4.459793868064881, Test Loss: 4.477846682071686, LR: 5e-06, Elapsed Time: 15037.87 seconds Step 143700/150000, Loss: 4.44737868309021, Test Loss: 4.477720677852631, LR: 5e-06, Elapsed Time: 15048.34 seconds Step 143800/150000, Loss: 4.4579430150985715, Test Loss: 4.477664887905121, LR: 5e-06, Elapsed Time: 15058.84 seconds Step 143900/150000, Loss: 4.460228509902954, Test Loss: 4.477552592754364, LR: 5e-06, Elapsed Time: 15069.33 seconds Step 144000/150000, Loss: 4.455417423248291, Test Loss: 4.477795600891113, LR: 5e-06, Elapsed Time: 15079.79 seconds Step 144100/150000, Loss: 4.453219032287597, Test Loss: 4.477742433547974, LR: 5e-06, Elapsed Time: 15090.25 seconds Step 144200/150000, Loss: 4.458648386001587, Test Loss: 4.477952420711517, LR: 5e-06, Elapsed Time: 15100.76 seconds Step 144300/150000, Loss: 4.44669912815094, Test Loss: 4.477637588977814, LR: 5e-06, Elapsed Time: 15111.29 seconds Step 144400/150000, Loss: 4.44953161239624, Test Loss: 4.47789466381073, LR: 5e-06, Elapsed Time: 15121.81 seconds Step 144500/150000, Loss: 4.455354633331299, Test Loss: 4.477940082550049, LR: 5e-06, Elapsed Time: 15132.31 seconds Step 144600/150000, Loss: 4.4668526935577395, Test Loss: 4.477588534355164, LR: 5e-06, Elapsed Time: 15142.88 seconds Step 144700/150000, Loss: 4.454244751930236, Test Loss: 4.477806568145752, LR: 5e-06, Elapsed Time: 15153.41 seconds Step 144800/150000, Loss: 4.442389678955078, Test Loss: 4.477844536304474, LR: 5e-06, Elapsed Time: 15163.93 seconds Step 144900/150000, Loss: 4.451730027198791, Test Loss: 4.477531850337982, LR: 5e-06, Elapsed Time: 15174.41 seconds Step 145000/150000, Loss: 4.448106727600098, Test Loss: 4.477595031261444, LR: 5e-06, Elapsed Time: 15184.89 seconds Step 145100/150000, Loss: 4.453576111793518, Test Loss: 4.47765439748764, LR: 5e-06, Elapsed Time: 15195.36 seconds Step 145200/150000, Loss: 4.452858099937439, Test Loss: 4.47746068239212, LR: 5e-06, Elapsed Time: 15205.81 seconds Step 145300/150000, Loss: 4.456533522605896, Test Loss: 4.4775460958480835, LR: 5e-06, Elapsed Time: 15216.31 seconds Step 145400/150000, Loss: 4.455827317237854, Test Loss: 4.477603614330292, LR: 5e-06, Elapsed Time: 15226.78 seconds Step 145500/150000, Loss: 4.451658873558045, Test Loss: 4.4774357080459595, LR: 5e-06, Elapsed Time: 15237.24 seconds Step 145600/150000, Loss: 4.447464776039124, Test Loss: 4.477752447128296, LR: 5e-06, Elapsed Time: 15247.78 seconds Step 145700/150000, Loss: 4.46410873413086, Test Loss: 4.477738380432129, LR: 5e-06, Elapsed Time: 15258.27 seconds Step 145800/150000, Loss: 4.448485341072082, Test Loss: 4.4776571393013, LR: 5e-06, Elapsed Time: 15268.75 seconds Step 145900/150000, Loss: 4.440375475883484, Test Loss: 4.477726697921753, LR: 5e-06, Elapsed Time: 15279.20 seconds Step 146000/150000, Loss: 4.446234831809997, Test Loss: 4.477728307247162, LR: 5e-06, Elapsed Time: 15289.62 seconds Step 146100/150000, Loss: 4.4494277858734135, Test Loss: 4.477596640586853, LR: 5e-06, Elapsed Time: 15300.07 seconds Step 146200/150000, Loss: 4.448816304206848, Test Loss: 4.477940499782562, LR: 5e-06, Elapsed Time: 15310.56 seconds Step 146300/150000, Loss: 4.453729848861695, Test Loss: 4.477714240550995, LR: 5e-06, Elapsed Time: 15321.09 seconds Step 146400/150000, Loss: 4.447273626327514, Test Loss: 4.47768098115921, LR: 5e-06, Elapsed Time: 15331.65 seconds Step 146500/150000, Loss: 4.444137463569641, Test Loss: 4.477687835693359, LR: 5e-06, Elapsed Time: 15342.18 seconds Step 146600/150000, Loss: 4.4381750249862675, Test Loss: 4.477806985378265, LR: 5e-06, Elapsed Time: 15352.66 seconds Step 146700/150000, Loss: 4.463432188034058, Test Loss: 4.477578282356262, LR: 5e-06, Elapsed Time: 15363.18 seconds Step 146800/150000, Loss: 4.44898639202118, Test Loss: 4.477652311325073, LR: 5e-06, Elapsed Time: 15373.75 seconds Step 146900/150000, Loss: 4.446751656532288, Test Loss: 4.477584898471832, LR: 5e-06, Elapsed Time: 15384.22 seconds Step 147000/150000, Loss: 4.455843539237976, Test Loss: 4.477757453918457, LR: 5e-06, Elapsed Time: 15394.79 seconds Step 147100/150000, Loss: 4.44806161403656, Test Loss: 4.47775000333786, LR: 5e-06, Elapsed Time: 15405.38 seconds Step 147200/150000, Loss: 4.436804265975952, Test Loss: 4.478011071681976, LR: 5e-06, Elapsed Time: 15415.84 seconds Step 147300/150000, Loss: 4.447864966392517, Test Loss: 4.477620720863342, LR: 5e-06, Elapsed Time: 15426.33 seconds Step 147400/150000, Loss: 4.45115873336792, Test Loss: 4.477466642856598, LR: 5e-06, Elapsed Time: 15436.77 seconds Step 147500/150000, Loss: 4.442023587226868, Test Loss: 4.4776612520217896, LR: 5e-06, Elapsed Time: 15447.24 seconds Step 147600/150000, Loss: 4.458568711280822, Test Loss: 4.477674961090088, LR: 5e-06, Elapsed Time: 15457.75 seconds Step 147700/150000, Loss: 4.445251317024231, Test Loss: 4.477506756782532, LR: 5e-06, Elapsed Time: 15468.26 seconds Step 147800/150000, Loss: 4.444106588363647, Test Loss: 4.477727293968201, LR: 5e-06, Elapsed Time: 15478.74 seconds Step 147900/150000, Loss: 4.455885577201843, Test Loss: 4.477564752101898, LR: 5e-06, Elapsed Time: 15489.28 seconds Step 148000/150000, Loss: 4.4548672103881835, Test Loss: 4.47752833366394, LR: 5e-06, Elapsed Time: 15499.78 seconds Step 148100/150000, Loss: 4.4496666526794435, Test Loss: 4.477524757385254, LR: 5e-06, Elapsed Time: 15510.30 seconds Step 148200/150000, Loss: 4.444072656631469, Test Loss: 4.477586030960083, LR: 5e-06, Elapsed Time: 15520.83 seconds Step 148300/150000, Loss: 4.43893572807312, Test Loss: 4.477665185928345, LR: 5e-06, Elapsed Time: 15531.29 seconds Step 148400/150000, Loss: 4.450906119346619, Test Loss: 4.4776611328125, LR: 5e-06, Elapsed Time: 15541.79 seconds Step 148500/150000, Loss: 4.449767684936523, Test Loss: 4.47739976644516, LR: 5e-06, Elapsed Time: 15552.28 seconds Step 148600/150000, Loss: 4.458567199707031, Test Loss: 4.47760272026062, LR: 5e-06, Elapsed Time: 15562.77 seconds Step 148700/150000, Loss: 4.432878155708313, Test Loss: 4.477497458457947, LR: 5e-06, Elapsed Time: 15573.21 seconds Step 148800/150000, Loss: 4.435658230781555, Test Loss: 4.4777191281318665, LR: 5e-06, Elapsed Time: 15583.67 seconds Step 148900/150000, Loss: 4.444917263984681, Test Loss: 4.477657496929169, LR: 5e-06, Elapsed Time: 15594.08 seconds Step 149000/150000, Loss: 4.44398530960083, Test Loss: 4.477760970592499, LR: 5e-06, Elapsed Time: 15604.50 seconds Step 149100/150000, Loss: 4.45354805469513, Test Loss: 4.47755765914917, LR: 5e-06, Elapsed Time: 15614.99 seconds Step 149200/150000, Loss: 4.445557546615601, Test Loss: 4.477566242218018, LR: 5e-06, Elapsed Time: 15625.53 seconds Step 149300/150000, Loss: 4.447668471336365, Test Loss: 4.477498114109039, LR: 5e-06, Elapsed Time: 15636.07 seconds Step 149400/150000, Loss: 4.440193943977356, Test Loss: 4.477578401565552, LR: 5e-06, Elapsed Time: 15646.57 seconds Step 149500/150000, Loss: 4.445153255462646, Test Loss: 4.4774104952812195, LR: 5e-06, Elapsed Time: 15657.05 seconds Step 149600/150000, Loss: 4.432530446052551, Test Loss: 4.477606415748596, LR: 5e-06, Elapsed Time: 15667.58 seconds Step 149700/150000, Loss: 4.444754137992859, Test Loss: 4.477676808834076, LR: 5e-06, Elapsed Time: 15678.08 seconds Step 149800/150000, Loss: 4.454492030143737, Test Loss: 4.477569818496704, LR: 5e-06, Elapsed Time: 15688.60 seconds Step 149900/150000, Loss: 4.450811109542847, Test Loss: 4.477332055568695, LR: 5e-06, Elapsed Time: 15699.09 seconds Step 150000/150000, Loss: 4.440347175598145, Test Loss: 4.477678954601288, LR: 5e-06, Elapsed Time: 15709.59 seconds Saving model checkpoint at step 150000
if use_existing_model:
print("Existing model used, no loss curves shown.")
plt.imshow(plt.imread("./loss_curve.png"))
else:
plt.figure(figsize=(10, 6))
plt.plot(losses, label="Train Loss", color='blue')
plt.plot(test_losses, label="Test Loss", color='red')
plt.xlabel('Checkpoint')
plt.ylabel('Loss')
plt.title('Training and Test Loss Over Time')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()
if not use_existing_model:
torch.save(model, f"./pretrain_final.pth")
Now that we have pretrained the model, we can perform some inference examples to see what types of outputs we get from the model. We can see that the model is able to output legible english, and most of the words make sense, however, its size limits make it not quite as robust as larger models. It is still good enough to see the "sparks" of understanding language.
In this dataset, we trained on news articles so I've started the sentences with phrases that could potentially be found in the news. If you rerun the cell below this you will see that you get different outputs every time. This is due to the randomness of the next token selection step.
def inference(prompt,torch_model, max_new_tokens):
torch_model.eval()
with torch.no_grad():
tokens = hf_tokenizer.encode(prompt)
for _ in range(max_new_tokens):
num_tokens = len(tokens)
tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens)
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = torch_model(tokens_padded)
probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1)
predicted_token = torch.multinomial(probabilities, 1).item()
tokens.append(predicted_token)
return hf_tokenizer.decode(tokens)
print("Predicted:", inference("The president signed a bill to pass", model, max_new_tokens=20))
print("Predicted:", inference("There was a large division in", model, max_new_tokens=20))
print("Predicted:", inference("Reports are showing that", model, max_new_tokens=20))
2025-02-15 09:21:39.354292: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2025-02-15 09:21:39.354348: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-02-15 09:21:39.355365: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-02-15 09:21:39.361722: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-02-15 09:21:40.186448: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Predicted: The president signed a bill to pass the bailout policy, and that's not necessarily in the Democratic Republic." Obama seems to have his closest Predicted: There was a large division in some kind disguised battles to film Chinese. I voted myself again, but I think the Internet's terms Predicted: Reports are showing that their advances in optimal traffic culture where each athlete's shift is allowable using drones, an 8-strong
To make the model more useable, we can take the pretrained model, and then go through a process called supervised fine tuning. This process involves having high quality supervised text datasets to get the model to respond how we want.
We can use the Fact Q&A dataset from huggingface for this. This dataset consists of question - answer examples that are short, which is good for our use case since we have a small context window of 128 tokens.
Supervised fine tuning is where we can introduce "tags" and other types of text tokens that can help the model understand different roles in the text. For our dataset, we will have a "question" tag and an "answer" tag. We will add all of these when we create our dataset, and also during inference when a user submits a query. We also add eos tokens to end/pad the examples that do not take up the full context window.
After fine tuning on this dataset, ideally we will have a LLM that you can ask a question and get an answer.
# Load dataset in streaming mode
sft_ds = load_dataset("rubenroy/GammaCorpus-Fact-QA-450k", split="train", streaming=True)
def check_sft_dataset_exists():
try:
# Attempt to load the dataset with reuse_cache_if_exists mode
load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")
return True
except FileNotFoundError:
return False
if not check_sft_dataset_exists():
print("Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.")
def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, rows=1000):
"""
Tokenizes and chunks the dataset into fixed-length 512-token segments.
The 'target' sequence is shifted left by 1 token.
Stops after generating `train_rows + test_rows` tokenized chunks.
"""
row_count = 0
for example in dataset:
question_plus_answer = "<Question>" + example["question"] + "</Question>" + "<Answer>" + example["answer"] + "</Answer>"
input_tokens = tokenizer(question_plus_answer, truncation=False, padding=False)['input_ids']
if row_count >= rows:
return
if len(input_tokens) >= chunk_size:
continue
else:
input_tokens = input_tokens +[tokenizer.eos_token_id] * (chunk_size - len(input_tokens))
target_tokens = input_tokens[1:] + [tokenizer.eos_token_id] # Shifted by 1 token
yield {
"input": input_tokens,
"target": target_tokens
}
row_count += 1
# Set the max number of rows for training and testing
TRAIN_ROWS = 440000 # Adjust as needed
TEST_ROWS = 500 # Adjust as needed
CHUNK_SIZE = 128
# Convert generator to a Hugging Face Dataset
tokenized_sft_dataset = Dataset.from_generator(lambda: tokenize_and_chunk(sft_ds, hf_tokenizer,chunk_size=CHUNK_SIZE, rows=TRAIN_ROWS + TEST_ROWS))
# Split the dataset into `train` and `test`
sft_dataset_splits = tokenized_sft_dataset.train_test_split(train_size=TRAIN_ROWS, test_size=TEST_ROWS, seed=42)
# Save to disk
sft_dataset_splits["train"].to_parquet("fact_qa_train.parquet")
sft_dataset_splits["test"].to_parquet("fact_qa_test.parquet")
print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows for supervised fine tuning.")
else:
print("SFT Tokenized dataset already exists locally.")
README.md: 0%| | 0.00/2.01k [00:00<?, ?B/s]
Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Creating parquet from Arrow format: 0%| | 0/440 [00:00<?, ?ba/s]
Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
✅ Saved 440000 train rows and 500 test rows for supervised fine tuning.
A very similar training loop can be used for supervised fine tuning.
# Example config:
batch_size = 64
sequence_len = 128
num_steps = 50000
accumulation_steps = 100
# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")
# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])
# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
use_existing_model = os.path.exists("./sft_final.pth")
# Check if pre-trained model exists
if use_existing_model:
model = torch.load("./sft_final.pth")
print("Loaded fine tuned model from ./sft_final.pth, skipping training loop.")
else:
# For SFT we start with the pretrained model
model = torch.load("./pretrain_final.pth")
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Scheduler with dynamic step size
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=10, min_lr=5e-6, threshold=1e-4)
# Training loop
losses = []
test_losses = []
accumulator = 0
accumulator_loss = 0
for i in range(num_steps):
model.train()
example = next(train_dataloader)
train_input = example["input"].to(device)
train_target = example["target"].to(device)
logits = model(train_input)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights
optimizer.step()
optimizer.zero_grad()
accumulator += 1
accumulator_loss += loss.item()
if accumulator >= accumulation_steps:
losses.append(accumulator_loss / accumulation_steps)
accumulator = 0
accumulator_loss = 0
model.eval()
test_loss = 0
test_accumulator = 0
with torch.no_grad():
for test_example in test_dataloader:
test_input = test_example["input"].to(device)
test_target = test_example["target"].to(device)
test_logits = model(test_input)
test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
test_accumulator += 1
test_losses.append(test_loss / test_accumulator)
print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}")
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
scheduler.step(test_losses[-1])
if i+1 % 50000 == 0:
torch.save(model.state_dict(), f"./sft_model_checkpoint_{i}.pt")
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Step 100/50000, Loss: 1.9918309170007706, Test Loss: 0.7444852367043495 Step 200/50000, Loss: 0.7175035107135773, Test Loss: 0.7121347188949585 Step 300/50000, Loss: 0.7018899911642075, Test Loss: 0.6949615925550461 Step 400/50000, Loss: 0.6908353179693222, Test Loss: 0.6887291222810745 Step 500/50000, Loss: 0.6768182325363159, Test Loss: 0.6792757734656334 Step 600/50000, Loss: 0.674177388548851, Test Loss: 0.6729135736823082 Step 700/50000, Loss: 0.6727178591489792, Test Loss: 0.665991447865963 Step 800/50000, Loss: 0.6694402080774308, Test Loss: 0.6644480600953102 Step 900/50000, Loss: 0.6582337802648545, Test Loss: 0.6569188311696053 Step 1000/50000, Loss: 0.6548500311374664, Test Loss: 0.6566447466611862 Step 1100/50000, Loss: 0.6548229521512985, Test Loss: 0.651200458407402 Step 1200/50000, Loss: 0.6486632919311524, Test Loss: 0.6475077718496323 Step 1300/50000, Loss: 0.6484963500499725, Test Loss: 0.6474942564964294 Step 1400/50000, Loss: 0.6471871453523635, Test Loss: 0.6445949599146843 Step 1500/50000, Loss: 0.6443393552303314, Test Loss: 0.6420401781797409 Step 1600/50000, Loss: 0.633151044845581, Test Loss: 0.638265810906887 Step 1700/50000, Loss: 0.6382915496826171, Test Loss: 0.6368600502610207 Step 1800/50000, Loss: 0.6347232925891876, Test Loss: 0.6353018656373024 Step 1900/50000, Loss: 0.6338496947288513, Test Loss: 0.6344203874468803 Step 2000/50000, Loss: 0.6316756427288055, Test Loss: 0.6306689009070396 Step 2100/50000, Loss: 0.6319619745016098, Test Loss: 0.6294878572225571 Step 2200/50000, Loss: 0.6291478699445725, Test Loss: 0.6268564462661743 Step 2300/50000, Loss: 0.6278823328018188, Test Loss: 0.6255254969000816 Step 2400/50000, Loss: 0.623234384059906, Test Loss: 0.6254716143012047 Step 2500/50000, Loss: 0.6190310460329056, Test Loss: 0.6225540265440941 Step 2600/50000, Loss: 0.6188549196720123, Test Loss: 0.6221257895231247 Step 2700/50000, Loss: 0.618628158569336, Test Loss: 0.6187321320176125 Step 2800/50000, Loss: 0.6215020614862442, Test Loss: 0.6191798150539398 Step 2900/50000, Loss: 0.6181986683607101, Test Loss: 0.6164458692073822 Step 3000/50000, Loss: 0.618198910355568, Test Loss: 0.6175436750054359 Step 3100/50000, Loss: 0.6129893887043, Test Loss: 0.6151383370161057 Step 3200/50000, Loss: 0.6147611361742019, Test Loss: 0.6149737536907196 Step 3300/50000, Loss: 0.6152183496952057, Test Loss: 0.614312969148159 Step 3400/50000, Loss: 0.6159565663337707, Test Loss: 0.6127408817410469 Step 3500/50000, Loss: 0.6132823014259339, Test Loss: 0.611700989305973 Step 3600/50000, Loss: 0.6138676339387894, Test Loss: 0.6131153330206871 Step 3700/50000, Loss: 0.6062165397405624, Test Loss: 0.6110232546925545 Step 3800/50000, Loss: 0.6079108506441117, Test Loss: 0.6089049205183983 Step 3900/50000, Loss: 0.6074112200737, Test Loss: 0.6070513054728508 Step 4000/50000, Loss: 0.6106110924482345, Test Loss: 0.6049295142292976 Step 4100/50000, Loss: 0.6047671067714692, Test Loss: 0.6041082292795181 Step 4200/50000, Loss: 0.603420569896698, Test Loss: 0.6057098358869553 Step 4300/50000, Loss: 0.6009371656179429, Test Loss: 0.6042213067412376 Step 4400/50000, Loss: 0.6017094248533249, Test Loss: 0.604745663702488 Step 4500/50000, Loss: 0.6014189845323563, Test Loss: 0.6009880006313324 Step 4600/50000, Loss: 0.5976266038417816, Test Loss: 0.6004352197051048 Step 4700/50000, Loss: 0.6051703834533692, Test Loss: 0.5990288332104683 Step 4800/50000, Loss: 0.5987521249055863, Test Loss: 0.5995767191052437 Step 4900/50000, Loss: 0.605491629242897, Test Loss: 0.5985585302114487 Step 5000/50000, Loss: 0.5989030289649964, Test Loss: 0.5991090014576912 Step 5100/50000, Loss: 0.5965026852488517, Test Loss: 0.597358226776123 Step 5200/50000, Loss: 0.5994638687372208, Test Loss: 0.5964501574635506 Step 5300/50000, Loss: 0.5930718868970871, Test Loss: 0.5972273796796799 Step 5400/50000, Loss: 0.5987280166149139, Test Loss: 0.5952593758702278 Step 5500/50000, Loss: 0.5906244492530823, Test Loss: 0.5944551900029182 Step 5600/50000, Loss: 0.5958823591470719, Test Loss: 0.5944562777876854 Step 5700/50000, Loss: 0.5881155526638031, Test Loss: 0.5947108417749405 Step 5800/50000, Loss: 0.5935657703876496, Test Loss: 0.593703106045723 Step 5900/50000, Loss: 0.59330351293087, Test Loss: 0.5939234718680382 Step 6000/50000, Loss: 0.5902789396047592, Test Loss: 0.59233009070158 Step 6100/50000, Loss: 0.588716431260109, Test Loss: 0.5923418253660202 Step 6200/50000, Loss: 0.5928794878721237, Test Loss: 0.5913825184106827 Step 6300/50000, Loss: 0.5914193224906922, Test Loss: 0.5902153626084328 Step 6400/50000, Loss: 0.5833790856599808, Test Loss: 0.588626816868782 Step 6500/50000, Loss: 0.5877368372678756, Test Loss: 0.5877366214990616 Step 6600/50000, Loss: 0.5860869419574738, Test Loss: 0.5863989442586899 Step 6700/50000, Loss: 0.589739373922348, Test Loss: 0.5867156386375427 Step 6800/50000, Loss: 0.584526817202568, Test Loss: 0.5867301002144814 Step 6900/50000, Loss: 0.5822766584157943, Test Loss: 0.5880885794758797 Step 7000/50000, Loss: 0.577407209277153, Test Loss: 0.5861862972378731 Step 7100/50000, Loss: 0.5744632133841514, Test Loss: 0.5860717371106148 Step 7200/50000, Loss: 0.5751043623685836, Test Loss: 0.5863938108086586 Step 7300/50000, Loss: 0.5748037856817245, Test Loss: 0.5855845585465431 Step 7400/50000, Loss: 0.572316969037056, Test Loss: 0.5851596668362617 Step 7500/50000, Loss: 0.5747124922275543, Test Loss: 0.5847618877887726 Step 7600/50000, Loss: 0.5798290985822677, Test Loss: 0.5830075964331627 Step 7700/50000, Loss: 0.5753655230998993, Test Loss: 0.5843382328748703 Step 7800/50000, Loss: 0.5709194296598434, Test Loss: 0.582453265786171 Step 7900/50000, Loss: 0.5761319142580033, Test Loss: 0.5819704532623291 Step 8000/50000, Loss: 0.5711049169301987, Test Loss: 0.5836983248591423 Step 8100/50000, Loss: 0.5715294700860977, Test Loss: 0.5828280597925186 Step 8200/50000, Loss: 0.5724949318170548, Test Loss: 0.5819578990340233 Step 8300/50000, Loss: 0.5741658109426498, Test Loss: 0.581995002925396 Step 8400/50000, Loss: 0.5692359441518784, Test Loss: 0.5819196477532387 Step 8500/50000, Loss: 0.5626497274637222, Test Loss: 0.5805854797363281 Step 8600/50000, Loss: 0.5717644435167313, Test Loss: 0.5816004872322083 Step 8700/50000, Loss: 0.5690776839852333, Test Loss: 0.5807487517595291 Step 8800/50000, Loss: 0.567734204530716, Test Loss: 0.5793915018439293 Step 8900/50000, Loss: 0.5671832323074341, Test Loss: 0.5800510048866272 Step 9000/50000, Loss: 0.5706066074967384, Test Loss: 0.5805618986487389 Step 9100/50000, Loss: 0.5674847972393036, Test Loss: 0.5779681727290154 Step 9200/50000, Loss: 0.5652343165874482, Test Loss: 0.57679333537817 Step 9300/50000, Loss: 0.5636638420820236, Test Loss: 0.578231044113636 Step 9400/50000, Loss: 0.5642386451363564, Test Loss: 0.5772436782717705 Step 9500/50000, Loss: 0.5624971750378609, Test Loss: 0.578190840780735 Step 9600/50000, Loss: 0.5634752601385117, Test Loss: 0.5764676332473755 Step 9700/50000, Loss: 0.5636484289169311, Test Loss: 0.5761144608259201 Step 9800/50000, Loss: 0.5656767743825912, Test Loss: 0.5751572251319885 Step 9900/50000, Loss: 0.5646402412652969, Test Loss: 0.576744981110096 Step 10000/50000, Loss: 0.5584480208158493, Test Loss: 0.5752606242895126 Step 10100/50000, Loss: 0.5675463330745697, Test Loss: 0.5746448859572411 Step 10200/50000, Loss: 0.5639113080501557, Test Loss: 0.5766838267445564 Step 10300/50000, Loss: 0.565152844786644, Test Loss: 0.575394481420517 Step 10400/50000, Loss: 0.5642486420273781, Test Loss: 0.57430499792099 Step 10500/50000, Loss: 0.5635545629262925, Test Loss: 0.5738035812973976 Step 10600/50000, Loss: 0.560040000975132, Test Loss: 0.5750302597880363 Step 10700/50000, Loss: 0.5596969667077064, Test Loss: 0.57220808416605 Step 10800/50000, Loss: 0.5574226033687592, Test Loss: 0.5730143934488297 Step 10900/50000, Loss: 0.5647035917639732, Test Loss: 0.5714382529258728 Step 11000/50000, Loss: 0.5587136927247047, Test Loss: 0.5711110085248947 Step 11100/50000, Loss: 0.5587834417819977, Test Loss: 0.5726886764168739 Step 11200/50000, Loss: 0.5549559888243675, Test Loss: 0.5714552626013756 Step 11300/50000, Loss: 0.5565311759710312, Test Loss: 0.5713993534445763 Step 11400/50000, Loss: 0.5563825571537018, Test Loss: 0.5699047222733498 Step 11500/50000, Loss: 0.5576359283924103, Test Loss: 0.5679070949554443 Step 11600/50000, Loss: 0.5607616451382637, Test Loss: 0.5699220821261406 Step 11700/50000, Loss: 0.5581514132022858, Test Loss: 0.5714012905955315 Step 11800/50000, Loss: 0.5621607759594918, Test Loss: 0.5677169635891914 Step 11900/50000, Loss: 0.55369149684906, Test Loss: 0.5713604018092155 Step 12000/50000, Loss: 0.5588549131155014, Test Loss: 0.5696694105863571 Step 12100/50000, Loss: 0.5542031788825988, Test Loss: 0.5705981105566025 Step 12200/50000, Loss: 0.5551470720767975, Test Loss: 0.568854846060276 Step 12300/50000, Loss: 0.5562813901901245, Test Loss: 0.5683741867542267 Step 12400/50000, Loss: 0.5507404124736786, Test Loss: 0.5689040198922157 Step 12500/50000, Loss: 0.5562013146281243, Test Loss: 0.5684254467487335 Step 12600/50000, Loss: 0.5528185081481933, Test Loss: 0.5685967206954956 Step 12700/50000, Loss: 0.5524562358856201, Test Loss: 0.5687995627522469 Step 12800/50000, Loss: 0.5539699363708496, Test Loss: 0.5666004568338394 Step 12900/50000, Loss: 0.5513373532891274, Test Loss: 0.5677912011742592 Step 13000/50000, Loss: 0.5518678990006447, Test Loss: 0.5673673525452614 Step 13100/50000, Loss: 0.5557357114553452, Test Loss: 0.5658706501126289 Step 13200/50000, Loss: 0.5515711814165115, Test Loss: 0.565628170967102 Step 13300/50000, Loss: 0.550505211353302, Test Loss: 0.5668040290474892 Step 13400/50000, Loss: 0.5481249782443046, Test Loss: 0.5661885440349579 Step 13500/50000, Loss: 0.5528215989470482, Test Loss: 0.5625173598527908 Step 13600/50000, Loss: 0.5528467190265656, Test Loss: 0.5631937682628632 Step 13700/50000, Loss: 0.5483042612671852, Test Loss: 0.5636199936270714 Step 13800/50000, Loss: 0.5488942724466324, Test Loss: 0.565242625772953 Step 13900/50000, Loss: 0.542830813229084, Test Loss: 0.5647081807255745 Step 14000/50000, Loss: 0.544424757361412, Test Loss: 0.5652362257242203 Step 14100/50000, Loss: 0.5432074779272079, Test Loss: 0.5628474578261375 Step 14200/50000, Loss: 0.5443048387765884, Test Loss: 0.5640399232506752 Step 14300/50000, Loss: 0.5409858584403991, Test Loss: 0.5641464665532112 Step 14400/50000, Loss: 0.5432823747396469, Test Loss: 0.5631680116057396 Step 14500/50000, Loss: 0.5496816590428353, Test Loss: 0.56195417791605 Step 14600/50000, Loss: 0.5455321884155273, Test Loss: 0.5633310079574585 Step 14700/50000, Loss: 0.5394514006376266, Test Loss: 0.5619406774640083 Step 14800/50000, Loss: 0.5472288265824318, Test Loss: 0.5640913099050522 Step 14900/50000, Loss: 0.5410517767071724, Test Loss: 0.5626183301210403 Step 15000/50000, Loss: 0.5427042940258979, Test Loss: 0.5630702450871468 Step 15100/50000, Loss: 0.5432297945022583, Test Loss: 0.5624691918492317 Step 15200/50000, Loss: 0.5445645186305046, Test Loss: 0.5646524205803871 Step 15300/50000, Loss: 0.5373612320423127, Test Loss: 0.5627492591738701 Step 15400/50000, Loss: 0.5364279320836067, Test Loss: 0.5618307366967201 Step 15500/50000, Loss: 0.5434803596138954, Test Loss: 0.5621660053730011 Step 15600/50000, Loss: 0.5427698168158531, Test Loss: 0.5622566640377045 Step 15700/50000, Loss: 0.539301550090313, Test Loss: 0.5627597570419312 Step 15800/50000, Loss: 0.5386711174249649, Test Loss: 0.562005452811718 Step 15900/50000, Loss: 0.5415481841564178, Test Loss: 0.5643602684140205 Step 16000/50000, Loss: 0.5412552654743195, Test Loss: 0.5606033578515053 Step 16100/50000, Loss: 0.5379377207159997, Test Loss: 0.5603309124708176 Step 16200/50000, Loss: 0.5377152219414711, Test Loss: 0.560099333524704 Step 16300/50000, Loss: 0.5371793606877326, Test Loss: 0.5596804171800613 Step 16400/50000, Loss: 0.5350770393013954, Test Loss: 0.5591178834438324 Step 16500/50000, Loss: 0.5388319265842437, Test Loss: 0.5582670271396637 Step 16600/50000, Loss: 0.5378912803530693, Test Loss: 0.5595279708504677 Step 16700/50000, Loss: 0.5402527329325676, Test Loss: 0.557677835226059 Step 16800/50000, Loss: 0.5356794854998589, Test Loss: 0.5578664466738701 Step 16900/50000, Loss: 0.5346369329094887, Test Loss: 0.5596128329634666 Step 17000/50000, Loss: 0.54222346752882, Test Loss: 0.5585969537496567 Step 17100/50000, Loss: 0.539298540353775, Test Loss: 0.5581124350428581 Step 17200/50000, Loss: 0.5376360580325127, Test Loss: 0.5601217895746231 Step 17300/50000, Loss: 0.5395399758219719, Test Loss: 0.559946745634079 Step 17400/50000, Loss: 0.5386925312876701, Test Loss: 0.5598364993929863 Step 17500/50000, Loss: 0.5348520314693451, Test Loss: 0.5595741048455238 Step 17600/50000, Loss: 0.5292650026082992, Test Loss: 0.5579413771629333 Step 17700/50000, Loss: 0.5398921331763268, Test Loss: 0.5583313331007957 Step 17800/50000, Loss: 0.5357465851306915, Test Loss: 0.5585442036390305 Step 17900/50000, Loss: 0.5327663189172744, Test Loss: 0.5510242432355881 Step 18000/50000, Loss: 0.5271280682086945, Test Loss: 0.5502513647079468 Step 18100/50000, Loss: 0.5259345364570618, Test Loss: 0.5491451099514961 Step 18200/50000, Loss: 0.5271320742368698, Test Loss: 0.5489896535873413 Step 18300/50000, Loss: 0.5238709262013436, Test Loss: 0.5480611175298691 Step 18400/50000, Loss: 0.5273246216773987, Test Loss: 0.5474121198058128 Step 18500/50000, Loss: 0.5263362589478493, Test Loss: 0.5471702739596367 Step 18600/50000, Loss: 0.5279832828044891, Test Loss: 0.547274611890316 Step 18700/50000, Loss: 0.5278452935814858, Test Loss: 0.5464454367756844 Step 18800/50000, Loss: 0.5205520689487457, Test Loss: 0.5468438267707825 Step 18900/50000, Loss: 0.525918051302433, Test Loss: 0.5463645756244659 Step 19000/50000, Loss: 0.5218250566720962, Test Loss: 0.5465811118483543 Step 19100/50000, Loss: 0.5228285375237465, Test Loss: 0.545909933745861 Step 19200/50000, Loss: 0.524276040494442, Test Loss: 0.5463402941823006 Step 19300/50000, Loss: 0.5219045123457908, Test Loss: 0.5459508374333382 Step 19400/50000, Loss: 0.5209808105230331, Test Loss: 0.5453898906707764 Step 19500/50000, Loss: 0.5214326724410057, Test Loss: 0.5455401465296745 Step 19600/50000, Loss: 0.5179504960775375, Test Loss: 0.5457349643111229 Step 19700/50000, Loss: 0.5227244329452515, Test Loss: 0.5457464009523392 Step 19800/50000, Loss: 0.5178977763652801, Test Loss: 0.5452375411987305 Step 19900/50000, Loss: 0.5201017183065414, Test Loss: 0.5451173186302185 Step 20000/50000, Loss: 0.5228851708769798, Test Loss: 0.5447744950652122 Step 20100/50000, Loss: 0.5185097491741181, Test Loss: 0.5446213185787201 Step 20200/50000, Loss: 0.5180230981111527, Test Loss: 0.5443410575389862 Step 20300/50000, Loss: 0.517382781803608, Test Loss: 0.5442386493086815 Step 20400/50000, Loss: 0.5202216759324074, Test Loss: 0.5437059476971626 Step 20500/50000, Loss: 0.5196962520480156, Test Loss: 0.5435793101787567 Step 20600/50000, Loss: 0.5123674601316452, Test Loss: 0.5442237555980682 Step 20700/50000, Loss: 0.5200866732001305, Test Loss: 0.5439315959811211 Step 20800/50000, Loss: 0.5091942158341408, Test Loss: 0.543663926422596 Step 20900/50000, Loss: 0.5112863254547119, Test Loss: 0.5440888926386833 Step 21000/50000, Loss: 0.5135010626912117, Test Loss: 0.5429307892918587 Step 21100/50000, Loss: 0.5110966017842293, Test Loss: 0.5440293699502945 Step 21200/50000, Loss: 0.5097229021787644, Test Loss: 0.5440555810928345 Step 21300/50000, Loss: 0.5112406870722771, Test Loss: 0.5441801026463509 Step 21400/50000, Loss: 0.5192759984731674, Test Loss: 0.543305054306984 Step 21500/50000, Loss: 0.5122659048438072, Test Loss: 0.5429277196526527 Step 21600/50000, Loss: 0.5052849891781807, Test Loss: 0.5427471697330475 Step 21700/50000, Loss: 0.5136335334181785, Test Loss: 0.5432612597942352 Step 21800/50000, Loss: 0.5077440929412842, Test Loss: 0.5436762794852257 Step 21900/50000, Loss: 0.509901857972145, Test Loss: 0.5442083030939102 Step 22000/50000, Loss: 0.5127878683805466, Test Loss: 0.5434536635875702 Step 22100/50000, Loss: 0.5088658592104912, Test Loss: 0.5432475730776787 Step 22200/50000, Loss: 0.503545526266098, Test Loss: 0.5430518910288811 Step 22300/50000, Loss: 0.509440864622593, Test Loss: 0.5440312474966049 Step 22400/50000, Loss: 0.504312878549099, Test Loss: 0.5440322384238243 Step 22500/50000, Loss: 0.5105202877521515, Test Loss: 0.5433548018336296 Step 22600/50000, Loss: 0.5068865966796875, Test Loss: 0.543488435447216 Step 22700/50000, Loss: 0.5061589232087136, Test Loss: 0.543672688305378 Step 22800/50000, Loss: 0.5050221633911133, Test Loss: 0.54234429448843 Step 22900/50000, Loss: 0.5060675710439682, Test Loss: 0.5421250462532043 Step 23000/50000, Loss: 0.5028709018230438, Test Loss: 0.5417612642049789 Step 23100/50000, Loss: 0.5007539334893226, Test Loss: 0.5417554974555969 Step 23200/50000, Loss: 0.5032428222894668, Test Loss: 0.5419873222708702 Step 23300/50000, Loss: 0.4976426234841347, Test Loss: 0.5417612642049789 Step 23400/50000, Loss: 0.5037221732735634, Test Loss: 0.5416203439235687 Step 23500/50000, Loss: 0.5026455554366112, Test Loss: 0.541393369436264 Step 23600/50000, Loss: 0.5041168466210365, Test Loss: 0.5412039160728455 Step 23700/50000, Loss: 0.499170041680336, Test Loss: 0.5414468348026276 Step 23800/50000, Loss: 0.49886377930641174, Test Loss: 0.5415316671133041 Step 23900/50000, Loss: 0.5008612561225891, Test Loss: 0.5415767431259155 Step 24000/50000, Loss: 0.5027919802069664, Test Loss: 0.541602335870266 Step 24100/50000, Loss: 0.4981607446074486, Test Loss: 0.5416673794388771 Step 24200/50000, Loss: 0.5020068144798279, Test Loss: 0.5414736419916153 Step 24300/50000, Loss: 0.49790079534053805, Test Loss: 0.5420249551534653 Step 24400/50000, Loss: 0.4945983332395554, Test Loss: 0.5420288145542145 Step 24500/50000, Loss: 0.4918954423069954, Test Loss: 0.5416672825813293 Step 24600/50000, Loss: 0.4987477654218674, Test Loss: 0.5413290336728096 Step 24700/50000, Loss: 0.4991921219229698, Test Loss: 0.5409770160913467 Step 24800/50000, Loss: 0.5160313996672631, Test Loss: 0.5407248362898827 Step 24900/50000, Loss: 0.5089896723628045, Test Loss: 0.5407268553972244 Step 25000/50000, Loss: 0.5133795315027236, Test Loss: 0.540549747645855 Step 25100/50000, Loss: 0.5142266032099724, Test Loss: 0.5403367578983307 Step 25200/50000, Loss: 0.5106305554509163, Test Loss: 0.5404777601361275 Step 25300/50000, Loss: 0.514853127002716, Test Loss: 0.5402576178312302 Step 25400/50000, Loss: 0.5116986194252968, Test Loss: 0.540235199034214 Step 25500/50000, Loss: 0.5178939509391784, Test Loss: 0.5401464626193047 Step 25600/50000, Loss: 0.5128523615002633, Test Loss: 0.5403817817568779 Step 25700/50000, Loss: 0.5107192620635033, Test Loss: 0.5402729734778404 Step 25800/50000, Loss: 0.5151638248562813, Test Loss: 0.540238045156002 Step 25900/50000, Loss: 0.5079279285669327, Test Loss: 0.5405196249485016 Step 26000/50000, Loss: 0.5137662106752395, Test Loss: 0.5403135642409325 Step 26100/50000, Loss: 0.5081922098994255, Test Loss: 0.5401801690459251 Step 26200/50000, Loss: 0.5133265718817711, Test Loss: 0.5400106459856033 Step 26300/50000, Loss: 0.5071106451749802, Test Loss: 0.5401815176010132 Step 26400/50000, Loss: 0.5103961524367332, Test Loss: 0.5401192381978035 Step 26500/50000, Loss: 0.5102493834495544, Test Loss: 0.5405233800411224 Step 26600/50000, Loss: 0.5107258838415146, Test Loss: 0.540264330804348 Step 26700/50000, Loss: 0.5076950311660766, Test Loss: 0.5403445139527321 Step 26800/50000, Loss: 0.508688750565052, Test Loss: 0.5403786599636078 Step 26900/50000, Loss: 0.5121384325623513, Test Loss: 0.5402709916234016 Step 27000/50000, Loss: 0.5072655525803565, Test Loss: 0.5402552932500839 Step 27100/50000, Loss: 0.5062821558117867, Test Loss: 0.5403524935245514 Step 27200/50000, Loss: 0.5098704776167869, Test Loss: 0.5401039719581604 Step 27300/50000, Loss: 0.5111740103363991, Test Loss: 0.5400716066360474 Step 27400/50000, Loss: 0.506802773475647, Test Loss: 0.5399189367890358 Step 27500/50000, Loss: 0.5019376620650291, Test Loss: 0.5398764163255692 Step 27600/50000, Loss: 0.5078087207674981, Test Loss: 0.539827011525631 Step 27700/50000, Loss: 0.49983825743198396, Test Loss: 0.5397513061761856 Step 27800/50000, Loss: 0.501948290169239, Test Loss: 0.5397870093584061 Step 27900/50000, Loss: 0.503223777115345, Test Loss: 0.5396886020898819 Step 28000/50000, Loss: 0.4988864180445671, Test Loss: 0.5397669076919556 Step 28100/50000, Loss: 0.5000418615341187, Test Loss: 0.539934940636158 Step 28200/50000, Loss: 0.5037692028284073, Test Loss: 0.5399342402815819 Step 28300/50000, Loss: 0.5065342092514038, Test Loss: 0.5399300828576088 Step 28400/50000, Loss: 0.49868389159440996, Test Loss: 0.5400566309690475 Step 28500/50000, Loss: 0.49928132712841033, Test Loss: 0.5400507599115372 Step 28600/50000, Loss: 0.5010258322954178, Test Loss: 0.5399254709482193 Step 28700/50000, Loss: 0.49798482537269595, Test Loss: 0.540141187608242 Step 28800/50000, Loss: 0.5000904050469398, Test Loss: 0.5402135774493217 Step 28900/50000, Loss: 0.5013855120539665, Test Loss: 0.5401042699813843 Step 29000/50000, Loss: 0.4986867892742157, Test Loss: 0.5401381179690361 Step 29100/50000, Loss: 0.49214051008224485, Test Loss: 0.5402148738503456 Step 29200/50000, Loss: 0.4973500269651413, Test Loss: 0.5402180105447769 Step 29300/50000, Loss: 0.4951749128103256, Test Loss: 0.5403081923723221 Step 29400/50000, Loss: 0.49519805639982223, Test Loss: 0.5403984859585762 Step 29500/50000, Loss: 0.4949836692214012, Test Loss: 0.5403021425008774 Step 29600/50000, Loss: 0.49874858170747755, Test Loss: 0.5402622520923615 Step 29700/50000, Loss: 0.5020810437202453, Test Loss: 0.5400966182351112 Step 29800/50000, Loss: 0.5026862496137618, Test Loss: 0.5399602949619293 Step 29900/50000, Loss: 0.5002500921487808, Test Loss: 0.5399456769227982 Step 30000/50000, Loss: 0.4972523659467697, Test Loss: 0.5399295464158058 Step 30100/50000, Loss: 0.49687450855970383, Test Loss: 0.540132001042366 Step 30200/50000, Loss: 0.4967585051059723, Test Loss: 0.5400266796350479 Step 30300/50000, Loss: 0.5007252091169357, Test Loss: 0.5401216298341751 Step 30400/50000, Loss: 0.49998929440975187, Test Loss: 0.5400102064013481 Step 30500/50000, Loss: 0.49902217119932174, Test Loss: 0.5399828031659126 Step 30600/50000, Loss: 0.4965375950932503, Test Loss: 0.5401788353919983 Step 30700/50000, Loss: 0.49841643542051317, Test Loss: 0.5401521772146225 Step 30800/50000, Loss: 0.49873902708292006, Test Loss: 0.5401478558778763 Step 30900/50000, Loss: 0.4986496239900589, Test Loss: 0.5402649715542793 Step 31000/50000, Loss: 0.4976932209730148, Test Loss: 0.540209136903286 Step 31100/50000, Loss: 0.4981874457001686, Test Loss: 0.5402702316641808 Step 31200/50000, Loss: 0.4924463939666748, Test Loss: 0.5404428169131279 Step 31300/50000, Loss: 0.4926974037289619, Test Loss: 0.5405407398939133 Step 31400/50000, Loss: 0.4927178320288658, Test Loss: 0.5404405742883682 Step 31500/50000, Loss: 0.4961189603805542, Test Loss: 0.5402893200516701 Step 31600/50000, Loss: 0.502119165956974, Test Loss: 0.5400217920541763 Step 31700/50000, Loss: 0.5106057554483414, Test Loss: 0.539872981607914 Step 31800/50000, Loss: 0.5088860255479812, Test Loss: 0.5397540032863617 Step 31900/50000, Loss: 0.5092998299002648, Test Loss: 0.5397043526172638 Step 32000/50000, Loss: 0.5100585091114044, Test Loss: 0.5396628379821777 Step 32100/50000, Loss: 0.508292889893055, Test Loss: 0.5396415516734123 Step 32200/50000, Loss: 0.514079284965992, Test Loss: 0.5395453348755836 Step 32300/50000, Loss: 0.5094090312719345, Test Loss: 0.5395669266581535 Step 32400/50000, Loss: 0.5140978255867958, Test Loss: 0.5395563840866089 Step 32500/50000, Loss: 0.5108561027050018, Test Loss: 0.5395280569791794 Step 32600/50000, Loss: 0.508008286356926, Test Loss: 0.5395460426807404 Step 32700/50000, Loss: 0.5117560002207756, Test Loss: 0.5395319536328316 Step 32800/50000, Loss: 0.5059553095698357, Test Loss: 0.5396299511194229 Step 32900/50000, Loss: 0.5118668919801712, Test Loss: 0.5395850613713264 Step 33000/50000, Loss: 0.5050633817911148, Test Loss: 0.5395438820123672 Step 33100/50000, Loss: 0.5107882875204086, Test Loss: 0.5395044758915901 Step 33200/50000, Loss: 0.5029929068684578, Test Loss: 0.5395789965987206 Step 33300/50000, Loss: 0.5088842526078224, Test Loss: 0.5396023169159889 Step 33400/50000, Loss: 0.5078984281420708, Test Loss: 0.5396974086761475 Step 33500/50000, Loss: 0.5050894203782081, Test Loss: 0.5396722480654716 Step 33600/50000, Loss: 0.5062958505749703, Test Loss: 0.5396731942892075 Step 33700/50000, Loss: 0.5092896395921707, Test Loss: 0.5396324619650841 Step 33800/50000, Loss: 0.5085639423131942, Test Loss: 0.5396186858415604 Step 33900/50000, Loss: 0.5020746979117393, Test Loss: 0.5396337956190109 Step 34000/50000, Loss: 0.5059955576062203, Test Loss: 0.539700098335743 Step 34100/50000, Loss: 0.5049130553007126, Test Loss: 0.5396822318434715 Step 34200/50000, Loss: 0.5080492258071899, Test Loss: 0.539713479578495 Step 34300/50000, Loss: 0.5059011596441269, Test Loss: 0.539629191160202 Step 34400/50000, Loss: 0.5042031505703926, Test Loss: 0.5396181717514992 Step 34500/50000, Loss: 0.5040303328633309, Test Loss: 0.5395740866661072 Step 34600/50000, Loss: 0.5003925260901451, Test Loss: 0.5395526438951492 Step 34700/50000, Loss: 0.5008617109060287, Test Loss: 0.5396481305360794 Step 34800/50000, Loss: 0.5013332226872445, Test Loss: 0.5395035743713379 Step 34900/50000, Loss: 0.4993538862466812, Test Loss: 0.5396741777658463 Step 35000/50000, Loss: 0.49984502136707304, Test Loss: 0.5397976487874985 Step 35100/50000, Loss: 0.5055657151341438, Test Loss: 0.5397628992795944 Step 35200/50000, Loss: 0.5025000894069671, Test Loss: 0.5398406684398651 Step 35300/50000, Loss: 0.4979968535900116, Test Loss: 0.5399135574698448 Step 35400/50000, Loss: 0.5027144092321396, Test Loss: 0.5398437827825546 Step 35500/50000, Loss: 0.49810495495796203, Test Loss: 0.5398181229829788 Step 35600/50000, Loss: 0.49869739830493925, Test Loss: 0.539912611246109 Step 35700/50000, Loss: 0.4993435364961624, Test Loss: 0.540031224489212 Step 35800/50000, Loss: 0.5010734874010087, Test Loss: 0.539951279759407 Step 35900/50000, Loss: 0.4961173927783966, Test Loss: 0.5399811267852783 Step 36000/50000, Loss: 0.49067836165428164, Test Loss: 0.5400380566716194 Step 36100/50000, Loss: 0.4983882322907448, Test Loss: 0.5400507599115372 Step 36200/50000, Loss: 0.49570639073848727, Test Loss: 0.5401374697685242 Step 36300/50000, Loss: 0.49497527778148653, Test Loss: 0.5402403101325035 Step 36400/50000, Loss: 0.4943352746963501, Test Loss: 0.5401946380734444 Step 36500/50000, Loss: 0.5013378807902336, Test Loss: 0.5400375798344612 Step 36600/50000, Loss: 0.5009453481435776, Test Loss: 0.5398918315768242 Step 36700/50000, Loss: 0.49999902069568636, Test Loss: 0.5398625582456589 Step 36800/50000, Loss: 0.4986932471394539, Test Loss: 0.5397756099700928 Step 36900/50000, Loss: 0.4988508015871048, Test Loss: 0.5398235246539116 Step 37000/50000, Loss: 0.49667459309101103, Test Loss: 0.5399428531527519 Step 37100/50000, Loss: 0.4973614087700844, Test Loss: 0.5398567691445351 Step 37200/50000, Loss: 0.4985528939962387, Test Loss: 0.5399526283144951 Step 37300/50000, Loss: 0.5001018795371056, Test Loss: 0.5398374199867249 Step 37400/50000, Loss: 0.4983680948615074, Test Loss: 0.5398883894085884 Step 37500/50000, Loss: 0.49329759150743485, Test Loss: 0.5400660261511803 Step 37600/50000, Loss: 0.5014468815922737, Test Loss: 0.5399715825915337 Step 37700/50000, Loss: 0.49771746158599856, Test Loss: 0.5399486199021339 Step 37800/50000, Loss: 0.4979658380150795, Test Loss: 0.5401404649019241 Step 37900/50000, Loss: 0.4978672149777412, Test Loss: 0.5400984436273575 Step 38000/50000, Loss: 0.4970002031326294, Test Loss: 0.5400849655270576 Step 38100/50000, Loss: 0.49319078475236894, Test Loss: 0.5402737855911255 Step 38200/50000, Loss: 0.4919032683968544, Test Loss: 0.540255218744278 Step 38300/50000, Loss: 0.49013256341218947, Test Loss: 0.5403067767620087 Step 38400/50000, Loss: 0.49624019861221313, Test Loss: 0.5401993095874786 Step 38500/50000, Loss: 0.5070158150792122, Test Loss: 0.539745643734932 Step 38600/50000, Loss: 0.5106859561800957, Test Loss: 0.5397117808461189 Step 38700/50000, Loss: 0.5072772273421288, Test Loss: 0.5395833551883698 Step 38800/50000, Loss: 0.5087108266353607, Test Loss: 0.5395811051130295 Step 38900/50000, Loss: 0.5086913874745369, Test Loss: 0.539511114358902 Step 39000/50000, Loss: 0.5101708760857582, Test Loss: 0.5394201874732971 Step 39100/50000, Loss: 0.5126127651333809, Test Loss: 0.5393865928053856 Step 39200/50000, Loss: 0.5107463571429253, Test Loss: 0.5393551588058472 Step 39300/50000, Loss: 0.5136496108770371, Test Loss: 0.5393457487225533 Step 39400/50000, Loss: 0.5069820827245712, Test Loss: 0.539389118552208 Step 39500/50000, Loss: 0.5111293032765388, Test Loss: 0.5393685102462769 Step 39600/50000, Loss: 0.5073767012357712, Test Loss: 0.5394056066870689 Step 39700/50000, Loss: 0.5081001633405685, Test Loss: 0.5394421964883804 Step 39800/50000, Loss: 0.5096010833978653, Test Loss: 0.539382092654705 Step 39900/50000, Loss: 0.5044914534687996, Test Loss: 0.5393995344638824 Step 40000/50000, Loss: 0.5096197336912155, Test Loss: 0.5393679961562157 Step 40100/50000, Loss: 0.5065052941441536, Test Loss: 0.5393980294466019 Step 40200/50000, Loss: 0.5058723211288452, Test Loss: 0.5394647270441055 Step 40300/50000, Loss: 0.5071479165554047, Test Loss: 0.5395227521657944 Step 40400/50000, Loss: 0.5047757157683372, Test Loss: 0.5394424721598625 Step 40500/50000, Loss: 0.5058568915724755, Test Loss: 0.5395368114113808 Step 40600/50000, Loss: 0.5097119781374931, Test Loss: 0.5394527763128281 Step 40700/50000, Loss: 0.505585141479969, Test Loss: 0.5394986569881439 Step 40800/50000, Loss: 0.505511117875576, Test Loss: 0.5394658669829369 Step 40900/50000, Loss: 0.5023908773064614, Test Loss: 0.5395566001534462 Step 41000/50000, Loss: 0.506652555167675, Test Loss: 0.5394731014966965 Step 41100/50000, Loss: 0.507455221414566, Test Loss: 0.5395409390330315 Step 41200/50000, Loss: 0.5043680673837662, Test Loss: 0.5394297987222672 Step 41300/50000, Loss: 0.5050949704647064, Test Loss: 0.5395004898309708 Step 41400/50000, Loss: 0.501577168405056, Test Loss: 0.5394115447998047 Step 41500/50000, Loss: 0.5013879150152206, Test Loss: 0.5393913984298706 Step 41600/50000, Loss: 0.5003279143571854, Test Loss: 0.5394631996750832 Step 41700/50000, Loss: 0.5016007670760154, Test Loss: 0.5393634587526321 Step 41800/50000, Loss: 0.49823493242263794, Test Loss: 0.5395021811127663 Step 41900/50000, Loss: 0.49926349729299546, Test Loss: 0.5396210700273514 Step 42000/50000, Loss: 0.5054442447423935, Test Loss: 0.5396415814757347 Step 42100/50000, Loss: 0.50202587723732, Test Loss: 0.539661668241024 Step 42200/50000, Loss: 0.4960733976960182, Test Loss: 0.5397117212414742 Step 42300/50000, Loss: 0.5030219155550003, Test Loss: 0.5396576225757599 Step 42400/50000, Loss: 0.4974089801311493, Test Loss: 0.5397517904639244 Step 42500/50000, Loss: 0.49898953795433043, Test Loss: 0.5396833717823029 Step 42600/50000, Loss: 0.4989222511649132, Test Loss: 0.5398036763072014 Step 42700/50000, Loss: 0.5000161120295524, Test Loss: 0.5398104265332222 Step 42800/50000, Loss: 0.4932199516892433, Test Loss: 0.5398836359381676 Step 42900/50000, Loss: 0.4919256231188774, Test Loss: 0.5398688986897469 Step 43000/50000, Loss: 0.4985833743214607, Test Loss: 0.5398799479007721 Step 43100/50000, Loss: 0.4971668937802315, Test Loss: 0.5400027558207512 Step 43200/50000, Loss: 0.49428201764822005, Test Loss: 0.5399693325161934 Step 43300/50000, Loss: 0.4935976222157478, Test Loss: 0.540062665939331 Step 43400/50000, Loss: 0.5008321458101272, Test Loss: 0.5398197025060654 Step 43500/50000, Loss: 0.5021645167469978, Test Loss: 0.5396685376763344 Step 43600/50000, Loss: 0.49895587891340254, Test Loss: 0.53972277790308 Step 43700/50000, Loss: 0.49847632586956026, Test Loss: 0.5396436974406242 Step 43800/50000, Loss: 0.4981184619665146, Test Loss: 0.5397310703992844 Step 43900/50000, Loss: 0.49537402898073196, Test Loss: 0.539734773337841 Step 44000/50000, Loss: 0.498755077123642, Test Loss: 0.5396531820297241 Step 44100/50000, Loss: 0.4986911591887474, Test Loss: 0.5397176146507263 Step 44200/50000, Loss: 0.5000812029838562, Test Loss: 0.5396725684404373 Step 44300/50000, Loss: 0.49518556147813797, Test Loss: 0.5398207157850266 Step 44400/50000, Loss: 0.49464826852083205, Test Loss: 0.5398502722382545 Step 44500/50000, Loss: 0.5006935778260231, Test Loss: 0.5397677645087242 Step 44600/50000, Loss: 0.49818256229162217, Test Loss: 0.5398132503032684 Step 44700/50000, Loss: 0.49620632976293566, Test Loss: 0.5399656072258949 Step 44800/50000, Loss: 0.49781897008419035, Test Loss: 0.5399337857961655 Step 44900/50000, Loss: 0.49707835257053373, Test Loss: 0.5399251356720924 Step 45000/50000, Loss: 0.49177643567323687, Test Loss: 0.5400127246975899 Step 45100/50000, Loss: 0.4867770627140999, Test Loss: 0.5401008650660515 Step 45200/50000, Loss: 0.49645036190748215, Test Loss: 0.5400954708456993 Step 45300/50000, Loss: 0.4933033186197281, Test Loss: 0.5400835126638412 Step 45400/50000, Loss: 0.5118939998745918, Test Loss: 0.5395145788788795 Step 45500/50000, Loss: 0.5083662039041519, Test Loss: 0.5394980236887932 Step 45600/50000, Loss: 0.5080907043814659, Test Loss: 0.5394219979643822 Step 45700/50000, Loss: 0.509898764193058, Test Loss: 0.5393904894590378 Step 45800/50000, Loss: 0.5071603581309319, Test Loss: 0.5393557548522949 Step 45900/50000, Loss: 0.5108535829186439, Test Loss: 0.5392103716731071 Step 46000/50000, Loss: 0.5099755051732063, Test Loss: 0.5392666682600975 Step 46100/50000, Loss: 0.5120785105228424, Test Loss: 0.5391957387328148 Step 46200/50000, Loss: 0.5120513769984245, Test Loss: 0.5391103476285934 Step 46300/50000, Loss: 0.5055394196510314, Test Loss: 0.539196103811264 Step 46400/50000, Loss: 0.5103425335884094, Test Loss: 0.5391378477215767 Step 46500/50000, Loss: 0.5065678125619888, Test Loss: 0.5392791852355003 Step 46600/50000, Loss: 0.507869057059288, Test Loss: 0.5393331274390221 Step 46700/50000, Loss: 0.5094516175985336, Test Loss: 0.5392355620861053 Step 46800/50000, Loss: 0.5073900875449181, Test Loss: 0.5392070561647415 Step 46900/50000, Loss: 0.5064733734726906, Test Loss: 0.5392192676663399 Step 47000/50000, Loss: 0.5072105729579925, Test Loss: 0.5391889214515686 Step 47100/50000, Loss: 0.5037602409720421, Test Loss: 0.539274089038372 Step 47200/50000, Loss: 0.5082229962944984, Test Loss: 0.5393294245004654 Step 47300/50000, Loss: 0.5041952931880951, Test Loss: 0.5392620787024498 Step 47400/50000, Loss: 0.5062178316712379, Test Loss: 0.5393047109246254 Step 47500/50000, Loss: 0.5090050908923149, Test Loss: 0.539303220808506 Step 47600/50000, Loss: 0.5047084537148475, Test Loss: 0.5393324419856071 Step 47700/50000, Loss: 0.5045905429124832, Test Loss: 0.5393098443746567 Step 47800/50000, Loss: 0.5039082843065262, Test Loss: 0.5393583551049232 Step 47900/50000, Loss: 0.5065311944484711, Test Loss: 0.5392666757106781 Step 48000/50000, Loss: 0.5071438843011856, Test Loss: 0.5393249616026878 Step 48100/50000, Loss: 0.5004611688852311, Test Loss: 0.5393214821815491 Step 48200/50000, Loss: 0.5083751136064529, Test Loss: 0.5392718464136124 Step 48300/50000, Loss: 0.4980458009243012, Test Loss: 0.539208821952343 Step 48400/50000, Loss: 0.499648708999157, Test Loss: 0.5392308905720711 Step 48500/50000, Loss: 0.5020219036936759, Test Loss: 0.539259284734726 Step 48600/50000, Loss: 0.49979176551103593, Test Loss: 0.5391851142048836 Step 48700/50000, Loss: 0.49803889751434327, Test Loss: 0.5393860414624214 Step 48800/50000, Loss: 0.49939071893692016, Test Loss: 0.5394993796944618 Step 48900/50000, Loss: 0.5076718246936798, Test Loss: 0.5394344180822372 Step 49000/50000, Loss: 0.5004595616459846, Test Loss: 0.5394701510667801 Step 49100/50000, Loss: 0.4938800597190857, Test Loss: 0.5395425483584404 Step 49200/50000, Loss: 0.502016199529171, Test Loss: 0.5394943058490753 Step 49300/50000, Loss: 0.49617588609457014, Test Loss: 0.5396153926849365 Step 49400/50000, Loss: 0.49824530065059663, Test Loss: 0.5396235510706902 Step 49500/50000, Loss: 0.5010240325331687, Test Loss: 0.5395853817462921 Step 49600/50000, Loss: 0.49718677312135695, Test Loss: 0.5396591052412987 Step 49700/50000, Loss: 0.49210875332355497, Test Loss: 0.5397243052721024 Step 49800/50000, Loss: 0.49780592769384385, Test Loss: 0.5396812334656715 Step 49900/50000, Loss: 0.4927149027585983, Test Loss: 0.5397619679570198 Step 50000/50000, Loss: 0.4986410740017891, Test Loss: 0.5398522838950157
if use_existing_model:
print("Existing model used, no loss curves shown.")
plt.imshow(plt.imread("./sft_loss_curve.png"))
else:
plt.figure(figsize=(10, 6))
plt.plot(losses, label="Train Loss", color='blue')
plt.plot(test_losses, label="Test Loss", color='red')
plt.xlabel('Checkpoint')
plt.ylabel('Loss')
plt.title('Supervised Fine Tuning - Training and Test Loss Over Time')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()
if not use_existing_model:
torch.save(model, f"./sft_final.pth")
With the fine tuned model, we can perform a more natural form of interence. Instead of formatting all of our prompts as next token prediction, we can have a more natural Q&A style format with the model
We are using a very small model and a very small set of data compared to modern LLMs, so our model is not going to perform very well on most questions. However, it is outputting responses that are at least related to the prompt and are formatted in a correct way. It is very cool to see the LLM starting to come together! As we scale up the model, data, etc... the responses will become more factual, realistic, and contextually accurate. At this point, the majority of the responses are hallucinations.
def sft_inference(prompt,torch_model, max_new_tokens):
torch_model.eval()
prompt = "<Question>" + prompt + "</Question>" + "<Answer>" # Wrap the prompt in <Question> and start inference with <Answer>
with torch.no_grad():
tokens = hf_tokenizer.encode(prompt) # Tokenize the prompt
for _ in range(max_new_tokens):
if tokens[-1] == hf_tokenizer.eos_token_id: # Stop if we reach the end of the sequence
break
num_tokens = len(tokens) #
tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens) # pad the sequence with eos token
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = torch_model(tokens_padded) # Forward pass through the model
probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1) # Get the probabilities of the last token
predicted_token = torch.argmax(probabilities).item() # Greedy decoding, change to sampling for more diversity
tokens.append(predicted_token)
# Strip the text to between the <Answer></Answer> tags
full_answer = hf_tokenizer.decode(tokens)
answer_start = full_answer.find("<Answer>") + len("<Answer>")
answer_end = full_answer.find("</Answer>")
return full_answer[answer_start:answer_end]
print("Predicted:", sft_inference("Who is the most powerful leader in the west?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the sun?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the ocean", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many planets are in the solar system", model, max_new_tokens=20))
print("Predicted:", sft_inference("What three countries are in north america?", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many eyes do humans have?", model, max_new_tokens=20))
Predicted: The President of the Republic of the Republic of the Congo Predicted: Yellow Predicted: Red Predicted: Two Predicted: United States and Canada Predicted: Two-four-four-four-four-four-four-four-four-four