AI기록장
[파이토치 딥러닝 프로젝트] 텍스트 생성(Text generation) 본문
‣ 해당 블로그는 ⌜실전! 파이토치 딥러닝 프로젝트⌟를 공부하며, 배운것들을 토대로 정리되었음을 알려드립니다.
텍스트 생성이란?
인공지능이 기존의 텍스트로부터 새로운 텍스트를 생성하는 포로세스.
뉴스 기사, 시, 코드, 스크립트, 소설, 잡지 문구 등과 같이 텍스트 생성을 다루는 자연어 처리(NLP)의 한 분야라고 생각하면 좋겠다. 텍스트 생성은 작사, 기사의 주요 문구, 제목, 이메일 작성 등 창의적으로 텍스트를 생성해야하는 다양한 분야에도 널리 사용되고 있다.
생성형 AI에서의 텍스트 생성 원리
텍스트 생성은 대규모 텍스트 데이터를 미리 학습한 머신러닝 모델을 사용하여, 또 다른 새로운 텍스트를 생성하는 프로세스를 갖고 있다. 단어와 구문 사이의 통계적 관계 및 수치적 관계를 이용하여 학습하고, 학습한 지식을 통해 다음 지식에 전달해 줌으로서 학습 데이터와 유사한 새로운 텍스를 생성하게 된다.
텍스트 생성의 대표적인 두 가지 유형
Seq2Seq(Seq-toSequence) 모델:
Seq2Seq 모델은 시퀀스를 다른 시퀀스로 변환하는 모델이다. ( ex - 번역, 요약, 대화 생성 등)
주로 인코더(encoder)와 디코더(decoder)라는 두 부분으로 구성된다. 최근에는 트랜스모머 기반의 텍스트 생성기를 사용하는데, 이 방식은 Seq2Seq 모델의 발전된 형태로 RNN을 사용하지 않고도 장기 의존성을 처리할 수 있는 구조를 가지고 있다. 이전 글에서 작성한 바와 같이 Self-Attention 메커니즘을 사용하는 구조를 갖고 있다.
생성적 적대 신경망 (GAN) :
GAN은 생상적 적대 신경망으로, 생성기(generator)와 판별기(discriminator)라는 두 개의 네트워크가 서로 경쟁하며 학습한다. 생성기는 실제와 유사한 데이터를 생성하려고 노력하고, 판별기는 생성된 데이터와 실제 데이터를 구별하려고 한다. GAN은 텍스트 생성기에 사용될 뿐만 아니라 이미지 생성에도 많이 사용되고 있다. GAN은 안정적인 학습이 어려운 경향이 있지만, 성능이 좋은 결과를 얻을 수 있는 혁신적인 모델이다.
텍스트 생성 모델에서의 생성 전략
생성전략이란?
훈련된 텍스트 생성 모델을 사용해 텍스트를 생성할 떄 일반적으로 단어 단위로 예측한다. 그런 다음 결과로 얻은 예측 단어 시퀀스를 통합해 텍스트를 예측한다. 단어 예측을 반복할 때 이전 K개 예측이 주어지면 다음 단어를 찾거나 예측하는 방법을 지정해야 한다. 이러한 방법을 '텍스트 생성 전략'이라고 한다.
<탐욕적 탐색>
탐욕적(Greedy)라는 이름은 앞에 얼마나 많은 시간 단계가 있든 모델이 현재 이터레이션에서 최대한 확률을 갖는 단어를 선택하기 때문에 붙었다. 해당 전략은 모델이 확률이 낮은 단어를 채택할 일이 없기 때문에 모델이 확률이 낮은 단어 뒤에 숨어있을 가능성이 높은 단어를 놓칠 가능성이 있다.
해당 다이어그램은 추후 실습에 대한 결과를 다이어그램으로 나타낸 것이다. 각 시간 단계에서 텍스트 생성 모델은 확률과 함께 나올 수 있는 단어를 출력한다. 좀 더 자세히 살펴보면, 모델은 텍스트를 생성하는 각 단계에서 "탐욕적 탐색 전략" 에 따라 확률이 가장 높은 단어를 선택했다. 하지만, 끝에서 두 번째 단계를 보게 되면, System,People, Future라는 각 단어를 비슷한 확률로 예측했으나, 그 중 가장 높은 System을 선택했다. 이것이 탐욕적 탐색 방식의 주요 한계이다. 또한, 탐욕적 탐색은 무작위성이 부족하여 반복적인 결과를 가져온다는 한계가 있다.
<Beam search>
"Beam search"는 탐욕적 탐색 기법을 다음 단어 확률이 아니라 전체 예측 시퀀스 확률을 기반으로 잠재적 후보 시퀀스 목록을 유지하는 방식으로 개발한 것이다. 다음 다이어그램을 보게 되면 이해가 쉽다.
해당 다이어그램은 Beam search의 실습 결과를 다이어그램으로 나타낸 것인데, 빔 크기가 3인 Beam search를 사용해 각각 5개의 단어로 구성된 3개의 후보 시퀀스를 생성했다. 좀 더 자세히 살펴보면, 이터레이션마다 가장 가능성이 높은 3개의 후보 시퀀스가 유지되는 것을 볼 수 있다. 시퀀스가 진행됨에 따라 가능한 후보 시퀀스 개수는 기하급수적으로 증가하지만, 우리는 그중 3가지의 시퀀스에만 관심이 있기 때문에 그 부분만 보면 된다. 이렇게하면 탐욕적 탐색처럼 잠재적으로 더 나은 시퀀스를 놓치는 일은 줄어든다. 하지만, Beam search 또한, 단조로운 결과를 가져온다는 것에 대한 한계가 여전히 존재했다.
<Top-k와 Top-p 샘플링>
항상 가장 높은 확률을 갖는 다음 단어를 선택하는 대신 상대적인 확률에 기반해 다음에 올 수 있는 단어 집합에서 단어를 임의로 샘플링할 수 있다. 예를 들어, 이전의 '탐욕적 탐색' 다이어그램을 보게 되면, Be, Know, Show는 각각 0.7, 0.2, 0.1의 확률을 갖는다. 이때 가장 높은 Be를 항상 뽑는 대신 각 단어의 확률에 기반하여 이 세 개의 단어 중 하나를 무작위로 샘플링하는 것이다. 10회를 반복한다했을 때 10개의 개별 텍스트를 생성한다면, 7번은 be가 뽑히고, 2번은 know, 1번은 show가 뽑힐 것이다. 이러한 방식으로 앞선 생성 전략의 단조로운 결과를 보다 해결할 수 있을 것이다.
그 중 가장 많이 사용하는 샘플링 방식이 Top-k와 Top-p 방식이다. "Top-k" 샘플링은 다음에 올 단어를 샘플링할 때 후보가 될 단어 개수인 매개변수 k를 미리 정의한다. 다른 모든 단어는 버리고 상위 k개의 단어에서 확률을 정규화하는 것이다. 즉, be,know,show에서 k가 2라면, show는 삭제되고 나머지 높은 2개를 사용하고, 확률을 0.78,0.22로 정규화해주는 것이다.
"Top-p" 샘플링은 상위 k개 단어를 정의하는 대신 누적 확률 임곗값(p)을 정의한 다음 누적 확률이 p가 될 때까지 나온 단어를 유지할 수 있게하는 것이다. 만약 p가 0.7과0.9이라면 know와 show를 버리고 p가 0.9와 1.0 사이면 show를 버리고 p가 1.0이면 be,know,show 세 단어를 모두 유지하는 것이다.
즉, 위 샘플링 방식을 함께 사용한다면, 앞선 다른 생성 전략과는 다르게 좀더 창의적인 텍스트를 생성할 수 있을 것이다.
<트랜스포머 기반 텍스트 생성기 구현 - Pytouch>
해당 구현 과정은 트랜스포머 기반의 언어 모델을 통해 텍스트 생성기를 구현해볼 예정
언어 모델로 텍스트 생성¶
import math
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
import torchtext
from torchtext.data.utils import get_tokenizer
class Transformer(nn.Module):
def __init__(self, num_token, num_inputs, num_heads, num_hidden, num_layers, dropout=0.3):
super(Transformer, self).__init__()
self.model_name = 'transformer'
self.mask_source = None
self.position_enc = PosEnc(num_inputs, dropout)
layers_enc = TransformerEncoderLayer(num_inputs, num_heads, num_hidden, dropout)
self.enc_transformer = TransformerEncoder(layers_enc, num_layers)
self.enc = nn.Embedding(num_token, num_inputs)
self.num_inputs = num_inputs
self.dec = nn.Linear(num_inputs, num_token)
self.init_params()
def _gen_sqr_nxt_mask(self, size):
msk = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1)
msk = msk.float().masked_fill(msk == 0, float('-inf'))
msk = msk.masked_fill(msk == 1, float(0.0))
return msk
def init_params(self):
initial_rng = 0.12
self.enc.weight.data.uniform_(-initial_rng, initial_rng)
self.dec.bias.data.zero_()
self.dec.weight.data.uniform_(-initial_rng, initial_rng)
def forward(self, source):
if self.mask_source is None or self.mask_source.size(0) != len(source):
dvc = source.device
msk = self._gen_sqr_nxt_mask(len(source)).to(dvc)
self.mask_source = msk
source = self.enc(source) * math.sqrt(self.num_inputs)
source = self.position_enc(source)
op = self.enc_transformer(source, self.mask_source)
op = self.dec(op)
return op
class PosEnc(nn.Module):
def __init__(self, d_m, dropout=0.2, size_limit=5000):
super(PosEnc, self).__init__()
self.dropout = nn.Dropout(dropout)
p_enc = torch.zeros(size_limit, d_m)
pos = torch.arange(0, size_limit, dtype=torch.float).unsqueeze(1)
divider = torch.exp(torch.arange(0, d_m, 2).float() * (-math.log(10000.0) / d_m))
p_enc[:, 0::2] = torch.sin(pos * divider)
p_enc[:, 1::2] = torch.cos(pos * divider)
p_enc = p_enc.unsqueeze(0).transpose(0, 1)
self.register_buffer('p_enc', p_enc)
def forward(self, x):
return self.dropout(x + self.p_enc[:x.size(0), :])
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"), lower=True, eos_token='<eos>', init_token='<sos>')
training_text, validation_text, testing_text = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(training_text)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def gen_batches(text_dataset, batch_size):
text_dataset = TEXT.numericalize([text_dataset.examples[0].text])
# divide text dataset into parts of size equal to batch_size
num_batches = text_dataset.size(0) // batch_size
# remove data points that lie outside batches (remainders)
text_dataset = text_dataset.narrow(0, 0, num_batches * batch_size)
# distribute dataset across batches evenly
text_dataset = text_dataset.view(batch_size, -1).t().contiguous()
return text_dataset.to(device)
training_batch_size = 32
evaluation_batch_size = 16
training_data = gen_batches(training_text, training_batch_size)
validation_data = gen_batches(validation_text, evaluation_batch_size)
testing_data = gen_batches(testing_text, evaluation_batch_size)
max_seq_len = 64
def return_batch(src, k):
sequence_length = min(max_seq_len, len(src) - 1 - k)
sequence_data = src[k:k+sequence_length]
sequence_label = src[k+1:k+1+sequence_length].view(-1)
return sequence_data, sequence_label
num_tokens = len(TEXT.vocab.stoi) # vocabulary size
embedding_size = 256 # dimension of embedding layer
num_hidden_params = 256 # transformer encoder's hidden (feed forward) layer dimension
num_layers = 2 # num of transformer encoder layers within transformer encoder
num_heads = 2 # num of heads in (multi head) attention models
dropout = 0.25 # value (fraction) of dropout
loss_func = nn.CrossEntropyLoss()
lrate = 4.0 # learning rate
transformer_model = Transformer(num_tokens, embedding_size, num_heads, num_hidden_params, num_layers,
dropout).to(device)
optim_module = torch.optim.SGD(transformer_model.parameters(), lr=lrate)
sched_module = torch.optim.lr_scheduler.StepLR(optim_module, 1.0, gamma=0.88)
def train_model():
transformer_model.train()
loss_total = 0.
time_start = time.time()
num_tokens = len(TEXT.vocab.stoi)
for b, i in enumerate(range(0, training_data.size(0) - 1, max_seq_len)):
train_data_batch, train_label_batch = return_batch(training_data, i)
optim_module.zero_grad()
op = transformer_model(train_data_batch)
loss_curr = loss_func(op.view(-1, num_tokens), train_label_batch)
loss_curr.backward()
torch.nn.utils.clip_grad_norm_(transformer_model.parameters(), 0.6)
optim_module.step()
loss_total += loss_curr.item()
interval = 100
if b % interval == 0 and b > 0:
loss_interval = loss_total / interval
time_delta = time.time() - time_start
print(f"epoch {ep}, {b}/{len(training_data)//max_seq_len} batches, training loss {loss_interval:.2f}, training perplexity {math.exp(loss_interval):.2f}")
loss_total = 0
time_start = time.time()
def eval_model(eval_model_obj, eval_data_source):
eval_model_obj.eval()
loss_total = 0.
num_tokens = len(TEXT.vocab.stoi)
with torch.no_grad():
for j in range(0, eval_data_source.size(0) - 1, max_seq_len):
eval_data, eval_label = return_batch(eval_data_source, j)
op = eval_model_obj(eval_data)
op_flat = op.view(-1, num_tokens)
loss_total += len(eval_data) * loss_func(op_flat, eval_label).item()
return loss_total / (len(eval_data_source) - 1)
min_validation_loss = float("inf")
eps = 50
best_model_so_far = None
for ep in range(1, eps + 1):
ep_time_start = time.time()
train_model()
validation_loss = eval_model(transformer_model, validation_data)
print()
print(f"epoch {ep:}, validation loss {validation_loss:.2f}, validation perplexity {math.exp(validation_loss):.2f}")
print()
if validation_loss < min_validation_loss:
min_validation_loss = validation_loss
best_model_so_far = transformer_model
sched_module.step()
epoch 1, 100/1018 batches, training loss 8.63, training perplexity 5614.45 epoch 1, 200/1018 batches, training loss 7.23, training perplexity 1380.31 epoch 1, 300/1018 batches, training loss 6.79, training perplexity 892.50 epoch 1, 400/1018 batches, training loss 6.55, training perplexity 701.84 epoch 1, 500/1018 batches, training loss 6.45, training perplexity 634.57 epoch 1, 600/1018 batches, training loss 6.32, training perplexity 553.86 epoch 1, 700/1018 batches, training loss 6.24, training perplexity 513.65 epoch 1, 800/1018 batches, training loss 6.13, training perplexity 459.07 epoch 1, 900/1018 batches, training loss 6.11, training perplexity 450.48 epoch 1, 1000/1018 batches, training loss 6.07, training perplexity 433.88 epoch 1, validation loss 5.82, validation perplexity 337.70 epoch 2, 100/1018 batches, training loss 5.98, training perplexity 395.15 epoch 2, 200/1018 batches, training loss 5.90, training perplexity 363.99 epoch 2, 300/1018 batches, training loss 5.83, training perplexity 338.74 epoch 2, 400/1018 batches, training loss 5.79, training perplexity 326.08 epoch 2, 500/1018 batches, training loss 5.81, training perplexity 333.72 epoch 2, 600/1018 batches, training loss 5.76, training perplexity 318.66 epoch 2, 700/1018 batches, training loss 5.77, training perplexity 321.75 epoch 2, 800/1018 batches, training loss 5.64, training perplexity 281.86 epoch 2, 900/1018 batches, training loss 5.67, training perplexity 291.02 epoch 2, 1000/1018 batches, training loss 5.71, training perplexity 300.97 epoch 2, validation loss 5.63, validation perplexity 278.58 epoch 3, 100/1018 batches, training loss 5.66, training perplexity 287.48 epoch 3, 200/1018 batches, training loss 5.59, training perplexity 268.82 epoch 3, 300/1018 batches, training loss 5.54, training perplexity 255.26 epoch 3, 400/1018 batches, training loss 5.51, training perplexity 248.34 epoch 3, 500/1018 batches, training loss 5.54, training perplexity 254.87 epoch 3, 600/1018 batches, training loss 5.52, training perplexity 249.04 epoch 3, 700/1018 batches, training loss 5.53, training perplexity 252.00 epoch 3, 800/1018 batches, training loss 5.39, training perplexity 218.15 epoch 3, 900/1018 batches, training loss 5.44, training perplexity 229.63 epoch 3, 1000/1018 batches, training loss 5.48, training perplexity 240.72 epoch 3, validation loss 5.40, validation perplexity 221.83 epoch 4, 100/1018 batches, training loss 5.46, training perplexity 234.99 epoch 4, 200/1018 batches, training loss 5.39, training perplexity 220.20 epoch 4, 300/1018 batches, training loss 5.36, training perplexity 211.95 epoch 4, 400/1018 batches, training loss 5.34, training perplexity 207.68 epoch 4, 500/1018 batches, training loss 5.36, training perplexity 211.72 epoch 4, 600/1018 batches, training loss 5.34, training perplexity 207.99 epoch 4, 700/1018 batches, training loss 5.36, training perplexity 212.20 epoch 4, 800/1018 batches, training loss 5.20, training perplexity 181.77 epoch 4, 900/1018 batches, training loss 5.26, training perplexity 193.30 epoch 4, 1000/1018 batches, training loss 5.32, training perplexity 203.74 epoch 4, validation loss 5.35, validation perplexity 209.77 epoch 5, 100/1018 batches, training loss 5.31, training perplexity 201.49 epoch 5, 200/1018 batches, training loss 5.24, training perplexity 188.74 epoch 5, 300/1018 batches, training loss 5.21, training perplexity 182.43 epoch 5, 400/1018 batches, training loss 5.20, training perplexity 180.85 epoch 5, 500/1018 batches, training loss 5.21, training perplexity 183.26 epoch 5, 600/1018 batches, training loss 5.20, training perplexity 181.39 epoch 5, 700/1018 batches, training loss 5.22, training perplexity 184.16 epoch 5, 800/1018 batches, training loss 5.06, training perplexity 157.45 epoch 5, 900/1018 batches, training loss 5.13, training perplexity 168.55 epoch 5, 1000/1018 batches, training loss 5.18, training perplexity 177.71 epoch 5, validation loss 5.25, validation perplexity 191.34 epoch 6, 100/1018 batches, training loss 5.18, training perplexity 177.00 epoch 6, 200/1018 batches, training loss 5.11, training perplexity 165.34 epoch 6, 300/1018 batches, training loss 5.09, training perplexity 162.07 epoch 6, 400/1018 batches, training loss 5.08, training perplexity 160.66 epoch 6, 500/1018 batches, training loss 5.09, training perplexity 162.77 epoch 6, 600/1018 batches, training loss 5.09, training perplexity 161.63 epoch 6, 700/1018 batches, training loss 5.10, training perplexity 164.79 epoch 6, 800/1018 batches, training loss 4.94, training perplexity 139.78 epoch 6, 900/1018 batches, training loss 5.01, training perplexity 149.44 epoch 6, 1000/1018 batches, training loss 5.06, training perplexity 158.02 epoch 6, validation loss 5.17, validation perplexity 176.13 epoch 7, 100/1018 batches, training loss 5.07, training perplexity 159.31 epoch 7, 200/1018 batches, training loss 5.00, training perplexity 148.27 epoch 7, 300/1018 batches, training loss 4.98, training perplexity 145.96 epoch 7, 400/1018 batches, training loss 4.98, training perplexity 145.43 epoch 7, 500/1018 batches, training loss 4.99, training perplexity 146.53 epoch 7, 600/1018 batches, training loss 4.98, training perplexity 145.70 epoch 7, 700/1018 batches, training loss 5.01, training perplexity 149.36 epoch 7, 800/1018 batches, training loss 4.84, training perplexity 126.40 epoch 7, 900/1018 batches, training loss 4.91, training perplexity 135.97 epoch 7, 1000/1018 batches, training loss 4.96, training perplexity 143.18 epoch 7, validation loss 5.15, validation perplexity 171.81 epoch 8, 100/1018 batches, training loss 4.98, training perplexity 144.99 epoch 8, 200/1018 batches, training loss 4.91, training perplexity 135.37 epoch 8, 300/1018 batches, training loss 4.89, training perplexity 133.54 epoch 8, 400/1018 batches, training loss 4.89, training perplexity 132.93 epoch 8, 500/1018 batches, training loss 4.90, training perplexity 134.35 epoch 8, 600/1018 batches, training loss 4.89, training perplexity 133.37 epoch 8, 700/1018 batches, training loss 4.92, training perplexity 136.73 epoch 8, 800/1018 batches, training loss 4.75, training perplexity 116.09 epoch 8, 900/1018 batches, training loss 4.82, training perplexity 124.44 epoch 8, 1000/1018 batches, training loss 4.88, training perplexity 131.09 epoch 8, validation loss 5.11, validation perplexity 166.43 epoch 9, 100/1018 batches, training loss 4.90, training perplexity 134.42 epoch 9, 200/1018 batches, training loss 4.83, training perplexity 124.90 epoch 9, 300/1018 batches, training loss 4.82, training perplexity 124.05 epoch 9, 400/1018 batches, training loss 4.81, training perplexity 123.31 epoch 9, 500/1018 batches, training loss 4.82, training perplexity 124.30 epoch 9, 600/1018 batches, training loss 4.82, training perplexity 123.87 epoch 9, 700/1018 batches, training loss 4.85, training perplexity 127.15 epoch 9, 800/1018 batches, training loss 4.68, training perplexity 107.60 epoch 9, 900/1018 batches, training loss 4.75, training perplexity 115.65 epoch 9, 1000/1018 batches, training loss 4.80, training perplexity 121.10 epoch 9, validation loss 5.10, validation perplexity 163.38 epoch 10, 100/1018 batches, training loss 4.83, training perplexity 125.39 epoch 10, 200/1018 batches, training loss 4.75, training perplexity 116.02 epoch 10, 300/1018 batches, training loss 4.75, training perplexity 115.87 epoch 10, 400/1018 batches, training loss 4.75, training perplexity 115.86 epoch 10, 500/1018 batches, training loss 4.76, training perplexity 116.17 epoch 10, 600/1018 batches, training loss 4.75, training perplexity 115.96 epoch 10, 700/1018 batches, training loss 4.78, training perplexity 118.95 epoch 10, 800/1018 batches, training loss 4.62, training perplexity 101.31 epoch 10, 900/1018 batches, training loss 4.69, training perplexity 108.76 epoch 10, 1000/1018 batches, training loss 4.73, training perplexity 113.81 epoch 10, validation loss 5.08, validation perplexity 160.96 epoch 11, 100/1018 batches, training loss 4.78, training perplexity 118.51 epoch 11, 200/1018 batches, training loss 4.69, training perplexity 109.28 epoch 11, 300/1018 batches, training loss 4.70, training perplexity 109.89 epoch 11, 400/1018 batches, training loss 4.70, training perplexity 109.53 epoch 11, 500/1018 batches, training loss 4.70, training perplexity 109.91 epoch 11, 600/1018 batches, training loss 4.70, training perplexity 110.13 epoch 11, 700/1018 batches, training loss 4.72, training perplexity 112.66 epoch 11, 800/1018 batches, training loss 4.57, training perplexity 96.17 epoch 11, 900/1018 batches, training loss 4.64, training perplexity 103.05 epoch 11, 1000/1018 batches, training loss 4.68, training perplexity 107.53 epoch 11, validation loss 5.07, validation perplexity 159.83 epoch 12, 100/1018 batches, training loss 4.72, training perplexity 112.17 epoch 12, 200/1018 batches, training loss 4.65, training perplexity 104.32 epoch 12, 300/1018 batches, training loss 4.65, training perplexity 104.52 epoch 12, 400/1018 batches, training loss 4.65, training perplexity 104.79 epoch 12, 500/1018 batches, training loss 4.65, training perplexity 104.86 epoch 12, 600/1018 batches, training loss 4.65, training perplexity 104.85 epoch 12, 700/1018 batches, training loss 4.68, training perplexity 107.57 epoch 12, 800/1018 batches, training loss 4.52, training perplexity 91.74 epoch 12, 900/1018 batches, training loss 4.59, training perplexity 98.33 epoch 12, 1000/1018 batches, training loss 4.63, training perplexity 102.73 epoch 12, validation loss 5.07, validation perplexity 158.38 epoch 13, 100/1018 batches, training loss 4.68, training perplexity 107.78 epoch 13, 200/1018 batches, training loss 4.60, training perplexity 99.70 epoch 13, 300/1018 batches, training loss 4.61, training perplexity 100.51 epoch 13, 400/1018 batches, training loss 4.61, training perplexity 100.45 epoch 13, 500/1018 batches, training loss 4.61, training perplexity 100.73 epoch 13, 600/1018 batches, training loss 4.61, training perplexity 100.60 epoch 13, 700/1018 batches, training loss 4.64, training perplexity 103.52 epoch 13, 800/1018 batches, training loss 4.48, training perplexity 87.99 epoch 13, 900/1018 batches, training loss 4.55, training perplexity 94.94 epoch 13, 1000/1018 batches, training loss 4.59, training perplexity 98.38 epoch 13, validation loss 5.06, validation perplexity 156.83 epoch 14, 100/1018 batches, training loss 4.64, training perplexity 103.72 epoch 14, 200/1018 batches, training loss 4.57, training perplexity 96.08 epoch 14, 300/1018 batches, training loss 4.57, training perplexity 96.81 epoch 14, 400/1018 batches, training loss 4.57, training perplexity 96.82 epoch 14, 500/1018 batches, training loss 4.58, training perplexity 97.09 epoch 14, 600/1018 batches, training loss 4.58, training perplexity 97.27 epoch 14, 700/1018 batches, training loss 4.60, training perplexity 99.35 epoch 14, 800/1018 batches, training loss 4.44, training perplexity 85.17 epoch 14, 900/1018 batches, training loss 4.52, training perplexity 91.61 epoch 14, 1000/1018 batches, training loss 4.55, training perplexity 94.93 epoch 14, validation loss 5.05, validation perplexity 156.21 epoch 15, 100/1018 batches, training loss 4.61, training perplexity 100.60 epoch 15, 200/1018 batches, training loss 4.54, training perplexity 93.36 epoch 15, 300/1018 batches, training loss 4.54, training perplexity 93.79 epoch 15, 400/1018 batches, training loss 4.54, training perplexity 93.75 epoch 15, 500/1018 batches, training loss 4.55, training perplexity 94.21 epoch 15, 600/1018 batches, training loss 4.55, training perplexity 94.35 epoch 15, 700/1018 batches, training loss 4.57, training perplexity 96.84 epoch 15, 800/1018 batches, training loss 4.42, training perplexity 82.84 epoch 15, 900/1018 batches, training loss 4.49, training perplexity 89.00 epoch 15, 1000/1018 batches, training loss 4.52, training perplexity 92.24 epoch 15, validation loss 5.04, validation perplexity 155.20 epoch 16, 100/1018 batches, training loss 4.58, training perplexity 97.83 epoch 16, 200/1018 batches, training loss 4.51, training perplexity 90.76 epoch 16, 300/1018 batches, training loss 4.51, training perplexity 91.37 epoch 16, 400/1018 batches, training loss 4.52, training perplexity 91.79 epoch 16, 500/1018 batches, training loss 4.52, training perplexity 92.02 epoch 16, 600/1018 batches, training loss 4.52, training perplexity 91.64 epoch 16, 700/1018 batches, training loss 4.55, training perplexity 94.33 epoch 16, 800/1018 batches, training loss 4.39, training perplexity 80.75 epoch 16, 900/1018 batches, training loss 4.46, training perplexity 86.69 epoch 16, 1000/1018 batches, training loss 4.49, training perplexity 89.48 epoch 16, validation loss 5.05, validation perplexity 155.90 epoch 17, 100/1018 batches, training loss 4.56, training perplexity 95.65 epoch 17, 200/1018 batches, training loss 4.49, training perplexity 88.96 epoch 17, 300/1018 batches, training loss 4.49, training perplexity 89.48 epoch 17, 400/1018 batches, training loss 4.50, training perplexity 89.58 epoch 17, 500/1018 batches, training loss 4.50, training perplexity 90.10 epoch 17, 600/1018 batches, training loss 4.49, training perplexity 89.46 epoch 17, 700/1018 batches, training loss 4.53, training perplexity 92.34 epoch 17, 800/1018 batches, training loss 4.37, training perplexity 79.28 epoch 17, 900/1018 batches, training loss 4.44, training perplexity 84.64 epoch 17, 1000/1018 batches, training loss 4.47, training perplexity 87.72 epoch 17, validation loss 5.04, validation perplexity 154.87 epoch 18, 100/1018 batches, training loss 4.54, training perplexity 93.88 epoch 18, 200/1018 batches, training loss 4.46, training perplexity 86.89 epoch 18, 300/1018 batches, training loss 4.47, training perplexity 87.63 epoch 18, 400/1018 batches, training loss 4.47, training perplexity 87.79 epoch 18, 500/1018 batches, training loss 4.48, training perplexity 87.99 epoch 18, 600/1018 batches, training loss 4.48, training perplexity 88.20 epoch 18, 700/1018 batches, training loss 4.51, training perplexity 90.84 epoch 18, 800/1018 batches, training loss 4.35, training perplexity 77.49 epoch 18, 900/1018 batches, training loss 4.42, training perplexity 83.16 epoch 18, 1000/1018 batches, training loss 4.45, training perplexity 85.89 epoch 18, validation loss 5.04, validation perplexity 155.08 epoch 19, 100/1018 batches, training loss 4.52, training perplexity 92.26 epoch 19, 200/1018 batches, training loss 4.45, training perplexity 85.55 epoch 19, 300/1018 batches, training loss 4.46, training perplexity 86.15 epoch 19, 400/1018 batches, training loss 4.46, training perplexity 86.49 epoch 19, 500/1018 batches, training loss 4.46, training perplexity 86.92 epoch 19, 600/1018 batches, training loss 4.46, training perplexity 86.59 epoch 19, 700/1018 batches, training loss 4.49, training perplexity 89.10 epoch 19, 800/1018 batches, training loss 4.34, training perplexity 76.35 epoch 19, 900/1018 batches, training loss 4.41, training perplexity 81.95 epoch 19, 1000/1018 batches, training loss 4.43, training perplexity 84.18 epoch 19, validation loss 5.03, validation perplexity 153.44 epoch 20, 100/1018 batches, training loss 4.51, training perplexity 90.74 epoch 20, 200/1018 batches, training loss 4.43, training perplexity 84.24 epoch 20, 300/1018 batches, training loss 4.44, training perplexity 85.11 epoch 20, 400/1018 batches, training loss 4.44, training perplexity 84.97 epoch 20, 500/1018 batches, training loss 4.45, training perplexity 85.37 epoch 20, 600/1018 batches, training loss 4.45, training perplexity 85.49 epoch 20, 700/1018 batches, training loss 4.48, training perplexity 87.88 epoch 20, 800/1018 batches, training loss 4.32, training perplexity 75.37 epoch 20, 900/1018 batches, training loss 4.39, training perplexity 80.89 epoch 20, 1000/1018 batches, training loss 4.42, training perplexity 83.39 epoch 20, validation loss 5.03, validation perplexity 152.68 epoch 21, 100/1018 batches, training loss 4.49, training perplexity 89.34 epoch 21, 200/1018 batches, training loss 4.42, training perplexity 83.13 epoch 21, 300/1018 batches, training loss 4.43, training perplexity 84.13 epoch 21, 400/1018 batches, training loss 4.43, training perplexity 83.70 epoch 21, 500/1018 batches, training loss 4.43, training perplexity 83.93 epoch 21, 600/1018 batches, training loss 4.43, training perplexity 84.29 epoch 21, 700/1018 batches, training loss 4.46, training perplexity 86.58 epoch 21, 800/1018 batches, training loss 4.31, training perplexity 74.16 epoch 21, 900/1018 batches, training loss 4.38, training perplexity 80.12 epoch 21, 1000/1018 batches, training loss 4.41, training perplexity 82.29 epoch 21, validation loss 5.04, validation perplexity 153.79 epoch 22, 100/1018 batches, training loss 4.48, training perplexity 88.31 epoch 22, 200/1018 batches, training loss 4.41, training perplexity 82.67 epoch 22, 300/1018 batches, training loss 4.42, training perplexity 83.34 epoch 22, 400/1018 batches, training loss 4.42, training perplexity 83.06 epoch 22, 500/1018 batches, training loss 4.42, training perplexity 83.42 epoch 22, 600/1018 batches, training loss 4.42, training perplexity 83.23 epoch 22, 700/1018 batches, training loss 4.45, training perplexity 85.73 epoch 22, 800/1018 batches, training loss 4.30, training perplexity 73.73 epoch 22, 900/1018 batches, training loss 4.37, training perplexity 79.07 epoch 22, 1000/1018 batches, training loss 4.40, training perplexity 81.33 epoch 22, validation loss 5.02, validation perplexity 152.15 epoch 23, 100/1018 batches, training loss 4.48, training perplexity 87.92 epoch 23, 200/1018 batches, training loss 4.40, training perplexity 81.63 epoch 23, 300/1018 batches, training loss 4.41, training perplexity 82.40 epoch 23, 400/1018 batches, training loss 4.41, training perplexity 82.46 epoch 23, 500/1018 batches, training loss 4.41, training perplexity 82.62 epoch 23, 600/1018 batches, training loss 4.41, training perplexity 82.47 epoch 23, 700/1018 batches, training loss 4.45, training perplexity 85.24 epoch 23, 800/1018 batches, training loss 4.29, training perplexity 73.03 epoch 23, 900/1018 batches, training loss 4.36, training perplexity 78.47 epoch 23, 1000/1018 batches, training loss 4.39, training perplexity 80.77 epoch 23, validation loss 5.03, validation perplexity 152.58 epoch 24, 100/1018 batches, training loss 4.46, training perplexity 86.85 epoch 24, 200/1018 batches, training loss 4.40, training perplexity 81.21 epoch 24, 300/1018 batches, training loss 4.41, training perplexity 81.92 epoch 24, 400/1018 batches, training loss 4.41, training perplexity 81.90 epoch 24, 500/1018 batches, training loss 4.41, training perplexity 82.12 epoch 24, 600/1018 batches, training loss 4.41, training perplexity 81.98 epoch 24, 700/1018 batches, training loss 4.44, training perplexity 84.48 epoch 24, 800/1018 batches, training loss 4.28, training perplexity 72.38 epoch 24, 900/1018 batches, training loss 4.36, training perplexity 77.90 epoch 24, 1000/1018 batches, training loss 4.38, training perplexity 79.98 epoch 24, validation loss 5.02, validation perplexity 151.43 epoch 25, 100/1018 batches, training loss 4.46, training perplexity 86.33 epoch 25, 200/1018 batches, training loss 4.39, training perplexity 80.61 epoch 25, 300/1018 batches, training loss 4.40, training perplexity 81.46 epoch 25, 400/1018 batches, training loss 4.40, training perplexity 81.23 epoch 25, 500/1018 batches, training loss 4.40, training perplexity 81.51 epoch 25, 600/1018 batches, training loss 4.40, training perplexity 81.40 epoch 25, 700/1018 batches, training loss 4.43, training perplexity 83.68 epoch 25, 800/1018 batches, training loss 4.28, training perplexity 71.94 epoch 25, 900/1018 batches, training loss 4.35, training perplexity 77.48 epoch 25, 1000/1018 batches, training loss 4.38, training perplexity 79.54 epoch 25, validation loss 5.02, validation perplexity 151.41 epoch 26, 100/1018 batches, training loss 4.45, training perplexity 85.92 epoch 26, 200/1018 batches, training loss 4.38, training perplexity 79.95 epoch 26, 300/1018 batches, training loss 4.39, training perplexity 80.84 epoch 26, 400/1018 batches, training loss 4.39, training perplexity 80.80 epoch 26, 500/1018 batches, training loss 4.40, training perplexity 81.18 epoch 26, 600/1018 batches, training loss 4.39, training perplexity 80.90 epoch 26, 700/1018 batches, training loss 4.42, training perplexity 83.40 epoch 26, 800/1018 batches, training loss 4.27, training perplexity 71.54 epoch 26, 900/1018 batches, training loss 4.34, training perplexity 76.52 epoch 26, 1000/1018 batches, training loss 4.37, training perplexity 78.93 epoch 26, validation loss 5.02, validation perplexity 151.18 epoch 27, 100/1018 batches, training loss 4.45, training perplexity 85.42 epoch 27, 200/1018 batches, training loss 4.38, training perplexity 79.67 epoch 27, 300/1018 batches, training loss 4.39, training perplexity 80.47 epoch 27, 400/1018 batches, training loss 4.39, training perplexity 80.54 epoch 27, 500/1018 batches, training loss 4.39, training perplexity 80.83 epoch 27, 600/1018 batches, training loss 4.39, training perplexity 80.49 epoch 27, 700/1018 batches, training loss 4.42, training perplexity 82.89 epoch 27, 800/1018 batches, training loss 4.27, training perplexity 71.30 epoch 27, 900/1018 batches, training loss 4.34, training perplexity 76.51 epoch 27, 1000/1018 batches, training loss 4.37, training perplexity 78.73 epoch 27, validation loss 5.02, validation perplexity 151.37 epoch 28, 100/1018 batches, training loss 4.45, training perplexity 85.34 epoch 28, 200/1018 batches, training loss 4.37, training perplexity 79.33 epoch 28, 300/1018 batches, training loss 4.39, training perplexity 80.53 epoch 28, 400/1018 batches, training loss 4.38, training perplexity 79.86 epoch 28, 500/1018 batches, training loss 4.39, training perplexity 80.50 epoch 28, 600/1018 batches, training loss 4.39, training perplexity 80.24 epoch 28, 700/1018 batches, training loss 4.41, training perplexity 82.53 epoch 28, 800/1018 batches, training loss 4.27, training perplexity 71.26 epoch 28, 900/1018 batches, training loss 4.34, training perplexity 76.46 epoch 28, 1000/1018 batches, training loss 4.36, training perplexity 78.32 epoch 28, validation loss 5.02, validation perplexity 150.82 epoch 29, 100/1018 batches, training loss 4.44, training perplexity 85.07 epoch 29, 200/1018 batches, training loss 4.37, training perplexity 79.41 epoch 29, 300/1018 batches, training loss 4.38, training perplexity 80.02 epoch 29, 400/1018 batches, training loss 4.38, training perplexity 79.85 epoch 29, 500/1018 batches, training loss 4.39, training perplexity 80.39 epoch 29, 600/1018 batches, training loss 4.38, training perplexity 80.00 epoch 29, 700/1018 batches, training loss 4.41, training perplexity 82.44 epoch 29, 800/1018 batches, training loss 4.26, training perplexity 70.62 epoch 29, 900/1018 batches, training loss 4.33, training perplexity 76.28 epoch 29, 1000/1018 batches, training loss 4.36, training perplexity 78.31 epoch 29, validation loss 5.01, validation perplexity 150.28 epoch 30, 100/1018 batches, training loss 4.44, training perplexity 84.80 epoch 30, 200/1018 batches, training loss 4.37, training perplexity 79.34 epoch 30, 300/1018 batches, training loss 4.38, training perplexity 79.90 epoch 30, 400/1018 batches, training loss 4.38, training perplexity 79.98 epoch 30, 500/1018 batches, training loss 4.38, training perplexity 80.22 epoch 30, 600/1018 batches, training loss 4.38, training perplexity 79.94 epoch 30, 700/1018 batches, training loss 4.41, training perplexity 82.45 epoch 30, 800/1018 batches, training loss 4.26, training perplexity 70.66 epoch 30, 900/1018 batches, training loss 4.33, training perplexity 76.00 epoch 30, 1000/1018 batches, training loss 4.35, training perplexity 77.84 epoch 30, validation loss 5.01, validation perplexity 150.11 epoch 31, 100/1018 batches, training loss 4.44, training perplexity 84.60 epoch 31, 200/1018 batches, training loss 4.37, training perplexity 79.08 epoch 31, 300/1018 batches, training loss 4.38, training perplexity 80.01 epoch 31, 400/1018 batches, training loss 4.38, training perplexity 79.85 epoch 31, 500/1018 batches, training loss 4.38, training perplexity 80.12 epoch 31, 600/1018 batches, training loss 4.38, training perplexity 79.74 epoch 31, 700/1018 batches, training loss 4.41, training perplexity 82.11 epoch 31, 800/1018 batches, training loss 4.26, training perplexity 70.57 epoch 31, 900/1018 batches, training loss 4.33, training perplexity 76.10 epoch 31, 1000/1018 batches, training loss 4.35, training perplexity 77.66 epoch 31, validation loss 5.01, validation perplexity 149.84 epoch 32, 100/1018 batches, training loss 4.43, training perplexity 84.29 epoch 32, 200/1018 batches, training loss 4.37, training perplexity 79.22 epoch 32, 300/1018 batches, training loss 4.38, training perplexity 79.67 epoch 32, 400/1018 batches, training loss 4.37, training perplexity 79.27 epoch 32, 500/1018 batches, training loss 4.38, training perplexity 79.94 epoch 32, 600/1018 batches, training loss 4.38, training perplexity 79.58 epoch 32, 700/1018 batches, training loss 4.41, training perplexity 82.00 epoch 32, 800/1018 batches, training loss 4.25, training perplexity 70.27 epoch 32, 900/1018 batches, training loss 4.33, training perplexity 75.91 epoch 32, 1000/1018 batches, training loss 4.36, training perplexity 77.98 epoch 32, validation loss 5.01, validation perplexity 149.63 epoch 33, 100/1018 batches, training loss 4.44, training perplexity 84.43 epoch 33, 200/1018 batches, training loss 4.37, training perplexity 78.98 epoch 33, 300/1018 batches, training loss 4.38, training perplexity 79.72 epoch 33, 400/1018 batches, training loss 4.38, training perplexity 79.68 epoch 33, 500/1018 batches, training loss 4.38, training perplexity 80.14 epoch 33, 600/1018 batches, training loss 4.38, training perplexity 79.74 epoch 33, 700/1018 batches, training loss 4.41, training perplexity 82.09 epoch 33, 800/1018 batches, training loss 4.25, training perplexity 70.43 epoch 33, 900/1018 batches, training loss 4.33, training perplexity 75.77 epoch 33, 1000/1018 batches, training loss 4.35, training perplexity 77.66 epoch 33, validation loss 5.01, validation perplexity 149.26 epoch 34, 100/1018 batches, training loss 4.44, training perplexity 84.59 epoch 34, 200/1018 batches, training loss 4.37, training perplexity 79.13 epoch 34, 300/1018 batches, training loss 4.38, training perplexity 79.70 epoch 34, 400/1018 batches, training loss 4.38, training perplexity 79.53 epoch 34, 500/1018 batches, training loss 4.38, training perplexity 80.18 epoch 34, 600/1018 batches, training loss 4.38, training perplexity 79.56 epoch 34, 700/1018 batches, training loss 4.41, training perplexity 82.03 epoch 34, 800/1018 batches, training loss 4.25, training perplexity 70.27 epoch 34, 900/1018 batches, training loss 4.33, training perplexity 75.67 epoch 34, 1000/1018 batches, training loss 4.35, training perplexity 77.30 epoch 34, validation loss 5.01, validation perplexity 149.31 epoch 35, 100/1018 batches, training loss 4.44, training perplexity 84.59 epoch 35, 200/1018 batches, training loss 4.37, training perplexity 79.14 epoch 35, 300/1018 batches, training loss 4.38, training perplexity 79.78 epoch 35, 400/1018 batches, training loss 4.37, training perplexity 79.38 epoch 35, 500/1018 batches, training loss 4.38, training perplexity 80.09 epoch 35, 600/1018 batches, training loss 4.38, training perplexity 79.54 epoch 35, 700/1018 batches, training loss 4.41, training perplexity 82.09 epoch 35, 800/1018 batches, training loss 4.26, training perplexity 70.50 epoch 35, 900/1018 batches, training loss 4.33, training perplexity 75.62 epoch 35, 1000/1018 batches, training loss 4.35, training perplexity 77.72 epoch 35, validation loss 5.00, validation perplexity 149.11 epoch 36, 100/1018 batches, training loss 4.43, training perplexity 84.33 epoch 36, 200/1018 batches, training loss 4.37, training perplexity 79.17 epoch 36, 300/1018 batches, training loss 4.38, training perplexity 79.67 epoch 36, 400/1018 batches, training loss 4.37, training perplexity 79.43 epoch 36, 500/1018 batches, training loss 4.39, training perplexity 80.37 epoch 36, 600/1018 batches, training loss 4.38, training perplexity 79.86 epoch 36, 700/1018 batches, training loss 4.41, training perplexity 81.99 epoch 36, 800/1018 batches, training loss 4.25, training perplexity 70.45 epoch 36, 900/1018 batches, training loss 4.33, training perplexity 75.66 epoch 36, 1000/1018 batches, training loss 4.35, training perplexity 77.69 epoch 36, validation loss 5.00, validation perplexity 148.60 epoch 37, 100/1018 batches, training loss 4.44, training perplexity 84.41 epoch 37, 200/1018 batches, training loss 4.37, training perplexity 79.20 epoch 37, 300/1018 batches, training loss 4.38, training perplexity 79.52 epoch 37, 400/1018 batches, training loss 4.38, training perplexity 79.71 epoch 37, 500/1018 batches, training loss 4.38, training perplexity 80.00 epoch 37, 600/1018 batches, training loss 4.38, training perplexity 79.72 epoch 37, 700/1018 batches, training loss 4.40, training perplexity 81.77 epoch 37, 800/1018 batches, training loss 4.25, training perplexity 70.36 epoch 37, 900/1018 batches, training loss 4.33, training perplexity 75.94 epoch 37, 1000/1018 batches, training loss 4.35, training perplexity 77.65 epoch 37, validation loss 5.00, validation perplexity 148.22 epoch 38, 100/1018 batches, training loss 4.44, training perplexity 84.99 epoch 38, 200/1018 batches, training loss 4.37, training perplexity 79.19 epoch 38, 300/1018 batches, training loss 4.38, training perplexity 79.68 epoch 38, 400/1018 batches, training loss 4.37, training perplexity 79.23 epoch 38, 500/1018 batches, training loss 4.38, training perplexity 79.98 epoch 38, 600/1018 batches, training loss 4.38, training perplexity 79.96 epoch 38, 700/1018 batches, training loss 4.41, training perplexity 81.89 epoch 38, 800/1018 batches, training loss 4.25, training perplexity 70.34 epoch 38, 900/1018 batches, training loss 4.32, training perplexity 75.50 epoch 38, 1000/1018 batches, training loss 4.36, training perplexity 77.88 epoch 38, validation loss 5.00, validation perplexity 148.25 epoch 39, 100/1018 batches, training loss 4.44, training perplexity 84.49 epoch 39, 200/1018 batches, training loss 4.37, training perplexity 79.02 epoch 39, 300/1018 batches, training loss 4.38, training perplexity 79.67 epoch 39, 400/1018 batches, training loss 4.38, training perplexity 79.56 epoch 39, 500/1018 batches, training loss 4.38, training perplexity 80.05 epoch 39, 600/1018 batches, training loss 4.38, training perplexity 79.76 epoch 39, 700/1018 batches, training loss 4.40, training perplexity 81.86 epoch 39, 800/1018 batches, training loss 4.25, training perplexity 70.37 epoch 39, 900/1018 batches, training loss 4.33, training perplexity 75.65 epoch 39, 1000/1018 batches, training loss 4.35, training perplexity 77.86 epoch 39, validation loss 5.00, validation perplexity 148.02 epoch 40, 100/1018 batches, training loss 4.44, training perplexity 84.56 epoch 40, 200/1018 batches, training loss 4.37, training perplexity 79.34 epoch 40, 300/1018 batches, training loss 4.38, training perplexity 79.84 epoch 40, 400/1018 batches, training loss 4.38, training perplexity 79.55 epoch 40, 500/1018 batches, training loss 4.38, training perplexity 80.19 epoch 40, 600/1018 batches, training loss 4.38, training perplexity 79.86 epoch 40, 700/1018 batches, training loss 4.41, training perplexity 82.07 epoch 40, 800/1018 batches, training loss 4.26, training perplexity 70.64 epoch 40, 900/1018 batches, training loss 4.33, training perplexity 75.57 epoch 40, 1000/1018 batches, training loss 4.36, training perplexity 77.94 epoch 40, validation loss 5.00, validation perplexity 147.91 epoch 41, 100/1018 batches, training loss 4.44, training perplexity 84.87 epoch 41, 200/1018 batches, training loss 4.37, training perplexity 79.31 epoch 41, 300/1018 batches, training loss 4.38, training perplexity 79.95 epoch 41, 400/1018 batches, training loss 4.38, training perplexity 79.89 epoch 41, 500/1018 batches, training loss 4.38, training perplexity 80.06 epoch 41, 600/1018 batches, training loss 4.38, training perplexity 79.80 epoch 41, 700/1018 batches, training loss 4.41, training perplexity 82.16 epoch 41, 800/1018 batches, training loss 4.26, training perplexity 70.68 epoch 41, 900/1018 batches, training loss 4.33, training perplexity 75.84 epoch 41, 1000/1018 batches, training loss 4.35, training perplexity 77.79 epoch 41, validation loss 4.99, validation perplexity 147.50 epoch 42, 100/1018 batches, training loss 4.44, training perplexity 84.77 epoch 42, 200/1018 batches, training loss 4.37, training perplexity 79.33 epoch 42, 300/1018 batches, training loss 4.38, training perplexity 79.99 epoch 42, 400/1018 batches, training loss 4.38, training perplexity 79.87 epoch 42, 500/1018 batches, training loss 4.38, training perplexity 80.10 epoch 42, 600/1018 batches, training loss 4.38, training perplexity 79.91 epoch 42, 700/1018 batches, training loss 4.41, training perplexity 82.19 epoch 42, 800/1018 batches, training loss 4.26, training perplexity 70.55 epoch 42, 900/1018 batches, training loss 4.33, training perplexity 75.83 epoch 42, 1000/1018 batches, training loss 4.36, training perplexity 78.10 epoch 42, validation loss 4.99, validation perplexity 147.36 epoch 43, 100/1018 batches, training loss 4.44, training perplexity 85.17 epoch 43, 200/1018 batches, training loss 4.38, training perplexity 79.49 epoch 43, 300/1018 batches, training loss 4.38, training perplexity 80.15 epoch 43, 400/1018 batches, training loss 4.38, training perplexity 79.77 epoch 43, 500/1018 batches, training loss 4.38, training perplexity 79.86 epoch 43, 600/1018 batches, training loss 4.38, training perplexity 80.00 epoch 43, 700/1018 batches, training loss 4.41, training perplexity 82.00 epoch 43, 800/1018 batches, training loss 4.26, training perplexity 70.66 epoch 43, 900/1018 batches, training loss 4.33, training perplexity 75.84 epoch 43, 1000/1018 batches, training loss 4.36, training perplexity 78.10 epoch 43, validation loss 4.99, validation perplexity 146.88 epoch 44, 100/1018 batches, training loss 4.44, training perplexity 85.07 epoch 44, 200/1018 batches, training loss 4.38, training perplexity 79.63 epoch 44, 300/1018 batches, training loss 4.38, training perplexity 80.08 epoch 44, 400/1018 batches, training loss 4.38, training perplexity 79.92 epoch 44, 500/1018 batches, training loss 4.38, training perplexity 80.21 epoch 44, 600/1018 batches, training loss 4.38, training perplexity 80.07 epoch 44, 700/1018 batches, training loss 4.41, training perplexity 82.12 epoch 44, 800/1018 batches, training loss 4.26, training perplexity 70.86 epoch 44, 900/1018 batches, training loss 4.33, training perplexity 75.92 epoch 44, 1000/1018 batches, training loss 4.36, training perplexity 78.40 epoch 44, validation loss 4.99, validation perplexity 146.86 epoch 45, 100/1018 batches, training loss 4.45, training perplexity 85.41 epoch 45, 200/1018 batches, training loss 4.38, training perplexity 79.61 epoch 45, 300/1018 batches, training loss 4.38, training perplexity 80.22 epoch 45, 400/1018 batches, training loss 4.38, training perplexity 79.90 epoch 45, 500/1018 batches, training loss 4.39, training perplexity 80.24 epoch 45, 600/1018 batches, training loss 4.38, training perplexity 80.22 epoch 45, 700/1018 batches, training loss 4.41, training perplexity 82.43 epoch 45, 800/1018 batches, training loss 4.26, training perplexity 71.06 epoch 45, 900/1018 batches, training loss 4.33, training perplexity 75.92 epoch 45, 1000/1018 batches, training loss 4.36, training perplexity 78.30 epoch 45, validation loss 4.99, validation perplexity 146.51 epoch 46, 100/1018 batches, training loss 4.45, training perplexity 85.27 epoch 46, 200/1018 batches, training loss 4.37, training perplexity 79.37 epoch 46, 300/1018 batches, training loss 4.38, training perplexity 80.17 epoch 46, 400/1018 batches, training loss 4.38, training perplexity 80.10 epoch 46, 500/1018 batches, training loss 4.38, training perplexity 80.23 epoch 46, 600/1018 batches, training loss 4.38, training perplexity 80.20 epoch 46, 700/1018 batches, training loss 4.41, training perplexity 82.49 epoch 46, 800/1018 batches, training loss 4.26, training perplexity 71.03 epoch 46, 900/1018 batches, training loss 4.33, training perplexity 76.09 epoch 46, 1000/1018 batches, training loss 4.36, training perplexity 78.41 epoch 46, validation loss 4.99, validation perplexity 146.24 epoch 47, 100/1018 batches, training loss 4.45, training perplexity 85.30 epoch 47, 200/1018 batches, training loss 4.38, training perplexity 79.66 epoch 47, 300/1018 batches, training loss 4.39, training perplexity 80.54 epoch 47, 400/1018 batches, training loss 4.38, training perplexity 80.22 epoch 47, 500/1018 batches, training loss 4.39, training perplexity 80.41 epoch 47, 600/1018 batches, training loss 4.38, training perplexity 80.09 epoch 47, 700/1018 batches, training loss 4.42, training perplexity 82.73 epoch 47, 800/1018 batches, training loss 4.26, training perplexity 71.00 epoch 47, 900/1018 batches, training loss 4.33, training perplexity 76.20 epoch 47, 1000/1018 batches, training loss 4.36, training perplexity 78.38 epoch 47, validation loss 4.98, validation perplexity 146.06 epoch 48, 100/1018 batches, training loss 4.45, training perplexity 85.22 epoch 48, 200/1018 batches, training loss 4.38, training perplexity 79.79 epoch 48, 300/1018 batches, training loss 4.39, training perplexity 80.47 epoch 48, 400/1018 batches, training loss 4.38, training perplexity 79.97 epoch 48, 500/1018 batches, training loss 4.39, training perplexity 80.54 epoch 48, 600/1018 batches, training loss 4.39, training perplexity 80.33 epoch 48, 700/1018 batches, training loss 4.41, training perplexity 82.47 epoch 48, 800/1018 batches, training loss 4.27, training perplexity 71.19 epoch 48, 900/1018 batches, training loss 4.33, training perplexity 76.23 epoch 48, 1000/1018 batches, training loss 4.36, training perplexity 78.60 epoch 48, validation loss 4.98, validation perplexity 145.82 epoch 49, 100/1018 batches, training loss 4.45, training perplexity 85.76 epoch 49, 200/1018 batches, training loss 4.38, training perplexity 79.81 epoch 49, 300/1018 batches, training loss 4.39, training perplexity 80.35 epoch 49, 400/1018 batches, training loss 4.39, training perplexity 80.26 epoch 49, 500/1018 batches, training loss 4.39, training perplexity 80.30 epoch 49, 600/1018 batches, training loss 4.38, training perplexity 80.14 epoch 49, 700/1018 batches, training loss 4.41, training perplexity 82.58 epoch 49, 800/1018 batches, training loss 4.26, training perplexity 71.02 epoch 49, 900/1018 batches, training loss 4.33, training perplexity 76.24 epoch 49, 1000/1018 batches, training loss 4.36, training perplexity 78.55 epoch 49, validation loss 4.98, validation perplexity 145.84 epoch 50, 100/1018 batches, training loss 4.45, training perplexity 85.55 epoch 50, 200/1018 batches, training loss 4.38, training perplexity 79.68 epoch 50, 300/1018 batches, training loss 4.39, training perplexity 80.61 epoch 50, 400/1018 batches, training loss 4.39, training perplexity 80.27 epoch 50, 500/1018 batches, training loss 4.39, training perplexity 80.31 epoch 50, 600/1018 batches, training loss 4.38, training perplexity 80.17 epoch 50, 700/1018 batches, training loss 4.41, training perplexity 82.47 epoch 50, 800/1018 batches, training loss 4.26, training perplexity 71.00 epoch 50, 900/1018 batches, training loss 4.33, training perplexity 76.24 epoch 50, 1000/1018 batches, training loss 4.36, training perplexity 78.51 epoch 50, validation loss 4.98, validation perplexity 145.72
testing_loss = eval_model(best_model_so_far, testing_data)
print(f"testing loss {testing_loss:.2f}, testing perplexity {math.exp(testing_loss):.2f}")
testing loss 4.92, testing perplexity 136.85
모델이 훈련됐으면, 모델을 다시 처음부터 훈련시키지 않아도 되게 로컬 환경에 훈련된 모델을 저장하는 것이 이상적
따라서 아래와 같이 모델을 저장한다.
# 모델 저장
mdl_pth = './transformer.pth'
torch.save(best_model_so_far.state_dict(), mdl_pth)
저장된 모델을 로딩하고, 언어 모델을 텍스트 생성 모델로 확장한다.
# load the best trained model
transformer_cached = Transformer(num_tokens, embedding_size, num_heads, num_hidden_params, num_layers,
dropout).to(device)
transformer_cached.load_state_dict(torch.load(mdl_pth))
<All keys matched successfully>
# 생성하려는 목표 단어 개수 정의 후, 초기 단어 시퀀스를 모델에 큐로 제공한다.
ln = 10
sntc = 'It will _'
sntc_split = sntc.split()
# 루트에서 단어를 하나씩 생성
torch.manual_seed(799)
with torch.no_grad():
for i in range(ln):
sntc = ' '.join(sntc_split)
txt_ds = TEXT.numericalize([sntc_split])
num_b = txt_ds.size(0)
txt_ds = txt_ds.narrow(0, 0, num_b)
txt_ds = txt_ds.view(1, -1).t().contiguous().to(device)
ev_X, _ = return_batch(txt_ds, i+1)
op = transformer_cached(ev_X)
op_flat = op.view(-1, num_tokens)
res = TEXT.vocab.itos[op_flat.argmax(1)[0]]
sntc_split.insert(-1, res)
print(sntc[:-2])
It will be used to the first season , and the
이터레이션을 돌 때 마다 입력 시퀀스에 해당 이터레이션에서 예측된 단어를 덧붙이면 된다. 이렇게 확장된 시퀀스는 다음 이터레이션에서 모델의 입력이 되는 것이다. 일관성을 확보하기 위해 랜덤 시드가 추가되고, 시드값을 바꾸면 위 코드에서 보듯이 다른 텍스트를 생성할 수 있다.
이처럼 언어 모델(이 경우, 트랜스포머 기반의 모델)을 훈련시킨 다음 몇줄 추가해 텍스트를 생성하는데 사용할 수 있음을 볼 수 있다.
<텍스트 생성기로 사전 훈련된 GPT-2 사용 - Pytouch>
텍스트 생성기로 사전 훈련된 GPT-2 사용 - pytouch¶
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# GPT2Tokenizer와 언어 모델 인스터스화
torch.manual_seed(799)
tkz = GPT2Tokenizer.from_pretrained("gpt2")
mdl = GPT2LMHeadModel.from_pretrained('gpt2')
ln = 10
cue = "It will"
gen = tkz.encode(cue)
ctx = torch.tensor([gen])
매번 다른 텍스트를 생성하려면 시드를 변경하면된다.
prv=None
for i in range(ln):
op, prv = mdl(ctx, past=prv)
tkn = torch.argmax(op[..., -1, :])
gen += [tkn.tolist()]
ctx = tkn.unsqueeze(0)
seq = tkz.decode(gen)
print(seq)
It will be interesting to see how the new system works out
입력으로 주어진 단어 시퀀스에 대해 언어 모델을 사용해 다음 단어를 예측하는 것을 반복한다. 이터레이션을 돌 때마다 예측된 단어를 다음 이터레이션의 입력 단어 시퀀스에 덧붙여주면 되는 것
# 탐욕적 탐색을 이용한 텍스트 생성
ip_ids = tkz.encode(cue, return_tensors='pt')
op_greedy = mdl.generate(ip_ids, max_length=ln)
seq = tkz.decode(op_greedy[0], skip_special_tokens=True)
print(seq)
It will be interesting to see how the new system
위 코드는 이전의 for 루트를 통해 작성했던 구문과 비슷한 코드인데, transgormers 라이브러리를 활용하여 작성했다. 위 코드의 결과는 이전에 생성했던 문장보다 단어하나가 더 적은데 이는 GPT-2의 코드에는 max_length 인수에 큐 단어가 포함되기 때문이다.
# 빔 서치를 이용한 텍스트 생성
op_beam = mdl.generate(
ip_ids,
max_length=5,
num_beams=3,
num_return_sequences=3,
)
for op_beam_cur in op_beam:
print(tkz.decode(op_beam_cur, skip_special_tokens=True))
It will be interesting to It will be a long It will be a great
for i in range(3):
torch.manual_seed(i)
op = mdl.generate(
ip_ids,
do_sample=True,
max_length=5,
top_k=2
)
seq = tkz.decode(op[0], skip_special_tokens=True)
print(seq)
It will also be a It will be a long It will also be interesting
for i in range(3):
torch.manual_seed(i)
op_greedy = mdl.generate(ip_ids, max_length=5)
seq = tkz.decode(op_greedy[0], skip_special_tokens=True)
print(seq)
It will be interesting to It will be interesting to It will be interesting to
for i in range(3):
torch.manual_seed(i)
op = mdl.generate(
ip_ids,
do_sample=True,
max_length=5,
top_p=0.75,
top_k=0
)
seq = tkz.decode(op[0], skip_special_tokens=True)
print(seq)
It will require work in It will be an interesting It will likely be important
'DL > 개념정리' 카테고리의 다른 글
[파이토치 딥러닝 프로젝트] 언어 모델링을 위한 트랜스포머 모델 (2) | 2024.02.13 |
---|---|
[파이토치 딥러닝 프로젝트] 심층 순환 신경망(RNN) (2) | 2024.02.13 |