ChatGPT: Everything you Need to Know
GPT-3 (Generative Pretrained Transformer 3) is a state-of-the-art Large Language Model (LLM) developed by OpenAI. It has 175 billion parameters, making it one of the largest and most powerful LLMs in existence. This allows it to generate human-like text and perform a wide variety of nature language tasks, such as translation, summarization, and question answering. ChatGPT (Which is based on GPT3) has been trained on a vast amount of text data, which gives it a deep understanding of natural language and allows it to generate text that is often difficult to distinguish from text written by humans.
Did you read this?! Actually what you just read is fully generated from the new ChatGPT Tool that is Released by OpenAI after we asked it “What is GPT-3?”
So, Now really what is GPT3? In this article we give an overview of how ChatGPT works.
- A brief history behind the Generative Pretrained Transformer (GPT) models.
- Transformers.
- Encoder and Decoder models.
- GPT-3 and Meta-Learning
- GPT-2 Code using hugging face.
- ChatGPT experiments in different domains.
A brief history behind the GPT models
Improving Language Understanding by Generative Pre-Training [1] is a paper published by OpenAI in June 2018 to be the first GPT paper. It used the Transformer architecture [2] in a supervised way to achieve one specific Natural language task.
In February 2019, A Second paper was published under the name “Language models are unsupervised multitask learners” [3] to introduce the GPT-2, the largest and first Multitask Language model that could perform well on several tasks “without task specific training” in a semi supervised manner [4].
Then, GPT-3 was trained in may 2020 by OpenAI in a paper with name of “language models are few shot learners.”[5] presenting the biggest language model ever “100x bigger than GPT-2” with 175 billion parameters. it overcomes all previous GPT models while it has the same underlying principles.
The researchers trained GPT-3 with data from CommonCrawl, WebText, Wikipedia, and a corpus of books in an unsupervised way, which proved that training Large Language Models with enough data in an unsupervised way could multitask well.
Transformers
The main building block of the commonly used models for Language Model is the Transformer block [2]. which is mainly composed of encoders and decoders as shown in (Figure1).
Transformers in Language models are used for various application such as:
- Text classification
- Text summarization
- Sentiment analysis
- Text generation
The attention layer is the main idea of a transformer, which can be described as follows:
- The first step is word encoding, which is basically assigning each word a unique value
- Each word is embedded into a feature vector space with a fixed dimension (d), assume it is 512 (this is an architecture choice to make the computation of multi-headed attention constant) For more about word embedding [6].
- Next, the algorithm linearly projects each vector into 3 values (Query, Key, and Value).
- Applying matrix multiplication between query and key you get higher values for correlated words that mainly holds the general context of the text (Fig 2).
The output vector for each word now contains information about each vector (Word) in the sequence. This means that the algorithm attends to correlated words by multiplying the Query (Q) and Key (K) of each word with each other. The idea is similar to a cosine similarity matrix or the adjacency matrix from the Graph Laplacian equation.
The Softmax function’s main idea here is to make the output positive sum up to 1. And to determine the degree to which each word will be emphasized at this position. Clearly the word at this position will have the highest Softmax score, but sometimes itβs useful to attend to another word that is relevant to the current word.
Encoder and Decoder Text generation models in NLP
While Transformers are encoder-decoder models, but many of the language models are encoder only or decoder only. Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering. Examples:
Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models. The pretraining of decoder models usually revolves around predicting the next word in the sentence. These models are best suited for tasks involving text generation. Examples:
ChatGPT is an Autoregressive model means that the output token (Word) will be the input to predict the next word until the sequence is completely generated. it’s size differ as based on the the data that was trained on, so, the largest ChatGPT one is trained on WebText data with 1.542 Million parameter.
The difference between GPT models and BERT can be summarized as follows:
- GPT models are decoder-only models while BERT is an Encoder only model (Fig 3).
- GPT models are producing one output at time
- GPT models are autoregressive models which use the output to add to the sequence input to predict the next word as the mobile predictor (Fig 4).
If you want to get a deeper look into transformers, refer to this blog post [5].
GPT-3
The ChatGPT3 model [5] is a multitask zero-shot learning model. If you’re not familiar with zero-shot learning (meta-learning), you can refer to this paper [7]. In simple terms, meta-learning is an approach that involves providing the model with query and support examples (labeled examples and examples of interest, respectively) to obtain the desired output (see Figure 5).
To address these issues, meta-learning is a potential solution. In the context of language models, meta-learning involves training the model to develop a wide range of skills and pattern recognition abilities, which it can then use during inference to quickly adapt to or recognize the target task.
Meta-Learning Techniques
he authors of this paper employed several techniques in their model:
- They assumed that Large Language models do not necessarily require large supervised datasets to learn most language tasks. Instead, they utilized numerous unrelated supervised tasks during training and transferred the weights of each task to the next in an unsupervised manner (refer to Figure 6).
- During the evaluation stage, the researchers tested GPT-3 under three conditions:
- (a) “few-shot learning” which involves in-context learning and permits as many demonstrations as possible within the model’s context window (usually 10 to 100)
- (b) “one-shot learning” which only allows for one demonstration
- (c) “zero-shot learning” (refer to Figure 7).
- To test their hypothesis, the researchers trained a 175 billion parameter model.
As shown in Figure 5, the training process involves utilizing multiple tasks for various purposes, and for each task, the training begins from the previous task’s checkpoint. This approach employs an inner loop per task, which is referred to as “in-context learning” in the original paper, and an outer loop for unsupervised learning.
In (Figure 6), you are presented with three inference options:
- Zero-shot, which involves the model relying solely on its understanding of the task without any example answers.
- One-shot, where one example is provided to assist with the task.
- Few-shots, where multiple examples are given to aid in the completion of the desired task.
ChatGPT also employs another magic trick: Reinforcement Learning Human Feedback. This technique is discussed in detail in a separate blog post, with implementation details available in the first open source related to it [8].
For those who are new to NLP, we recommend our Ultimate Guide To Natural Language Processing (NLP). It will take you through the process of encoding your tokens (words) and developing advanced text classification models step-by-step.
GPT-2 Inference Code
In this notebook, we will dive into the world of text generation using a GPT-2 model. This state-of-the-art language model was trained on a massive 40GB of internet text data, enabling it to predict the next word or sequence of words with impressive accuracy. While the fully trained model is not publicly available due to concerns about its potential misuse, a smaller version is accessible for enthusiasts to experiment with, which we will be using in this notebook.
As we explore text generation with GPT-2, we will also take a closer look at different decoding methods such as Beam Search, Top-K Sampling, and Top-P Sampling. Through various demonstrations, we will showcase the performance of these methods and highlight their unique features.
At its core, a language model is a machine learning model that can analyze a sentence and accurately predict the next word or sequence of words. GPT-2 takes this one step further and can generate sophisticated text on a much larger scale. For context, the smallest version of GPT-2 has 117 million parameters, while the largest model has over 1.5 billion parameters, although it is not publicly available.
Luckily, importing the model and tokenizer is easy with πTransformers, which supports both PyTorch and TensorFlow. In this notebook, we will be using TensorFlow, but PyTorch is equally simple to use. By importing the transformers library with a simple “!pip install transformers” command, we can access the model and tokenizer effortlessly.
# For reproducibility
SEED = 34
# Maximum number of words in output text
MAX_LEN = 70
# Input prompt to Complete
input_sequence = "I don't know about you, but there's only one thing I want to do after a long day of work"
Import the GPT2 Model and the Tokenizer:
# Import transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
# Get large GPT2 tokenizer and GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)
# View model parameters
GPT2.summary()
II. Different Decoding Methods – First Pass (Greedy Search)
The Greedy search algorithm predicts the word with the highest probability as the next word using Greedy search, and updates the next word via::
wπ‘ = ππππππ₯ π€ π(π€|π€1:π‘β1)
at each timestep π‘. Let’s see how this naΓ―ve approach performs:
# import Tensorflow
import tensorflow as tf
tf.random.set_seed(SEED)
input_sequence
Now, let’s tokenize the input sequence using the tokenizer
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')
input_ids.shape
# Output: TensorShape([1, 23])
Now, let’s generate text
# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)
print("Output:\n")
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))
And there we go: generating text is that easy. Our results are not great – as we can see, our model starts repeating itself rather quickly. Greedy Search has a major issue where words with high probabilities can be masked by words in front of them with low probabilities, so the model is unable to explore more diverse combinations of words. We can prevent this by implementing Beam Search:
II. Different Decoding Methods – Beam Search with N-Gram Penalties
Beam search is a modified version of Greedy Search that allows the model to track and maintain multiple hypotheses at each time step. By setting the parameter ‘num_beams,’ we can specify the number of hypotheses the model should keep. This approach enables the model to compare and explore different paths while generating text. We can also set a n-gram penalty by defining the ‘no_repeat_ngram_size’ parameter, which ensures that no 2-grams are repeated. To get a better understanding of the output, we will set ‘num_return_sequences’ to 5, allowing us to compare the different beams.
To use Beam Search, we need to modify the parameters in the ‘generate’ function, which is simple and straight forward:
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
input_ids,
max_length = MAX_LEN,
num_beams = 5,
no_repeat_ngram_size = 2,
num_return_sequences = 5,
early_stopping = True
)
print('')
print("Output:\n" + 100 * '-')
# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
Great, we have a much clearer output now! While the 5 beam hypotheses we generated are quite similar, increasing the ‘num_beams’ parameter would result in more diverse outputs. However, it’s worth noting that Beam Search isn’t perfect either. It works best when generating text of a consistent length, such as in translation or summarization tasks. But for open-ended problems like dialogue or story generation, it can be challenging to find the right balance between ‘num_beams’ and ‘no_repeat_ngram_size.’
II. Different Decoding Methods – Basic Sampling
Now, let’s explore the concept of indeterministic decoding – sampling. Instead of strictly following a path to find the text with the highest probability, we randomly select the next word based on its conditional probability distribution:
π€π‘βΌπ(π€|π€1:π‘β1)
However, this randomness can lead to incoherent generated text [12]. To address this, we can use the ‘temperature’ parameter, which increases the likelihood of selecting high-probability words and decreases the chances of selecting low-probability words during the sampling process.
To implement sampling, we need to set the ‘do_sample’ parameter to True. For demonstration purposes, we’ll set ‘top_k’ to 0:
# Use temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
input_ids,
do_sample = True,
max_length = MAX_LEN,
top_k = 0,
temperature = 0.8
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))
# Output:
#----------------------------------------------------------------------
# I don't know about you, but there's only one thing I want to do after a long day of work: put on a pair of lightweight jeans. (I don't know if I even have enough jeans on to wear them for another 30 days, though.) Lucky for me, I had a pair of these in my closet along with #my
II. Different Decoding Methods – Top-K and Top-P Sampling
As you may have already guessed, we can use both Top-K and Top-P sampling to improve the sampling process. This approach reduces the likelihood of generating uncommon or low-probability words while allowing for a dynamic selection size. To implement this method, we simply need to specify a value for both ‘top_k’ and ‘top_p’ parameters. We can even include the initial temperature parameter if we want to. Now, let’s see how our model performs after adding these modifications. We’ll check the top 5 returns to see how diverse our answers are.
# Sample from only top_k most likely words
sample_output = GPT2.generate(
input_ids,
do_sample = True,
max_length = MAX_LEN,
top_k = 50
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')
# Output:
#---------------------------------------------------------------------
# I don't know about you, but there's only one thing I want to do after #a long day of work: I want to relax with a good book on my lap. One thing that I know for sure is that the most important factor in writing a good book is the author's ability to write a good book.
Top-K Sampling seems to generate more coherent text than our random sampling before.
ChatGPT – Our experiments
ChatGPT, developed by OpenAI, is a cutting-edge Large Language Model that interacts with users in a natural and conversational manner. Its advanced capabilities enable it to provide insightful and nuanced answers to complex questions, even in highly technical and demanding fields.
In order to test ChatGPT’s ability to answer questions related to Machine Learning, we conducted a series of experiments. The results of these experiments are presented below:
Figures 7 and 8 illustrate how ChatGPT excels at answering machine learning questions. In addition to providing code, ChatGPT goes beyond that by attempting to explain the concepts to the readers, making the answers more accessible and informative.
Furthermore, ChatGPT can also help you with debugging by providing solutions to code errors. For instance, you can input a piece of code with an error and ask ChatGPT to solve it for you, as shown below:
You could also ask ChatGPT to explain a piece of code to you, The next code block can be used for generating images using Generative Adversarial Networks. GANs can be a complex task that involves several steps and hyperparameters:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Define the Generator network
class Generator(nn.Module):
def __init__(self, nz):
super(Generator, self).__init__()
self.main = nn.Sequential(
nn.Linear(nz, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 1024),
nn.LeakyReLU(0.2),
nn.Linear(1024, 784),
nn.Tanh()
)
def forward(self, input):
return self.main(input)
# Define the Discriminator network
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.main = nn.Sequential(
nn.Linear(784, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, input):
input = input.view(-1, 784)
return self.main(input)
# Initialize the Generator and Discriminator networks
nz = 100 # size of the latent vector
generator = Generator(nz)
discriminator = Discriminator()
# Define the loss functions
criterion = nn.BCELoss()
# Define the optimizers
lr = 0.0002
beta1 = 0.5
optimizerG = optim.Adam(generator.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerD = optim.Adam(discriminator.parameters(), lr=lr, betas=(beta1, 0.999))
# Load the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
# Train the GAN
num_epochs = 50
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
generator.to(device)
discriminator.to(device)
for epoch in range(num_epochs):
for i, data in enumerate(dataloader, 0):
real_images, _ = data
real_images = real_images.to(device)
# Train the Discriminator
discriminator.zero_grad()
real_labels = torch.full((real_images.size(0),), 1, device=device)
fake_labels = torch.full((real_images.size(0),), 0, device=device)
# Train on real images
real_outputs = discriminator(real_images)
real_loss = criterion(real_outputs, real_labels)
real_loss.backward()
# Train on fake images
noise = torch.randn(real_images.size(0), nz, device=device)
fake_images = generator(noise)
fake_outputs = discriminator(fake_images.detach())
fake_loss = criterion(fake_outputs, fake_labels)
fake_loss.backward()
# Update the Discriminator parameters
optimizerD.step()
# Train the Generator
generator.zero_grad()
noise = torch.randn(real_images.size(0), nz, device=device)
fake_images = generator(noise)
fake_outputs = discriminator(fake)
Let’s see how can ChatGPT explain this code block, it’s rather advanced and not at all beginner friendly:
Pretty cool, Huh?!
Conclusion
Although many people have expressed satisfaction with ChatGPT, there is an ongoing debate surrounding its effectiveness and reliability. While some domain experts have criticized its output, citing instances where it made incorrect decisions and misled users, others have been highly pleased with the results.
As with any new technology, there are both strengths and weaknesses to consider. It is our belief that the model will continue to improve in the near future, leaving us all in awe of the era we are living in. We encourage readers to experiment with the model and share their feedback from their unique domain perspectives.
References & Resources
- [1] Improving Language Understanding by Generative Pre-Training
- [2] Attention is all You Need
- [3] Language Models are Unsupervised Multitask Learners
- [4] An Overview of Deep Semi-Supervised Learning
- [5] Language Models are Few-Shot Learners
- [6] What is Word Embedding | Word2Vec | GloVe
- [7] Meta-Learning in Neural Networks: A Survey
- [8] First open Source implementation of ChatGPT3
- [9] Text classification using DistilBERT
- [10] Chat-GPT by OpenAI
- [11] The Curious Case of Neural Text DeGeneration