GPT Technical Evolutionary History (1)

8 min readFeb 15, 2025

GPT stands for Generative Pre-Training Transformer, is a series of large-scale language models developed by OpenAI. In order to understand the technical evolutionary history of GPT models, I referred to following GPT technical papers & blog published by OpenAI. In this blog I’ll try to summarize the key techniques each version of GPT has adopted and the journey GPT has made, give you a clear technical insight into the GPT models.

GPT-1 (2018)

Improving Language Understanding by Generative Pre-Training this paper demonstrated that large gains on these language understanding tasks like textual entailment, question answering, semantic similarity assessment, and document classification can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. This language model is GPT-1.

The basic architecture of GPT-1 is showed in following diagram left part, it is a multi-layer transformer (if you are not familiar with transformer, please refer to the “transformer clear explanation” blog) architecture (a variant of original transformer, just include the decoder part). There are 12 identical layers, each layer has the Masked Multi-Head Self Attention layer and the Feed Forward Network layer.

The training procedure of GPT-1 consists of two stages: 1) The first stage is learning a high-capacity language model on a large corpus of unlabeled text. 2) This is followed by a fine-tuning stage, where adapt the model to a discriminative task with labeled data.

In the unsupervised pre-training stage, given an unsupervised corpus of tokens U = {u1, . . . , un}, a standard language modeling objective was used to maximize the following likelihood:

Where k is the size of the context window.

After training the model with the above objective, GPT-1 adapt the parameters to the supervised target task. GPT-1 assume a labeled dataset C, where each instance consists of a sequence of input tokens, x1 , . . . , xm, along with a label y. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation h, which is then fed into an added linear output layer with parameters Wy to predict y:

Then given the following objective to maximize:

During the GPT-1 training process, OpenAI found that including language modeling as an auxiliary objective to the fine-tuning helped learning by improving generalization of the supervised model and accelerating convergence. So the supervised fine-tuning stage’s goal is optimize the following objective(with weight λ):

In the unsupervised pre-training stage, the BooksCorpus dataset was used which contains over 7,000 unique unpublished books from a variety of genres. Residual, embedding, and attention dropouts with a rate of 0.1 for regularization. A modified version of L2 regularization proposed in [Fixing Weight Decay Regularization in Adam] was employed in GPT-1.

After fine-tuning, GPT-1 — the general task-agnostic model, outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.

GPT-2 (2019)

Language Models are Unsupervised Multitask Learners introduced GPT-2, a language model trained without any explicit supervision on millions of webpages (WebText).

The key difference between GPT-2 and GPT-1 is:

GPT-2 has around 1.5B parameters which is 13x of GPT-1(117 Millions).
GPT-2 is trained from millions of unlabeled webpages, and without any explicit supervision. (zero-shot setting for NLP tasks VS GPT-1 task fine-tuning)
In GPT-2, layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. In initial transformer and GPT-1, LayerNorm was applied after residual connections, it means Output=LayerNorm(Residual+SubLayer(X)); but in GPT-2 Output=Residual+SubLayer(LayerNorm(X)) which is inspired by Pre-Activation Residual Networks.
In GPT-2, a modified initialization which accounts for the accumulation on the residual path with model depth is used. GPT-2 scale the weights of residual layers at initialization by a factor of 1/ √ N where N is the number of residual layers. This ensures that the magnitude of activations remains consistent across layers, and helps stabilize training.

We can see that GPT-2 inherited GPT-1’s architecture with few modifications. The key change of GPT-2 is its 10x number of parameters (1.5B parameters with 48 layers) and training dataset size (WebText), and there is no fine-tuning process in GPT-2. When scale the model size and training dataset, OpenAI observed better language understanding of the model. Actually, this was part of the journey approaching to “Scaling Laws for Neural Language Models”.

GPT-3 (2020)

Language Models are Few-Shot Learners introduced GPT-3, an autoregressive language model with 175B parameters (100x larger than GPT-2 1.5B), and tested its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

GPT-3's basic pre-training approach is similar to GPT-2, with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. This paper presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of 40 state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly.

GPT-3 adopted the same architecture with GPT-2, with the exception that GPT-3 used alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer (reduced the time and memory that grows quadratically(O(n*n)) with the sequence length to O(n√ n)). During the training of GPT-3, OpenAI trained 8 different sizes of model, ranging over 3 orders of magnitude from 125M parameters to 175B parameters to test the scaling law hypothesis.

When training GPT-3, OpenAI did a lot of works to improve the average quality of training datasets:

Downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora (using the original WebText as a proxy for high-quality documents, trained a classifier to distinguish these from raw Common Crawl),
Performed fuzzy deduplication at the document level (fuzzily deduplicated documents within each dataset using Spark’s MinHashLSH implementation with 10 hashes), within and across datasets, to prevent redundancy and preserve the integrity of held-out validation set as an accurate measure of overfitting, and
Added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

In this paper, OpenAI also mentioned the potential contamination issue: a major methodological concern with language models pre-trained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, OpenAI searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper.

“To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network”. This means GPT-3 adopted two different model parallelism (about model parallelism during training, please refer to this blog):

Each large matrix multiplication (e.g., in self-attention and feed-forward layers) is split across multiple GPUs. Instead of computing the full operation on one GPU, different GPUs handle different parts of the matrix multiplication.
Instead of storing all layers of the network on a single GPU, different layers are assigned to different GPUs.

All models were trained on V100 GPU’s on part of a high-bandwidth cluster.

The performance become better for various tasks as the model size scale. Language Models are Few-Shot Learners this paper is a bit of long with 75 pages, it revealed a lot of empirical training details and found limitations like:

On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs.
This model focused on exploring in-context learning behavior in autoregressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising.
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form.
…

Well, if you are a ML engineer, this paper is worth to read in detail and repeatedly from both practical and theoretical perspective.

GPT Technical Evolutionary History (1)

GPT-1 (2018)

GPT-2 (2019)

GPT-3 (2020)

Written by Jinpeng Zhang

No responses yet