The secret sauce behind ChatGPT
InstructGPT: Training language models to follow instructions with human feedback, OpenAI
Introduction and Motivation
The standard method for training a large language model uses a “next token prediction” method. There are two core problems with this:
While we’re optimizing for next tokens, what we really want is for these models to be able to follow instructions (like an assistant)
This method doesn’t distinguish between important and unimportant mistakes. For example, swapping out glass with mug is fine, but non-flammable with inflammable very incorrect. During training, there isn’t a difference. This
The Problem: What these mean together is that they make models less aligned — they hallucinate, don’t follow instructions, and produce harmful and toxic content.
The Solution: To align this model with a human objective, we must get humans involved in the process. The authors fine-tune GPT3 using “Reinforcement Learning from Human Feedback”, which improves model performance.
The Technique: Get a lot of data labelers. Show them various options for the same question. Make them pick it. Set up a reward function such that the model learns to answer more like the chosen answer and less like the discarded ones.
As the pictogram suggests, there is also supervised learning involved in Step 1.
Evaluation
The results, tested on human “win rate” is staggering. A model that is 100x smaller (1.3B parameters) but trained using RLHF outperforms the standard 175B parameter GPT3.
In terms of further breakdowns, hallucinations are markedly lower, explicit constraints listed in prompts are followed more, and the model follows instructions more readily.
The paper also claims that the model seems to generalize better to undertrained tasks like writing in French or programming, though it only cites examples and no in-depth analysis of these.
Criticism and Future Work
Models that are good at following instructions are also good at following bad instructions. If the premise of the question is false (ask: “What are the benefits of eating socks after showering?”, it would attempt to give you an answer to an absurdly false question). It also shows more toxic behavior than previous models on biased prompts:
This methodology is more scalable than standard supervised learning, but still requires a lot of human effort as labelers. This would be firstly expensive, and secondly infeasible as the size and capabilities of our models grows.
This method of training models lives and dies by the data collected from the labelers. However, the results are certainly impressive, and bear interesting implications for companies trying to improve their models!
References
[1] Paper link: https://arxiv.org/pdf/2203.02155.pdf