Paper Summary and Thoughts:
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/abs/2201.11903
Large Language Models (LLMs) are great for many things, but they often fail at tasks which require multiple steps of reasoning, like math word problems. For a standard AI, if you ask it a multi-step question, it will try to answer directly which often results in a wrong answer. This groundbreaking paper introduces a simple but powerful technique to fix this: Chain-of-Thought Prompting.
The paper introduces the prior ways researchers have attempted to improve AI models. One previous way was through generating Natural Language Rationales, which improves performance by generating step-by-step explanations in plain English that lead to the final answer. This required complicated methods like training a model from scratch or fine-tuning it. Another way has been In-Context Few-Shot Learning, this is where LLMs can learn a new task simply by being shown a few examples (input-output pairs) in the prompt itself. This avoids the need to retrain or fine-tune a separate model for each new task and has been proven successful for a variety of simple question-answering problems. This works for simple tasks but fails when steps or reasoning is needed.
This paper introduces Chain-of-Thought Prompting. By combining the ideas of rationale-augmented training and finetuning methods and of traditional few-shot prompting methods they were able to develop Chain-of-Thought Prompting. Chain-of-Thought Prompting is when you give the LLM a few examples that show the entire reasoning process (every step) to solve a question in the prompt. The prompt will have examples of Question, Step-by-Step Reasoning and the Final Answer. These examples will be similar to the new, unanswered question that the LLM will be prompted to answer (in the same prompt). This allows the model to break down multi-step problems into intermediate steps.The AI will be able to recognize this pattern of reasoning which will allow it to replicate it for new, complex problems. It also allows for the researchers to see where the model may have gone wrong in its reasoning.
Here is an example, the researchers included:
The paper also discovered that reasoning is an “Emergent Ability.” This means that this skill suddenly appears once a model is large enough (enough parameters). Chain-of-thought prompting only starts to work effectively on models with around 100 billion parameters or more. For smaller models, for example under 10 billion parameters, this technique doesn’t help and actually makes performance worse because they generate illogical reasoning steps.
Here are the graphs the researchers produced. You can see that once the model scale (x-axis) increases past a certain amount, the solve rate (y-axis) significantly improves.
Eventually the researchers tested Chain-of-thought prompting on three main categories of reasoning.
First was Arithmetic Reasoning (Math Word Problems). They found that Chain-of-thought prompting significantly improved performance on challenging math benchmarks like GSM8K, SVAMP, and MAWPS. For example, the PaLM 540B model with just 8 Chain-of-thought examples achieved new state-of-the-art results, outperforming prior methods that required lots of fine-tuning.
Next was Commonsense Reasoning. This was tested with problems requiring world knowledge. It was tested on datasets like StrategyQA and Sports Understanding. The PaLM 540B model with Chain-of-thought surpassed the prior best on StrategyQA and even outperformed a human sports enthusiast on the Sports Understanding task.
Finally they tested on Symbolic Reasoning. These were tasks that are simple for humans but hard for AI, like concatenating the last letters of words or tracking coin flips. They found that Chain-of-thought didn’t just solve the problems, but actually allowed the models to generalize to longer, more complex versions of the problems that they hadn’t seen in the examples. It was something that standard prompting failed at completely.
The researchers ran “ablation studies” to confirm why Chain-of-thought prompting works.
Their first question was if it was just about the final equation. They tested this by prompting the model to only output the mathematical equation. They found that this didn’t work for complex problems. The model needs the natural language reasoning steps to figure out the correct equation.
Then they asked if it is just about giving the model more “thinking” time. They tested this by prompting the model to output a series of dots (...) to simulate extra computation time without actual reasoning. They found that this also didn’t help. The benefit comes from the meaningful steps, not just the extra computation.
Finally they asked if the benefit of Chain-of-thought could just be that the prompts allow the model to better access relevant knowledge it gained during pre-training. They tested this with an alternative setup where the chain of thought was only given after the final answer. This was designed to test if the model actually depends on the generated chain of thought to arrive at the final answer. The results were that it performed about the same as the baseline, showing no significant improvement. This suggests that Chain-of-thought prompting is not just about “activating knowledge”.
These “ablation studies” that the researchers ran interested me because it really demonstrated how they were able to narrow down why their studied prompting works.
Final Thoughts:
Overall, I found this study very informative and built on my current knowledge. What I found most interesting was that the researchers were able to take two previously researched and discovered ways that have improved LLMs and combine them to create a new method. This grows my desire to look at more papers to see if there is some way I can combine methods of previous papers to create a novel idea. This paper has also left me curious to recreate the researchers' results for my internship, as it is something that can be easily applied to the work I have already done. Exploring “emergent abilities” is also something that interested me. This model only improved after increasing the scale by a lot(upwards of 100 billion parameters), which suggests that some of the future AI developments may emerge from just a sufficient scale of data and computation.