Chain-of-Thought

Reasoning has long been regarded as a cornerstone of intelligence, traditionally dominated by symbolic approaches (Russell & Norvig, 2016). The recent launch of o1 series by OpenAI marked the productionalization of large language model (LLM)-based reasoning systems. At the core of these systems lies a key yet simple technique known as chain of thought (CoT). This blog explores the concept of chain-of-thought reasoning, tracing its origins, examining its variants, and uncovering its limitations. Our discussion summarizes the insights shared by Denny Zhou in his Berkeley lecture.

Ideas

The term CoT is introduced by Jason Wei et al. (2022). It simply means a series of intermediate reasoning steps that lead to the final answer for a problem. Similar concepts have been explored under different terms, such as rationale (Ling et al. 2017), natural language solutions (Cobbe et al. 2021), or scratchpad (Nye et al. 2021).

A simple example of chain-of-thought reasoning is as follows:

Problem:
John has 5 apples. He buys 3 more apples from the store and then gives 2 apples to his friend. How many apples does John have left?

Without Chain-of-Thought:
Output: "John has 6 apples."
(Note: Correct, but reasoning is not provided.)

With Chain-of-Thought:
1. John starts with 5 apples.
2. He buys 3 more apples. Adding them together: 5 + 3 = 8.
3. He gives 2 apples to his friend. Subtracting that: 8 - 2 = 6.
4. Final Answer: "John has 6 apples left."

The use of intermediate steps to solve problems was pioneered by Ling et al. (2017) at DeepMind. Their work involved training a sequence-to-sequence model from scratch using a novel algebraic word problem dataset that included explicitly collected intermediate steps. Cobbe et al. (2021)1 at OpenAI followed the idea by creating a much larger math word problem dataset with intermediate steps, which they used to finetune the GPT-3 model 2.

By combining the strengths of the use of intermediate steps and the in-context few shot learning via prompting technique introduced by Brown et al (2020) in GPT-3, Wei et al. (2022) discovered that few-shot CoT prompting significantly improves the reasoning capabilities of larger language models 3. By providing a few examples of problems along with their step-by-step reasoning in the input prompt, they demonstrated that larger models could emulate this reasoning structure and apply it to new problems. This method proved particularly effective in tasks requiring arithmetic, logical inference, and symbolic reasoning, achieving notable performance improvements over traditional prompting approaches.

More surprisingly, Kojima et al. (2022) in their work zero-shot CoT prompting demonstrates that explicit examples are not always necessary to elicit reasoning. They found that appending a simple instruction like “Let’s think step by step” to the problem prompt could trigger models to generate intermediate reasoning steps, even without additional labeled examples.

More recently, Wang and Zhou (2024) discovered that even prompt is not necessary. They proposed CoT-decoding which leverages the LLM’s confidence scores during the decoding process to select the most reliable CoT path to bypass prompting altogether.

Improvements

Several techniques for improve the CoT prompting have been proposed, such as

Self-Consistency (Wang, et al. 2023):

This approach improves CoT reasoning by generating multiple reasoning path and selecting the most consistent answer based on a voting mechanism or other criteria
Least-to-Most Prompting (Zhou et al. 2023):

This technique breaks down complex problems into simpler, manageable subproblems, allowing LLMs to solve more challenging tasks than those presented in the prompt itself
Analogical Prompting (Yasunaga et al. 2024):

This approach guides LLMs to generate their own relevant examples and knowledge, leading to improved generalization and reasoning capabilities

Limitations

Despite its strengths, CoT reasoning has several limitations:

Irrelevant Context (Shi et al. 2023):

LLMs can be easily distracted by irrelevant or extraneous information within the input, leading to reasoning errors or incorrect conclusions.
Limited Self-Correction (Huang et al. 2023):

LLMs struggle to self-correct their reasoning without external feedback or oracle labels, highlighting a need for improved intrinsic self-correction capabilities
Premise Order Sensitivity (Chen et al. 2024):

LLMs exhibit sensitivity to the order in which information is presented, impacting reasoning performance even when the underlying information remains the same

1: Many authors of the GSM8K paper are also the key contributors to o1.

2: Even though one of the main focuses of the paper is to demonstrate that a verifier, trained on a smaller GPT-3 model to judge the correctness of model-generated solutions, can achieve comparable or better performance than fine-tuning a 30x larger GPT-3 model, particularly by leveraging verification’s favorable scaling with data size.

3: They note that the benefits of CoT prompting are limited for smaller models.

Last modified on 2024-09-21