Microsoft’s new tech helps AI improve its own reasoning

Tech in Asia·2025-06-19 17:00

🔍 In one sentence

Researchers introduced a framework called Direct Reasoning Optimization (DRO), which enables large language models (LLMs) to refine their reasoning on open-ended tasks without external feedback.

🏛️ Paper by:

Microsoft, University of California, Los Angeles

✏️ Authors:

Yifei Xu et al.

🧠 Key discovery

The study proposes a new reward signal, Reasoning Reflection Reward (R3), that allows LLMs to internally assess their reasoning processes. This helps apply reinforcement learning to open-ended tasks, where defining reward signals has traditionally been difficult.

📊 Surprising results

Key stat: DRO reduced training costs by about 45% compared to standard methods, while maintaining high performance on two different datasets. Breakthrough: R3 highlights key tokens that represent the model’s reasoning steps, improving its ability to self-evaluate. Comparison: DRO showed clear gains over baseline models in both reasoning efficiency and output accuracy.

📌 Why this matters

The research offers an alternative to using external evaluators or complex reward models when training LLMs. In practical applications like scientific writing or document editing, a model that can assess and refine its own reasoning could help improve output quality and relevance.

💡 What are the potential applications?

Document Revision: Useful for tasks such as revising academic papers based on reviewer feedback. Creative Writing: Could help improve reasoning in plot or character development. Educational Tools: May support automated feedback in student essays by analyzing reasoning.

⚠️ Limitations

Since the method relies on the model’s own evaluations, it may not always align with human judgment, particularly in tasks involving subjective reasoning.

👉 Bottom line:

DRO offers a way for LLMs to improve their reasoning without external supervision, with potential implications for various text-based tasks.

📄 Read the full paper: Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

……

Read full article on Tech in Asia

Government

HOME

PROPERTY

SALE

RENT

NEW LAUNCH

CONDOS

OVERSEAS

GROUP

SERVICES

LOTTERY

🔍 In one sentence

🏛️ Paper by:

✏️ Authors:

🧠 Key discovery

📊 Surprising results

📌 Why this matters

💡 What are the potential applications?

⚠️ Limitations

👉 Bottom line:

Get Nestia App Free Now

Property Agent Program

Properties for sale

Properties for rent

Singapore New Launch

Singapore Condo

Sale by area

Rent by area

Popular properties for sale

Popular properties for rent

Singapore News

Singapore Online Groups

External Links