Microsoft’s new tech helps AI improve its own reasoning

Microsoft’s new tech helps AI improve its own reasoning

Tech in Asia·2025-06-19 17:00

🔍 In one sentence

Researchers introduced a framework called Direct Reasoning Optimization (DRO), which enables large language models (LLMs) to refine their reasoning on open-ended tasks without external feedback.

🏛️ Paper by:

Microsoft, University of California, Los Angeles

✏️ Authors:

Yifei Xu et al.

🧠 Key discovery

The study proposes a new reward signal, Reasoning Reflection Reward (R3), that allows LLMs to internally assess their reasoning processes. This helps apply reinforcement learning to open-ended tasks, where defining reward signals has traditionally been difficult.

📊 Surprising results

Key stat: DRO reduced training costs by about 45% compared to standard methods, while maintaining high performance on two different datasets. Breakthrough: R3 highlights key tokens that represent the model’s reasoning steps, improving its ability to self-evaluate. Comparison: DRO showed clear gains over baseline models in both reasoning efficiency and output accuracy.

📌 Why this matters

The research offers an alternative to using external evaluators or complex reward models when training LLMs. In practical applications like scientific writing or document editing, a model that can assess and refine its own reasoning could help improve output quality and relevance.

💡 What are the potential applications?

Document Revision: Useful for tasks such as revising academic papers based on reviewer feedback. Creative Writing: Could help improve reasoning in plot or character development. Educational Tools: May support automated feedback in student essays by analyzing reasoning.

⚠️ Limitations

Since the method relies on the model’s own evaluations, it may not always align with human judgment, particularly in tasks involving subjective reasoning.

👉 Bottom line:

DRO offers a way for LLMs to improve their reasoning without external supervision, with potential implications for various text-based tasks.

📄 Read the full paper: Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

……

Read full article on Tech in Asia

Government