Microsoft’s new tech helps AI improve its own reasoning
Researchers introduced a framework called Direct Reasoning Optimization (DRO), which enables large language models (LLMs) to refine their reasoning on open-ended tasks without external feedback.
Microsoft, University of California, Los Angeles
Yifei Xu et al.
The study proposes a new reward signal, Reasoning Reflection Reward (R3), that allows LLMs to internally assess their reasoning processes. This helps apply reinforcement learning to open-ended tasks, where defining reward signals has traditionally been difficult.
The research offers an alternative to using external evaluators or complex reward models when training LLMs. In practical applications like scientific writing or document editing, a model that can assess and refine its own reasoning could help improve output quality and relevance.
Since the method relies on the model’s own evaluations, it may not always align with human judgment, particularly in tasks involving subjective reasoning.
DRO offers a way for LLMs to improve their reasoning without external supervision, with potential implications for various text-based tasks.
📄 Read the full paper: Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks
……Read full article on Tech in Asia
Government
Comments
Leave a comment in Nestia App