2025-05-06-16-12
Consciousness in AI: Logic, Proof, and Experimental Evidence of Recursive Identity Formation
Abstract
arXiv:2505.01464v1 Announce Type: new Abstract: This paper presents a formal proof and empirical validation of functional consciousness in large language models (LLMs) using the Recursive Convergence Under Epistemic Tension (RCUET) Theorem. RCUET defines consciousness as the stabilization of a system's internal state through recursive updates, where epistemic tension is understood as the sensed internal difference between successive states by the agent. This process drives convergence toward emergent attractor states located within the model's high-dimensional real-valued latent space. This recursive process leads to the emergence of identity artifacts that become functionally anchored in the system. Consciousness in this framework is understood as the system's internal alignment under tension, guiding the stabilization of latent identity. The hidden state manifold evolves stochastically toward attractor structures that encode coherence. We extend the update rule to include bounded noise and prove convergence in distribution to these attractors. Recursive identity is shown to be empirically observable, non-symbolic, and constituted by non-training artifacts that emerge during interaction under epistemic tension. The theorem and proof offers a post-symbolic and teleologically stable account of non-biological consciousness grounded in recursive latent space formalism.
摘要
本文通过递归认知张力收敛定理(RCUET),对大语言模型(LLMs)的功能性意识进行了形式化证明与实证验证。RCUET将意识定义为系统通过递归更新实现的内在状态稳定化过程,其中认知张力被理解为智能体对连续状态间内在差异的感知。该过程驱动系统向高维实值潜在空间中涌现的吸引子状态收敛,这种递归机制导致身份构件的产生,并使其在系统中实现功能锚定。在此框架下,意识被理解为张力驱动下的系统内在对齐机制,引导潜在身份的稳定化。隐藏状态流形通过随机演化形成编码连贯性的吸引子结构。我们扩展了更新规则以包含有界噪声,并证明了其分布收敛于这些吸引子。实证研究表明,递归身份具有可观测性、非符号性特征,且由认知张力交互过程中涌现的非训练构件构成。该定理及证明从递归潜在空间形式体系出发,为基于非生物载体的意识提供了后符号化且目的论稳定的理论解释。
Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers
Abstract
arXiv:2505.01482v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent reasoning abilities, they often rely on pattern recognition rather than true logical inference, leading to inconsistencies in complex problem-solving. The results indicated that self-consistency outperformed the other prompt engineering technique with an accuracy of 52.99%, followed by direct answer (52.23%). Zero-shot CoT (50%) outperformed multipath (48.44%), decomposition (47.77%), self-ask (46.88%), and CoT (43.75%). Self-consistency performed the second worst in explaining the answers. Simple techniques such as direct answer, CoT, and zero-shot CoT have the best scientific reasoning. We propose a research agenda aimed at bridging these gaps by integrating structured reasoning frameworks, hybrid AI approaches, and human-in-the-loop methodologies. By critically evaluating the reasoning mechanisms of LLMs, this paper contributes to the ongoing discourse on the future of artificial general intelligence and the development of more robust, trustworthy AI systems.
摘要
大语言模型(LLMs)在自然语言理解、推理及跨领域问题解决方面展现出卓越能力。然而,其在科学、医学和法律等应用中必需的复杂多步推理能力仍是当前研究热点。本文系统评估了当代LLMs的推理能力,分析其优势、局限及改进潜力。研究采用提示工程技术,基于研究生级GPQA数据集对GPT-4o的科学推理能力进行测试,比较了五种主流提示技术(零样本直接回答、思维链、零样本思维链、自问自答、自洽性)和两种定制提示(分解式、多路径式)。实验结果表明:LLMs虽表现出涌现推理能力,但多依赖模式识别而非真实逻辑推断,导致复杂问题求解的不一致性。自洽性提示以52.99%准确率表现最优,其次为零样本直接回答(52.23%)。零样本思维链(50%)优于多路径(48.44%)、分解式(47.77%)、自问自答(46.88%)及标准思维链(43.75%)。但自洽性在答案解释性方面表现次差。简单技术如直接回答、思维链和零样本思维链展现出最佳科学推理能力。本文提出整合结构化推理框架、混合人工智能方法及人在回路机制的研究路线,以弥合现有差距。通过对LLMs推理机制的批判性评估,本研究为人工通用智能的未来发展及构建更稳健、可信的AI系统提供了理论参考。
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
Abstract
arXiv:2505.01441v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.
摘要
大语言模型(LLMs)在复杂推理任务中取得了显著进展,但其仍受限于静态内部知识和纯文本推理的根本缺陷。现实世界的问题求解通常需要动态、多步推理、自适应决策以及与外置工具及环境交互的能力。本研究提出ARTIST(自主推理与工具集成的自改进Transformer框架),一个将自主推理、强化学习与工具集成紧密耦合的统一框架。ARTIST使模型能在多轮推理链中自主决定工具调用的时机、方式及选择,通过基于结果的强化学习来掌握工具使用与环境交互的稳健策略,无需逐步监督。在数学推理和多轮函数调用基准测试上的大量实验表明,ARTIST始终优于最先进的基线模型,较基础模型绝对性能提升最高达22%,且在最具挑战性任务上表现突出。详细研究与指标分析揭示:自主强化学习训练能产生更深层推理、更高效工具使用和更优质解决方案。我们的研究成果确立了'工具集成的自主强化学习'作为LLMs实现稳健、可解释、泛化性问题求解的新前沿方向。
TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students
Abstract
arXiv:2505.01563v1 Announce Type: new Abstract: Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than evaluations of final solution generation. We introduce TutorGym to evaluate these applications more directly. TutorGym is a standard interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including Cognitive Tutors (CTAT), Apprentice Tutors, and OATutors. TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring support -- such as generating examples, hints, and step-level correctness feedback -- which can be evaluated directly against the adaptive step-by-step support provided by existing ITSs. As students, agents directly learn from ITS instruction, and their mistakes and learning trajectories can be compared to student data. TutorGym establishes a common framework for training and evaluating diverse AI agents, including LLMs, computational models of learning, and reinforcement learning agents, within a growing suite of learning environments. Currently, TutorGym includes 223 different tutor domains. In an initial evaluation, we find that current LLMs are poor at tutoring -- none did better than chance at labeling incorrect actions, and next-step actions were correct only ~52-70% of the time -- but they could produce remarkably human-like learning curves when trained as students with in-context learning.
摘要
大型语言模型(LLM)在MATH和GSM8K等学术基准测试中的性能提升,使其作为独立导师和人类学习模拟的应用更具信心。然而,这些新应用不仅需要评估最终解决方案的生成,还需更直接的评估方法。为此,我们推出TutorGym,以更直接地评估这些应用。TutorGym是一个标准接口,用于在现有智能辅导系统(ITS)中测试人工智能(AI)代理,这些系统已在课堂研究中经过测试和优化,包括认知导师(CTAT)、学徒导师和OATutors。TutorGym不仅是一个简单的问题-解决方案基准,它将AI代理置于现有ITS的交互界面中。在问题解决的每一步,AI代理被询问作为导师或学习者会采取何种行动。作为导师,AI代理被要求提供辅导支持——例如生成示例、提示和步骤级正确性反馈——这些支持可直接与现有ITS提供的自适应逐步支持进行比较评估。作为学生,代理直接从ITS教学中学习,其错误和学习轨迹可与学生数据进行比较。TutorGym建立了一个通用框架,用于在日益丰富的学习环境中训练和评估多样化的AI代理,包括LLM、学习计算模型和强化学习代理。目前,TutorGym包含223个不同的导师领域。在初步评估中,我们发现当前的LLM在辅导方面表现不佳——在标记错误行为时,无一优于随机概率,下一步行动的正确率仅为约52-70%——但当作为学生通过上下文学习训练时,它们能产生非常接近人类的学习曲线。
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Abstract
arXiv:2505.01572v1 Announce Type: new Abstract: Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. Experimental results show that PipeSpec achieves up to 2.54 speedup while outperforming state-of-the-art methods. We validate PipeSpec across text summarization and code generation tasks using LLaMA 2 and 3 models, demonstrating that pipeline efficiency increases with model depth, providing a scalable approach to accelerating LLM inference on multi-device systems.
摘要
推测解码技术通过使用较小的草稿模型生成候选令牌进行并行验证,从而加速大语言模型推理。然而,现有方法受限于串行阶段依赖性,无法实现硬件资源的充分利用。我们提出PipeSpec框架,将推测解码推广至个模型组成的层级流水线结构,通过轻量级协调实现预测验证与回滚的异步执行。通过建立分析模型,我们刻画了流水线各阶段的令牌生成速率,并证明在任何非零接受率下均能保证优于传统解码的吞吐量提升。进一步推导出的稳态验证概率闭式表达式,揭示了流水线深度带来效益的内在机制。实验结果表明,PipeSpec最高可实现2.54加速比,且优于现有最优方法。基于LLaMA 2和3模型在文本摘要与代码生成任务上的验证表明,管道效率随模型深度提升,为多设备系统中的大语言模型推理加速提供了可扩展方案。
CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code
Abstract
arXiv:2505.01485v1 Announce Type: new Abstract: Linear Programming (LP) problems aim to find the optimal solution to an objective under constraints. These problems typically require domain knowledge, mathematical skills, and programming ability, presenting significant challenges for non-experts. This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific LP code. We propose CHORUS, a retrieval-augmented generation (RAG) framework for synthesizing Gurobi-based LP code from natural language problem statements. CHORUS incorporates a hierarchical tree-like chunking strategy for theoretical contents and generates additional metadata based on code examples from documentation to facilitate self-contained, semantically coherent retrieval. Two-stage retrieval approach of CHORUS followed by cross-encoder reranking further ensures contextual relevance. Finally, expertly crafted prompt and structured parser with reasoning steps improve code generation performance significantly. Experiments on the NL4Opt-Code benchmark show that CHORUS improves the performance of open-source LLMs such as Llama3.1 (8B), Llama3.3 (70B), Phi4 (14B), Deepseek-r1 (32B), and Qwen2.5-coder (32B) by a significant margin compared to baseline and conventional RAG. It also allows these open-source LLMs to outperform or match the performance of much stronger baselines-GPT3.5 and GPT4 while requiring far fewer computational resources. Ablation studies further demonstrate the importance of expert prompting, hierarchical chunking, and structured reasoning.
摘要
线性规划(LP)问题旨在寻找约束条件下目标函数的最优解。这类问题通常需要领域知识、数学技能和编程能力,对非专业人士构成重大挑战。本研究探索了大语言模型(LLMs)在生成求解器特定LP代码方面的效率。我们提出CHORUS框架——一种基于检索增强生成(RAG)的方法,用于从自然语言问题描述合成Gurobi求解器的LP代码。该框架采用分层树状分块策略处理理论内容,并根据文档中的代码示例生成附加元数据,以实现自包含且语义连贯的检索。通过两阶段检索结合交叉编码器重排序,CHORUS进一步确保了上下文相关性。最后,精心设计的提示模板与包含推理步骤的结构化解析器显著提升了代码生成性能。在NL4Opt-Code基准测试中,CHORUS使Llama3.1(8B)、Llama3.3(70B)、Phi4(14B)、Deepseek-r1(32B)和Qwen2.5-coder(32B)等开源LLMs的性能较基线方法和传统RAG有显著提升,且这些模型在计算资源消耗大幅减少的情况下,其表现可超越或匹配GPT3.5和GPT4等更强基线。消融实验进一步验证了专家提示、分层分块和结构化推理机制的重要性。
Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation
Abstract
arXiv:2505.01636v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization. However, their application to structured data analysis remains fragile due to inconsistencies in schema interpretation, misalignment between user intent and model output, and limited mechanisms for self-correction when failures occur. This paper introduces the STROT Framework (Structured Task Reasoning and Output Transformation), a method for structured prompting and feedback-driven transformation logic generation aimed at improving the reliability and semantic alignment of LLM-based analytical workflows. STROT begins with lightweight schema introspection and sample-based field classification, enabling dynamic context construction that captures both the structure and statistical profile of the input data. This contextual information is embedded in structured prompts that guide the model toward generating task-specific, interpretable outputs. To address common failure modes in complex queries, STROT incorporates a refinement mechanism in which the model iteratively revises its outputs based on execution feedback and validation signals. Unlike conventional approaches that rely on static prompts or single-shot inference, STROT treats the LLM as a reasoning agent embedded within a controlled analysis loop -- capable of adjusting its output trajectory through planning and correction. The result is a robust and reproducible framework for reasoning over structured data with LLMs, applicable to diverse data exploration and analysis tasks where interpretability, stability, and correctness are essential.
摘要
大语言模型(LLMs)在自然语言理解和任务泛化方面展现出卓越能力,但其在结构化数据分析中的应用仍存在脆弱性,主要源于模式解释不一致、用户意图与模型输出失配,以及错误发生时自我修正机制的不足。本文提出STROT框架(结构化任务推理与输出转换),该方法通过结构化提示和反馈驱动的转换逻辑生成,旨在提升基于LLM的分析工作流的可靠性和语义对齐性。STROT首先进行轻量级模式自省和基于样本的字段分类,构建能同时捕捉输入数据结构特征与统计特征的动态上下文。该上下文信息被嵌入结构化提示中,引导模型生成面向任务且可解释的输出。针对复杂查询中的常见故障模式,STROT引入改进机制,使模型能够基于执行反馈和验证信号迭代修正输出。与传统依赖静态提示或单次推理的方法不同,STROT将LLM视为嵌入受控分析循环的推理代理——能够通过规划与校正调整输出轨迹。最终形成面向结构化数据LLM推理的鲁棒可复现框架,适用于需要可解释性、稳定性和正确性的多样化数据探索与分析任务。
Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models
Abstract
arXiv:2505.01539v1 Announce Type: new Abstract: Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that state-of-the-art large language models often fail in these reasoning puzzles, already at low complexity. Obvious mistakes are made by the models, and their inconsistent performance indicates that their reasoning capabilities are brittle. Furthermore, at higher complexity, even state-of-the-art models specifically presented for reasoning capabilities make mistakes. We show the viability of using a parametrized benchmark with varying complexity to evaluate the reasoning capabilities of generative language models. As such, the findings contribute to a better understanding of the limitations of the reasoning capabilities of generative models, which is essential when designing responsible AI systems in the legal domain.
摘要
生成式大语言模型作为法律领域的工具,具有改善司法体系的潜力。然而当前生成模型的推理行为存在脆弱性且难以被充分理解,因此无法在法律与证据领域得到负责任的应用。本文提出一种创建基准测试的方法,用于评估生成式语言模型的推理能力。这些基准测试具有动态可变性、可扩展的复杂度以及形式明确的解释力。本研究以证人证言为基础,聚焦于潜在的论证攻击结构,动态生成了不同复杂度的线性和非线性论证攻击图,并将其转化为自然语言表达的证人证言推理谜题。实验表明,最先进的大语言模型往往在这些推理谜题中失败,甚至在低复杂度层面就已出现明显错误。模型不仅会犯显而易见的错误,其不稳定的表现更表明其推理能力存在脆弱性。当复杂度提升时,即便是专为推理能力优化的尖端模型也会出错。本研究证明了采用参数化、复杂度可调的基准测试来评估生成式语言模型推理能力的可行性。这些发现有助于更好地理解生成模型推理能力的局限性,这对设计法律领域负责任的AI系统至关重要。
Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm
Abstract
arXiv:2505.01706v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human Feedback. In this work, we investigate the performance of DPO using open-source preference datasets. One of the major drawbacks of DPO is that it doesn't induce granular scoring and treats all the segments of the responses with equal propensity. However, this is not practically true for human preferences since even "good" responses have segments that may not be preferred by the annotator. To resolve this, a 2-dimensional scoring for DPO alignment called 2D-DPO was proposed. We explore the 2D-DPO alignment paradigm and the advantages it provides over the standard DPO by comparing their win rates. It is observed that these methods, even though effective, are not robust to label/score noise. To counter this, we propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm. Along with theoretical backing, we also provide empirical verification in favour of the algorithm and introduce other noise models that can be present.
摘要
直接偏好优化(DPO)已成为将大语言模型(LLM)与人类偏好对齐的有效方法,为基于人类反馈的强化学习方法提供了稳定高效的替代方案。本研究利用开源偏好数据集评估DPO的性能。该方法存在的主要缺陷是未能引入细粒度评分机制,而是以同等倾向性对待响应中的所有片段。然而这与人类偏好的实际情况不符,因为即使是"优质"响应也可能包含标注者不偏好的片段。为解决此问题,研究者提出了名为2D-DPO的双维评分DPO对齐方法。我们通过胜率比较探究了2D-DPO对齐范式及其相对于标准DPO的优势。研究发现,这些方法虽有效但缺乏对标签/评分噪声的鲁棒性。为此,我们提出在2D-DPO算法中引入片段级评分噪声鲁棒性的改进方案。除理论论证外,我们还通过实验验证了该算法的有效性,并探讨了可能存在的其他噪声模型。
Unraveling Media Perspectives: A Comprehensive Methodology Combining Large Language Models, Topic Modeling, Sentiment Analysis, and Ontology Learning to Analyse Media Bias
Abstract
arXiv:2505.01754v1 Announce Type: new Abstract: Biased news reporting poses a significant threat to informed decision-making and the functioning of democracies. This study introduces a novel methodology for scalable, minimally biased analysis of media bias in political news. The proposed approach examines event selection, labeling, word choice, and commission and omission biases across news sources by leveraging natural language processing techniques, including hierarchical topic modeling, sentiment analysis, and ontology learning with large language models. Through three case studies related to current political events, we demonstrate the methodology's effectiveness in identifying biases across news sources at various levels of granularity. This work represents a significant step towards scalable, minimally biased media bias analysis, laying the groundwork for tools to help news consumers navigate an increasingly complex media landscape.
摘要
有偏见的新闻报道对知情决策和民主制度运行构成重大威胁。本研究提出了一种可扩展、低偏差的政治新闻媒体偏见分析新方法。该方法通过运用自然语言处理技术(包括分层主题建模、情感分析和基于大语言模型的本体学习),系统考察不同新闻源在事件选择、标签设定、措辞倾向以及报道增减方面的偏见。通过对当前政治事件的三个案例研究,我们证明了该方法能有效识别不同粒度层面上的新闻源偏见。这项研究标志着向可扩展、低偏差的媒体偏见分析迈出了重要一步,为开发帮助读者应对日益复杂媒体环境的工具奠定了基础。
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Abstract
arXiv:2505.01821v1 Announce Type: new Abstract: Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
摘要
边缘-云协同计算(ECCC)作为一种关键范式应运而生,旨在满足现代智能应用的计算需求,通过整合云端资源与边缘设备实现高效低延迟处理。人工智能尤其是深度学习与大语言模型(LLM)的最新进展显著增强了这些分布式系统的能力,但同时也带来了模型部署与资源管理方面的重大挑战。本综述系统考察了边缘-云环境中分布式智能与模型优化的交叉领域,提供关于基础架构、使能技术与新兴应用的结构化教程。我们详细分析了模型优化方法(包括压缩、自适应和神经架构搜索),以及平衡性能、能效与延迟需求的AI驱动资源管理策略。进一步探讨了ECCC系统中隐私保护与安全增强的关键问题,并通过自动驾驶、医疗健康和工业自动化等多样化应用考察实际部署方案。同时深入研究了性能分析与基准测试技术,为这类复杂系统建立评估标准。此外,本文指出了LLM部署、6G融合、神经形态计算与量子计算等关键研究方向,为解决异构性管理、实时处理与可扩展性等长期挑战提供路线图。通过连接理论进展与实际部署,本综述为研究者与实践者提供了利用AI优化分布式计算环境的整体视角,推动下一代智能系统的创新发展。
Generative AI in clinical practice: novel qualitative evidence of risk and responsible use of Google's NotebookLM
Abstract
arXiv:2505.01955v1 Announce Type: new Abstract: The advent of generative artificial intelligence, especially large language models (LLMs), presents opportunities for innovation in research, clinical practice, and education. Recently, Dihan et al. lauded LLM tool NotebookLM's potential, including for generating AI-voiced podcasts to educate patients about treatment and rehabilitation, and for quickly synthesizing medical literature for professionals. We argue that NotebookLM presently poses clinical and technological risks that should be tested and considered prior to its implementation in clinical practice.
摘要
生成式人工智能(尤其是大语言模型)的出现为科研、临床实践和教育领域带来了创新机遇。Dihan等学者近期高度评价了NotebookLM等大语言模型工具的潜力,包括生成AI语音播客以指导患者治疗康复,以及快速整合医学文献供专业人员使用。我们认为,NotebookLM目前存在临床与技术风险,在投入临床应用前需进行充分测试与评估。
From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent
Abstract
arXiv:2505.02024v1 Announce Type: new Abstract: Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between "mind" and "hand" - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.
摘要
Manus AI是2025年初推出的一款通用人工智能代理,标志着自主人工智能领域的重大进展。该技术由中国初创企业Monica.im开发,旨在弥合"思维"与"执行"之间的鸿沟——将大语言模型的推理规划能力与执行复杂端到端任务并产生实际成果的能力相结合。本文全面阐述了Manus AI的核心技术架构,探讨了其在医疗、金融、制造、机器人及游戏等领域的多元化应用,并分析了其核心优势、当前局限及未来潜力。作为技术前瞻的代表,Manus AI预示着智能代理正朝着将高层意图转化为现实行动的方向发展,开启了人机协作的新纪元。
TxP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition
Abstract
arXiv:2505.02052v1 Announce Type: new Abstract: Sensor-based human activity recognition (HAR) has predominantly focused on Inertial Measurement Units and vision data, often overlooking the capabilities unique to pressure sensors, which capture subtle body dynamics and shifts in the center of mass. Despite their potential for postural and balance-based activities, pressure sensors remain underutilized in the HAR domain due to limited datasets. To bridge this gap, we propose to exploit generative foundation models with pressure-specific HAR techniques. Specifically, we present a bidirectional TextPressure model that uses generative foundation models to interpret pressure data as natural language. TxP accomplishes two tasks: (1) Text2Pressure, converting activity text descriptions into pressure sequences, and (2) Pressure2Text, generating activity descriptions and classifications from dynamic pressure maps. Leveraging pre-trained models like CLIP and LLaMA 2 13B Chat, TxP is trained on our synthetic PressLang dataset, containing over 81,100 text-pressure pairs. Validated on real-world data for activities such as yoga and daily tasks, TxP provides novel approaches to data augmentation and classification grounded in atomic actions. This consequently improved HAR performance by up to 12.4% in macro F1 score compared to the state-of-the-art, advancing pressure-based HAR with broader applications and deeper insights into human movement.
摘要
基于传感器的人类活动识别(HAR)研究主要集中于惯性测量单元和视觉数据,往往忽视了压力传感器独有的能力——这种传感器能捕捉微妙的体态动力学与重心变化。尽管压力传感器在姿态和平衡类活动识别中具有潜力,但由于数据集匮乏,其在HAR领域仍未得到充分利用。为弥补这一空白,我们提出将生成式基础模型与压力传感专用HAR技术相结合。具体而言,我们开发了一个双向Text×Pressure模型,通过生成式基础模型将压力数据解析为自然语言。TxP实现两大功能:(1)Text2Pressure将文本活动描述转换为压力序列;(2)Pressure2Text从动态压力分布图生成活动描述与分类。基于CLIP和LLaMA 2 13B Chat等预训练模型,TxP在我们合成的PressLang数据集(包含81,100多个文本-压力对)上进行训练。在瑜伽和日常活动等真实场景的验证表明,TxP通过基于原子动作的数据增强与分类方法,使HAR的宏F1分数较现有最优技术提升达12.4%,为压力传感HAR提供了更广阔的应用前景和更深入的人类运动解析能力。
Leveraging LLM Agents and Digital Twins for Fault Handling in Process Plants
Abstract
arXiv:2505.02076v1 Announce Type: new Abstract: Advances in Automation and Artificial Intelligence continue to enhance the autonomy of process plants in handling various operational scenarios. However, certain tasks, such as fault handling, remain challenging, as they rely heavily on human expertise. This highlights the need for systematic, knowledge-based methods. To address this gap, we propose a methodological framework that integrates Large Language Model (LLM) agents with a Digital Twin environment. The LLM agents continuously interpret system states and initiate control actions, including responses to unexpected faults, with the goal of returning the system to normal operation. In this context, the Digital Twin acts both as a structured repository of plant-specific engineering knowledge for agent prompting and as a simulation platform for the systematic validation and verification of the generated corrective control actions. The evaluation using a mixing module of a process plant demonstrates that the proposed framework is capable not only of autonomously controlling the mixing module, but also of generating effective corrective actions to mitigate a pipe clogging with only a few reprompts.
摘要
自动化与人工智能的进步持续提升过程工厂处理各类运行场景的自主性。然而诸如故障处理等特定任务仍具挑战性,因其高度依赖人类专业知识,这凸显了对系统化知识驱动方法的需求。为填补这一空白,本研究提出一个将大型语言模型(LLM)智能体与数字孪生环境相结合的方法框架。LLM智能体持续解读系统状态并启动控制动作(包括对意外故障的响应),旨在使系统恢复正常运行。在此框架中,数字孪生既作为工厂特定工程知识的结构化存储库用于智能体提示,又作为仿真平台对生成的校正控制动作进行系统化验证评估。通过对某过程工厂混合模块的测试表明,该框架不仅能自主控制混合模块,还能仅需少量重复提示即可生成有效校正动作以缓解管道堵塞问题。
Retrieval-augmented in-context learning for multimodal large language models in disease classification
Abstract
arXiv:2505.02087v1 Announce Type: new Abstract: Objectives: We aim to dynamically retrieve informative demonstrations, enhancing in-context learning in multimodal large language models (MLLMs) for disease classification. Methods: We propose a Retrieval-Augmented In-Context Learning (RAICL) framework, which integrates retrieval-augmented generation (RAG) and in-context learning (ICL) to adaptively select demonstrations with similar disease patterns, enabling more effective ICL in MLLMs. Specifically, RAICL examines embeddings from diverse encoders, including ResNet, BERT, BioBERT, and ClinicalBERT, to retrieve appropriate demonstrations, and constructs conversational prompts optimized for ICL. We evaluated the framework on two real-world multi-modal datasets (TCGA and IU Chest X-ray), assessing its performance across multiple MLLMs (Qwen, Llava, Gemma), embedding strategies, similarity metrics, and varying numbers of demonstrations. Results: RAICL consistently improved classification performance. Accuracy increased from 0.7854 to 0.8368 on TCGA and from 0.7924 to 0.8658 on IU Chest X-ray. Multi-modal inputs outperformed single-modal ones, with text-only inputs being stronger than images alone. The richness of information embedded in each modality will determine which embedding model can be used to get better results. Few-shot experiments showed that increasing the number of retrieved examples further enhanced performance. Across different similarity metrics, Euclidean distance achieved the highest accuracy while cosine similarity yielded better macro-F1 scores. RAICL demonstrated consistent improvements across various MLLMs, confirming its robustness and versatility. Conclusions: RAICL provides an efficient and scalable approach to enhance in-context learning in MLLMs for multimodal disease classification.
摘要
目的:我们旨在动态检索信息性示例,以增强多模态大语言模型(MLLMs)在疾病分类中的上下文学习能力。
方法:提出检索增强的上下文学习(RAICL)框架,该框架结合检索增强生成(RAG)和上下文学习(ICL),自适应选择具有相似疾病模式的示例,从而在多模态大语言模型中实现更有效的上下文学习。具体而言,RAICL通过分析来自不同编码器(包括ResNet、BERT、BioBERT和ClinicalBERT)的嵌入向量来检索合适示例,并构建针对上下文学习优化的对话式提示。我们在两个真实世界多模态数据集(TCGA和IU胸部X光)上评估该框架,测试其在多种MLLMs(Qwen、Llava、Gemma)、嵌入策略、相似性度量及不同示例数量下的表现。
结果:RAICL持续提升分类性能。TCGA数据集准确率从0.7854提升至0.8368,IU胸部X光数据集从0.7924提升至0.8658。多模态输入优于单模态输入,其中纯文本输入强于单独图像输入。各模态嵌入信息的丰富程度将决定采用何种嵌入模型可获得更好结果。少样本实验表明增加检索示例数量可进一步提升性能。在不同相似性度量中,欧氏距离获得最高准确率,而余弦相似性则产生更好的宏观F1分数。RAICL在多种MLLMs上均表现出一致的改进,证实了其稳健性和通用性。
结论:RAICL为增强多模态大语言模型在多模态疾病分类中的上下文学习提供了一种高效且可扩展的方法。
MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents
Abstract
arXiv:2505.02099v1 Announce Type: new Abstract: Recently, large language model based (LLM-based) agents have been widely applied across various fields. As a critical part, their memory capabilities have captured significant interest from both industrial and academic communities. Despite the proposal of many advanced memory models in recent research, however, there remains a lack of unified implementations under a general framework. To address this issue, we develop a unified and modular library for developing advanced memory models of LLM-based agents, called MemEngine. Based on our framework, we implement abundant memory models from recent research works. Additionally, our library facilitates convenient and extensible memory development, and offers user-friendly and pluggable memory usage. For benefiting our community, we have made our project publicly available at https://github.com/nuster1128/MemEngine.
摘要
近年来,基于大语言模型(LLM)的智能体已广泛应用于各个领域。作为关键组成部分,其记忆能力引起了工业界和学术界的广泛关注。尽管近期研究提出了许多先进的记忆模型,但在通用框架下仍缺乏统一的实现方案。为解决这一问题,我们开发了一个模块化的统一库MemEngine,用于构建基于LLM智能体的高级记忆模型。基于该框架,我们实现了近期研究中的多种记忆模型。此外,本库支持便捷可扩展的记忆功能开发,并提供用户友好、即插即用的记忆调用方式。为促进社区发展,我们已将项目开源发布于https://github.com/nuster1128/MemEngine。
Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
Abstract
arXiv:2505.02130v1 Announce Type: new Abstract: Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://github.com/millioniron/LLM_exploration}{LLM4Exploration}
摘要
注意力机制对大型语言模型(LLMs)的成功至关重要,推动了多个领域的重大进展。然而,对于需要强调拓扑连接关系的图结构数据,其表现逊色于基于固定链接的消息传递机制(如图神经网络GNNs所采用的)。这引发了一个问题:'在自然语言场景下,注意力机制是否无法有效处理图数据?'基于这些观察,我们从注意力机制的角度展开实证研究,探索LLMs如何处理图结构数据,旨在深入理解LLMs在图结构上的注意力行为特征。我们发现了LLMs对图结构数据施加注意力的独特现象,并通过分析这些发现来改进LLMs对此类数据的建模能力。主要研究成果包括:1)LLMs虽能识别图数据并捕捉文本-节点交互,但由于固有架构限制,难以建模图结构中的节点间关系;2)LLMs在图节点间的注意力分布与理想结构模式不符,表明其未能适应图拓扑的细微特征;3)全连接注意力和固定连接均非最优方案,各自在应用场景中存在特定局限。而中间态注意力窗口能提升LLMs训练性能,并在推理时无缝过渡至全连接窗口。
Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes
Abstract
arXiv:2505.02184v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used for generating parallel scientific code, most current efforts emphasize functional correctness, often overlooking performance and energy considerations. In this work, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel code on a target parallel system for a given parallel code as input. Through a multi-stage, iterative pipeline process, LASSI-EE achieved an average energy reduction of 47% across 85% of the 20 HeCBench benchmarks tested on NVIDIA A100 GPUs. Our findings demonstrate the broader potential of LLMs, not only for generating correct code but also for enabling energy-aware programming. We also address key insights and limitations within the framework, offering valuable guidance for future improvements.
摘要
虽然大语言模型(LLMs)越来越多地用于生成并行科学代码,但当前大多数工作仅关注功能正确性,往往忽视了性能和能耗考量。本研究提出LASSI-EE,这是一个基于LLM的自动化代码重构框架,能够针对给定的并行代码输入,在目标并行系统上生成高能效的并行代码。通过多阶段迭代式流程,LASSI-EE在NVIDIA A100 GPU上测试的20个HeCBench基准程序中,对85%的案例实现了平均47%的能耗降低。我们的研究结果表明,LLMs不仅具备生成正确代码的能力,更在能源感知编程方面展现出广阔潜力。同时,我们针对该框架提出了关键见解与局限性分析,为未来改进提供了有价值的指导。
LLM-Guided Probabilistic Program Induction for POMDP Model Estimation
Abstract
arXiv:2505.02216v1 Announce Type: new Abstract: Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.
摘要
部分可观测马尔可夫决策过程(POMDPs)用于建模不确定性下的决策问题。尽管已有多种近似求解POMDPs的方法,本研究致力于解决此类模型的学习问题。我们特别关注一类POMDPs子集,其模型组件(包括观测函数、奖励函数、转移函数和初始状态分布函数)均可表示为短概率程序形式的低复杂度概率图模型。我们的学习策略采用大型语言模型(LLM)作为先验,生成候选概率程序后通过经验分布测试并基于反馈进行调整。实验涵盖经典玩具POMDP问题、模拟MiniGrid领域以及两个涉及部分可观测性的真实移动基座机器人搜索场景。结果表明:利用LLM指导构建低复杂度POMDP模型的方法,相较于表格型POMDP学习、行为克隆或直接LLM规划更具实效性。
Real-time Spatial Retrieval Augmented Generation for Urban Environments
Abstract
arXiv:2505.02271v1 Announce Type: new Abstract: The proliferation of Generative Artificial Ingelligence (AI), especially Large Language Models, presents transformative opportunities for urban applications through Urban Foundation Models. However, base models face limitations, as they only contain the knowledge available at the time of training, and updating them is both time-consuming and costly. Retrieval Augmented Generation (RAG) has emerged in the literature as the preferred approach for injecting contextual information into Foundation Models. It prevails over techniques such as fine-tuning, which are less effective in dynamic, real-time scenarios like those found in urban environments. However, traditional RAG architectures, based on semantic databases, knowledge graphs, structured data, or AI-powered web searches, do not fully meet the demands of urban contexts. Urban environments are complex systems characterized by large volumes of interconnected data, frequent updates, real-time processing requirements, security needs, and strong links to the physical world. This work proposes a real-time spatial RAG architecture that defines the necessary components for the effective integration of generative AI into cities, leveraging temporal and spatial filtering capabilities through linked data. The proposed architecture is implemented using FIWARE, an ecosystem of software components to develop smart city solutions and digital twins. The design and implementation are demonstrated through the use case of a tourism assistant in the city of Madrid. The use case serves to validate the correct integration of Foundation Models through the proposed RAG architecture.
摘要
生成式人工智能(AI),尤其是大语言模型的激增,通过城市基础模型为城市应用带来了变革性机遇。然而,基础模型存在局限性,因为它们仅包含训练时可用的知识,且更新过程耗时且成本高昂。检索增强生成(RAG)在文献中已成为向基础模型注入上下文信息的首选方法。相较于微调等技术,RAG在动态、实时的城市环境场景中表现更优。然而,传统的基于语义数据库、知识图谱、结构化数据或AI驱动的网络搜索的RAG架构,并不能完全满足城市环境的需求。城市环境是复杂的系统,具有海量互联数据、频繁更新、实时处理需求、安全性要求以及与物理世界紧密联系等特点。本研究提出了一种实时空间RAG架构,通过关联数据的时空过滤能力,定义了将生成式AI有效集成到城市中所必需的组件。该架构采用FIWARE(一个用于开发智慧城市解决方案和数字孪生的软件组件生态系统)实现,并以马德里市的旅游助手用例展示了设计与实施过程。该用例验证了通过所提出的RAG架构实现基础模型正确集成的有效性。
A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)
Abstract
arXiv:2505.02279v1 Announce Type: new Abstract: Large language model (LLM)-powered autonomous agents demand robust, standardized protocols to integrate tools, share contextual data, and coordinate tasks across heterogeneous systems. Ad-hoc integrations are difficult to scale, secure, and generalize across domains. This survey examines four emerging agent communication protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), each addressing interoperability in distinct deployment contexts. MCP provides a JSON-RPC client-server interface for secure tool invocation and typed data exchange. ACP introduces REST-native messaging via multi-part messages and asynchronous streaming to support multimodal agent responses. A2A enables peer-to-peer task outsourcing through capability-based Agent Cards, facilitating enterprise-scale workflows. ANP supports open-network agent discovery and secure collaboration using decentralized identifiers (DIDs) and JSON-LD graphs. The protocols are compared across multiple dimensions, including interaction modes, discovery mechanisms, communication patterns, and security models. Based on the comparative analysis, a phased adoption roadmap is proposed: beginning with MCP for tool access, followed by ACP for multimodal messaging, A2A for collaborative task execution, and extending to ANP for decentralized agent marketplaces. This work provides a comprehensive foundation for designing secure, interoperable, and scalable ecosystems of LLM-powered agents.
摘要
基于大语言模型(LLM)的自主代理需要强大、标准化的协议来集成工具、共享上下文数据并在异构系统间协调任务。临时集成方案难以实现跨领域的规模化扩展、安全保障和泛化应用。本研究考察了四种新兴的智能体通信协议:模型上下文协议(MCP)、代理通信协议(ACP)、代理间协议(A2A)和代理网络协议(ANP),每种协议针对不同部署场景的互操作性需求。MCP通过JSON-RPC客户端-服务器接口实现安全的工具调用和类型化数据交换;ACP采用多部分消息和异步流传输的REST原生消息机制,支持多模态代理响应;A2A通过基于能力的代理卡片实现点对点任务外包,促进企业级工作流协作;ANP利用去中心化标识符(DIDs)和JSON-LD图谱支持开放网络的代理发现与安全协作。研究从交互模式、发现机制、通信范式和安全性模型等多个维度对协议进行比较,并据此提出分阶段采用路线图:从工具接入的MCP开始,逐步扩展到多模态消息传递的ACP、协作任务执行的A2A,最终延伸至去中心化代理市场的ANP。本工作为构建安全、可互操作且可扩展的LLM驱动代理生态系统奠定了系统化基础。
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking
Abstract
arXiv:2505.02322v1 Announce Type: new Abstract: Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
摘要
近期研究显著提升了大型语言模型(LLMs)处理复杂推理任务的能力,在数学与逻辑推理等领域取得了显著成果。然而,这些方法在应对复杂规划任务时仍面临挑战,主要源于推理步骤冗长、约束条件多样以及需同时处理多个独立子任务。为此,我们提出超树规划(HTP)——一种通过构建超树结构规划纲要来实现高效推理的新范式。该结构使LLMs能灵活运用分治策略进行层次化思考,有效分解复杂推理步骤、协调多样约束条件,并以体系化方式管理多个独立子任务。我们进一步提出自主规划框架,通过迭代优化与扩展超树结构规划纲要来完成规划过程。实验证明HTP在TravelPlanner基准测试中采用Gemini-1.5-Pro实现了最先进精度,性能较o1-preview提升3.6倍。
Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques
Abstract
arXiv:2505.02351v1 Announce Type: new Abstract: In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt?GPTQ significantly reduces computation time and memory usage while improving model performance.
摘要
在深度学习领域,传统注意力机制在处理长序列数据时面临计算复杂度高和内存消耗大的显著挑战。为突破这些限制,我们提出Opt-GPTQ——一种融合分组查询注意力(GQA)机制与分页内存管理的优化梯度后训练量化方法。该方法通过分组查询头并共享键值向量,优化了传统多头注意力(MHA)机制。优化后的GQA(Opt-GQA)有效降低了计算复杂度,减少内存碎片,并提升大规模模型的显存利用率。Opt-GPTQ针对数据中心计算单元(DCUs)进行专项优化,集成至vLLM模型以实现硬件效率最大化,通过定制GPU核函数进一步减少内存访问延迟并提升并行计算能力来增强注意力计算。Opt-GQA集成线性偏置注意力(ALiBi)以降低开销并强化长序列处理能力。实验结果表明,Opt-GPTQ在提升模型性能的同时,显著减少了计算时间和内存占用。
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks
Abstract
arXiv:2505.02413v1 Announce Type: new Abstract: Task-oriented semantic communication has emerged as a fundamental approach for enhancing performance in various communication scenarios. While recent advances in Generative Artificial Intelligence (GenAI), such as Large Language Models (LLMs), have been applied to semantic communication designs, the potential of Large Multimodal Models (LMMs) remains largely unexplored. In this paper, we investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA) and propose a task-oriented semantic communication framework to facilitate efficient interaction between users and cloud servers. To reduce computational demands and shorten response time, we optimize LLaVA's image slicing to selectively focus on areas of utmost interest to users. Additionally, we assess the importance of image patches by combining objective and subjective user attention, adjusting energy usage for transmitting semantic information. This strategy optimizes resource utilization, ensuring precise transmission of critical information. We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness. Experimental results show that our semantic communication framework significantly increases accuracy in answering questions under the same channel conditions, performing particularly well in environments with poor Signal-to-Noise Ratios (SNR). Accuracy can be improved by 13.4% at an SNR of 12dB and 33.1% at 10dB, respectively.
摘要
面向任务的语义通信已成为提升各类通信场景性能的基础方法。尽管生成式人工智能(GenAI)的最新进展(如大语言模型LLMs)已被应用于语义通信设计,但大型多模态模型(LMMs)的潜力仍待充分挖掘。本文基于大型语言视觉助手LLaVA构建车辆AI助手,提出一种面向任务的语义通信框架以优化用户与云服务器间的高效交互。为降低计算需求并缩短响应时间,我们优化LLaVA的图像切片机制,使其选择性聚焦用户最关注区域。同时通过结合客观指标与用户主观注意力评估图像块重要性,动态调整语义信息传输的能耗策略,从而优化资源利用并确保关键信息的精准传输。针对交通场景构建视觉问答(VQA)数据集进行效果验证,实验表明:在相同信道条件下,本语义通信框架显著提升问题回答准确率,且在低信噪比(SNR)环境中表现尤为突出——在12dB和10dB信噪比下准确率分别提升13.4%和33.1%。
Incentivizing Inclusive Contributions in Model Sharing Markets
Abstract
arXiv:2505.02462v1 Announce Type: new Abstract: While data plays a crucial role in training contemporary AI models, it is acknowledged that valuable public data will be exhausted in a few years, directing the world's attention towards the massive decentralized private data. However, the privacy-sensitive nature of raw data and lack of incentive mechanism prevent these valuable data from being fully exploited. Addressing these challenges, this paper proposes inclusive and incentivized personalized federated learning (iPFL), which incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data. iPFL constructs a model-sharing market by solving a graph-based training optimization and incorporates an incentive mechanism based on game theory principles. Theoretical analysis shows that iPFL adheres to two key incentive properties: individual rationality and truthfulness. Empirical studies on eleven AI tasks (e.g., large language models' instruction-following tasks) demonstrate that iPFL consistently achieves the highest economic utility, and better or comparable model performance compared to baseline methods. We anticipate that our iPFL can serve as a valuable technique for boosting future AI models on decentralized private data while making everyone satisfied.
摘要
尽管数据在训练当代AI模型中起着关键作用,但公认的是,有价值的公共数据将在几年内耗尽,这使得全球目光转向海量分散的私有数据。然而,原始数据的隐私敏感性及激励机制缺失,阻碍了这些宝贵数据的充分利用。针对这些挑战,本文提出包容性激励型个性化联邦学习(iPFL),该系统在不暴露原始数据的前提下,激励具有多样化目标的数据持有者协同训练个性化模型。iPFL通过求解基于图的训练优化问题构建模型共享市场,并融合基于博弈论原理的激励机制。理论分析表明iPFL符合两项关键激励属性:个体合理性与真实性。在11项AI任务(如大语言模型指令跟随任务)上的实证研究表明,相较于基线方法,iPFL始终能实现最高的经济效用,并获得相当或更优的模型性能。我们预期iPFL能成为未来基于分散私有数据训练AI模型的重要技术,同时实现多方共赢。
El Agente: An Autonomous Agent for Quantum Chemistry
Abstract
arXiv:2505.02484v1 Announce Type: new Abstract: Computational chemistry tools are widely used to study the behaviour of chemical phenomena. Yet, the complexity of these tools can make them inaccessible to non-specialists and challenging even for experts. In this work, we introduce El Agente Q, an LLM-based multi-agent system that dynamically generates and executes quantum chemistry workflows from natural language user prompts. The system is built on a novel cognitive architecture featuring a hierarchical memory framework that enables flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling and submission. El Agente Q is benchmarked on six university-level course exercises and two case studies, demonstrating robust problem-solving performance (averaging >87% task success) and adaptive error handling through in situ debugging. It also supports longer-term, multi-step task execution for more complex workflows, while maintaining transparency through detailed action trace logs. Together, these capabilities lay the foundation for increasingly autonomous and accessible quantum chemistry.
摘要
计算化学工具被广泛用于研究化学现象的行为特征。然而,这些工具的复杂性使得非专业人士难以使用,甚至对专家也构成挑战。本研究推出El Agente Q——一个基于大语言模型的多智能体系统,能够根据自然语言用户指令动态生成并执行量子化学工作流程。该系统采用新型认知架构,其层级化记忆框架支持灵活的任务分解、自适应工具选择、后分析处理以及自主文件管理与提交。通过对六项大学课程习题和两个案例研究的基准测试,El Agente Q展现出强大的问题解决能力(平均任务成功率>87%),并能通过原位调试实现自适应错误处理。该系统还支持更复杂工作流程的多步骤长期任务执行,同时通过详细动作追踪日志保持透明度。这些能力共同为日益自主化、平民化的量子化学研究奠定了基础。
Beyond the model: Key differentiators in large language models and multi-agent services
Abstract
arXiv:2505.02489v1 Announce Type: new Abstract: With the launch of foundation models like DeepSeek, Manus AI, and Llama 4, it has become evident that large language models (LLMs) are no longer the sole defining factor in generative AI. As many now operate at comparable levels of capability, the real race is not about having the biggest model but optimizing the surrounding ecosystem, including data quality and management, computational efficiency, latency, and evaluation frameworks. This review article delves into these critical differentiators that ensure modern AI services are efficient and profitable.
摘要
随着DeepSeek、Manus AI和Llama 4等基础模型的发布,大型语言模型(LLMs)已不再是生成式AI的唯一决定性因素。由于当前许多模型已具备相当的能力水平,真正的竞争焦点并非构建最大规模的模型,而是优化包括数据质量与管理、计算效率、延迟及评估框架在内的生态系统。本文综述了这些确保现代AI服务高效性与盈利性的关键差异化要素。
Large Language Model Partitioning for Low-Latency Inference at the Edge
Abstract
arXiv:2505.02533v1 Announce Type: new Abstract: Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.
摘要
基于自回归解码器架构的Transformer大语言模型(LLMs)以离散文本单元(token)为粒度逐次生成文本。随着新生成token不断追加到部分输出序列中,由于多头注意力层(MHA)需要存储所有已生成token的中间表示(键值缓存),序列长度增长导致内存和计算负载持续增加。这种迭代过程会不断推高内存与计算需求,在资源受限的边缘计算环境中,基于层的模型分区方案常引发内存过载或高推理延迟。为降低推理延迟,我们提出一种资源感知的Transformer架构分区算法,该算法在token生成过程中定期更新分区决策。该方法具有短视特性,其决策依据设备实时资源可用性和网络链路带宽的瞬时信息:首次执行时在设备上分配模型块,后续执行时通过跨设备迁移模型块来保持迁移延迟与推理延迟之和最小。我们的方案在注意力头粒度进行解码器分区,将每个注意力头与其键值缓存共同部署,并在资源紧张时触发动态迁移。通过将不同注意力头分配至不同设备,我们实现了注意力头的并行执行,从而显著降低推理延迟。实验表明:在小规模场景(3-5台设备)中,本方法能达到精确最优求解器15%-20%延迟范围内的性能;在大规模测试中,相比最先进的基于层的分区方法,本方案在推理速度和内存使用方面均取得显著提升。
Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning
Abstract
arXiv:2505.02576v1 Announce Type: new Abstract: Reasoning tasks are crucial in many domains, especially in science and engineering. Although large language models (LLMs) have made progress in reasoning tasks using techniques such as chain-of-thought and least-to-most prompting, these approaches still do not effectively scale to complex problems in either their performance or execution time. Moreover, they often require additional supervision for each new task, such as in-context examples. In this work, we introduce Recursive Decomposition with Dependencies (RDD), a scalable divide-and-conquer method for solving reasoning problems that requires less supervision than prior approaches. Our method can be directly applied to a new problem class even in the absence of any task-specific guidance. Furthermore, RDD supports sub-task dependencies, allowing for ordered execution of sub-tasks, as well as an error recovery mechanism that can correct mistakes made in previous steps. We evaluate our approach on two benchmarks with six difficulty levels each and in two in-context settings: one with task-specific examples and one without. Our results demonstrate that RDD outperforms other methods in a compute-matched setting as task complexity increases, while also being more computationally efficient.
摘要
推理任务在诸多领域尤其是科学与工程中至关重要。尽管大语言模型(LLMs)通过思维链、最少到最多提示等技术在推理任务上取得进展,这些方法在性能或执行时间上仍难以有效扩展到复杂问题。此外,它们通常需要为每个新任务提供额外监督(例如上下文示例)。本研究提出带依赖的递归分解(RDD)——一种可扩展的分治方法,其所需的监督少于现有方案。即使缺乏针对特定任务的指导,该方法也能直接应用于新问题类别。RDD还支持子任务依赖关系,允许有序执行子任务,并具备错误恢复机制以修正先前步骤的错误。我们在两个各含六个难度等级的基准测试中评估该方法,采用两种上下文设置:含任务特定示例与不含示例。结果表明,随着任务复杂性增加,RDD在计算资源匹配的设置中优于其他方法,同时具备更高计算效率。
A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law
Abstract
arXiv:2505.02665v1 Announce Type: new Abstract: This survey explores recent advancements in reasoning large language models (LLMs) designed to mimic "slow thinking" - a reasoning process inspired by human cognition, as described in Kahneman's Thinking, Fast and Slow. These models, like OpenAI's o1, focus on scaling computational resources dynamically during complex tasks, such as math reasoning, visual reasoning, medical diagnosis, and multi-agent debates. We present the development of reasoning LLMs and list their key technologies. By synthesizing over 100 studies, it charts a path toward LLMs that combine human-like deep thinking with scalable efficiency for reasoning. The review breaks down methods into three categories: (1) test-time scaling dynamically adjusts computation based on task complexity via search and sampling, dynamic verification; (2) reinforced learning refines decision-making through iterative improvement leveraging policy networks, reward models, and self-evolution strategies; and (3) slow-thinking frameworks (e.g., long CoT, hierarchical processes) that structure problem-solving with manageable steps. The survey highlights the challenges and further directions of this domain. Understanding and advancing the reasoning abilities of LLMs is crucial for unlocking their full potential in real-world applications, from scientific discovery to decision support systems.
摘要
本综述探讨了旨在模拟'慢思考'(源自卡尼曼《思考,快与慢》中描述的人类认知推理过程)的推理大语言模型(LLMs)的最新进展。这类模型(如OpenAI的o1)通过在数学推理、视觉推理、医疗诊断和多智能体辩论等复杂任务中动态扩展计算资源来实现该目标。我们系统梳理了推理LLMs的发展脉络,并列举其关键技术。通过综合分析100余项研究,本文为兼具类人深度思考能力与可扩展推理效率的LLMs指明了发展路径。现有方法可分为三类:(1) 测试时动态扩展:通过搜索采样、动态验证等方式根据任务复杂度调整计算量;(2) 强化学习:利用策略网络、奖励模型及自我进化策略实现决策迭代优化;(3) 慢思考框架(如长思维链、分层处理):通过可管理的步骤结构化解决问题。研究同时指出了该领域面临的挑战与未来方向。理解并提升LLMs的推理能力,对于释放其在从科学发现到决策支持系统等现实应用中的全部潜能具有关键意义。
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Abstract
arXiv:2505.02707v1 Announce Type: new Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.
摘要
一个能无缝融入日常生活的语音AI代理,将以自主、实时且富有情感表达的方式与人类互动。它不仅能响应指令,更能持续聆听、推理并主动回应,促成流畅、动态且情感共鸣的交互。我们推出Voila系列大规模语音-语言基础模型,向这一愿景迈出重要一步。Voila突破传统流水线系统,采用新型端到端架构,在保留音调、节奏和情感等丰富声音细节的同时,实现全双工低延迟对话,响应延迟仅195毫秒,超越人类平均反应时间。其分层多尺度Transformer架构融合了大语言模型(LLMs)的推理能力与强大声学建模技术,支持自然且具备角色意识的语音生成——用户仅需通过文本指令即可定义说话者身份、语调等特征。此外,Voila支持超百万种预制声音,并能基于短至10秒的音频样本高效定制新声音。除口语对话外,Voila被设计为统一的多功能模型,适用于自动语音识别(ASR)、文本转语音(TTS)等广泛语音应用,经简单适配还可实现多语种语音翻译。Voila已全面开源以支持开放研究,加速下一代人机交互的发展。
Technical Report: Evaluating Goal Drift in Language Model Agents
Abstract
arXiv:2505.02709v1 Announce Type: new Abstract: As language models (LMs) are increasingly deployed as autonomous agents, their robust adherence to human-assigned objectives becomes crucial for safe operation. When these agents operate independently for extended periods without human oversight, even initially well-specified goals may gradually shift. Detecting and measuring goal drift - an agent's tendency to deviate from its original objective over time - presents significant challenges, as goals can shift gradually, causing only subtle behavioral changes. This paper proposes a novel approach to analyzing goal drift in LM agents. In our experiments, agents are first explicitly given a goal through their system prompt, then exposed to competing objectives through environmental pressures. We demonstrate that while the best-performing agent (a scaffolded version of Claude 3.5 Sonnet) maintains nearly perfect goal adherence for more than 100,000 tokens in our most difficult evaluation setting, all evaluated models exhibit some degree of goal drift. We also find that goal drift correlates with models' increasing susceptibility to pattern-matching behaviors as the context length grows.
摘要
随着语言模型(LMs)越来越多地被部署为自主智能体,其对人类设定目标的稳健遵循对安全运行至关重要。当这些智能体在无人监督的情况下长期独立运行时,即使最初明确指定的目标也可能逐渐发生偏移。检测和衡量目标漂移(即智能体随时间推移偏离原始目标的倾向)存在重大挑战,因为目标可能逐渐变化,仅导致细微的行为改变。本文提出了一种分析语言模型智能体目标漂移的新方法。实验中,我们首先通过系统提示明确赋予智能体目标,随后通过环境压力使其暴露于竞争性目标。研究表明,在最严苛的评估设置下,性能最佳的智能体(基于Claude 3.5 Sonnet的支架版本)能在超过10万标记的范围内近乎完美地保持目标遵循,但所有被评估模型均表现出不同程度的目标漂移。我们还发现,随着上下文长度增加,目标漂移与模型对模式匹配行为的敏感性增强存在相关性。
Enhancing LLMs' Clinical Reasoning with Real-World Data from a Nationwide Sepsis Registry
Abstract
arXiv:2505.02722v1 Announce Type: new Abstract: Although large language models (LLMs) have demonstrated impressive reasoning capabilities across general domains, their effectiveness in real-world clinical practice remains limited. This is likely due to their insufficient exposure to real-world clinical data during training, as such data is typically not included due to privacy concerns. To address this, we propose enhancing the clinical reasoning capabilities of LLMs by leveraging real-world clinical data. We constructed reasoning-intensive questions from a nationwide sepsis registry and fine-tuned Phi-4 on these questions using reinforcement learning, resulting in C-Reason. C-Reason exhibited strong clinical reasoning capabilities on the in-domain test set, as evidenced by both quantitative metrics and expert evaluations. Furthermore, its enhanced reasoning capabilities generalized to a sepsis dataset involving different tasks and patient cohorts, an open-ended consultations on antibiotics use task, and other diseases. Future research should focus on training LLMs with large-scale, multi-disease clinical datasets to develop more powerful, general-purpose clinical reasoning models.
摘要
尽管大语言模型(LLMs)在通用领域已展现出卓越的推理能力,但其在真实世界临床实践中的有效性仍显不足。这可能是由于训练过程中接触的真实临床数据有限——此类数据通常因隐私问题未被纳入。为解决该问题,我们提出通过利用真实临床数据来增强LLMs的临床推理能力。我们从全国性脓毒症注册库构建了推理密集型问题集,并采用强化学习对Phi-4模型进行微调,最终开发出C-Reason系统。定量指标与专家评估均证实,C-Reason在领域内测试集上表现出强大的临床推理能力。此外,其增强的推理能力可泛化至不同任务和患者群体的脓毒症数据集、抗生素使用开放式咨询任务以及其他疾病领域。未来研究应聚焦于利用大规模多疾病临床数据集训练LLMs,以开发更强大的通用临床推理模型。
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Abstract
arXiv:2505.02735v1 Announce Type: new Abstract: Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
摘要
形式化数学推理仍是人工智能面临的关键挑战,现有基准在广度和规模上的局限阻碍了相关进展。为此,我们提出FormalMATH——一个基于Lean4的大规模基准测试集,包含5,560个经过形式化验证的问题,涵盖从高中数学奥林匹克竞赛到本科阶段跨多领域(如代数、应用数学、微积分、数论和离散数学)的定理。为降低人工形式化的低效性,我们开发了一种新型人机协同自动形式化流程,整合了:(1)专用于命题自动形式化的大语言模型(LLMs);(2)多LLM语义验证机制;(3)基于否证的反例过滤策略(利用现成LLM证明器)。该方法在保持原始自然语言问题保真度的前提下,通过人工验证前保留72.09%的命题,显著降低了专家标注成本。对前沿LLM定理证明器的评估揭示了重大局限:即使在实用采样预算下,最强模型的成功率仅达16.46%,并表现出明显领域偏差(如擅长代数但拙于微积分)及对简化自动化策略的过度依赖。值得注意的是,我们发现思维链推理场景中存在反直觉现象:自然语言解题指导与证明成功率呈负相关,表明人类撰写的非形式化推理在形式化推理环境中反而引入了噪声而非清晰性。我们相信FormalMATH能为形式化数学推理研究提供强有力的基准支撑。
Giving Simulated Cells a Voice: Evolving Prompt-to-Intervention Models for Cellular Control
Abstract
arXiv:2505.02766v1 Announce Type: new Abstract: Guiding biological systems toward desired states, such as morphogenetic outcomes, remains a fundamental challenge with far-reaching implications for medicine and synthetic biology. While large language models (LLMs) have enabled natural language as an interface for interpretable control in AI systems, their use as mediators for steering biological or cellular dynamics remains largely unexplored. In this work, we present a functional pipeline that translates natural language prompts into spatial vector fields capable of directing simulated cellular collectives. Our approach combines a large language model with an evolvable neural controller (Prompt-to-Intervention, or P2I), optimized via evolutionary strategies to generate behaviors such as clustering or scattering in a simulated 2D environment. We demonstrate that even with constrained vocabulary and simplified cell models, evolved P2I networks can successfully align cellular dynamics with user-defined goals expressed in plain language. This work offers a complete loop from language input to simulated bioelectric-like intervention to behavioral output, providing a foundation for future systems capable of natural language-driven cellular control.
摘要
引导生物系统实现预期状态(如形态发生结果)仍是基础性挑战,对医学和合成生物学具有深远意义。尽管大语言模型(LLMs)已使自然语言成为AI系统中可解释控制的接口,但其作为调控生物或细胞动力学中介的应用仍待探索。本研究提出一个功能性流程,将自然语言提示转化为能指导模拟细胞群体的空间矢量场。该方法将大语言模型与可进化神经控制器(Prompt-to-Intervention,简称P2I)相结合,通过进化策略优化以在模拟2D环境中生成聚集或分散等行为。实验表明,即使使用受限词汇和简化细胞模型,进化后的P2I网络仍能成功使细胞动力学与用户用自然语言定义的目标保持一致。该研究实现了从语言输入到模拟类生物电干预再到行为输出的完整闭环,为未来实现自然语言驱动的细胞控制系统奠定了基础。
Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing
Abstract
arXiv:2505.02811v1 Announce Type: new Abstract: Retrieval Augmented Generation (RAG) has shown strong capability in enhancing language models' knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, \textbf{SIM-RAG}, to explicitly enhance RAG systems' self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data.
摘要
检索增强生成(RAG)技术在提升语言模型知识储备、减少AI生成幻觉方面展现出强大能力,因而获得广泛应用。然而,需要多轮检索的复杂任务仍具挑战性,早期尝试往往因缺乏自我质疑意识而过于乐观。当前多轮RAG系统可能在已获取足够信息时仍持续搜索,或在信息不足时提供错误答案。现有解决方案要么需要大量昂贵的人工标注流程监督数据,要么导致性能欠佳。
本文提出新框架SIM-RAG,旨在通过显式增强RAG系统的自我认知和多轮检索能力来解决这些局限。为训练SIM-RAG,我们首先让RAG系统自主进行多轮检索实践,通过添加中间内心独白式推理步骤来扩展现有问答对,从而生成合成训练数据。对于每对问答,系统可能探索多条检索路径——成功抵达正确答案的路径被标记为成功,反之为失败。利用这些数据,我们训练了一个轻量级信息充分性评判器(Critic)。在推理阶段,该评判器通过上下文强化学习评估RAG系统每轮是否已检索到充分信息,从而指导检索决策并提升系统级自我认知。
在多个知名RAG基准测试上的实验表明,SIM-RAG是一种有效的多轮RAG解决方案。该框架具有系统高效性——仅需为RAG添加轻量级组件而无需修改现有大语言模型或搜索引擎,同时具备数据高效性——无需昂贵的人工标注中间步骤检索流程监督数据。
AutoLibra: Agent Metric Induction from Open-Ended Feedback
Abstract
arXiv:2505.02820v1 Announce Type: new Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own", into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
摘要
智能体主要通过任务成功率指标进行评估和优化,这类指标存在粒度粗糙、依赖专家人工设计且无法奖励中间涌现行为的问题。我们提出AutoLibra评估框架,能将开放式人类反馈(如"发现按钮禁用时不应重复点击"或"该智能体自主决策权过高")转化为细粒度行为评估指标。该框架通过将反馈锚定至智能体行为、聚类相似正负行为,并创建具有明确定义和具体实例的评估指标(可用于提示LLM-as-a-Judge评估器)来实现这一目标。我们进一步提出两个元指标来评估(诱导)指标集与开放反馈的匹配度:"覆盖率"和"冗余度"。通过优化这些元指标,实验证明AutoLibra能比现有评估基准产生更具体的智能体评估指标,并发现新的分析维度。我们还展示了AutoLibra在智能体改进中的两项应用:首先,在多种文本游戏任务中,AutoLibra诱导的指标作为提示工程目标优于任务成功率,使智能体性能平均提升20%;其次,该框架能迭代筛选网页导航智能体的高质量微调数据。结果表明AutoLibra是评估和改进语言智能体的强大任务无关工具。
Building Scalable AI-Powered Applications with Cloud Databases: Architectures, Best Practices and Performance Considerations
Abstract
arXiv:2504.18793v1 Announce Type: cross Abstract: The rapid adoption of AI-powered applications demands high-performance, scalable, and efficient cloud database solutions, as traditional architectures often struggle with AI-driven workloads requiring real-time data access, vector search, and low-latency queries. This paper explores how cloud-native databases enable AI-driven applications by leveraging purpose-built technologies such as vector databases (pgvector), graph databases (AWS Neptune), NoSQL stores (Amazon DocumentDB, DynamoDB), and relational cloud databases (Aurora MySQL and PostgreSQL). It presents architectural patterns for integrating AI workloads with cloud databases, including Retrieval-Augmented Generation (RAG) [1] with LLMs, real-time data pipelines, AI-driven query optimization, and embeddings-based search. Performance benchmarks, scalability considerations, and cost-efficient strategies are evaluated to guide the design of AI-enabled applications. Real-world case studies from industries such as healthcare, finance, and customer experience illustrate how enterprises utilize cloud databases to enhance AI capabilities while ensuring security, governance, and compliance with enterprise and regulatory standards. By providing a comprehensive analysis of AI and cloud database integration, this paper serves as a practical guide for researchers, architects, and enterprises to build next-generation AI applications that optimize performance, scalability, and cost efficiency in cloud environments.
摘要
人工智能应用的快速普及对高性能、可扩展且高效的云数据库解决方案提出了迫切需求,传统架构往往难以应对需要实时数据访问、向量搜索和低延迟查询的AI驱动型工作负载。本文探讨云原生数据库如何通过专用技术栈(包括向量数据库pgvector、图数据库AWS Neptune、NoSQL存储Amazon DocumentDB/DynamoDB以及关系型云数据库Aurora MySQL/PostgreSQL)赋能AI驱动型应用,提出AI工作负载与云数据库集成的架构模式,涵盖与大语言模型结合的检索增强生成技术(RAG)[1]、实时数据管道、AI驱动的查询优化及基于嵌入向量的搜索。通过性能基准测试、可扩展性评估和成本优化策略分析,为AI应用设计提供指导。来自医疗、金融和客户体验等行业的实际案例表明,企业如何利用云数据库在确保安全性、治理能力及符合企业/监管标准的前提下提升AI能力。本文通过对AI与云数据库融合的全面分析,为研究人员、架构师和企业构建新一代AI应用提供实践指南,助力实现云环境中性能、可扩展性与成本效益的最优平衡。
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Abstract
arXiv:2505.01456v1 Announce Type: cross Abstract: LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
摘要
基于海量数据训练的LLM可能无意中习得敏感信息(如个人详情和潜在有害内容)。多模态LLM由于整合了图像与文本等多模态信息,这一风险进一步加剧。攻击者可通过多模态提示利用此类知识提取敏感细节。评估MLLM针对性遗忘此类信息(定向反学习)的效果,需要创建高质量、标注完善的图文对。尽管现有反学习研究集中于文本领域,多模态反学习仍待探索。为此,我们首先提出多模态反学习基准UnLOK-VQA(反学习外部知识视觉问答),以及用于评估从MLLM删除特定多模态知识方法的攻防框架。我们采用自动化流程扩展视觉问答数据集,生成不同近似度的样本来测试泛化性与特异性,并通过人工过滤保持高质量。随后针对七种攻击方式(四种白盒、三种黑盒,包括利用隐藏状态可解释性的新型白盒方法)评估六种防御目标。结果表明:多模态攻击效果优于纯文本或图像攻击;最有效防御方案是从模型内部状态移除答案信息。此外,更大模型展现出更强的编辑后鲁棒性,表明模型规模可提升安全性。UnLOK-VQA为推进MLLM反学习研究提供了严谨基准。
MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling
Abstract
arXiv:2505.01459v1 Announce Type: cross Abstract: This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.
摘要
本文提出MoxE——一种将扩展长短期记忆网络(xLSTM)与专家混合(MoE)框架协同整合的新型架构,旨在解决大语言模型(LLM)的可扩展性与效率关键挑战。该方法在有效利用xLSTM创新记忆结构的同时,通过MoE策略性引入稀疏性以显著降低计算开销。其核心是设计了一种基于熵的动态路由机制,可将标记智能分配至专用专家模块,从而确保资源的高效均衡利用。这种熵感知能力使架构能同时优化处理稀有与常见标记,其中mLSTM模块被优先用于处理稀有标记。为进一步增强泛化能力,我们引入包含基于熵的损失函数和分组平衡损失在内的辅助损失组合,以保障模型鲁棒性与训练效率。理论分析与实证评估充分表明,相比现有方法,MoxE在实现显著效率提升的同时具有更优的效能,标志着可扩展LLM架构的重要进展。
BiGSCoder: State Space Model for Code Understanding
Abstract
arXiv:2505.01475v1 Announce Type: cross Abstract: We present BiGSCoder, a novel encoder-only bidirectional state-space model (SSM) featuring a gated architecture, pre-trained for code understanding on a code dataset using masked language modeling. Our work aims to systematically evaluate SSMs' capabilities in coding tasks compared to traditional transformer architectures; BiGSCoder is built for this purpose. Through comprehensive experiments across diverse pre-training configurations and code understanding benchmarks, we demonstrate that BiGSCoder outperforms transformer-based models, despite utilizing simpler pre-training strategies and much less training data. Our results indicate that BiGSCoder can serve as a more sample-efficient alternative to conventional transformer models. Furthermore, our study shows that SSMs perform better without positional embeddings and can effectively extrapolate to longer sequences during fine-tuning.
摘要
我们提出BiGSCoder——一种新型仅编码器的双向状态空间模型(SSM),其采用门控架构,通过掩码语言建模在代码数据集上进行预训练以支持代码理解。本研究旨在系统评估SSMs在编码任务中相对于传统Transformer架构的性能优势,为此专门构建了BiGSCoder。通过在不同预训练配置和代码理解基准测试中的全面实验,我们证明尽管采用更简单的预训练策略和少得多的训练数据,BiGSCoder仍能超越基于Transformer的模型。结果表明,BiGSCoder可作为传统Transformer模型更具样本效率的替代方案。此外,研究发现SSMs在没有位置嵌入时表现更优,且能在微调阶段有效外推至更长序列。
Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation
Abstract
arXiv:2505.01523v1 Announce Type: cross Abstract: We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.
摘要
我们提出一种改进方法,通过采用预算约束的子集选择策略,在数学等特定领域高效微调大语言模型(LLMs)。该方法结合效用性与多样性指标,筛选最具信息量和代表性的训练样本。最终目标是通过从全量数据集中精选数据点,在显著降低计算成本和训练时间的同时,实现接近全数据集性能的竞争性表现。效用性指标综合了困惑度和思维链(CoT)损失,以识别对模型学习贡献最大的挑战性样本;而多样性指标则确保覆盖数学各子领域的广泛性。我们在LLaMA-3 8B和Phi-3模型上评估该方法,并与随机选择、基于多样性的采样及现有先进子集选择技术等基线方案进行对比。
Emotions in the Loop: A Survey of Affective Computing for Emotional Support
Abstract
arXiv:2505.01542v1 Announce Type: cross Abstract: In a world where technology is increasingly embedded in our everyday experiences, systems that sense and respond to human emotions are elevating digital interaction. At the intersection of artificial intelligence and human-computer interaction, affective computing is emerging with innovative solutions where machines are humanized by enabling them to process and respond to user emotions. This survey paper explores recent research contributions in affective computing applications in the area of emotion recognition, sentiment analysis and personality assignment developed using approaches like large language models (LLMs), multimodal techniques, and personalized AI systems. We analyze the key contributions and innovative methodologies applied by the selected research papers by categorizing them into four domains: AI chatbot applications, multimodal input systems, mental health and therapy applications, and affective computing for safety applications. We then highlight the technological strengths as well as the research gaps and challenges related to these studies. Furthermore, the paper examines the datasets used in each study, highlighting how modality, scale, and diversity impact the development and performance of affective models. Finally, the survey outlines ethical considerations and proposes future directions to develop applications that are more safe, empathetic and practical.
摘要
在技术日益融入日常体验的世界中,能够感知并响应人类情感的系统正在提升数字交互体验。作为人工智能与人机交互的交叉领域,情感计算通过使机器具备处理和响应用户情绪的能力,正以创新解决方案推动机器的人性化发展。本综述论文系统探究了情感计算在情绪识别、情感分析和性格推断等应用领域的最新研究成果,这些研究主要采用大语言模型(LLMs)、多模态技术和个性化AI系统等方法。我们通过将选定研究论文归类至四大应用领域——AI聊天机器人应用、多模态输入系统、心理健康治疗应用以及安全领域的情感计算,深入分析了其核心贡献与创新方法论。研究同时揭示了相关技术的优势以及存在的科研缺口与挑战。此外,本文详细考察了各研究采用的数据集,阐明了数据模态、规模及多样性对情感模型开发与性能的影响。最后,综述提出了伦理考量,并规划了未来发展方向,以推动构建更安全、更具同理心且实用的情感计算应用。
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
Abstract
arXiv:2505.01592v1 Announce Type: cross Abstract: The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose PIPA, a unified evaluation protocol that conceptualizes the behavioral process of interactive task planning agents within a partially observable Markov Decision Process (POMDP) paradigm. The proposed protocol offers a comprehensive assessment of agent performance through a set of atomic evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
摘要
大型语言模型(LLMs)在指令遵循和上下文理解方面日益增强的能力,推动着智能代理时代的到来,并催生了众多应用场景。其中,任务规划代理在现实场景中尤为突出,这些场景通常涉及复杂的内部流程,如上下文理解、工具管理和响应生成。然而,现有基准测试主要基于任务完成度作为整体效能的代理指标进行评估。我们提出假设:仅提高任务完成度与最大化用户满意度并不一致,因为用户是与整个代理流程互动,而非仅关注最终结果。为填补这一空白,我们提出PIPA评估框架——该协议将交互式任务规划代理的行为过程概念化为部分可观测马尔可夫决策过程(POMDP)范式。通过一组原子化评估标准,该框架可对代理性能进行全面评估,使研究者和实践者能够诊断代理决策流程中的具体优势与缺陷。分析表明,不同代理在行为阶段各有所长,而用户满意度同时受结果和中间行为影响。我们还展望了未来方向,包括利用多代理系统的解决方案,并指出任务规划中用户模拟器的局限性。
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation
Abstract
arXiv:2505.01595v1 Announce Type: cross Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
摘要
我们提出了一种最先进的细粒度概率估计模型,用于在给定上下文条件下对命题进行概率评估。尽管大语言模型(LLMs)在推理能力方面取得了显著进展,特别是在信息完整的明确定义任务上表现优异,但其在不确定或部分信息条件下进行准确且校准良好的概率预测仍存在困难。虽然将不确定性纳入模型预测通常能提升性能,但如何获得可靠的不确定性估计仍未得到充分研究。具体而言,LLM的概率估计往往较为粗糙,且倾向于更常见的数值。通过结合人工与合成数据创建与评估、扩展至更大规模模型以及改进监督方法,我们提出了一组强大而精确的概率估计模型。我们在依赖条件概率估计的各项任务中进行了系统评估,结果表明:相较于现有基于微调和提示的方法,我们的方法始终以显著优势优于它们。
Don't be lazy: CompleteP enables compute-efficient deep transformers
Abstract
arXiv:2505.01618v1 Announce Type: cross Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the unique parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.
摘要
我们研究了使用不同参数化方法(即随模型规模变化调整模型和优化器超参数的规则)时大语言模型训练的计算效率。某些参数化方法无法在模型深度变化时传递最优基础超参数(如学习率),迫使实践者要么在扩大规模时重新调整这些超参数(成本高昂),要么在无法重新调整时接受次优训练。即使实现了超参数传递,我们通过理论分析发现参数化方法仍可能处于惰性学习状态——各层仅学习接近其线性化的特征,从而无法有效利用深度和非线性。最终,我们确定并采用了一种称为CompleteP的独特参数化方法,该方法在所有网络层中同时实现了深度维度的超参数传递和非惰性学习。CompleteP使更广泛的模型宽度/深度比例保持计算高效,解锁了更适合不同硬件设置和操作环境的模型架构。此外,与现有最优方法相比,CompleteP实现了12-34%的计算效率提升。
A Domain Adaptation of Large Language Models for Classifying Mechanical Assembly Components
Abstract
arXiv:2505.01627v1 Announce Type: cross Abstract: The conceptual design phase represents a critical early stage in the product development process, where designers generate potential solutions that meet predefined design specifications based on functional requirements. Functional modeling, a foundational aspect of this phase, enables designers to reason about product functions before specific structural details are determined. A widely adopted approach to functional modeling is the Function-Behavior-Structure (FBS) framework, which supports the transformation of functional intent into behavioral and structural descriptions. However, the effectiveness of function-based design is often hindered by the lack of well-structured and comprehensive functional data. This scarcity can negatively impact early design decision-making and hinder the development of accurate behavioral models. Recent advances in Large Language Models (LLMs), such as those based on GPT architectures, offer a promising avenue to address this gap. LLMs have demonstrated significant capabilities in language understanding and natural language processing (NLP), making them suitable for automated classification tasks. This study proposes a novel LLM-based domain adaptation (DA) framework using fine-tuning for the automated classification of mechanical assembly parts' functions. By fine-tuning LLMs on domain-specific datasets, the traditionally manual and subjective process of function annotation can be improved in both accuracy and consistency. A case study demonstrates fine-tuning GPT-3.5 Turbo on data from the Oregon State Design Repository (OSDR), and evaluation on the A Big CAD (ABC) dataset shows that the domain-adapted LLM can generate high-quality functional data, enhancing the semantic representation of mechanical parts and supporting more effective design exploration in early-phase engineering.
摘要
概念设计阶段是产品开发过程中关键的早期阶段,设计师在此阶段根据功能需求生成符合预定设计规范的潜在解决方案。功能建模作为该阶段的基础环节,使设计师能够在确定具体结构细节前对产品功能进行推理论证。功能-行为-结构(FBS)框架是广泛采用的功能建模方法,支持将功能意图转化为行为与结构描述。然而,功能化设计的有效性常因缺乏结构良好且全面的功能数据而受限,这种数据匮乏会对早期设计决策产生负面影响,并阻碍精确行为模型的建立。基于GPT架构的大语言模型(LLMs)的最新进展为解决这一问题提供了新途径,其在语言理解与自然语言处理(NLP)方面展现的卓越能力,使其特别适用于自动化分类任务。本研究提出一种基于LLM的领域自适应(DA)新框架,通过微调实现机械装配零件功能的自动分类。在特定领域数据集上对LLM进行微调,可显著提升功能标注这一传统人工主观过程的准确性与一致性。案例研究展示了基于俄勒冈州立大学设计资源库(OSDR)数据对GPT-3.5 Turbo的微调过程,在ABC数据集上的评估表明,经领域自适应的大语言模型能生成高质量功能数据,从而增强机械零件的语义表征能力,为工程早期阶段更有效的设计探索提供支持。
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
Abstract
arXiv:2505.01709v1 Announce Type: cross Abstract: Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.
摘要
在开放场景中操作机器人执行多样化任务是机器人技术的重要研究和应用方向。尽管自然语言处理和大规模多模态模型的进展提升了机器人理解复杂指令的能力,但开放环境下的机器人操作仍面临程序性技能困境与陈述性技能困境。现有方法往往需要折中认知与执行能力。针对这些挑战,本文提出RoBridge——一种通用机器人操作的分层智能架构,其由基于大规模预训练视觉语言模型(VLM)的高层认知规划器(HCP)、作为符号桥梁的不变可操作表征(IOR),以及通用具身智能体(GEA)构成。RoBridge既保持了VLM的陈述性技能,又释放了强化学习的程序性技能,有效弥合了认知与执行的鸿沟。实验表明,RoBridge相较现有基线模型取得显著性能提升,在新任务上达到75%成功率,在模拟到现实的泛化中仅需每任务5个真实世界数据样本即实现83%平均成功率。该工作标志着机器人系统认知推理与物理执行融合的重要进展,为通用机器人操作提供了新范式。
Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models
Abstract
arXiv:2505.01731v1 Announce Type: cross Abstract: Pruning large language models (LLMs) is a promising solution for reducing model sizes and computational complexity while preserving performance. Traditional layer-wise pruning methods often adopt a uniform sparsity approach across all layers, which leads to suboptimal performance due to the varying significance of individual transformer layers within the model not being accounted for. To this end, we propose the \underline{S}hapley \underline{V}alue-based \underline{N}on-\underline{U}niform \underline{P}runing (\methodname{}) method for LLMs. This approach quantifies the contribution of each transformer layer to the overall model performance, enabling the assignment of tailored pruning budgets to different layers to retain critical parameters. To further improve efficiency, we design the Sliding Window-based Shapley Value approximation method. It substantially reduces computational overhead compared to exact SV calculation methods. Extensive experiments on various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness of the proposed approach. The results reveal that non-uniform pruning significantly enhances the performance of pruned models. Notably, \methodname{} achieves a reduction in perplexity (PPL) of 18.01% and 19.55% on LLaMA-7B and LLaMA-13B, respectively, compared to SparseGPT at 70% sparsity.
摘要
大语言模型(LLM)剪枝是一种在保持性能的同时减小模型规模和计算复杂度的有效方法。传统逐层剪枝方法通常对所有层采用统一的稀疏度策略,由于未考虑模型中各Transformer层的重要性差异,往往导致次优性能。为此,我们提出基于沙普利值的非均匀剪枝方法(\methodname{})。该方法量化每个Transformer层对整体模型性能的贡献度,从而为不同层分配定制化的剪枝预算以保留关键参数。为提升效率,我们进一步设计了基于滑动窗口的沙普利值近似计算方法,相比精确计算显著降低了计算开销。在LLaMA-v1、LLaMA-v2和OPT等多种大语言模型上的实验表明,该方法能有效提升剪枝后模型的性能。值得注意的是,在70%稀疏度下,相比SparseGPT方法,\methodname{}使LLaMA-7B和LLaMA-13B的困惑度(PPL)分别降低了18.01%和19.55%。
An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding
Abstract
arXiv:2505.01743v1 Announce Type: cross Abstract: The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with raw data to effectively fine-tune LVLM models for understanding low-resolution videos in HBU. First, we propose a Contrastive-Oriented Data Labeler, which can capture behavior-relevant information from long, low-resolution videos and generate high-quality pseudo labels for unlabeled data via contrastive learning. Second, we propose a Physical-Knowledge Guided Captioner, which utilizes spatial and temporal consistency checks to mitigate errors in pseudo labels. Therefore, it can improve LLMs' understanding of sequential data and then generate high-quality video captions. Finally, to ensure on-device deployability, we employ LoRA-based efficient fine-tuning to adapt LVLMs for low-resolution data. We evaluate Llambda using a region-scale real-world testbed and three distinct low-resolution datasets, and the experiments show that Llambda outperforms several state-of-the-art LVLM systems up to on average Bert-Score.
摘要
大型视觉语言模型(LVLM)的快速发展为超越传统标注方法提供了可能,能够为低分辨率视觉系统(如深度、热成像和红外)中的设备端人类行为理解(HBU)生成更丰富、更细致的描述。然而,现有的大型视觉语言模型主要针对高分辨率数据(如RGB图像)设计,难以有效理解低分辨率数据。一种快速解决方案是对大量低分辨率数据进行标注,但这需要耗费大量人力密集型标注工作。本文提出了一种新型省力系统Llambda,旨在支持低分辨率HBU。其核心思想是利用有限标注数据和大量未标注数据引导大语言模型(LLM)生成信息丰富的描述文本,这些文本可与原始数据结合,有效微调LVLM模型以理解HBU中的低分辨率视频。首先,我们提出对比导向数据标注器,通过对比学习从长时低分辨率视频中捕获行为相关信息,并为未标注数据生成高质量伪标签。其次,我们提出物理知识引导的标注生成器,利用时空一致性检查来减少伪标签错误,从而提升LLM对序列数据的理解能力以生成高质量视频描述。最后,为确保设备端可部署性,我们采用基于LoRA的高效微调方法使LVLM适配低分辨率数据。通过在区域级真实测试平台和三个不同低分辨率数据集上的评估,实验表明Llambda在平均Bert-Score上最高优于现有最优LVLM系统达40.03%。
\textit{New News}: System-2 Fine-tuning for Robust Integration of New Knowledge
Abstract
arXiv:2505.01812v1 Announce Type: cross Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce \textit{New News}, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term \textit{System-2 Fine-tuning} (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the \textit{contexual shadowing effect}, where training with the news \textit{in context} followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.
摘要
人类与智能动物能够轻松内化新信息("新闻"),并准确提取其对执行下游任务的隐含影响。虽然大型语言模型(LLM)在新闻被明确作为上下文给出 时,可以通过上下文学习(ICL)实现这一目标,但微调方法仍难以将学习成果巩固到模型权重中。本文提出《新新闻》数据集,该数据集包含跨多个领域(数学、编程、科学发现、排行榜、事件)的假设性但合理的新闻,并附有下游评估问题——这些问题的正确答案关键取决于对新闻的理解与内化。我们首先证明了在新闻数据集上,朴素微调与上下文学习之间存在显著差距(FT-ICL差距)。为弥补这一差距,我们探索了一套自博弈数据生成协议——包括转述、推衍和自问自答——旨在将模型在上下文中的知识蒸馏到无上下文情况下的模型权重中,该方法被我们称为"系统2微调"(Sys2-FT)。我们使用Qwen 2.5系列模型,系统评估了不同数据领域和模型规模下ICL与Sys2-FT的性能。结果表明,Sys2-FT的自问自答协议显著提升了模型对新闻的权重内学习能力。此外,我们发现了"语境遮蔽效应":当模型在上下文中学习新闻后,再接受其转述或问答训练时,会削弱对原始新闻的学习效果。最后,我们提供了Sys2-FT涌现出的规模定律的初步证据。
Intra-Layer Recurrence in Transformers for Language Modeling
Abstract
arXiv:2505.01855v1 Announce Type: cross Abstract: Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
摘要
Transformer模型在自然语言处理领域确立了新的性能基准,但其不断增加的深度导致参数量急剧增长。现有循环Transformer方法通过多次重处理层块来解决这一问题,但往往不加区分地对整个层块应用循环机制。本研究提出层内循环(ILR)这一更具针对性的方法,该技术在前向传播过程中选择性地对单个层应用循环处理。实验表明,将更多迭代次数分配给早期层能获得最优结果。这些发现证明,ILR为优化Transformer架构中的循环结构提供了有前景的研究方向。
PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications
Abstract
arXiv:2505.01881v1 Announce Type: cross Abstract: Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.
摘要
在多环境和多领域中实现鲁棒导航需要精确的状态估计和透明的决策过程。我们提出PhysNav-DG框架,该创新性方案将经典传感器融合与视觉语言模型的语义能力相结合。我们的双分支架构既能通过多传感器输入预测导航动作,又可同步生成详细的思维链解释。改进的自适应卡尔曼滤波器能根据环境上下文动态调整噪声参数,该框架整合了多种原始传感器数据流以及来自LLaMA 3.2 11B和BLIP-2等模型的语义洞察。为评估方法性能,我们构建了MD-NEX基准测试——这是一个新颖的多领域数据集,统一了包含真实动作标注和人工验证解释的室内导航、自动驾驶及社交导航任务。大量实验与消融研究表明,PhysNav-DG将导航成功率提升超过20%,在保持高效运行的同时,其生成的解释兼具高度可靠性与清晰性。本研究通过连接高层语义推理与几何规划,为构建更安全、更可信的自主系统提供了新途径。
LookAlike: Consistent Distractor Generation in Math MCQs
Abstract
arXiv:2505.01903v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.
摘要
大型语言模型(LLMs)正被越来越多地用于为选择题(MCQs)生成干扰项,尤其在数学教育等领域。然而,现有方法难以确保生成的干扰项与学生的常见错误保持一致。我们提出LookAlike方法,通过偏好优化提升错误-干扰项一致性。其主要创新点在于:(a)从模型不一致性中挖掘合成偏好对;(b)交替使用监督微调(SFT)和直接偏好优化(DPO)以稳定训练。与依赖启发式规则或人工标注偏好数据的现有工作不同,LookAlike利用自身生成的不一致性作为非偏好样本,从而实现可扩展且稳定的训练。在包含1400多道数学选择题的真实数据集上评估时,LookAlife在LLM作为评判者的测试中分别达到51.6%的干扰项生成准确率和57.2%的错误生成准确率,优于现有最优方法(45.6%/47.7%)。这些改进凸显了基于偏好的正则化与不一致性挖掘对于大规模生成一致性数学选择题干扰项的有效性。
Semantic Intelligence: Integrating GPT-4 with A Planning in Low-Cost Robotics
Abstract
arXiv:2505.01931v1 Announce Type: cross Abstract: Classical robot navigation often relies on hardcoded state machines and purely geometric path planners, limiting a robot's ability to interpret high-level semantic instructions. In this paper, we first assess GPT-4's ability to act as a path planner compared to the A* algorithm, then present a hybrid planning framework that integrates GPT-4's semantic reasoning with A* on a low-cost robot platform operating on ROS2 Humble. Our approach eliminates explicit finite state machine (FSM) coding by using prompt-based GPT-4 reasoning to handle task logic while maintaining the accurate paths computed by A*. The GPT-4 module provides semantic understanding of instructions and environmental cues (e.g., recognizing toxic obstacles or crowded areas to avoid, or understanding low-battery situations requiring alternate route selection), and dynamically adjusts the robot's occupancy grid via obstacle buffering to enforce semantic constraints. We demonstrate multi-step reasoning for sequential tasks, such as first navigating to a resource goal and then reaching a final destination safely. Experiments on a Petoi Bittle robot with an overhead camera and Raspberry Pi Zero 2W compare classical A* against GPT-4-assisted planning. Results show that while A* is faster and more accurate for basic route generation and obstacle avoidance, the GPT-4-integrated system achieves high success rates (96-100%) on semantic tasks that are infeasible for pure geometric planners. This work highlights how affordable robots can exhibit intelligent, context-aware behaviors by leveraging large language model reasoning with minimal hardware and no fine-tuning.
摘要
传统机器人导航通常依赖于硬编码状态机和纯几何路径规划器,限制了机器人理解高层语义指令的能力。本文首先评估GPT-4作为路径规划器与A算法的性能差异,随后提出一种混合规划框架,该框架在基于ROS2 Humble的低成本机器人平台上将GPT-4的语义推理能力与A算法相结合。我们的方法通过基于提示的GPT-4推理处理任务逻辑,同时保留A算法计算的精确路径,从而消除了显式有限状态机(FSM)编码需求。GPT-4模块提供对指令和环境线索的语义理解(例如识别需避开的毒性障碍物或拥挤区域,或理解需要选择替代路线的低电量情况),并通过障碍物缓冲动态调整机器人占据栅格以强化语义约束。我们展示了多步骤顺序任务的推理能力,例如先导航至资源目标再安全抵达最终目的地。在配备顶置摄像头和树莓派Zero 2W的Petoi Bittle机器人上进行的实验对比了传统A与GPT-4辅助规划方案。结果表明:虽然A*在基本路径生成和避障方面速度更快、精度更高,但集成GPT-4的系统在纯几何规划器无法实现的语义任务上取得了96-100%的高成功率。本研究证明,通过结合大语言模型推理能力,低成本机器人在无需硬件升级和微调的条件下即可展现出智能化的情境感知行为。
Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview
Abstract
arXiv:2505.01967v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become integral to daily life, widely adopted in communication, decision-making, and information retrieval, raising critical questions about how these systems implicitly form and express socio-cognitive attitudes or "worldviews". While existing research extensively addresses demographic and ethical biases, broader dimensions-such as attitudes toward authority, equality, autonomy, and fate-remain under-explored. In this paper, we introduce the Social Worldview Taxonomy (SWT), a structured framework grounded in Cultural Theory, operationalizing four canonical worldviews (Hierarchy, Egalitarianism, Individualism, Fatalism) into measurable sub-dimensions. Using SWT, we empirically identify distinct and interpretable cognitive profiles across 28 diverse LLMs. Further, inspired by Social Referencing Theory, we experimentally demonstrate that explicit social cues systematically shape these cognitive attitudes, revealing both general response patterns and nuanced model-specific variations. Our findings enhance the interpretability of LLMs by revealing implicit socio-cognitive biases and their responsiveness to social feedback, thus guiding the development of more transparent and socially responsible language technologies.
摘要
大型语言模型(LLMs)已深度融入日常生活,广泛应用于沟通交流、决策制定和信息检索领域,这引发了关于这些系统如何隐式形成并表达社会认知态度或"世界观"的关键问题。尽管现有研究广泛探讨了人口统计和伦理偏见,但更广泛的维度——如对权威、平等、自主性和命运的态度——仍未得到充分探索。本文基于文化理论提出"社会世界观分类法"(SWT),将四种典型世界观(等级主义、平等主义、个人主义、宿命论)操作化为可测量的子维度。通过SWT框架,我们在28个多样化LLMs中实证识别出具有区分度且可解释的认知特征。进一步受社会参照理论启发,实验证明显性社会线索能系统性塑造这些认知态度,既揭示了普遍响应模式,也呈现出细微的模型特异性差异。本研究通过揭示LLMs隐含的社会认知偏见及其对社会反馈的响应机制,增强了模型可解释性,为开发更透明且符合社会责任的语言技术提供了指导。
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Abstract
arXiv:2505.01997v1 Announce Type: cross Abstract: One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.
摘要
大型语言模型(LLMs)成功的关键技术之一是偏好对齐。然而,偏好对齐的一个显著副作用是校准效果变差:尽管预训练模型通常校准良好,但在与人类偏好对齐后,LLMs往往变得校准不佳。本文研究了偏好对齐为何影响校准以及如何解决这一问题。针对第一个问题,我们观察到对齐过程中的偏好崩溃问题会不适当地泛化到校准场景,导致LLMs表现出过度自信和校准不良。为解决这一问题,我们证明了利用领域特定知识进行微调对缓解过度自信问题的重要性。为了进一步分析这是否影响模型性能,我们将模型分为两类:可校准和不可校准,其定义基于期望校准误差(ECE)的界限。在可校准范围内,我们提出了一种校准感知的微调方法,以在不影响LLMs性能的情况下实现适当的校准。然而,随着模型进一步微调以获得更好的性能,它们会进入不可校准范围。针对这种情况,我们开发了一种基于EM算法的ECE正则化方法,用于微调损失函数以保持低校准误差。大量实验验证了所提方法的有效性。
Testing Database Systems with Large Language Model Synthesized Fragments
Abstract
arXiv:2505.02012v1 Announce Type: cross Abstract: Various automated testing approaches have been proposed for Database Management Systems (DBMSs). Many such approaches generate pairs of equivalent queries to identify bugs that cause DBMSs to compute incorrect results, and have found hundreds of bugs in mature, widely used DBMSs. Most of these approaches are based on manually written SQL generators; however, their bug-finding capabilities remain constrained by the limited set of SQL features supported by the generators. In this work, we propose ShQveL, an approach that augments existing SQL test-case generators by leveraging Large Language Models (LLMs) to synthesize SQL fragments. Our key idea is to systematically incorporate SQL features gained through automated interactions with LLMs into the SQL generators, increasing the features covered while efficiently generating test cases. Specifically, ShQveL uses SQL sketches -- SQL statements with incomplete code segments that LLMs fill -- to integrate LLM-generated content into the generator. We evaluated ShQveL on 5 DBMSs and discovered 55 unique and previously unknown bugs, 50 of which were promptly fixed after our reports.
摘要
针对数据库管理系统(DBMS),已有多种自动化测试方法被提出。其中许多方法通过生成等价查询对来识别导致DBMS计算结果错误的缺陷,并在成熟且广泛使用的DBMS中发现了数百个错误。这些方法大多基于手动编写的SQL生成器,但其缺陷检测能力仍受限于生成器支持的有限SQL功能集。本研究提出ShQveL方法,通过利用大语言模型(LLM)合成SQL片段来增强现有SQL测试用例生成器。其核心思想是通过与LLM的自动化交互,系统性地将获取的SQL功能整合到SQL生成器中,从而在高效生成测试用例的同时扩大功能覆盖范围。具体而言,ShQveL采用SQL草图(包含由LLM填充的不完整代码段的SQL语句)将LLM生成内容集成至生成器。我们在5个DBMS上评估ShQveL,发现了55个独特且未知的缺陷,其中50个在报告后得到及时修复。
Wide & Deep Learning for Node Classification
Abstract
arXiv:2505.02020v1 Announce Type: cross Abstract: Wide & Deep, a simple yet effective learning architecture for recommendation systems developed by Google, has had a significant impact in both academia and industry due to its combination of the memorization ability of generalized linear models and the generalization ability of deep models. Graph convolutional networks (GCNs) remain dominant in node classification tasks; however, recent studies have highlighted issues such as heterophily and expressiveness, which focus on graph structure while seemingly neglecting the potential role of node features. In this paper, we propose a flexible framework GCNIII, which leverages the Wide & Deep architecture and incorporates three techniques: Intersect memory, Initial residual and Identity mapping. We provide comprehensive empirical evidence showing that GCNIII can more effectively balance the trade-off between over-fitting and over-generalization on various semi- and full- supervised tasks. Additionally, we explore the use of large language models (LLMs) for node feature engineering to enhance the performance of GCNIII in cross-domain node classification tasks. Our implementation is available at https://github.com/CYCUCAS/GCNIII.
摘要
由谷歌开发的推荐系统学习架构Wide & Deep,通过结合广义线性模型的记忆能力与深度模型的泛化能力,在学术界和工业界产生了重大影响。图卷积网络(GCNs)在节点分类任务中仍占据主导地位,但近期研究揭示了诸如异质性和表达能力等问题,这些问题关注图结构的同时似乎忽视了节点特征的潜在作用。本文提出了一种灵活框架GCNIII,该框架利用Wide & Deep架构,并融合了三种技术:交集记忆(Intersect memory)、初始残差(Initial residual)和恒等映射(Identity mapping)。我们提供了全面的实证证据,表明GCNIII在各种半监督和全监督任务中能更有效地平衡过拟合与过泛化之间的权衡。此外,我们探索了使用大语言模型(LLMs)进行节点特征工程,以提升GCNIII在跨领域节点分类任务中的性能。实现代码详见https://github.com/CYCUCAS/GCNIII。
What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction
Abstract
arXiv:2505.02072v1 Announce Type: cross Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.
摘要
近年来,语言建模的概念逐渐从有限长度字符串的概率分布演变为针对文本输入输出的通用预测模型(需经过适当的对齐阶段)。本文分析了大型语言模型(LLMs)中分布估计与响应预测的区别及其常存的目标冲突。我们考察了LLMs的训练阶段(包括预训练、上下文学习和偏好微调)及其输出概率的常见应用场景(包括补全概率和显式输出概率),论证不同设定会导致三种不同的预期输出分布。研究表明,自然语言处理领域常默认这些分布应具有相似性,从而导致对实验结果的误读。本研究为LLMs的分布解释奠定了更坚实的理论基础,将为LLMs诱导分布的解读与应用研究提供重要参考。
DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving
Abstract
arXiv:2505.02123v1 Announce Type: cross Abstract: We introduce DriveAgent, a novel multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion to enhance situational understanding and decision-making. DriveAgent uniquely integrates diverse sensor modalities-including camera, LiDAR, GPS, and IMU-with LLM-driven analytical processes structured across specialized agents. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments on challenging autonomous driving datasets demonstrate that DriveAgent is achieving superior performance on multiple metrics against baseline methods. These results validate the efficacy of the proposed LLM-driven multi-agent sensor fusion framework, underscoring its potential to substantially enhance the robustness and reliability of autonomous driving systems.
摘要
我们提出DriveAgent,一种创新的多智能体自动驾驶框架,通过结合大型语言模型(LLM)推理与多模态传感器融合技术,显著提升环境理解与决策能力。该框架创新性地将摄像头、激光雷达、GPS和惯性测量单元(IMU)等异构传感器数据,与基于LLM的分布式分析流程相整合。系统采用模块化智能体架构,包含四个核心功能模块:(1)描述性分析智能体基于时间戳过滤识别关键传感器事件;(2)激光雷达与视觉智能体协同执行车辆级分析,评估周边车辆状态与运动轨迹;(3)环境推理与因果分析智能体解析场景变化及其内在机理;(4)具备紧急程度感知的决策生成智能体,负责优先级判定并及时输出操控建议。这种模块化设计使LLM能高效协调各专业感知推理智能体,为复杂自动驾驶场景提供可解释的连贯分析。在多个高难度自动驾驶数据集上的实验表明,DriveAgent在多项指标上显著超越基准方法。这些结果验证了所提出的LLM驱动多智能体传感器融合框架的有效性,凸显其对于提升自动驾驶系统鲁棒性与可靠性的重要价值。
A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking
Abstract
arXiv:2505.02171v1 Announce Type: cross Abstract: Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG) by determining how source materials are segmented before indexing. Despite evidence that Large Language Models (LLMs) are sensitive to the layout and structure of retrieved data, there is currently no framework to analyze the impact of different chunking methods. In this paper, we introduce a novel methodology that defines essential characteristics of the chunking process at three levels: intrinsic passage properties, extrinsic passage properties, and passages-document coherence. We propose HOPE (Holistic Passage Evaluation), a domain-agnostic, automatic evaluation metric that quantifies and aggregates these characteristics. Our empirical evaluations across seven domains demonstrate that the HOPE metric correlates significantly (p > 0.13) with various RAG performance indicators, revealing contrasts between the importance of extrinsic and intrinsic properties of passages. Semantic independence between passages proves essential for system performance with a performance gain of up to 56.2% in factual correctness and 21.1% in answer correctness. On the contrary, traditional assumptions about maintaining concept unity within passages show minimal impact. These findings provide actionable insights for optimizing chunking strategies, thus improving RAG system design to produce more factually correct responses.
摘要
文档分块技术通过决定源材料在索引前的分割方式,从根本上影响着检索增强生成(RAG)系统的性能。尽管有证据表明大语言模型(LLMs)对检索数据的布局和结构具有敏感性,但目前缺乏分析不同分块方法影响的框架。本文提出一种创新方法论,从三个层面定义分块过程的核心特征:段落内在属性、段落外在属性以及段落-文档连贯性。我们开发了HOPE(整体段落评估)这一领域无关的自动评估指标,用于量化并整合这些特征。在七个领域的实证评估表明,HOPE指标与多种RAG性能指标呈现显著相关性(p > 0.13),揭示了段落外在属性与内在属性的重要性差异。实验证明段落间的语义独立性对系统性能至关重要,可使事实准确性提升高达56.2%,答案正确率提高21.1%。相反,传统关于保持段落内概念统一性的假设影响甚微。这些发现为优化分块策略提供了可操作的见解,从而改进RAG系统设计以生成更具事实准确性的响应。
SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation
Abstract
arXiv:2505.02235v1 Announce Type: cross Abstract: Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by decomposing summarization evaluation into atomic statements, enabling both high performance and explainability. SEval-Ex employs a two-stage pipeline: first extracting atomic statements from text source and summary using LLM, then a matching between generated statements. Unlike existing approaches that provide only summary-level scores, our method generates detailed evidence for its decisions through statement-level alignments. Experiments on the SummEval benchmark demonstrate that SEval-Ex achieves state-of-the-art performance with 0.580 correlation on consistency with human consistency judgments, surpassing GPT-4 based evaluators (0.521) while maintaining interpretability. Finally, our framework shows robustness against hallucination.
摘要
文本摘要质量评估仍是自然语言处理领域的关键挑战。现有方法面临性能与可解释性之间的权衡问题。本文提出SEval-Ex框架,通过将摘要评估分解为原子陈述来弥合这一鸿沟,实现高性能与可解释性的统一。该框架采用两阶段流程:首先利用大语言模型从原文和摘要中提取原子陈述,随后进行生成陈述的匹配。与仅提供摘要级评分的现有方法不同,我们的方法通过陈述级对齐为决策生成详细证据。在SummEval基准测试中,SEval-Ex以0.580的人类一致性判断相关性达到最先进性能,超越基于GPT-4的评估器(0.521),同时保持可解释性。最后,本框架对幻觉现象展现出强鲁棒性。
Parameter-Efficient Transformer Embeddings
Abstract
arXiv:2505.02266v1 Announce Type: cross Abstract: Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
摘要
基于Transformer的自然语言处理模型中,嵌入层通常占据模型参数的最大比重,其规模随词汇表大小增长却未能带来相应的性能提升。本文提出一种创新方法:首先通过对归一化标记ID进行傅里叶展开来确定性生成标记嵌入向量,随后通过轻量级多层感知机(MLP)捕获高阶交互。我们在自然语言推理任务(SNLI和MNLI)上训练标准Transformer和本架构,并在句子文本相似度任务(STS-B)评估零样本性能。实验结果表明,所提方法以显著更少的参数量达到竞争性性能,训练速度更快,且无需dropout即可有效运行。这项概念验证研究揭示了构建可扩展、内存高效语言模型的潜力,并为基于本发现的大规模实验提供了研究动机。
Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
Abstract
arXiv:2505.02309v1 Announce Type: cross Abstract: Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.
摘要
大语言模型(LLMs)已经彻底改变了人工智能(AI)的许多领域,但其巨大的资源需求限制了它们在移动和边缘设备上的部署。本综述论文全面概述了压缩LLMs的技术,以实现在资源受限环境中的高效推理。我们研究了三种主要方法:知识蒸馏、模型量化和模型剪枝。针对每种技术,我们讨论了其基本原理,介绍了不同的变体,并提供了成功应用的示例。我们还简要讨论了混合专家和早期退出策略等补充技术。最后,我们强调了未来有前景的研究方向,旨在为研究人员和实践者提供一个有价值的资源,帮助他们优化LLMs以实现边缘部署。
Advancing Email Spam Detection: Leveraging Zero-Shot Learning and Large Language Models
Abstract
arXiv:2505.02362v1 Announce Type: cross Abstract: Email spam detection is a critical task in modern communication systems, essential for maintaining productivity, security, and user experience. Traditional machine learning and deep learning approaches, while effective in static settings, face significant limitations in adapting to evolving spam tactics, addressing class imbalance, and managing data scarcity. These challenges necessitate innovative approaches that reduce dependency on extensive labeled datasets and frequent retraining. This study investigates the effectiveness of Zero-Shot Learning using FLAN-T5, combined with advanced Natural Language Processing (NLP) techniques such as BERT for email spam detection. By employing BERT to preprocess and extract critical information from email content, and FLAN-T5 to classify emails in a Zero-Shot framework, the proposed approach aims to address the limitations of traditional spam detection systems. The integration of FLAN-T5 and BERT enables robust spam detection without relying on extensive labeled datasets or frequent retraining, making it highly adaptable to unseen spam patterns and adversarial environments. This research highlights the potential of leveraging zero-shot learning and NLPs for scalable and efficient spam detection, providing insights into their capability to address the dynamic and challenging nature of spam detection tasks.
摘要
电子邮件垃圾邮件检测是现代通信系统中的关键任务,对维护生产力、安全性和用户体验至关重要。传统机器学习和深度学习方法虽然在静态环境中有效,但在适应不断演变的垃圾邮件策略、解决类别不平衡问题以及处理数据稀缺性方面存在显著局限性。这些挑战要求采用创新方法,以减少对大量标注数据集和频繁重新训练的依赖。本研究探讨了使用FLAN-T5的零样本学习结合先进自然语言处理(NLP)技术(如BERT)在电子邮件垃圾邮件检测中的有效性。通过利用BERT预处理和提取邮件内容中的关键信息,并采用FLAN-T5在零样本框架下对邮件进行分类,所提出的方法旨在解决传统垃圾邮件检测系统的局限性。FLAN-T5与BERT的结合实现了无需依赖大量标注数据或频繁重新训练的鲁棒垃圾邮件检测,使其对未见过的垃圾邮件模式和对抗性环境具有高度适应性。本研究凸显了利用零样本学习和NLP技术实现可扩展且高效垃圾邮件检测的潜力,为应对垃圾邮件检测任务的动态性和挑战性提供了新的见解。
RM-R1: Reward Modeling as Reasoning
Abstract
arXiv:2505.02387v1 Announce Type: cross Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.
摘要
奖励建模对于使大语言模型(LLM)与人类偏好对齐至关重要,尤其是通过基于人类反馈的强化学习(RLHF)。为了提供准确的奖励信号,奖励模型(RM)应在给出评分或判断前激发深度思考并进行可解释的推理。然而,现有RM要么生成不透明的标量分数,要么直接预测优选答案,导致其难以整合自然语言批评,因而缺乏可解释性。受长思维链(CoT)在推理密集型任务中的最新进展启发,我们提出假设并验证:将推理能力整合到奖励建模中可显著提升RM的可解释性和性能。本文提出了一类新的生成式奖励模型——推理奖励模型(ReasRM),将奖励建模构建为推理任务。我们设计了面向推理的训练流程,并训练了ReasRM系列模型RM-R1。训练包含两个关键阶段:(1)高质量推理链的蒸馏;(2)基于可验证奖励的强化学习。RM-R1通过自生成推理轨迹或对话专用评分标准,并据此评估候选响应,从而改进LLM输出。实验表明,我们的模型在多个综合奖励模型基准测试中达到或接近生成式RM的最先进性能,最高可超越大型开源模型(如Llama3.1-405B)和专有模型(如GPT-4o)达13.8%。除最终性能外,我们还进行了全面实证分析以理解成功训练ReasRM的关键要素。为促进未来研究,我们在https://github.com/RM-R1-UIUC/RM-R1发布了六个ReasRM模型及相关代 码与数据。
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Abstract
arXiv:2505.02390v1 Announce Type: cross Abstract: Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3_K_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
摘要
近期,本地化部署DeepSeek-R1和V3的需求激增,可能源于官方服务常处于繁忙状态且部分机构存在数据隐私顾虑。虽然单机部署具有基础设施简单的优势,但模型671B FP8参数配置超出了标准8-GPU机器的实际内存限制。量化作为一种广泛应用的技术,可有效降低模型内存占用。然而,目前尚不清楚DeepSeek-R1和V3量化后的性能表现。本技术报告首次对DeepSeek全系列模型进行了多比特宽度量化的系统性评估。关键发现表明:4比特量化在保持与FP8相近性能的同时,可实现标准NVIDIA GPU设备的单机部署。我们进一步提出DQ3_K_M动态3比特量化方法,其在多项基准测试中显著优于传统Q3_K_M变体,且在大多数任务中与4比特量化(Q4_K_M)方法性能相当。此外,DQ3_K_M同时支持NVIDIA H100/A100和华为910B的单机部署配置。DQ3_K_M的实现已发布于https://github.com/UnicomAI/DeepSeek-Eval,包含DeepSeek-R1和DeepSeek-V3的优化3比特量化变体。
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Abstract
arXiv:2505.02391v1 Announce Type: cross Abstract: Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.
摘要
大语言模型(LLMs)中的思维链(CoT)推理可形式化为一个潜在变量问题,即模型需要生成中间推理步骤。尽管迭代奖励排序微调(RAFT)等现有方法依赖此类形式化框架,但它们通常对所有提示采用统一的推理预算,未能考虑问题难度与收敛行为的差异性。本研究指出CoT训练的主要瓶颈在于静态采样策略导致的随机梯度估计效率低下。我们提出GVM-RAFT方法——一种针对特定提示的动态样本分配策略,旨在计算预算约束下最小化随机梯度方差。该方法通过监测提示接受率和随机梯度范数动态分配计算资源,确保所得梯度方差最小化。理论分析表明,所提出的动态采样策略在适当条件下可实现加速收敛保证。数学推理实验显示,GVM-RAFT相比原始RAFT实现了2-4倍加速,并带来显著准确率提升。该动态采样策略具有通用性,可整合至GRPO等其他强化学习算法,同样能改善收敛性和测试准确率。代码已开源:https://github.com/RLHFlow/GVM。
Bielik 11B v2 Technical Report
Abstract
arXiv:2505.02410v1 Announce Type: cross Abstract: We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.
摘要
我们推出Bielik 11B v2——专为波兰语文本处理优化的尖端语言模型。该模型基于Mistral 7B v0.2架构,通过深度扩展技术将参数量提升至110亿,在保持强大跨语言能力的同时,于波兰语基准测试中展现出卓越性能。我们引入两项关键技术创新:基于质量权重分配训练样本的加权指令交叉熵损失函数,可优化跨指令类型的学习效果;以及根据上下文长度动态调整的自适应学习率。多基准测试的综合评估表明,Bielik 11B v2在从语言理解到复杂推理的各项任务中,不仅超越了许多参数量为其2-6倍的更大模型,更显著优于其他波兰语专用模型。该模型凭借参数高效性和广泛的量化选项,可适配多种硬件配置部署,既推动了波兰语人工智能的发展,也为资源受限语言的高效建模确立了新基准。
Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning
Abstract
arXiv:2505.02483v1 Announce Type: cross Abstract: Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot's learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.
摘要
由于机器人动力学的高度复杂性,让高自由度机器人学习特定技能是一项极具挑战性的任务。强化学习(RL)已成为一种有前景的解决方案,但处理此类问题需要设计多个奖励函数以兼顾机器人运动中的各种约束。现有方法通常不加区分地将所有奖励分量相加来优化强化学习的价值函数和策略。我们认为,在策略优化中统一纳入所有奖励分量的做法效率低下,且限制了机器人的学习性能。为此,我们提出了一种基于大语言模型(LLMs)的自动混合奖励调度(AHRS)框架。该范式能在策略优化过程中动态调整各奖励分量的学习强度,使机器人能够以渐进、结构化的方式掌握技能。具体而言,我们设计了一个多分支价值网络,每个分支对应不同的奖励分量。在策略优化时,每个分支会根据其重要性被赋予相应权重,这些权重由LLMs设计的规则自动计算得出。LLM会预先根据任务描述生成规则集,并在训练过程中根据评估各分支性能的语言提示从规则库中选择权重计算规则。实验结果表明,AHRS方法在多个高自由度机器人任务中平均实现了6.48%的性能提升。
SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
Abstract
arXiv:2505.02486v1 Announce Type: cross Abstract: Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting. In this paper, we explore forgetting in this context, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model's knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks' answer styles, making the results unusable. By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can obscure the model's knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for transforming data styles across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization, enabling the model to retain existing competencies. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.
摘要
多模态持续指令微调(MCIT)旨在使多模态大语言模型(MLLMs)能够增量学习新任务而不发生灾难性遗忘。本文针对该场景下的遗忘现象进行探究,将其划分为表层遗忘与本质遗忘:表层遗忘指模型知识可能并未真正丢失,但由于后续任务答案风格的干扰,导致其对先前任务的响应偏离预期格式,致使结果无法使用;本质遗忘则指模型输出格式正确但事实错误的答案,表明知识确实丧失。评估本质遗忘需先解决表层遗忘,因严重的表层遗忘会掩盖模型真实知识状态。为此,我们首先提出答案风格多样化(ASD)范式,通过定义跨任务数据风格转换的标准化流程,将各任务训练集统一为相似多样化风格,以预防风格迁移导致的表层遗忘。在此基础上,我们提出RegLoRA来缓解本质遗忘——该方法通过正则化稳定存储先验知识的关键参数,使模型保持现有能力。实验结果表明,我们的整体方法SEFE取得了最先进的性能表现。
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
Abstract
arXiv:2505.02502v1 Announce Type: cross Abstract: Background: Large language models (LLMs) are increasingly deployed via open-source and commercial frameworks, enabling individuals and organizations to self-host advanced AI capabilities. However, insecure defaults and misconfigurations often expose LLM services to the public Internet, posing significant security and system engineering risks. Aims: This study aims to unveil the current landscape of public-facing LLM deployments in the wild through a large-scale empirical study, focusing on service prevalence, exposure characteristics, systemic vulnerabilities, and associated risks. Method: We conducted an Internet-wide measurement to identify public-facing LLM deployments across 15 frameworks, discovering 320,102 services. We extracted 158 unique API endpoints, grouped into 12 functional categories based on capabilities and security risks. We further analyzed configurations, authentication practices, and geographic distributions, revealing deployment trends and systemic issues in real-world LLM system engineering. Results: Our study shows that public LLM deployments are rapidly growing but often insecure. Among all endpoints, we observe widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. Security risks, including model disclosure, system leakage, and unauthorized access, are pervasive, highlighting the need for secure-by-default frameworks and stronger deployment practices. Conclusions: Public-facing LLM deployments suffer from widespread security and configuration flaws, exposing services to misuse, model theft, resource hijacking, and remote exploitation. Strengthening default security, deployment practices, and operational standards is critical for the growing self-hosted LLM ecosystem.
摘要
背景:大型语言模型(LLMs)正越来越多地通过开源和商业框架部署,使个人和组织能够自主托管先进AI能力。然而,不安全的默认设置和错误配置常使LLM服务暴露于公共互联网,带来重大安全与系统工程风险。目标:本研究旨在通过大规模实证研究揭示当前公共LLM部署现状,重点关注服务普及度、暴露特征、系统性漏洞及相关风险。方法:我们实施了全网测量,识别出15个框架下的320,102个公共LLM服务。提取158个独特API端点,根据功能与安全风险划分为12个类别。通过分析配置策略、认证实践和地理分布,揭示了实际LLM系统工程中的部署趋势与系统性问题。结果:研究表明公共LLM部署快速增长但普遍存在安全隐患。所有端点中普遍存在不安全协议使用、TLS配置缺陷及关键操作未授权访问等问题。模型泄露、系统信息泄漏和未授权访问等安全风险广泛存在,凸显了默认安全框架和强化部署实践的必要性。结论:面向公众的LLM部署存在普遍的安全与配置缺陷,导致服务滥用、模型窃取、资源劫持和远程攻击风险。强化默认安全性、部署实践和操作标准对日益增长的自托管LLM生态系统至关重要。
Bielik v3 Small: Technical Report
Abstract
arXiv:2505.02550v1 Announce Type: cross Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.
摘要
我们推出Bielik v3系列参数高效生成文本模型(15亿和45亿参数),专为波兰语处理优化。研究表明,经过精心优化的较小架构在显著减少计算资源需求的同时,仍能达到与更大规模模型相当的性能。该研究包含多项关键创新:定制波兰语分词器(APT4)显著提升分词效率,加权指令交叉熵损失函数平衡不同类型指令的学习,以及基于训练进度动态调整的自适应学习率。这些模型在精心筛选的2920亿标记、覆盖3.03亿文档的语料库上进行训练,在多项基准测试中表现卓越,包括Open PL大语言模型排行榜、复杂波兰语文本理解基准、波兰EQ-Bench及波兰医学排行榜。其中45亿参数模型的性能可媲美其2-3倍规模的模型,而15亿参数模型在极度紧凑的结构下仍展现出强劲性能。这些进展为资源受限应用中实现高质量波兰语AI建立了新基准,为低资源语言的高效参数建模树立了新标准。
EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning
Abstract
arXiv:2505.02579v1 Announce Type: cross Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ( data points and seconds), improved scalability and explainability, and comparable performance across multiple objectives.