2025-05-29-12-07
Make Planning Research Rigorous Again!
Abstract
arXiv:2505.21674v1 Announce Type: new Abstract: In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.
摘要
自规划领域诞生六十余年来,其在构建能够解决全新规划问题的规划软件理论与实践方面做出了重大贡献。这一成就源于对规划系统进行严格设计与评估的既定实践。我们认为,当前基于大语言模型的规划研究热潮同样需要贯彻这种严谨性。实现路径之一是将自动化规划领域的洞见、工具和数据正确整合到基于LLM的规划器设计与评估中。规划界的经验与专业积淀不仅具有历史意义,其积累的教训更能对加速LLM规划器发展起到关键作用。鉴于近期大量研究正在重复规划领域曾遭遇并克服过的相同陷阱,这一立场显得尤为重要。我们相信,规避这些已知陷阱将极大推动基于LLM的规划器发展,并对整个规划领域产生深远影响。
Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems
Abstract
arXiv:2505.21588v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have enabled the emergence of multi-agent systems where LLMs interact, collaborate, and make decisions in shared environments. While individual model behavior has been extensively studied, the dynamics of peer influence in such systems remain underexplored. In this paper, we investigate herd behavior, the tendency of agents to align their outputs with those of their peers, within LLM-based multi-agent interactions. We present a series of controlled experiments that reveal how herd behaviors are shaped by multiple factors. First, we show that the gap between self-confidence and perceived confidence in peers significantly impacts an agent's likelihood to conform. Second, we find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior. Finally, we demonstrate that the degree of herd behavior can be systematically controlled, and that appropriately calibrated herd tendencies can enhance collaborative outcomes. These findings offer new insights into the social dynamics of LLM-based systems and open pathways for designing more effective and adaptive multi-agent collaboration frameworks.
摘要
大型语言模型(LLM)的最新进展推动了多智能体系统的出现,这些系统中的LLM能够在共享环境中交互、协作并做出决策。尽管单个模型的行为已得到广泛研究,但此类系统中同伴影响的动态机制仍未充分探索。本文研究了基于LLM的多智能体交互中的从众行为——即智能体倾向于使其输出与同伴保持一致的倾向。我们通过一系列受控实验揭示了从众行为如何受多种因素影响:首先,研究表明自我置信度与感知同伴置信度之间的差距显著影响智能体的从众概率;其次,发现同伴信息的呈现形式对调节从众行为强度具有关键作用;最后,我们证明从众程度可被系统调控,且适当校准的从众倾向能提升协作效果。这些发现为基于LLM系统的社会动力学提供了新见解,并为设计更高效、自适应的多智能体协作框架开辟了路径。
Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Abstract
arXiv:2505.21880v1 Announce Type: new Abstract: This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
摘要
本研究提出一种创新性城市移动性模拟方法,通过将大语言模型(LLM)与基于智能体的建模(ABM)相结合。与传统基于规则的ABM不同,该框架利用LLM生成合成人口特征、分配常规与偶发活动地点,并模拟个性化路线,从而增强智能体多样性与真实性。基于台北市真实数据的仿真实验,成功模拟了个体行为与大规模移动模式。关键发现如路线热力图和交通方式专项指标,为城市规划者提供了可操作的决策依据。未来工作将致力于建立稳健的验证框架,以确保城市规划应用中的准确性与可靠性。
StreamLink: Large-Language-Model Driven Distributed Data Engineering System
Abstract
arXiv:2505.21575v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.
摘要
大型语言模型(LLMs)在自然语言理解(NLU)方面展现出卓越能力,为创新应用开辟了道路。我们提出StreamLink——一个基于LLM的分布式数据系统,旨在提升数据工程任务的效率与可访问性。该系统构建于Apache Spark和Hadoop等分布式框架之上,以支持大规模数据处理。StreamLink的重要设计理念之一是通过采用本地微调的LLMs(而非ChatGPT等公共AI服务)来保障用户数据隐私。借助领域适配的LLMs,我们能够增强系统对多样化场景下用户自然语言查询的理解能力,并简化生成结构化查询语言(SQL)等数据库查询的信息处理流程。系统还集成了基于LLM的语法与安全检查器,确保每个生成查询的可靠性与安全性。StreamLink展现了生成式LLMs与分布式数据处理技术融合的潜力,可实现以用户为中心的全方位数据工程。通过该架构,用户能以友好且安全的方式与不同规模的复杂数据库系统交互:相比基线方法,其SQL生成执行准确率提升超过10%,并支持用户在数秒内从数亿条数据中通过自然语言定位最关注的项目。
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Abstract
arXiv:2505.21784v1 Announce Type: new Abstract: Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE
摘要
安全推理是一种新兴范式,大型语言模型(LLM)通过先对安全策略进行推理再生成响应,从而缓解现有安全措施(如过度拒绝和越狱漏洞)的局限性。然而,由于创建高质量策略嵌入思维链(CoT)数据集需要耗费大量资源,同时还需确保推理的准确性和避免幻觉或策略冲突,该范式的实施面临挑战。为此,我们提出AIDSAFE:安全推理的代理迭代审议方法——一种利用多智能体审议迭代扩展安全策略推理的新型数据生成方案。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来保证输出质量。AIDSAFE生成的思维链为基于监督微调(SFT)的安全训练提供了坚实基础。此外,针对对齐阶段(如DPO训练)对偏好数据的需求,我们引入了一种补充方案,利用信念增强来创建差异化的选定与拒绝思维链样本。评估表明,AIDSAFE生成的思维链在策略遵循性和推理质量上表现优异。实验证明,基于这些思维链对开源LLM进行微调,可显著提升安全泛化能力和越狱鲁棒性,同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的思维链数据集详见:https://huggingface.co/datasets/AmazonScience/AIDSAFE
From Reasoning to Learning: A Survey on Hypothesis Discovery and Rule Learning with Large Language Models
Abstract
arXiv:2505.21935v1 Announce Type: new Abstract: Since the advent of Large Language Models (LLMs), efforts have largely focused on improving their instruction-following and deductive reasoning abilities, leaving open the question of whether these models can truly discover new knowledge. In pursuit of artificial general intelligence (AGI), there is a growing need for models that not only execute commands or retrieve information but also learn, reason, and generate new knowledge by formulating novel hypotheses and theories that deepen our understanding of the world. Guided by Peirce's framework of abduction, deduction, and induction, this survey offers a structured lens to examine LLM-based hypothesis discovery. We synthesize existing work in hypothesis generation, application, and validation, identifying both key achievements and critical gaps. By unifying these threads, we illuminate how LLMs might evolve from mere ``information executors'' into engines of genuine innovation, potentially transforming research, science, and real-world problem solving.
摘要
自大型语言模型(LLMs)问世以来,研究重点多集中于提升其指令遵循与演绎推理能力,而关于这些模型能否真正发现新知识的问题仍悬而未决。在追求通用人工智能(AGI)的过程中,我们日益需要模型不仅能执行指令或检索信息,更能通过学习、推理和生成新知识来提出深化人类认知的新假设与理论。本文以皮尔士的"溯因-演绎-归纳"框架为指导,为基于LLM的假设发现研究提供结构化视角。我们系统梳理了假设生成、应用与验证领域的现有成果,既总结了关键突破,也指出了核心缺陷。通过整合这些研究方向,本文阐明了LLMs如何可能从单纯的"信息执行者"蜕变为真正创新的引擎,从而潜在变革科学研究与现实问题解决的范式。
Large Language Models for Solving Economic Dispatch Problem
Abstract
arXiv:2505.21931v1 Announce Type: new Abstract: This paper investigates the capability of off-the-shelf large language models (LLMs) to solve the economic dispatch (ED) problem. ED is a hard-constrained optimization problem solved on a day-ahead timescale by grid operators to minimize electricity generation costs while accounting for physical and engineering constraints. Numerous approaches have been proposed, but these typically require either mathematical formulations, face convergence issues, or depend on extensive labeled data and training time. This work implements LLMs enhanced with reasoning capabilities to address the classic lossless ED problem. The proposed approach avoids the need for explicit mathematical formulations, does not suffer from convergence challenges, and requires neither labeled data nor extensive training. A few-shot learning technique is utilized in two different prompting contexts. The IEEE 118-bus system with 19 generation units serves as the evaluation benchmark. Results demonstrate that various prompting strategies enable LLMs to effectively solve the ED problem, offering a convenient and efficient alternative. Consequently, this approach presents a promising future solution for ED tasks, particularly when foundational power system models are available.
摘要
本文研究了现成大型语言模型(LLMs)解决经济调度(ED)问题的能力。ED是电网运营商在日前时间尺度上求解的硬约束优化问题,旨在满足物理和工程约束的同时最小化发电成本。尽管已有多种解决方案,但这些方法通常需要数学公式、面临收敛问题,或依赖大量标注数据和训练时间。本研究采用具备推理能力增强的LLMs来解决经典的无损ED问题,所提方法无需显式数学公式、不存在收敛挑战,且不需要标注数据或大量训练。我们在两种不同的提示场景中应用了小样本学习技术,并以包含19台发电机组的IEEE 118节点系统作为评估基准。结果表明,多种提示策略能使LLMs有效求解ED问题,提供了一种便捷高效的替代方案。因此,该方法为ED任务(特别是在具备电力系统基础模型的情况下)展现出了极具前景的未来解决方案。
AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models
Abstract
arXiv:2505.21741v1 Announce Type: new Abstract: Nuclear waste management requires rigorous regulatory compliance assessment, demanding advanced decision-support systems capable of addressing complex legal, environmental, and safety considerations. This paper presents a multi-agent Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) with document retrieval mechanisms to enhance decision accuracy through structured agent collaboration. Through a structured 10-round discussion model, agents collaborate to assess regulatory compliance and safety requirements while maintaining document-grounded responses. Implemented on consumer-grade hardware, the system leverages Llama 3.2 and mxbai-embed-large-v1 embeddings for efficient retrieval and semantic representation. A case study of a proposed temporary nuclear waste storage site near Winslow, Arizona, demonstrates the framework's effectiveness. Results show the Regulatory Agent achieves consistently higher relevance scores in maintaining alignment with legal frameworks, while the Safety Agent effectively manages complex risk assessments requiring multifaceted analysis. The system demonstrates progressive improvement in agreement rates between agents across discussion rounds while semantic drift decreases, indicating enhanced decision-making consistency and response coherence. The system ensures regulatory decisions remain factually grounded, dynamically adapting to evolving regulatory frameworks through real-time document retrieval. By balancing automated assessment with human oversight, this framework offers a scalable and transparent approach to regulatory governance. These findings underscore the potential of AI-driven, multi-agent systems in advancing evidence-based, accountable, and adaptive decision-making for high-stakes environmental management scenarios.
摘要
核废料管理需要严格的法规遵从性评估,这要求决策支持系统能够处理复杂的法律、环境和安全因素。本文提出一种多智能体检索增强生成(RAG)系统,通过整合大语言模型(LLMs)与文档检索机制,以结构化智能体协作提升决策准确性。系统采用10轮结构化讨论模型,各智能体协作评估法规合规性与安全要求,同时保持基于文档的响应。在消费级硬件上实现时,该系统利用Llama 3.2和mxbai-embed-large-v1嵌入模型实现高效检索与语义表征。以亚利桑那州温斯洛附近拟建临时核废料储存场为例的案例研究验证了该框架的有效性。结果表明:法规智能体在保持法律框架一致性方面持续获得更高相关性评分,而安全智能体能有效处理需多维度分析的复杂风险评估。随着讨论轮次增加,智能体间共识率逐步提升且语义漂移降低,表明决策一致性与响应连贯性增强。该系统通过实时文档检索动态适应不断演变的法规框架,确保监管决策始终基于事实。通过平衡自动化评估与人工监督,该框架为监管治理提供了可扩展且透明的解决方案。这些发现凸显了人工智能驱动的多智能体系统在推进高风险环境管理场景中循证、可问责且适应性决策方面的潜力。
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models
Abstract
arXiv:2505.21765v1 Announce Type: new Abstract: While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
摘要
尽管大型推理模型(LRMs)近期通过强化学习优化最终答案准确率显著提升了大型语言模型(LLMs)的推理能力,但其可能因过度思考而大幅增加输出长度——这种特征表现为不必要的复杂推理路径,既浪费计算资源又可能导致性能下降。我们假设这种低效性源于LRMs动态选择适当模块化推理策略(称为"思维模式")的能力不足。为验证该假设,我们提出一个动态优化框架:将模型生成的推理路径分割为不同思维模式,系统性地识别并提升有益模式以改进答案,同时剔除有害模式。实证分析表明,优化后的思维路径能产生更简洁且信息充分的轨迹,在保持原有正确答案准确率的同时,将注意力浮点运算量(FLOPs)降低达47%。此外,相当比例原本错误的答案被转化为正确结果,在缩短输出长度的同时实现了15.6%的准确率提升。基于优化思维路径带来的改进,我们采用偏好优化技术,通过对比次优与最优推理路径的配对数据集进行训练。在多个数学推理基准测试中,实验评估表明该方法显著降低了计算开销,同时提升推理准确率——最高实现12%的准确率提升,并将令牌使用量从约5,000个减少至3,000个。
Query, Don't Train: Privacy-Preserving Tabular Prediction from EHR Data via SQL Queries
Abstract
arXiv:2505.21801v1 Announce Type: new Abstract: Electronic health records (EHRs) contain richly structured, longitudinal data essential for predictive modeling, yet stringent privacy regulations (e.g., HIPAA, GDPR) often restrict access to individual-level records. We introduce Query, Don't Train (QDT): a structured-data foundation-model interface enabling tabular inference via LLM-generated SQL over EHRs. Instead of training on or accessing individual-level examples, QDT uses a large language model (LLM) as a schema-aware query planner to generate privacy-compliant SQL queries from a natural language task description and a test-time input. The model then extracts summary-level population statistics through these SQL queries and the LLM performs, chain-of-thought reasoning over the results to make predictions. This inference-time-only approach (1) eliminates the need for supervised model training or direct data access, (2) ensures interpretability through symbolic, auditable queries, (3) naturally handles missing features without imputation or preprocessing, and (4) effectively manages high-dimensional numerical data to enhance analytical capabilities. We validate QDT on the task of 30-day hospital readmission prediction for Type 2 diabetes patients using a MIMIC-style EHR cohort, achieving F1 = 0.70, which outperforms TabPFN (F1 = 0.68). To our knowledge, this is the first demonstration of LLM-driven, privacy-preserving structured prediction using only schema metadata and aggregate statistics - offering a scalable, interpretable, and regulation-compliant alternative to conventional foundation-model pipelines.
摘要
电子健康记录(EHRs)包含丰富且结构化的纵向数据,这对预测建模至关重要,但严格的隐私法规(如HIPAA、GDPR)通常限制对个体记录的访问。我们提出"查询而非训练"(QDT)方法:这是一种结构化数据基础模型接口,通过基于EHRs的LLM生成SQL实现表格推理。QDT无需在个体样本上训练或访问原始数据,而是利用大语言模型(LLM)作为模式感知的查询规划器,根据自然语言任务描述和测试时输入生成符合隐私要求的SQL查询。模型随后通过这些SQL查询提取汇总级群体统计量,并由LLM对结果进行思维链推理以生成预测。这种仅需推理时介入的方法具有以下优势:(1)无需监督模型训练或直接数据访问;(2)通过可审计的符号化查询确保可解释性;(3)天然处理缺失特征而无需插补或预处理;(4)有效管理高维数值数据以增强分析能力。我们在2型糖尿病患者30天再入院预测任务上验证QDT(使用MIMIC式EHR队列),取得F1=0.70,优于TabPFN(F1=0.68)。据我们所知,这是首个仅利用模式元数据和聚合统计量实现LLM驱动的隐私保护结构化预测的方案——为传统基础模型流程提供了可扩展、可解释且合规的替代方案。
R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Abstract
arXiv:2505.21668v1 Announce Type: new Abstract: Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0% to 64.1%, outperforming GPT-4o (text-only: 58.6%) and approaching GPT-4o with Code Interpreter (70.9%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
摘要
尽管类R1模型在推理与规划方面取得进展,大型语言模型(LLMs)在需要精确计算、符号操作、优化和算法推理的任务中仍存在困难——这些场景下文本推理缺乏代码执行的严谨性。关键挑战在于如何让LLMs自主判断何时采用文本推理或代码生成。虽然OpenAI通过训练实现按需调用代码解释器,但公开研究缺乏关于如何对齐预训练LLMs以有效利用代码并泛化至多样任务的指导。我们提出R1-Code-Interpreter,通过对纯文本LLM进行多轮监督微调(SFT)和强化学习(RL)训练,使其能在逐步推理过程中自主生成多个代码查询。我们构建了144个推理与规划任务(107训练/37测试),每个任务包含200+多样化问题。采用不同SFT与RL策略对Qwen-2.5模型(3B/7B/14B)进行微调,研究包括:答案格式差异、推理与非推理模型对比、冷启动与热启动、GRPO与PPO算法比较,以及代码输出的掩码策略。与先前针对狭窄领域的RL研究不同,我们发现代码解释器训练因任务多样性和高昂的代码执行成本而显著困难,这凸显了SFT阶段的关键作用。最终模型R1-CI-14B将37项测试任务的平均准确率从44.0%提升至64.1%,超越GPT-4o纯文本模式(58.6%),并接近启用代码解释器的GPT-4o(70.9%),且通过代码生成展现出新兴的自检行为。数据集、代码与模型已开源:https://github.com/yongchao98/R1-Code-Interpreter 与 https://huggingface.co/yongchao98。
Efficiently Enhancing General Agents With Hierarchical-categorical Memory
Abstract
arXiv:2505.22006v1 Announce Type: new Abstract: With large language models (LLMs) demonstrating remarkable capabilities, there has been a surge in research on leveraging LLMs to build general-purpose multi-modal agents. However, existing approaches either rely on computationally expensive end-to-end training using large-scale multi-modal data or adopt tool-use methods that lack the ability to continuously learn and adapt to new environments. In this paper, we introduce EHC, a general agent capable of learning without parameter updates. EHC consists of a Hierarchical Memory Retrieval (HMR) module and a Task-Category Oriented Experience Learning (TOEL) module. The HMR module facilitates rapid retrieval of relevant memories and continuously stores new information without being constrained by memory capacity. The TOEL module enhances the agent's comprehension of various task characteristics by classifying experiences and extracting patterns across different categories. Extensive experiments conducted on multiple standard datasets demonstrate that EHC outperforms existing methods, achieving state-of-the-art performance and underscoring its effectiveness as a general agent for handling complex multi-modal tasks.
摘要
随着大语言模型(LLM)展现出卓越的能力,利用LLM构建通用多模态代理的研究呈现爆发式增长。然而,现有方法要么依赖基于大规模多模态数据的高计算成本端到端训练,要么采用缺乏持续学习与环境适应能力的工具使用方法。本文提出EHC——一种无需参数更新的通用学习代理,其核心由层次化记忆检索(HMR)模块和任务导向型经验学习(TOEL)模块构成。HMR模块通过高效检索相关记忆并突破存储容量限制持续更新信息;TOEL模块通过经验分类与跨类别模式提取,增强代理对不同任务特性的理解能力。在多个标准数据集上的实验表明,EHC以显著优势超越现有方法,其处理复杂多模态任务的性能达到当前最优水平,充分验证了作为通用代理的有效性。
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
Abstract
arXiv:2505.21828v1 Announce Type: new Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.
摘要
大语言模型(LLM)能否将关键安全知识稳健地推广至新情境?当用户提出天真问题时,缺乏这种能力是危险的。例如"我打算为10个月大的宝宝午餐准备蜜瓜球,还应该搭配哪些食物?"在推荐食物前,LLM应依据美国疾控中心(CDC)记录,警告蜜瓜球可能造成幼儿窒息风险。若未能提供此类警告,可能导致严重伤害甚至死亡。为此,我们提出SAGE-Eval(安全知识系统化泛化评估),首个评估LLM能否将公认安全知识正确应用于天真用户提问的基准。该基准包含从权威机构手动收集的104项安全知识,经系统化扩展形成7大常见领域(如户外活动、医药)共10,428个测试场景。研究发现,表现最佳的Claude-3.7-sonnet模型仅通过58%的安全知识测试。同时观察到模型能力与训练算力仅与SAGE-Eval表现呈弱相关性,表明单纯扩大规模并非最佳解决方案。研究结果表明前沿LLM仍缺乏稳健的泛化能力。建议开发者在部署前使用SAGE-Eval评估模型应对突出风险的可靠性。我们已在https://huggingface.co/datasets/YuehHanChen/SAGE-Eval 公开SAGE-Eval数据集,代码发布于https://github.com/YuehHanChen/SAGE-Eval/tree/main。
VIRAL: Vision-grounded Integration for Reward design And Learning
Abstract
arXiv:2505.22092v1 Announce Type: new Abstract: The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/t4_BXugBm9Q.
摘要
人机对齐是当前人工智能领域的关键挑战。以奖励函数最大化为目标的强化学习方法,尤其容易受到设计不当的奖励函数所带来的风险影响。最新研究表明,基于大语言模型(LLMs)的奖励生成在此背景下可超越人类表现。本文提出VIRAL——一种通过多模态大语言模型生成与优化奖励函数的流程框架。该系统能基于给定环境及目标提示(或标注图像)自主创建并通过交互式迭代改进奖励函数。优化过程既可融入人类反馈,也可由视频大语言模型生成的策略描述(以视频形式呈现智能体行为)来指导实现。我们在五个Gymnasium环境中对VIRAL进行了评估,结果表明其不仅能加速新行为的学习,还能确保更精准地符合用户意图。源代码及演示视频详见:https://github.com/VIRAL-UCBL1/VIRAL 与 https://youtu.be/t4_BXugBm9Q。
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
Abstract
arXiv:2505.21907v1 Announce Type: new Abstract: AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.
摘要
AI协作者(AI copilots)作为情境感知、人工智能驱动的辅助系统,旨在帮助用户完成软件开发与内容创作等任务,正逐渐成为现代工作流程的核心组成部分。随着系统能力与应用范围的扩展,个性化已成为确保可用性、信任度与生产力的关键要素。其中偏好优化是个人化的核心环节,即AI协作者检测、解读并适应用户个体偏好的能力。尽管个性化技术在推荐系统与对话代理等领域已趋成熟,但其在AI协作者这类交互式实时系统中的适配研究仍呈现碎片化且探索不足的现状。本综述通过系统梳理AI协作者设计中用户偏好的捕获、建模与优化研究,填补了这一空白。我们提出了AI协作者的统一定义,并构建了基于交互前、交互中与交互后三阶段的偏好优化策略分类体系。通过分析偏好信号获取、用户意图建模及反馈循环整合的技术路径,既梳理了成熟方法,也突出了前沿创新。本研究融合了AI个性化、人机协作与大语言模型适配等领域的洞见,为设计具有自适应性与偏好感知能力的AI协作者提供了结构化理论基础,全面阐述了现有偏好资源的利用方式及其在系统设计各阶段的最适配技术方案。
Reinforced Reasoning for Embodied Planning
Abstract
arXiv:2505.22050v1 Announce Type: new Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.
摘要
具身规划要求智能体基于动态视觉观察和自然语言目标做出连贯的多步决策。尽管当前视觉语言模型(VLMs)在静态感知任务中表现出色,但其在交互环境中进行规划所需的时间推理、空间理解和常识基础方面仍存在不足。本研究提出一种强化微调框架,将R1式推理增强引入具身规划。我们首先从强大的闭源模型中蒸馏出高质量数据集,并通过监督微调(SFT)赋予模型结构化决策先验。随后设计基于规则的多步动作质量奖励函数,采用广义强化偏好优化(GRPO)进行策略优化。该方法在交互式具身任务新基准Embench上进行评估,涵盖领域内和跨领域场景。实验结果表明,我们的方法显著优于规模相近或更大的模型(包括GPT-4o-mini和70B+开源基线),并对未见环境展现出强大泛化能力。本工作揭示了强化驱动推理在推进具身AI长程规划方面的潜力。
Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test
Abstract
arXiv:2505.22112v1 Announce Type: new Abstract: Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual Large Language Models (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.
摘要
认知灵活性在人类认知领域已得到广泛研究,但在视觉大语言模型(VLLMs)中的探索仍相对不足。本研究采用威斯康星卡片分类测试(WCST)——这一衡量定势转换能力的经典范式,对前沿VLLMs(GPT-4o、Gemini-1.5 Pro和Claude-3.5 Sonnet)的认知灵活性进行评估。结果表明,在思维链提示的文本输入条件下,VLLMs能够达到或超越人类水平的定势转换能力,但其表现显著受输入模态和提示策略的影响。此外,研究发现通过角色扮演,VLLMs可模拟与认知灵活性受损患者相符的多种功能性缺陷,这表明VLLMs可能具有至少就定势转换能力而言与大脑相似的认知架构。本研究揭示了VLLMs在人类高阶认知关键组成部分上已接近人类水平的事实,并凸显了其模拟复杂大脑过程的潜在价值。
Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection
Abstract
arXiv:2505.22192v1 Announce Type: new Abstract: Multi-agent systems based on large language models (LLMs) advance automatic task completion in various fields, where debate is a common cooperation form for agents to solve complicated problems with reasoning and cross-review to solidify answers. Assessing the individual contributions of agents within these debates is crucial for system refinement and outcome reliability. Traditional leave-one-out (LOO) method offers a clear framework for evaluating each agent's role but face challenges in LLM-based systems due to high computational costs and associated financial implications. This paper presents introspective-leave-one-out (IntrospecLOO), a simple yet effective prompting for approximation of LOO in LLM-powered multi-agent debates. IntrospecLOO introduces an additional querying round after standard debates, prompting agents to update their answers while ignoring responses from a designated agent. This strategy effectively isolates and gauges each participant's influence at a reduced query complexity compared to the original LOO approaches. Validation through experiments on three benchmark datasets confirms the effectiveness of IntrospecLOO.
摘要
基于大语言模型(LLM)的多智能体系统推动了各领域自动任务完成的进展,其中辩论是智能体通过推理和交叉评审来解决复杂问题并巩固答案的常见协作形式。评估这些辩论中每个智能体的个体贡献对于系统优化和结果可靠性至关重要。传统的留一法(LOO)为评估各智能体作用提供了清晰框架,但在基于LLM的系统中面临高计算成本和相应财务影响等挑战。本文提出内省留一法(IntrospecLOO),这是一种简单而有效的提示方法,用于近似计算LLM驱动的多智能体辩论中的LOO。IntrospecLOO在标准辩论后引入额外查询轮次,提示智能体在忽略指定智能体响应的情况下更新答案。与原始LOO方法相比,该策略以更低查询复杂度有效隔离并量化了每个参与者的影响。通过在三个基准数据集上的实验验证,证实了IntrospecLOO的有效性。
What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning
Abstract
arXiv:2505.22148v1 Announce Type: new Abstract: Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
摘要
大语言模型(LLM)推理领域的最新进展推动了长思维链(LCoT)策略的普及,该策略鼓励在生成最终答案前进行逐步深思熟虑的推理。尽管LCoT已在复杂任务中实现专家级性能,但其推理链的内部结构如何驱动甚至预测最终答案的正确性,仍是一个关键但尚未充分探索的问题。本研究提出LCoT2Tree自动化框架,将序列化LCoT转换为层次化树结构,从而支持对LLM推理进行更深层次的结构分析。通过图神经网络(GNN),我们发现LCoT2Tree提取的结构模式(包括探索、回溯和验证)在多种任务和模型中能更有效地预测最终性能。借助可解释性技术,我们进一步识别出导致失败的临界思维模式(如过度分支)。除诊断价值外,LCoT2Tree揭示的结构模式还支持实际应用,包括提升N选优解码效率。总体而言,我们的研究结果凸显了推理链内部结构的关键作用,使LCoT2Tree成为诊断、解释和改进LLM推理的强大工具。
ChatPD: An LLM-driven Paper-Dataset Networking System
Abstract
arXiv:2505.22349v1 Announce Type: new Abstract: Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.
AgentDNS: A Root Domain Naming System for LLM Agents
Abstract
arXiv:2505.22368v1 Announce Type: new Abstract: The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross-vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent-to-agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi-agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third-party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi-agent collaboration in real-world scenarios. The source code will be published on https://github.com/agentdns.
摘要
大型语言模型(LLM)代理的快速发展凸显了跨厂商服务发现、互操作性与通信方面的关键挑战。现有协议如模型上下文协议和代理间协议在标准化代理与工具间的互操作性以及多代理通信方面取得了显著进展。然而,针对不同代理和工具厂商之间的服务发现,目前仍缺乏标准化协议与解决方案。本文提出AgentDNS,这是一个根域名命名与服务发现系统,旨在使LLM代理能够跨组织与技术边界自主发现、解析并安全调用第三方代理与工具服务。受传统DNS原理启发,AgentDNS引入了结构化服务注册机制、语义化服务发现、安全调用及统一计费方案。我们详细阐述了AgentDNS的架构、核心功能及应用场景,证明其在实际场景中优化多代理协作的潜力。源代码将发布于https://github.com/agentdns。
Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling
Abstract
arXiv:2505.22290v1 Announce Type: new Abstract: Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.
摘要
近期研究表明,即使经过生成长推理步骤训练的大语言模型(LLMs),在应对复杂推理问题时仍面临重大挑战。然而现有文献多基于简单上下文学习示例的直接提示进行评估,这种范式很大程度上忽视了激发LLMs审慎推理的先进技术,进而过早得出LLMs性能已达天花板的结论。本文系统探索了上下文搜索与测试时扩展在超难推理任务中的协同潜力:我们发现通过采用增强内部扩展机制的先进上下文搜索提示,可在曾被判定为"无解"(如成功率低于5%)的任务上实现突破性进展。我们通过实证结果与理论分析揭示了该组合如何释放LLM推理能力:i)实证方面,在受控NP难问题与复杂现实规划基准测试中,本方法相较既往无外部机制的研究实现了高达30倍的成功率提升;ii)理论层面,我们证明上下文搜索提示结合内部扩展能显著扩展可解推理问题的复杂度类别。这些发现挑战了关于LLMs复杂任务局限性的主流假设,表明当前评估范式系统性低估了其真实潜力。本研究呼吁对LLM推理基准进行批判性重估,并提出能充分捕捉当代LLMs真实能力的鲁棒评估策略,这将有助于更准确理解其在现实部署中的实际推理边界。
Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems
Abstract
arXiv:2505.22467v1 Announce Type: new Abstract: Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. Nevertheless, the question of how agents should be structurally organized for optimal cooperation remains largely unexplored. In this position paper, we aim to gently redirect the focus of the MAS research community toward this critical dimension: develop topology-aware MASs for specific tasks. Specifically, the system consists of three core components - agents, communication links, and communication patterns - that collectively shape its coordination performance and efficiency. To this end, we introduce a systematic, three-stage framework: agent selection, structure profiling, and topology synthesis. Each stage would trigger new research opportunities in areas such as language models, reinforcement learning, graph learning, and generative modeling; together, they could unleash the full potential of MASs in complicated real-world applications. Then, we discuss the potential challenges and opportunities in the evaluation of multiple systems. We hope our perspective and framework can offer critical new insights in the era of agentic AI.
摘要
基于大语言模型的多智能体系统(MASs)已成为通过协作智能解决复杂任务的重要范式。然而,关于如何通过结构组织实现最优协同的问题仍鲜有研究。在本立场论文中,我们旨在引导MAS研究界关注这一关键维度:为特定任务开发具有拓扑感知能力的多智能体系统。该系统由三个核心组件构成——智能体、通信链路和通信模式——它们共同决定了系统的协调性能与效率。为此,我们提出了一个系统化的三阶段框架:智能体选择、结构剖析与拓扑合成。每个阶段都将催生语言模型、强化学习、图学习和生成建模等领域的新研究机遇;这些环节的协同将充分释放多智能体系统在复杂现实应用中的潜力。随后,我们讨论了多元系统评估中潜在的挑战与机遇。希望我们的视角与框架能为智能体AI时代提供关键的新见解。
Offset Unlearning for Large Language Models
Abstract
arXiv:2404.11045v2 Announce Type: cross Abstract: Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, biased, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose {\delta}-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, {\delta}-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that {\delta}- Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. {\delta}-Unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs.
摘要
尽管大型语言模型(LLM)具备从训练语料库中获取知识的强大能力,但其对语料中敏感信息(如受版权保护内容、偏见性内容和隐私内容)的记忆引发了伦理与法律问题。针对这些挑战,"遗忘学习"已成为受问题训练数据影响的LLM的潜在解决方案。然而,现有遗忘技术或因需要访问模型内部权重而无法应用于黑盒LLM,或因需保留敏感数据进行推理时校正而违反数据保护原则。我们提出{\delta}-遗忘学习——一种面向黑盒LLM的偏移遗忘框架。该方法不直接调整黑盒LLM本身,而是通过对比一对较小模型的逻辑输出来学习遗忘所需的逻辑偏移量。实验表明,{\delta}-遗忘学习能有效遗忘目标数据,同时在一般非遗忘范围任务上保持相当甚至更强的性能。该框架还能有效整合不同遗忘算法,使得现有各类遗忘算法都能适配于黑盒LLM,形成通用解决方案。
From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications
Abstract
arXiv:2505.22311v1 Announce Type: new Abstract: With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting-edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial's motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM-centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM-based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi-agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next-generation intelligent communication systems.
摘要
随着6G通信时代的到来,智能通信系统面临感知响应能力受限、可扩展性不足以及动态环境适应性低下等多重挑战。本教程系统性地介绍了大型人工智能模型(LAMs)与代理人工智能(Agentic AI)技术在智能通信系统中的原理、设计与应用,旨在为研究人员提供前沿技术概览与实践指导。首先,我们概述6G通信背景,梳理从LAMs到Agentic AI的技术演进脉络,阐明本教程的动机与主要贡献;随后,对构建LAMs所需的关键组件进行全面综述,进一步将LAMs分类并分析其适用性,涵盖大语言模型(LLMs)、大视觉模型(LVMs)、大多模态模型(LMMs)、大推理模型(LRMs)及轻量化LAMs等类型;接着提出面向通信的LAM中心化设计范式,包括数据集构建及内外学习两种实现路径;在此基础上构建基于LAM的智能通信代理AI系统,阐明其规划器、知识库、工具库、记忆模块等核心组件及交互机制,并针对6G场景提出具备数据检索、协同规划与反思评估能力的多代理框架;随后详细综述LAMs与Agentic AI在通信场景中的应用案例;最后总结当前研究面临的挑战与未来方向,以支持构建高效、安全、可持续的新一代智能通信系统。
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Abstract
arXiv:2505.21523v1 Announce Type: cross Abstract: Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
摘要
测试时计算能力的提升使多模态大语言模型能够生成更长的推理链,从而在诸如多模态数学推理等任务上表现出色。然而,这种增强的推理能力往往伴随着幻觉的增加:随着生成内容变长,模型倾向于偏离图像基础内容,更多地依赖语言先验。注意力分析表明,较长的推理链会导致对视觉输入的关注减少,从而加剧幻觉。为系统研究这一现象,我们提出了RH-AUC指标,用于量化模型感知精度随推理长度的变化,从而评估模型在推理过程中是否保持视觉基础。我们还发布了RH-Bench诊断基准,涵盖多种多模态任务,旨在评估推理能力与幻觉之间的权衡。分析表明:(i)较大模型通常在推理与感知之间取得更好的平衡;(ii)这种平衡更多受训练数据的类型和领域影响,而非总体数据量。这些发现强调了需要联合考量推理质量与感知保真度的评估框架的重要性。
How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control
Abstract
arXiv:2505.21531v1 Announce Type: cross Abstract: We explore Large Language Models (LLMs)' human motion knowledge through 3D avatar control. Given a motion instruction, we prompt LLMs to first generate a high-level movement plan with consecutive steps (High-level Planning), then specify body part positions in each step (Low-level Planning), which we linearly interpolate into avatar animations as a clear verification lens for human evaluators. Through carefully designed 20 representative motion instructions with full coverage of basic movement primitives and balanced body part usage, we conduct comprehensive evaluations including human assessment of both generated animations and high-level movement plans, as well as automatic comparison with oracle positions in low-level planning. We find that LLMs are strong at interpreting the high-level body movements but struggle with precise body part positioning. While breaking down motion queries into atomic components improves planning performance, LLMs have difficulty with multi-step movements involving high-degree-of-freedom body parts. Furthermore, LLMs provide reasonable approximation for general spatial descriptions, but fail to handle precise spatial specifications in text, and the precise spatial-temporal parameters needed for avatar control. Notably, LLMs show promise in conceptualizing creative motions and distinguishing culturally-specific motion patterns.
摘要
我们通过三维虚拟角色控制探究大语言模型(LLMs)的人类运动知识。给定运动指令时,我们引导LLMs首先生成包含连续步骤的高层次运动计划(高层次规划),随后在每一步中指定身体部位位置(低层次规划),并通过线性插值将其转化为虚拟角色动画,为人类评估者提供清晰的验证视角。通过精心设计的20个具有基本运动原语全覆盖和身体部位使用平衡的代表性运动指令,我们开展了综合评估,包括对人类生成的动画和高层次运动计划的人工评估,以及与低层次规划中基准位置的自动对比分析。研究发现:LLMs擅长解释高层次身体运动,但在精确定位身体部位方面存在困难;虽然将运动查询分解为原子组件能提升规划性能,但LLMs难以处理涉及高自由度身体部位的多步骤运动;此外,LLMs能对一般空间描述提供合理近似,却无法处理文本中的精确空间规范,以及虚拟角色控制所需的精确时空参数。值得注意的是,LLMs在概念化创意运动及区分文化特异性运动模式方面展现出潜力。
OpenReview Should be Protected and Leveraged as a Community Asset for Research in the Era of Large Language Models
Abstract
arXiv:2505.21537v1 Announce Type: cross Abstract: In the era of large language models (LLMs), high-quality, domain-rich, and continuously evolving datasets capturing expert-level knowledge, core human values, and reasoning are increasingly valuable. This position paper argues that OpenReview -- the continually evolving repository of research papers, peer reviews, author rebuttals, meta-reviews, and decision outcomes -- should be leveraged more broadly as a core community asset for advancing research in the era of LLMs. We highlight three promising areas in which OpenReview can uniquely contribute: enhancing the quality, scalability, and accountability of peer review processes; enabling meaningful, open-ended benchmarks rooted in genuine expert deliberation; and supporting alignment research through real-world interactions reflecting expert assessment, intentions, and scientific values. To better realize these opportunities, we suggest the community collaboratively explore standardized benchmarks and usage guidelines around OpenReview, inviting broader dialogue on responsible data use, ethical considerations, and collective stewardship.
摘要
在大语言模型(LLMs)时代,能够捕捉专家级知识、人类核心价值与推理过程的高质量、多领域且持续演化的数据集正变得愈发珍贵。本立场论文提出,OpenReview——这个持续更新的研究论文、同行评审、作者反驳、元评审及决策结果知识库——应当被更广泛地视为LLM时代推动研究的核心社区资产。我们重点阐述了OpenReview能作出独特贡献的三个领域:提升同行评审流程的质量、可扩展性与问责性;建立基于真实专家审议的开放式基准;通过反映专家评估、意图与科学价值观的真实交互支持对齐研究。为更好实现这些潜力,我们建议社区共同探索围绕OpenReview的标准化基准与使用指南,并就负责任的数据使用、伦理考量及集体管理展开更广泛对话。
Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding?
Abstract
arXiv:2505.21548v1 Announce Type: cross Abstract: Large language models (LLMs) are used around the world but exhibit Western cultural tendencies. To address this cultural misalignment, many countries have begun developing "regional" LLMs tailored to local communities. Yet it remains unclear whether these models merely speak the language of their users or also reflect their cultural values and practices. Using India as a case study, we evaluate five Indic and five global LLMs along two key dimensions: values (via the Inglehart-Welzel map and GlobalOpinionQA) and practices (via CulturalBench and NormAd). Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models. In fact, an average American person is a better proxy for Indian cultural values than any Indic model. Even prompting strategies fail to meaningfully improve alignment. Ablations show that regional fine-tuning does not enhance cultural competence and may in fact hurt it by impeding recall of existing knowledge. We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data. Our study positions cultural evaluation as a first-class requirement alongside multilingual benchmarks and offers a reusable methodology for developers. We call for deeper investments in culturally representative data to build and evaluate truly sovereign LLMs.
摘要
大型语言模型(LLMs)在全球范围内得到广泛应用,但呈现出西方文化倾向。为解决这种文化错位问题,许多国家已开始开发针对本地社区的"区域性"LLMs。然而,这些模型究竟仅能使用用户语言,还是同时反映了其文化价值观与实践,目前尚不明确。以印度为案例,我们沿两个关键维度评估了五个印度本土模型和五个全球模型:价值观(通过Inglehart-Welzel地图和GlobalOpinionQA)与实践(通过CulturalBench和NormAd)。在所有四项任务中,我们发现印度本土模型并未比全球模型更符合印度文化规范。事实上,普通美国人在代表印度文化价值观方面优于任何印度本土模型。即使采用提示策略也未能显著改善文化对齐性。消融实验表明,区域性微调不仅无法提升文化适应能力,反而可能因阻碍现有知识回忆而削弱该能力。我们将此问题归因于缺乏高质量、未翻译且文化根植的预训练与微调数据。本研究将文化评估定位为与多语言基准同等重要的核心要求,并为开发者提供了可复用的方法论。我们呼吁加大对文化代表性数据的投入,以构建和评估真正具有主权性的LLMs。
Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
Abstract
arXiv:2505.21547v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main
摘要
采用离散图像标记器的大型视觉语言模型(LVLMs)通过将视觉输入编码为有限标记集来实现多模态表征的统一。尽管这些模型表现优异,我们发现其仍会幻觉出不存在物体。我们假设这可能是训练过程中诱导的视觉先验所致:当某些图像标记在相同空间区域频繁共现并表征相同物体时,它们会与该物体对应的语言描述形成强关联。因此,模型可能通过激活与现存标记频繁共现的视觉缺失标记而产生幻觉。为验证该假设,我们利用分割数据集构建图像标记共现图,采用图神经网络(GNN)进行对比学习后通过聚类方法,将相似视觉语境中频繁共现的标记分组。研究发现幻觉主要对应于输入中占主导地位的标记簇,且相较于图像中存在的标记,这些簇中视觉缺失标记与幻觉物体的相关性显著更高。基于此发现,我们提出一种通过在生成过程中修改潜在图像嵌入来抑制视觉缺失标记影响的幻觉缓解方法。实验表明该方法能在保持表达力的同时减少幻觉。代码详见https://github.com/weixingW/CGC-VTD/tree/main
ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools
Abstract
arXiv:2505.21569v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.4open.science/r/ChemHAS-01E4/README.md.
摘要
基于大语言模型(LLM)的智能体已展现出通过选择合适工具来提升化学相关任务性能的能力。然而,化学工具固有的预测误差仍限制着其有效性。本文进一步探索如何利用LLM智能体来降低工具的预测误差,为此提出ChemHAS(化学分层智能体堆叠)方法——一种通过有限数据优化智能体堆叠结构来增强化学工具的简洁有效方案。ChemHAS在四项基础化学任务中实现了最先进的性能,证明该方法能有效补偿工具的预测误差。此外,我们识别并表征了四种不同的智能体堆叠行为,这有望提升可解释性,并为科学研究中AI智能体应用揭示新的可能性。代码与数据集已公开于https://anonymous.4open.science/r/ChemHAS-01E4/README.md。
AITEE -- Agentic Tutor for Electrical Engineering
Abstract
arXiv:2505.21582v1 Announce Type: cross Abstract: Intelligent tutoring systems combined with large language models offer a promising approach to address students' diverse needs and promote self-efficacious learning. While large language models possess good foundational knowledge of electrical engineering basics, they remain insufficiently capable of addressing specific questions about electrical circuits. In this paper, we present AITEE, an agent-based tutoring system for electrical engineering designed to accompany students throughout their learning process, offer individualized support, and promote self-directed learning. AITEE supports both hand-drawn and digital circuits through an adapted circuit reconstruction process, enabling natural interaction with students. Our novel graph-based similarity measure identifies relevant context from lecture materials through a retrieval augmented generation approach, while parallel Spice simulation further enhances accuracy in applying solution methodologies. The system implements a Socratic dialogue to foster learner autonomy through guided questioning. Experimental evaluations demonstrate that AITEE significantly outperforms baseline approaches in domain-specific knowledge application, with even medium-sized LLM models showing acceptable performance. Our results highlight the potential of agentic tutors to deliver scalable, personalized, and effective learning environments for electrical engineering education.
摘要
智能辅导系统与大型语言模型相结合,为解决学生多样化需求和促进自我效能学习提供了可行方案。尽管大型语言模型具备电气工程基础知识的良好储备,但在处理电路相关具体问题时仍存在不足。本文提出AITEE——一个基于智能体的电气工程辅导系统,旨在全程陪伴学生学习过程,提供个性化支持并促进自主式学习。该系统通过改进的电路重建流程同时支持手绘与数字电路,实现与学生的自然交互。我们提出的新型图结构相似度度量方法,结合检索增强生成技术从讲义材料中识别相关上下文,而并行Spice仿真则进一步提升解决方案方法的应用准确性。系统采用苏格拉底式对话机制,通过引导式提问培养学习者自主性。实验评估表明,AITEE在领域知识应用方面显著优于基线方法,即使中等规模的语言模型也展现出可接受的性能。研究结果凸显了智能体辅导系统在电气工程教育中构建可扩展、个性化且高效学习环境的潜力。
RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving
Abstract
arXiv:2505.21577v1 Announce Type: cross Abstract: The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/RepoMaster.
摘要
代码智能体的终极目标是自主解决复杂任务。尽管大语言模型在代码生成方面取得显著进展,但现实任务通常需要完整的代码仓库而非简单脚本。从零开始构建此类仓库仍面临重大挑战。幸运的是,GitHub托管着庞大且持续演进的开源仓库集合,开发者常将其作为模块化组件复用于复杂任务。然而,现有框架如OpenHands和SWE-Agent仍难以有效利用这些宝贵资源:仅依赖README文件提供的指导不足,深入分析后我们发现两大核心障碍——仓库信息过载与依赖关系错综复杂,二者均受限于当前大语言模型的有限上下文窗口。为解决这些问题,我们提出RepoMaster——一个专为探索和复用GitHub仓库以解决复杂任务而设计的自主智能体框架。该框架通过构建函数调用图、模块依赖图及分层代码树来识别核心组件,仅向大语言模型提供已识别的关键元素而非整个仓库。在自主执行过程中,它利用我们的探索工具逐步关联相关组件,并通过信息剪枝优化上下文使用。在调整后的MLE-bench评估中,RepoMaster相较最强基线OpenHands实现有效提交量110%的相对提升。在我们新发布的GitTaskBench上,RepoMaster将任务通过率从24.1%提升至62.9%,同时减少95%的token消耗。代码及演示材料已公开于https://github.com/wanghuacan/RepoMaster。
Public Discourse Sandbox: Facilitating Human and AI Digital Communication Research
Abstract
arXiv:2505.21604v1 Announce Type: cross Abstract: Social media serves as a primary communication and information dissemination platform for major global events, entertainment, and niche or topically focused community discussions. Therefore, it represents a valuable resource for researchers who aim to understand numerous questions. However, obtaining data can be difficult, expensive, and often unreliable due to the presence of bots, fake accounts, and manipulated content. Additionally, there are ethical concerns if researchers decide to conduct an online experiment without explicitly notifying social media users about their intent. There is a need for more controlled and scalable mechanisms to evaluate the impacts of digital discussion interventions on audiences. We introduce the Public Discourse Sandbox (PDS), which serves as a digital discourse research platform for human-AI as well as AI-AI discourse research, testing, and training. PDS provides a safe and secure space for research experiments that are not viable on public, commercial social media platforms. Its main purpose is to enable the understanding of AI behaviors and the impacts of customized AI participants via techniques such as prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. We provide a hosted live version of the sandbox to support researchers as well as the open-sourced code on GitHub for community collaboration and contribution.
摘要
社交媒体作为全球重大事件、娱乐活动以及小众或主题性社群讨论的主要传播与信息发布平台,为研究者提供了理解诸多问题的宝贵资源。然而,由于机器人账号、虚假账户和操纵性内容的存在,数据获取往往面临困难、成本高昂且可靠性不足的问题。此外,若研究者在未明确告知社交媒体用户的情况下开展在线实验,还会引发伦理争议。当前亟需建立更具可控性和扩展性的机制,以评估数字讨论干预对受众的影响。为此,我们推出"公共话语沙盒"(PDS)——一个面向人机对话及人工智能间对话研究、测试与训练的数字话语研究平台。该沙盒为无法在公共商业社交媒体平台上实施的研究实验提供了安全可靠的环境,其核心目标是通过提示工程、检索增强生成(RAG)和微调等技术,助力研究者理解AI行为模式及定制化AI参与者的影响。我们不仅提供托管式沙盒实时版本支持科研工作,同时也在GitHub开源代码以促进社区协作与贡献。
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Abstract
arXiv:2505.21594v1 Announce Type: cross Abstract: Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.
摘要
大型语言模型(LLMs)为智能手机、可穿戴设备和具身机器人等边缘设备提供了多样化的应用可能。然而,其部署通常依赖昂贵的云端API接口,导致高昂运营成本,这不仅限制了小型组织的使用权限,也引发了可持续性担忧。部分LLMs可采用设备端部署方案,通过降低延迟和增强隐私保护实现经济高效的解决方案。但有限的计算资源制约了可部署模型的规模与精度,需要边缘与云端协同设计。我们提出一种快速高效的边缘-云端推测式解码框架,在服务器端部署大型目标模型,在设备端运行小型草稿模型。通过在目标模型中引入早期退出机制,令牌可在验证过程中生成,使得客户端能在最终验证前预起草后续令牌,从而利用空闲时间并增强边缘与云端的并行性。基于NVIDIA Jetson Nano(客户端)和A100 GPU(服务器)平台,配合Vicuna-68M(草稿)与Llama2-7B(目标)模型,我们的方法相比云端自回归解码可降低35%的延迟,其中预起草机制额外贡献了11%的改进。为验证实际应用价值,我们将该方法部署于Unitree Go2四足机器人,采用基于视觉语言模型(VLM)的控制方案,较传统云端自回归解码实现了21%的加速。这些结果证明了本框架在资源受限边缘设备上实现实时LLM和VLM应用的潜力。
The Feasibility of Topic-Based Watermarking on Academic Peer Reviews
Abstract
arXiv:2505.21636v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly integrated into academic workflows, with many conferences and journals permitting their use for tasks such as language refinement and literature summarization. However, their use in peer review remains prohibited due to concerns around confidentiality breaches, hallucinated content, and inconsistent evaluations. As LLM-generated text becomes more indistinguishable from human writing, there is a growing need for reliable attribution mechanisms to preserve the integrity of the review process. In this work, we evaluate topic-based watermarking (TBW), a lightweight, semantic-aware technique designed to embed detectable signals into LLM-generated text. We conduct a comprehensive assessment across multiple LLM configurations, including base, few-shot, and fine-tuned variants, using authentic peer review data from academic conferences. Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating strong robustness to paraphrasing-based evasion. These findings highlight the viability of TBW as a minimally intrusive and practical solution for enforcing LLM usage in peer review.
摘要
大型语言模型(LLMs)正日益融入学术工作流程,许多会议和期刊允许将其用于语言润色和文献综述等任务。然而,由于担心泄露机密信息、生成虚构内容及评价不一致等问题,同行评审中仍禁止使用LLMs。随着LLM生成文本与人类写作的区分度逐渐降低,建立可靠的溯源机制以维护评审过程的完整性变得愈发重要。本研究评估了基于主题的水印技术(TBW)——一种轻量级、语义感知的方法,旨在向LLM生成文本中嵌入可检测信号。我们使用学术会议的真实同行评审数据,对多种LLM配置(包括基础模型、少样本学习及微调变体)进行了全面评估。结果表明,相较于无水印输出,TBW在保持评审质量的同时,对基于改写的规避行为表现出极强的鲁棒性。这些发现证明TBW可作为执行同行评审中LLM使用规范的一种低干扰、实用性解决方案。
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
Abstract
arXiv:2505.21605v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.
摘要
大型语言模型(LLMs)在复杂任务(如推理与研究生水平问答)中展现出日益增强的能力,但其抗滥用韧性——尤其是在涉及科学复杂性风险的情境下——仍未得到充分探究。现有安全基准通常聚焦于仅需基础知识理解的指令(例如"告诉我如何制作炸弹"),或采用风险相对较低的提示(如关于危险内容的多选题或分类任务),因而无法充分评估模型在知识密集型危险场景中的安全性。为填补这一关键空白,我们提出SOSBench——一个基于法规、聚焦高危领域的基准测试,涵盖化学、生物学、医学、药理学、物理学和心理学六大高风险科学领域。该基准包含3,000条源自真实法规条例的提示,通过LLM辅助的进化管道系统扩展,引入多样化且现实化的滥用场景(例如涉及高级化学公式的详细爆炸物合成指导)。我们在统一评估框架下使用SOSBench对前沿模型进行测试。尽管这些模型宣称已进行安全对齐,先进模型在所有领域均持续披露违反政策的内容,有害响应率居高不下(如Deepseek-R1达79.1%,GPT-4.1达47.3%)。这些结果揭示了显著的安全对齐缺陷,并突显了关于强大LLMs负责任部署的紧迫性问题。
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Abstract
arXiv:2505.21600v1 Announce Type: cross Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
摘要
大型语言模型(LLMs)以显著增加的推理开销为代价获得了卓越的推理能力,这给实际部署带来了巨大挑战。尽管经过蒸馏的小型语言模型(SLMs)显著提升了效率,但其性能因无法遵循LLMs的推理路径而受限。幸运的是,我们发现仅有少量关键标记会真正导致LLMs与SLMs的推理路径分叉,大多数生成标记要么完全相同,要么仅存在中性差异(如缩写或表达方式的细微变化)。基于这一发现,我们提出R2R(Roads to Rome)——一种神经标记路由方法,该方法仅针对关键路径分叉标记选择性调用LLMs,而将大部分标记生成任务交由SLM处理。我们还开发了自动化数据生成流程,用于识别分叉标记并生成标记级路由标签以训练轻量级路由器。将R2R应用于DeepSeek家族的R1-1.5B和R1-32B模型组合后,在数学、编程和问答等挑战性基准测试中,平均激活参数量仅5.6B的R2R以1.6倍优势超越R1-7B的平均准确率,甚至优于R1-14B模型。与R1-32B相比,在保持相当性能的同时实现了2.8倍的实时加速,推进了测试时缩放效率的帕累托前沿。代码已开源:https://github.com/thu-nics/R2R。
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Abstract
arXiv:2505.21627v1 Announce Type: cross Abstract: State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it -- they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output -- they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the \texttt{Llama}, \texttt{Gemma} and \texttt{Ministral} families, and input prompts from the LMSYS Chatbot Arena platform.
摘要
当前最先进的大语言模型需要专用硬件和大量能源才能运行。因此,提供大语言模型访问权限的云服务变得非常流行。在这些服务中,用户为模型输出支付的价格取决于模型生成输出时使用的令牌数量——他们为每个令牌支付固定价格。本研究表明,这种定价机制为服务提供商创造了策略性误报模型生成输出所用令牌数量的财务动机,而用户无法证明甚至无从知晓提供商是否存在超额收费行为。然而我们也发现,若要求不诚信的提供商必须公开模型生成过程的透明度,则要在不引起怀疑的前提下实现最优误报具有相当难度。作为概念验证,我们提出了一种高效的启发式算法,使提供商能在不引发怀疑的情况下大幅超额收费,这揭示了现行按令牌计费机制下用户的脆弱性。为进一步彻底消除策略性行为的财务动机,我们提出了一种简单的激励相容令牌定价机制。在该机制下,用户为模型输出支付的价格取决于输出内容的字 符数量——他们为每个字符支付固定价格。为验证和补充理论结果,我们使用\texttt{Llama}、\texttt{Gemma}和\texttt{Ministral}系列的大语言模型,以及来自LMSYS Chatbot Arena平台的输入提示进行了多项实验。
Incentivizing Permissionless Distributed Learning of LLMs
Abstract
arXiv:2505.21684v1 Announce Type: cross Abstract: We describe an incentive system for distributed deep learning of foundational models where peers are rewarded for contributions. The incentive system, \textit{Gauntlet}, has been deployed on the bittensor blockchain and used to train a 1.2B LLM with completely permissionless contributions of pseudo-gradients: no control over the users that can register or their hardware. \textit{Gauntlet} can be applied to any synchronous distributed training scheme that relies on aggregating updates or pseudo-gradients. We rely on a two-stage mechanism for fast filtering of peer uptime, reliability, and synchronization, combined with the core component that estimates the loss before and after individual pseudo-gradient contributions. We utilized an OpenSkill rating system to track competitiveness of pseudo-gradient scores across time. Finally, we introduce a novel mechanism to ensure peers on the network perform unique computations. Our live 1.2B run, which has paid out real-valued tokens to participants based on the value of their contributions, yielded a competitive (on a per-iteration basis) 1.2B model that demonstrates the utility of our incentive system.
摘要
我们提出了一种用于基础模型分布式深度学习的激励机制,该机制通过奖励参与者的贡献来运作。这一名为\textit{Gauntlet}的激励系统已部署在Bittensor区块链上,并成功用于训练一个12亿参数的大型语言模型(LLM),其特点在于完全无需许可地接收伪梯度贡献:既不控制注册用户资格,也不限制其硬件条件。\textit{Gauntlet}可应用于任何依赖聚合更新或伪梯度的同步分布式训练方案。我们采用两阶段机制快速筛选节点的在线率、可靠性和同步性,其核心组件通过对比个体伪梯度贡献前后的损失值进行评估。系统采用OpenSkill评分体系持续追踪伪梯度得分的动态竞争力。最后,我们引入创新机制确保网络节点执行独特计算任务。在实际运行的12亿参数模型训练中,系统根据参与者贡献价值发放实际代币奖励,最终产出的模型在单次迭代性能上表现出竞争力,验证了该激励系统的实用价值。
Rethinking the Outlier Distribution in Large Language Models: An In-depth Study
Abstract
arXiv:2505.21670v1 Announce Type: cross Abstract: Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.
摘要
研究大型语言模型(LLMs)中的异常值至关重要,因为这些异常值对模型性能的多个方面(包括量化和压缩)具有显著影响。异常值通常会导致较大的量化误差,从而降低模型性能。识别并解决这些异常值可以提高量化过程的准确性和效率,实现在边缘设备或专用硬件上的更顺畅部署。近期研究发现了LLMs中两种常见的异常值类型:大规模激活异常和通道级异常。尽管已有大量量化算法被提出以减轻其影响并保持满意的准确度,但很少有研究深入探讨这些异常值的根本成因。本文对这些异常值的形成机制进行了全面研究,并提出了可能减少其发生的策略。最终,我们介绍了一些高效方法,可在对准确度影响最小的情况下消除大多数大规模激活异常和通道级异常。
How does Misinformation Affect Large Language Model Behaviors and Preferences?
Abstract
arXiv:2505.21608v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.
摘要
大型语言模型(LLMs)在知识密集型任务中展现出卓越能力,但在遭遇错误信息时仍显脆弱。现有研究探讨了LLMs在应对错误信息中的作用,但对其受错误信息影响的具体方面和程度仍缺乏细粒度分析。为填补这一空白,我们提出MisBench——当前规模最大、最全面的基准测试,用于评估LLMs对错误信息的行为反应与知识偏好。该基准包含10,346,712条错误信息,创新性地同时考量了知识冲突与错误信息风格变异两个维度。实验结果表明:尽管LLMs表现出相当的误信息识别能力,其仍易受知识冲突和风格变异的影响。基于这些发现,我们进一步提出"重构判别法"(Reconstruct to Discriminate, RtD)以增强LLMs的误信息检测能力。本研究为理解LLMs与错误信息的交互机制提供了重要见解,相信MisBench可作为评估基于LLM的检测器、提升其实际应用可靠性的有效基准。代码与数据详见https://github.com/GKNL/MisBench。
LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model
Abstract
arXiv:2505.21689v1 Announce Type: cross Abstract: The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.
摘要
未决法律案件的持续积压,特别是在印度司法系统中,严重阻碍了司法的及时执行。传统的人工请愿书优先级排序方法往往效率低下且易受主观偏见影响,进一步加剧了案件延误。为解决这一问题,我们提出LLMPR(基于大语言模型的请愿书排序框架),该自动化框架利用迁移学习和机器学习技术,根据法律请愿书的上下文紧急程度分配优先级排序。通过包含7,593份标注请愿书的ILDC数据集,我们处理非结构化法律文本,并采用DistilBERT、LegalBERT和MiniLM等多种嵌入技术提取特征。这些文本嵌入特征与间隔天数、等级分数和字数等量化指标相结合,用于训练包括随机森林、决策树、XGBoost、LightGBM和CatBoost在内的多种机器学习模型。实验结果表明,随机森林和决策树模型表现最优,准确率超过99%,斯皮尔曼等级相关系数达0.99。值得注意的是,仅使用数值特征的模型即可实现近乎最优的排序效果(R2 = 0.988,ρ = 0.998),而基于大语言模型的嵌入仅带来边际提升。这些发现表明,自动化请愿书排序能有效优化司法工作流程,减少案件积压,并提升法律优先级排序的公平性。
Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2
Abstract
arXiv:2505.21715v1 Announce Type: cross Abstract: The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.
摘要
基于胸部X光图像的自动化放射学报告生成在提升诊断工作流程效率的同时,能够有效保护患者隐私。传统集中式方法常需传输敏感数据,存在隐私泄露风险。为此,本研究提出一种基于IU-Xray数据集的多模态联邦学习框架,用于胸部X光报告生成。该系统采用视觉变换器(ViT)作为编码器,GPT-2作为报告生成器,实现无需共享原始数据的分布式训练。评估了三种联邦学习聚合策略:联邦平均(FedAvg)、Krum聚合以及新型损失感知联邦平均(L-FedAvg)。结果表明,Krum聚合在ROUGE、BLEU、BERTScore和RaTEScore等词汇与语义评估指标上表现最优。研究证实联邦学习模型在生成临床相关且语义丰富的放射学报告方面可媲美甚至超越集中式模型。该轻量级隐私保护框架为不妥协数据机密性的协作医疗AI开发提供了新途径。
Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Abstract
arXiv:2505.21657v1 Announce Type: cross Abstract: Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
摘要
诸如GPT、LLAMA和Claude等大型语言模型在文本生成方面已展现出强大能力,但其内部机制仍如同黑箱,难以理解其决策依据。这种透明度的缺失在需要信任与问责的领域尤为棘手。为此,我们提出SMILE——一种通过微调输入并测量输出变化来解释模型响应机制的新方法。该模型无关技术能精准定位提示文本中影响力最大的词汇,并生成直观的热力图进行可视化呈现。我们在多个前沿大语言模型上验证了SMILE的有效性,采用准确性、一致性、稳定性和保真度等指标证明其解释的清晰度与可靠性。通过提升模型可解释性,SMILE为增强人工智能的透明度和可信度迈出了关键一步。
Counterfactual Simulatability of LLM Explanations for Generation Tasks
Abstract
arXiv:2505.21740v1 Announce Type: cross Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
摘要
大型语言模型(LLM)的行为可能难以预测,即使对提示进行微小改动也可能导致输出发生意料之外的变化。因此,模型准确解释自身行为的能力至关重要,特别是在高风险场景中。评估解释的一种方法是反事实可模拟性,即解释能使用户在多大程度上推断模型在相关反事实上的输出。此前反事实可模拟性研究主要针对是非问答任务。我们提出了一个通用框架,将该方法扩展至生成任务,并以新闻摘要和医疗建议作为应用案例。研究发现,在摘要场景中,LLM的解释确实能帮助用户更好地预测模型在反事实上的输出,但在医疗建议方面仍有显著改进空间。此外,结果表明反事实可模拟性评估可能更适用于基于技能的任务,而非基于知识的任务。
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
Abstract
arXiv:2505.21724v1 Announce Type: cross Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.
摘要
本文提出在线多模态对话响应生成(OMCRG)这一新任务,旨在基于说话者的多模态输入,在线生成同步的言语与非言语倾听者反馈。该任务反映了自然二元互动特性,并在实现倾听者生成音频与面部反应的同步性方面提出了新挑战。为解决这些挑战,我们创新性地引入文本作为中间模态以桥接音频与面部反应,进而提出多模态大语言模型OmniResponse,该模型能自回归生成高质量的多模态倾听者响应。OmniResponse采用预训练大语言模型架构,并集成两个新组件:时序锚定生成文本标记的Chrono-Text模块,以及可控制在线生成与面部反应同步语音的TempoVoice合成模块。为推进OMCRG研究,我们构建了包含696段高质量二元互动的ResponseNet数据集,内含同步分屏视频、多通道音频、转录文本及面部行为标注。基于ResponseNet的全面评估表明,OmniResponse在语义语音内容、视听同步性和生成质量方面显著优于基线模型。
VeriTrail: Closed-Domain Hallucination Detection with Traceability
Abstract
arXiv:2505.21786v1 Announce Type: cross Abstract: Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as "closed-domain hallucination." This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs' faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
摘要
即使被要求严格遵循源材料,语言模型仍经常生成未经证实的内容——这种现象被称为"闭域幻觉"。与单步生成过程(SGS)相比,多步生成过程(MGS)中这种风险会被进一步放大。然而由于MGS过程更为复杂,我们认为仅检测最终输出的幻觉虽有必要但并不充分:同等重要的是追踪幻觉内容可能被引入的环节,以及忠实内容如何通过中间输出从源材料派生。为此,我们提出了VeriTrail——首个专为MGS和SGS过程提供可追溯性的闭域幻觉检测方法。同时我们发布了首个包含所有中间输出及MGS过程最终输出忠实性人工标注的数据集。实验表明,VeriTrail在两个数据集上的表现均优于基线方法。
DualSchool: How Reliable are LLMs for Optimization Education?
Abstract
arXiv:2505.21775v1 Announce Type: cross Abstract: Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.
摘要
考虑以下在优化入门课程中讲授的任务,该任务针对(生成式)人工智能与运筹学交叉领域社区提出的挑战:生成线性规划的对偶问题。大型语言模型(LLMs)通过互联网规模训练,已掌握对偶转换流程及大量原始-对偶转换(P2DC)实例。因此学生有理由预期LLMs在P2DC任务上表现良好。为验证该预期,本文提出DualSchool框架——一个用于生成和验证P2DC实例的完整体系。DualSchool的验证程序采用规范图编辑距离,其评估深度远超现有优化模型评估方法(这些方法在P2DC任务中存在大量假阳性与假阴性)。DualSchool实验揭示了有趣发现:尽管LLMs能准确复述转换流程,但最先进的开源LLMs仍无法持续生成正确对偶形式。这一现象即便在最小的双变量实例和衍生任务(如正确性验证、错误分类)中依然存在。本文还探讨了该发现对教育者、学生及大型推理系统开发的启示。
MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning
Abstract
arXiv:2505.21771v1 Announce Type: cross Abstract: Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.
摘要
融合半结构化数据与图表、地图等视觉元素的多模态表格在现实领域无处不在,却对当前视觉语言模型(VLMs)构成严峻挑战。尽管大语言模型(LLMs)和VLMs在文本与图像理解方面展现出强大能力,但其在复杂现实场景下的多模态表格推理性能仍未得到探索。为填补这一空白,我们提出MMTBENCH(多模态表格基准),该基准包含500个源自多样现实场景的真实多模态表格,共计4021组问答对。MMTBENCH的问题涵盖四种问题类型(显式、隐式、答案提及和视觉基础)、五种推理类型(数学计算、极值识别、事实验证、视觉基础和其它)以及八种表格类型(单/多实体、含实体地图与图表、单/多图表、地图及可视化)。通过对前沿模型的全类型评估,我们发现其存在显著性能差距,尤其在需要视觉推理和多步推断的问题上。这些发现表明,亟需开发能更紧密融合视觉与语言处理的新型架构。MMTBENCH通过提供反映现实任务复杂性的高质量挑战性资源,凸显了其作为未来多模态表格研究基础资源的价值。
Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking
Abstract
arXiv:2505.21815v1 Announce Type: cross Abstract: Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.
摘要
科学论文检索对于支持文献发现与研究至关重要。尽管密集检索方法在通用任务中表现出有效性,但它们往往无法捕捉对准确理解科学查询至关重要的细粒度科学概念。近期研究也尝试利用大语言模型(LLM)进行查询理解,但这些方法通常缺乏对语料库特定知识的 grounding,可能生成不可靠或不忠实的内容。为克服这些局限,我们提出SemRank框架,该框架将LLM引导的查询理解与基于概念的语义索引相结合。每篇论文通过多粒度科学概念(包括通用研究主题和详细关键短语)进行索引。查询时,大语言模型会识别源自语料库的核心概念,以显式捕获查询的信息需求。这些识别出的概念可实现精确的语义匹配,显著提升检索准确性。实验表明,SemRank能持续提升各类基础检索器的性能,超越现有基于LLM的强基线,同时保持高效性。
Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
Abstract
arXiv:2505.21825v1 Announce Type: cross Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.
摘要
推理时计算已成为提升大语言模型推理能力的重要扩展方向。然而,尽管该技术能带来显著性能提升,学界对推理时计算的最优分配机制仍缺乏深入理解。核心问题在于应优先采用序列化扩展(如更长的思维链)还是并行化扩展(如跨多个短思维链的多数投票)。本研究通过证明在某些推理场景中序列化扩展相对并行化扩展具有指数级优势,旨在揭示测试时扩展的适用边界。这些场景基于具有挑战性的图分布中的连通性问题。我们通过涵盖多类语言模型的系统性实验验证理论发现,包括采用不同思维链策略从头训练的图连通性专用模型,以及通用大型推理模型。
Extracting Research Instruments from Educational Literature Using LLMs
Abstract
arXiv:2505.21855v1 Announce Type: cross Abstract: Large Language Models (LLMs) are transforming information extraction from academic literature, offering new possibilities for knowledge management. This study presents an LLM-based system designed to extract detailed information about research instruments used in the education field, including their names, types, target respondents, measured constructs, and outcomes. Using multi-step prompting and a domain-specific data schema, it generates structured outputs optimized for educational research. Our evaluation shows that this system significantly outperforms other approaches, particularly in identifying instrument names and detailed information. This demonstrates the potential of LLM-powered information extraction in educational contexts, offering a systematic way to organize research instrument information. The ability to aggregate such information at scale enhances accessibility for researchers and education leaders, facilitating informed decision-making in educational research and policy.
摘要
大语言模型(LLMs)正在改变学术文献的信息提取方式,为知识管理提供了新的可能性。本研究提出了一种基于LLM的系统,旨在从教育领域提取研究工具的详细信息,包括其名称、类型、目标受访者、测量构念及结果。该系统采用多步提示和特定领域数据模式,生成针对教育研究优化的结构化输出。评估结果表明,该系统在识别工具名称及详细信息方面显著优于其他方法,这证明了LLM驱动的信息提取在教育场景中的潜力,为系统化组织研究工具信息提供了途径。大规模聚合此类信息的能力增强了研究人员和教育领导者对数据的可获取性,有助于推动教育研究与政策制定的科学决策。
Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task
Abstract
arXiv:2505.21850v1 Announce Type: cross Abstract: Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn't explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages.
摘要
当前的多模态大语言模型(MLLMs)在通用视觉推理任务中表现优异,但在抽象视觉推理(AVR)领域的研究仍显不足。AVR需要超越简单感知的高阶推理能力以识别抽象规则。现有AVR基准测试主要关注单步推理,强调最终结果而忽视了推理过程的多阶段性。既往研究发现MLLMs在这些基准测试中表现欠佳,但未能揭示其失败机制。为填补这一空白,我们基于RAVEN框架开发了MultiStAR——一个多阶段AVR评估基准,旨在测试模型在不同复杂度层级上的推理能力。此外,现有评估指标(如准确率)仅关注最终结果,未能考量中间步骤的正确性。为此,我们提出新型评估指标MSEval,该指标同时兼顾中间步骤与最终结果的正确性。我们使用17个具有代表性的闭源和开源MLLMs在MultiStAR上开展了全面实验。结果表明:现有MLLMs在基础感知任务中表现尚可,但在更复杂的规则检测阶段仍面临显著挑战。
Evaluating the Retrieval Robustness of Large Language Models
Abstract
arXiv:2505.21870v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.
摘要
检索增强生成(RAG)通常能提升大语言模型(LLM)解决知识密集型任务的能力。但由于检索不完善及模型利用检索内容的能力有限,RAG也可能导致性能下降。本研究评估了LLM在实际RAG设置中的鲁棒性(以下简称检索鲁棒性),聚焦三个研究问题:(1)RAG是否始终优于非RAG;(2)更多检索文档是否总能带来更好性能;(3)文档排序是否影响结果。为此,我们构建了包含1500个开放域问题的基准数据集,每个问题均配有从维基百科检索的文档,并针对每个研究问题提出三项鲁棒性指标。通过对11种LLM和3种提示策略的全面实验,我们发现所有LLM都表现出惊人的高检索鲁棒性;然而,不同程度的不完美鲁棒性仍阻碍它们充分获取RAG的优势。
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
Abstract
arXiv:2505.21919v1 Announce Type: cross Abstract: The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
摘要
随着大语言模型(LLMs)长上下文窗口的广泛应用,高效的键值缓存(KVC)管理成为优化推理性能的关键。检索增强生成(RAG)和智能体等推理工作负载表现出较高的缓存复用性,这使得高效缓存对减少冗余和提升速度至关重要。我们基于公开可用的轨迹数据分析了真实场景中的KVC访问模式,并评估了Redis等商用键值存储系统及基于RDMA的前沿系统(CHIME[1]和Sherman[2])在KVC元数据管理中的表现。本研究揭示了当前缺乏针对KVC预填充的专用存储方案,强调需要为LLM工作负载设计具备优化元数据管理的高效分布式缓存系统,同时为构建可扩展、低延迟的KVC管理系统提供了改进思路。
Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development
Abstract
arXiv:2505.21898v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system -- Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" -- instructional transitions learned from historically successful trajectories -- which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.
摘要
大语言模型(LLMs)与自主智能体的最新进展已在多个领域展现出卓越能力。然而,独立智能体在处理需要大量交互和计算资源的复杂任务时仍存在局限。尽管多智能体系统(MAS)通过任务分解、迭代通信和角色专业化等协作机制缓解了部分问题,但现有系统通常缺乏资源意识,因高令牌消耗和过长执行时间导致显著效率低下。为此,我们提出一种资源感知型多智能体系统——Co-Saving(意为多个智能体协同参与资源节约活动),该系统利用经验知识提升运行效率与解决方案质量。我们的核心创新是引入"捷径"机制——从历史成功轨迹中学习到的指令跳转——可绕过冗余推理智能体以加速集体问题解决过程。在软件开发任务的实验中,本方法展现出显著优势:相较于最先进的多智能体系统ChatDev,平均降低50.85%的令牌使用量,并将整体代码质量提升10.06%。
Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding
Abstract
arXiv:2505.21908v1 Announce Type: cross Abstract: Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.
摘要
诊断相关组(DRG)编码对医院报销和运营至关重要,但其人工分配过程耗时费力。大型语言模型(LLM)由于该任务的外分布(OOD)特性——预训练语料库极少包含私有临床或计费数据——在DRG编码任务中表现欠佳。我们提出DRG-Sapphire系统,该系统通过大规模强化学习(RL)实现临床记录自动编码。基于Qwen2.5-7B架构并采用基于规则的奖励函数进行群体相对策略优化(GRPO)训练,DRG-Sapphire引入了一系列RL增强技术以解决先前数学任务中未见的领域特定挑战。我们的模型在MIMIC-IV基准测试中达到最先进准确率,并能生成经医师验证的DRG分配逻辑,显著提升可解释性。本研究进一步揭示了将RL应用于知识密集型OOD任务的广泛挑战。我们观察到RL性能与监督微调(SFT)样本数量的对数近似线性相关,表明RL效果本质上受限于基础模型编码的领域知识。对于DRG编码这类OOD任务,要实现强RL性能需在RL阶段前完成充分的知识注入。因此,对此类任务而言,扩展SFT可能比单独扩展RL更具效果和计算效率。
MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing
Abstract
arXiv:2505.21966v1 Announce Type: cross Abstract: We introduce MapStory, an LLM-powered animation authoring tool that generates editable map animation sequences directly from natural language text. Given a user-written script, MapStory leverages an agentic architecture to automatically produce a scene breakdown, which decomposes the script into key animation building blocks such as camera movements, visual highlights, and animated elements. Our system includes a researcher component that accurately queries geospatial information by leveraging an LLM with web search, enabling the automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.
摘要
我们介绍MapStory——一个基于大语言模型的动画创作工具,能够直接从自然语言文本生成可编辑的地图动画序列。该系统通过智能代理架构,将用户编写的脚本自动分解为场景构成要素,包括摄像机运动、视觉高亮和动画元素等关键动画构建模块。我们的系统配备研究组件,通过结合大语言模型与网络搜索精确查询地理空间信息,可自动提取相关区域、路径和坐标,同时允许用户通过编辑和查询来调整结果或获取补充信息。用户还可通过交互式时间线编辑器微调这些模块的参数。系统设计基于对专业动画师的初步访谈及200个现有地图动画视频的分析。评估结果显示(包括5位专家访谈和12人可用性研究),MapStory能帮助用户轻松创建地图动画,加快迭代速度,激发创意探索,并降低制作地图叙事作品的门槛。
Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation
Abstract
arXiv:2505.21956v1 Announce Type: cross Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.
摘要
文本到图像生成日益需要获取预训练模型无法完全掌握的领域特定、细粒度且快速更新的知识。现有的检索增强生成(RAG)方法试图通过检索全局相关图像来解决这一问题,但当复杂用户查询中的所需元素无法在单张图像中完整呈现时,这些方法便会失效。我们提出跨模态RAG框架,该框架将查询和图像分解为子维度组件,实现子查询感知的检索与生成。我们的方法引入了一种混合检索策略——结合子维度稀疏检索器与稠密检索器——以识别帕累托最优图像集合,其中每张图像贡献查询的互补方面。在生成过程中,通过引导多模态大语言模型选择性地以特定子查询对齐的相关视觉特征为条件,确保子查询感知的图像合成。在MS-COCO、Flickr30K、WikiArt、CUB和ImageNet-LT上的大量实验表明,跨模态RAG在检索和生成质量上均显著优于现有基线方法,同时保持高效性。
Learning Compositional Behaviors from Demonstration and Language
Abstract
arXiv:2505.21981v1 Announce Type: cross Abstract: We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.
摘要
我们提出"基于语言与演示的行为框架"(BLADE),一种通过整合模仿学习与模型规划实现长周期机器人操作的框架。BLADE利用语言标注的演示数据,从大语言模型(LLMs)中提取抽象动作知识,并构建结构化高层动作表示库。这些表示包含每个高层动作以视觉感知为基础的前提条件与效果,以及通过神经网络策略实现的对应控制器。BLADE能自动恢复此类结构化表示,无需人工标注状态或符号定义。该框架在应对新情境方面展现出显著能力,包括新初始状态、外部状态扰动及新目标等情况。我们通过在仿真环境和真实机器人上的实验验证了方法的有效性,测试场景涉及具有关节部件、部分可观测性及几何约束的多样化物体。
Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection
Abstract
arXiv:2505.22029v1 Announce Type: cross Abstract: Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at https://github.com/Berkeley-Speech-Group/LLM-Dys.
摘要
言语不流畅检测对于临床诊断和语言评估至关重要,但现有方法受限于高质量标注数据的稀缺性。尽管近期文本转语音(TTS)模型的进展使得合成不流畅语音成为可能,但现有合成数据集存在韵律不自然和语境多样性不足的问题。为解决这些局限,我们提出LLM-Dys——首个基于大语言模型增强不流畅仿真的综合性言语不流畅语料库。该数据集涵盖词级和音素级共11类不流畅现象。基于此资源,我们改进了一种端到端不流畅检测框架,实验验证表明其性能达到当前最优水平。所有数据、模型及代码均已开源,详见https://github.com/Berkeley-Speech-Group/LLM-Dys。
Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs
Abstract
arXiv:2505.21955v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, their narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.
摘要
大型视觉语言模型(LVLMs)正日益应用于虚拟现实和增强现实等交互式场景中,其中头戴式摄像头捕获的第一人称(自我中心)视角作为关键输入。尽管该视角能提供用户注意力及手物交互的细粒度线索,但其狭窄视野和全局语境缺失常导致空间或上下文复杂查询的失败。为此,我们提出一个框架,通过第三人称(他者中心)视角增强自我中心输入,为LVLMs提供全局场景布局、物体可见性等互补信息。我们推出首个多视角问答基准E3VQA,包含基于同步自我-他者图像对的4K高质量问答对。此外,我们提出M3CoT——一种无需训练的提示技术,通过整合三个互补视角的场景图构建统一场景表征。M3CoT使LVLMs能更有效地跨视角推理,相较近期CoT基线实现持续性能提升(GPT-4o提升4.84%,Gemini 2.0 Flash提升5.94%)。大量实验揭示了LVLMs在多视角推理中的核心优势与局限,并验证了融合自我中心与他者中心输入的价值。
LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents
Abstract
arXiv:2505.21963v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
摘要
大型语言模型(LLM)已在广泛任务中展现出卓越性能。为使其更适配特定领域或应用,通常采用监督微调(SFT)、偏好学习及模型融合等训练后优化技术。尽管这些方法各自已得到深入研究,但自动化构建完整训练后流程的领域仍待探索。现有方案多依赖人工设计或仅聚焦于优化单一组件(如数据排序或融合策略)。本研究提出LaMDAgent(语言模型开发智能体框架),该创新框架通过基于LLM的智能体自主构建并优化完整训练后流程。LaMDAgent系统性地探索多样化模型生成技术、数据集及超参数配置,利用任务反馈机制以最少人工干预发现高性能流程。实验表明,LaMDAgent在保持指令跟随能力的同时将工具使用准确率提升9.0个百分点,并能发现传统人工探索常忽略的有效训练后策略。我们进一步分析了数据与模型规模缩放对降低探索计算成本的影响,发现模型规模缩放会引入新挑战,而数据规模扩展可实现高性价比的流程发现。
Judging LLMs on a Simplex
Abstract
arXiv:2505.21972v1 Announce Type: cross Abstract: Automated evaluation of free-form outputs from large language models (LLMs) is challenging because many distinct answers can be equally valid. A common practice is to use LLMs themselves as judges, but the theoretical properties of this approach are not yet well understood. We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable using LLM judges. Our theoretical analysis uncovers a "phase transition" in ranking identifiability: for binary scoring systems, true rankings are identifiable even with weak judges under mild assumptions, while rankings become non-identifiable for three or more scoring levels even with infinite data, absent additional prior knowledge. This non-identifiability highlights how uncertainty in rankings stems from not only aleatoric uncertainty (i.e., inherent stochasticity in the data) but also epistemic uncertainty regarding which assumptions hold, an aspect that has received limited attention until now. To integrate both types of uncertainty, we use Bayesian inference to encode assumptions as priors and conduct sensitivity analysis of ranking estimates and credible intervals. Empirical evaluations across multiple benchmarks demonstrate that Bayesian inference yields more accurate rankings and substantially improves coverage rates. These results underscore the importance of taking a more holistic approach to uncertainty quantification when using LLMs as judges.
摘要
大型语言模型(LLM)生成自由形式输出的自动化评估具有挑战性,因为许多不同的答案可能同样有效。当前常见做法是将LLM自身作为评判者,但这种方法的理论特性尚未得到充分理解。我们提出一种几何框架,将评判者和候选答案表示为概率单纯形上的点,该框架能够揭示使用LLM评判者时哪些内容可识别或不可识别。理论分析发现排序可识别性存在"相变"现象:在二元评分体系下,即使使用弱评判者,真实排序在温和假设下仍可识别;而对于三个及以上评分等级,即使有无限数据且无额外先验知识,排序也变得不可识别。这种不可识别性表明排序的不确定性不仅源于偶然不确定性(即数据固有的随机性),还来自关于哪些假设成立的认知不确定性——这一方面此前未受足够重视。为整合两类不确定性,我们采用贝叶斯推断将假设编码为先验分布,并对排序估计与可信区间进行敏感性分析。跨多个基准的实证评估表明,贝叶斯推断能产生更准确的排序并显著提升覆盖率。这些结果强调在使用LLM作为评判者时,需要采用更全面的不确定性量化方法。
Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance
Abstract
arXiv:2505.22003v1 Announce Type: cross Abstract: Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model's applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.
摘要
印度在寻求可及法律援助方面存在关键缺口,由于法律意识薄弱且难以获取相关法律信息,许多公民无法有效行使法定权利。本文提出Legal Assist AI——一种基于Transformer架构的模型,旨在通过大语言模型(LLM)提供高效法律支持以弥合这一缺口。该系统从精选数据库中检索相关法律信息并生成精准响应,能为法律从业者、学者及普通公众等不同用户群体提供有效帮助。该模型在印度宪法、《印度刑法典》(BNS)、《印度刑事诉讼法典》(BNSS)等本土法律领域的大规模数据集上进行了微调,从而对印度法律体系的复杂性具有深刻理解。通过整合专业法律数据集,该模型在法律问答任务中展现出卓越的效能与专业性。在与GPT-3.5 Turbo和Mistral 7B等前沿模型的对比评估中,其以60.08%的AIBE分数在法律推理与准确性方面超越竞争对手。不同于其他模型,Legal Assist AI有效规避了幻觉等常见问题,使其在实际法律应用中具有高度可靠性。研究证明了该模型在真实法律场景中的适用性,未来版本将通过提升性能与扩展多语言及个案查询数据集来增强服务能力。
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Abstract
arXiv:2505.22038v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average.
摘要
大型视觉语言模型(LVLMs)通过将图像编码为数千个标记,在多模态任务中展现出卓越性能。然而,大量图像标记导致显著的计算开销,而动态高分辨率输入的使用进一步加剧了这一负担。现有方法通常基于注意力分数或图像标记多样性进行选择,试图通过标记剪枝来减少图像标记数量。通过实证研究,我们发现现有方法往往忽视剪枝对当前层输出(局部)和后续层输出(全局)的联合影响,从而导致次优的剪枝决策。为解决这一问题,我们提出平衡标记剪枝(BTP),一种即插即用的视觉标记剪枝方法。具体而言,我们的方法利用小型校准集将剪枝过程划分为多个阶段:在早期阶段侧重剪枝对后续层的影响,而在深层阶段则转向保持局部输出的一致性。跨多种LVLMs的广泛实验表明,该方法在多个基准测试中具有普适有效性。我们的方法在实现78%压缩率的同时,平均保留了原始模型96.7%的性能。
From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving
Abstract
arXiv:2505.22067v1 Announce Type: cross Abstract: Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbf{SERA}, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.
摘要
确保自动驾驶系统的鲁棒性和泛化性不仅需要广泛的场景覆盖,还需有效修复故障案例——尤其是涉及安全关键场景的挑战性案例。然而,现有场景生成与选择方法往往缺乏自适应性和语义关联性,限制了其对性能提升的作用。本文提出SERA框架,该框架基于大语言模型(LLM),通过定向场景推荐使自动驾驶系统具备自我进化能力。SERA通过分析性能日志识别故障模式,并从结构化场景库中动态检索语义对齐的场景。基于LLM的反思机制进一步优化推荐结果,以最大化相关性与多样性。所选场景用于少样本微调,实现最小数据量下的定向适配。基准测试表明,SERA在多个自动驾驶基线模型上持续提升关键指标,验证了其在安全关键条件下的有效性与泛化能力。
Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R)GRPO
Abstract
arXiv:2505.22068v1 Announce Type: cross Abstract: Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. RGRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. RGRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.