2025-05-09-12-03
Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models
Abstract
arXiv:2505.04914v1 Announce Type: new Abstract: Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.
摘要
基于Transformer解码器的语言模型是文本生成人工智能的核心创新技术。这些模型正作为通用智能系统被部署于众多应用场景。其功能的核心在于理解自然语言指令的能力,以及利用人类文本语料库中蕴含的推理机制,将某种形式的推理过程应用于各类新颖任务。为理解这种推理生成方法的局限性,我们认为需要考察这些系统的架构约束。通过分析Transformer解码器模型的潜在变量结构,我们得以设计出能够探测其推理能力边界的测试任务。本文提出Enigme——一个开源的文本谜题生成库,用于训练和评估Transformer解码器模型及未来AI架构的推理能力。
Position: Epistemic Artificial Intelligence is Essential for Machine Learning Models to Know When They Do Not Know
Abstract
arXiv:2505.04950v1 Announce Type: new Abstract: Despite the impressive achievements of AI, including advancements in generative models and large language models, there remains a significant gap in the ability of AI to handle uncertainty and generalize beyond the training data. We argue that AI models, especially in autonomous systems, fail to make robust predictions when faced with unfamiliar or adversarial data, as evidenced by incidents with autonomous vehicles. Traditional machine learning approaches struggle to address these issues due to an overemphasis on data fitting and domain adaptation. This position paper posits a paradigm shift towards epistemic artificial intelligence, emphasizing the need for models to learn not only from what they know but also from their ignorance. This approach, which focuses on recognizing and managing uncertainty, offers a potential solution to improve the resilience and robustness of AI systems, ensuring that they can better handle unpredictable real-world environments.
摘要
尽管人工智能已取得令人瞩目的成就,包括生成模型和大语言模型的进步,但其在处理不确定性和训练数据外泛化能力方面仍存在显著不足。我们认为,人工智能模型(尤其是自主系统中的模型)在面对陌生或对抗性数据时无法做出稳健预测,自动驾驶汽车的相关事故便佐证了这一点。传统机器学习方法因过度强调数据拟合和领域适应而难以解决这些问题。本立场论文提出向认知人工智能的范式转变,强调模型不仅需要从已知知识中学习,更需从未知中学习。这种以识别和管理不确定性为核心的方法,为提升人工智能系统的韧性和鲁棒性提供了潜在解决方案,从而确保其能更好地应对不可预测的现实环境。
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning
Abstract
arXiv:2505.04638v1 Announce Type: new Abstract: Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbf{AR}tificial \textbf{I}ntelligence research assistant for \textbf{E}xpert-involved \textbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.
摘要
大语言模型(LLMs)与大模态模型(LMMs)已成为科学研究的变革性工具,但其在生物医学应用中的可靠性和具体贡献仍缺乏充分表征。本研究提出ARIEL(专家参与学习的人工智能研究助手),这是一个多模态数据集,旨在评估并增强LLMs与LMMs在生物医学研究中的两项关键能力:总结长篇科学文本和解析复杂生物医学图表。为支持严谨评估,我们创建了两套开源数据集,包含生物医学文献与图表及其配套问题。我们系统性地对开源与闭源基础模型进行基准测试,并引入博士级专家主导的人工评估。此外,通过针对性提示工程与微调策略提升研究论文摘要任务的模型性能,并应用测试时计算扩展增强LMMs的推理能力,其准确率已超越人类专家修正结果。我们还探索了利用LMM智能体从多模态输入生成科学假设的潜力。总体而言,研究结果明确了当前基础模型的优势,同时揭示了显著局限,为生物医学研究中大规模语言与多模态模型的部署提供了可行见解与发展方向。
Large Language Models are Autonomous Cyber Defenders
Abstract
arXiv:2505.04843v1 Announce Type: new Abstract: Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.
摘要
快速有效的应急响应对于防范恶意网络攻击至关重要。自主网络防御(ACD)旨在通过规划与执行行动的人工智能(AI)代理实现响应自动化。现有ACD方法多聚焦于单代理场景并采用强化学习(RL),但RL训练的ACD代理存在训练成本高昂、决策过程缺乏可解释性及可迁移性等局限。大型语言模型(LLMs)能通过提供通用安全场景下的可解释行动来应对这些问题。尽管已有研究探索LLM代理在ACD中的应用,但尚未评估其在多代理场景或与其他ACD代理交互时的表现。本文通过提出CybORG CAGE 4环境的新集成方案,首次研究了LLM在多代理ACD环境中的性能表现。我们设计新型通信协议,考察LLM与RL代理组成的ACD团队如何协作。实验结果揭示了LLM与RL的优势与不足,为未来ACD代理团队的创建、训练和部署指明了研究方向。
Exploring Influence Factors on LLM Suitability for No-Code Development of End User IoT Applications
Abstract
arXiv:2505.04710v1 Announce Type: new Abstract: With the increasing popularity of IoT applications, end users demand more personalized and intuitive functionality. A major obstacle for this, however, is that custom IoT functionality today still requires at least some coding skills. To address this, no-code development platforms have been proposed as a solution for empowering non-technical users to create applications. However, such platforms still require a certain level of technical expertise for structuring process steps or defining event-action relations. The advent of LLMs can further enhance no-code platforms by enabling natural language-based interaction, automating of complex tasks, and dynamic code generation. By allowing users to describe their requirements in natural language, LLMs can significantly streamline no-code development. As LLMs vary in performance, architecture, training data used, and the use cases they target, it is still unclear which models are best suited and what are the influence factors determining this fit. In particular, no-code development of IoT applications by non-technical users will have completely different demands on LLMs than, e.g., code generation for more open-ended applications or for supporting professional developers. In this paper, we explore the factors influencing the suitability of LLMs to no-code development of IoT applications. We also examine the role of input prompt language on accuracy and quality of generated applications as well as the influence of LLM training data. By conducting comprehensive experiments with a range of LLMs, we provide valuable insights for optimizing LLM-powered no-code platforms, guiding the selection of the suitable LLMs and their effective application. Our findings contribute to improving the accessibility, efficiency, and user experience of no-code IoT development, ultimately enabling broader adoption of IoT technologies among non-expert users.
摘要
随着物联网应用的日益普及,终端用户对个性化和直观功能的需求不断增长。然而当前定制化物联网功能仍需至少具备一定编程能力,这成为主要障碍。为解决该问题,无代码开发平台被提出作为赋能非技术用户创建应用的解决方案。但此类平台在构建流程步骤或定义事件-动作关系时仍需要一定技术专长。大型语言模型(LLM)的出现通过实现基于自然语言的交互、复杂任务自动化及动态代码生成,可进一步提升无代码平台能力。当用户能够以自然语言描述需求时,LLM可显著简化无代码开发流程。由于LLM在性能、架构、训练数据及应用场景方面存在差异,目前尚不清楚哪些模型最适合以及决定适配性的影响因素。特别是非技术用户进行物联网应用的无代码开发对LLM的要求,与开放式应用的代码生成或专业开发者辅助等场景存在本质区别。本文探究了影响LLM适用于物联网无代码开发的关键因素,研究了输入提示语言对生成应用准确性和质量的作用,以及LLM训练数据的影响。通过针对多种LLM开展综合实验,我们为优化基于LLM的无代码平台提供了重要见解,指导合适LLM的选择及其有效应用。本研究有助于提升无代码物联网开发的易用性、效率和用户体验,最终促进非专业用户更广泛地采用物联网技术。
Text2Cypher: Data Pruning using Hard Example Selection
Abstract
arXiv:2505.05122v1 Announce Type: new Abstract: Database query languages such as SQL for relational databases and Cypher for graph databases have been widely adopted. Recent advancements in large language models (LLMs) enable natural language interactions with databases through models like Text2SQL and Text2Cypher. Fine-tuning these models typically requires large, diverse datasets containing non-trivial examples. However, as dataset size increases, the cost of fine-tuning also rises. This makes smaller, high-quality datasets essential for reducing costs for the same or better performance. In this paper, we propose five hard-example selection techniques for pruning the Text2Cypher dataset, aiming to preserve or improve performance while reducing resource usage. Our results show that these hard-example selection approaches can halve training time and costs with minimal impact on performance, and demonstrates that hard-example selection provides a cost-effective solution.
摘要
关系型数据库的SQL和图数据库的Cypher等查询语言已被广泛采用。大型语言模型(LLMs)的最新进展使得通过Text2SQL和Text2Cypher等模型实现与数据库的自然语言交互成为可能。微调这些模型通常需要包含非平凡示例的大规模多样化数据集。然而,随着数据集规模增大,微调成本也随之上升。这使得在保持或提升性能的同时,小型高质量数据集对于降低成本至关重要。本文提出五种困难样本选择技术用于修剪Text2Cypher数据集,旨在减少资源使用的同时保持或提升性能。实验结果表明,这些困难样本选择方法可将训练时间和成本减半且对性能影响极小,证明困难样本选择是一种高性价比的解决方案。
The Promise and Limits of LLMs in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems
Abstract
arXiv:2505.04736v1 Announce Type: new Abstract: Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance with 84.4% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but requires additional modifications to ensure accuracy and pedagogical appropriateness.
摘要
智能辅导系统在教授形式命题逻辑证明方面已显示出有效性,但其基于模板的解释方式限制了提供个性化学生反馈的能力。虽然大型语言模型(LLMs)为动态反馈生成提供了有前景的能力,但它们存在产生幻觉或教学上不合理的解释的风险。我们评估了LLMs在构建多步符号逻辑证明中的逐步准确性,在358个命题逻辑问题上比较了四种最先进LLMs的六种提示技术。结果显示,DeepSeek-V3在逐步证明构建中以84.4%的准确率表现出色,尤其在简单规则上表现优异。我们进一步使用性能最佳的LLM为逻辑智能辅导系统中的1,050个独特学生问题解决状态生成解释性提示,并通过LLM评分器和人类专家对20%样本的4项标准进行评估。分析发现,LLM生成的提示准确率为75%,在一致性和清晰度方面获得人类评估者的高度评价,但在解释提示的提供原因及其更大背景方面表现不佳。我们的结果表明,LLMs可用于为辅导系统增强逻辑辅导提示,但需要进一步修改以确保准确性和教学适宜性。
Enhancing Text2Cypher with Schema Filtering
Abstract
arXiv:2505.05118v1 Announce Type: new Abstract: Knowledge graphs represent complex data using nodes, relationships, and properties. Cypher, a powerful query language for graph databases, enables efficient modeling and querying. Recent advancements in large language models allow translation of natural language questions into Cypher queries - Text2Cypher. A common approach is incorporating database schema into prompts. However, complex schemas can introduce noise, increase hallucinations, and raise computational costs. Schema filtering addresses these challenges by including only relevant schema elements, improving query generation while reducing token costs. This work explores various schema filtering methods for Text2Cypher task and analyzes their impact on token length, performance, and cost. Results show that schema filtering effectively optimizes Text2Cypher, especially for smaller models. Consistent with prior research, we find that larger models benefit less from schema filtering due to their longer context capabilities. However, schema filtering remains valuable for both larger and smaller models in cost reduction.
摘要
知识图谱通过节点、关系和属性来表征复杂数据。Cypher作为一种强大的图数据库查询语言,能够实现高效的数据建模与查询。随着大语言模型的发展,自然语言问题到Cypher查询的转换(Text2Cypher)成为可能。当前主流方法是将数据库模式整合至提示词中,但复杂模式可能引入噪声、加剧幻觉现象并增加计算成本。模式过滤技术通过仅保留相关模式元素来解决这些问题,在提升查询生成质量的同时降低标记开销。本研究系统探讨了Text2Cypher任务中不同模式过滤方法,并分析了其对标记长度、性能及成本的影响。实验结果表明,模式过滤能有效优化Text2Cypher任务,尤其对小型模型效果显著。与既有研究一致,我们发现大型模型因其长上下文处理能力从模式过滤中获益较少。但值得注意的是,模式过滤在降低各类模型使用成本方面仍具有重要价值。
ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints
Abstract
arXiv:2505.05232v1 Announce Type: new Abstract: The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemRxivQuest provides a foundational resource for chemistry NLP research, education, and tool development.
摘要
化学文献的快速扩张对研究人员高效获取领域特定知识提出了重大挑战。为支持化学领域自然语言处理(NLP)的发展,我们推出ChemRxivQuest——一个从17个化学子学科的155篇ChemRxiv预印本中提取的970组高质量问答对(QA)的精选数据集。每个问答对均明确关联至源文本片段,确保可追溯性和上下文准确性。该数据集通过结合光学字符识别(OCR)、基于GPT-4o的问答生成及模糊匹配答案验证技术的自动化流程构建而成,重点关注概念性、机理性、应用性和实验性问题,可应用于检索式问答系统、搜索引擎开发及领域适配大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性,并规划了未来扩展与专家验证的方向。ChemRxivQuest为化学NLP研究、教育及工具开发提供了基础性资源。
MARK: Memory Augmented Refinement of Knowledge
Abstract
arXiv:2505.05177v1 Announce Type: new Abstract: Large Language Models (LLMs) assist in specialized tasks but struggle to align with evolving domain knowledge without costly fine-tuning. Domain knowledge consists of: Knowledge: Immutable facts (e.g., 'A stone is solid') and generally accepted principles (e.g., ethical standards); Refined Memory: Evolving insights shaped by business needs and real-world changes. However, a significant gap often exists between a domain expert's deep, nuanced understanding and the system's domain knowledge, which can hinder accurate information retrieval and application. Our Memory-Augmented Refinement of Knowledge (MARK) framework enables LLMs to continuously learn without retraining by leveraging structured refined memory, inspired by the Society of Mind. MARK operates through specialized agents, each serving a distinct role: Residual Refined Memory Agent: Stores and retrieves domain-specific insights to maintain context over time; User Question Refined Memory Agent: Captures user-provided facts, abbreviations, and terminology for better comprehension; LLM Response Refined Memory Agent: Extracts key elements from responses for refinement and personalization. These agents analyse stored refined memory, detect patterns, resolve contradictions, and improve response accuracy. Temporal factors like recency and frequency prioritize relevant information while discarding outdated insights. MARK enhances LLMs in multiple ways: Ground Truth Strategy: Reduces hallucinations by establishing a structured reference; Domain-Specific Adaptation: Essential for fields like healthcare, law, and manufacturing, where proprietary insights are absent from public datasets; Personalized AI Assistants: Improves virtual assistants by remembering user preferences, ensuring coherent responses over time.
摘要
大语言模型(LLMs)能够辅助专业任务,但在不进行昂贵微调的情况下难以适应不断演进的领域知识。领域知识包含两方面:知识:不可变事实(如"石头是固体")和普遍接受的原则(如伦理标准);精炼记忆:由业务需求和现实变化塑造的演进见解。然而,领域专家的深刻、细致理解与系统领域知识之间常存在显著差距,这可能阻碍准确的信息检索和应用。受"心智社会"启发,我们提出的知识精炼记忆增强框架(MARK)使LLMs无需重新训练即可持续学习。MARK通过专业代理运作,每个代理承担特定职能:残余精炼记忆代理:存储并检索领域特定见解以维持长期上下文;用户问题精炼记忆代理:捕获用户提供的事实、缩写和术语以提升理解;LLM响应精炼记忆代理:从响应中提取关键要素进行精炼和个性化。这些代理分析存储的精炼记忆,检测模式,解决矛盾并提高响应准确性。通过时效性和频率等时间因素对相关信息进行优先级排序,同时淘汰过时见解。MARK从多维度增强LLMs:基准事实策略:通过建立结构化参照减少幻觉;领域特定适配:对医疗、法律和制造等缺乏公开数据专有见解的领域至关重要;个性化AI助手:通过记忆用户偏好改进虚拟助手,确保长期响应连贯性。
Multi-agent Embodied AI: Advances and Future Directions
Abstract
arXiv:2505.05108v1 Announce Type: new Abstract: Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real-world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single-agent systems that often assume static, closed environments, whereas real-world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real-time learning, and collaborative problem-solving. Despite increasing interest in multi-agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi-agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi-agent embodied AI to address the challenges presented by real-world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.
摘要
具身人工智能(Embodied AI)在智能时代先进技术应用中发挥着关键作用,其通过将AI系统与物理载体结合,使系统能够感知、推理并与环境交互。这些系统利用传感器获取输入,通过执行器采取行动,并基于现实世界反馈进行学习与适应,从而在动态不可预测的环境中高效执行任务。随着深度学习(DL)、强化学习(RL)和大语言模型(LLM)等技术的成熟,具身AI已成为学界与工业界的前沿领域,应用涵盖机器人、医疗、交通和制造业。然而,现有研究多集中于假设静态封闭环境的单智能体系统,而现实世界的具身AI需应对更复杂的场景。在此类场景中,智能体不仅需与环境交互,还需与其他智能体协作,这就要求其具备适应机制、实时学习及协同问题解决等高级能力。尽管多智能体系统研究日益受到关注,现有工作仍局限于简化模型,未能充分捕捉动态开放环境中多智能体具身AI的完整复杂性。此外,目前尚无系统性综述全面梳理该领域的进展。随着具身AI的快速发展,深入理解多智能体具身AI对应对实际应用挑战至关重要。为填补这一空白并推动领域发展,本文回顾了当前研究现状,分析了关键贡献,指出挑战与未来方向,旨在为该领域的创新与进步提供指导性见解。
CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
Abstract
arXiv:2505.05130v1 Announce Type: new Abstract: Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.
摘要
大规模预训练视觉语言模型(VLMs),例如对比语言-图像预训练(CLIP),在各种图像分类任务中展现出卓越的零样本性能。在特定领域数据集上对这些模型进行微调,可进一步提升其在下游应用中的有效性。然而,云端环境中的微调引发了数据安全与隐私方面的重大隐忧。联邦学习(FL)通过允许模型在本地客户端上进行训练而无需集中敏感数据,提供了一种去中心化解决方案,但训练期间传输完整预训练模型的高通信与计算成本限制了其可扩展性。此外,本地客户端间的非独立同分布(non-IID)数据可能对模型收敛与性能产生负面影响。为解决这些挑战,我们提出CacheFL——一种新颖的联邦学习方法,以轻量级缓存模型微调替代传统的完整模型微调。该缓存模型通过生成式预训练模型生成的类别平衡数据集初始化,有效缓解非IID数据的影响。随后将该缓存模型分发至本地客户端进行微调,并将各客户端的更新参数在服务器端聚合后重新分发。借助更新的缓存模型,CLIP的分类性能仅需少量训练周期即可提升。通过将训练与通信限制在缓存模型内,CacheFL在确保数据隐私与安全的同时显著降低了资源需求。在ImageNet及另外10个数据集上的大量实验表明,CacheFL在分类准确率、资源效率与隐私保护方面均优于传统方法。
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Abstract
arXiv:2505.05440v1 Announce Type: new Abstract: Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose EcoAgent, an Edge-Cloud cOllaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage. In case of failure, the Planning Agent retrieves screen history and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent maintains high task success rates while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.
摘要
基于云端、由(多模态)大语言模型((M)LLMs)驱动的移动智能体虽具备强大的推理能力,但存在高延迟和高成本问题。虽然经过微调的(M)SLMs可实现边缘部署,但通常会丧失通用能力且难以处理复杂任务。为此,我们提出EcoAgent——一种面向移动自动化的边缘-云端协同多智能体框架。该框架通过云端规划智能体与两个边缘智能体(执行智能体负责动作执行,观察智能体负责结果验证)形成闭环协作。观察智能体采用预理解模块将屏幕图像压缩为简洁文本,显著降低token消耗。当任务失败时,规划智能体通过反射模块检索屏幕历史并重新规划。在AndroidWorld上的实验表明,EcoAgent在保持高任务成功率的同时,大幅减少了大语言模型的token消耗,实现了高效实用的移动自动化。
HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow
Abstract
arXiv:2505.05286v1 Announce Type: new Abstract: Recent advances in leveraging the agentic paradigm of large language models (LLMs) utilization have significantly enhanced Text-to-SQL capabilities, enabling users without specialized database expertise to query data intuitively. However, deploying these agentic LLM-based Text-to-SQL systems in production poses substantial challenges due to their inherently multi-stage workflows, stringent latency constraints, and potentially heterogeneous GPU infrastructure in enterprise environments. Current LLM serving frameworks lack effective mechanisms for handling interdependent inference tasks, dynamic latency variability, and resource heterogeneity, leading to suboptimal performance and frequent service-level objective (SLO) violations. In this paper, we introduce HEXGEN-TEXT2SQL, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters that handle multi-tenant end-to-end queries. HEXGEN-TEXT2SQL introduce a hierarchical scheduling approach combining global workload-balanced task dispatching and local adaptive urgency-guided prioritization, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our extensive evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-TEXT2SQL significantly outperforms state-of-the-art LLM serving frameworks. Specifically, HEXGEN-TEXT2SQL reduces latency deadlines by up to 1.67 (average: 1.41) and improves system throughput by up to 1.75 (average: 1.65) compared to vLLM under diverse, realistic workload conditions. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.
摘要
近年来,基于大语言模型(LLMs)智能体范式应用的重大进展显著提升了文本到SQL(Text-to-SQL)的能力,使得不具备专业数据库知识的用户能够直观地进行数据查询。然而,由于这类基于智能体LLM的文本到SQL系统本质上具有多阶段工作流程、严格的延迟约束以及企业环境中潜在的异构GPU基础设施,将其部署到生产环境面临巨大挑战。当前LLM服务框架缺乏有效机制来处理相互依赖的推理任务、动态延迟变化和资源异构性,导致性能欠佳和频繁违反服务级别目标(SLO)。本文提出HEXGEN-TEXT2SQL,这是一个专为在异构GPU集群上调度和执行基于智能体多阶段LLM的文本到SQL工作流而设计的新框架,可处理多租户端到端查询。HEXGEN-TEXT2SQL引入了一种分层调度方法,结合全局负载均衡的任务分发和局部自适应紧急度引导的优先级排序,该方法基于对智能体文本到SQL工作流的系统分析。此外,我们提出了一种基于轻量级模拟的关键调度超参数调优方法,进一步增强了系统的鲁棒性和适应性。在真实文本到SQL基准测试中的广泛评估表明,HEXGEN-TEXT2SQL显著优于最先进的LLM服务框架。具体而言,与vLLM相比,HEXGEN-TEXT2SQL在不同真实工作负载条件下将延迟截止时间缩短了最高1.67倍(平均1.41倍),并将系统吞吐量提高了最高1.75倍(平均1.65倍)。我们的代码可在https://github.com/Relaxed-System-Lab/Hexgen-Flow获取。
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
Abstract
arXiv:2505.04628v1 Announce Type: cross Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.
摘要
扩大大型语言模型(LLMs)在社会生活中的应用,而不仅限于作为与单一用户交互的辅助工具,需要LLMs具备在复杂社会情境中独立承担多用户、多轮次社交代理任务的能力。然而,当前尚缺乏系统性评估该能力的基准测试。为此,我们首先基于社会学原理提出了一个代理任务分级框架,同时设计了一个名为“How Social Is It”(简称HSII)的新型基准测试,用于全面评估LLMs在社交代理任务中的社会能力并对代表性模型进行基准测试。HSII包含四个阶段:格式解析、目标选择、目标切换对话和稳定对话,通过源自新闻数据集逐步构建的真实社交互动场景数据集HSII-Dataset,综合评估LLMs的沟通与任务完成能力。我们通过对数据集进行聚类分析开展了消融实验,并研究了思维链(COT)方法对提升LLMs社交表现的影响。鉴于COT会消耗更多计算资源,我们进一步提出了新的统计指标COT复杂度,用以量化特定LLMs在完成特定社交任务时使用COT的效率,从而在正确性与效率评估之间实现更好平衡。大量实验结果表明,我们的基准测试能有效评估LLMs的社交技能。
Conversational Process Model Redesign
Abstract
arXiv:2505.05453v1 Announce Type: new Abstract: With the recent success of large language models (LLMs), the idea of AI-augmented Business Process Management systems is becoming more feasible. One of their essential characteristics is the ability to be conversationally actionable, allowing humans to interact with the LLM effectively to perform crucial process life cycle tasks such as process model design and redesign. However, most current research focuses on single-prompt execution and evaluation of results, rather than on continuous interaction between the user and the LLM. In this work, we aim to explore the feasibility of using LLMs to empower domain experts in the creation and redesign of process models in an iterative and effective way. The proposed conversational process model redesign (CPD) approach receives as input a process model and a redesign request by the user in natural language. Instead of just letting the LLM make changes, the LLM is employed to (a) identify process change patterns from literature, (b) re-phrase the change request to be aligned with an expected wording for the identified pattern (i.e., the meaning), and then to (c) apply the meaning of the change to the process model. This multi-step approach allows for explainable and reproducible changes. In order to ensure the feasibility of the CPD approach, and to find out how well the patterns from literature can be handled by the LLM, we performed an extensive evaluation. The results show that some patterns are hard to understand by LLMs and by users. Within the scope of the study, we demonstrated that users need support to describe the changes clearly. Overall the evaluation shows that the LLMs can handle most changes well according to a set of completeness and correctness criteria.
摘要
随着大型语言模型(LLM)近年来的成功应用,人工智能增强型业务流程管理系统的构想正变得愈发可行。其核心特征之一是具备对话式可操作性,允许人类通过与LLM的有效交互来执行关键流程生命周期任务,如流程模型设计与再设计。然而当前大多数研究仅关注单次提示的执行与结果评估,而非用户与LLM间的持续交互。本研究旨在探索如何利用LLM以迭代有效的方式赋能领域专家进行流程模型的创建与再设计。我们提出的对话式流程模型再设计(CPD)方法接收两个输入:流程模型和用户以自然语言表述的再设计请求。该方法并非直接让LLM实施修改,而是分三步:(a)从文献中识别流程变更模式;(b)将变更请求重新表述为与识别模式预期表述方式(即语义)相一致的表达;(c)将变更语义应用于流程模型。这种多步骤方法确保了变更的可解释性与可复现性。为验证CPD方法的可行性并评估LLM对文献中模式的处理能力,我们开展了全面评估。结果表明部分模式对LLM和用户而言较难理解。研究范围内,我们证实用户需要辅助才能清晰描述变更需求。总体评估显示,根据完整性与正确性标准,LLM能较好地处理大多数变更需求。
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
Abstract
arXiv:2505.04637v1 Announce Type: cross Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.
摘要
尽管多模态大语言模型(MLLMs)的最新进展展现了处理多样化数据类型的卓越能力,但人类认知过程与计算式多模态信息整合方法之间仍存在显著差异。本研究系统性地探究了人类跨模态组块机制与MLLMs中令牌表征方法的相似性。通过对比人类在视觉-语言任务中的表现模式与模型行为的实证研究,我们发现传统静态令牌化方案从根本上限制了现有模型模拟人类动态、上下文敏感信息处理的能力。为此,我们提出了一种新颖的动态跨模态令牌化框架,该框架融合了自适应边界、分层表征以及基于认知科学原理的对齐机制。定量评估表明,我们的方法在基准任务上实现了相对于最先进模型的统计学显著提升(视觉问答任务提升7.8%,复杂场景描述任务提升5.3%),同时表现出更符合人类认知的错误模式和注意力分布。这些发现深化了人类认知与人工智能关系的理论理解,并为开发更具认知合理性的AI系统提供了实证依据。
Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising
Abstract
arXiv:2505.04665v1 Announce Type: cross Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.
摘要
尽管大型语言模型在实验环境中已展现出个性化广告推荐的潜力,但在实际运营中,广告推荐系统如何与用户隐私保护、数据安全等措施相结合仍是值得深入探讨的领域。为此,本文研究了数字广告中大型语言模型的个性化风险与监管策略。本研究首先阐述了大型语言模型(LLM)的原理,特别是基于Transformer架构的自注意力机制,以及如何使模型理解并生成自然语言文本。随后,结合BERT(基于Transformer的双向编码器表示)模型与注意力机制,构建了面向个性化广告推荐与用户因素风险保护的算法模型。具体步骤包括:数据收集与预处理、特征选择与构建、利用BERT等大型语言模型进行广告语义嵌入、基于用户画像的广告推荐。进而通过本地模型训练与数据加密来保障用户隐私安全,避免个人数据泄露。本文设计了基于BERT大型语言模型的个性化广告推荐实验,并采用真实用户数据进行验证。实验结果表明:基于BERT的广告推送能有效提升广告点击率与转化率;同时通过本地模型训练与隐私保护机制,可在一定程度上降低用户隐私泄露风险。
MatMMFuse: Multi-Modal Fusion model for Material Property Prediction
Abstract
arXiv:2505.04634v1 Announce Type: cross Abstract: The recent progress of using graph based encoding of crystal structures for high throughput material property prediction has been quite successful. However, using a single modality model prevents us from exploiting the advantages of an enhanced features space by combining different representations. Specifically, pre-trained Large language models(LLMs) can encode a large amount of knowledge which is beneficial for training of models. Moreover, the graph encoder is able to learn the local features while the text encoder is able to learn global information such as space group and crystal symmetry. In this work, we propose Material Multi-Modal Fusion(MatMMFuse), a fusion based model which uses a multi-head attention mechanism for the combination of structure aware embedding from the Crystal Graph Convolution Network (CGCNN) and text embeddings from the SciBERT model. We train our model in an end-to-end framework using data from the Materials Project Dataset. We show that our proposed model shows an improvement compared to the vanilla CGCNN and SciBERT model for all four key properties: formation energy, band gap, energy above hull and fermi energy. Specifically, we observe an improvement of 40% compared to the vanilla CGCNN model and 68% compared to the SciBERT model for predicting the formation energy per atom. Importantly, we demonstrate the zero shot performance of the trained model on small curated datasets of Perovskites, Chalcogenides and the Jarvis Dataset. The results show that the proposed model exhibits better zero shot performance than the individual plain vanilla CGCNN and SciBERT model. This enables researchers to deploy the model for specialized industrial applications where collection of training data is prohibitively expensive.
摘要
近年来,基于晶体结构图编码的高通量材料性质预测方法取得了显著进展。然而,单模态模型无法通过结合不同表征方式来利用增强特征空间的优势。具体而言,预训练大语言模型(LLMs)能够编码大量知识,这对模型训练大有裨益。此外,图编码器擅长学习局部特征,而文本编码器则能捕获空间群和晶体对称性等全局信息。本研究提出材料多模态融合模型(MatMMFuse),该融合模型采用多头注意力机制,将来自晶体图卷积网络(CGCNN)的结构感知嵌入与SciBERT模型的文本嵌入相结合。我们使用材料项目数据集的数据,以端到端框架训练模型。实验结果表明,在形成能、带隙、能量高于壳层以及费米能这四项关键性质预测上,本模型相比原始CGCNN和SciBERT模型均有提升。特别是在预测单原子形成能时,较原始CGCNN模型提升40%,较SciBERT模型提升68%。值得注意的是,我们在钙钛矿、硫族化合物小型精选数据集及Jarvis数据集上验证了训练模型的零样本性能。结果显示,所提模型比单独的原始CGCNN和SciBERT模型具有更好的零样本性能。这使得研究人员可将该模型应用于专业工业领域——在这些场景中,训练数据的采集成本往往极其高昂。
When Bad Data Leads to Good Models
Abstract
arXiv:2505.04741v1 Announce Type: cross Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
摘要
在大语言模型(LLM)预训练中,数据质量通常被认为决定模型质量。本文从训练前后协同设计的视角重新审视"质量"这一概念。具体而言,我们探讨了预训练阶段使用更多毒性数据可能增强后训练阶段的控制能力,从而最终降低模型输出毒性的可能性。首先,我们通过玩具实验研究数据构成如何影响表征空间中特征的几何分布。接着,通过对Olmo-1B模型在不同比例清洁数据与毒性数据下的控制实验,发现随着毒性数据比例增加,毒性概念会获得更少纠缠的线性表征。进一步研究表明,虽然毒性数据会提升基础模型的生成毒性,但也使得毒性更易被消除。在Toxigen和Real Toxicity Prompts数据集上的评估表明,当应用推理时干预(ITI)等去毒技术时,使用毒性数据训练的模型能在降低生成毒性和保持通用能力之间取得更好平衡。我们的研究结果表明,当考虑后训练环节时,劣质数据可能反而有助于构建优质模型。
Advancing Conversational Diagnostic AI with Multimodal Reasoning
Abstract
arXiv:2505.04653v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
摘要
大型语言模型(LLMs)在开展诊断对话方面展现出巨大潜力,但现有评估主要局限于纯文本交互,这与远程医疗服务的实际需求存在偏差。即时通讯平台允许临床医生和患者在诊疗过程中无缝上传并讨论多模态医疗资料,然而LLMs在保持合格诊断对话其他属性的同时处理此类数据的能力尚未可知。本研究通过赋予清晰医疗智能探索系统(AMIE)收集解读多模态数据并在问诊中精准推理的新能力,提升了其对话式诊断与管理性能。基于Gemini 2.0 Flash构建的系统采用状态感知对话框架,通过反映患者状态和动态诊断的中间模型输出来控制对话流程。后续提问策略性地针对患者状态的不确定性展开,形成模拟资深临床医生的结构化多模态病史采集过程。在一项随机双盲OSCE风格研究中,我们将AMIE与初级保健医生(PCPs)通过聊天咨询患者演员进行对比。研究构建了105个评估场景,涵盖智能手机皮肤照片、心电图和临床文档PDF等多种形式的医疗资料,涉及不同病症和人群。评估标准包含多模态能力及其他临床重要维度,如病史采集、诊断准确性、管理推理、沟通能力和同理心。专家评估显示AMIE在多模态维度7/9项、非多模态维度29/32项(包括诊断准确性)上优于PCPs。结果表明多模态对话诊断AI取得明显进展,但实际应用转化仍需进一步研究。
A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models
Abstract
arXiv:2505.04784v1 Announce Type: cross Abstract: The emergence of Generative AI (Gen AI) and Large Language Models (LLMs) has enabled more advanced chatbots capable of human-like interactions. However, these conversational agents introduce a broader set of operational risks that extend beyond traditional cybersecurity considerations. In this work, we propose a novel, instrumented risk-assessment metric that simultaneously evaluates potential threats to three key stakeholders: the service-providing organization, end users, and third parties. Our approach incorporates the technical complexity required to induce erroneous behaviors in the chatbot--ranging from non-induced failures to advanced prompt-injection attacks--as well as contextual factors such as the target industry, user age range, and vulnerability severity. To validate our metric, we leverage Garak, an open-source framework for LLM vulnerability testing. We further enhance Garak to capture a variety of threat vectors (e.g., misinformation, code hallucinations, social engineering, and malicious code generation). Our methodology is demonstrated in a scenario involving chatbots that employ retrieval-augmented generation (RAG), showing how the aggregated risk scores guide both short-term mitigation and longer-term improvements in model design and deployment. The results underscore the importance of multi-dimensional risk assessments in operationalizing secure, reliable AI-driven conversational systems.
摘要
生成式人工智能(Gen AI)和大型语言模型(LLM)的出现使得能够实现更先进、具备类人交互能力的聊天机器人。然而,这些对话代理引入了超越传统网络安全考量的更广泛运营风险。本研究提出了一种新颖的、工具化的风险评估指标,可同时评估对三个关键利益相关方的潜在威胁:服务提供组织、终端用户及第三方。我们的方法综合了诱发聊天机器人错误行为所需的技术复杂性(从非诱导性故障到高级提示注入攻击),以及目标行业、用户年龄范围和漏洞严重性等情境因素。为验证该指标,我们利用开源框架Garak进行LLM漏洞测试,并进一步扩展其功能以捕获多种威胁向量(如错误信息、代码幻觉、社会工程和恶意代码生成)。通过采用检索增强生成(RAG)技术的聊天机器人场景,我们展示了该方法如何通过聚合风险评分指导短期风险缓解及长期模型设计与部署改进。研究结果强调了多维风险评估对于实现安全可靠的人工智能驱动对话系统运营的重要性。
QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort
Abstract
arXiv:2505.04732v1 Announce Type: cross Abstract: The Query-By-Document (QBD) problem is an information retrieval problem where the query is a document, and the retrieved candidates are documents that match the query document, often in a domain or query specific manner. This can be crucial for tasks such as patent matching, legal or compliance case retrieval, and academic literature review. Existing retrieval methods, including keyword search and document embeddings, can be optimized with domain-specific datasets to improve QBD search performance. However, creating these domain-specific datasets is often costly and time-consuming. Our work introduces a process to generate custom QBD-search datasets and compares a set of methods to use in this problem, which we refer to as QBD-RankedDatagen. We provide a comparative analysis of our proposed methods in terms of cost, speed, and the human interface with the domain experts. The methods we compare leverage Large Language Models (LLMs) which can incorporate domain expert input to produce document scores and rankings, as well as explanations for human review. The process and methods for it that we present can significantly reduce human effort in dataset creation for custom domains while still obtaining sufficient expert knowledge for tuning retrieval models. We evaluate our methods on QBD datasets from the Text Retrieval Conference (TREC) and finetune the parameters of the BM25 model -- which is used in many industrial-strength search engines like OpenSearch -- using the generated data.
摘要
文档查询(Query-By-Document,QBD)是一种信息检索问题,其查询本身为文档形式,检索目标是获取与查询文档相匹配的候选文档,通常需结合特定领域或查询需求进行处理。该技术对专利匹配、法律合规案例检索及学术文献综述等任务至关重要。现有检索方法(包括关键词搜索和文档嵌入)可通过领域专用数据集优化以提升QBD搜索性能,但构建此类数据集往往成本高昂且耗时。本研究提出了一种生成定制化QBD搜索数据集的流程(称为QBD-RankedDatagen),并比较了适用于该问题的一系列方法。我们从成本、速度及领域专家的人机交互维度对所提方法进行了对比分析。这些方法利用大型语言模型(LLMs)整合领域专家输入,生成文档评分、排序结果及可人工审核的解释说明。我们提出的流程与方法能显著降低定制领域数据集构建中的人工投入,同时为检索模型调优获取足够的专家知识。基于文本检索会议(TREC)的QBD数据集,我们评估了所提方法,并利用生成数据对BM25模型(广泛应用于OpenSearch等工业级搜索引擎)的参数进行了微调。
REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM
Abstract
arXiv:2505.04673v1 Announce Type: cross Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate () while Qwen2-VL showed the highest MT refusal rate ().
摘要
视觉大语言模型(VLLMs)通过整合图像处理能力与文本理解技术,显著推动了人工智能的发展,从而提升了用户交互体验并拓展了应用领域。然而,其复杂性的增加也带来了新的安全与伦理挑战,尤其在多模态多轮对话场景中。传统基于文本单轮交互的安全评估框架难以应对这些复杂问题。为此,我们提出REVEAL(视觉赋能AI大模型责任评估)框架——一个可扩展的自动化流程,用于评估VLLMs中的图像输入危害。该框架包含自动图像挖掘、合成对抗数据生成、基于渐进式攻击策略的多轮对话扩展,以及通过GPT-4o等评估器进行的全面危害分析。
我们对五款前沿VLLMs(GPT-4o、Llama-3.2、Qwen2-VL、Phi3.5V和Pixtral)在三大关键危害类别(性危害、暴力及错误信息)进行了深入评估。研究发现: 相较于单轮评估,多轮交互会导致缺陷率显著上升,暴露出VLLMs更深层的脆弱性。值得注意的是,根据我们设计的安全-可用性指数(SUI)衡量,GPT-4o展现出最均衡的性能表现,Pixtral紧随其后。此外,错误信息被证明是需要加强上下文防御的关键领域。其中Llama-3.2的多轮缺陷率最高(16.55%),而Qwen2-VL的多轮拒绝率最高(19.1%)。
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
Abstract
arXiv:2505.04842v1 Announce Type: cross Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL boosts MATH accuracy by over 20% with parallel sampling and enables efficient test-time compute scaling compared to the base RL method. RL also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL achieves higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
摘要
当前用于微调大型语言模型(LLM)推理器的强化学习(RL)方法(如GRPO或留一法PPO)通常会放弃已学习的价值函数,转而采用经验估计的回报。这种做法阻碍了依赖价值函数进行验证的测试时计算扩展。本研究提出RL方法,通过联合训练LLM作为推理器和生成式验证器(利用RL生成的数据),在不显著增加开销的情况下增强任何“无价值”RL方法的验证能力。实验表明,RL在并行采样条件下将MATH准确率提升超过20%,与基础RL方法相比可实现的测试时计算效率扩展。RL在易到难任务及域外任务中均表现出强大的泛化能力。此外,当与长推理R1模型联合扩展并行和顺序测试时计算时,RL能实现的性能提升。
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
Abstract
arXiv:2505.04847v1 Announce Type: cross Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.
摘要
幻觉问题仍是大型语言模型(LLM)面临的持续挑战。检索增强生成(RAG)技术试图通过将响应锚定于上下文来减少幻觉。然而即使提供上下文,LLMs仍频繁生成缺乏依据的信息或矛盾内容。本文重点研究摘要任务中的LLM幻觉测量,评估不同LLMs在文档摘要时产生幻觉的频率。我们基于Hughes幻觉评估模型(HHEM)讨论了Vectara现有的LLM幻觉排行榜。尽管HHEM和Vectara幻觉排行榜已引发广泛研究兴趣,我们仍通过分析现有幻觉数据集上这些方法的有效性,检验了HHEM及当前幻觉检测方法面临的挑战。针对这些局限,我们提出FaithJudge——一种基于少量人工幻觉标注指导的LLM-as-a-judge方法,相较现有方法显著提升了LLM幻觉自动评估效果。我们推出了以FaithJudge为核心的增强版幻觉排行榜,与现有排行榜并列呈现,从而为RAG场景下的LLM幻觉提供更可靠的基准评估体系。
HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights
Abstract
arXiv:2505.04846v1 Announce Type: cross Abstract: The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
摘要
科学文献数量正呈指数级增长,这导致大量研究成果未被充分利用、科研工作重复以及跨学科合作受限。检索增强生成(RAG)技术通过提升大语言模型(LLMs)处理海量信息时的 factual 准确性,为科研人员提供了有效辅助。然而,将RAG扩展至处理数百万篇文献时面临重大挑战:包括文档解析与科学知识嵌入的高计算成本,以及将这些表征与科学内容复杂语义对齐的算法复杂性。为此,我们提出HiPerRAG——一种基于高性能计算(HPC)的RAG工作流,能够对超过360万篇科学文献进行知识索引与检索。其核心是Oreo(一个高通量多模态文档解析模型)和ColTrast(一种查询感知的编码器微调算法,通过对比学习与延迟交互技术提升检索精度)。HiPerRAG在现有科学问答基准及本文提出的两个新基准上表现优异:在SciQ达到90%准确率,在PubMedQA达到76%准确率,优于PubMedGPT等领域专用模型及GPT-4等商用LLMs。通过在Polaris、Sunspot和Frontier超级计算机上部署数千块GPU,HiPerRAG实现了百万级文献规模的RAG工作流,为整合科学知识与促进跨学科创新提供了解决方案。
GroverGPT-2: Simulating Grover's Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
Abstract
arXiv:2505.04880v1 Announce Type: cross Abstract: Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, prompting exploration into their potential for this challenge. In this work, we introduce GroverGPT-2, an LLM-based method for simulating Grover's algorithm using Chain-of-Thought (CoT) reasoning and quantum-native tokenization. Building on its predecessor, GroverGPT-2 performs simulation directly from quantum circuit representations while producing logically structured and interpretable outputs. Our results show that GroverGPT-2 can learn and internalize quantum circuit logic through efficient processing of quantum-native tokens, providing direct evidence that classical models like LLMs can capture the structure of quantum algorithms. Furthermore, GroverGPT-2 outputs interleave circuit data with natural language, embedding explicit reasoning into the simulation. This dual capability positions GroverGPT-2 as a prototype for advancing machine understanding of quantum algorithms and modeling quantum circuit logic. We also identify an empirical scaling law for GroverGPT-2 with increasing qubit numbers, suggesting a path toward scalable classical simulation. These findings open new directions for exploring the limits of classical simulatability, enhancing quantum education and research, and laying groundwork for future foundation models in quantum computing.
摘要
量子计算在特定任务上具有超越经典计算的理论优势,但实际量子优势的边界仍是一个悬而未决的问题。为探究这一边界,理解经典机器是否及如何能够学习并模拟量子算法至关重要。大型语言模型(LLMs)近期展现出的强大推理能力,促使我们探索其应对这一挑战的潜力。本研究提出GroverGPT-2,这是一种基于LLM的方法,通过思维链(CoT)推理和量子原生标记化来模拟Grover算法。相较于前代模型,GroverGPT-2能直接从量子电路表示进行模拟,同时生成具有逻辑结构且可解释的输出。结果表明,GroverGPT-2能通过高效处理量子原生标记来学习并内化量子电路逻辑,这为LLM等经典模型可捕捉量子算法结构提供了直接证据。此外,GroverGPT-2的输出将电路数据与自然语言交织,将显式推理嵌入模拟过程。这种双重能力使GroverGPT-2成为推进机器理解量子算法和建模量子电路逻辑的原型。我们还发现了GroverGPT-2随量子比特数增加的经验缩放规律,为可扩展经典模拟指明了路径。这些发现为探索经典可模拟性极限、加强量子教育与研究,以及构建未来量子计算基础模型开辟了新方向。
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
Abstract
arXiv:2505.04881v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.
摘要
大规模推理模型(LRMs)通过思维链(CoT)提示在复杂推理任务中表现优异,但常因冗余内容导致输出冗长,从而增加计算开销并降低用户体验。现有压缩方法要么采用事后剪枝,可能破坏推理连贯性;要么依赖基于采样的选择,无法在生成过程中有效干预。本研究提出一种置信度引导的视角来解释LRMs中冗余反思的产生,识别出两种关键模式:置信赤字(模型因内部置信度低而重新考虑正确步骤)和终止延迟(模型在获得置信答案后仍持续推理)。基于此分析,我们提出ConCISE(逐步高效推理中的置信度引导压缩框架),通过增强模型推理过程中的置信度来简化推理链,从而避免生成冗余反思步骤。该框架整合置信注入(稳定中间步骤)和早期终止(当置信度充足时停止推理)。大量实验表明,基于ConCISE生成数据微调的LRMs能显著缩短输出长度(在SimPO下减少约50%),同时保持高任务准确率。ConCISE在多个推理基准测试中 consistently 优于现有基线方法。
SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
Abstract
arXiv:2505.04911v1 Announce Type: cross Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.
摘要
本研究提出了一种名为SpatialPrompting的创新框架,该框架利用现成多模态大语言模型涌现的推理能力,实现了三维(3D)环境中的零样本空间推理。与现有方法依赖昂贵的3D专用微调(需使用点云或体素特征等专业3D输入)不同,SpatialPrompting采用关键帧驱动的提示生成策略。该框架通过视觉语言相似度、马氏距离、视场角和图像清晰度等指标,从图像序列中选择多样化的信息关键帧,并将其与对应相机位姿数据整合,从而有效抽象空间关系并推断复杂3D结构。该框架不仅建立了利用直观视觉与位置线索进行灵活空间推理的新范式,还在ScanQA和SQA3D等基准数据集上实现了多项指标的零样本最先进性能。所提方法彻底消除了对专业3D输入和微调的依赖,为传统方法提供了更简单且可扩展的替代方案。
An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education
Abstract
arXiv:2505.04916v1 Announce Type: cross Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.
摘要
人工智能的最新进展推动了智能教育工具的普及,但现有语义检索系统仍难以适应学术内容独特的语言与结构特征。本研究提出两种专为教育问答任务优化的开源嵌入模型,特别针对课程大纲场景。通过人工筛选与大语言模型辅助生成相结合,构建了包含3,197个句对的合成数据集,涵盖同义术语、问题复述及显隐式映射三类语义关系。评估了两种训练策略:(1)采用多重负样本排序损失(MNRL)的基线模型,(2)融合MNRL与余弦相似度损失的双损失模型以同步优化语义排序与相似度校准。基于28份大学课程大纲及预设的自然语言问题集(课程信息、教师信息、助教信息三类)的测试表明:两种微调模型均优于all-MiniLM-L6-v2和multi-qa-MiniLM-L6-cos-v1等强开源基线,且双损失模型缩小了与OpenAI text-embedding-3系列等高性能商业嵌入的差距。本研究贡献了可复用的领域适配嵌入模型,并为教育语义检索提供了可复制的技术框架,可支持学术聊天机器人、检索增强生成(RAG)系统及学习管理系统(LMS)集成等下游应用。
Chain-of-Thought Tokens are Computer Program Variables
Abstract
arXiv:2505.04955v1 Announce Type: cross Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.
摘要
思维链(CoT)要求大语言模型(LLM)在得出最终答案前生成中间步骤,已被证明能有效帮助LLM解决复杂推理任务。然而,CoT的内在机制仍不甚明晰。本文通过实证研究,探讨了LLM中CoT标记在两个组合任务中的作用:多位数乘法与动态规划。虽然CoT对解决这些问题至关重要,但我们发现仅保留存储中间结果的标记即可实现相当的性能。此外,我们观察到以替代潜在形式存储中间结果不会影响模型表现。通过随机干预CoT中的部分数值,我们注意到后续CoT标记及最终答案会相应改变。这些发现表明,CoT标记可能类似于计算机程序中的变量,但也存在潜在缺陷,如无意形成的捷径以及标记间计算复杂度的限制。代码与数据详见https://github.com/solitaryzero/CoTs_are_Variables。
LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture
Abstract
arXiv:2505.04980v1 Announce Type: cross Abstract: This paper proposes a novel Large Vision-Language Model (LVLM) and Model Predictive Control (MPC) integration framework that delivers both task scalability and safety for Autonomous Driving (AD). LVLMs excel at high-level task planning across diverse driving scenarios. However, since these foundation models are not specifically designed for driving and their reasoning is not consistent with the feasibility of low-level motion planning, concerns remain regarding safety and smooth task switching. This paper integrates LVLMs with MPC Builder, which automatically generates MPCs on demand, based on symbolic task commands generated by the LVLM, while ensuring optimality and safety. The generated MPCs can strongly assist the execution or rejection of LVLM-driven task switching by providing feedback on the feasibility of the given tasks and generating task-switching-aware MPCs. Our approach provides a safe, flexible, and adaptable control framework, bridging the gap between cutting-edge foundation models and reliable vehicle operation. We demonstrate the effectiveness of our approach through a simulation experiment, showing that our system can safely and effectively handle highway driving while maintaining the flexibility and adaptability of LVLMs.
摘要
本文提出了一种新颖的大型视觉语言模型(LVLM)与模型预测控制(MPC)集成框架,旨在为自动驾驶(AD)同时实现任务可扩展性和安全性。LVLM擅长处理多样化驾驶场景中的高层任务规划,但由于这些基础模型并非专为驾驶设计,其推理过程与底层运动规划的可行性存在不一致性,因此在安全性和平滑任务切换方面仍存在隐患。本研究将LVLM与MPC构建器相结合,该系统能根据LVLM生成的符号化任务指令自动生成MPC控制器,同时确保最优性与安全性。生成的MPC控制器通过提供任务可行性反馈及生成支持任务切换的MPC方案,能够有效辅助执行或否决LVLM驱动的任务切换。该框架构建了一个安全、灵活且适应性强的控制体系,弥合了前沿基础模型与可靠车辆操作之间的鸿沟。通过仿真实验验证,我们的系统在保持LVLM灵活性与适应性的同时,能够安全高效地完成高速公路驾驶任务。
Rethinking Invariance in In-context Learning
Abstract
arXiv:2505.04994v1 Announce Type: cross Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.
摘要
上下文学习(ICL)已成为自回归大语言模型的核心能力,但其对上下文示例顺序的显著敏感性阻碍了发展,即使这些示例相互独立。为解决该问题,近期研究提出了多种实现排列不变性的ICL变体算法,然而其中多数未能达到标准自回归ICL算法的可比性能。本研究发现,构建不变性ICL算法需满足两个关键要素:信息无泄漏和上下文互依性,现有方法均未能同时实现这两点。基于此,我们提出不变性上下文学习(InvICL)方法,该设计在保证两个特性的同时实现ICL不变性。实验表明,InvICL在多数基准数据集上超越以往不变性与非不变性模型,并展现出对不同输入长度的卓越泛化能力。代码发布于https://github.com/PKU-ML/InvICL。
Understanding In-context Learning of Addition via Activation Subspaces
Abstract
arXiv:2505.05145v1 Announce Type: cross Abstract: To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer to the input. We find that Llama-3-8B attains high accuracy on this task for a range of , and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.