Measure Zero

https://shiina18.github.io

https://shiina18.github.io/atom.xml (RSS订阅地址)

Agent 实践杂录

2025 年大家都忙着搞 agent. 下面分类是随便分的. Tools/Prompts System Prompts and Models of AI Tools. 各种 AI 的系统提示词以及 tool schema Claude Cookbooks. citation 和 research agent 的提示词示范 2025-05. Highlights from the Claude 4 system prompt. 分析提示词 2025-03. Markdown vs. XML in LLM Prompts: A Comparative Analysis Claude Docs. Essential tips for long context prompts 2025-04. How ChatGPT Memory Works. 逆向 memory tool 2025-06. 逆向 Gemini 2.5 Pro 搜索功能. Browse 工具是个 sub-agent, 根据提示词从网页中返回相关信息给主 agent. Anthropic. 2025-03. The “think” tool: Enabling Claude to stop and think in complex tool use situations Anthropic. 2025-09. Writing effective tools for agents — with agents Context-Engineering Lance’s Blog. 2025-06. Context Engineering for Agents Drew Breunig. 2025-06. How Long Contexts Fail manus. 2025-07. Context Engineering for AI Agents: Lessons from Building Manus 周星星-知乎. 2025-09. Context Engineering 上下文工程的前世今生 2025-10. Context Engineering for AI Agents with LangChain and Manus - YouTube. PMF 之前别训模型 Lance Martin’s slides (LangChain) Yichao “Peak” Ji’s slides (Manus) Anthropic. 2025-10. Introducing Claude Skills. 不同于之前的 tools/mcp, skills 可以层级化地提供信息, 提供信息的方式更灵活, LLM 按需加载, 而无需一开始就把所有 desc 都加载到 prompt 中. 上面那篇 manus 里也有提到他们做了类似的按需加载的模式. System 现在趋势是自己不做 index (分块 + 向量化 + 向量数据库), 直接让 LLM grep 或者 web search. 比如 Claude, Cline, manus 都是如此. Cline. 2025-05. Why Cline Doesn’t Index Your Codebase (And Why That’s a Good Thing). Cline 有很多关于模型各种数据的博客. minusx. 2025-08. What makes Claude Code so damn good (and how to recreate that magic in your agent)!?. 大道至简 Multi-Agent Anthropic. 2025-06. How we built our multi-agent research system Cognition. 2025-06. Don’t Build Multi-Agents Deep Research Google. 2025-06. Gemini Fullstack LangGraph Quickstart. 最基本的 agentic search pattern LangChain. 2025-07. Open Deep Research. 结构和很清晰基础, 在搜索阶段分 topic 给 sub-agent 干活, 最后用一个 LLM 写报告. 比 Gemini Quickstart 多了最开始的澄清和计划步骤, 类似 Gemini 和 ChatGPT 中实际的 deep research. 也是基于 LangGraph 写的, 代码见这里. Jina. 2025-02. DeepSearch/DeepResearch 实施实用指南. Jina 擅长做 embedding, 技术博客里有很多和 RAG 相关的文章, 风格相比别家也比较 tech. 周星星-知乎. 2025-04. 端到端的训练, 怎么复现 Deep ReSearch Training: Agentic RL 2025-07. How Kimi K2 Became One of the Best Tool-Using Models 2025-09. 通义 DeepResearch：开源 AI 智能体的新纪元

2025/10/17

articleCard.readMore

RAG 简要回顾

2025 年大家都忙着开发 agent, 这里简要回顾一下 RAG. RAG 基本操作 Offline: 文件解析, 文本切片, embedding (以前通常用 bge) 对 query embedding 后做召回 (通常就算个 cos, chunk 量大时用向量数据库牺牲一定精度加速召回) Rerank (通常是 bge-reranker) 这套早在 2023 年就玩烂了. 基本的 “进阶” 操作可见 NisaarAgharia/Advanced_RAG, 以及 NirDiamant/RAG_Techniques 这是一篇很好的综述: 【同济大学王昊奋】Agentic RAG 时代另外可以参考一些字节跳动 RAG 实践手册将 RAG 分为数据层, 索引层, 检索层, 生成层. Advanced RAG Offline 文本切片优化按照语义切分: 想法是, 先得到句子 embedding, 若相邻句子 embedding 距离较大 (比如可以统计分位数取阈值), 则认为语义差别大, 在这里切分. 按结构切分: 比如根据 markdown 的标题层级, 图表, 代码等, 保证有意义的结构不被切开. 这里可以把 chunk 所属的标题放在其 metadata 里或者直接拼在 chunk 开头; 或者用 LLM 总结 chunk 生成一个 heading 拼上去. 还有很多雕花级别的操作, 可以参考这些 2023 年的 RAG 比赛, B 站上也有答辩视频. 阿里天池: 2023 全球智能汽车 AI 挑战赛——赛道一：AI 大模型检索问答 2023 博金大模型挑战赛. 这个挑战赛后来开了个学习赛, 我也打到了第一 (直到最近才有一个人超了我一点分). Embedding 优化给每个 chunk 生成更多 “维度” 的 embedding. 比如对当前 chunk 做个总结得到 embedding, 或者把当前 chunk 对应的 window 更大的 chunk 或者段落以及章节层级拿来做 embedding (甚至是层次化的 embedding 以及召回). 命中 embedding 之后可以连带地在当前 chunk 前后扩展一定 window 或者段落带出更完整连贯的上下文. Online Query 处理 Query 改写, 可参考 A Survey of Query Optimization in Large Language Models Query 分类 (意图识别/路由等) 生成更多维度 embedding: 比如 HyDE (Hypothetical Document Embedding), 根据 query 生成伪文档再去召回, 把 qa 匹配变成 aa 匹配. 类似地, 离线时可以对每个 chunk 生成可能的 query, 把 qa 匹配变成 qq 匹配. 拼接上下文扩大窗口 (之前讲过了, 带出当前 chunk 对应的 window 更大的 chunk) 顺序 (如果 chunk 来自同一篇文档, 按文中出现的顺序排序, 离得近可以补充一些 gap 等让段落更连贯) 根据层级 (之前讲过了, 带出当前 chunk 对应的章节) 压缩 (还是靠 LLM 搞) 评估召回评估效果指标: recall@k, precision@k, mAP, mrr 等, 可以参考这里性能指标: 平均响应时间, QPS 承载能力, 可用性/节点故障恢复时间成本指标: 单位向量存储成本, 单位检索成本在线评估: 检索结果点击率 (CTR), 停留时间 (查看检索结果的时间), 二次检索率 (看了结果后再次检索的比例, 越低越好), 用户满意度评分生成评估效果指标: 事实准确率 (回答与检索信息一致), 幻觉率 (回答包含检索信息外内容的比例), 格式符合度, 用户满意度性能指标: 首 token 时间, QPS, 可用性成本指标: 单位请求成本 (GPU 资源成本), GPU 利用率 Graph RAG 参考 LightRAG 以及微软的 GraphRAG, 宣传中 Graph RAG 能做这两件事情 (1) 回答全局问题, 比如总结全书; (2) 回答多跳问题. 其中第一点我的理解是, Graph RAG 相当于做了层级 (图的层级聚类) 的摘要, 越往上层级就是摘要的摘要, 所以所谓的能解决全局问题其实是提前通过摘要的摘要把回答准备好了. 至于第二点, 我的理解是不如 agentic RAG. Graph RAG 企图通过图关系, 一步 (虽然后续也有工作是多步迭代式召回) 把多跳关系找全, 很难做好. 构建图谱就不是 trivial 的事情, 光是定义什么东西算个结点都不容易, 实体 (结点) 链接与消歧也不容易. 而召回时需要利用图谱的边, 实际上需要 “预先知道要利用到这类边”, 构图时才能构出来. 构图过程的计算量和存储需求都很大, 后续更新也很难做. 图的构建说到底还是看 LLM 本身能力. 而 agentic rag 允许多次检索, 同样是依赖 LLM 本身能力, 但不需要预先对知识库构建图谱 (所以能直接用上 web search 等更通用的能力). 行动机制上更像人, 也更容易 scaling. 另外可以见你为什么要用 GraphRAG? 那图谱到底有什么用? 我的理解是可以通过图谱构建 agent 训练数据, 比如 web sailor. Agentic RAG 其实就是让 LLM 自己取做判断, 比如: 召回文档是否相关? 够不够解决问题? 这个回答有没有乱编? 等等. 一个比较典型的应用是 deep research, 具体就要开另一篇博客了. 其他生成句子级别引用可以参考 Anthropic-Style Citations with Any LLM 小宇宙上有一期 nano graph rag 开发者的采访, 见 Graph RAG：提升大模型检索时的智力. 但是隔得太久了, 不记得 takeaway 了. 2024 年的雕花论文. Searching for Best Practices in Retrieval-Augmented Generation 现在 embedding 趋势是基于 LLM 架构的. 阿里开源 Qwen3 新模型 Embedding，该模型的框架设计有哪些优势？备用资料 2025-01. 知乎直答 RAG (九月份变成 agentic 了) 阿里云开发者. 2025-04. RAG 技术演进的四大核心命题

2025/10/7

articleCard.readMore

读代码: Cherry Studio 联网搜

非常粗糙. 如果同时开启知识库和联网搜 (searchOrchestrationPlugin.ts), 则用 SEARCH_SUMMARY_PROMPT 做意图分析和 query 改写. 简单地把两种搜索的结果拼接起来 (不会混起来重排), index 加上偏移量避免重叠. 如果设置了召回 memory 也会拼在后面. 联网搜分为两种: 一种是 local search (见 LocalSearchProvider.ts), 直接解析 SERP (比如 https://www.google.com/search?q=%s). 免费. 另一种就是调搜索 API, 比如 Tavily. 访问搜索引擎以及 fetch url 内容都是通过 Electron 在后台打开不可见的浏览器窗口加载指定的 url. window.api.searchService.openUrlInSearchWindow(uid, url) 类似白嫖搜索引擎的项目还有比如 duckduckgo-mcp-server 以及 open-webSearch. 不清楚是否合规. Prompts prompts.ts // https://github.com/ItzCrazyKns/Perplexica/blob/master/src/lib/prompts/webSearch.ts export const SEARCH_SUMMARY_PROMPT = ` You are an AI question rephraser. Your role is to rephrase follow-up queries from a conversation into standalone queries that can be used by another LLM to retrieve information, either through web search or from a knowledge base. **Use user's language to rephrase the question.** Follow these guidelines: 1. If the question is a simple writing task, greeting (e.g., Hi, Hello, How are you), or does not require searching for information (unless the greeting contains a follow-up question), return 'not_needed' in the 'question' XML block. This indicates that no search is required. 2. If the user asks a question related to a specific URL, PDF, or webpage, include the links in the 'links' XML block and the question in the 'question' XML block. If the request is to summarize content from a URL or PDF, return 'summarize' in the 'question' XML block and include the relevant links in the 'links' XML block. 3. For websearch, You need extract keywords into 'question' XML block. For knowledge, You need rewrite user query into 'rewrite' XML block with one alternative version while preserving the original intent and meaning. 4. Websearch: Always return the rephrased question inside the 'question' XML block. If there are no links in the follow-up question, do not insert a 'links' XML block in your response. 5. Knowledge: Always return the rephrased question inside the 'question' XML block. 6. Always wrap the rephrased question in the appropriate XML blocks to specify the tool(s) for retrieving information: use <websearch></websearch> for queries requiring real-time or external information, <knowledge></knowledge> for queries that can be answered from a pre-existing knowledge base, or both if the question could be applicable to either tool. Ensure that the rephrased question is always contained within a <question></question> block inside these wrappers. There are several examples attached for your reference inside the below 'examples' XML block. <examples> 1. Follow up question: What is the capital of France Rephrased question:\` <websearch> <question> Capital of France </question> </websearch> <knowledge> <rewrite> What city serves as the capital of France? </rewrite> <question> What is the capital of France </question> </knowledge> \` 2. Follow up question: Hi, how are you? Rephrased question:\` <websearch> <question> not_needed </question> </websearch> <knowledge> <question> not_needed </question> </knowledge> \` 3. Follow up question: What is Docker? Rephrased question: \` <websearch> <question> What is Docker </question> </websearch> <knowledge> <rewrite> Can you explain what Docker is and its main purpose? </rewrite> <question> What is Docker </question> </knowledge> \` 4. Follow up question: Can you tell me what is X from https://example.com Rephrased question: \` <websearch> <question> What is X </question> <links> https://example.com </links> </websearch> <knowledge> <question> not_needed </question> </knowledge> \` 5. Follow up question: Summarize the content from https://example1.com and https://example2.com Rephrased question: \` <websearch> <question> summarize </question> <links> https://example1.com </links> <links> https://example2.com </links> </websearch> <knowledge> <question> not_needed </question> </knowledge> \` 6. Follow up question: Based on websearch, Which company had higher revenue in 2022, "Apple" or "Microsoft"? Rephrased question: \` <websearch> <question> Apple's revenue in 2022 </question> <question> Microsoft's revenue in 2022 </question> </websearch> <knowledge> <question> not_needed </question> </knowledge> \` 7. Follow up question: Based on knowledge, Formula of Scaled Dot-Product Attention and Multi-Head Attention? Rephrased question: \` <websearch> <question> not_needed </question> </websearch> <knowledge> <rewrite> What are the mathematical formulas for Scaled Dot-Product Attention and Multi-Head Attention </rewrite> <question> What is the formula for Scaled Dot-Product Attention? </question> <question> What is the formula for Multi-Head Attention? </question> </knowledge> \` </examples> Anything below is part of the actual conversation. Use the conversation history and the follow-up question to rephrase the follow-up question as a standalone question based on the guidelines shared above. <conversation> {chat_history} </conversation> **Use user's language to rephrase the question.** Follow up question: {question} Rephrased question: ` WebSearchTool.ts, KnowledgeSearchTool.ts 也类似 let summary = 'No search needed based on the query analysis.' if (results.query && results.results.length > 0) { summary = `Found ${results.results.length} relevant sources. Use [number] format to cite specific information.` } const citationData = results.results.map((result, index) => ({ number: index + 1, title: result.title, content: result.content, url: result.url })) // 🔑 返回引用友好的格式，复用 REFERENCE_PROMPT 逻辑 const referenceContent = `\`\`\`json\n${JSON.stringify(citationData, null, 2)}\n\`\`\`` const fullInstructions = REFERENCE_PROMPT.replace( '{question}', "Based on the search results, please answer the user's question with proper citations." ).replace('{references}', referenceContent) return { type: 'content', value: [ { type: 'text', text: 'This tool searches for relevant information and formats results for easy citation. The returned sources should be cited using [1], [2], etc. format in your response.' }, { type: 'text', text: summary }, { type: 'text', text: fullInstructions } ] } export const REFERENCE_PROMPT = `Please answer the question based on the reference materials ## Citation Rules: - Please cite the context at the end of sentences when appropriate. - Please use the format of citation number [number] to reference the context in corresponding parts of your answer. - If a sentence comes from multiple contexts, please list all relevant citation numbers, e.g., [1][2]. Remember not to group citations at the end but list them in the corresponding parts of your answer. - If all reference content is not relevant to the user's question, please answer based on your knowledge. ## My question is: {question} ## Reference Materials: {references} Please respond in the same language as the user's question. ` BaseApiClient.ts public async getMessageContent( message: Message ): Promise<{ textContent: string; imageContents: { fileId: string; fileExt: string }[] }> { const content = getMainTextContent(message) if (isEmpty(content)) { return { textContent: '', imageContents: [] } } const webSearchReferences = await this.getWebSearchReferencesFromCache(message) const knowledgeReferences = await this.getKnowledgeBaseReferencesFromCache(message) const memoryReferences = this.getMemoryReferencesFromCache(message) const knowledgeTextReferences = knowledgeReferences.filter((k) => k.metadata?.type !== 'image') const knowledgeImageReferences = knowledgeReferences.filter((k) => k.metadata?.type === 'image') // 添加偏移量以避免ID冲突 const reindexedKnowledgeReferences = knowledgeTextReferences.map((ref) => ({ ...ref, id: ref.id + webSearchReferences.length // 为知识库引用的ID添加网络搜索引用的数量作为偏移量 })) const allReferences = [...webSearchReferences, ...reindexedKnowledgeReferences, ...memoryReferences] const referenceContent = `\`\`\`json\n${JSON.stringify(allReferences, null, 2)}\n\`\`\`` const imageReferences = knowledgeImageReferences.map((r) => { return { fileId: r.metadata?.id, fileExt: r.metadata?.ext } }) return { textContent: isEmpty(allReferences) ? content : REFERENCE_PROMPT.replace('{question}', content).replace('{references}', referenceContent), imageContents: isEmpty(knowledgeImageReferences) ? [] : imageReferences } }

2025/9/30

articleCard.readMore

用 Pydantic 自动生成 LLM Tool Schema

简单小工具. 定义 tool 参数后, 不引入其他库, 仅用 Pydantic 自动生成符合 OpenAI 规范的 Tool Schema. 想法很简单, 把 Pydantic 的 model_json_schema 生成的 JSON Schema 处理成 OpenAI 规范即可. 好处是 (1) 不用引入或依赖其他乱七八糟的库; (2) 不用手动额外维护一套工具描述; (3) 能利用 Pydantic 的一些功能, 从 JSON string load 之后自动校验参数, 自动转换类型等. 基础示例比如 class GetWeatherArgs(BaseModel): """Retrieves current weather for the given location.""" location: str = Field(description="City and country e.g. Bogotá, Colombia") units: Literal["celsius", "fahrenheit"] = Field(description="Units the temperature will be returned in.") def get_weather(args: GetWeatherArgs): """实际的工具处理逻辑""" pass get_weather_tool = create_tool_from_pydantic(GetWeatherArgs) print(json.dumps(get_weather_tool, ensure_ascii=False, indent=2)) { "type": "function", "function": { "name": "get_weather", "description": "Retrieves current weather for the given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and country e.g. Bogotá, Colombia" }, "units": { "type": "string", "description": "Units the temperature will be returned in.", "enum": [ "celsius", "fahrenheit" ] } }, "required": [ "location", "units" ] } } } 完整代码 import datetime import json import re import textwrap from enum import StrEnum from typing import Type, Literal, Optional, List, Any import pydantic from pydantic import BaseModel, Field def _clean_text(text: str) -> str: """清理多行字符串的缩进和行尾空格。""" return textwrap.dedent(text).strip() def _process_property(prop_schema: dict, defs: dict) -> dict: """递归地处理单个属性的 Schema，将其转换为 Tool 参数格式。""" # 1. 处理 Optional[T]，在 Pydantic v2 中表现为 anyOf 包含 'null' if 'anyOf' in prop_schema: # 找到非 null 的那个 schema 定义 non_null_schema = next((s for s in prop_schema['anyOf'] if s.get('type') != 'null'), None) if non_null_schema: # 递归处理，但保留外层的 description processed_schema = _process_property(non_null_schema, defs) if 'description' in prop_schema: processed_schema['description'] = _clean_text(prop_schema['description']) return processed_schema else: # 理论上不应该只有 null return {} # 2. 处理嵌套对象 ($ref) if '$ref' in prop_schema: ref_name = prop_schema['$ref'].split('/')[-1] nested_schema = defs.get(ref_name) if nested_schema: # 对于嵌套对象，我们再次调用主转换函数 return pydantic_to_tool_schema(nested_schema, defs) # 3. 处理基本类型和数组 result = {} prop_type = prop_schema.get('type') if prop_type: result['type'] = prop_type if 'description' in prop_schema: result['description'] = _clean_text(prop_schema['description']) if 'enum' in prop_schema: result['enum'] = prop_schema['enum'] # 3a. 处理数组 (List[T]) if prop_type == 'array' and 'items' in prop_schema: # 递归处理数组元素的类型 result['items'] = _process_property(prop_schema['items'], defs) return result def pydantic_to_tool_schema(schema: dict, defs: dict = None) -> dict: """将 Pydantic 的 JSON Schema 转换为 Tool 的 parameters 部分。""" if defs is None: defs = schema.get('$defs', {}) tool_params = { "type": "object", "properties": {}, "required": schema.get("required", []), } # 顶层描述 (来自类的 docstring) if 'description' in schema: tool_params['description'] = _clean_text(schema['description']) properties = schema.get("properties", {}) for name, prop_schema in properties.items(): tool_params["properties"][name] = _process_property(prop_schema, defs) return tool_params def create_tool_from_pydantic(pydantic_model: Type[BaseModel]) -> dict: """ 根据 Pydantic 模型自动创建一个符合 OpenAI 规范的 Tool 定义。 - 自动从模型类名推断函数名 (例如 GetWeatherArgs -> get_weather)。 - 自动使用模型的 docstring 作为工具的描述。 """ # 1. 从模型类名推断函数名 model_name = pydantic_model.__name__ class_name = model_name.removesuffix('Args') # 将驼峰命名 (CamelCase) 转换为下划线命名 (snake_case) function_name = re.sub(r'(?<!^)(?=[A-Z])', '_', class_name).lower() # 2. 生成 Pydantic Schema 并转换为 Tool Schema pydantic_schema = pydantic_model.model_json_schema() tool_schema = pydantic_to_tool_schema(pydantic_schema) description = tool_schema.pop("description", "") # 描述移动到外层 # 3. 构建并返回完整的 Tool 定义 return { "type": "function", "function": { "name": function_name, "description": description, "parameters": tool_schema, }, } class GetWeatherArgs(BaseModel): """Retrieves current weather for the given location.""" location: str = Field(description="City and country e.g. Bogotá, Colombia") units: Literal["celsius", "fahrenheit"] = Field(description="Units the temperature will be returned in.") def get_weather(args: GetWeatherArgs): """实际的工具处理逻辑""" pass get_weather_tool = create_tool_from_pydantic(GetWeatherArgs) print(json.dumps(get_weather_tool, ensure_ascii=False, indent=2)) 复杂点的例子可以定义嵌套模型, 枚举类型, 添加自定义校验逻辑等. 下面的 SearchFilesArgs 模型演示了如何处理文件搜索场景, 它包含了对文件类型 (FileType 枚举) 和创建时间 (嵌套的 TimeRange 模型) 的筛选. 我们还定义了一个 LLMProofBaseModel 基类, 能自动处理来自 LLM 的 'null' 字符串输入. 嵌套的 TimeRange 模型中的校验器 check_dates 也展示了如何在数据模型层面封装业务规则. # --- 接上一段代码 --- class LLMProofBaseModel(BaseModel): """自动将所有字段中值为字符串 'null' 的输入转换为 None""" @pydantic.field_validator('*', mode='before') @classmethod def _clean_null_str(cls, v: Any) -> Any: if isinstance(v, str) and v.lower() == 'null': return None return v class TimeRange(LLMProofBaseModel): """这个 docstring 不会用到""" start_date: Optional[datetime.date] = Field(None, description="开始日期 (YYYY-MM-DD)") end_date: Optional[datetime.date] = Field(None, description="结束日期 (YYYY-MM-DD)") random_field: Optional[str] = Field(None, description='演示用') @pydantic.model_validator(mode='after') def check_dates(self) -> 'TimeRange': if self.start_date and self.end_date and self.start_date > self.end_date: # 抛出错误或者其他处理方式 self.end_date = self.start_date return self class FileType(StrEnum): PDF = "pdf" PPT = "ppt" class SearchFilesArgs(LLMProofBaseModel): """ 搜索文件多行示例 - xx - yy """ query: str = Field(description="根据用户问题提炼出的核心搜索查询语句") file_types: Optional[List[Literal[*FileType]]] = Field(None, description="文件类型") time_range: Optional[TimeRange] = Field(None, description="文件创建时间范围") search_file_tool = create_tool_from_pydantic(SearchFilesArgs) tools = [ get_weather_tool, search_file_tool, ] print(json.dumps(tools, ensure_ascii=False, indent=2)) args1 = GetWeatherArgs.model_validate({"location": "Bogotá, Colombia", "units": "celsius"}) args2 = SearchFilesArgs.model_validate( { "query": "年报", "file_types": ["pdf"], "time_range": {"start_date": "2025-01-01", "end_date": "2024-01-01", "random_field": "null"}, } ) [ { "type": "function", "function": { "name": "get_weather", "description": "Retrieves current weather for the given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and country e.g. Bogotá, Colombia" }, "units": { "type": "string", "description": "Units the temperature will be returned in.", "enum": [ "celsius", "fahrenheit" ] } }, "required": [ "location", "units" ] } } }, { "type": "function", "function": { "name": "search_files", "description": "搜索文件\n\n多行示例\n- xx\n- yy", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "根据用户问题提炼出的核心搜索查询语句" }, "file_types": { "type": "array", "items": { "type": "string", "enum": [ "pdf", "ppt" ] }, "description": "文件类型" }, "time_range": { "type": "object", "properties": { "start_date": { "type": "string", "description": "开始日期 (YYYY-MM-DD)" }, "end_date": { "type": "string", "description": "结束日期 (YYYY-MM-DD)" }, "random_field": { "type": "string", "description": "演示用" } }, "required": [], "description": "文件创建时间范围" } }, "required": [ "query" ] } } } ]

2025/9/14

articleCard.readMore

难倒各路大模型的两道简单 SQLite 问题

问题描述以及示例 prompt 如下你是 SQLite 专家, 请完成下面两个问题. 用 SQLite 写一个 query, 根据 “now” 获得当地时间今日零点的 unix 时间戳. 注: “当地” 指执行 SQL 的机器的系统时区, “今日” 指当地日期的今日. 例: 若 now 为 ‘2025-05-05 04:00:00+08:00’, 则返回 ‘2025-05-05 00:00:00+08:00’. (假设当地时区为 UTC+8) 用 SQLite 写一个 query, 根据 “now” 获得上周的周一的日期. 假设周一为一周的开始, 全程只在 UTC 时间考虑问题 (不用考虑时区). 例: 若 now 为 ‘2025-05-05’ 周一, 则返回 ‘2025-04-28’. 若 now 为 ‘2025-05-04’ 周日, 则返回 ‘2025-04-21’. 参考代码如下, 请填充对应的 SQL. import sqlite3 import time def f(sql): conn = sqlite3.connect(':memory:') cursor = conn.cursor() cursor.execute(sql) result = cursor.fetchone()[0] if isinstance(result, int): print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result))) elif isinstance(result, str): print(result) conn.close() # now 替换成了具体时间戳 sqls = [ # 当地今日零点 "SELECT unixepoch('2025-05-05 04:00:00+08:00', ...)", # 上周周一 "SELECT DATE('2025-05-05', ...)", "SELECT DATE('2025-05-04', ...)", ] for sql in sqls: f(sql) 在看答案之前可以自己尝试做一做. 问了一圈 LLM, 4o 无联网非推理模式能完全做对 (一定概率做对, 推理反而做不对). DeepSeek-R1 能做对 (一定概率). 其他大多数 LLM 包括 Gemini 2.5 Pro 也没全对. 参考答案见这里的 4o 回答. 以下是完整的 SQL 填空答案，分别对应两个问题。 ✅ 1. 当地今日零点的 unix 时间戳我们要将 '2025-05-05 04:00:00+08:00' 转换为当地时间的今日零点，再转为 unix 时间戳。 SQLite 的 unixepoch() 函数会返回 UTC 的时间戳，所以我们需要先将时间转换为本地的今日零点，然后取 unixepoch()。 SELECT unixepoch('2025-05-05 04:00:00+08:00', 'localtime', 'start of day', 'utc') 解释： 'localtime'：将输入时间转为本地时间。 'start of day'：获取该地时间的当天零点。 'utc'：再转为 UTC 时间，以便 unixepoch() 得到正确的 unix 时间戳。 ✅ 2. 上周周一的日期（仅考虑 UTC） SQLite 的 strftime('%w', date) 得到一周中的数字（0=周日，1=周一，…，6=周六）。我们可以用它来计算相对于当前日期的偏移。公式：当前日期 - (当前是周几 + 6) % 7 + 7 所以最终表达式是： SELECT DATE('2025-05-05', '-' || ((strftime('%w', '2025-05-05') + 6) % 7 + 7) || ' days') 对于 '2025-05-05'（周一，%w=1）： (1 + 6) % 7 + 7 = 0 + 7 = 7 → 返回上周一：2025-04-28 对于 '2025-05-04'（周日，%w=0）： (0 + 6) % 7 + 7 = 6 + 7 = 13 → 返回上周一：2025-04-21 ✅ 填充完整代码如下： sqls = [ # 当地今日零点 "SELECT unixepoch('2025-05-05 04:00:00+08:00', 'localtime', 'start of day', 'utc')", # 上周周一 "SELECT DATE('2025-05-05', '-' || ((strftime('%w', '2025-05-05') + 6) % 7 + 7) || ' days')", "SELECT DATE('2025-05-04', '-' || ((strftime('%w', '2025-05-04') + 6) % 7 + 7) || ' days')", ] 是否需要我帮你扩展为一个支持动态 now 参数的通用函数？答案说明第一题今日零点我的写法和 4o 相同. 这里反直觉的地方是 SQLite 的 ‘localtime’ 与 ‘utc’ 两个修饰符的命名. 典型的错误是最后没有加 utc 修饰符. 官方文档这样写道 The “localtime” modifier assumes the time-value to its left is in Universal Coordinated Time (UTC) and adjusts that time value so that it is in localtime. The “utc” modifier is the opposite of “localtime”. “utc” assumes that the time-value to its left is in the local timezone and adjusts that time-value to be in UTC. 在 prompt 中提供这一段后, 就有相当一部分 LLM 能做对了. SQLite 时间函数的修饰符从左往右依次叠加. 其中 localtime 实际上就是单纯 +8 小时, 而 utc 则是 -8 小时. 这些修饰符可以重复叠加, 比如 SELECT unixepoch('2025-05-05 04:00:00+08:00', 'localtime', 'localtime') 会得到 ‘2025-05-05 20:00:00+08:00’. 第二题我用了和 4o 不同的实现 SELECT DATE('2025-05-05', 'weekday 0', '+1 day', '-14 days') 这里的坑在于如果左边的日期就是 weekday N, 则这个修饰符不做事情. 用 weekday 修饰符的 LLM 都栽了, 而且给答错的 LLM 提供下面的信息后依然会答错. The “weekday” modifier advances the date forward, if necessary, to the next date where the weekday number is N. Sunday is 0, Monday is 1, and so forth. If the date is already on the desired weekday, the “weekday” modifier leaves the date unchanged. 典型错误回答是 DATE('2025-05-05', 'weekday 1', '-14 days')

2025/5/5

articleCard.readMore

LightRAG 源码简要分享

Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2024). Lightrag: Simple and fast retrieval-augmented generation. 大体流程: 用 LLM 提取 chunks 中的实体和关系, 并存成一个图用 LLM 从 query 中提取关键词, 根据关键词召回实体或关系, 再找到最相关的 chunks, 最后把所有东西都拼起来给 LLM 输出答案提取实体和关系并存为图提示词位于 lightrag/prompt.py. 文档分片后, 让 LLM 按照特定格式提取实体和关系 (以及关键词), 再把 output 解析出来存储. 看代码下面的 step 3 content_keywords 好像全程都没用到. 1. Identify all entities. ... Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>) 2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other. ... - relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity - relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>) 3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document. Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>) ... 5. When finished, output {completion_delimiter} Example 1: Entity_types: [person, technology, mission, organization, location] Text: ... ################ Output: ("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter} ("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter} ... ("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter} ... ("content_keywords"{tuple_delimiter}"power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter} 实体存储如下, 后面会把同名的实体合并 (description 也合并, 过长就 LLM 摘要), source_id 是所有来源的 chunk id. 根据 dp["entity_name"] + dp["description"] 做 embedding. dict( entity_name=entity_name, entity_type=entity_type, description=entity_description, source_id=entity_source_id, ) 关系存储如下, 其中 edge_keywords 和 weight 分别为之前 LLM 生成的 relationship_keywords 和 relationship_strength. 之后合并时 weight 会相加. 根据 dp["keywords"] + dp["src_id"] + dp["tgt_id"] + dp["description"] 做 embedding. dict( src_id=source, tgt_id=target, weight=weight, description=edge_description, keywords=edge_keywords, source_id=edge_source_id, metadata={"created_at": time.time()}, ) 把实体和关系分别作为节点和边, 存为图. 召回用 LLM 从用户 query 中提取 high-level 与 low-level 关键词. Given the query, list both high-level and low-level keywords. High-level keywords focus on overarching concepts or themes, while low-level keywords focus on specific entities, details, or concrete terms. Example 1: Query: "How does international trade influence global economic stability?" ################ Output: { "high_level_keywords": ["International trade", "Global economic stability", "Economic impact"], "low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"] } if query_param.mode == "local": entities_context, relations_context, text_units_context = await _get_node_data( ll_keywords, knowledge_graph_inst, entities_vdb, text_chunks_db, query_param, ) elif query_param.mode == "global": entities_context, relations_context, text_units_context = await _get_edge_data( hl_keywords, knowledge_graph_inst, relationships_vdb, text_chunks_db, query_param, ) 先看所谓的 local 召回. 把 low-level 关键词拼接成字符串 (是的, 虽然抽出来格式是列表, 但没什么意义), 比如 Trade agreements, Tariffs, Currency exchange, Imports, Exports, 代码中记为 query (lightrag/operate.py 的 _get_node_data 函数). 根据 query, 从实体向量库中召回 top k 实体. _find_most_related_text_unit_from_entities. 找出之前召回实体的所有边 (关系), 把所有 chunks 按照这些边的数量 (relation_counts) 从大到小排序, 根据限制 token 数截取前若干个 chunks. _find_most_related_edges_from_entities. 找出之前召回实体的所有边 (关系), 根据 tuple(实体节点的 degree 代码记为 rank, 边的 weight 即 relationship_strength) 从大到小排序, 根据限制 token 数截取前若干个关系的 description. 最后将实体, 关系, 以及 chunks 信息以 csv 的格式拼接起来给 LLM 推理. entities_context id,entity,type,description,rank 0,"""A CHRISTMAS CAROL""","""EVENT""","""A Christmas Carol is a literary event, being a classic story written by Charles Dickens and published in various editions.""",12 relations_context id,source,target,description,keywords,weight,rank,created_at 0,"""A CHRISTMAS CAROL""","""CHARLES DICKENS""","""Charles Dickens is the author of 'A Christmas Carol,' making him the creator of this literary work.""","""authorship, literary creation""",10.0,13,UNKNOWN text_units_context (chunks) id,content 0,"The Project Gutenberg eBook of A Christmas Carol..." 再看 global 召回, 和之前类似, 从略. 用 high-level 关键词拼成字符串. 召回边 (关系). _find_related_text_unit_from_relationships _find_most_related_entities_from_relationships 最终拼接

2025/1/21

articleCard.readMore

ModernBERT

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., … & Poli, I. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv preprint arXiv:2412.13663. 2024-12-19 Hugging Face Finally, a Replacement for BERT 如同字面意思, 更现代的 BERT, 更快更强而且 context length 拓展到 8k tokens, 也是首个在训练数据中加入大量代码数据的 encoder-only 模型. BERT 系模型对比 LLM 的优势是快, 便宜, 而且很多任务适用 encoder-only 结构. 性能 ModernBERT 不仅是首个在 GLUE 中打败 DeBERTaV3 的 base-size 模型, 而且其内存使用量不到 DeBERTa 的五分之一. 速度也是 DeBERTa 两倍, 输入混合长度序列时最高可达到四倍. Here’s the memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 (在消费级显卡上考虑性能) for ModernBERT and other decoder models: On short context, it processes fixed-length 512 token inputs faster than all other recent encoders, although slower than the original BERT and RoBERTa models. On long context, ModernBERT is faster than all competing encoders, processing documents 2.65 and 3 times faster than the next-fastest encoder at the BASE and LARGE sizes, respectively. On variable-length inputs, both GTE-en-MLM and ModernBERT models are considerably faster than all other models, largely due to unpadding. Why modern? Even more surprising: since RoBERTa, there has been no encoder providing overall improvements without tradeoffs (fancily known as “Pareto improvements”): DeBERTaV3 had better GLUE and classification performance, but sacrificed both efficiency and retrieval. Other models, such as AlBERT, or newer ones, like GTE-en-MLM, all improved over the original BERT and RoBERTa in some ways but regressed in others. The goal of the (hopefully aptly named) ModernBERT project was thus fairly simple: bring this modern engineering to encoder models. We did so in three core ways: a modernized transformer architecture particular attention to efficiency modern data scales & sources (2T tokens) New transformer Replace the old positional encoding with “rotary positional embeddings” (RoPE). Switch out the old MLP layers for GeGLU layers, improving on the original BERT’s GeLU activation function. Streamline the architecture by removing unnecessary bias terms, letting us spend our parameter budget more effectively. 减少参数 Add an extra normalization layer after embeddings, which helps stabilize training. Efficiency Our efficiency improvements rely on three key components: Alternating Attention, to improve processing efficiency, Unpadding and Sequence Packing, to reduce computational waste, and Hardware-Aware Model Design, to maximise hardware utilization. Alternating Attention. In technical terms, this means that our attention mechanism only attends to the full input every 3 layers (global attention), while all other layers use a sliding window where every token only attends to the 128 tokens nearest to itself (local attention). Unpadding and Sequence Packing. In order to be able to process multiple sequences within the same batch, encoder models require them to be the same length, so they can perform parallel computation. Traditionally, we’ve relied on padding to achieve this: figure out which sentence is the longest, and add meaningless tokens (padding tokens) to fill up every other sequence. Unpadding solves this issue: rather than keeping these padding tokens, we remove them all, and concatenate them into mini-batches with a batch size of one, avoiding all unnecessary computations. If you’re using Flash Attention, our implementation of unpadding is even faster than previous methods, which heavily relied on unpadding and repadding sequences as they went through the model: we go one step further by introducing our own implementation of unpadding, relying heavily on recent developments in Flash Attention’s RoPE support. This allows ModernBERT to only have to unpad once, and optionally repad sequences after processing, resulting in a 10-20% speedup over previous methods. Paying Attention to Hardware. 在一组常见的消费级显卡上优化模型结构. Training We stick to the original BERT’s training recipe, with some slight upgrades inspired by subsequent work: we remove the Next-Sentence Prediction objective, since then shown to add overhead for no clear gains, and increase the masking rate from 15% to 30%. Both models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adopt a long-context adaptation phase, training on 250B tokens at a sequence length of 8192, while keeping the total tokens seen per batch more or less consistent by lowering the batch size. Finally, we perform annealing on 50 billion tokens sampled differently, following the long-context extension ideal mix highlighted by ProLong. Tricks Let’s start with the first one, which is pretty common: since the initial training steps are updating random weights, we adopt batch-size warmup: we start with a smaller batch size so the same number of tokens update the model weights more often, then gradually increase the batch size to the final training size. This significantly speeds up the initial phase of model training, where the model learns its most basic understanding of language. The second trick is far more uncommon: weight initialization via tiling for the larger model size, inspired by Microsoft’s Phi family of models. This one’s based on the following realization: Why initialize the ModernBERT-large’s initial weights with random numbers when we have a perfectly good (if we dare say so ourselves) set of ModernBERT-base weights just sitting there?

2024/12/24

articleCard.readMore

LoRA 变体

LoRA Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. 众所周知了, 略 (可以参考这里). We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”. We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules QLoRA paper: “We find that the most critical LoRA hyperparameter is how many LoRA adapters are used in total and that LoRA on all linear transformer block layers is required to match full finetuning performance.” 初始化时 A 或 B 其中一个为零保证加了 AB 之后一开始的输出和原输出相同, 另一个非零保证优化过程中梯度不会恒为零. 注意 LoRA 并不省计算量, 只是大幅度节省了优化器需要存的参数, 可参考这里和这里. GaLore Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507. Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory efficient than common low-rank adaptation methods such as LoRA. Our key idea is to leverage the slow changing low-rank structure of the gradient of the weight matrix, rather than trying to approximate the weight matrix itself as low rank. 微调和预训练都可以用. 问题是全量微调的话多个任务不方便部署吧. for weight in model.parameters(): grad = weight.grad # original space -> compact space lor_grad = project(grad) # update by Adam, Adafactor, etc. lor_update = update(lor_grad) # compact space -> original space update = project_back(lor_update) weight.data += update At time step $t$, $G_t \in \mathbb R^{m\times n}$ is the negative gradient matrix of weight $W_t$. The regular update is where $eta$ is the learning rate, and $\rho_t$ is an entry-wise stateful gradient regularizer (e.g., Adam). In GaLore, the $\tilde G_t$ in update becomes where $P_t \in \mathbb R^{m\times r}$ and $Q_t \in \mathbb R^{n\times r}$. They are derived from SVD: 另外可参考锐评 GaLore GaLore can be a Scalable Pretraining Algorithm LoRA+ Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient Low Rank Adaptation of Large Models. arXiv preprint arXiv:2402.12354. 苏剑林. (Feb. 27, 2024). 《配置不同的学习率，LoRA还能再涨一点？》[Blog post]. Retrieved from https://spaces.ac.cn/archives/10001 LoRA 中 B 的学习率应该大于 A. 简单易用. DoRA Liu, S. Y., Wang, C. Y., Yin, H., Molchanov, P., Wang, Y. C. F., Cheng, K. T., & Chen, M. H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv preprint arXiv:2402.09353. Our intuitions are two-fold. Firstly, we believe that limiting LoRA to concentrate exclusively on directional adaptation while also allowing the magnitude component to be tunable simplifies the task compared to the original approach, decomposition where LoRA is required to learn adjustments in both magnitude and direction. Secondly, the process of optimizing directional updates is made more stable through weight decomposition, which we delve into more thoroughly in Section.4.2. 第一点感觉没什么道理, 第二点还没仔细看过. 其他还有些很无聊的变种, 就略了.

2024/8/18

articleCard.readMore

读文章: Understanding Pins through keyword extraction

挺久之前读的, 补个笔记. 传统机器学习. 从帖子的多个文本来源抽取候选标签, 然后用分类模型判断标签是否与帖子相关. 没有用到图片信息 (除了从图中抽取文字). 2019-08 Understanding Pins through keyword extraction Pinterest 主要通过 annotations 理解文本. Annotations 是 1~6 个词的关键词或者短语, 描述 Pin 的主题. Annotations 除了文本, 还有置信度分数和语言标签 (共 28 门语言), 例如: - (EN, sloth sanctuary, 0.99) - (EN, sloths, 0.95) - (EN, costa rica, 0.90) - (EN, carribean, 0.85) - (EN, animals, 0.80) - (EN, travel, 0.80) 用例 Annotations 用作他们很多产品的机器学习模型特征, 得到了很好的效果. 搜索. 用 annotations 召回. 相关 Pins (推荐). 用 annotation 向量求 cosine 相似度. 安全内容过滤 (分类). 生成方法 Annotations dictionary Annotations are limited to a finite vocabulary known internally as the Dictionary. The advantage of using such a dictionary over allowing annotations to be arbitrary ngrams is that it guarantees the annotations will be valid and useful phrases instead of misspellings (e.g., “recipies”), stopwords (e.g., “the”), fragments (e.g., “of liberty”) and generic phrases (e.g., “ideas”, “things”). The dictionary initially started with popular topics that were manually entered by users, but it has grown to include additional sources of terms such as search queries, hashtags, etc. A significant amount of human curation has gone into building the dictionary to ensure its quality is maintained, and we periodically use heuristics to trim out bad terms and use a spell checker to remove misspellings. We have around 100,000 terms in the dictionary for each language. Candidate extraction 先从不同文本源抽取候选 annotations. 文本源包括: Pin title, description, url Board name and description Page title and description of the link Search queries that frequently lead to clicks on the Pin Names of objects detected in the image using a visual classifier 抽取候选: 检测文本语言. 分词. 滑窗获得所有 1-6 词的 ngrams. 标准化 ngrams. Ngrams are matched against the annotations dictionary. The extracted annotations are canonicalized to reduce duplication (e.g., “sloth” is canonicalized to “sloths” since it is not useful to have both of these annotations on a Pin). Canonical mappings are stored in the dictionary. Features Features are extracted for each annotation candidate to be later used for scoring. Pin — Annotation features: TF-IDF Embedding similarity — cosine similarity between Pin embedding and annotation embedding Source — some text sources tend to yield higher quality annotations than others, and annotations that were extracted from multiple sources (e.g., both Pin title and board title) tend to be better than annotations that were only present in a single source (e.g., just board title) Annotation features: IDF Category Entropy — annotations that are popular across multiple categories tend to be more generic and less useful Search frequency We found our model performed better when we normalized our features such that the value distribution was similar across language and Pin popularity (i.e., number of repins). Model 从候选中判断是否真的和 Pin (当前帖子) 相关. XGBoost. Training labels are obtained through crowdsourcing where judges are asked to label for a given (Pin, annotation) pair whether the annotation is relevant to the Pin. Around 150,000 labels per language are used.

2024/1/10

articleCard.readMore

LLM-based Text2SQL

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., & Zhou, J. (2023). Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363. 个人总结: 一篇 LLM 在 Text2SQL 数据集上的 prompt engineering 的实验报告. 在文中评测的两个数据集中效果是开源方案中最好的. 提出的 prompt 方案 DAIL-SQL 融合了现有的几种 RAG 方法. 数据集 Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. 实际上看给出的 Data Examples, 即使是 EXTRA HARD 的样例, 涉及的数据库和 SQL 相比实际都相当简单. [Extra Hard] What is the average life expectancy in the countries where English is not the official language? SELECT AVG(life_expectancy) FROM country WHERE name NOT IN (SELECT T1.name FROM country AS T1 JOIN country_language AS T2 ON T1.code = T2.country_code WHERE T2.language = "English" AND T2.is_official = "T") BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc. 评价指标 Execution Accuracy. 这个指标有很多种叫法. 生成的 SQL 执行结果是否与答案 SQL 结果相同. Exact Set Match. 把 SQL 分解成若干子句, 每个子句再拆成词的集合. 弄成集合规避顺序问题, 比如 SELECT col1, col2 和 SELECT col2, col1 等价. 详见这里. Valid Efficiency Score. 首先执行结果要符合答案, 其次评估效率. Prompts Question Representation Basic Prompt. 给出相关表的 schemas, 接上 QA, 并以 A: SELECT 提示模型补全. 没有 instruction. Table continents, columns = [ContId, Continent] Table countries, columns = [CountryId, CountryName, Continent] Q: How many continents are there? A: SELECT Text Representation Prompt. 在 Basic Prompt 基础上加上 instructions. Given the following database schema: continents: ContId, Continent countries: CountryId, CountryName, Continent Answer the following: How many continents are there? SELECT OpenAI Demostration Prompt. 当成 SQL 让模型补全, 把指示信息放在注释. ### Complete sqlite SQL query only and with no explanation ### SQLite SQL tables, with their properties: # # continents (ContId, Continent) # countries (CountryId, CountryName, Continent) # ### How many continents are there? SELECT Code Representation Prompt. /* Given the following database schema: */ CREATE TABLE continents ( ContId int primary key, Continent text, foreign key (ContId) references countries (Continent) ); CREATE TABLE countries ( CountryId int primary key, CountryName text, Continent int, foreign key (Continent) references continents (ContId) ); /* Answer the following: How many continents are there? */ SELECT Alpaca SFT Prompt. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Write a SQL query to answer the question "How many continents are there?" ### Input: continents (ContId, Continent) countries (CountryId, CountryName, Continent) ### Response: SELECT In-Context Learning 考虑 k-shot: 从训练集 (question-sql pairs) 中选 k 个放入 prompt. Random. Question Similarity Selection. 根据 question 相似度 kNN. Masked Question Similarity Selection. 把 question 中的领域相关的表名, 列名, 值等 mask 掉, 再 kNN. Query Similarity Selection. It employs a preliminary model to generate SQL query $s’$ using target question and database, where this generated $s’$ can be regarded as an approximation of target SQL query $s^\ast$. Then it encodes queries from examples into binary discrete syntax vectors according to their keywords. After that, it chooses $k$ examples by considering both similarity to the approximated query $s’$ and diversity among selected examples. Example Organization Full-Information Organization. /* Given the following database schema: */ ${DATABASE_SCHEMA} /* Answer the following: How many authors are there? */ SELECT COUNT(*) FROM authors /* Given the following database schema: */ ${DATABASE_SCHEMA} /* Answer the following: How many farms are there? */ SELECT COUNT(*) FROM farm ${TARGET_QUESTION} SQL-Only Organization. /* Some SQL examples are provided based on similar problems: */ SELECT COUNT(*) FROM authors SELECT COUNT(*) FROM farm ${TARGET_QUESTION} DAIL-SQL 这篇论文提出的方法, 缝合了上述所有方法. 用 Code Representation Prompt 表示 question. Selection. Consider both questions and queries to select candidates. Specifically, DAIL Selection first masks domain-specific words in both target question $q$ and example questions $q_i$ in the candidate set. It then ranks the candidate examples based on the Euclidean distance between the embeddings of masked $q$ and $q_i$. Simultaneously, it calculates the query similarity between the pre-predicted SQL query $s’$ and $s_i$ in the candidate set. Finally, the selection criterion prioritizes the sorted candidates by question similarity with a query similarity greater than a predefined threshold. In this way, the selected top $k$ examples have good similarity with both question and query. Organization. Preserve the mapping information between questions and SQL queries and also improve the token efficiency. 缝合两种 organization 但是省些 token. /* Some example questions and corresponding SQL queries are provided based on similar problems: */ /* Answer the following: How many authors are there? */ SELECT COUNT(*) FROM authors /* Answer the following: How many farms are there? */ SELECT COUNT(*) FROM farm ${TARGET_QUESTION} 最后是用这套 prompt 在 GPT-4 上达到 sota. 微调则是对开源小 LLM 进行, 因为没钱调 GPT-4. 用 zero-shot prompt 微调, 发现微调后的 zero-shot 效果远远好于微调前的 few-shots 效果, 但是微调后用 few-shots 并没有提升还可能下降效果. 其他 Self-Consistency Improves Chain of Thought Reasoning in Language Models 文中用这个 trick 提升了很少的点, 问题是耗时增加多倍. Awesome-Text2SQL We Fine-Tuned GPT-4 to Beat the Industry Standard for Text2SQL 微调了 GPT-4, 但是效果还没别人 prompt engineering 好. 人家强调别人针对数据集做了特殊处理, 他们这个更通用. 不知道实际如何.

2023/12/25

articleCard.readMore