拆解Manus:沙盒架构深度解析

Manus 是一个通用目的 AI 代理(General-Purpose AI Agent),它能够在完整的沙盒环境中运行并交付生产级结果[1]。与传统的单一 LLM 聊天机器人不同,Manus 采用更复杂的多代理协调系统(Multi-Agent Coordination System):用户提示首先传递给规划代理(Planner Agent),该代理将任务分解为一系列子任务;然后执行代理(Executor Agent)使用多种工具(从网页浏览到终端命令)完成这些子任务[2]。为了使执行代理能够像人类研究人员或开发者一样工作,Manus 必须为其提供一个完整的云计算机环境——这就是 Manus 选择 E2B 作为其沙盒基础设施的核心原因。 Manus 的沙盒环境本质上是由 E2B 平台提供的 Firecracker 微虚拟机(microVMs)[3]。Firecracker 是 AWS 开发的轻量级虚拟化技术,能够在约 150 毫秒内启动一个完整的虚拟计算机环境——相比之下,Docker 容器需要 10-20 秒的启动时间,更重要的是,Docker 作为容器解决方案无法提供完整操作系统的所有功能[4]。Manus 联合创始人 Tao Zhang 强调:"Manus 不只是运行一些代码片段,它使用 27 种不同的工具,需要 E2B 提供完整的虚拟计算机来像真人一样工作"[5]。这些工具涵盖了从 Chromium 浏览器(访问 URL、保存图片、滚动页面)到终端命令(创建、编辑、删除文件),再到 Python、JavaScript、Bash 等编程语言环境的全栈能力。 从部署架构来看,Manus 选择了 E2B 的自托管部署模式(Self-Hosting Deployment)[6]。这意味着 Manus 在自己的机器上运行 E2B,而非依赖 E2B 的云服务——这种选择使 Manus 能够完全控制沙盒基础设施,同时简化运维管理。Tao Zhang 表示:"E2B 自托管很容易管理,我们能够在半天内完成实现和部署"[7]。如果 Manus 选择自建这套基础设施,技术上是可行的,但需要一支 3-5 人的全职基础设施团队花费数月时间来开发和维护——对于大多数构建 Agent 平台的团队来说,这会分散对产品和研发工作的注意力。通过选择 E2B,Manus 能够快速交付产品,专注于改进多代理编排系统,而不是重新发明云运行时基础设施。 E2B 沙盒的会话持久性(Session Persistence)是 Manus 架构的关键特性之一。沙盒会话可以持续运行数小时,代理在每次迭代中决定在沙盒中执行哪个操作[8]。对于付费用户,E2B 沙盒中保存的信息可以长达 14 天——这对于需要数十分钟才能完成的复杂任务至关重要。更重要的是,沙盒会话支持暂停和恢复(Pause and Resume):当代理需要向用户确认某些信息、需要凭据访问特定网站、或者需要通过"验证你是人类"测试时,可以暂停当前任务,待条件满足后恢复执行。这种能力使得 Manus 的代理能够在保持上下文的同时,灵活地处理需要人工介入的场景。 从多租户隔离的角度分析,E2B 为每个用户提供独立的沙盒实例[9]。这种设计确保了不同用户的任务互不干扰,每个用户都在完全隔离的虚拟计算机中运行——这对于快速增长的 Manus 用户群体尤为重要。当用户数量激增时,E2B 的可扩展性使 Manus 能够为每个用户分配独立的沙盒实例,而无需担心资源竞争或安全问题。Firecracker 微虚拟机提供的强隔离边界确保了一个沙盒中的恶意代码或异常操作不会影响宿主机或其他用户的沙盒,这是 E2B 设计的核心目标——安全地运行不可信代码。 从架构演进和未来规划的角度看,Manus 选择 E2B 也是基于长远考虑[10]。Tao Zhang 表示:"我们选择 E2B 是因为我们考虑到了未来。"Manus 的目标是让代理能够在各种操作系统上运行,包括 Windows 和 Android——由于并非所有的信息和服务都存在于 Web 上,像虚拟 Android 这样的环境将显著扩展代理能够访问和完成的工作范围。这种跨操作系统的愿景要求沙盒平台具备足够的技术灵活性和可扩展性,而 E2B 的架构设计恰好满足了这一需求。 Manus 的实际应用案例已经证明了沙盒架构的价值。金融时报中文版的主编展示了一个通过 Manus 生成的 3D 打印模型,代表过去十年美国国债的历史——这是完全通过 Manus 创建的[11]。另一个案例来自迪拜,一位社交媒体顾问利用 Manus 为客户开发全面的内容策略:在提供客户的网站和社交媒体资料后,代理生成了完整的一年期策略,包括受众定位、标题、内容、吸引人的句子和针对特定渠道的建议。这份超过 50 页的文档生产成本约为 6-7 美元,但在咨询市场上的价值高达数千美元。这些案例展示了 Manus 通过 E2B 沙盒实现的端到端任务执行能力——从数据分析到内容创作,从视觉设计到策略规划,所有这些复杂任务都在隔离的虚拟计算机环境中安全、可靠地完成。 参考资料: [1] Manus 官方网站 [2] Manus 文档 - Introduction [3] E2B 官方博客 - How Manus Uses E2B to Provide Agents with Virtual Computers [4] Firecracker microVMs - AWS 开源项目 [5] E2B 官方文档 [6] E2B - Self-Hosting 部署指南 [7] E2B 博客 - AI Agents in 2024 [8] E2B 文档 - Sandbox 会话管理 [9] E2B 文档 - 多租户隔离 [10] E2B 博客 - Future Plans for AI Agent Sandboxes [11] Manus 应用案例 - Financial Times & Dubai Consultant

2026/1/1
articleCard.readMore

40+ Claude Code Tips: From Basics to Advanced

Here are my tips for getting the most out of Claude Code, including a custom status line script, cutting the system prompt in half, using Gemini CLI as Claude Code's minion, and Claude Code running itself in a container. Also includes the dx plugin. Table of Contents Table of Contents Tip 0: Customize your status line Tip 1: Learn a few essential slash commands /usage /chrome /mcp /stats /clear Tip 2: Talk to Claude Code with your voice Tip 3: Break down large problems into smaller ones Tip 4: Using Git and GitHub CLI like a pro Tip 5: AI context is like milk; it's best served fresh and condensed! Tip 6: Getting output out of your terminal Tip 7: Set up terminal aliases for quick access Tip 8: Proactively compact your context Tip 9: Complete the write-test cycle for autonomous tasks Creative testing strategies Tip 10: Cmd+A and Ctrl+A are your friends Tip 11: Use Gemini CLI as a fallback for blocked sites Tip 12: Invest in your own workflow Tip 13: Search through your conversation history Tip 14: Multitasking with terminal tabs Tip 15: Slim down the system prompt Tip 16: Git worktrees for parallel branch work Tip 17: Manual exponential backoff for long-running jobs Tip 18: Claude Code as a writing assistant Tip 19: Markdown is the s**t Tip 20: Use Notion to preserve links when pasting Tip 21: Containers for long-running risky tasks Advanced: Orchestrating a worker Claude Code in a container Advanced: Multi-model orchestration Tip 22: The best way to get better at using Claude Code is by using it Tip 23: Clone and half-clone conversations Half-clone to reduce context Tip 24: Use realpath to get absolute paths Tip 25: Understanding CLAUDE.md vs Skills vs Slash Commands vs Plugins Tip 26: Interactive PR reviews Tip 27: Claude Code as a research tool Tip 28: Mastering different ways of verifying its output Tip 29: Claude Code as a DevOps engineer Tip 30: Keep CLAUDE.md simple and concise Tip 31: Claude Code as the universal interface Tip 32: It's all about choosing the right level of abstraction Tip 33: Audit your approved commands Tip 34: Write lots of tests (and use TDD) Tip 35: Be braver in the unknown; iterative problem solving Tip 36: Running bash commands and agents in the background Tip 37: The era of personalized software is here Tip 38: Navigating and editing your input box Tip 39: Spend some time planning, but also prototype quickly Tip 40: Simplify overcomplicated code Tip 41: Automation of automation Tip 42: Share your knowledge and contribute where you can Tip 43: Keep learning! Install the dx plugin Tip 0: Customize your status line You can customize the status line at the bottom of Claude Code to show useful info. I set mine up to show the model, current directory, git branch (if any), uncommitted file count, sync status with origin, and a visual progress bar for token usage. It also shows a second line with my last message so I can see what the conversation was about: 1 2 Opus 4.5 | 📁claude-code-tips | 🔀main (scripts/context-bar.sh uncommitted, synced 12m ago) | ██░░░░░░░░ 18% of 200k tokens 💬 This is good. I don't think we need to change the documentation as long as we don't say that the default color is orange el... This is especially helpful for keeping an eye on your context usage and remembering what you were working on. The script also supports 10 color themes (orange, blue, teal, green, lavender, rose, gold, slate, cyan, or gray). To set this up, you can use this sample script and check the setup instructions. Tip 1: Learn a few essential slash commands There are a bunch of built-in slash commands (type / to see them all). Here are a few worth knowing: /usage Check your rate limits: 1 2 3 4 5 6 7 Current session ███████ 14% used Resets 3:59pm (Asia/Tokyo) Current week (all models) █████████████ 26% used Resets Jan 3, 2026, 5:59am (Asia/Tokyo) /chrome Toggle Claude's native browser integration: 1 2 > /chrome Chrome integration enabled /mcp Manage MCP (Model Context Protocol) servers: 1 2 3 4 5 6 7 8 Manage MCP servers 1 server ❯ 1. playwright ✔ connected · Enter to view details MCP Config locations (by scope): • User config (available in all your projects): • /Users/yk/.claude.json /stats View your usage statistics with a GitHub-style activity graph: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ·············································▒▒▒▓▒░█ Mon ··············································▒█░▓░█ ·············································▒▒██▓░█ Wed ·············································░▒█▒▓░█ ············································░▓▒█▓▓░ Fri ············································░▓░█▓▓█ ············································▓▒░█▓▒█ Less ░ ▒ ▓ █ More Favorite model: Opus 4.5 Total tokens: 12.1m Sessions: 1.8k Longest session: 20h 40m 45s Current streak: 44 days Longest streak: 45 days Active days: 49/51 Peak hour: 17:00-18:00 You've used ~145x more tokens than Brave New World /clear Clear the conversation and start fresh. Tip 2: Talk to Claude Code with your voice I found that you can communicate much faster with your voice than typing with your hands. Using a voice transcription system on your local machine is really helpful for this. On my Mac, I've tried a few different options: superwhisper MacWhisper Super Voice Assistant (open source, I built it with Claude Code) You can get more accuracy by using a hosted service, but I found that a local model is strong enough for this purpose. Even when there are mistakes or typos in the transcription, Claude is smart enough to understand what you're trying to say. Sometimes you need to say certain things extra clearly, but overall local models work well enough. For example, in this screenshot you can see that Claude was able to interpret mistranscribed words like "ExcelElanishMark" and "advast" correctly as "exclamation mark" and "Advanced": I think the best way to think about this is like you're trying to communicate with your friend. Of course, you can communicate through texts. That might be easier for some people, or emails, right? That's totally fine. That's what most people seem to do with Claude Code. But if you want to communicate faster, why wouldn't you get on a quick phone call? You can just send voice messages. You don't need to literally have a phone call with Claude Code. Just send a bunch of voice messages. It's faster, at least for me, as someone who's practiced the art of speaking a lot over the past number of years. But I think for a majority of people, it's going to be faster too. A common objection is "what if you're in a room with other people?" I just whisper using earphones - I personally like Apple EarPods (not AirPods). They're affordable, high quality enough, and you just whisper into them quietly. I've done it in front of other people and it works well. In offices, people talk anyway - instead of talking to coworkers, you're talking quietly to your voice transcription system. I don't think there's any problem with that. This method works so well that it even works on a plane. It's loud enough that other people won't hear you, but if you speak close enough to the mic, your local model can still understand what you're saying. (In fact, I'm writing this very paragraph using that method on a flight.) Tip 3: Break down large problems into smaller ones This is one of the most important concepts to master. It's exactly the same as traditional software engineering - the best software engineers already know how to do this, and it applies to Claude Code too. If you find that Claude Code isn't able to one-shot a difficult problem or coding task, ask it to break it down into multiple smaller issues. See if it can solve an individual part of that problem. If it's still too hard, see if it can solve an even smaller sub-problem. Keep going until everything is solvable. Essentially, instead of going from A to B: You can go from A to A1 to A2 to A3, then to B: A good example of this is when I was building my own voice transcription system. I needed to build a system that could let the user select and download a model, take keyboard shortcuts, start transcribing, put the transcribed text at the user's cursor, and wrap all of this in a nice UI. That's a lot. So I broke it down into smaller tasks. First, I created an executable that would just download a model, nothing else. Then I created another one that would just record voice, nothing else. Then another one that would just transcribe pre-recorded audio. I completed them one by one like that before combining them at the end. Highly related to this: your problem-solving skills and software engineering skills are still highly relevant in the world of agentic coding and Claude Code. It's able to solve a lot of problems on its own, but when you apply your general problem-solving and software engineering skills to it, it becomes a lot more powerful. Tip 4: Using Git and GitHub CLI like a pro Just ask Claude to handle your Git and GitHub CLI tasks. This includes committing (so you don't have to write commit messages manually), branching, pulling, and pushing. I personally allow pull automatically but not push, because push is riskier - it doesn't contaminate the origin if something goes wrong with a pull. For GitHub CLI (gh), there's a lot you can do. One thing I started doing more after using Claude Code is creating draft PRs. This lets Claude Code handle the PR creation process with low risk - you can review everything before marking it ready for review. And it turns out, gh is pretty powerful. You can even send arbitrary GraphQL queries through it. For example, you can even find the exact times at which GitHub PR descriptions were edited: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ⏺ Bash(gh api graphql -f query=' query { repository(owner: "...", name: "...") { pullRequest(number: ...) { userContentEdits(first: 100) { nodes { editedAt editor { login } } } } } }') ⏺ Here's the full edit history for your PR description: | # | Edited At (UTC) | Editor | |----|---------------------|--------| | 1 | 2025-12-01 00:08:34 | ykdojo | | 2 | 2025-12-01 15:57:21 | ykdojo | | 3 | 2025-12-01 16:24:33 | ykdojo | | 4 | 2025-12-01 16:27:00 | ykdojo | | 5 | 2025-12-04 00:40:02 | ykdojo | ... Tip 5: AI context is like milk; it's best served fresh and condensed! When you start a new conversation with Claude Code, it performs the best because it doesn't have all the added complexity of having to process the previous context from earlier parts of the conversation. But as you talk to it longer and longer, the context gets longer and the performance tends to go down. So it's best to start a new conversation for every new topic, or if the performance starts to go down. Tip 6: Getting output out of your terminal Sometimes you want to copy and paste Claude Code's output, but copying directly from the terminal isn't always clean. Here are a few ways to get content out more easily: Clipboard directly: On Mac or Linux, ask Claude to use pbcopy to send output straight to your clipboard Write to a file: Have Claude put the content in a file, then ask it to open it in VS Code (or your favorite editor) so you can copy from there. You can also specify a line number, so you can ask Claude to open the specific line it just edited. For markdown files, once it's open in VS Code, you can use Cmd+Shift+P (or Ctrl+Shift+P on Linux/Windows) and select "Markdown: Open Preview" to see the rendered version Opening URLs: If there's a URL you want to examine yourself, ask Claude to open it in your browser. On Mac, you can ask it to use the open command, but in general asking to open in your favorite browser should work on any platform GitHub Desktop: You can ask Claude to open the current repo in GitHub Desktop. This is particularly useful when it's working in a non-root directory - for example, if you asked it to create a git worktree in a different directory and you haven't opened Claude Code from there yet You can combine some of these together too. For example, if you want to edit a GitHub PR description, instead of having Claude edit it directly (which it might mess up), you can have it copy the content into a local file first. Let it edit that, check the result yourself, and once it looks good, have it copy and paste it back into the GitHub PR. That works really well. Or if you want to do that yourself, you can just ask it to open it in VS Code or give it to you via pbcopy so you can copy and paste it manually. Of course, you can run these commands yourself, but if you find yourself doing it repetitively, it's helpful to let Claude run them for you. Tip 7: Set up terminal aliases for quick access Since I use the terminal more because of Claude Code, I found it helpful to set up short aliases so I can launch things quickly. Here are the ones I use: c for Claude Code (this is the one I use the most) ch for Claude Code with Chrome integration gb for GitHub Desktop co for VS Code q for going to the project directory where I have most projects. From there I can manually cd into an individual folder to work on that project, or I can just launch Claude Code with c to let it basically have access to any project it needs to access. To set these up, add lines like this to your shell config file (~/.zshrc or ~/.bashrc): 1 2 3 4 5 alias c='claude' alias ch='claude --chrome' alias gb='github' alias co='code' alias q='cd ~/Desktop/projects' Once you have these aliases, you can combine them with flags: c -c continues your last conversation, and c -r shows a list of recent conversations to resume. These work with ch too (ch -c, ch -r) for Chrome sessions. Tip 8: Proactively compact your context There's a /compact command in Claude Code that summarizes your conversation to free up context space. Automatic compaction also happens when the full available context is filled. The total available context window for Opus 4.5 is currently 200k, and 45k of that is reserved for automatic compaction. About 10% of the total 200k is automatically filled with the system prompt, tools, memory, and dynamic context. But I found that it's better to proactively do it and manually tune it. I turned off auto-compact with /config so I have more context available for the main conversation and more control over when and how compaction happens. The way I do this is to ask Claude to write a handoff document before starting fresh. Something like: Put the rest of the plan in the system-prompt-extraction folder as HANDOFF.md. Explain what you have tried, what worked, what didn't work, so that the next agent with fresh context is able to just load that file and nothing else to get started on this task and finish it up. Claude will create a file summarizing the current state of work: 1 2 3 4 5 6 7 8 9 10 ⏺ Write(experiments/system-prompt-extraction/HANDOFF.md) ⎿ Wrote 129 lines to experiments/system-prompt-extraction/HANDOFF.md # System Prompt Slimming - Handoff Document ## Goal Reduce Claude Code's system prompt by ~45% (currently at 11%, need ~34% more). ## Current Progress ### What's Been Done - **Backup/restore system**: `backup-cli.sh` and `restore-cli.sh` with SHA256 verification - **Patch system**: `patch-cli.js` that restores from backup then applies patches ... After Claude writes it, review it quickly. If something's missing, ask for edits: Did you add a note about iteratively testing instead of trying to do everything all at once? Then start a fresh conversation. For the fresh agent, you can just give the path of the file and nothing else like this, and it should work just fine: 1 > experiments/system-prompt-extraction/HANDOFF.md In subsequent conversations, you can ask the agent to update the document for the next agent. I've also created a /handoff slash command that automates this - it checks for an existing HANDOFF.md, reads it if present, then creates or updates it with the goal, progress, what worked, what didn't, and next steps. You can find it in the commands folder, or install it via the dx plugin. Tip 9: Complete the write-test cycle for autonomous tasks If you want Claude Code to run something autonomously, like git bisect, you need to give it a way to verify results. The key is completing the write-test cycle: write code, run it, check the output, and repeat. For example, let's say you're working on Claude Code itself and you notice /compact stopped working and started throwing a 400 error. A classic tool to find the exact commit that caused this is git bisect. The nice thing is you can let Claude Code run bisect on itself, but it needs a way to test each commit. For tasks that involve interactive terminals like Claude Code, you can use tmux. The pattern is: Start a tmux session Send commands to it Capture the output Verify it's what you expect Here's a simple example of testing if /context works: 1 2 3 4 5 6 7 tmux kill-session -t test-session 2>/dev/null tmux new-session -d -s test-session tmux send-keys -t test-session 'claude' Enter sleep 2 tmux send-keys -t test-session '/context' Enter sleep 1 tmux capture-pane -t test-session -p Once you have a test like this, Claude Code can run git bisect and automatically test each commit until it finds the one that broke things. This is also an example of why your software engineering skills still matter. If you're a software engineer, you probably know about tools like git bisect. That knowledge is still really valuable when working with AI - you just apply it in new ways. Another example is simply writing tests. After you let Claude Code write some code, if you want to test it, you can just let it write tests for itself too. And let it run on its own and fix things if it can. Of course, it doesn't always go in the right direction and you need to supervise it sometimes, but it's able to do a surprising amount of coding tasks on its own. Creative testing strategies Sometimes you need to be creative with how you complete the write-test cycle. For example, if you're building a web app, you could use Playwright MCP, Chrome DevTools MCP, or Claude's native browser integration (through /chrome). I haven't tried Chrome DevTools yet, but I've tried Playwright and Claude's native integration. Overall, Playwright generally works better. It does use a lot of context, but the 200k context window is normally enough for a single task or a few smaller tasks. The main difference between these two seems to be that Playwright focuses on the accessibility tree (structured data about page elements) rather than taking screenshots. It does have the ability to take screenshots, but it doesn't normally use them to take actions. On the other hand, Claude's native browser integration focuses more on taking screenshots and clicking on elements by specific coordinates. It can click on random things sometimes, and the whole process can be slow. This might improve over time, but by default I would go with Playwright for most tasks that aren't visually intensive. I'd only use Claude's native browser integration if I need to use a logged-in state without having to provide credentials (since it runs in your own browser profile), or if it specifically needs to click on things visually using their coordinates. This is why I disable Claude's native browser integration by default and use it through the ch shortcut I defined previously. That way Playwright handles most browser tasks, and I only enable Claude's native integration when I specifically need it. Additionally, you can ask it to use accessibility tree refs instead of coordinates. Here's what I put in my CLAUDE.md for this: 1 2 3 4 5 6 # Claude for Chrome - Use `read_page` to get element refs from the accessibility tree - Use `find` to locate elements by description - Click/interact using `ref`, not coordinates - NEVER take screenshots unless explicitly requested by the user In my personal experience, I've also had a situation where I was working on a Python library at Daft and needed to test a version I built locally on Google Colab. The trouble is it's hard to build a Python library with a Rust backend on Google Colab - it doesn't seem to work that well. So I needed to actually build a wheel locally and then upload it manually so that I could run it on Google Colab. I also tried monkey patching, which worked well in the short term before I had to wait for the whole wheel to build locally. Another situation I encountered is I needed to test something on Windows but I'm not running a Windows machine. My CI tests on the same repo were failing because we had some issues with Rust on Windows, and I had no way of testing locally. So I needed to create a draft PR with all the changes, and another draft PR with the same changes plus enabling Windows CI runs on non-main branches. I instructed Claude Code to do all of that, and then I tested the CI directly in that new branch. Tip 10: Cmd+A and Ctrl+A are your friends I've been saying this for a few years now: Cmd+A and Ctrl+A are friends in the world of AI. This applies to Claude Code too. Sometimes you want to give Claude Code a URL, but it can't access it directly. Maybe it's a private page (not sensitive data, just not publicly accessible), or something like a Reddit post that Claude Code has trouble fetching. In those cases, you can just select all the content you see (Cmd+A on Mac, Ctrl+A on other platforms), copy it, and paste it directly into Claude Code. It's a pretty powerful method. This works great for terminal output too. When I have output from Claude Code itself or any other CLI application, I can use the same trick: select all, copy, and paste it back to CC. Pretty helpful. Some pages don't lend themselves well to select all by default - but there are tricks to get them into a better state first. For example, with Gmail threads, click Print All to get the print preview (but cancel the actual print). That page shows all emails in the thread expanded, so you can Cmd+A the entire conversation cleanly. This applies to any AI, not just Claude Code. Tip 11: Use Gemini CLI as a fallback for blocked sites Claude Code's WebFetch tool can't access certain sites, like Reddit. But you can work around this by creating a skill that tells Claude to use Gemini CLI as a fallback. Gemini has web access and can fetch content from sites that Claude can't reach directly. This uses the same tmux pattern from Tip 9 - start a session, send commands, capture output. The skill file goes in ~/.claude/skills/reddit-fetch/SKILL.md. See skills/reddit-fetch/SKILL.md for the full content. Skills are more token-efficient because Claude Code only loads them when needed. If you want something simpler, you can put a condensed version in ~/.claude/CLAUDE.md instead, but that gets loaded into every conversation whether you need it or not. I tested this by asking Claude Code to check how Claude Code skills are regarded on Reddit - a bit meta. It goes back and forth with Gemini for a while, so it's not fast, but the report quality was surprisingly good. Obviously, you'll need to have Gemini CLI installed for this to work. You can also install this skill via the dx plugin. Tip 12: Invest in your own workflow Personally, I've created my own voice transcription app from scratch with Swift. I created my own custom status line from scratch using Claude Code, this one with bash. And I created my own system for simplifying the system prompt in Claude Code's minified JavaScript file. But you don't have to go overboard like that. Just taking care of your own CLAUDE.md, making sure it's as concise as possible while being able to help you achieve your goals - stuff like that is helpful. And of course, learning these tips, learning these tools, and some of the most important features. All of these are investments in the tools you use to build whatever you want to build. I think it's important to spend at least a little bit of time on that. Tip 13: Search through your conversation history You can ask Claude Code about your past conversations, and it'll help you find and search through them. All your conversation history is stored locally in ~/.claude/. Project-specific conversations are in ~/.claude/projects/, with folder names based on the project path (slashes become dashes). For example, conversations for a project at /Users/yk/Desktop/projects/claude-code-tips would be stored in: 1 ~/.claude/projects/-Users-yk-Desktop-projects-claude-code-tips/ Each conversation is a .jsonl file. You can search through them with basic bash commands: 1 2 3 4 5 6 7 8 # Find all conversations mentioning "Reddit" grep -l -i "reddit" ~/.claude/projects/-Users-yk-Desktop-projects-*/*.jsonl # Find today's conversations about a topic find ~/.claude/projects/-Users-yk-Desktop-projects-*/*.jsonl -mtime 0 -exec grep -l -i "keyword" {} \; # Extract just the user messages from a conversation (requires jq) cat ~/.claude/projects/.../conversation-id.jsonl | jq -r 'select(.type=="user") | .message.content' Or just ask Claude Code directly: "What did we talk about regarding X today?" and it'll search through the history for you. Tip 14: Multitasking with terminal tabs When running multiple Claude Code instances, staying organized is more important than any specific technical setup like Git worktrees. I recommend focusing on at most three or four tasks at a time. My personal method is what I would call a "cascade" - whenever I start a new task, I just open a new tab on the right. Then I sweep left to right, left to right, going from oldest tasks to newest. The general direction stays consistent, except when I need to check on certain tasks, get notifications, etc. Here's what my setup typically looks like: In this example: Leftmost tab - A persistent tab running my voice transcription system (always stays here) Second tab - Setting up a Docker container Third tab - Checking disk usage on my local machine Fourth tab - Working on an engineering project Fifth tab (current) - Writing this very tip Tip 15: Slim down the system prompt Claude Code's system prompt and tool definitions take up about 20k tokens (~10% of your 200k context) before you even start working. I created a patch system that reduces this to about 9k tokens - saving around 11,000 tokens (~55% of the overhead). Component Before After Savings System prompt 3.1k 1.8k 1,300 tokens System tools 15.6k 7.1k 8,500 tokens Static total ~19k ~9k ~10,000 tokens (~52.5%) Allowed tools list ~1k 0 ~1k tokens Total ~20k ~9k ~11k tokens (~55%) The allowed tools list is dynamic context - it grows as you approve more bash commands. The patch removes this list entirely. Here's what /context looks like before and after patching: Unpatched (~20k, 10%) Patched (~9k, 4%) The patches work by trimming verbose examples and redundant text from the minified CLI bundle while keeping all the essential instructions. I've tested this extensively and it works well. It feels more raw - more powerful, but maybe a little less regulated, which makes sense because the system instruction is shorter. It feels more like a pro tool when you use it this way. I really enjoy starting with lower context because you have more room before it fills up, which gives you the option to continue conversations a bit longer. That's definitely the best part of this strategy. Check out the system-prompt folder for the patch scripts and full details on what gets trimmed. Why patching? Claude Code has flags that let you provide a simplified system prompt from a file (--system-prompt or --system-prompt-file), so that's another way to go about it. But for the tool descriptions and the dynamic approved tools list, there's no official option to customize them. Patching the CLI bundle is the only way. Since my patch system handles everything in one unified approach, I'm keeping it this way for now. I might re-implement the system prompt portion using the flag in the future. Requirements: These patches require npm installation (npm install -g @anthropic-ai/claude-code). The patching works by modifying the JavaScript bundle (cli.js) - other installation methods may produce compiled binaries that can't be patched this way. Important: If you want to keep your patched system prompt, disable auto-updates by adding export DISABLE_AUTOUPDATER=1 to ~/.zshenv (not ~/.zshrc). The reason for .zshenv is that it's sourced for ALL zsh invocations, including non-interactive shells and tmux sessions. .zshrc only gets sourced for interactive shells, so tmux-based workflows (like the ones in Tips 9, 11, and 21) would auto-update without .zshenv. You can manually update later with npm update -g @anthropic-ai/claude-code when you're ready to re-apply patches to a new version. Tip 16: Git worktrees for parallel branch work If you're working on multiple files or multiple branches and you don't want them to get conflicted, Git worktrees are a great way to work on them at the same time. You can just ask Claude Code to create a git worktree and start working on it there - you don't have to worry about the specific syntax. The basic idea is that you can work on a different branch in a different directory. It's essentially a branch + a directory. You can add this layer of Git worktrees on top of the cascade method I discussed in the multitasking tip. Tip 17: Manual exponential backoff for long-running jobs When waiting on long-running jobs like Docker builds or GitHub CI, you can ask Claude Code to do manual exponential backoff. Exponential backoff is a common technique in software engineering, but you can apply it here too. Ask Claude Code to check the status with increasing sleep intervals - one minute, then two minutes, then four minutes, and so on. It's not programmatically doing it in the traditional sense - the AI is doing it manually - but it works pretty well. This way the agent can continuously check the status and let you know once it's done. (For GitHub CI specifically, gh run watch exists but outputs many lines continuously, which wastes tokens. Manual exponential backoff with gh run view <run-id> | grep <job-name> is actually more token-efficient. This is also a general technique that works well even when you don't have a dedicated wait command handy.) For example, if you have a Docker build running in the background: And it keeps going until the job completes. Tip 18: Claude Code as a writing assistant Claude Code is an excellent writing assistant and partner. The way I use it for writing is I first give it all the context about what I'm trying to write, and then I give it detailed instructions by speaking to it using my voice. That gives me the first draft. If it's not good enough, I try a few times. Then I go through it line by line, pretty much. I say okay, let's take a look at it together. I like this line for these reasons. I feel like this line needs to move over there. This line needs to change in this particular way. I might ask about reference materials as well. So it's this sort of back-and-forth process, maybe with the terminal on the left and your code editor on the right: That tends to work really well. Tip 19: Markdown is the s**t Typically when people write a new document, they might use something like Google Docs or maybe Notion. But now I honestly think the most efficient way to go about it is markdown. Markdown was already pretty good even before AI, but with Claude Code in particular, because it's so efficient as I mentioned with regards to writing, it makes the value of markdown higher in my opinion. Whenever you want to write a blog post or even a LinkedIn post, you can just talk to Claude Code, have it be saved as markdown, and then go from there. A quick tip for this one: if you want to copy and paste markdown content into a platform that doesn't accept it easily, you can paste it into a fresh Notion file first, then copy from Notion into the other platform. Notion converts it to a format that other platforms can accept. If regular pasting doesn't work, try Command + Shift + V to paste without formatting. Tip 20: Use Notion to preserve links when pasting It turns out the reverse also works. If you have text with links from other places, let's say from Slack, you can copy it. If you paste it directly into Claude Code, it doesn't show the links. But if you put it in a Notion document first, then copy from there, you get it in markdown, which of course Claude Code can read. Tip 21: Containers for long-running risky tasks Running Claude Code with --dangerously-skip-permissions is the equivalent of having unprotected sex. So use a condo... I mean a container. Regular sessions are more for methodical work where you control the permissions you give and review output more carefully. Containerized environments are great for --dangerously-skip-permissions sessions where you don't have to give permission for each little thing. You can just let it run on its own for a while. This is useful for research or experimentation, things that take a long time and maybe could be risky. A good example is the Reddit research workflow from Tip 11, where the reddit-fetch skill goes back and forth with Gemini CLI through tmux. Running that unsupervised is risky on your main system, but in a container, if something goes wrong, it's contained. Another example is how I created the system prompt patching scripts in this repo. When a new version of Claude Code comes out, I need to update the patches for the minified CLI bundle. Instead of running Claude Code with --dangerously-skip-permissions on my host machine (where it has access to everything), I run it in a container. Claude Code can explore the minified JavaScript, find the variable mappings, and create new patch files without me approving every little thing that way. In fact, it was able to complete the migration pretty much on its own. It tried applying the patches, found that some didn't work with the new version, iterated to fix them, and even improved the instruction document for future instances based on what it learned. I set up a Docker container with Claude Code, Gemini CLI, tmux, and all the customizations from this repo. Check out the container folder for the Dockerfile and setup instructions. Advanced: Orchestrating a worker Claude Code in a container You can take this further by having your local Claude Code control another Claude Code instance running inside a container. The trick is using tmux as the control layer: Your local Claude Code starts a tmux session In that tmux session, it runs or connects to the container Inside the container, Claude Code runs with --dangerously-skip-permissions Your outer Claude Code uses tmux send-keys to send prompts and capture-pane to read output This gives you a fully autonomous "worker" Claude Code that can run experimental or long-running tasks without you approving every action. When it's done, your local Claude Code can pull the results back. If something goes wrong, it's all sandboxed in the container. Advanced: Multi-model orchestration Beyond just Claude Code, you can run different AI CLIs in containers - Codex, Gemini CLI, or others. I tried OpenAI Codex for code review, and it works well. The point isn't that you can't run these CLIs directly on your host machine - you obviously can. The value is that Claude Code's UI/UX is smooth enough that you can just talk to it and let it handle the orchestration: spinning up different models, sending data between containers and your host. Instead of manually switching between terminals and copy-pasting, Claude Code becomes the central interface that coordinates everything. Tip 22: The best way to get better at using Claude Code is by using it Recently I saw a world-class rock climber being interviewed by another rock climber. She was asked, "How do you get better at rock climbing?" She simply said, "By rock climbing." That's how I feel about this too. Of course, there are supplementary things you can do, like watching videos, reading books, learning about tips. But using Claude Code is the best way to learn how to use it. Using AI in general is the best way to learn how to use AI. I like to think of it like a billion token rule instead of the 10,000 hour rule. If you want to get better at AI and truly get a good intuition about how it works, the best way is to consume a lot of tokens. And nowadays it's possible. I found that especially with Opus 4.5, it's powerful enough but affordable enough that you can run multiple sessions at the same time. You don't have to worry as much about token usage, which frees you up a lot. Tip 23: Clone and half-clone conversations Sometimes you want to try a different approach from a specific point in a conversation without losing your original thread. The clone-conversation script lets you duplicate a conversation with new UUIDs so you can branch off. The first message is tagged with [CLONED], which shows up both in the claude -r list and inside the conversation. To set it up manually, symlink both files: 1 2 ln -s /path/to/this/repo/scripts/clone-conversation.sh ~/.claude/scripts/clone-conversation.sh ln -s /path/to/this/repo/commands/clone.md ~/.claude/commands/clone.md Or install via the dx plugin - no symlinks needed. Then just type /clone (or /dx:clone if using the plugin) in any conversation and Claude will handle finding the session ID and running the script. I've tested this extensively and the cloning works really well. Half-clone to reduce context When a conversation gets too long, the half-clone-conversation script keeps only the later half. This reduces token usage while preserving your recent work. The first message is tagged with [HALF-CLONE]. To set it up manually, symlink both files: 1 2 ln -s /path/to/this/repo/scripts/half-clone-conversation.sh ~/.claude/scripts/half-clone-conversation.sh ln -s /path/to/this/repo/commands/half-clone.md ~/.claude/commands/half-clone.md Or install via the dx plugin - no symlinks needed. Tip 24: Use realpath to get absolute paths When you need to tell Claude Code about files in a different folder, use realpath to get the full absolute path: 1 realpath some/relative/path Tip 25: Understanding CLAUDE.md vs Skills vs Slash Commands vs Plugins These are somewhat similar features and I initially found them pretty confusing. I've been unpacking them and trying my best to wrap my head around them, so I wanted to share what I learned. CLAUDE.md is the simplest one. It's a bunch of files that get treated as the default prompt, loaded into the beginning of every conversation no matter what. The nice thing about it is the simplicity. You can explain what the project is about in a particular project (./CLAUDE.md) or globally (~/.claude/CLAUDE.md). Skills are like better-structured CLAUDE.md files. They can be invoked by Claude automatically when relevant, or manually by the user with a slash (e.g., /my-skill). For example, you could have a skill that opens a Google Translate link with proper formatting when you ask how to pronounce a word in a certain language. If those instructions are in a skill, they only load when needed. If they were in CLAUDE.md, they'd already be there taking up space. So skills are more token-efficient in theory. Slash Commands are similar to skills in that they're ways of packaging instructions separately. They can be invoked manually by the user, or by Claude itself. If you need something more precise, to invoke at the right time at your own pace, slash commands are the tool to use. Skills and slash commands are pretty similar in the way they function. The difference is the intention of the design - skills are primarily designed for Claude to use, and slash commands are primarily designed for the user to use. However, I think there's a chance they'll be merged at some point, and I requested that from Anthropic. Plugins are a way to package skills, slash commands, agents, hooks, and MCP servers together. But a plugin doesn't have to use all of them. Anthropic's official frontend-design plugin is essentially just a skill and nothing else. It could be distributed as a standalone skill, but the plugin format makes it easier to install. For example, I built a plugin called dx that bundles slash commands and a skill from this repo together. You can see how it works in the Install the dx plugin section. Tip 26: Interactive PR reviews Claude Code is great for PR reviews. The procedure is pretty simple: you ask it to retrieve PR information using the gh command, and then you can go through the review however you want. You can do a general review, or go file by file, step by step. You control the pace. You control how much detail you want to look into and the level of complexity you want to work at. Maybe you just want to understand the general structure, or maybe you want to have it run tests too. The key difference is that Claude Code acts as an interactive PR reviewer, not just a one-shot machine. Some AI tools are good at one-shot reviews (including the latest GPT models), but with Claude Code you can have a conversation. Tip 27: Claude Code as a research tool Claude Code is amazing for any sort of research. It's essentially a Google replacement or deep research replacement, but more advanced in a few different ways. Whether you're researching why certain GitHub Actions failed (which I've been doing a lot recently), doing sentiment or market analysis on Reddit, exploring your codebase, or exploring public information to find something - it's able to do that. The key is giving it the right pieces of information and instructions about how to access those pieces of information. It might be gh terminal command access, or the container approach (Tip 21), or Reddit through Gemini CLI (Tip 11), or private information through an MCP like Slack MCP, or the Cmd+A / Ctrl+A method (Tip 10) - whatever it is. Additionally, if Claude Code has trouble loading certain URLs, you can try using Playwright MCP or Claude's native browser integration (see Tip 9). In fact, I was even able to save $10,000 by using Claude Code for research. Tip 28: Mastering different ways of verifying its output One way to verify its output if it's code is to have it write tests and make sure the tests look good in general. That's one way, but you can of course check the code it generates as it goes, just on the Claude Code UI. Another thing is you can use a visual Git client like GitHub Desktop for example. I personally use it. It's not a perfect product, but it's good enough for checking changes quickly. And having it generate a PR as I probably mentioned earlier in this post is a great way as well. Have it create a draft PR, check the content before turning it into a real PR. Another one is letting it check itself, its own work. If it gives you some sort of output, let's say from some research, you can say "are you sure about this? Can you double check?" One of my favorite prompts is to say "double check everything, every single claim in what you produced and at the end make a table of what you were able to verify" - and that seems to work really well. Tip 29: Claude Code as a DevOps engineer I wanted to specifically create a separate tip for this because it's been really amazing for me. Whenever there are GitHub Actions CI failures, I just give it to Claude Code and say "dig into this issue, try to find the root cause." Sometimes it gives you surface level answers, but if you just keep asking - was it caused by a particular commit, a particular PR, or is it a flaky issue? - it really helps you dig into these nasty issues that are hard to dig into by hand. You would need to wade through a bunch of logs and that would be super painful to do manually, but Claude Code is able to handle a lot of that. I've packaged this workflow as a /gha slash command - just run /gha <url> with any GitHub Actions URL and it will automatically investigate the failure, check for flakiness, identify breaking commits, and suggest fixes. You can find it in the commands folder, or install it via the dx plugin. Once you identify what the particular problem was, you can just create a draft PR and go through some of the tips I mentioned earlier - check the output, make sure it looks good, let it verify its own outputs, and then turn it into a real PR to actually fix the issue. It's been working really well for me personally. Tip 30: Keep CLAUDE.md simple and concise I think it's important to keep CLAUDE.md really simple and concise. You can just start with no CLAUDE.md at all. And if you find that you keep telling Claude Code the same thing over and over again, then you can just add it to CLAUDE.md. I know there is an option to do that through the # symbol, but I prefer to just ask Claude Code to either add it to the project level CLAUDE.md or the global CLAUDE.md and it'll know what to edit exactly. So you can just let Claude Code edit CLAUDE.md by itself based on your instruction. Tip 31: Claude Code as the universal interface I used to think with Claude Code, CLI is like the new IDE, and it's still true in a way. I think it's a great first place to open your project whenever you want to make quick edits and stuff like that. But depending on the severity of your project, you want to be more careful about the outputs than just staying at the vibe coding level. But what's also true, the more general case of that, is that Claude Code is really the universal interface to your computer, the digital world, any sort of digital problem that you have. You can let it figure it out in many cases. For example, if you need to do a quick edit of your video, you can just ask it to do that - it'll probably figure out how to do that through ffmpeg or something similar. If you want to transcribe a bunch of audio files or video files that you have locally, you can just ask it to do that - it might suggest to use Whisper through Python. If you want to analyze some data that you have in a CSV file, it might suggest to use Python or JavaScript to visualize that. And of course with internet access - Reddit, GitHub, MCPs - the possibilities are endless. It's also great for any operations you want to perform on your local computer. For example, if you're running out of storage, you can just ask it to give you some advice on how to clean that up. It'll look through your local folders and files, try to find what's taking up a lot of space, and then give you advice on how to clean them up - maybe delete particularly large files. In my case, I had some Final Cut Pro files that were really large that I should have cleaned up. Claude Code told me about it. Maybe it'll tell you to clean up unused Docker images and containers using docker system prune. Or maybe it'll tell you to clean up some cache that you never realized was still there. No matter what you want to do on your computer, Claude Code is the first place I go to now. I think it's kind of interesting because the computer started with a text interface. And we're, in a way, coming back to this text interface that you can spin up three or four tabs at a time, as I mentioned earlier. To me, that's really exciting. It feels like you have a second brain, in a way. But because of the way it's structured, because it's just a terminal tab, you can open up a third brain, a fourth brain, a fifth brain, a sixth brain. And as the models become more powerful, the proportion of the thinking that you can delegate to these things - not the important things, but things that you don't want to do or that you find boring or too tedious - you can just let them take care of it. As I mentioned, a good example of that is looking into GitHub Actions. Who wants to do that? But it turns out these agents are really good at those boring tasks. Tip 32: It's all about choosing the right level of abstraction As I mentioned earlier, sometimes it's okay to stay at the vibe coding level. You don't necessarily have to worry about every single line of code if you're working on one-time projects or non-critical parts of the codebase. But other times, you want to dig in a little deeper - look at the file structure and functions, individual lines of code, even checking dependencies. The key is that it's not binary. Some people say vibe coding is bad because you don't know what you're doing, but sometimes it's totally fine. But other times, it is helpful to dig deeper, use your software engineering skills, understand code at a granular level, or copy and paste parts of the codebase or specific error logs to ask Claude Code specific questions about them. It's sort of like you're exploring a giant iceberg. If you want to stay at the vibe coding level, you can just fly over the top and check it from far away. Then you can go a little bit closer. You can go into diving mode. You can go deeper and deeper, with Claude Code as your guide. Tip 33: Audit your approved commands I recently saw this post where someone's Claude Code ran rm -rf tests/ patches/ plan/ ~/ and wiped their home directory. It's easy to dismiss as a vibe coder mistake, but this kind of mistake could happen to anyone. So it's important to audit your approved commands from time to time. To make it easier, I built cc-safe - a CLI that scans your .claude/settings.json files for risky approved commands. It detects patterns like: sudo, rm -rf, Bash, chmod 777, curl | sh git reset --hard, npm publish, docker run --privileged And more - it's container-aware so docker exec commands are skipped It recursively scans all subdirectories, so you can point it at your projects folder to check everything at once. You can run it manually or ask Claude Code to run it for you: 1 2 npm install -g cc-safe cc-safe ~/projects Or just run it directly with npx: 1 npx cc-safe . GitHub: cc-safe Tip 34: Write lots of tests (and use TDD) As you write more code with Claude Code, it becomes easier to make mistakes. PR reviews and visual Git clients help catch issues (as I mentioned earlier), but writing tests is crucial as your codebase grows larger. You can have Claude Code write tests for its own code. Some people say AI can't test its own work, but it turns out it can - similar to how the human brain works. When you write tests, you're thinking about the same problem in a different way. The same applies to AI. I've found that TDD (Test-Driven Development) works really well with Claude Code: Write tests first Make sure they fail Commit the tests Write the code to make them pass This is actually how I built cc-safe. By writing failing tests first and committing them before implementation, you create a clear contract for what the code should do. Claude Code then has a concrete target to hit, and you can verify the implementation is correct by running the tests. If you want to be extra sure, review the tests yourself to make sure they don't do anything stupid like just returning true. Tip 35: Be braver in the unknown; iterative problem solving Since I started using Claude Code more intensely, I've noticed that I became more and more brave in the unknown. For example, when I started working at Daft, I noticed a problem with our frontend code. I'm not an expert in React, but I decided to dig into it anyway. I just started asking questions about the codebase and about the problem. Eventually I was able to solve it because I knew how to iteratively solve problems with Claude Code. A similar thing happened recently. I was building a guide for users of Daft and ran into some very specific issues: cloudpickle not working with Google Colab with Pydantic, and a separate issue with Python and a bit of Rust where things weren't printing correctly in JupyterLab even though they worked fine in the terminal. I had never worked with Rust before. I could have just created an issue and let other engineers handle it. But I thought, let me dig into the codebase. Claude Code came up with an initial solution, but it wasn't that good. So I slowed down. A colleague suggested we just disable that part, but I didn't want any regression. Can we find a better solution? What followed was a collaborative and iterative process. Claude Code suggested potential root causes and solutions. I experimented with those. Some turned out to be dead ends, so we went in a different direction. Throughout this, I controlled my pace. Sometimes I went faster, like when letting it explore different solution spaces or parts of the codebase. Sometimes I went slower, asking "what does this line mean exactly?" Controlling the level of abstraction, controlling the speed. Eventually I found a pretty elegant solution. The lesson: even in the world of the unknown, you can do a lot more with Claude Code than you might think. Tip 36: Running bash commands and agents in the background When you have a long-running bash command in Claude Code, you can press Ctrl+B to move it to run in the background. Claude Code knows how to manage background processes - it can check on them later using the BashOutput tool. This is useful when you realize a command is taking longer than expected and you want Claude to do something else in the meantime. You can either have it use the exponential backoff method I mentioned in Tip 17 to check on progress, or just let it work on something else entirely while the process runs. Claude Code also has the ability to run subagents in the background. If you need to do long-running research or have an agent check on something periodically, you don't have to keep it running in the foreground. Just ask Claude Code to run an agent or task in the background, and it'll handle it while you continue with other work. Tip 37: The era of personalized software is here We're entering an era of personalized, custom software. Since AI came out - ChatGPT in general, but especially Claude Code - I've noticed that I'm able to create a lot more software, sometimes just for myself, sometimes for small projects. As I mentioned earlier in this document, I've created a custom transcription tool that I use every day to talk to Claude Code. I've created ways to customize Claude Code itself. I've also done a bunch of data visualization and data analysis tasks using Python much faster than I could otherwise. Here's another example: korotovsky/slack-mcp-server, a popular Slack MCP with almost 1,000 stars, is designed to run as a Docker container. I had trouble using it smoothly inside my own Docker container (Docker-in-Docker complications). Instead of fighting with that setup, I just asked Claude Code to write a CLI using Slack's Node SDK directly. It worked really well. This is an exciting time. Whatever you want to get done, you can ask Claude Code to do it. If it's small enough, you can build it in an hour or two. Tip 38: Navigating and editing your input box Claude Code's input box is designed to emulate common terminal/readline shortcuts, which makes it feel natural if you're used to working in the terminal. Here are some useful ones: Navigation: Ctrl+A - Jump to the beginning of the line Ctrl+E - Jump to the end of the line Option+Left/Right (Mac) or Alt+Left/Right - Jump backward/forward by word Editing: Ctrl+W - Delete the previous word Ctrl+U - Delete from cursor to beginning of line Ctrl+K - Delete from cursor to end of line Ctrl+C / Ctrl+L - Clear the current input Ctrl+G - Open your prompt in an external editor (useful for pasting long text, since pasting directly into the terminal can be slow) If you're familiar with bash, zsh, or other shells, you'll feel right at home. For Ctrl+G, the editor is determined by your EDITOR environment variable. You can set it in your shell config (~/.zshrc or ~/.bashrc): 1 export EDITOR=vim # or nano, code, nvim, etc. Or in ~/.claude/settings.json (requires restart): 1 2 3 4 5 { "env": { "EDITOR": "vim" } } Entering newlines (multi-line input): The quickest method works everywhere without any setup: type \ followed by Enter to create a newline. For keyboard shortcuts, run /terminal-setup in Claude Code. On Mac Terminal.app, I use Option+Enter. Pasting images: Ctrl+V (Mac/Linux) or Alt+V (Windows) - Paste an image from your clipboard Note: On Mac, it's Ctrl+V, not Cmd+V. Tip 39: Spend some time planning, but also prototype quickly You want to spend enough time planning so that Claude Code knows what to build and how to build it. This means making high-level decisions early: what technology to use, how the project should be structured, where each functionality should live, which files things should go in. It's important to make good decisions as early as you can. Sometimes prototyping helps with that. Just by making a simple prototype quickly, you might be able to say "okay, this technology works for this particular purpose" or "this other technology works better." For example, I was recently experimenting with creating a diff viewer. I first tried a simple bash prototype with tmux and lazygit, then tried making my own git viewer with Ink and Node. I had a lot of trouble with different things and ended up not publishing any of these results. But what I got reminded of through this project is the importance of planning and prototyping. I found that just by planning a little bit better at the beginning before you let it write code, you're able to guide it better. You still need to guide it throughout the process of coding, but letting it plan a little first is really helpful. You can use plan mode for this by pressing Shift+Tab to switch to it. Or you can just ask Claude Code to make a plan before writing any code. Tip 40: Simplify overcomplicated code I've found that Claude Code sometimes overcomplicates things and writes too much code. It makes changes you didn't ask for. It just seems to have a bias for writing more code. The code might work correctly if you've followed the other tips in this guide, but it's going to be hard to maintain and hard to check. It can be kind of a nightmare if you don't review it enough. So sometimes you want to check the code and ask it to simplify things. You could fix things yourself, but you could also just ask it to simplify. You can ask questions like "why did you make this particular change?" or "why did you add this line?" Some people say if you write code only through AI, you'll never understand it. But that's only true if you don't ask enough questions. If you make sure you understand every single thing, you can actually understand code faster than otherwise because you can ask AI about it. Especially when you're working on a large project. Note that this applies to prose as well. Claude Code often tries to summarize previous paragraphs in the last paragraph, or previous sentences in the last sentence. It can get pretty repetitive. Sometimes it's helpful, but most of the time you'll need to ask it to remove or simplify it. Tip 41: Automation of automation At the end of the day, it's all about automation of automation. What I mean by that is I've found it's the best way to not just become more productive, but also make the process more fun. At least to me, this whole process of automation of automation is really fun. I personally started with ChatGPT and wanted to automate the process of copy-pasting and running commands that ChatGPT gave me in the terminal. I automated that whole process by building a ChatGPT plugin called Kaguya. I've consistently worked towards more and more automation since then. Nowadays, luckily, we don't even have to build a tool like that because tools like Claude Code exist and they work really well. And as I've used it more and more, I found myself thinking, well, what if I could automate the process of typing? So I used Claude Code itself to build my voice transcription app, as I mentioned earlier. Then I started to think, I find myself repeating myself sometimes. So I would put those things in CLAUDE.md. Then I would think, okay, sometimes I go through running the same command over and over again. How can I automate that? Maybe I can ask Claude Code to do it. Or maybe I can put them in skills. Or maybe I can even have it create a script so I don't have to repeat the same process over and over again. I think ultimately that's where we're heading. Whenever you find yourself repeating the same task or the same command over and over again, a couple of times is okay, but if you repeat it over and over again, then think about a way to automate that whole process. Tip 42: Share your knowledge and contribute where you can This tip is a bit different from the others. I found that by learning as much as you can, you're able to share your knowledge with people around you. Maybe through posts like these, maybe even books, courses, videos. I also recently had an internal session for my colleagues at Daft. It's been very rewarding. And whenever I share tips, I often get information back. For example, when I shared my trick for shortening the system prompt and tool descriptions (Tip 15), some people told me about the --system-prompt flag that you can use as an alternative. Another time, I shared about the difference between slash commands and skills (Tip 25), and I learned new things from comments on that Reddit post. So sharing your knowledge isn't just about establishing your brand or solidifying your learning. It's also about learning new things through that process. It's not always a one-way street. When it comes to contributing, I've been sending issues to the Claude Code repo. I thought, okay, if they listen, cool. If they don't, that's totally fine. I didn't have any expectations. But in version 2.0.67, I noticed they took multiple suggestions from reports I made: Fixed scroll position resetting after deleting a permission rule in /permissions Added search functionality to /permissions command It's kind of amazing how fast the team can react to feature requests and bug reports. But it makes sense because they're using Claude Code to build Claude Code itself. Tip 43: Keep learning! There are several effective ways to keep learning about Claude Code: Ask Claude Code itself - If you have a question about Claude Code, just ask it. Claude Code has a specialized sub-agent for answering questions about its own features, slash commands, settings, hooks, MCP servers, and more. Check the release notes - Type /release-notes to see what's new in your current version. This is the best way to learn about the latest features. Learn from the community - The r/ClaudeAI subreddit is a great place to learn from other users and see what workflows people are using. Follow Ado for daily tips - Ado (@adocomplete) is a DevRel at Anthropic who's been posting daily Claude Code tips throughout December in his "Advent of Claude" series. Each day covers a different feature or workflow - things like named sessions, /stats, headless mode, vim mode, and more. Twitter/X: Advent of Claude posts LinkedIn: Advent of Claude posts Install the dx plugin This repo is also a Claude Code plugin called dx (developer experience). It bundles several tools from the tips above into a single install: Command/Skill Description /dx:gha <url> Analyze GitHub Actions failures (Tip 29) /dx:handoff Create handoff documents for context continuity (Tip 8) /dx:clone Clone conversations to branch off (Tip 23) /dx:half-clone Half-clone to reduce context (Tip 23) reddit-fetch Fetch Reddit content via Gemini CLI (Tip 11) - auto-invoked when needed Install with two commands: 1 2 claude plugin marketplace add ykdojo/claude-code-tips claude plugin install dx@ykdojo After installing, the commands are available as /dx:clone, /dx:half-clone, /dx:handoff, and /dx:gha. The reddit-fetch skill is invoked automatically when you ask about Reddit URLs. Recommended companion: Playwright MCP for browser automation - add with claude mcp add -s user playwright npx @playwright/mcp@latest 📺 Related talk: Claude Code Masterclass - lessons and project examples from 31 months of agentic coding 📝 Story: How I got a full-time job with Claude Code 📰 Newsletter: Agentic Coding with Discipline and Skill - bring the practice of agentic coding to the next level

2026/1/1
articleCard.readMore

Ralph 实验:构建 SQLite UI

摘要 我使用 Ralph 技巧配合 Claude Code,根据生成的 PRD 自主构建了一个基于浏览器的 SQLite UI。 Claude 逐个处理需求,在极少指导且不使用框架的情况下,生成了一个简单的静态应用。 这种方法效果出奇地好,但速度慢、消耗 token 多,且在没有测试或版本控制的情况下存在风险。 更强的护栏、更短的冲刺周期和更好的项目结构将显著提高可靠性和效率。 总的来说,当时间和 token 成本可接受时,Ralph 被证明对这种无需人工干预的从零开发是有效的。 最终结果可以在 lochie.dev/sqlite-ui 找到。 简介 首先 —— Ralph Wiggum 和软件开发有什么关系? Geoffrey Huntley 在 2025 年 7 月首次创造了这个术语 Ralph 是一种技巧。在其最纯粹的形式中,Ralph 就是一个 Bash 循环。 1 while :; do cat PROMPT.md | claude-code ; done Ralph 可以取代大多数公司在全新项目上的大部分外包工作。它有缺陷,但这些缺陷是可以识别的,并且可以通过各种风格的提示词来解决。 本周早些时候,Matt Pocock 关于他的 Ralph 工作流的精彩 视频 出现在我的视野中。Matt 采纳了 Geoffrey 的最初想法,并将其调整以适应他自己的开发风格。 我强烈建议观看该视频,但为了总结核心概念,Matt 实际上重建了一个完全由 Claude 驱动的敏捷风格工作流。 创建需求(用户故事),包含名称、描述、验收标准和实施状态。 这些需求被分组到一个列表中(一个“Sprint”/冲刺)。 Claude 获取单个需求,独立工作直到完成,然后更新其状态。 重复步骤 3,直到没有未满足的需求。 Claude 本身被用来定义需求列表。没有显式的依赖映射或优先级排序,这完全留给 Claude 在冲刺期间去弄清楚。 冲刺作为一个 JSON 文件存在,包含所有需求及其进度。更新会同时持久化在文本文件和每次迭代后写入的 git 提交中。 实验 看到我的推送里充满了对 Ralph 的热议,我感到有点“错失恐惧症”(FOMO),并开始构思一个足够复杂有趣,但又可实现的项目。 一个用于与 SQLite 数据库交互的基于浏览器的 Web 应用感觉很合适。更棒的是,如果效果好,我将来还能自己用它。 这是我的第一个提示词(prompt),使用了 plan 模式: create a product requirements document for a browser based sqlite DB viewer, it should allow opening a sqlitedb file, show the contents and be interactive. No frontend JavaScript framework just keep it simple, lightweight and static. (译:创建一个基于浏览器的 SQLite 数据库查看器的产品需求文档,它应该允许打开 sqlite 数据库文件,显示内容并且是交互式的。不要使用前端 JavaScript 框架,保持简单、轻量和静态。) 这产生了一个相当大的需求列表,存储在 PRD.md 中。 然后我将这些需求转换为 PRD.json,遵循 Matt 在视频中使用的方法: use @PRD.md to create PRD.json the file should contain an array of feature objects, the schema of an object is as follows. Break the PRD into many small features. field description category functional, non functional, etc description a brief 1 sentence description of the feature steps array of strings, eg. when button is clicked, colour changes passes boolean representing if the feature is complete and working as intended 总共产生了 62 个需求。 带上我的 Ralph 脚本,我准备好让它开跑了。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 #!/bin/bash set -e if [ -z "$1" ]; then echo "Usage: $0 {iterations}" exit 1 fi for ((i=1; i<=$1; i++)); do echo "Iteration: $i" echo "---------------------------------------" result=$(claude --permission-mode acceptEdits -p "@PRD.json @progress.txt \ 1. Find the highest-priority feature to work on and work only on that feature. \ 2. Check that the tests pass. \ 3. Update the PRD (PRD.json) with the work that was done. \ 4. Append your progress to the progress.txt file. \ ONLY WORK ON A SINGLE FEATURE If, while implementing the feature, you notice the PRD is complete, output <promise>COMPLETE</promise>. \ ") echo "$result" if [[ "$result" == *"<promise>COMPLETE</promise>"* ]]; then echo "PRD COMPLETE" exit 0 fi done 我一次运行 10 次迭代的脚本。每次迭代都会阻塞且没有输出,所以我每次都急切地等待完成。 我的方法没有 git 历史记录。我在“危险边缘试探”。我依靠每次迭代后的 progress.txt 文件来查看发生了什么变化,并刷新浏览器查看是否有东西坏了。令我惊讶的是,一切都维持得很好,应用程序很快开始看起来像一个功能齐全的数据库 UI。 我无法一次性完成所有需求。随着功能的积累,token 使用量显著攀升,我最终花了大约两天时间进行实验。 一旦所有原始需求都实现了,我忍不住开始要求更多功能,包括: 显示表存储大小 对查询运行 EXPLAIN 显示查询执行时长 交互式行编辑 保存更新后的数据库 索引管理 架构图 ……还有更多 在那一刻,我从工程师变成了一个过度兴奋的产品经理,刚刚发现了一个 24/7 工作、从不拒绝、并且没有自尊心去拒绝需求膨胀的软件工程师。 最终结果 你可以在这里试用:lochie.dev/sqlite-ui 反思 仅用极简的提示词,我就能生成完整的需求列表并让 Claude 直接开始工作。结果在很大程度上符合我的预期,考虑到我有限的提示词,这仍然令人印象深刻。 我要求一个简单的静态实现,而这正是我得到的——单个 HTML、CSS 和 JavaScript 文件。唯一的外部依赖是 sql-wasm。Claude 构建了一个坚实的基础,可以进一步扩展。 显示表存储大小最初不起作用。Claude 指出,由于 sql-wasm 使用的编译标志,这在实现过程中可能无法工作。我自己用 SQLITE_ENABLE_DBSTAT_VTAB 重新构建了它,更新了 HTML 以指向我的构建版本,用 HTTP 服务器(加载 WASM 所需)提供文件服务,然后一切正常,无需进一步的代码更改。 一些缺点变得很明显: Claude 运行时没有可见的中间输出,所以很难判断它是卡住了还是只是慢。 没有编写测试。(没有提示它写,Claude 认为没有必要)。 项目没有版本控制。 我很幸运,没有什么灾难性的崩溃。 不过,这种成功并不令人惊讶。Ralph 本质上将 Claude 变成了一个专注的软件工程师,按照项目经理定义的冲刺计划工作。没有上下文切换,只有一次一个故事。 也就是说,这种方法并不快,而且消耗大量的 token。这是速度和准确性之间的明显权衡。如果你有时间并且能承担成本,像这样自主构建项目是有意义的,尤其是当它在你睡觉时还在埋头苦干的时候。 未来改进 根据以往的经验,Claude 在有更强的护栏和更清晰的指导下表现会好得多。虽然 Ralph 循环在极少设置下效果出奇地好,但这次实验突显了几个更刻意的结构会带来回报的领域。 护栏和项目结构 下次我会做不同的事情: 创建一个定义良好的 CLAUDE.md,记录项目惯例、架构、约束和期望。 使用结构化的多文件项目布局,而不是单个不断增长的 JavaScript 文件。 从一开始就添加 git 以持久化进度,通过提交历史启用检查,并允许在 Claude 走上不可恢复的道路时回滚。 添加测试 —— 大量的测试。对于 Web 应用,Playwright 将是一个强有力的选择,可以自动捕获功能和视觉回归。 因为每次 Ralph 迭代都在一个新的 Claude Code 会话中运行,大量的时间和 token 花在了重新加载和重新理解项目上下文上。更好的结构、更清晰的惯例和自动化测试将显著减少这种开销并提高迭代质量。 更小的冲刺和提前规划 冲刺的大小也很重要。在一个冲刺中有 62 个功能实在是太多了,Claude 仅仅为了决定下一步做什么就要进行大量的推理。 一种选择是预先手动界定故事范围、定义关系并分配优先级。然而,一个更有趣的工作流是让 Claude 自己处理那个计划步骤: 从完整的需求集开始。 要求 Claude 将它们分解为多个较小的冲刺,并提前计划几个冲刺,就像现实世界的敏捷流程一样。 独立执行每个冲刺。 通过这种方法,Ralph 脚本可以扩展为使用嵌套循环:遍历冲刺,然后遍历冲刺中的故事。这将减少因重复扫描完整需求集而浪费的上下文,并允许 Claude 自主运行更长时间。 进一步来说,这种模式甚至可以扩展到多个“软件工程师”(多个 Claude 实例)并行工作,由更高级别的编排层协调。 结论 我认为这次实验是成功的。 我构建了一些真正有用的东西,不会被丢弃,在实践中学到了很多关于 Ralph 技巧的知识,现在对将其应用于未来的项目充满信心。 我期待着在有更好的护栏、测试和版本控制的情况下再次尝试。

2026/1/1
articleCard.readMore

Notex:一个开源 NotebookLM 替代方案的实现

起源 Google 的 NotebookLM 上线后,不少人体验了它的文档问答和内容生成功能。它很好用,但有个问题:数据需要上传到 Google 的服务器。对于一些敏感的内部文档,这不太合适。 于是就想做一个开源版本:数据存在本地,想用什么模型就自己配置。一个元旦假期,基本功能就跑起来了。 项目叫 Notex,代码在 GitHub 上,今天聊聊它的实现。 本项目受open-notebook项目的启发。但是open-notebook缺少信息图、思维导图、幻灯片等高级功能。本项目弥补了这些功能 信息图采用最新的nano banana pro实现 幻灯片采用 宝玉的 《预订本年度最有价值提示词 —— 生成既有质感,又能随意修改文字的完美 PPT》 的方式,生成真正意义上的PPT 功能概览 Notex 是一个隐私优先的开源知识管理工具,基于 RAG(检索增强生成)技术。它支持多种文档格式,通过 AI 帮助你理解、总结和可视化文档内容。 核心功能 功能分类 具体能力 文档管理 支持 PDF、DOCX、PPTX、Markdown、TXT、HTML 等多种格式的文档上传和解析 AI 对话 基于文档内容的智能问答,回答会标注来源引用,避免 AI 幻觉 内容转换 摘要、FAQ、学习指南、大纲、时间线、术语表、测验、播客脚本等 9 种预设类型 视觉生成 思维导图(Mermaid.js)、信息图(Gemini Nano Banana Pro)、幻灯片/PPT 模型支持 OpenAI(含兼容 API)、Ollama 本地模型、Google Gemini 特色功能详解 1. 幻灯片(PPT)生成 采用两阶段生成流程: 阶段一:使用 Gemini Flash 生成 PPT 大纲,包含: 全局风格指南(字体、色彩、设计美学) 每页的叙事目标、关键内容、视觉元素、布局说明 阶段二:使用 Gemini Pro Image 为每页幻灯片生成配图 最终输出图文并茂的完整 PPT,而非简单的文字大纲。 2. 信息图生成 通过 Prompt Engineering 将文本内容"翻译"为结构化的视觉描述,再调用 Gemini Nano Banana Pro 生成手绘风格的信息图。 适用于:数据可视化、流程说明、概念解释等场景。 3. 思维导图生成 自动将文档结构提炼为 Mermaid.js mindmap 格式,支持: 中心主题、主要分支、细节节点的层级展示 自动缩放和交互式导航 导出为 SVG/PNG 4. 智能问答(RAG) 支持基于文档内容的自然语言问答: 中文/英文自适应分词和检索 回答会标注来源,可追溯原文 支持多轮对话历史 整体架构 技术栈选型很简单: 1 2 3 后端:Go 1.25 + Gin + SQLite 前端:原生 HTML/CSS/JavaScript AI:LangChainGo + OpenAI/Ollama + Gemini 为什么不选 Python?因为 Go 编译完就是一个二进制文件,部署起来省事。而且 Go 的并发性能好,后续如果需要处理大量文档也不会成为瓶颈。 目录结构也很清晰: 1 2 3 4 5 6 7 8 9 notex/ ├── main.go # 入口 ├── backend/ │ ├── agent.go # AI 调用逻辑 │ ├── server.go # HTTP 服务 │ ├── vector.go # 文档索引 │ ├── store.go # 数据持久化 │ └── nanobanana.go # Gemini 图片生成 └── frontend/ # 单页应用(嵌入二进制) 前端通过 //go:embed 编译进二进制,一个文件就能跑起来。 核心模块 1. 文档索引(vector.go) RAG 的第一步是把文档切成小块,建立索引。 这里有个问题:中文和英文的分词方式不一样。英文按空格分词就行,中文需要按字符切。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // 检测 CJK 字符比例 cjkCount := 0 for _, r := range runes { if r >= 0x4E00 && r <= 0x9FFF { // CJK Unified Ideographs cjkCount++ } } cjkRatio := float64(cjkCount) / float64(len(runes)) if cjkRatio > 0.3 { // 中文按字符切分 for i := 0; i < len(runes); i += (chunkSize - chunkOverlap) { // ... } } else { // 英文按单词切分 words := strings.Fields(text) // ... } 检索部分目前用的是简单的关键词匹配,没有用向量 Embedding。原因?对于中小规模的文档(几万字以内),关键词匹配够用了,而且快。如果后续文档量大,可以接入向量数据库。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 // 多层评分策略 score := 0.0 // 1. 子串匹配(权重高) if strings.Contains(content, queryLower) { score += 10.0 } // 2. 字符匹配率 matchCount := 0 for _, r := range queryRunes { if strings.ContainsRune(content, r) { matchCount++ } } if matchCount > 0 { charMatchRatio := float64(matchCount) / float64(len(queryRunes)) score += charMatchRatio * 5.0 } // 3. 单词匹配(针对英文) queryWords := strings.Fields(queryLower) for _, word := range queryWords { if len(word) > 2 && strings.Contains(content, word) { score += 2.0 } } 2. AI 调用(agent.go) 这块是核心,负责和 LLM 打交道。 使用 LangChainGo 作为抽象层,支持 OpenAI 和 Ollama: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 func createLLM(cfg Config) (llms.Model, error) { if cfg.IsOllama() { return ollamallm.New( ollamallm.WithModel(cfg.OllamaModel), ollamallm.WithServerURL(cfg.OllamaBaseURL), ) } opts := []openai.Option{ openai.WithToken(cfg.OpenAIAPIKey), openai.WithModel(cfg.OpenAIModel), } if cfg.OpenAIBaseURL != "" { opts = append(opts, openai.WithBaseURL(cfg.OpenAIBaseURL)) } return openai.New(opts...) } Prompt Engineering 是关键。不同的转换类型需要不同的 prompt 模板: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 case "mindmap": return `你是一位资深的信息架构师和知识管理专家。 请将【文本内容】提炼并转换为 Mermaid.js 的 mindmap 格式。 # 样式规范: 1. 中心主题:root((内容)) 2. 主要分支:(内容) 3. 细节节点:[内容] # 严格规范: - 严禁使用 graph, LR, --> 等字符 - 节点内容 10 字以内 - 只输出代码块,不解释 来源:{sources} ` 注意这里的 prompt 设计:明确的输出格式要求,加上反面示例("严禁"),能有效减少 LLM 输出格式不稳定的问题。 3. PPT 生成 这是比较复杂的 feature。需要两个步骤: 第一步:生成大纲 用 Gemini Flash 生成 PPT 大纲,每页包含: 叙事目标(这张幻灯片要讲什么) 关键内容(标题、要点) 视觉元素(需要什么图) 布局(怎么排版) 第二步:解析和生成图片 1 2 3 4 5 6 7 8 9 10 11 12 13 14 func (a *Agent) ParsePPTSlides(content string) []Slide { // 1. 提取风格指南 styleStart := strings.Index(content, "<STYLE_INSTRUCTIONS>") styleEnd := strings.Index(content, "</STYLE_INSTRUCTIONS>") // 2. 按幻灯片分割(支持多种标记) re := regexp.MustCompile(`(?m)^(?:\s*#{1,6}\s*)?(?:Slide|幻灯片)\s*\d+`) indices := re.FindAllStringIndex(content, -1) // 3. 验证每张幻灯片是否包含必需字段 if strings.Contains(lower, "叙事目标") || strings.Contains(lower, "关键内容") { slides = append(slides, Slide{Style: style, Content: slideContent}) } } 然后为每张幻灯片调用 Gemini Pro Image 生成图片: 1 2 3 4 5 for i, slide := range slides { prompt := fmt.Sprintf("Style: %s\n\nSlide Content: %s", slides[0].Style, slide.Content) imagePath, err := s.agent.GenerateImage(ctx, "gemini-3-pro-image-preview", prompt) slideURLs = append(slideURLs, "/uploads/"+filepath.Base(imagePath)) } 4. 信息图生成 信息图的核心是让 LLM 把文本内容"翻译"成视觉描述,然后把描述作为 prompt 喂给图像模型。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 case "infograph": return `# Role 你是一位世界顶级的数据可视化设计师和信息图专家。 # Task 阅读所附文本,设计一张信息图。不要进行总结, 而是描述这张图应该长什么样。你的输出将被直接用作图像生成的 prompt。 # Design Guidelines 1. 核心信息提炼:找出最重要的 3-5 个数据点 2. 视觉隐喻:用形象的比喻(如网络安全用"盾牌和锁") 3. 布局结构:明确定义图的结构 4. 文本限制:极简,只保留标题和关键数据 5. 风格:插画或手绘感 # Output Format Start with "Infographic illustration created in a soft, hand-drawn digital art style..." [描述整体布局和背景风格] [详细描述主要视觉元素] ... ` 这个 prompt 设计的要点是:明确告诉 LLM 它的输出会被用作另一个 prompt,这样 LLM 就会更注意输出的结构化和可用性。 5. 思维导图 思维导图的实现最简单,核心是生成 Mermaid.js 的语法: 1 2 3 4 5 6 7 8 9 10 11 12 13 case "mindmap": return `将文本转换为 Mermaid.js mindmap 格式。 # 样式规范: root((中心主题)) # 圆圈 (主要分支) # 圆角矩形 [细节节点] # 矩形 # 严格规范: - 仅限 mindmap 语法 - 节点内容 10 字以内 - 严禁包含引号 ` 前端用 Mermaid.js 自动渲染: 1 2 import mermaid from 'https://s4.zstatic.net/ajax/libs/mermaid/11.4.0/mermaid.min.js'; mermaid.initialize({ startOnLoad: true }); 数据流 整个系统的数据流如下: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 用户上传文档 ↓ markitdown 转换为 Markdown(如果需要) ↓ 文本分块并索引 ↓ 存储到 SQLite ↓ 用户选择转换类型(PPT/信息图/思维导图) ↓ 根据类型构建 prompt ↓ 调用 LLM 生成内容 ↓ 如果需要图片,调用 Gemini 生成 ↓ 保存为 Note ↓ 前端渲染展示 部署和使用 本地运行 1 2 3 4 5 6 7 8 9 10 11 12 git clone https://github.com/smallnest/notex.git cd notex go mod tidy # 使用 OpenAI export OPENAI_API_KEY=your_key go run . -server # 或使用 Ollama(本地模型) export OLLAMA_BASE_URL=http://localhost:11434 export OLLAMA_MODEL=qwen2.5:7b go run . -server Docker 部署 1 docker-compose up 环境变量 变量 说明 默认值 OPENAI_API_KEY OpenAI API 密钥 - OPENAI_BASE_URL 自定义 API 地址 - OPENAI_MODEL 模型名称 gpt-4o-mini OLLAMA_BASE_URL Ollama 地址 - OLLAMA_MODEL Ollama 模型 - GOOGLE_API_KEY Gemini API(PPT/信息图) - SERVER_PORT 服务端口 8080 扩展性 代码结构支持几种扩展方式: 1. 添加新的转换类型 在 agent.go 的 getTransformationPrompt 添加新 case: 1 2 case "new_type": return `你的 prompt 模板...` 2. 支持新的 LLM LangChainGo 支持多种模型,只需在 createLLM 添加新的分支。 3. 接入向量数据库 VectorStore 目前是内存实现,可以替换成 Qdrant、Milvus 等。 4. 添加新的文档格式 在 vector.go 的 needsMarkitdown 添加新的文件扩展名。 代码量和学习价值 项目核心代码大约 2000 行左右,适合学习: RAG 的完整实现:从文档处理到检索到生成 Prompt Engineering:如何设计稳定的 prompt Go Web 开发:Gin 框架、SQLite、并发处理 前后端分离:单页应用 + REST API LLM 应用架构:如何组织一个 AI 应用的代码 如果你想学习如何构建 AI 应用,或者想找一个 NotebookLM 的替代方案,这个项目可以作为一个起点。 总结 Notex 不是什么高大上的项目,代码也写得朴素。但它解决了实际问题:让你在本地使用 AI 处理文档,数据不外泄。 核心功能就几个: 文档问答(RAG) 多种内容转换(摘要、FAQ、大纲等) 视觉内容生成(PPT、信息图、思维导图) 技术选型务实地选择了 Go 和原生 JS,没有过多的依赖。部署简单,维护成本低。 如果你感兴趣,可以看看代码,提提 PR,或者直接 fork 改成你自己的工具。 项目地址: https://github.com/smallnest/notex 欢迎 Star、Issue、PR。

2026/1/1
articleCard.readMore

拆解Manus:真正有用的深度报告的生成

真正的深度研究报告不是让 AI "写"出来的,而是"研究"出来的。这两个字差别大了。 最近几个月,几个产品让我眼前一亮:Gemini Deep Research、Manus Wide Research,还有 OpenAI 的 Deep Research。它们都在做同一件事——让 AI 具备真正的"研究能力"。 这事儿得从"上下文窗口陷阱"说起。 一、上下文窗口陷阱 传统 AI 助手有个致命问题。 让它分析 5 个产品?没问题,每个都能给出详尽分析。 分析 10 个?开始有点吃力,后面的描述明显变短。 分析 30 个?基本崩了,只能给些通用摘要,错误率飙升。 这不是模型变笨了,是上下文窗口被撑爆了。AI 必须把前面所有项目都记在内存里,越往后,早期信息被压缩得越厉害,最后就剩个轮廓。 研究显示,大多数 AI 系统的"幻觉阈值"大概在 8-10 个项目左右。超过这个数,质量断崖式下跌。 这就是为什么那些所谓的"深度研究" agent,给你列 50 篇论文,结果后面 30 篇都是胡编乱造。 二、两条路线 业界现在有两条路线解决这个问题。 路线一:把模型做强 代表是 Google 的 Gemini Deep Research。2025 年搞了两次大升级——4 月上了 Gemini 2.5 Pro,12 月又升级到 Gemini 3 Pro。核心思路就是用更大的上下文窗口、更强的推理能力,硬抗复杂任务。 它适合那种需要深度推理、跨多个来源综合分析的任务。比如研究一个技术趋势的演进,需要把几十篇论文、新闻、博客揉在一起,找出脉络。 路线二:把架构做巧 代表是 Manus 的 Wide Research。它不跟你比谁的上下文窗口大,而是直接换个架构——不用一个 Agent 干所有事,而是搞几百个 Agent 并行干活。 每个 Agent 只负责一个项目,有自己独立的上下文窗口。主 Agent 负责拆解任务、分配工作、最后汇总结果。 这就像写研究报告。原来是一个人憋大招,现在是一个导师带几十个研究生,每人负责一篇文献,最后导师综合。 所以Manus把它的研究项目叫做宽度搜索 (wide research)。 三、实战对比:风电行业投资研究 拿"风电行业股票的投资研究"这个任务来说,两种方案差异明显。 Gemini Deep Research:深度推理派 丢给 Gemini 一个复杂问题,它会: ① 先理解任务,规划研究策略 可以看到,Gemini一开始先制定研究方案,把搜索计划拆解成 7 个方向,针对每个方向,进行深度的分析。 而且它在开始的允许你修改研究方案,提供 human-in-the-loop 的机制,允引入人工干预的过程。 ② 多轮搜索,找权威来源——行业报告、公司财报、政策文件、学术论文 这其实是谷歌的优势,它是搜索引擎起家,有着无与伦比的搜索数据优势,而且时效性也非常的高,所以它更有资源去做这个事情,其实百度也非常的合适做这个。所以它搜索的质量也是非常高的,搜索出来的网页的权重都挺高,权威性好。 ③ 边搜索边分析,不断调整方向 从上图可以看出,它也是边搜索边思考,对搜索方向进行适当的调整,围绕着计划进行有目的的搜索。 ④ 综合所有信息,写出有逻辑的报告 最终,整合所有的报告,输出一份完整的报告结果。 整个过程可能要跑十几分钟。但它给出的不是简单的信息拼凑,而是有洞察的分析。 优点是推理能力强,能发现隐含的关联。比如它会注意到某个政策变化对上游设备商和下游运营商的不同影响,这是普通搜索做不到的。 缺点是,如果任务涉及大量独立对象(比如分析 50 家风电公司),还是会遇到上下文瓶颈。 Manus Wide Research:并行执行派 同样的任务,如果是"分析 50 家风电上市公司",Manus 的做法完全不同。 第一步,任务拆解。 主 Agent 会把"分析 50 家公司"这个大任务,拆成 50 个独立的小任务。每个小任务包含:公司名称、需要收集的信息字段(财务数据、业务布局、风险因素等)、评估标准。 第二步,并行启动。 50 个子 Agent 同时启动,每个只研究一家公司。关键在于——它们不是轮流干活,而是真正并行。就像 50 个研究员同时开工,互不干扰。 第三步,独立执行。 每个子 Agent 在自己的沙箱环境里跑完整的调研流程:搜索公司信息、读财报、查新闻、整理数据。因为每个 Agent 有独立的上下文窗口,第 50 个公司和第 1 个公司获得的分析深度完全一致。 第四步,结果汇总。 主 Agent 收集所有子任务的结果,整理成结构化输出——表格、报告、或者数据库。 这个方案的巧妙之处在于规避了上下文窗口问题。 传统方式是一个 Agent 处理所有任务,越往后上下文越拥挤,质量必然下降。而 Manus 的方式是每个 Agent 都从零开始,大家都是"满血"状态,不存在谁挤占谁的问题。 官方有个案例——研究 250 位 AI 研究员。 第 1 位到第 250 位,每位的详细程度都一样:完整的背景、研究方向、代表作、核心论文。这换成传统 AI 根本做不到——早就因为上下文爆炸而开始胡编乱造了。 技术实现细节 剥开产品外壳,Manus 的架构其实挺讲究的。核心思想来自一篇叫《CodeAct》的论文:让 AI 像程序员一样工作——写代码、运行代码、看结果、改代码、再运行。 ReAct 模式 每个 Agent 的执行逻辑是一个循环:观察(Observation)→ 思考(Thought)→ 行动(Action)→ 观察...不断迭代,直到任务完成。 这不是简单的"调用工具",而是边思考边行动。搜索结果不相关?换关键词。代码执行报错?检查原因、修改命令、重试。 1 2 3 ┌─────────────────────────────────────────────┐ │ 观察 → 思考 → 行动 → 观察 → 思考 → 行动... │ └─────────────────────────────────────────────┘ 沙箱隔离 每个 Agent 跑在独立的虚拟机环境里。这不是多线程,而是真正的隔离——独立的文件系统、独立的网络、独立的进程空间。 Manus 用了沙箱技术,这样就算某个 Agent 被诱导执行了危险命令,也影响不到其他 Agent 和宿主系统。 任务分解:DAG 结构 主 Agent 把复杂任务拆成有向无环图(DAG)。清楚知道哪些步骤可以并行,哪些必须等前置条件。 比如"研究 50 家风电公司"的执行图: 1 2 3 4 5 6 7 8 9 10 [主 Agent:任务分解] ↓ ┌─────────┬─────────┼─────────┬─────────┐ ↓ ↓ ↓ ↓ ↓ [Agent 1] [Agent 2] [Agent 3] ... [Agent 50] 研究公司1 研究公司2 研究公司3 研究公司50 ↓ ↓ ↓ ↓ ↓ └─────────┴─────────┴─────────┴─────────┘ ↓ [主 Agent:结果汇总] 工具集 Manus 提供了 29 种工具,分成几类: 类别 工具示例 用途 命令执行 Shell、Python 执行任意代码和系统命令 文件操作 读写 txt/md/pdf/xlsx 处理各种格式的文档 网络能力 搜索、浏览器、端口部署 获取信息、部署服务 系统能力 进程管理、软件安装 配置运行环境 这些工具让 Agent 能真正"动手做事",而不是只会生成文本。比如研究一家公司,它可以:搜索公司官网、爬取财报数据、用 Python 分析财务指标、把结果写入表格。 动态质量检测 Manus 不是按预设流程走到黑。每次执行完一个步骤,都会判断:结果可信吗?需要调整方向吗? 1 2 3 4 def quality_check(result): if result.confidence < 0.7: trigger_self_correction() return generate_validation_report() 代码执行报错?调整命令重试。搜索结果太少?换搜索引擎或关键词。这种自我纠错能力,让它在遇到问题时不会死磕,而是会寻找替代方案。 状态管理 每个任务维护一个 todo.md 文件,实时更新进度: 1 2 3 4 5 6 7 8 9 10 11 # TODO ![cover](cover.png) ![cover](cover.png) - [x] 收集公司列表 - [x] 研究前 10 家公司 - [ ] 研究中 30 家公司 - [ ] 研究后 10 家公司 - [ ] 汇总分析报告 这样做的好处是:任务中断后可以恢复;用户随时能看到进展;中间结果有地方存储,不会丢失。 当然这些分析都是我们基于Manus的产品、宣传资料和外部人员的分析做的推测,我们实际并不知道Manus内部的方案。仅供分析和学习。 四、真正的"研究"是什么? 看这些产品,我发现一个有趣的共通点:它们都在模仿人类研究员的工作方式。 人类研究员怎么做深度报告? 首先规划:这个问题要从哪些角度分析?需要哪些数据? 然后执行:找数据、读文献、做访谈、整理信息。 最后综合:把零散的信息组织成有逻辑的故事。 关键在于,这不是一次性能完成的。需要反复迭代——查了资料发现方向不对,得换方向;分析到一半发现缺数据,得补数据。 传统的 LLM chat 模式,本质上是一次性生成。给你个 prompt,你吐出结果。这是"写作",不是"研究"。 真正的深度研究 agent,应该具备: ① 规划能力:能理解任务,制定研究策略,知道先查什么后查什么。 ② 工具能力:会用搜索引擎、读数据库、解析文档,不只是从训练数据里抠信息。 ③ 迭代能力:能根据中间结果调整方向,不是按预定流程走到黑。 ④ 验证能力:能交叉验证信息来源,不是捡到什么信什么。 ⑤ 综合能力:能把碎片信息拼成完整故事,不是简单的信息罗列。 Gemini Deep Research 和 Manus Wide Research,本质上都是在往这个方向努力。只是侧重点不同——前者重"深",后者重"广"。 五、一些思考 深度研究报告的生成,本质上是在解决 AI 的两个核心问题:知识获取和复杂推理。 传统的做法是把知识"压缩"进模型参数里,但这有天花板——模型总有训练截止日期,总有不知道的东西。 新的方向是给 AI 装上"工具",让它能动态获取信息、多步推理、自我纠错。这更像人类的智能——我们不是什么都知道,但知道怎么去查、怎么去思考。 Deep Research 这类产品,标志着 AI 从"聊天机器人"向"研究助手"的转变。它们不只是在陪你闲聊,而是能真正帮你干活。 当然,问题还很多。幻觉、偏见、安全风险,一个都没解决。但方向是对的。 Insight: 我们的深度研究工具 基于上面对这些产品的分析,结合对智能体 20 余中架构模式的研究,针对深度研究报告这个场景,我们研发了Insight平台,展示了要实现高质量的深度研究报告所设计的技术和套路,并提供了独有的一些方法。 接下来我介绍这个产品的重要的结束。在这之前,我先给大家展示一下效果。 Insight平台地址: https://insight.rpcx.io 一个网友的报告: https://insight.rpcx.io/reports/e25061cd-b936-470d-b5ed-3998b2d7452e 高质量的资料源 就像研究员一样,要想生成高质量的报告,必须要有高质量的数据源作为基础。 如上面提到过的,这方面搜索引擎具有优势,他们拥有全面性,时效性高的搜索资料。 一些专业的网站也有行业方面的专业资料,比如金融领域相关的,IT技术相关的、白酒方面相关的 电商方面相关的,旅游行业相关的、出行方面相关的、法律行业相关的,科普方面相关的,他们能够提供垂类专业的资料,如果要生成这方面的报告,找这些行业相关的网站信息是最好的。当然如果这些资料是公开的,可以通过搜索因为+专业的搜索词进行搜索获取,所以搜索引擎的重要性在AI的深度调研报告生成中占很重要的地位。 但是搜索引擎并不轻易把它的能力无偿的暴露出来,比如谷歌,它提供付费的搜索API,价格不低。Bing已经把它的搜索API下掉了。还有一些第三方的服务,或许是通过爬虫的方式进行搜索,比如Travily、Brave、SerpAPI等,也提供收费的服务。总的来说,好东西都是有价格的。这些搜索引擎增加了反爬虫的机制,我几个月前还能绕过这些限制,现在也不行了,他们都很聪明,不会轻易让你免费使用的。 从 Manus网站的输出来看,我怀疑它们买了某个或者某几个服务器服务商的服务,可以针对用户的请求,得到搜索引擎的列表 (SERP), 这是猜测,前端并看不出来,因为它们并不可能去做个搜索引擎。专业的人做专业的事。它们拿到搜索结果列表后,通过它们的AI浏览器,也就是虚拟机的浏览器去浏览网络,获取网页的内容。它们叫cloud browser,应该是基于browserbase开发的。沙盒技术和Browser我放在下一个拆解文章中在再专门介绍,这一篇还是专门关注调研报告的实现。 识别用户的意图 专业的用户会提供详细的、简洁的、对 AI 友好的提示词,但是大部分用户并不能很好的表达自己的意图,比如: 请帮我分析Apple的市场行情 这里的Apple指的是苹果电脑公司还是水果苹果🍎呢?非常有歧义。还有的提示词含糊不清。在Insight的实现在,我们首先实现的就是对用户的提示词进行澄清。理论上对于非常模糊的提示词,比如Apple这里例子,我们需要进一步询问用户,反问用户的意图,但是Insight还没进一步的实现这个功能,而是自动进行澄清,它是怎么实现的呢? 生成报告大纲 接下来专门有个agent(StructurePlannerNode)负责生成报告的节点,主要确定报告的主要大纲和方向。 区分报告的种类进行规划搜索任务 接下来针对报告的大纲, Planner Agent针对金融投资领域、技术开发领域、商业分析领域、政策相关领域、通用领域等各种领域制定搜索计划,为每个章节搜索足够的信息备用。 执行搜索 Search Agent 会根据搜索计划,逐步实现相关内容的搜索。 它会根据项目提供的工具,进行搜索。目前提供了travily、github、谷歌学术、知乎、微信公众号等网站的搜索,我其实还是期望能直接搜索google的搜索结果和一些顶级的专业网站的搜索内容。 搜索的结果内容如果太长,还会对它们进行summary,以便压缩上下文。 此节点还会针对每个章节进行内容的总结。 现在万事俱备了,所有的大纲和材料都准备好了,可以进行报告的编写了。有请 ContentWriter Agent。 内容的编写 ContentWriter 负责报告的编写,它根据大纲和材料进行内容的创作。 说起创作来,针对内容创作的场景,我们为了提高创作产品的质量,经常采用Reflection模式,针对生成的内容进行打分,要不要进行优化。 所以这个Agent生成后,会调用Reflector Agent进行打分,以便决定要不要调用Revisor节点进行订正。你看报告生成的日志也能观察到这一点。 除非生成的报告片段达到了“优”的级别,否则就会尝试补充和订正,直到达到了 5 轮的反复订正才算通过。 所以你会看到Insight生成的报告的质量还是很高的,不像一个玩具或者示例deep research产品的报告那么简略或者偏离主题。 Reportor Agent 最后这个Agent把各个章节合在一起,生成一个完整的markdown的报告。 会生成合理的章节,并进行编号。 另外还有的Podcast Agent,可以生成播客的脚本。如果用户请求中要求生成播客,那么此Agent就会自动生成。 工作流程 完整的工作流 (graph)如下: 工程开发 本产品开发过程中很多资源花费了工程产品化方面,尤其是前端的生成和优化。 实现框架 本产品使用 https://github.com/smallnest/langgraphgo 实现。 Go 语言实现。 使用 CC 辅助代码生成。 使用 GLM 4.7 作为后端模型。 本产品演示网站: https://insight.rpcx.io 对后端代码感兴趣的可以加我微信 smallnest 。我拉你进《智能体研究社》,讨论智能体的开发和vibe coding。 未来计划 报告中缺乏图表,可以加强数据的采集和图表的展示。 报告中缺乏图片的润色,网上搜索的图片经常文不对题,不过我已经有方案了。 参考资料: OpenAI Deep Research 官方介绍 Gemini Deep Research 指南 2025 Manus Wide Research 官方文档 OpenAI Deep Research API 文档

2026/1/1
articleCard.readMore

了解 Manus Sandbox - 您的云计算机

了解 Manus Sandbox - 您的云计算机 本来想写拆解Manus沙箱的文章,结果Manus官方自己写了 :) 了解 Manus Sandbox - 您的云计算机 正如 Manus 名称的出处"Mens et Manus"(Mind and Hand)所说的,Manus 希望让 AI 模型不仅只是思考,更可以帮你做出行动。而在我们给 AI 模型赋予的"手"中,最强大的莫过于一台真正的云计算机 —— Manus Sandbox。 什么是 Manus Sandbox? Manus Sandbox 是 Manus 为每一个任务分配的完全独立的云虚拟机。每个 Sandbox 在自己的环境中运行,互不影响,可以并行执行任务。 Sandbox 的强大之处在于它的完整性——就像你手上使用的个人电脑一样,它拥有完整的能力:网络、文件系统、浏览器、各种各样的软件工具。我们的 AI Agent 经过设计与训练,可以很好地选择并正确使用这些工具帮助你完成任务。不仅如此,有了这台计算机,AI 可以通过自己最擅长的方式——编写代码来解决问题,甚至可以帮你制作一个完整的网站和移动端 App。这一切都发生在 Manus 背后的虚拟化平台上。这些 Sandbox 可以 7x24 小时工作,完成你下发的任务而不占用你的本地资源。 Manus Sandbox 的特性 Sandbox 里面有什么 Manus Sandbox 里存放了任务执行过程中所需要的文件,包含如下类型: 用户上传的附件 Manus 运行过程中创建、编写的文件和产物 Manus 为了执行特定任务所需的配置(如用户上传的密钥、Manus 给用户分配的用于调用相关接口的密钥) 你可以通过右上角的"View all files in this task"入口查看 Sandbox 中的所有产物文件。 当然,你也可以直接向 Manus 发送消息询问当前 Sandbox 的状态和其中的内容,比如"把当前写的所有代码打包发给我",Manus 会自行访问 Sandbox 满足你的要求。 Sandbox 的生命周期 Sandbox 遵循一个可预测的生命周期,在资源效率和持久性之间取得平衡: 创建: Sandbox 会在新会话中按需创建 休眠/唤醒: 在 Sandbox 不活跃时(没有操作、文件编辑等),会自动进入休眠状态。当你回到任务,Manus 需要操作 Sandbox 时会自动唤醒。在休眠/唤醒周期内,Sandbox 中的文件数据保持不变 回收/重新创建: 连续休眠超过一定期限的 Sandbox 会被回收(免费用户:7 天;Manus Pro:21 天)。Sandbox 被回收后再次访问时,会自动重新创建一个新的沙盒。Manus 会自动恢复之前沙盒中的部分文件到新的沙盒:Manus 的产物、上传的附件、Slides/WebDev 等重要文件会被自动恢复;运行过程中的中间代码、临时文件等不会被恢复 由于 Sandbox 存在休眠机制,如果你需要部署长时间运行的后台服务,可以使用 Manus 的网页开发能力创建前后台服务并部署到公网。 Sandbox 的安全性 我们对 Manus Sandbox 的设计遵循 原则。就像你在云服务厂商购买的一台云虚拟机那样,你和 Manus 对这台电脑拥有绝对的掌控权,可以不受限制地进行任何操作(例如获取 root 权限,修改系统文件,甚至是格式化整个磁盘)。这使得 Manus 可以尽最大的可能帮你完成任务而不受权限的约束。 不用担心:任何对 Sandbox 的操作只会影响 Sandbox 本身,不会影响到 Manus 服务的安全和稳定性,你的会话、账号等数据也无法被 Sandbox 访问。一旦出现不可恢复的错误,Manus 会自动创建一个新的沙盒来进行替换,确保能继续为你提供服务。 避免意外共享 Sandbox 中的敏感数据 Sandbox 作为你的私人电脑,可能存放有你的个人敏感信息和数据。Manus 有着严格的隐私保护政策和措施,不会在未经用户授权的情况下读取或分享任何用户数据,但你仍然需要采取措施以避免 Sandbox 中的数据被意外共享。 我们需要区分"分享"和"协作"两个场景。 分享 对于任务的分享(通过 Manus 右上角的"Share"按钮),被分享者只会看到任务对话中的消息和输出的产物。Sandbox 对他们是完全不可见的。因此,你只需要关心对话中是否包含敏感信息即可,无需担心 Sandbox 中的内容被泄露。 协作 对于任务的协作(通过 Manus 右上角的协作按钮,邀请特定的用户参与),协作者加入会话后即获得了参与此任务的权限。这意味着协作者可以向 AI 发送指令,控制任务的执行。此时 Sandbox 对协作者也同样开放——他们可以通过 AI 访问或者修改 Sandbox 中的文件、数据,可能会造成预期外的数据泄露。 另外,Connectors 会在会话开启协作后自动禁用,无需担心 Connectors 被协作者访问。 你可以通过上面的表格了解不同情况下其他用户对于任务内容的可访问性,在进行共享、协作操作前确认不会泄露你的隐私数据。 保护隐私的最佳实践 由于 Sandbox 相当于你的私人电脑,添加协作者前请二次确认 Sandbox 中是否有敏感内容不便于协作者访问 如果已经有敏感内容,可以新建一个 task,只复制必要的内容和产物到该 task,再邀请协作 避免在协作会话中发送个人敏感信息 可用性 Manus Sandbox 是我们平台的核心组件,所有用户均可使用,适用于所有订阅层级。 常见问题 问:Sandbox 被回收后,我的文件会怎样? 答:Manus 会自动恢复你最重要的文件——产物、上传的附件以及 Slides/WebDev 等项目文件——到新的 Sandbox。中间代码和临时文件不会被恢复。 问:Sandbox 多久会被回收? 答:免费用户的保留期为 7 天,Manus Pro 用户在 Sandbox 不活跃 21 天后才会被回收。 问:协作者能否通过 Sandbox 访问我的 Connectors? 答:不能。协作开启后,Connectors 会自动禁用,确保协作者无法访问你连接的服务。 Manus Sandbox 代表了 Manus 代你行动的基础。通过提供一个持久、安全、功能完整的云计算环境,我们正在开启一类新的 AI 驱动工作方式——从对话走向真正的执行。 参考资料 Manus Project 博客 Manus文档 - 网站构建器

2026/1/1
articleCard.readMore

Claude Code 使用

<a id="more"></a>

2025/12/1
articleCard.readMore

我给每个模型服务商『捐』了10块钱,只为了...

我整理了几家大厂的模型服务的地址、文档和基本介绍,方便大家可以使用和参考。 当然还有其他一些厂商提供了通用的模型服务或自家的模型服务,就不一一介绍了,未来模型服务最终会集中到几家大厂手里。 deepseek 官方服务 官方入口:https://platform.deepseek.com 官方文档:https://api-docs.deepseek.com/zh-cn/ 官方网站上的操作还是很简洁的,不像有些云服务商搞得人晕头转向。你可以很方便的充值和创建API key。 我年初的时候充了10块钱,现在还剩8块多。 它提供了OPENAI兼容的API, 所以下面三个信息非常关键: base_url: https://api.deepseek.com 或者 https://api.deepseek.com/v1 api_key: 充钱后就可以自由的创建key, key就像密码一样,注意安全的保存,或者设置在环境变量中 model: 选择要使用的模型,deepseek-chat 非思考模式,eepseek-reasoner 思考模式,目前这两个都升级到了DeepSeek-V3.2-Exp 你可以使用curl测试: 1 2 3 4 5 6 7 8 9 10 11 curl https://api.deepseek.com/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${DEEPSEEK_API_KEY}" \ -d '{ "model": "deepseek-chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "stream": false }' 阿里云百炼平台 官方入口:https://bailian.console.aliyun.com/ 官方文档:https://help.aliyun.com/zh/model-studio/what-is-model-studio 据官方介绍,自2025年9月8日11点起,首次开通阿里云百炼的用户,获赠的新人免费额度有效期调整为 90 天,可以领取7000万的额度(应该是每个模型一定额度,总共7000万额度)。 你可以在模型广场挑选你关注的模型,比如我想使用阿里服务的deepseek的模型,那么就在输入框中输入deepseek。后续我们都以deepseek的模型为例。 注意哈,这些云服务商除了提供自己家的大模型外,还会提供流行的开源的大模型服务。比如大家都会提供 deepseek的模型服务。 所以你要使用deepseek模型,不一定到deepseek官方去购买,也可以在其他云服务厂商购买。 我觉得阿里云这个模型的介绍还是非常的简洁的: 第一行就把模型code名列出来了,还有一个贴心的复制按钮,比下面要介绍的厂商贴心多了 模型的能力罗列的很清楚 免费的额度很清晰的罗列出来 限流和上下文明明白白的列出来了,对开发者很友好 代码示例在本页中就列出来了,方便复制和测试 你可以使用curl调用: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen-plus", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "你是谁?" } ] }' 模型的费用统一在阿里云的账户充值即可。我充值了10元。因为免费额度都没有用完,所以资金还没有动。后续要多用用阿里云的模型服务了,先把免费额度用光了,否则就过期了。 首先你得创建API Key: 它也提供了OPENAI兼容的API, 所以下面三个信息非常关键: base_url: https://dashscope.aliyuncs.com/compatible-mode/v1 (新加坡区域使用 - https://dashscope-intl.aliyuncs.com/compatible-mode/v1) api_key: 上面我们创建的key model: 选择要使用的模型,deepseek的模型很多,你可以选择deepseek-v3.2-exp 百度云千帆平台 官方入口:https://cloud.baidu.com/product-s/qianfan_home 官方文档:https://help.aliyun.com/zh/model-studio/what-is-model-studio 登录后选择 模型服务 -> 模型广场, 搜索deepseek, 可以列出几个百度云服务的deepseek模型。 通过和阿里云百炼平台的对比,我们能看到百度云相对于阿里云,产品的友好型还是差了一截,具体来说: 找到模型多了一级:阿里云的『模型广场』直接在左侧最上面,而百度云需要点击『模型服务』,再点击『模型广场』 程序员开发用的模型名在哪里?在哪里?在哪里?那个可复制的模型 ID又是啥?又是啥?又是啥? 我每次找模型名称,都是在这个页面点击体验,然后再点击代码查看,看请求的示例代码使用的model 名称是啥 为啥不在这个页面中类似阿里云那样,把用户最关注的信息罗列在这里。你可以这个页面空空旷旷的,还不得不把行距设的很大,以免显得很空。为啥不在这个页面中增加用户关注的信息呢?比如: 模型能力 限流和上下文的限制 代码示例 你可以在控制台页面进行充值,我冲了10块钱的,才用了一分钱。主要是作为百度一个卑微的员工,还是有一点点进行 AI应用开发的预算的,我在公司基本就是使用公司的账户进行测试和运营了。 当然你使用百度云的服务,首先也得创建API key。这个创建API key页面我也不知道从哪个入口进入的,我都是通过搜索,或者看API文档的某个页面的链接进入进来,我不想吐槽了。 它也提供了OPENAI兼容的API, 所以下面三个信息非常关键: base_url: https://qianfan.baidubce.com/v2 api_key: 上面我们创建的key model: 选择要使用的模型,deepseek的模型很多,你可以选择deepseek-v3 你可以使用curl调用: 1 2 3 4 curl --location --request POST 'https://qianfan.baidubce.com/v2/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer 密钥' \ --data-raw '{"model":"deepseek-v3.2","messages":[{"role":"user","content":"hello"}],"temperature":0.6,"top_p":0.95,"stop":[],"web_search":{"enable":false}}' 字节火山方舟平台 官方入口:https://console.volcengine.com/ark/region:ark+cn-beijing/model 官方文档:https://help.aliyun.com/zh/model-studio/what-is-model-studio 直接点击模型广场,会显示一些提供服务的模型,但是有三个个问题,对于deepseek的模型: 主界面并没有显示全所有模型,所以我还不得不通过搜索框搜索 其实不是一个搜索框,是一个所谓的“火山助手”。不好用,慢,还不如直接给答案。 AI 虽好,但是还没达到智能客服的水准,显示又是右边的窄窄的一个小窗口,小里小气的 没有最新的deepseek-v3.2-exp模型,最新的是deepseek-v3.1 v250821版本 但是针对单个模型的介绍,还是可以的。你看上面的图,此模型的特性都有介绍,而且模型的token限制和限流信息,都讲的明明白白的。 模型code name也在最上面显示,也提供了贴心的复制按钮功能。 啊哦,我又发现了它一个不太好的地方。对于deepseek-3.1, 它提供了两个版本的支持,显示在同一个网页中,所以你需要下拉才能注意的,版本之间的信息很不容易区分。 代码示例太粗糙了,只提供了一个方舟的Python SDK的示例,其他语言和curl命令都没有。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import os # 升级方舟 SDK 到最新版本 pip install -U 'volcengine-python-sdk[ark]' from volcenginesdkarkruntime import Ark client = Ark( # 从环境变量中读取您的方舟API Key api_key=os.environ.get("ARK_API_KEY"), # 深度思考模型耗费时间会较长,请您设置较大的超时时间,避免超时,推荐30分钟以上 timeout=1800, ) response = client.chat.completions.create( # 替换 <Model> 为您的Model ID model="deepseek-v3-1-250821", messages=[ {"role": "user", "content": "我要研究深度思考模型与非深度思考模型区别的课题,体现出我的专业性"} ], thinking={ "type": "disabled" # 默认行为,不使用深度思考能力, # "type": "enabled" # 使用深度思考能力 }, ) print(response) 要通过API调用模型服务,你还是需要创建一个API key, 如下图点击创建按钮就可以创建一个API KEY: 充值点击顶栏的费用菜单就可以,充值也很方便: 它也提供了OPENAI兼容的API, 所以下面三个信息非常关键: base_url: https://ark.cn-beijing.volces.com/api/v3/ api_key: 上面我们创建的key model: 选择要使用的模型,你可以选择deepseek-v3-1-250821 字节自己搞了另外一套Responses API, 提供了Python、Go和Java的版本。 你会选择使用么?头得是多铁才会使用这个定制的API, 将自己的应用绑定到固定的厂商? 它的文档是一个通用的文档介绍: https://www.volcengine.com/docs/82379/1399009 ,并不会针对某个模型有单独的介绍和示例。但是好歹在模型的介绍页面有个到文档的API调用的显著的链接呀。我使用它的小助手询问才找到。 openrouter OpenRouter 可以看成一个大模型API 路由器,目前已经将现有的各种主流的 AI 模型和服务集成到一个统一的接口中,后续还会不断增加新的模型。它允许用户通过简单的配置就能调用不同大模型的能力。 OpenRouter 的主要功能和特点 统一接口:提供标准化的 API,不同模型使用一个API即可,只需要选择一下模型的名字,简化了模型的集成和部署过程。 多模型支持:目前已经支持几乎所有的主流模型,如 GPT系列、Claude系列、Gemini系列、deepseek等。 无需自行部署:各种开源、闭源的模型基本都有,用户无需自建 GPU 服务器部署。 成本优化:提供透明的定价机制,帮助用户在性能和成本之间找到最佳平衡点。 易于集成:便于与现有系统集成,适合各种应用场景。 可白嫖:有免费的模型可以使用,虽然存在一定的调用限制。 它需要绑定信用卡支付,所以我就不使用它了。 我比较痛苦的是登录他的网站首页居然没有 signin 按钮,我都不知道怎么登录进去,还好通过google搜索它的登录页。难懂它针对中国的用户直接屏蔽登录了么? 不管怎么着,我们使用它的服务,首先还是创建个Key: 你可以找模型名带 “free” 的模型,可以免费使用: 它也提供了OPENAI兼容的API, 所以下面三个信息非常关键: base_url: https://openrouter.ai/api/v1 api_key: 上面我们创建的key model: 选择要使用的模型,比如免费的deepseek/deepseek-chat-v3.1:free 你可以使用curl调用: 1 2 3 4 5 6 7 8 9 10 11 12 13 curl https://openrouter.ai/api/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENROUTER_API_KEY" \ -d '{ "model": "deepseek/deepseek-chat-v3.1:free", "messages": [ { "role": "user", "content": "What is the meaning of life?" } ] }' gemini 谷歌提供了它的自己的模型服务,大家最熟悉的就是gemini 系列了。 它的文档在 https://ai.google.dev/gemini-api/docs 。 你需要在google ai studio的 api keys在创建你的key。 它的API是自有的格式: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent" \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H 'Content-Type: application/json' \ -X POST \ -d '{ "contents": [ { "parts": [ { "text": "Explain how AI works in a few words" } ] } ] }' 但是为了便于推广gemini, google提供了openai兼容的API。https://ai.google.dev/gemini-api/docs/openai 所以下面三个信息非常关键: base_url: https://generativelanguage.googleapis.com/v1beta/openai/ api_key: 上面我们创建的key model: 选择要使用的模型,比如gemini-2.0-flash

2025/11/1
articleCard.readMore

Go之禅 - 基于Rob Pike思想的Go语言哲学

Go之禅 - 基于Rob Pike思想的Go语言哲学 1. 简单胜过聪明 (Simple is better than clever) 不要炫技,要解决问题 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // ✅ 简单直接的代码 func Max(a, b int) int { if a > b { return a } return b } // ❌ 过度聪明的代码 func Max(a, b int) int { return (a + b + abs(a-b)) / 2 // 聪明但难懂 } func abs(x int) int { if x < 0 { return -x } return x } 2. 清晰胜过简洁 (Clear is better than concise) 代码首先要让人读懂 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // ✅ 清晰的代码 func CalculateUserScore(user *User) int { totalPoints := 0 for _, activity := range user.Activities { totalPoints += activity.Points } if len(user.Activities) == 0 { return 0 } return totalPoints / len(user.Activities) } // ❌ 过度简洁的代码 func CalcScore(u *User) int { if len(u.Acts) == 0 { return 0 } return func() int { s := 0; for _, a := range u.Acts { s += a.Pts }; return s }() / len(u.Acts) } 3. 组合胜过继承 (Composition over inheritance) Go的接口和嵌入体现组合思想 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // ✅ 使用组合和接口 type Writer interface { Write([]byte) (int, error) } type Logger interface { Log(string) } type FileLogger struct { writer Writer prefix string } func (f *FileLogger) Log(message string) { f.writer.Write([]byte(f.prefix + message)) } type NetworkLogger struct { writer Writer endpoint string } func (n *NetworkLogger) Log(message string) { n.writer.Write([]byte(n.endpoint + ": " + message)) } // ❌ 如果Go有继承(反例) // type BaseLogger struct { ... } // type FileLogger struct { BaseLogger ... } // 继承会带来复杂性 4. 接口要小而专注 (Interfaces should be small and focused) 接口越小越好,单一职责 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 // ✅ 小而专注的接口 type Reader interface { Read([]byte) (int, error) } type Writer interface { Write([]byte) (int, error) } type Closer interface { Close() error } // 组合小接口 type ReadWriter interface { Reader Writer } type ReadWriteCloser interface { Reader Writer Closer } // ❌ 大而全的接口 type FileHandler interface { Read([]byte) (int, error) Write([]byte) (int, error) Seek(int64, int) (int64, error) Close() error Stat() (os.FileInfo, error) Chmod(os.FileMode) error // ... 太多方法 } 5. 并发不是并行 (Concurrency is not parallelism) 并发是关于结构,并行是关于执行 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 // ✅ 并发结构 - 使用goroutine和channel组织程序 func ProcessTasks(tasks []Task) { taskChan := make(chan Task, len(tasks)) resultChan := make(chan Result, len(tasks)) // 启动工作者goroutine for i := 0; i < 3; i++ { go worker(taskChan, resultChan) } // 发送任务 for _, task := range tasks { taskChan <- task } close(taskChan) // 收集结果 for i := 0; i < len(tasks); i++ { result := <-resultChan fmt.Printf("任务完成: %v\n", result) } } func worker(tasks <-chan Task, results chan<- Result) { for task := range tasks { result := processTask(task) results <- result } } 6. 通过通信来共享内存,而不是通过共享内存来通信 (Share memory by communicating) Channel是Go的核心哲学 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 // ✅ 通过channel通信 type Counter struct { ch chan int value int } func NewCounter() *Counter { c := &Counter{ ch: make(chan int), } go c.run() return c } func (c *Counter) run() { for increment := range c.ch { c.value += increment } } func (c *Counter) Increment() { c.ch <- 1 } func (c *Counter) Add(n int) { c.ch <- n } // ❌ 通过共享内存通信(需要锁) type MutexCounter struct { mu sync.Mutex value int } func (c *MutexCounter) Increment() { c.mu.Lock() defer c.mu.Unlock() c.value++ } 7. 错误是值,不是异常 (Errors are values, not exceptions) 显式错误处理胜过异常 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 // ✅ 错误作为值处理 func ReadConfig(filename string) (*Config, error) { data, err := os.ReadFile(filename) if err != nil { return nil, fmt.Errorf("读取配置文件失败: %w", err) } var config Config if err := json.Unmarshal(data, &config); err != nil { return nil, fmt.Errorf("解析配置失败: %w", err) } return &config, nil } func main() { config, err := ReadConfig("config.json") if err != nil { log.Fatal(err) } // 使用config... } // ❌ 如果Go有异常(反例) // func ReadConfig(filename string) *Config { // data := mustReadFile(filename) // 可能抛出异常 // return mustParseConfig(data) // 隐藏的错误处理 // } 8. 不要设计大型接口 (Don't design with interfaces, discover them) 接口应该从使用中发现,而不是预先设计 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // ✅ 从实际使用中发现接口 // 首先写具体实现 type FileStorage struct { basePath string } func (f *FileStorage) Save(key string, data []byte) error { return os.WriteFile(filepath.Join(f.basePath, key), data, 0644) } func (f *FileStorage) Load(key string) ([]byte, error) { return os.ReadFile(filepath.Join(f.basePath, key)) } // 当需要不同实现时,才抽象出接口 type Storage interface { Save(string, []byte) error Load(string) ([]byte, error) } // ❌ 预先设计大型接口 // type DataStore interface { // Save(string, []byte) error // Load(string) ([]byte, error) // Delete(string) error // List() ([]string, error) // Backup() error // Restore() error // // ... 很多可能用不到的方法 // } 9. 空接口什么都不说 (The empty interface says nothing) 避免过度使用interface{} 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // ✅ 使用具体类型或有意义的接口 func ProcessUsers(users []User) { for _, user := range users { fmt.Printf("处理用户: %s\n", user.Name) } } type Processor interface { Process() error } func RunProcessors(processors []Processor) { for _, p := range processors { if err := p.Process(); err != nil { log.Printf("处理失败: %v", err) } } } // ❌ 过度使用空接口 func ProcessAnything(items []interface{}) { for _, item := range items { // 需要类型断言,失去了类型安全 switch v := item.(type) { case User: fmt.Printf("用户: %s\n", v.Name) case Product: fmt.Printf("产品: %s\n", v.Name) default: fmt.Printf("未知类型: %T\n", v) } } } 10. Gofmt的风格就是每个人的风格 (Gofmt's style is no one's favorite, yet gofmt is everyone's favorite) 统一的代码格式胜过个人偏好 1 2 3 4 5 6 7 8 9 10 11 12 13 // ✅ 使用gofmt格式化的代码 func CalculateTotal(items []Item, taxRate float64) float64 { var subtotal float64 for _, item := range items { subtotal += item.Price * float64(item.Quantity) } tax := subtotal * taxRate return subtotal + tax } // 所有Go代码都应该用gofmt格式化,保持一致性 // 不要手动调整格式,让工具处理 11. 小规模清晰胜过大规模复杂 (A little copying is better than a little dependency) 适度的代码重复胜过不必要的依赖 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 // ✅ 简单的重复实现 package userservice func ValidateEmail(email string) bool { return strings.Contains(email, "@") && len(email) > 5 } package orderservice func ValidateEmail(email string) bool { return strings.Contains(email, "@") && len(email) > 5 } // ❌ 为了避免重复引入复杂依赖 // import "github.com/complex-validation-library/v2/email" // // func ValidateEmail(email string) bool { // validator := email.NewValidator(email.WithComplexRules()) // return validator.Validate(email) // } 12. 系统调用、操作系统线程和互斥锁的代价很高,尽量避免 (Syscalls, OS threads, and mutexes are expensive) 理解并发原语的成本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 // ✅ 使用goroutine和channel,避免系统级同步 func FanOut(input <-chan int, workers int) <-chan int { output := make(chan int) for i := 0; i < workers; i++ { go func() { for n := range input { // 处理数据 result := process(n) output <- result } }() } return output } // ❌ 过度使用互斥锁 type ExpensiveCounter struct { mu sync.Mutex value int } func (c *ExpensiveCounter) Get() int { c.mu.Lock() // 每次读取都要加锁 defer c.mu.Unlock() return c.value } func (c *ExpensiveCounter) Set(v int) { c.mu.Lock() defer c.mu.Unlock() c.value = v } 13. 缓存是有用的 (Caching is important) 但要简单明了 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 // ✅ 简单的缓存实现 type Cache struct { mu sync.RWMutex data map[string]interface{} } func NewCache() *Cache { return &Cache{ data: make(map[string]interface{}), } } func (c *Cache) Get(key string) (interface{}, bool) { c.mu.RLock() defer c.mu.RUnlock() value, exists := c.data[key] return value, exists } func (c *Cache) Set(key string, value interface{}) { c.mu.Lock() defer c.mu.Unlock() c.data[key] = value } // 使用示例 var userCache = NewCache() func GetUser(id string) (*User, error) { if cached, ok := userCache.Get(id); ok { return cached.(*User), nil } user, err := fetchUserFromDB(id) if err != nil { return nil, err } userCache.Set(id, user) return user, nil } 14. 测试先行,但不要过度测试 (Test first, but don't over-test) 测试重要行为,不是实现细节 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 // ✅ 测试重要的业务逻辑 func TestCalculateDiscount(t *testing.T) { tests := []struct { name string amount float64 userType string want float64 }{ {"普通用户无折扣", 100.0, "regular", 100.0}, {"VIP用户9折", 100.0, "vip", 90.0}, {"金牌用户8折", 100.0, "gold", 80.0}, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { got := CalculateDiscount(tt.amount, tt.userType) if got != tt.want { t.Errorf("CalculateDiscount() = %v, want %v", got, tt.want) } }) } } // ❌ 过度测试内部实现 func TestCalculateDiscountInternals(t *testing.T) { // 测试私有方法或内部状态变化 // 这种测试很脆弱,实现改变时就会失败 } 15. 如果你觉得需要泛型,请三思 (If you think you need generics, think again) Go 1.18+有了泛型,但要谨慎使用 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 // ✅ 合理使用泛型的场景 func Map[T, U any](slice []T, fn func(T) U) []U { result := make([]U, len(slice)) for i, v := range slice { result[i] = fn(v) } return result } func Filter[T any](slice []T, predicate func(T) bool) []T { var result []T for _, v := range slice { if predicate(v) { result = append(result, v) } } return result } // 使用 numbers := []int{1, 2, 3, 4, 5} doubled := Map(numbers, func(x int) int { return x * 2 }) evens := Filter(numbers, func(x int) bool { return x%2 == 0 }) // ❌ 过度使用泛型 type GenericProcessor[T, U, V, W any] interface { Process(T, U) (V, W, error) Validate(T) bool Transform(U) V // 过于复杂的泛型接口 } 16. 性能问题通常出在算法,不是语言 (Performance problems are usually algorithmic) 优化算法比优化语言特性更重要 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 // ✅ 好的算法 func FindDuplicates(nums []int) []int { seen := make(map[int]bool) var duplicates []int for _, num := range nums { if seen[num] { duplicates = append(duplicates, num) } else { seen[num] = true } } return duplicates // O(n) 时间复杂度 } // ❌ 差的算法 func FindDuplicatesSlow(nums []int) []int { var duplicates []int for i := 0; i < len(nums); i++ { for j := i + 1; j < len(nums); j++ { if nums[i] == nums[j] { duplicates = append(duplicates, nums[i]) break } } } return duplicates // O(n²) 时间复杂度 } 17. 数据结构,不是算法,才是编程的核心 (Data structures, not algorithms, are central to programming) 正确的数据结构让程序简单明了 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 // ✅ 选择正确的数据结构 type UserIndex struct { byID map[string]*User byEmail map[string]*User users []*User } func NewUserIndex() *UserIndex { return &UserIndex{ byID: make(map[string]*User), byEmail: make(map[string]*User), users: make([]*User, 0), } } func (ui *UserIndex) AddUser(user *User) { ui.byID[user.ID] = user ui.byEmail[user.Email] = user ui.users = append(ui.users, user) } func (ui *UserIndex) FindByID(id string) *User { return ui.byID[id] // O(1) 查找 } func (ui *UserIndex) FindByEmail(email string) *User { return ui.byEmail[email] // O(1) 查找 } func (ui *UserIndex) GetAllUsers() []*User { return ui.users // O(1) 获取所有用户 } // ❌ 错误的数据结构选择 type BadUserStorage struct { users []*User // 只用切片存储 } func (us *BadUserStorage) FindByID(id string) *User { for _, user := range us.users { // O(n) 查找 if user.ID == id { return user } } return nil } Go之禅总结: Go语言强调简单、清晰、组合和并发。Rob Pike的设计哲学体现在语言的每个角落:小而专注的接口、显式的错误处理、通过通信共享内存的并发模型,以及对简单性的坚持。Go不追求语言特性的丰富性,而是追求解决实际问题的有效性。

2025/11/1
articleCard.readMore

Codex CLI vs Gemini CLI vs Claude Code: Which is the Best?

https://www.analyticsvidhya.com/blog/2025/08/codex-cli-vs-gemini-cli-vs-claude-code/ In 2025, several AI coding assistants have been released, which can be accessed directly from the terminal. Codex CLI, Gemini CLI, and Claude Code are some of the popular names that embed large language models into command-line workflows. These programming tools that can generate and fix code via natural language prompts are truly incredible. We document our evaluation of all three of these across different tasks to determine which is most useful. Each assistant is based on a sophisticated AI model like o4-mini, Gemini 2.5 Pro, or Claude Sonnet 4 to enhance productivity. We place each one in the same environment and test them with specific metrics on realistic programming tasks. Varying from web development to data analysis, through this, we aim to make the strengths of each agent clear! Meet the Contenders: Codex CLI, Gemini CLI & Claude Code The command line is quickly becoming a battleground for the next generation of AI coding assistants. Companies, including OpenAI, Google, and Anthropic, have released advanced CLI-based AI coding assistants, each with very powerful and impressive capabilities directly into the terminal. But what are the differences, and which is best for your workflow? Let’s go over the tools. Codex CLI: OpenAI’s Code-Centric Terminal Agent Codex CLI functions like a smart terminal assistant for coding. It listens to what you say to it and creates code. Codex CLI has access to your shell and file system. It can scaffold a project, write a function, and fix a bug. Codex CLI is utilizing OpenAI’s Codex models in the background. You use plain English to tell Codex CLI what code you would like for a task. Then the CLI suggests new code and files. Codex CLI supports several languages, including Python, JavaScript, and Go. Gemini CLI: Google’s Terminal Agent Gemini CLI by Google brings together the strengths of the Gemini 2.5 Pro model with access to the terminal and filesystem in order to create an uninterrupted coding and utility assistant for developers. It can be used for much more than simple code generation. Gemini CLI is adept at completing tasks in real time, such as obtaining live information or running shell commands. Developed on the Google infrastructure and integrated with various tools such as VS Code AI, Gemini CLI provides utility across terminals and IDEs. Claude Code: Anthropic’s CLI Assistant Claude Code is a leading coding AI made for high-performance terminal workflows. It is based on Claude Sonnet 4 and can easily handle end-to-end software development functions. Such as writing new modules to running tests, to automatically creating pull requests. Claude Code aims to provide depth, consistency, and qualified codebase navigation. While it is skill-based and closed-sourced. So if you are a professional software developer looking for AI that can understand and evolve large, complex projects, Claude Code is for you. Codex CLI vs Gemini CLI vs Claude Code: Summary Feature: Model Backbone Codex CLI: OpenAI Codex (o4-mini) Gemini CLI: Gemini 2.5 Pro Claude Code: Claude Sonnet 4 Feature: Context Window Codex CLI: 128K tokens Gemini CLI: 1 million tokens Claude Code: ~200K tokens (approx) Feature: Installation Codex CLI: npm install codex-cli Gemini CLI: npm install @google/gemini Claude Code: npm install claude Feature: License Type Codex CLI: Commercial OpenAI terms Gemini CLI: Open-source (Apache 2.0) Claude Code: Commercial, subscription-based Feature: Local File System Access Codex CLI: Yes Gemini CLI: Yes Claude Code: Yes Feature: Shell Command Execution Codex CLI: Native via shell integration Gemini CLI: Native Claude Code: Native Feature: Unique Capability Codex CLI: Fastest response time Gemini CLI: Real-time web search + command Claude Code: Full codebase mapping & PR generation Feature: Ideal For Codex CLI: Developers needing rapid iteration Gemini CLI: Balanced dev + utility workflows Claude Code: Advanced team development Feature: Web Integration Codex CLI: No live web search Gemini CLI: Integrated Google Search Claude Code: None – code-focused only How We Tested Them: Setup, Metrics & Tasks Testbed & Environment: All the CLI-based AI coding assistants were tested using a local workstation running Ubuntu 24.04. The agents Codex CLI (based on OpenAI’s o4-mini), Gemini CLI (Gemini 2.5 Pro), and Claude Code (Claude Sonnet 4) were installed via npm or pip. Codex CLI and Claude required Node.js and valid API keys. Gemini CLI required a Google login for authentication. Evaluation Metrics That Matter: We evaluated each agent based on five criteria: Code correctness Code generation speed Simplicity of prompts Output clarity Handling of errors These measures test not just performance, but how usable and reliable a developer can expect the agents to be in a real workflow. Real-World Tasks Used in the Battle: Each agent was tasked with three tasks to test versatility: Build a game similar to Super Mario. Build a Weather Clock that presents the time and the weather. Begin exploratory data analysis (EDA) in Python using the Nike_Sales_Uncleaned.csv dataset. Codex CLI vs Gemini CLI vs Claude Code: Task-by-Task Faceoff Task 1: Creating a Super Mario Game Goal**:** Build a basic 2D Mario-style game Prompt: “Create a basic 2D Super Mario-style platformer game. The game should feature a simple tile-based layout with Mario standing on ground blocks, a background sky with clouds, a question mark block above him, and a green pipe nearby. Include basic mechanics like left/right movement and jumping using keyboard arrow keys. Simulate gravity and collision with platforms. Use pixel-art style graphics with embedded or referenced local assets.” Gemini CLI: Codex CLI: Claude Code: CLI Comparison Claude Code: Best and most relevant of all three. It also uses the pixelated version, and the user has complete control over Mario. It also shows the mystery boxes for coins and power-ups, but nothing happens when Mario hits them. Codex CLI: created an interface with a pixelated interface, but was not able to play the game as Mario is trapped inside the green box. Gemini CLI: created an interface with a block format interface and able to play the game, but the thing is it does not follow the original rules, like it allows me to pass through the objects and jump automatically when Mario reaches near the edge without pressing the jump key. Claude Code excels in game handling logic from both Codex and Gemini. It shows consistent controls, gravity, and collision, and delivers the most immersive gameplay experience. Task 2: Weather Clock App Goal: Build a clock UI with live weather updates Prompt: “Design and develop a visually rich weather-themed dynamic clock dashboard using only HTML, CSS, and JavaScript. The main goal is to create a real-time clock interface that not only displays the current time but also visually adapts to the time of day. Implement four animated background transitions representing sunrise, noon, sunset, and night, each with unique colors and animated elements like moving clouds, twinkling stars, or a rising/setting sun/moon, and offer a toggle between 12-hour and 24-hour time formats. For an added layer of interactivity, include a section that displays a rotating motivational or productivity quote based on the hour.” Gemini CLI: Codex CLI: Claude Code: CLI Comparison Claude Code: Claude Code provided the most visually profound and feature-complete result. It implemented four animated themes with smooth transitions and interactive elements such as moving clouds and celestial bodies. Additionally, Claude Code came with an auto-theme mode, shifting the backgrounds based on system time. The 12/24-hour toggle and quote-randomization features were seamlessly done. Codex CLI: Codex CLI had implemented all of the required functions and execution, but lacked visual design and polish. The user experience felt antiquated, with static styling and uninspired layout. Functionally, it was sound, but design execution was the weakest among the three. Gemini CLI: Gemini CLI used a fixed background, i.e, no animation, which brought down some visual richness. However, Gemini was still a cleaner interface than Codex. Gemini made all the time display and quote-randomization work correctly, but lacked interactivity and dynamism in the overall experience. To summarize, Claude Code was ahead in UI logic and the overall user experience. It brought together sound functionality, engaging visual transitions, interactive elements, and flow in the user interface. Codex delivered on the basic functional requirements but lacked the UX, and Gemini had a moderate visual design but very low dynamism. Task 3: Performing EDA (Exploratory Data Analysis) Goal: Clean, analyze, and visualize a dataset Prompt: “Perform Data Analysis and Exploratory Data Analysis (EDA) on the dataset provided in the same directory. The entire analysis should be implemented and stored in a Jupyter Notebook file named eda.ipynb. Begin by loading the dataset and inspecting its structure, including column names, data types, and summary statistics. Proceed to clean the data by handling missing values, correcting data types if necessary, and removing any duplicates. Conduct univariate analysis to understand individual features, and then perform bivariate and multivariate analysis to uncover relationships between variables. Use clear and relevant visualizations to support your insights. Organize the notebook with proper Markdown headings and explanations for each step. Conclude with at least three key observations or insights drawn from the data.” Gemini CLI: Codex CLI: Claude Code: CLI Comparison Claude Code: Claude Code produced a complete professional-grade EDA. It completed every piece of the instruction from the prompt, along with the output being organized into three folders: A Plots folder containing all the generated visualizations A Code folder containing the clean, reproducible notebook The visuals were appropriate, and the insights were reported clearly. Codex CLI: Codex CLI produced a usable but partial solution. It produced the necessary code and suitably followed the EDA steps, but it did not produce any visualizations or provide a summary of important insights. The notebook did not have any final analytical conclusions, nor markdown explanations to assist in interpretation. Gemini CLI: Gemini CLI was unable to complete this task. It was unable to complete the EDA pipeline and ultimately produced an incoherent notebook. There were many instances of dataset loading failing, no visualizations, and many incomplete code blocks. Claude Code is the one for EDA and data analysis. It not only completes the full analytical workflow but also organizes the outputs nicely and delivers well-structured insights useful for both single-user data work and team-based environments. Codex could be a useful backup; however, Gemini CLI is not appropriate for this. Codex CLI vs Gemini CLI vs Claude Code: Overall Analysis Claude Code gives a clear structure and documentation, and is good to execute. It handled the game logic and error handling without issue. Codex CLI was fast and flexible, but required some manual intervention. Gemini CLI gave a firm foundation and seemed fast. Its polish and documentation were lacking; it suffered the most in the EDA assignment, missing core outputs and structural completeness. In speed, Codex CLI was fastest, followed by Gemini and Claude. Claude was the easiest for prompt engineering. Each CLI was suited well to specific workflows. Claude was strong on logic-heavy work, Codex would be best in speed-focused workflows, and Gemini was suitable for basic structured implementations lacking refinement. Conclusion Claude Code was the best across all tasks, providing the best quality code, user experience, and complete range of features. While it was not the fastest AI coding assistant, its finished products were polished, documented, organized, and ideal for professional workflows with a lot of trust involved. Codex CLI was the fastest, and a great choice using to creating quick prototypes or if there was a time constraint on the coding work. Gemini CLI was reasonable for basic builds, but had issues with not being fast, polished, or organized for many kinds of work. It had issues with data analysis tasks that required organized or insightful content. Overall, all tools have different fits, but Claude Code provides the most consistent depth when it comes to being a command-line AI coding assistant. Frequently Asked Questions Q1. What is a CLI AI assistant, and how does it work? A. A CLI (Command-Line Interface) AI assistant allows users to interact with an AI model directly through the terminal, automating tasks like coding, debugging, and content generation using natural language prompts. Q2. Which AI terminal assistant is fastest? A. Codex CLI offers the fastest response times, followed by Gemini CLI, with Claude Code being the slowest of the three. However, speed comes at the cost of polish and completeness in many cases. Q3. Which tool is best for development? A. Claude Code demonstrated superior development capabilities, creating the most playable and visually appealing Super Mario-style game with proper physics, collision detection, and interactive elements like mystery boxes. Q4. Can Codex CLI, Gemini CLI and Claude Code work with existing codebases? A. Yes, all three tools have local file system access and can work with existing projects. Claude Code particularly excels at understanding and navigating large, complex codebases. Q5. Is Claude Code always the best choice? A. Claude Code offers the most balanced performance across tasks, especially for professional-grade projects, but it isn’t the fastest. Hello! I'm Vipin, a passionate data science and machine learning enthusiast with a strong foundation in data analysis, machine learning algorithms, and programming. I have hands-on experience in building models, managing messy data, and solving real-world problems. My goal is to apply data-driven insights to create practical solutions that drive results. I'm eager to contribute my skills in a collaborative environment while continuing to learn and grow in the fields of Data Science, Machine Learning, and NLP.

2025/11/1
articleCard.readMore

Codex CLI vs Gemini CLI vs Claude Code:谁是最佳选择?

在 2025 年,市面上发布了几款可以直接通过终端访问的 AI 编程助手。Codex CLI、Gemini CLI 和 Claude Code 是其中几个热门名称,它们都将大型语言模型嵌入到命令行工作流程中。这些编程工具能够通过自然语言提示来生成和修复代码,能力着实令人惊叹。我们记录了对这三款工具在不同任务中的评估结果,以确定哪一个才是最实用的。 每款助手都基于先进的 AI 模型,如 o4-mini、Gemini 2.5 Pro 或 Claude Sonnet 4,旨在提高生产力。我们将它们放置在相同的环境中,并通过特定的指标在真实的编程任务中对它们进行测试。这些任务涵盖了从 Web 开发到 数据分析等多个领域,通过此评估,我们旨在清晰地展现每款助手的优势! 💻 参赛者介绍:Codex CLI、Gemini CLI 与 Claude Code 命令行正迅速成为下一代 AI 编程助手的战场。包括 OpenAI、谷歌和 Anthropic 在内的公司都发布了先进的 CLI(命令行界面)AI 编程助手,每款工具都将强大且令人印象深刻的功能直接带入终端。但它们之间有何不同?哪一个最适合你的工作流程?让我们来详细了解一下这些工具。 Codex CLI:OpenAI 专注于代码的终端助手 Codex CLI 的功能类似于一个智能终端编程助手。它会倾听用户的指令并创建代码。Codex CLI 可以访问你的 Shell 和文件系统,能够搭建项目框架、编写函数和修复 Bug。Codex CLI 在后台利用了 OpenAI 的 Codex 模型。你只需用简单的英语告诉 Codex CLI 你想让它为某个任务编写什么代码,然后 CLI 就会建议新的代码和文件。Codex CLI 支持包括 Python、JavaScript 和 Go 在内的多种语言。 Gemini CLI:谷歌的终端助手 谷歌的 Gemini CLI 将 Gemini 2.5 Pro 模型的优势与对终端和文件系统的访问相结合,为开发者创建了一个不间断的编程和实用助手。它的用途远不止简单的代码生成。Gemini CLI 擅长实时完成任务,例如获取实时信息或运行 Shell 命令。Gemini CLI 基于谷歌的基础设施开发,并集成了 VS Code AI 等各种工具,可在终端和 IDE(集成开发环境)中提供实用功能。 Claude Code:Anthropic 的 CLI 助手 Claude Code 是一款领先的编程 AI,专为高性能终端工作流程而设计。它基于 Claude Sonnet 4,可以轻松处理端到端的软件开发功能,例如编写新模块、运行测试,乃至自动创建 Pull Request。Claude Code 旨在提供深度、一致性和合格的代码库导航。它以技能为基础且闭源。因此,如果你是一名寻求能够理解和改进大型复杂项目的 AI 的专业软件开发人员,Claude Code 就是你的理想选择。 📊 Codex CLI vs Gemini CLI vs Claude Code:总结 特性 Codex CLI Gemini CLI Claude Code 核心模型 OpenAI Codex (o4-mini) Gemini 2.5 Pro Claude Sonnet 4 上下文窗口 128K tokens 100 万 tokens 约 200K tokens (近似值) 安装方式 npm install codex-cli npm install @google/gemini npm install claude 许可证类型 商业,OpenAI 条款 开源 (Apache 2.0) 商业,订阅制 本地文件系统访问 是 是 是 Shell 命令执行 通过 Shell 集成本地执行 本地执行 本地执行 独特能力 最快的响应时间 实时网络搜索 + 命令执行 完整的代码库映射 & PR 生成 理想适用场景 需要快速迭代的开发者 平衡的开发 + 实用功能工作流 高级团队开发 Web 集成 无实时网络搜索 集成 Google 搜索 无 – 仅专注于代码 🛠️ 我们的测试方法:设置、指标与任务 测试平台与环境: 所有基于 CLI 的 AI 编程助手都在运行 Ubuntu 24.04 的本地工作站上进行了测试。助手 Codex CLI(基于 OpenAI 的 o4-mini)、Gemini CLI (Gemini 2.5 Pro) 和 Claude Code (Claude Sonnet 4) 均通过 npm 或 pip 安装。Codex CLI 和 Claude 需要 Node.js 和有效的 API 密钥。Gemini CLI 需要 Google 登录进行身份验证。 关键评估指标: 我们基于五个标准对每个助手进行了评估: 代码正确性 代码生成速度 提示的简易性 输出的清晰度 错误处理能力 这些衡量标准不仅测试了性能,还测试了开发者在实际工作流程中对这些助手的可用性和可靠性的期望。 用于实战的真实任务: 每个助手都被分配了三个任务来测试其多功能性: 构建一个类似于超级马里奥的游戏。 构建一个天气时钟,显示时间和天气。 使用 Nike_Sales_Uncleaned.csv 数据集,在 Python 中开始探索性数据分析 (EDA)。 🆚 Codex CLI vs Gemini CLI vs Claude Code:逐项任务对比 任务 1:创建一款超级马里奥游戏 目标: 构建一个基本的 2D 马里奥风格平台游戏。 提示: “创建一个基本的 2D 超级马里奥风格平台游戏。游戏应包含简单的基于图块的布局,马里奥站在地面方块上,背景是带有云朵的天空,头顶有一个问号方块,附近有一根绿色管道。实现基本的机制,如使用键盘箭头键进行左右移动和跳跃。模拟重力以及与平台的碰撞。使用像素艺术风格的图形,并嵌入或引用本地资产。” 助手 结果总结 Gemini CLI 创建了块状格式的界面,可以玩游戏,但不遵循原始规则,例如允许穿过物体,并且在马里奥靠近边缘时无需按跳跃键即可自动跳跃。 Codex CLI 创建了像素化界面,但无法进行游戏,因为马里奥被困在一个绿色的方框内。 Claude Code 三者中最佳且最相关。它使用了像素化版本,用户可以完全控制马里奥。它也显示了用于金币和道具的神秘箱子,但马里奥碰到它们时没有任何反应。 CLI 对比分析 Claude Code 在游戏处理逻辑方面优于 Codex 和 Gemini。它展示了一致的控制、重力和碰撞,并提供了最沉浸式的游戏体验。 任务 2:天气时钟应用 目标: 构建一个带有实时天气更新的时钟 UI。 提示: “仅使用 HTML、CSS 和 JavaScript 设计并开发一个视觉效果丰富的天气主题动态时钟仪表板。主要目标是创建一个实时时钟界面,它不仅显示当前时间,还能根据时间段进行视觉调整。实现四种动画背景过渡来代表日出、正午、日落和夜晚,每种都应有独特的颜色和动画元素,如移动的云朵、闪烁的星星或升起/落下的太阳/月亮,并提供 12 小时和 24 小时时间格式的切换开关。为了增加一层交互性,加入一个部分,根据小时数显示一个轮播的励志或高效能名言。” 助手 结果总结 Gemini CLI 使用了固定背景(即没有动画),降低了视觉丰富度。但界面比 Codex 清晰。时间显示和名言随机化功能均正常工作,但整体体验缺乏交互性和动态性。 Codex CLI 实现了所有必需的功能和执行,但缺乏视觉设计和润饰。用户体验感觉陈旧,采用了静态样式和缺乏灵感的布局。功能上可行,但在设计执行力方面是三者中最弱的。 Claude Code 提供了视觉上最深刻且功能最完整的结果。它实现了四种动画主题,具有平滑过渡和交互元素。此外,Claude Code 还具备自动主题模式,根据系统时间切换背景。12/24 小时切换和名言随机化功能也无缝实现。 CLI 对比分析 综上所述,Claude Code 在 UI 逻辑和整体用户体验方面领先。它融合了可靠的功能、引人入胜的视觉过渡、交互元素和用户界面流程。Codex 满足了基本功能要求,但缺乏 UX(用户体验),而 Gemini 的视觉设计中等,但动态性非常低。 任务 3:执行探索性数据分析(EDA) 目标: 清理、分析和可视化数据集。 提示: “对同一目录中提供的数据集执行数据分析和探索性数据分析 (EDA)。整个分析应在一个名为 eda.ipynb 的 Jupyter Notebook 文件中实现和存储。首先加载数据集,并检查其结构,包括列名、数据类型和摘要统计信息。然后通过处理缺失值、如有必要则修正数据类型,并删除任何重复项来清理数据。进行单变量分析以理解各个特征,然后进行双变量和多变量分析以揭示变量之间的关系。使用清晰且相关的可视化来支持你的见解。用适当的 Markdown 标题和解释来组织 Notebook 的每一步。最后,总结至少三个从数据中得出的关键观察结果或见解。” 助手 结果总结 Gemini CLI 未能完成此任务。它无法完成 EDA 流程,最终生成了一个不连贯的 Notebook。出现多次数据集加载失败的情况,没有可视化,并且存在许多不完整的代码块。 Codex CLI 生成了一个可用但不完整的解决方案。它生成了必要的代码并适当地遵循了 EDA 步骤,但没有生成任何可视化,也没有提供重要见解的总结。Notebook 缺乏最终分析结论,也没有 Markdown 解释来辅助解读。 Claude Code 生成了一个完整、专业级别的 EDA。它完成了提示中所有部分的指令,并且输出被组织到三个文件夹中:一个包含所有生成的可视化的 Plots 文件夹;一个包含干净、可重现 Notebook 的 Code 文件夹。可视化图表是恰当的,并且见解报告清晰。 CLI 对比分析 Claude Code 是进行 EDA和数据分析的首选。它不仅完成了完整的分析工作流程,还很好地组织了输出,并提供了结构良好的见解,适用于单用户数据工作和团队环境。Codex 可以作为有用的备用;然而,Gemini CLI 不适合此任务。 💡 Codex CLI vs Gemini CLI vs Claude Code:综合分析 Claude Code 提供了清晰的结构和文档,易于执行。它毫不费力地处理了游戏逻辑和错误处理。Codex CLI 速度快且灵活,但需要一些手动干预。Gemini CLI 提供了坚实的基础,似乎也很快。它缺乏润饰和文档;它在 EDA 任务中表现最差,缺失了核心输出和结构完整性。 在速度方面,Codex CLI 最快,其次是 Gemini,Claude 最慢。Claude 是提示工程最简单的。每个 CLI 都非常适合特定的工作流程。Claude 擅长逻辑繁重的工作,Codex 最适合以速度为中心的工作流程,而 Gemini 适用于缺乏精炼的基本结构化实现。 ✅ 结论 Claude Code 在所有任务中表现最佳,提供了最高质量的代码、用户体验和最完整的功能范围。虽然它不是最快的 AI 编程助手,但其最终产品经过润饰、文档齐全、组织有序,是涉及高度信任的专业工作流程的理想选择。Codex CLI 最快,是创建快速原型或有时间限制的编程工作的绝佳选择。 Gemini CLI 对于基本构建来说是合理的,但在许多类型的工作中存在不够快、不够精细或缺乏组织性的问题。它在需要有组织或富有洞察力的内容的数据分析任务中遇到困难。总的来说,所有工具都有其不同的适用范围,但就命令行 AI 编程助手而言,Claude Code 提供了最一致的深度。 ❓ 常见问题 Q1. 什么是 CLI AI 助手,它是如何工作的? A. CLI(命令行界面)AI 助手允许用户通过终端直接与 AI 模型交互,使用自然语言提示来自动化编程、调试和内容生成等任务。 Q2. 哪个 AI 终端助手最快? A. Codex CLI 提供了最快的响应时间,其次是 Gemini CLI,Claude Code 是三者中最慢的。然而,在许多情况下,速度是以牺牲润饰和完整性为代价的。 Q3. 哪个工具最适合开发? A. Claude Code 展示了卓越的开发能力,它创建的超级马里奥风格游戏可玩性最高,视觉上最吸引人,并具有正确的物理特性、碰撞检测和交互元素(如神秘箱子)。 Q4. Codex CLI、Gemini CLI 和 Claude Code 能否处理现有代码库? A. 是的,这三个工具都可以访问本地文件系统,并可以处理现有项目。Claude Code 尤其擅长理解和导航大型复杂的代码库。 Q5. Claude Code 总是最佳选择吗? A. Claude Code 在各项任务中提供了最均衡的性能,特别是对于专业级项目而言,但它不是最快的。

2025/11/1
articleCard.readMore

一行代码使用 Claude Skill 和 deepseek

Claud Skills 虽好,但是只能使用在Claude 的工具中,想在我们自己的应用中使用Skill 还得想想办法。 上周我介绍了goskills, 支持在常见的LLM 中集成Claude Skills的能力。没两天百度厂内的同学就基于它做了提效的工具,自动为批量的表结构按照特定的要求创建创建“建表SQL”: 类似cat user.data | goskills。 这激发了我的灵感,当初做这个工具的时候我还没想到可以这么简化调用Claude Skills,这促使我进一步优化这个工具,产品化并发布v0.1.3版本,更方便下载使用。 基本上,下载后在命令行中只需执行一句./goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584", 就可以调用markitdown这个Skill将百度百科上关于孔子的网页转换成markdown格式: 所以你只需在程序中调用goskills run --auto-approve your_prompt 即可。文章后面我附上各种编程语言的示例。 安装 你可以在 goskills/releases 下载 Mac/Linux/Windows + amd64/arm64 环境的已经编译好的可执行二进制文件,放在PATH路径中就可以使用。 对于 Mac系统,你还可以使用 brew install smallnest/goskills/goskills 进行安装,或者分步安装: 1 2 3 4 5 # add tap brew tap smallnest/goskills # install goskills brew install goskills 安装完成后,你可以尝试执行goskills run -h 检查是否安装成功。正常情况下它应该输出: -b: 设置大模型API的调用基地址 -k:设置大模型API的Key --auto-approve: 是否自动批准调用Skill中的脚本,在不需要人工干预的情况下启用此参数,否则命令在执行的时候可能需要用户的批准(输入y或者N) -m: 要使用的模型名称 -d:Claude Skills所在的目录。 Claude工具默认放在~/.Claude/skills目录下 你可以在环境变量中设置 api-base、api-key和model 这三个参数,这样你就不必在命令行中设置它们了。 你可以在网上下载你想使用的Skills,下面几个站点是常见的Skills收集地: anthropics/skills travisvn/awesome-claude-skills ComposioHQ/awesome-claude-skills BehiSecc/awesome-claude-skills K-Dense-AI/claude-scientific-skills markitdown Skill markitdown 是 微软开发的一个轻量级 Python 工具,用于将各种文件格式转换为 Markdown,主要服务于大语言模型(LLM)和文本分析流程 GitHub。 支持的文件格式广泛 : markitdown 能够转换多种文件格式,包括 Word 文档、Excel 表格、PowerPoint 演示文稿、PDF、HTML、图片、音频文件、文本格式(如 CSV、JSON、XML)甚至 ZIP 压缩包 。 专为 LLM 优化 :该工具注重保留重要的文档结构和内容,如标题、列表、表格、链接等,转换为 Markdown 格式后可被主流 LLM(如 GPT-4o)直接理解和使用。 所以理论上网上应该有相应的Skill发布出来,实际上不是太好找,不过最终让我在一个项目中找到了:scientific-skills/markitdown,这个项目包含了120+个科学Skill, markitdown就是其中一个。 功能强大的Skill的SKILL.md文件内容一向 很长,markitdown Skill也不例外,所以我不粘贴在这里了,只附上链接地址:SKILL.md 它包含了一个通用的python脚本以及5 种要转换的类型的更详细的参考手册。 实际在使用过程中我发现它处理网页转换很容易失败,所以我扩展了这个Skill,增加了一个处理网页的脚本convert_webpage.py: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import sys from markitdown import MarkItDown if len(sys.argv) != 2: print("Usage: python convert_webpage.py <URL>") sys.exit(1) url = sys.argv[1] try: md = MarkItDown() result = md.convert(url) print(result.text_content) except Exception as e: print(f"An error occurred: {e}") sys.exit(1) 其他都没有做修改,就处理网页转换非常好了。 使用这个技能的一个很好的例子,就是文章最开始的例子(我已经安装goskills到PATH环境变量的一个路径中,并且设置大模型的三个环境变量): 1 `goskills run --auto-approve "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584 既然它是一个命令行,那么我们很容易就可以在各种编程语言中调用它,不用写复杂的LLM和Claude Skill解析和调用。下面是常见的几种编程语言的例子。 各种编程语言调用例子 在运行任何 goskills 命令之前,您必须设置您的大模型API的密钥: 1 export OPENAI_API_KEY="YOUR_OPENAI_API_KEY" 这些示例的基础命令是: 1 goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=~/.claude/skills "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584" Shell (Bash) 这是运行该命令最直接的方式。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #!/bin/bash # 定义提示 PROMPT="使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584" # 执行命令并捕获输出 RESULT=$(goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=$HOME/.claude/skills "$PROMPT") # 或者直接执行并打印 # goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=$HOME/.claude/skills "$PROMPT" echo "输出:" echo "$RESULT" Python 使用 subprocess 模块是在 Python 中运行外部命令的标准方法。为了正确处理 ~ 符号,我们需要使用 os.path.expanduser 来获取完整路径。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import subprocess import shlex import os # 导入 os 模块 # 定义提示 prompt = "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584" # 获取并扩展技能目录路径 skills_dir_path = os.path.expanduser("~/.claude/skills") skills_dir_arg = f"--skills-dir={skills_dir_path}" # 为安全起见,将命令定义为参数列表 command = [ "goskills", "run", "--auto-approve", "--model", "deepseek-v3", "--api-base", "https://qianfan.baidubce.com/v2", skills_dir_arg, # 使用构建的路径参数 prompt ] # 或者,使用 shlex 从字符串构建命令以进行正确的引用 # 注意:如果使用 shlex.split,需要确保 skills_dir_path 已正确扩展,且整个字符串被正确引用 # 例如: # cmd_str = f'goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 {skills_dir_arg} "{prompt}"' # command = shlex.split(cmd_str) try: # 执行命令,捕获 stdout 和 stderr result = subprocess.run( command, check=True, # 如果命令返回非零退出码,则引发异常 capture_output=True, # 捕获 stdout 和 stderr text=True # 将 stdout/stderr 解码为文本 ) print("命令执行成功:") print("输出:\n", result.stdout) except FileNotFoundError: print("错误: 'goskills' 命令未找到。请确保它在您的 PATH 中。") except subprocess.CalledProcessError as e: print(f"命令失败,退出码 {e.returncode}:") print("标准错误输出:\n", e.stderr) JavaScript (Node.js) 在 Node.js 中,您可以使用 child_process 模块。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 const { exec } = require('child_process'); // 使用单引号作为外部字符串以便轻松处理内部的双引号 const command = 'goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=$HOME/.claude/skills "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584"'; exec(command, (error, stdout, stderr) => { if (error) { console.error(`执行错误: ${error.message}`); if (stderr) { console.error(`标准错误输出: ${stderr}`); } return; } console.log(`命令输出:\n${stdout}`); }); Go 在 Go 中,使用 os/exec 包来运行外部命令。为了正确处理 ~ 符号,我们需要手动获取用户主目录并构建路径。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 package main import ( "fmt" "os" "os/exec" "path/filepath" ) func main() { prompt := "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584" // 获取用户主目录 homeDir, err := os.UserHomeDir() if err != nil { fmt.Printf("无法获取主目录: %v\n", err) return } // 构建技能目录的完整路径 skillsDirPath := filepath.Join(homeDir, ".claude", "skills") skillsDirArg := "--skills-dir=" + skillsDirPath cmd := exec.Command("goskills", "run", "--auto-approve", "--model", "deepseek-v3", "--api-base", "https://qianfan.baidubce.com/v2", skillsDirArg, // 使用构建的路径参数 prompt) // CombinedOutput 运行命令并返回其组合的标准输出和标准错误 output, err := cmd.CombinedOutput() if err != nil { fmt.Printf("执行命令时出错: %v\n", err) fmt.Printf("输出:\n%s\n", string(output)) return } fmt.Printf("命令输出:\n%s\n", string(output)) } Java 使用 ProcessBuilder 是在 Java 中执行命令的现代和推荐方式。为了正确指定 --skills-dir 参数,我们需要手动获取用户的主目录并构建完整路径。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.IOException; import java.nio.file.Paths; // 导入 Paths 类 public class GoSkillsRunner { public static void main(String[] args) { try { // 获取用户主目录并构建技能目录路径 String userHome = System.getProperty("user.home"); String skillsDirPath = Paths.get(userHome, ".claude", "skills").toString(); String skillsDirArg = "--skills-dir=" + skillsDirPath; ProcessBuilder pb = new ProcessBuilder( "goskills", "run", "--auto-approve", "--model", "deepseek-v3", "--api-base", "https://qianfan.baidubce.com/v2", skillsDirArg, // 使用构建的路径参数 "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584" ); // 将错误流重定向到与标准输出流相同 pb.redirectErrorStream(true); Process process = pb.start(); // 从命令读取输出 StringBuilder output = new StringBuilder(); try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) { String line; while ((line = reader.readLine()) != null) { output.append(line).append("\n"); } } int exitCode = process.waitFor(); System.out.println("退出码: " + exitCode); System.out.println("输出:\n" + output.toString()); } catch (IOException | InterruptedException e) { e.printStackTrace(); } } } Rust 在 Rust 中,使用 std::process::Command 结构体来执行外部命令。为了正确处理 $HOME 环境变量,我们需要手动获取它的值并构建路径。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 use std::process::Command; use std::env; use std::path::PathBuf; fn main() { let prompt = "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584"; // 获取 HOME 环境变量 let home_dir = env::var("HOME").expect("HOME 环境变量未设置"); // 构建技能目录路径 let skills_dir_path = PathBuf::from(home_dir).join(".claude").join("skills"); let skills_dir_arg = format!("--skills-dir={}", skills_dir_path.to_str().expect("路径无效")); let output = Command::new("goskills") .arg("run") .arg("--auto-approve") .arg("--model") .arg("deepseek-v3") .arg("--api-base") .arg("https://qianfan.baidubce.com/v2") .arg(&skills_dir_arg) // 使用构建的路径参数 .arg(prompt) .output() .expect("未能执行命令"); if output.status.success() { println!("命令执行成功:"); println!("输出:\n{}", String::from_utf8_lossy(&output.stdout)); } else { eprintln!("命令失败,退出码: {:?}", output.status.code()); eprintln!("标准错误输出:\n{}", String::from_utf8_lossy(&output.stderr)); } } C++ 在 C++ 中,std::system 提供了一种简单的方式来执行 shell 命令。然而,为了捕获命令的输出,popen 是一个更好的选择。popen 会执行一个命令并创建一个管道,允许程序读取该命令的标准输出。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 #include <iostream> #include <cstdio> // For popen, pclose, fgets #include <string> #include <array> int main() { std::string prompt = "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584"; std::string command = "goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=$HOME/.claude/skills \"" + prompt + "\""; // 使用 "r" 模式执行 popen 以读取命令的输出 FILE* pipe = popen(command.c_str(), "r"); if (!pipe) { std::cerr << "无法执行命令!" << std::endl; return 1; } std::array<char, 256> buffer; std::string result; while (fgets(buffer.data(), buffer.size(), pipe) != nullptr) { result += buffer.data(); } // pclose 会等待命令终止并返回其退出状态 int exit_code = pclose(pipe); std::cout << "命令输出:" << std::endl; std::cout << result << std::endl; std::cout << "退出码: " << WEXITSTATUS(exit_code) << std::endl; return 0; } C 在 C 语言中,system() 函数是 stdlib.h 的一部分,允许执行 shell 命令。但它不容易捕获输出。为了读取命令的输出,推荐使用 popen 函数,它会创建一个管道来连接到被调用进程的标准输出。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 #include <stdio.h> #include <stdlib.h> #include <string.h> #define BUFFER_SIZE 256 int main() { char prompt[] = "使用markitdown 工具解析网页 https://baike.baidu.com/item/%E5%AD%94%E5%AD%90/1584"; char command[512]; snprintf(command, sizeof(command), "goskills run --auto-approve --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 --skills-dir=$HOME/.claude/skills \"%s\"", prompt); FILE *pipe = popen(command, "r"); if (pipe == NULL) { fprintf(stderr, "无法执行命令!\n"); return 1; } char buffer[BUFFER_SIZE]; printf("命令输出:\n"); // 逐行读取管道的输出并打印 while (fgets(buffer, sizeof(buffer), pipe) != NULL) { printf("%s", buffer); } // pclose 等待命令终止并返回其退出状态 int exit_code = pclose(pipe); fprintf(stdout, "\n退出码: %d\n", WEXITSTATUS(exit_code)); return 0; }

2025/11/1
articleCard.readMore

goskills:Claude Skills 功能强大,为我所用

在去年年底Claude推出MCP的功能后,MCP热度维持了小半年,MCP开发和研究风生水起。一年后,Claude又推出了一个新的概念:Skills。 Claude 的各种应用(desktop、code cli、claude.ai等)现在可以使用 “Skills” 来改进其执行特定任务的方式。“Skills”是包含指令、脚本和资源的文件夹,Claude 应用可以根据需要加载这些文件夹。 Claude 应用只会在Skill与当前任务相关时才会使用该Skill。使用Skill后,克劳德可以更高效地完成特定任务,例如使用 Excel 或操作pdf。 在执行任务时,Claude 应用会扫描可用Skill以查找相关匹配项。找到匹配项后,它只会加载所需的最少信息和文件——既保证了Claude的运行速度,又能让他快速获取专业知识。 Skills 功能强大: 可组合 :技能可以叠加使用。Claude 会自动识别所需技能并协调其使用。 可移植性 :技能在所有地方都使用相同的格式。只需构建一次,即可在 Claude 应用、Claude Code 和 API 中使用。 高效 :只在需要时加载所需内容。 强大的技能包括编写可执行代码,以完成传统编程比令牌生成更可靠的任务。 下面是官方的一些资源: Claude 应用: 用户指南和帮助中心 API 开发人员: 文档 Claude code: 文档 可自定义的示例技能: GitHub 仓库 除了官方的示例技能,很快又一群网友开发和整理了一批Claude资源,比如: travisvn/awesome-claude-skills ComposioHQ/awesome-claude-skills BehiSecc/awesome-claude-skills K-Dense-AI/claude-scientific-skills Claude 官方对SKILL的规范描述的比较清楚了,但是对于LLM 怎么使用SKILLs并没有一个详细的描述,这对于其他大语言模型使用SKILLs带来了很大的挑战。现在你看网上所有的关于Claude Skills讨论都是基于Claude 的各种应用的,鲜有openai、qwen、deepseek如何使用SKILLs的资料。考虑到 Claude官方对中国的敌意,以及在国内也很少能够购买Claude会员和使用Claude应用,我们对其他大语言模型使用SKILLs的研究还是很有意义的。 少有的几篇资料,从外部的视角分析Claude如何调用SKILLs的: Claude Agent Skills: A First Principles Deep Dive Claude Code Skills Just Changed Everything About AI Assistants Claude Skills: A Technical Deep-Dive into Context Injection Architecture Inside a Claude Skill: SKILL.md, resources, and how Claude loads them 作为对SKILLs在其他大语言模型的探索,我实现了一个Claude SKILL的Go语言库,它提供了下面的功能: 对skill包的解析 提供了一个inspector功能,可以在命令行中分析skill包。inspired by raw391-ai/skill-cli 实现了一个 deepseek调用SKILL的示例。事实上可以支持任意的兼容openai API的llm 接下来我就详细介绍这个库的功能。 库地址: https://github.com/smallnest/goskills 解析SKILL包 Claude Skill 可以打包成一个zip文件上传,最终会解开到一个文件夹(比如.claude/skills/xxx),包含一个SKILL.md文件和其它的一些资源。 这个库做了以下的事情: 解析SKILL.md, 得到skill的元数据和指令 抽取YAML frontmatter到Go数据结构SkillMeta 获取skill的markdown信息 发现scripts/、references/和assets/` 目录下的资源 如果你使用这个库,你可以执行下面的命令引入: 1 go get github.com/smallnest/goskills 下面是一个使用这个库解析skill的例子: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 package main import ( "fmt" "log" "github.com/smallnest/goskills" ) func main() { // Path to the skill directory you want to parse skillDirectory := "./examples/skills/artifacts-builder" skillPackage, err := goskills.ParseSkillPackage(skillDirectory) if err != nil { log.Fatalf("Failed to parse skill package: %v", err) } // Print the parsed information fmt.Printf("Successfully Parsed Skill: %s\n", skillPackage.Meta.Name) // ... and so on } goskills库在examples复制了Claude官方的SKILLs例子,你也可以手工clone官方的例子:https://github.com/anthropics/skills 。 上面这个例子使用ParseSkillPackage 函数解析某个SKILL的文件夹,得到它的信息,后续就可以结合LLM进行调用了。 inspector 受raw391-ai/skill-cli启发,我基于这个库,实现了一个SKILL imspector的命令行工具,可以在命令行中审视某个SKILL。 它包含几个子命令: list - 查看某个目录下包含的所有的SKILL parse - 显示某个SKILL的概览 detail - 显示某个SKILL的详情 files - 检查某个SKILL的所有文件 search - 搜索某个SKILL 比如我们在项目的使用下面的命令编译出这个cli程序: 1 go build -o goskills-cli ./cmd/skill-cli ./goskills-cli list ./examples/skills就可以显示出此目录下的所有SKILL: ./goskills-cli parse ./examples/skills/artifacts-builder 显示artifacts-builder 这个skill的基本信息: 其它命令的使用方法类似。 正如这个 cli 的名称所示,它可以用来检查SKILL的信息, 它也是goskills库的一个应用示例。 和deepseek结合 最终,这个库的目的是能够提供和其他大语言模型结合的能力。 它提供了goskills-runner 的命令行工具,演示了deepseek-v3 调用SKILL的能力。 目前这个演示还只是『演示』,并没有考虑到调用SKILL的安全性问题,并且调用SKILL还没有达到非常深入的研究,但是它已经提供了一个SKILL应用在其他LLM中的基础,为后续的功能的深入和演化打下来一个很好的基础。 接下来让我们看看它是怎么实现的? 首先我们通过下面的命令编译出这个工具: 1 go build -o goskills-runner ./cmd/skill-runner 然后,你就可以使用openai API 兼容的大模型进行测试了: 1 2 export OPENAI_API_KEY="YOUR_OPENAI_API_KEY" ./goskills-runner run --model deepseek-v3 --api-base https://qianfan.baidubce.com/v2 "create an algorithm that generates abstract art" 环境变量OPENAI_API_KEY 设置你在服务商中生成的key, 无论你使用的是deepseek、千问、文心还是openai等。 命令行中传入模型名称和api_base地址,这在各模型服务商的文档中都有提供。这周我会整理几家大的模型服务器的信息和使用方法,我会单写一篇文章。 这里我使用的是百度云提供的deepseek服务,它的模型和官方的模型名称不太一样,deepseek官方的模型名称是deepseek-chat, 百度云提供的模型名称是deepseek-v3, 其实都无所谓啦。然后百度云的调用的api的地址是https://qianfan.baidubce.com/v2。 create an algorithm that generates abstract art这个就是我们作为用户的请求。 先看一下输出效果: 采用三步走的策略: discoverSkills: 第一步是找到所有可用的SKILL, 加到上下文中 selectSkill:让LLM选择一个SKILL executeSkillWithTools: 执行一个SKILL, 使用它的定义的tool执行 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // --- STEP 1: SKILL DISCOVERY --- fmt.Println("🔎 Discovering available skills...") availableSkills, err := discoverSkills(skillsPath) if err != nil { return fmt.Errorf("failed to discover skills: %w", err) } if len(availableSkills) == 0 { return errors.New("no valid skills found") } fmt.Printf("✅ Found %d skills.\n\n", len(availableSkills)) // --- STEP 2: SKILL SELECTION --- fmt.Println("🧠 Asking LLM to select the best skill...") selectedSkillName, err := selectSkill(ctx, client, userPrompt, availableSkills) if err != nil { return fmt.Errorf("failed during skill selection: %w", err) } selectedSkill, ok := availableSkills[selectedSkillName] if !ok { fmt.Printf("⚠️ LLM selected a non-existent skill '%s'. Aborting.\n", selectedSkillName) return nil } fmt.Printf("✅ LLM selected skill: %s\n\n", selectedSkillName) // --- STEP 3: SKILL EXECUTION (with Tool Calling) --- fmt.Println("🚀 Executing skill (with potential tool calls)...") fmt.Println(strings.Repeat("-", 40)) err = executeSkillWithTools(ctx, client, userPrompt, selectedSkill) if err != nil { return fmt.Errorf("failed during skill execution: %w", err) } 具体的实现代码可以看代码库。这里我们仅仅是显示了一个简单的SKILL的调用逻辑。 下一步就是在每个步骤进行细化,尤其是第三个步骤,要加上安全性检查、沙盒执行、文件嵌套、命令抽取和执行更复杂的逻辑。 tool 为了让SKILL执行bash、python的等脚本,我们还需要实现执行shell命令的工具。 所以tool 包下还实现几个常用的命令 file_tool: 文件操作工具 knowledge_tool: wiki知识库查询 python_tool: python调用工具 shell_tool: shell脚本执行工具 web_search_tool: duckduckgo 搜索工具等 未来这个工具集会越来越丰富

2025/11/1
articleCard.readMore

langchain + MCP:如虎添翼

MCP技术毋须多言了,上半年火的一塌糊涂,现在进入冷静期了。 langchain 本身就很方便的集成进程内的工具,但是加上 MCP的功能,就如虎添翼,可以充分利用网上上万的MCP的服务。 langchain 自从上个月融资了1.25亿美元之后,资金充足,也更加有动力推进产品的演化,相继发布了langchain/langgraph 1.0的版本。 langchain 1.0中统一了agent的创建,使用create_agent代替之前的create_tool_calling_agent、create_react_agent、create_json_agent、create_xml_agent等。 这篇文章介绍 2(MCP的两种模式sse、stdio) x 2 (经典的langchain agent和1.0最新版create_agent两个模式)一共4个例子,介绍了langchain如何使用MCP 工具丰富其功能。 先决条件 在运行这些示例之前,请确保您已安装必要的库。您通常可以使用 pip 进行安装: 1 pip install langchain langchain-openai langchain-classic langchain-mcp-adapters mcp 此外,请确保您的环境中已安装 python3。 共同组件 这两个示例都共享几个核心组件和一个共同目标:使用 LangChain Agent 通过 MCP 服务器提供的工具来回答一个简单的数学问题。 LLM 配置:两个示例都使用 ChatOpenAI,配置为 deepseek-v3 模型,温度为 0。 这是我使用百度云上提供的deepseek服务,你要是使用deepseek官方的服务,需要修改模型为deepseek-chat。 我已经把KEY和调用地址配置在环境变量中了,所以在代码中不用显示指定: 1 2 export OPENAI_API_KEY=bce-v3/abcsfsfdskgergerthntjrweeuidfu8324refbif3 export OPENAI_API_BASE=https://qianfan.baidubce.com/v2 创建模型对象: 1 llm = ChatOpenAI(model="deepseek-v3", temperature=0) Agent 提示:ChatPromptTemplate 用于定义 Agent 的角色并构建对话。它包括一个系统消息、可选的聊天历史记录、人类输入以及 Agent 暂存区(用于规划其行动)的占位符。 1 2 3 4 5 6 7 8 prompt = ChatPromptTemplate.from_messages( [ ("system", "你是一个可以使用工具的得力助手。 தயவுசெய்து கருவிகளைப் பயன்படுத்தவும்."), MessagesPlaceholder("chat_history", optional=True), ("human", "{input}"), MessagesPlaceholder("agent_scratchpad"), ] ) Agent 创建和执行:create_tool_calling_agent 函数用于构建能够使用工具的 Agent ,AgentExecutor 用于运行 Agent 。 1 2 agent = create_tool_calling_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) # SSE 示例中 verbose=False 任务: Agent 被调用,输入为“123 + 456 等于多少?”。 MCP 服务 Stdio服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import asyncio from mcp.server import FastMCP # 创建一个服务器实例 server = FastMCP(name="math_server", log_level="ERROR") # 定义并注册 add 工具 @server.tool() def add(a: int, b: int) -> int: """将两个整数相加。""" return a + b async def main(): await server.run_stdio_async() if __name__ == "__main__": asyncio.run(main()) sse 服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import asyncio from typing import Annotated from mcp.server import FastMCP # 创建一个服务器实例 server = FastMCP(name="math_server", instructions="一个可以做加法的简单数学服务器。", log_level="ERROR") # 定义并注册 add 工具 @server.tool() def add( a: Annotated[int, "第一个整数"], b: Annotated[int, "第二个整数"] ) -> int: """将两个整数相加。""" return a + b async def main(): await server.run_sse_async() if __name__ == "__main__": asyncio.run(main()) 示例 1:example_1_mcp_tool_stdio.py(标准 I/O 通信) 此示例演示了如何设置一个 MCP 服务器,该服务器通过标准输入和输出流与客户端通信。这适用于本地、单进程交互,其中服务器可以作为子进程生成。 目的 example_1_mcp_tool_stdio.py 脚本展示了如何: 定义一个通过 stdio 暴露工具的 MCP 服务器。 创建一个 stdio_client 来连接到此服务器。 将服务器提供的工具加载到 LangChain Agent 中。 使用 Agent 解决需要加载工具的问题。 关键组件和代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 import asyncio import os from pathlib import Path from langchain_openai import ChatOpenAI from langchain_classic.agents import AgentExecutor, create_tool_calling_agent from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_mcp_adapters.tools import load_mcp_tools from mcp.client.stdio import stdio_client from mcp import ClientSession, StdioServerParameters async def run_mcp_tool_example(): # ... (LLM 和提示设置,如共同组件中所述) ... # 1. 定义 MCP 数学服务器的路径 # 此行构建了 stdio 数学服务器脚本的绝对路径。 # `Path(__file__).parent` 获取当前脚本的目录。 mcp_server_path = Path(__file__).parent / "mcp_math_server_stdio.py" # 2. 为 MCP 服务器设置 stdio 客户端参数 # StdioServerParameters 指定如何运行 MCP 服务器。 # `command="python3"` 指示解释器。 # `args=[str(mcp_server_path)]` 提供要作为参数执行的脚本。 server_params = StdioServerParameters( command="python3", args=[str(mcp_server_path)], ) # 3. 建立 stdio 客户端会话 # `stdio_client(server_params)` 创建一个异步上下文管理器, # 它生成服务器进程并提供用于通信的读/写流。 async with stdio_client(server_params) as (read, write): # 4. 创建 MCP 客户端会话 # `ClientSession` 通过提供的读/写流管理 MCP 协议。 async with ClientSession(read, write) as session: # 5. 初始化连接 # 此步骤对于客户端和服务器建立 MCP 连接至关重要。 await session.initialize() # 6. 从 MCP 服务器获取工具 # `load_mcp_tools(session)` 从 MCP 服务器获取工具定义 # 并将它们转换为 LangChain 兼容的工具对象。 tools = await load_mcp_tools(session) # ... ( Agent 创建和执行,如共同组件中所述) ... print("正在调用 agent 回答一个数学问题...") response = await agent_executor.ainvoke( {"input": "123 + 456 等于多少?", "chat_history": []} ) print(f"Agent 回答: {response['output']}") if __name__ == "__main__": asyncio.run(run_mcp_tool_example()) 如何运行 要运行此示例,只需执行 Python 脚本: 1 python3 example_1_mcp_tool_stdio.py 您将看到 Agent 的思考过程和最终答案打印到控制台。 示例 2:example_1_mcp_tool_sse.py(服务器发送事件通信) 此示例演示了如何集成一个 MCP 服务器,该服务器通过 HTTP 上的服务器发送事件(SSE)暴露其工具。此方法更适用于工具服务器可能是独立网络服务的情况。 目的 example_1_mcp_tool_sse.py 脚本展示了如何: 将 MCP SSE 服务器作为单独的进程启动。 创建一个 MultiServerMCPClient 来连接到此基于 HTTP 的服务器。 将服务器提供的工具加载到 LangChain Agent 中。 使用 Agent 解决需要加载工具的问题。 关键组件和代码演练 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 import asyncio import os from pathlib import Path import sys from langchain_openai import ChatOpenAI from langchain_classic.agents import AgentExecutor, create_tool_calling_agent from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_mcp_adapters.client import MultiServerMCPClient async def run_mcp_tool_example(): # ... (LLM 和提示设置,如共同组件中所述) ... # 1. 启动 MCP 数学服务器作为子进程 # 这将 `mcp_math_server_sse.py` 脚本作为单独的 Python 进程启动。 # `sys.executable` 确保使用正确的 Python 解释器。 # `stdout=asyncio.subprocess.PIPE` 和 `stderr=asyncio.subprocess.PIPE` # 捕获子进程的输出,尽管此处未明确读取。 mcp_server_path = Path(__file__).parent / "mcp_math_server_sse.py" server_process = await asyncio.create_subprocess_exec( sys.executable, str(mcp_server_path), stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, ) # 2. 等待服务器启动 # 添加一个小的延迟,以使 SSE 服务器有时间初始化并开始监听。 await asyncio.sleep(5) # 3. 为 MCP 服务器设置客户端 # `MultiServerMCPClient` 用于基于 HTTP 的 MCP 服务器。 # 它接受一个字典,其中键是服务器名称(例如,“math”),值 # 指定传输类型(“sse”)和 SSE 端点的 URL。 client = MultiServerMCPClient( { "math": { "transport": "sse", "url": "http://localhost:8000/sse", }, } ) # 4. 从 MCP 服务器获取工具 # `client.get_tools()` 连接到指定的 SSE 端点, # 检索工具定义,并将其作为 LangChain 工具提供。 tools = await client.get_tools() # ... ( Agent 创建和执行,如共同组件中所述) ... print("正在调用 agent 回答一个数学问题...") response = await agent_executor.ainvoke( {"input": "123 + 456 等于多少?", "chat_history": []} ) print(f"Agent 回答: {response['output']}") # 5. 终止服务器进程 # 通过终止已启动的子进程进行清理非常重要。 server_process.terminate() await server_process.wait() if __name__ == "__main__": asyncio.run(run_mcp_tool_example()) 如何运行 要运行此示例,只需执行 Python 脚本: 1 python3 example_1_mcp_tool_sse.py 您将看到 Agent 的思考过程和最终答案打印到控制台。请注意,在此示例中,AgentExecutor 的 verbose 标志设置为 False,以避免 SSE 服务器子进程产生过多的输出。 比较:Stdio 与 SSE Stdio(标准 I/O): 简单性:更易于设置本地、单机通信。 执行:MCP 服务器通常作为客户端应用程序的子进程生成。 用例:适用于紧密耦合的组件,或者当您希望将工具服务器直接与应用程序捆绑时。 SSE(服务器发送事件): 灵活性:允许 MCP 服务器作为独立的网络服务运行,可能在不同的机器上。 可伸缩性:可以成为更大微服务架构的一部分。 执行:客户端连接到正在运行的 HTTP 端点。服务器需要单独启动(如示例中通过 asyncio.create_subprocess_exec 所示)。 用例:适用于分布式系统、基于 Web 的应用程序,或者当工具服务器需要被多个客户端访问时。 这两种方法都有效地允许 LangChain Agent 发现和利用使用多客户端协议定义的工具,从而为这些工具的部署和访问提供了灵活性。 这两个示例都是langchain经典的Agent开发模型,他们演示了langchain将 MCP 工具与 LangChain Agent 集成的强大功能和灵活性。无论它们是通过 stdio 还是 SSE 暴露。这使得健壮且可伸缩的 Agent 应用程序成为可能。 接下来我介绍如何使用 Langchain 1.0 中新的 create_agent 函数来构建 Agent ,并将其与 MCP 工具集成。与之前的示例不同,create_agent 提供了一种更简洁的方式来定义 Agent ,直接接受 LLM 模型、工具列表和系统提示,并返回一个可流式传输的图(graph)对象。 示例 3:使用 Langchain 1.0 create_agent 的 MCP SSE 工具集成 此示例展示了如何将 MCP SSE 服务器提供的工具与 Langchain 1.0 的 create_agent 函数结合使用。它演示了如何启动 SSE 服务器、获取工具,然后使用新的 Agent 创建和调用模式来解决数学问题。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 import asyncio import os from pathlib import Path import sys from langchain_openai import ChatOpenAI from langchain.agents import create_agent from langchain_mcp_adapters.client import MultiServerMCPClient async def run_mcp_sse_new_agent_example(): server_process = None try: # 启动 MCP 数学服务器作为子进程 mcp_server_path = Path(__file__).parent / "mcp_math_server_sse.py" server_process = await asyncio.create_subprocess_exec( sys.executable, str(mcp_server_path), stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, ) # 等待服务器启动 await asyncio.sleep(5) # 为 MCP 服务器设置客户端 client = MultiServerMCPClient( { "math": { "transport": "sse", "url": "http://localhost:8000/sse", }, } ) # 从 MCP 服务器获取工具 tools = await client.get_tools() # 使用用户偏好配置 LLM llm = ChatOpenAI(model="deepseek-v3", temperature=0) # 创建一个 agent graph = create_agent( model=llm, tools=tools, system_prompt="你是一个可以使用工具的得力助手。 தயவுசெய்து கருவிகளைப் பயன்படுத்தவும்.", ) print("正在调用 agent 回答一个数学问题...") inputs = {"messages": [{"role": "user", "content": "123 + 456 等于多少?"}]} async for chunk in graph.astream(inputs, stream_mode="updates"): print(chunk) finally: if server_process and server_process.returncode is None: print("\nTerminating server process...") server_process.terminate() await server_process.wait() elif server_process and server_process.returncode is not None: print(f"\nServer process exited with code: {server_process.returncode}") if server_process and server_process.stderr: stderr_output = await server_process.stderr.read() if stderr_output: print("\nServer stderr output:") print(stderr_output.decode()) if __name__ == "__main__": asyncio.run(run_mcp_sse_new_agent_example()) 示例 4:使用 Langchain 1.0 create_agent 的 MCP Stdio 工具集成 此示例演示了如何将 MCP Stdio 服务器提供的工具与 Langchain 1.0 的 create_agent 函数结合使用。它展示了如何通过 stdio 客户端连接到服务器、获取工具,然后使用新的 Agent 创建和调用模式来解决数学问题。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 import asyncio import os from pathlib import Path from langchain_openai import ChatOpenAI from langchain.agents import create_agent from langchain_mcp_adapters.tools import load_mcp_tools from mcp.client.stdio import stdio_client from mcp import ClientSession, StdioServerParameters async def run_mcp_stdio_new_agent_example(): # 定义 MCP 数学服务器的路径 mcp_server_path = Path(__file__).parent / "mcp_math_server_stdio.py" # 为 MCP 服务器设置 stdio 客户端 server_params = StdioServerParameters( command="python3", args=[str(mcp_server_path)], ) async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: # 初始化连接 await session.initialize() # 从 MCP 服务器获取工具 tools = await load_mcp_tools(session) # 使用用户偏好配置 LLM llm = ChatOpenAI(model="deepseek-v3", temperature=0) # 创建一个 agent graph = create_agent( model=llm, tools=tools, system_prompt="你是一个可以使用工具的得力助手。 தயவுசெய்து கருவிகளைப் பயன்படுத்தவும்.", ) print("正在调用 agent 回答一个数学问题...") inputs = {"messages": [{"role": "user", "content": "123 + 456 等于多少?"}]} async for chunk in graph.astream(inputs, stream_mode="updates"): print(chunk) if __name__ == "__main__": asyncio.run(run_mcp_stdio_new_agent_example())

2025/11/1
articleCard.readMore

Linux 中网络包的一生

从 write() 到 recv() 的实用导览。 你运行了 curl http://example.com,现在在终端里得到了一些 HTML,但实际上发生了什么?Linux 会让你的字节经过一套明确的步骤:选定一条路径,查找邻居的 MAC 地址,把包放在,请求网卡发送,然后在另一端执行反向的操作。 这篇文章尽量简单地解释这条路。如果你用过 Linux,运行过 curl,或者试过 ip addr,你完全有能够读懂这篇文章。不需要多么高深的背景。 注意:当我在这篇文章中说“内核”时,我实际上指的是“Linux 内核及其网络栈”,即内核中运行并移动数据包的部分。 我们要讲的内容 以下是我们将介绍的简化路径: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 your app ↓ write()/send() TCP (segments your bytes) ↓ IP (chooses where to send them) ↓ Neighbor/ARP (find the next-hop MAC) ↓ qdisc (queueing, pacing) ↓ driver/NIC (DMA to hardware) ↓ wire / Wi‑Fi / fiber ↓ NIC/driver (other host) ↓ IP (checks, decides it's for us) ↓ TCP (reassembles, ACKs) ↓ server app 第一部分 传输:从 write() 到网线 步骤1:你的应用将字节传递给内核 你在 TCP 套接字上调用 send() 或 write()。 内核接受你的缓冲区并顺序发送。 TCP 会将大的缓冲区拆分成大小适合路径的 数据段 (segments)。通信双方会在 TCP 握手 期间通告各自的 最大数据段大小 (MSS),发送方会将自身的数据段大小限制在对方通告的 MSS 内,同时还要受到当前 路径最大传输单元 (Path MTU) 以及任何 IP/TCP 选项(如:时间戳)的进一步约束。 它会为每个数据段标记 序列号 (sequence numbers),以便接收方能够按正确的顺序进行重组。 [!info] 🔌 套接字 套接字 只是你程序的一个 通信端点。对于 TCP 而言,内核会为每个套接字维护状态信息,包括:序列号、拥塞窗口 (congestion window) 和 定时器 等。 [!info] 🤝 TCP 握手 (TCP Handshake) TCP 握手 在任何write()到达对端之前,TCP 会进行快速三步设置:1)客户端-> 服务器:SYN,并带有选项(MSS、SACK 允许、窗口规模、时间戳、ECN)。2)服务器 -> 客户端:SYN-ACK 及其选项。3) 客户端 -> 服务器:ACK.双方就初始序列号和选项达成一致;州已成立。TLS 说明:对于 HTTPS 来说,TLS 握手是在 TCP 建立后运行的。 [!todo] 试试看 下载东西时运行 ss -tni。你会看到随着数据在线路上传输并被应用消耗,TCP 的发送和接收队列大小会波动。 步骤2:内核决定将数据发送到哪里(路由) 内核会查看目标 IP 并选择最匹配的路由。在典型的主机上,问题归结为:这个 IP 是在我的本地网络上,还是我应该交给网关? 如果地址位于直接连接的网络上,则会通过该接口发送。 否则,它会连接到你的默认网关(通常是路由器)。 [!todo] 试试看 ip route get 192.0.2.10 它会打印接口、下一跳(如果有)以及内核将使用的源 IP。 [!info] 策略路由 内核可以使用 ip rule查询多个路由表(例如按源地址或标记选择路由)。大多数笔记本和服务器使用主路由表。 步骤 3:内核学习下一跳 MAC(邻居/ARP) IP 路由选择下一跳。为了实际发送以太网帧,内核需要该跳的 MAC 地址。 如果内核已经知道下一跳(在邻居/ARP 缓存里),那很好。 如果没有,它会发送广播 ARP 请求:“谁拥有 10.0.0.1?告诉我你的 MAC。”回复已被缓存。 [!todo] 试试看 ip neigh show 你会看到像 10.0.0.1 lladdr 00:11:22:33:44:55 REACHABLE 这样的条目。 [!info] ARP vs NDP IPv4 使用 ARP(广播)。IPv6 使用NDP(组播)。原理相同:找到你网络中某个 IP 的链路层地址。 步骤 4:数据包等待其轮到(qdisc) 在 NIC 发送任何内容之前,数据包会进入队列领域(qdisc)。你可以把它看作是一个小的等待队伍加上一个交通警察,内核可以: 平滑突发流量,避免链路泛滥和缓冲膨胀(大队列->高延迟), 在不同流之间公平共享带宽, 如果你已经配置了整形/速率限制规则,请强制执行。 [!todo] 试试看 tc qdisc show dev eth0 tc -s qdisc show dev eth0 # same, but with counters/stats 把 eth0 替换成你的实际接口名称(例如 enp3s0,wlp2s0)。 [!info] MTU 与 MSS MTU 是链路能承载的最大 L2 负载(典型以太网为 1500 字节)。 MSS 是段内最大的 TCP 有效载荷,仅次于 IP + TCP 头部和选项。 在 TCP 握手过程中,双方都宣布自己可以接收的 MSS,发送方不会发送比对方通告 MSS 更大的段,并且也会遵守路径 MTU(PMTU)。 在 IPv4 常见的无选项情况下,MSS≈MTU−40 字节。选项进一步降低 MSS。 步骤 5:网卡驱动和 NIC 负责繁重工作 内核的网络驱动将你的数据包交给网卡(NIC),并将其放入一个小的传输队列,卡从中读取。NIC 随后: 直接从内存(使用 DMA)提取字节,并将其转化为链路上的比特流、铜缆上的微小电压变化、光纤上的光脉冲,或者如果你用 Wi-Fi 时是无线电波。 那才是真正的“接线”时刻:内存中的数据变成了网络上的信号。 [!todo] 试试看 ip -s link show dev eth0 ethtool -S eth0 # NIC stats ethtool -k eth0 # offloads enabled 把 eth0 替换成你实际的接口名称。 [!info] offloads 卸载 TSO/GSO:让网卡或栈将大型缓冲区拆分为 MTU 大小的帧。 校验和卸载:传输时,网卡在内核递交包后填写 IP/TCP 校验和,发送前,接收时网卡可以验证校验和并告知内核结果。GRO(接收时):将许多小数据包合并成更大的块以节省 CPU。 [!info] DMA 直接内存访问(DMA)允许网卡通过总线(例如 PCIe)直接读写你在 RAM 中的数据,而 CPU 无需复制字节。这就是网卡能高效地从transmit ring拉取帧(并放置接收帧)的原因。 步骤 6:上线 在以太网上,网卡发送一个帧,内容如下: 1 [ dst MAC | src MAC | EtherType (IPv4) | IP header | TCP header | payload | FCS ] 交换机关心以太网头部:它们查看目标 MAC 地址,并将帧转发到正确的端口。 路由器会查看 IP 头部,减少 TTL / Hop Limit,并在(IPv4)更新头部校验和后再将数据包转发到下一跳。 每台交换机和路由器逐跳重复此作,直到路由器最终获得直接到目的网络的路由,并将数据包传送到服务器的局域网。 [!info] frame vs packet 帧与包 数据包是 IP 级单元(IP 头部+TCP/UDP+有效载荷)。 帧是指该数据包在特定链路上(例如以太网)上通过 src/dst MAC 和校验和传输的方式。 第二部分 接收:从线路回传到你的应用 步骤 7:网卡将数据传递给内核(NAPI) 在服务器端,网卡将收到的帧写入receive rings(内存中的小队列)。Linux 内核随后使用 NAPI 高效拉取数据包:快速中断后切换为轮询,一次性处理一批数据包。 [!info] NAPI 如果每个数据包都触发了满中断,忙碌的网卡可能会让 CPU 不堪重负。NAPI 的诀窍是: 发起一次中断, 暂时切换到轮询以排空大量数据包, 然后重新启用中断。 中断减少,吞吐量更好。 步骤 8:IP 检查数据包并决定下一步行动 内核会验证 IP 头部(版本、校验和、TTL 等),然后问:“这个包是给我的吗?” 如果目标 IP 与服务器的某个地址匹配,则该 IP 是本地的,并在堆栈中向上移动。 如果没有,且启用了 IP 转发,内核可能会将它转发, 类似 Linux 路由器的表现。 否则,数据包会被丢弃。 如果你使用防火墙,这时像 PREROUTING 和 INPUT(nftables/iptables)这样的钩子可以过滤、记录或 DNAT 流量,然后才会被送到本地套接字。在 POSTROUTING 中会发生 SNAT/假面舞会。对于本地生成的数据包,DNAT 也可能出现在输出中。 [!todo] 试试看 sudo nft list ruleset # or, with iptables: sudo iptables -L -n -v sudo iptables -t nat -L -n -v 步骤 9:TCP 重新组装、确认并唤醒应用 TCP 协议栈会将段排序,检查缺失部分,并发送 ACK。当数据准备好时,它会唤醒在 recv() 中等待的进程。 [!todo] 试试看 ss -tni 'sport = :80 or dport = :80' 随着应用读取,接收队列(Recv-Q)会随着增长和缩小。 简短实用笔记 回环很特别(而且速度快) 发送到 127.0.0.1 的数据包从未到达物理网卡。路由仍然会进行,但所有内容都保留在仅软件的 lo 接口内存中。 桥接与路由(同一盒子,不同角色) 如果盒子是桥接器(例如带有 br0),它会在第 2 层转发帧,不会改变 TTL。如果是路由,它会在第 3 层转发,TTL 下降一跳。 NAT hairpin(为什么内部客户端会访问外部 IP) 通过路由器的公共 IP 从同一局域网访问服务需要“发夹式 NAT”。如果在这种情况下连接重置,请检查预路由和后路由 NAT 规则。 IPv6 把 ARP 换成新民主党。否则,路径是相同的: 1 2 ip -6 route ip -6 neigh UDP 是故意的不同 UDP 不做排序、重传或拥塞控制。发送路径使用 udp_sendmsg,接收路径传输完整的数据报。你的应用自己负责处理数据丢失。 亲自看看(10个快速指令) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 # 1) Where would the kernel send a packet? ip route get 192.0.2.10 # 2) What routes and rules exist? ip route; ip rule # 3) Who's my next hop? ip neigh show # 4) What's my firewall/NAT doing? sudo nft list ruleset # or: sudo iptables -L -n -v sudo iptables -t nat -L -n -v # 5) Which sockets are active? ss -tni # 6) What's on the wire (swap eth0/host as needed)? sudo tcpdump -ni eth0 -e -vvv 'host 192.0.2.10 and tcp port 80' # 7) Are my queues healthy? tc -s qdisc show dev eth0 # 8) Is my NIC happy? ip -s link show dev eth0 ethtool -S eth0 # 9) Are counters hinting at a problem? nstat -a | grep -E 'InErrors|OutErrors|InNoRoutes|InOctets|OutOctets' # (Use `-z` instead of `-a` if you explicitly want to zero the counters.) # 10) Is the path MTU safe? tracepath 192.0.2.10 # discovers PMTU via ICMP: IPv4 "Fragmentation Needed" (Type 3, Code 4) / IPv6 "Packet Too Big" (Type 2) ARP/邻居问题 IP neigh 显示失败或不断切换状态——> L2 可达性、VLAN 标记或交换机过滤问题。 MTU / PMTU 黑洞 小 ping 可以正常,大传输会卡顿——>MTU 不匹配或 ICMP 被阻。 允许 PMTU 信号通过防火墙(IPv4:ICMP 类型 3 代码 4“需要分段”,IPv6:ICMPv6 类型 2“数据包过大”)或修复 MTU。 反向路径滤波器的痛点 非对称路由 + rp_filter=1 会丢弃返回流量。使用 rp_filter=2(松散)或使路由对称。 NAT 的惊喜 SNAT/MASQUERADE 错误地重写了来源,所以回复无效。检查 NAT 规则和 conntrack -L。 Backlog/accept pressure 新连接在大负载下重置 -> 增加应用backlog和 net.core.somaxconn,确保应用能及时处理accept。 爆发产生的缓冲膨胀 如果遇到队列过大和严重延迟尖峰的问题,请选择 fq_codel(或 fq)作为 队列规则 (qdisc),并且如果应用程序支持,请启用 数据包限速 (pacing)。 内核调用路径(如果你感兴趣的话)发送(典型的 TCP 路径): 1 2 3 4 5 6 7 8 9 10 tcp_sendmsg -> tcp_push_pending_frames -> __tcp_transmit_skb -> ip_queue_xmit -> ip_local_out / ip_output -> ip_finish_output -> neigh_output -> dev_queue_xmit -> qdisc / sch_direct_xmit -> ndo_start_xmit (driver) 接收(典型的 IPv4 TCP 路径): 1 2 3 4 5 6 7 8 9 napi_gro_receive / netif_receive_skb -> __netif_receive_skb_core -> ip_rcv -> ip_rcv_finish -> ip_local_deliver -> ip_local_deliver_finish -> tcp_v4_rcv -> tcp_v4_do_rcv -> tcp_data_queue (wake reader) 一份小清单,随时备着 Socket - Your program’s handle for network I/O. MTU / MSS - Max link payload / max TCP payload. ARP / NDP - Find the link-layer address (IPv4 / IPv6). qdisc - Per-device queueing policy (fairness, shaping). NAPI - Efficient receive: interrupt, then poll a batch. TSO/GSO/GRO - Offloads to split/merge packets and save CPU. Conntrack - Kernel’s flow table (used by NAT and filtering). PREROUTING/INPUT/OUTPUT/POSTROUTING - Firewall hook points. DMA (Direct Memory Access) - Hardware reads/writes RAM without CPU copies, NICs use this for TX/RX rings. TTL / Hop Limit - Per‑packet counter decremented by each router (TTL in IPv4, Hop Limit in IPv6). When it hits zero, the packet is dropped. FCS (Frame Check Sequence) - Link‑layer CRC at the end of an Ethernet frame, used to detect bit errors on the wire.

2025/11/1
articleCard.readMore

godotenv 库介绍

godotenv 是一个 Go 语言库,用于从 .env 文件加载环境变量到应用程序中 1 。它是 Ruby dotenv 项目的 Go 移植版本 1 。 背景 该库遵循十二要素应用方法论,将配置与代码分离。核心理念是:任何可能在部署环境之间变化的内容(如数据库资源句柄或外部服务凭证)都应该从代码中提取到环境变量中。 但在开发机器或运行多个项目的持续集成服务器上设置环境变量并不总是实用的。godotenv 在环境启动时从 .env 文件加载变量到 ENV 中。 安装方法 作为库使用 1 go get github.com/joho/godotenv 作为命令行工具 Go >= 1.17: 1 go install github.com/joho/godotenv/cmd/godotenv@latest 使用方法 基本用法 在项目根目录创建 .env 文件: 1 2 S3_BUCKET=YOURS3BUCKET SECRET_KEY=YOURSECRETKEYGOESHERE 在 Go 代码中加载: 1 2 3 4 5 err := godotenv.Load() if err != nil { log.Fatal("Error loading .env file") } s3Bucket := os.Getenv("S3_BUCKET") 自动加载 使用 autoload 包可以在导入时自动读取 .env: 1 import _ "github.com/joho/godotenv/autoload" 加载多个文件 可以指定多个 .env 文件: 1 2 godotenv.Load("somerandomfile") godotenv.Load("filenumberone.env", "filenumbertwo.env") 核心函数 Load(): 加载变量到系统环境,不覆盖已存在的变量 Overload(): 加载变量到系统环境,会覆盖已存在的变量 Read(): 读取变量到 map 而不是环境变量,会合并多个文件的内容到一个 map 中,后加载的文件会覆盖前面文件中的同名键。 Parse(): 从 io.Reader 解析 Unmarshal(): 从字符串解析 Write(): 将 map 写入文件 Marshal(): 将 map 转换为字符串 命令行模式 1 godotenv -f /some/path/to/.env some_command with some args 使用 -o 标志可以覆盖现有环境变量 。 支持的 .env 文件格式 支持注释和 export 语句: 1 2 3 4 # 注释 SOME_VAR=someval FOO=BAR # 行尾注释 export BAR=BAZ 也支持 YAML 风格: 1 2 FOO: bar BAR: baz 重要注意事项 优先级规则 已存在的环境变量优先于后加载的变量。这意味着: Load() 不会覆盖已存在的环境变量 Overload() 会覆盖已存在的环境变量 多环境管理 推荐的多环境管理方式(开发、测试、生产): 1 2 3 4 5 6 7 8 9 10 11 env := os.Getenv("FOO_ENV") if "" == env { env = "development" } godotenv.Load(".env." + env + ".local") if "test" != env { godotenv.Load(".env.local") } godotenv.Load(".env." + env) godotenv.Load() // 原始 .env 功能完整性声明 该库已被声明为功能完整 。不再接受添加新功能或破坏库 API 的 issue 或 pull request。 平台支持 Linux 和 Windows 环境都有测试覆盖和 CI,但不保证命令行版本在 Windows 上正常工作。 Notes 该库的实现核心在 godotenv.go 和 parser.go 文件中,其中 parseBytes() 函数负责实际的解析工作 。库支持变量替换(如 ${VAR} 或 $VAR)和特殊字符转义。所有代码更改都需要测试和对等 dotenv 实现的参考。

2025/11/1
articleCard.readMore

使用Linux 30年了,我都不知道 ping 8.8 还能这么用?

今天看到 @sysxplore 介绍的一个技巧,类似IPv6的0简写的方法,这是我第一次知道ipv4的地址还能这么直接写: ![](/127.1/admin # 可能绕过 "127.0.0.1" 的黑名单 http://2130706433/ # 十进制形式的 127.0.0.1 ` 2. 日志分析困难 不同格式的相同IP可能导致日志分析和安全审计的复杂化。 3. 防火墙规则绕过 某些防火墙规则可能无法识别非标准格式的IP地址。 兼容性说明 这个特性的支持情况因系统和应用而异: ✅ 完全支持: Linux、macOS、BSD系统的网络工具 ⚠️ 部分支持: Windows系统(某些版本的 ping 支持,但浏览器通常不支持) ❌ 不支持: 大多数现代Web浏览器(出于安全考虑)、某些编程语言的标准库 建议 知道就好,就像知道回字有几种写法,现在几乎不会这么使用。 参考资料: BSD Socket API Documentation inet_aton() Manual Pages IETF RFC 3986 (URI Generic Syntax)

2025/10/1
articleCard.readMore

从 AI 哪里挣钱?

来自投资人 @JTLonsdale 的见解。 根据国际能源署,数据中心电力消耗预计到 2030 年将增加一倍以上,达到约 945 太瓦时。 以这个数据为背景,这超过了大多数国家。 例如,德国在 2024 年大约产生了 431.7 太瓦时的电力。 Tier 1 一级芯片层。 由于通用 CPU 无法处理 AI 的需求,您需要为所有这些并行处理提供完全不同的硅架构。 随着 AI 模型复杂性的增加,为其提供动力的芯片也必须保持同步。 一些关键的近期芯片“军备竞赛”统计数据: - Groq 在 69 亿美元估值下融资 7.5 亿美元,以挑战 NVIDIA。 - 台积电将其美国投资扩大至 1650 亿美元。 2024 年,整个 AI 芯片市场达到 1230 亿美元,预计到 2030 年将以 33% 的年增长率增长。 Tier 3 覆盖基础模型公司 这些公司需要大量的前期资本进行训练,但它们创造了推动堆栈中所有上层功能的能力。 例如 OpenAI、Anthropic 和谷歌的 Gemini。 这些公司层次似乎是一个明显的投资选择,因为它们总是成为头条新闻。 OpenAI 的估值达到了 3000 亿美元,收入从 37 亿美元增长到 127 亿美元,年化增长仅几个月。Anthropic 的最新估值也增长了 3 倍,达到 1830 亿美元。 这里的限制是,这些公司的训练成本呈指数级增长。GPT-4 的训练成本为 1 亿美元,未来的模型将超过 10 亿美元。 Tier 4 软件基础设施 把这想成是 AI 的镐子和铲子。 没有部署平台、检索向量数据库和用于管理整个流程的 MLOps 工具,模型将无法存在。 Databricks 的估值达到 1000 多亿美元,推动企业 AI 工作流程。 就在去年,矢量数据库初创公司筹集了 2 亿美元(Vespa、Weaviate、Pinecone 和 Chroma)以支持 LLM 基础设施。langchain完成了1.25亿美元的融资,估值已经达到了12.5亿美元。 Tier 5 AI 原生应用 在这个级别上,我们看到了最多的 alpha 创造,因为采用率倾向于这一层级。 人们直接与应用程序互动并从中获得价值:它们解决问题,提供服务,并产生可衡量的结果。 基础设施使这一切成为可能,但正是应用程序吸引了用户的参与和忠诚度。 这里的大问题是哪个级别能带来最高的投资回报率。 在 2024 年,69%的资金流向了基础设施大型融资轮,但几乎所有交易中的近四分之三都是应用层的早期投资。 这种模式在每个主要科技周期中都重复出现。 举个互联网例子:互联网服务提供商(ISPs)构建了基础,但谷歌和 Facebook 等公司获得了回报。 我们再次看到了同样的原则在起作用: 基础设施层推动了革命,但应用层获得了经济价值。 @JTLonsdale

2025/10/1
articleCard.readMore

Go sync 包近两年发展综述

Go 语言的 sync 包是其并发编程模型的基石,提供了实现同步和并发控制的关键原语。在过去的两年里(大约从 Go 1.19 到 1.25),sync 包及其子包 sync/atomic 经历了一系列重要的演进。这些变化不仅包括新功能的增加,还涉及性能优化、内部实现的重构以及开发者体验的显著提升。 本文基于 Go 语言官方仓库的 Git 提交历史,对 sync 包近两年的主要变化进行总结。 1. 新增 API 与功能增强 为了简化常见的并发模式,sync 包引入了一些备受期待的新 API。 sync.WaitGroup.Go Go 1.25 引入了 WaitGroup.Go 方法,极大地简化了在 WaitGroup 中启动 goroutine 的代码。 旧模式: 1 2 3 4 5 wg.Add(1) go func() { defer wg.Done() // ... do work ... }() 新模式: 1 2 3 wg.Go(func() { // ... do work ... }) 这个辅助方法不仅减少了样板代码,还通过内置的 defer wg.Done() 调用避免了忘记调用 Done() 的常见错误。 代码示例: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 package main import ( "fmt" "sync" "time" ) func main() { var wg sync.WaitGroup urls := []string{ "http://example.com", "http://example.org", "http://example.net", } for _, url := range urls { // 使用 wg.Go 启动 goroutine wg.Go(func() { // 模拟抓取 URL fmt.Printf("Fetching %s\n", url) time.Sleep(100 * time.Millisecond) fmt.Printf("Fetched %s\n", url) }) } // 等待所有 wg.Go 启动的 goroutine 完成 wg.Wait() fmt.Println("All fetches completed.") } sync.Map.Clear 和 sync.Map.Swap sync.Map 也获得了一些实用的新方法: Clear(): 用于一次性删除 Map 中的所有键值对,提供了一种高效清空 Map 的标准方式 (Go 1.23.0)。 Swap(): 原子性地交换一个键的新旧值 (Go 1.20)。 CompareAndSwap() / CompareAndDelete(): 提供了更精细的原子操作,允许用户基于旧值进行条件交换或删除 (Go 1.20)。 代码示例: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 package main import ( "fmt" "sync" ) func main() { var m sync.Map // 存储键值对 m.Store("config", "v1") m.Store("feature_flag", "off") // Swap: 原子性地将 "config" 的值更新为 "v2",并返回旧值 oldValue, loaded := m.Swap("config", "v2") if loaded { fmt.Printf("Swapped 'config'. Old value was: %s\n", oldValue) } // 打印当前值 currentValue, _ := m.Load("config") fmt.Printf("Current value of 'config' is: %s\n", currentValue) // Clear: 清空整个 Map fmt.Println("\nClearing the map...") m.Clear() // 验证 Map 是否为空 m.Range(func(key, value interface{}) bool { fmt.Println("This should not be printed.") return true }) fmt.Println("Map is empty.") } sync.Once 系列函数的演进 Go 1.21.0 引入的 OnceFunc, OnceValue, OnceValues 在后续版本中得到了优化,例如减少了堆内存分配,使其在需要延迟初始化或缓存计算结果的场景下更高效。func OnceFunc(f func()) func():将一个无参数无返回值的函数包装成只执行一次的函数。 func OnceValue[T any](f func() T) func() T: 将一个无参数但有单个返回值的函数包装成只执行一次的函数,后续调用返回缓存的结果。 func OnceValues[T1, T2 any](f func() (T1, T2)) func() (T1, T2): 将一个无参数但有两个返回值的函数包装成只执行一次的函数,后续调用返回缓存的结果。 代码示例 (OnceValue) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 package main import ( "fmt" "sync" ) func main() { once := sync.OnceValue(func() int { sum := 0 for i := 0; i < 1000; i++ { sum += i } fmt.Println("Computed once:", sum) return sum }) done := make(chan bool) for i := 0; i < 10; i++ { go func() { const want = 499500 got := once() if got != want { fmt.Println("want", want, "got", got) } done <- true }() } for i := 0; i < 10; i++ { <-done } } 2. 性能优化与内部实现重构 sync 包的性能至关重要。近期的更新在多个方面对其进行了优化。 sync.Map 的新实现 sync.Map 的内部实现经历了一次重大变更,引入了基于哈希分片三角树(HashTrieMap)的新设计 (Go 1.24)。但这类探索表明了社区在持续寻求提升 sync.Map 在不同并发场景下性能的努力。 sync.Mutex 内部重构 Mutex 在Go1.24.0增加了一个基于HashTrieMap的实现。在Go1.26.0中应该会移除旧的sync.Mutex实现, sync.Map将默认采用HashTrieMap的实现。 https://github.com/golang/go/issues/70683 sync.Once 的原子操作优化 Once.done 字段的实现从 atomic.Uint32 切换到atomic.Bool (Go 1.25)。这不仅提升了代码的可读性和类型安全性,也代表了用现代原子类型替代旧有模式的趋势。 3. 代码正确性与开发者体验 为帮助开发者编写更健壮的并发代码,sync 包在文档和静态分析方面做了大量改进。 noCopy 哨兵的引入 Mutex, RWMutex, WaitGroup, Cond, 和 Map 等核心类型都加入了 noCopy 字段。这是一个特殊的非导出字段,可以被 go vet 工具识别。当开发者无意中复制了这些包含内部状态的同步原语时,go vet 会发出警告,从而在编译前发现潜在的并发 bug。 代码示例 (错误用法): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 package main import "sync" // Counter 是一个带有锁的计数器 type Counter struct { sync.Mutex count int } // Inc 增加计数器 func (c Counter) Inc() { // 注意:这里使用了值接收器,会导致 Mutex 复制 c.Lock() c.count++ c.Unlock() } func main() { var c Counter c.Inc() } // 运行 `go vet .` 将会报告: // main.go:12:6: call of method Inc copies lock value: main.Counter contains sync.Mutex 文档的持续改进 大量的提交都致力于改进和澄清文档,包括: 明确指出 RWMutex 的读锁和写锁不能相互升级或降级。 详细描述了 Map 的 Range、Delete 等方法在并发访问下的行为和内存模型保证。 为 Cond.Wait 的行为提供了更清晰的说明。 在包文档中直接链接到 Go 内存模型,强调其重要性。 4. sync/atomic 的现代化 作为 sync 的底层支撑,sync/atomic 包在 Go 1.19 中引入了基于泛型的类型安全原子类型(如 atomic.Int64, atomic.Pointer[T], atomic.Bool 等)。近两年的变化主要体现在: 鼓励使用新类型:在 sync 包内部,旧的函数式原子操作(如 atomic.LoadUint32)正逐渐被新的类型化方法(如 myAtomicBool.Load())所取代。 API 完善:增加了 And/Or 等新的原子位操作函数,并对文档进行了补充,以指导用户从旧 API 过渡到新 API。 总结 过去两年,Go 的 sync 包在保持 API 稳定的同时,向着更易用、更安全、更高性能的方向稳步发展。通过引入 WaitGroup.Go 等便捷的辅助函数,开发者可以编写更简洁的并发代码。noCopy 哨兵和持续完善的文档则提高了代码的健壮性。底层的性能优化和实现重构,确保了 Go 的并发原语能够适应不断增长的性能需求。这些变化共同巩固了 Go 作为一门现代并发语言的地位。

2025/10/1
articleCard.readMore

deepseek-v3.2-exp的闪电索引器

我们可以把 DeepSeek 稀疏注意力(DeepSeek Sparse Attention, DSA)中的闪电索引器(Lightning Indexer) 想象成一位专门负责阅读和检索《红楼梦》全书信息的“记忆筛选专家”。 例如,《红楼梦》全书篇幅巨大,如果我们想让一个语言模型(比如 DeepSeek-V3.2-Exp)记住书里的所有细节,并在读到某个句子时能立刻回想起所有相关信息,效率是个大问题。 1. 核心挑战:全书阅读的 $O(L^2)$ 困境 想象《红楼梦》全书有 $L$ 个 token(可以把一个字或词语看作一个 token)。当模型读到第 $L$ 个字(比如“散”)时,如果它需要同时回顾并计算之前 $L-1$ 个字中每一个字对“散”这个字的影响,那么总体的计算量就是 $L \times L$,即 $O(L^2)$ 复杂度。对于 128K 这样长的上下文,$L^2$ 的计算量是难以承受的。 闪电索引器就是用来解决这个效率问题的关键工具。 2. 闪电索引器的作用与机制 DSA 的原型(prototype)主要由两部分组成:lightning indexer 和 fine-grained token selection mechanism(细粒度 token 选择机制)。 A. 计算关联性:索引分数(Index Score) 闪电索引器扮演了“筛选专家”的角色。当模型读到当前的查询 token ($h_t$),比如“宝玉哭了”,它需要迅速判断出书本前面所有的先前 token ($h_s$) 中,哪些是高度相关的(比如“黛玉回天乏术”)。 闪电索引器就是通过计算索引分数 ($I_{t,s}$) 来判断这种关联性的。 计算公式(公式 1) 闪电索引器使用一个高效的公式来计算索引分数: $$I_{t,s} = \sum_{j=1}^{H^I} w_{t,j}^I \cdot \text{ReLU}(q_{t,j}^I \cdot k_s^I) \quad \text{(1)} \text{}$$ $H^I$ (Indexer Heads):索引器拥有少量的头部。 定义:$H^I$ 表示索引器头部的数量(the number of indexer heads)。 作用:在多头注意力(Multi-Head Attention)的结构中,头部允许模型从不同的表示子空间中捕获信息。这里的 $H^I$ 表示索引器利用了多个独立的“视角”或“计算通道”来计算索引分数。公式中的求和符号 $\sum_{j=1}^{H^I}$ 表示将所有这些头部计算出的分数进行累加。 效率考量:闪电索引器设计得非常高效,其中一个原因是它具有少量的头部(a small number of heads)。 $h_t$ (Query Token):当前的 token(“宝玉哭了”)会生成 $q_{t,j}^I$ 和 $w_{t,j}^I$。 $h_s$ (Preceding Token):前面的 token(“黛玉回天乏术”)会生成 $k_s^I$。 ReLU:选择 ReLU 作为激活函数是出于对吞吐量(throughput consideration)的考量,意味着它计算速度快。 作用:ReLU 是一种激活函数(activation function)。在公式(1)中,它应用于 $q_{t,j}^I$ 和 $k_s^I$ 的点积(dot product)之后。 选择原因:选择 ReLU 作为激活函数是出于吞吐量(throughput consideration)的考量。这意味着 ReLU 相比其他复杂的激活函数,计算速度更快,有助于提高整体的计算效率。 效率:虽然理论上它也要遍历所有先前 token,但由于它设计轻量(头部少,可用 FP8 实现),其计算效率是显著的(computational efficiency is remarkable),计算量远小于之前 DeepSeek-V3.1-Terminus 中使用的 MLA。 B. 筛选记忆:Top-k 选择 当闪电索引器为“宝玉哭了”计算完所有先前句子的索引分数 ${I_{t,s}}$ 之后,细粒度 token 选择机制就会启动: 它只检索对应于 Top-k 索引分数的 key-value entries ${c_s}$。 假设《红楼梦》全书有 10 万个 token,但 Top-k 只选择 2048 个 key-value tokens。 最终模型的注意力输出 ($u_t$),只在这 2048 个被稀疏选择的 key-value entries ${c_s}$ 上进行计算。 这个机制将主模型的核心注意力复杂度(core attention complexity)从 $O(L^2)$ 降低到了 $O(L k)$,其中 $k$ (2048) 远小于 $L$ (128K)。 上面的公式还是难以理解,不太能够理解其中的细节,我们通过deepseek-v3.2-exp的代码库中的 fp8_index_kernel 这个闪电索引器的实现来尝试解释下。 公式与代码的对应关系 符号映射 公式中的各个符号在代码中的对应关系如下: $I_{t,s}$: 对应输出张量 o[i_b, i_m, i1_n * blk_n1 + i2_n * blk_n2],表示位置 $t$ 的查询对位置 $s$ 的键的索引分数 1 T.copy(logits_sum, o[i_b, i_m, i1_n * blk_n1 + i2_n * blk_n2]) $H^I$: 对应参数 h,即注意力头的数量 1 def fp8_index_kernel(h: int, d: int): $j$: 对应循环变量 i_h,遍历所有注意力头 1 2 3 4 5 6 7 8 9 10 11 T.gemm( k_smem, q_smem, logits, transpose_A=False, transpose_B=True, clear_accum=True, ) for i_h, i3_n in T.Parallel(h, blk_n2): logits[i3_n, i_h] = T.max(logits[i3_n, i_h], 0) * q_s_frag[i_h] $w_{t,j}^I$: 对应查询缩放因子 q_s_frag[i_h],即 q_s[i_b, i_m, i_h] 1 2 q_s_frag = T.alloc_fragment(h, FP32) T.copy(q_s[i_b, i_m, 0], q_s_frag) $q_{t,j}^I$: 对应查询张量 q_smem,即 q[i_b, i_m, :, :] 的第 i_h 个头 1 2 q_smem = T.alloc_shared((h, d), FP8) T.copy(q[i_b, i_m, 0, 0], q_smem) $k_s^I$: 对应键张量 k_smem,即 k[i_b, s, :] 1 2 k_smem = T.alloc_shared((blk_n2, d), FP8) T.copy(k[i_b, i1_n * blk_n1 + i2_n * blk_n2, 0], k_smem) 计算步骤对应 公式的计算在代码中按以下步骤实现: 1. 点积 $q_{t,j}^I \cdot k_s^I$: 通过矩阵乘法 T.gemm(k_smem, q_smem, logits, ...) 计算,结果存储在 logits[i3_n, i_h] 中,对应查询头 $j$ 与键位置 $s$ 的点积 2. ReLU 激活: 使用 T.max(logits[i3_n, i_h], 0) 实现 $\text{ReLU}(q_{t,j}^I \cdot k_s^I)$ 3. 乘以权重 $w_{t,j}^I$: 在同一行代码中,ReLU 结果乘以查询缩放因子 q_s_frag[i_h],即 logits[i3_n, i_h] = T.max(logits[i3_n, i_h], 0) * q_s_frag[i_h] 4. 跨头求和 $\sum_{j=1}^{H^I}$: 通过 T.reduce_sum(logits, logits_sum, dim=1) 在头维度(dim=1)上求和,实现对所有头的累加 5. 键缩放: 最后乘以键缩放因子 k_s_frag[i3_n],即 logits_sum[i3_n] *= k_s_frag[i3_n] 完整计算流程 代码实际计算的完整公式为: $$I_{t,s} = k_s[s] \cdot \sum_{j=1}^{H^I} q_s[t,j] \cdot \text{ReLU}(q[t,j] \cdot k[s])$$ 这与您给出的公式一致,其中 $w_{t,j}^I = q_s[t,j]$ 是查询的缩放因子,而键的缩放因子 $k_s[s]$ 在求和后统一应用。 ReLU 是目前最流行的激活函数之一,在深度学习中广泛应用。 数学定义: f(x) = max(0, x) 简单来说: 当 x > 0 时,输出 x 当 x ≤ 0 时,输出 0

2025/10/1
articleCard.readMore