Apache SkyWalking – AI

Zh: AI Coding 如何重塑软件架构师的工作方式

Sun, 15 Mar 2026 00:00:00 +0000

以 SkyWalking GraalVM Distro 为例，看 AI Coding 如何把一批探索性 PoC 打磨成一条可重复的迁移流水线。

这个项目给我最大的启发，不是 AI 能写多少代码，而是 AI Coding 改变了架构设计的试错成本。当一个想法可以很快做成 PoC、跑起来验证、不行就推翻重来时，架构师就更有机会逼近自己真正想要的设计，而不是过早停在“团队现在做得出来”的折中方案上。

这种变化在成熟开源系统里尤其重要。Apache SkyWalking OAP 长期以来一直是一个功能强大且经过生产验证的可观测性后端，但大型 Java 平台该有的问题它一个不少：运行时字节码生成、重反射初始化、classpath 扫描、基于 SPI 的模块装配，以及动态 DSL 执行——这些机制方便扩展，但做 GraalVM Native Image 时全是障碍。

SkyWalking GraalVM Distro 的出现，源于我们把这个挑战当成一个架构设计问题来处理，而不是一次性的移植工程。目标不仅是让 OAP 能以原生二进制运行，更是把 GraalVM 迁移本身做成一条可重复执行、能够持续跟上上游演进的自动化流水线。

如果你想看完整的技术设计、基准数据和上手方式，请阅读配套文章：SkyWalking GraalVM Distro：设计与基准测试。

从停滞的想法到可运行的系统

这件事其实很多年前就开始了。在这个仓库创建不久之后，yswdqz 曾花了数个月探索迁移方案。真正做下来才发现，这个项目远比 GraalVM 文档里列出的那些单点限制复杂得多，这项工作最终也因此搁置了很多年。

这段停滞很重要。缺少的并不是想法。成熟维护者通常从来不缺想法，真正稀缺的，是把这些想法真正做出来的时间、人力和精力。即使架构师已经看到了几条很有前景的路线，有限的开发资源也会迫使大家更早做出权衡：优先选择实现成本最低的方案，而不是那个更干净、更可复用、更经得起未来变化的方案。

这种情况非常普遍，并不特殊。在开源社区里，很多工作依赖志愿者或有限的企业赞助；在商业产品里，约束的形式不同，但本质仍然一样：路线图承诺、团队规模和交付压力都会让工程资源始终紧张。在这两种环境里，很多好想法被放弃，并不是因为它们错了，而是因为要把它们真正验证清楚、实现完整，成本太高。

还有一个同样重要的约束：架构师通常同时也是非常资深的工程师，而不是一个可以全职扑在实现细节上的人。问题在于个人编码精力有限、时间高度碎片化，同时还要在代码尚未出现之前，不断向其他资深工程师解释自己的设计意图。传统上，这种解释主要通过图、文档和沟通完成。它很慢、信息损失大，而且充满不确定性。我们都体验过“传话游戏”：哪怕是很简单的意思，也很容易被误解，而等误解真正暴露出来时，时间已经过去很多了。

到了 2025 年末，AI Coding 让”同时尝试多条路线”这件事终于变得现实。我们不必再因为实现能力稀缺而过早接受折中，而是可以在多个设计之间来回切换，用代码验证，快速淘汰弱方案，持续迭代，直到架构本身变得足够稳固、足够实用、足够高效。

这种设计自由度至关重要。GraalVM 文档对单个限制讲得很清楚，但成熟 OSS 平台遇到的是一整套彼此牵连的系统性问题。只修补一个动态机制远远不够。要让 native image 真正落地，我们必须把整类运行时行为改造成构建期产物和自动生成的元数据。

在这条路的早期历史中，还有一座非常具体的大山。那时上游 SkyWalking 仍然大量依赖 Groovy 来处理 LAL、MAL 和 Hierarchy 脚本。理论上，这只不过是另一个“不支持运行时动态行为”的例子；但在实践中，Groovy 是整条路径上最大的障碍。它不仅意味着脚本执行，还意味着一整套在 JVM 里极其便利、在 native image 里却极其不友好的动态模型。

为了跨过这道坎，我们围绕 AOT-first 模式重新设计了 OAP 的核心引擎。早期实验必须直接面对 Groovy 时代的运行时行为，并尝试不同的脚本编译方案来绕过去。最终方案走得更远：对齐上游编译器流水线，把动态生成前移到构建期，并引入自动化机制，让这条迁移路径在上游持续演进时依然保持可控。具体来说，就是把 OAL、MAL、LAL 和 Hierarchy 的生成过程变成构建期预编译器的输出，而不是继续保留为启动期的动态行为。

AI Coding 如何改写架构迭代

这次转变的关键，并不只是“写代码更快了”。AI 真正改变的，是想法、原型、验证和重设计之间来回迭代的速度。围绕同一个问题，我们可以很快做出几个可运行的 PoC，迅速淘汰不成立的方向，再把值得保留的抽象慢慢沉淀成一套连贯的迁移系统。

这并不会削弱人的架构价值，反而会放大它。哪些行为应该前移到构建期，哪些地方应该保留可配置性，哪里应该引入 same-FQCN 替换，如何让上游同步保持可控，以及哪些抽象值得不惜代价保留下来，这些判断仍然只能由人来做。不同的是，AI 的速度让我们终于有机会把这些更好的设计真正做出来，而不是过早退回到更简单、也更差的折中方案。

这才是软件架构师工作方式真正发生变化的地方。过去，架构师往往已经知道更干净的方向在哪里，但有限的工程产能会逼着那个愿景退回到一个更便宜的妥协方案。现在，架构师在某种意义上又重新变回了“能快速动手的人”：可以直接用代码把思路搭出来，把高层抽象落成接口，再用真实运行的实现去证明设计。

这不仅改变了实现，也改变了沟通方式。在开源里，我们常说：talk is cheap, show me the code。在 AI Coding 时代，“把代码拿出来”这件事变得容易多了。设计不再那么依赖一个缓慢的、自上而下的翻译过程：从想法到文档，再到解释，再到实现。代码可以更早出现，也可以更早跑起来。

这也让其他资深工程师受益。他们不必只靠图、会议或长篇解释来还原整个设计，而是可以直接审查抽象、阅读真实代码、运行它、质疑它，并在具体实现上一起打磨。这让架构协作更快、更清晰，也少了很多沟通误差。

也正因为如此，我总觉得今天很多 AI 讨论有点跑偏。很多项目确实很有趣、也很好玩，拿来体验当然没问题，但高级工程工作并不会因为“给代码库接了个 agent”就自然变好。真正重要的，不是哪个 demo 看起来最炫，而是哪些工程能力真的被放大了，同时软件开发本身的纪律有没有被保留下来。

对于架构师和资深工程师来说，这里真正重要的能力包括：

快速做对比式原型验证：不是只用 slides 和文档去论证某个想法，而是直接把多个方案做成可运行代码来比较。
大规模代码理解能力：能在大量模块之间快速阅读，同时保持对整个系统的全局认识。
系统性的重构能力：把基于反射、依赖运行时动态行为的路径，系统性地改造成适配 AOT 约束的设计。
搭建自动化的能力：当一个迁移步骤在每次上游同步时都必须重做一次，靠手工处理本身就很费时费力，而且越往后只会越累。AI 让我们真正有条件去投资生成器、清单、一致性检查和漂移检测，把重复的人力劳动变成可重复的自动化流程。
大范围审查能力：在很大的代码面上检查边界条件、兼容性约束，以及方案是否经得起反复执行。

这些能力也都体现在最终的设计结果里。same-FQCN 替换为 GraalVM 特定行为建立了清晰、受控的边界；反射元数据不再依赖手工维护的猜测清单，而是直接从构建产物中生成；各种清单机制和漂移检测，则把原本模糊的“上游同步风险”变成了显式的工程工作流。

对于初级工程师，我觉得这里的启发同样重要。AI 不会让架构设计、系统约束、接口设计、测试和可维护性这些基本功变得不重要。恰恰相反，这些能力只会变得更重要，因为它们决定了“被加速的实现”最终产出的是一个可持续演进的系统，还是只是更快地制造出更多代码。真正的杠杆来自工程判断力，而不是新鲜感。

Claude Code 和 Gemini AI 在整个过程中都扮演了工程加速器的角色。在 GraalVM Distro 这个项目里，它们具体帮我们做了几件事：

把迁移思路直接做成可运行代码：不是争论哪个方向可能行得通，而是把多个真实原型做出来、跑起来、比较掉，把不成立的方向淘汰掉。
重构重反射、重动态的代码路径：把不适合运行时的模式系统性替换成 AOT 友好的实现方式。
让上游同步真正可持续：每次 distro 从上游 SkyWalking 拉取变更后，元数据扫描、配置再生成和重新编译都必须再来一次。AI 帮助我们把这些过程做成流水线，使每次同步都变成一个可控、且大部分自动化的过程，而不是一次比一次更长的手工重复劳动。
在大范围内审查逻辑和边界情况：特别是在功能对等性比纯实现速度更重要的地方。

最终产出的，不只是一次大重写，而是一套可重复的系统：预编译器、manifest 驱动的加载、反射配置生成、替换边界，以及让上游迁移可审查、可自动化的漂移检测机制。

如果你想看这种开发方法背后的更广泛背景，可以读这篇文章：在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续。这篇文章则是这个故事的下一步：不仅是在一个成熟代码库里增强功能，而是重新激活一项曾经停滞的工作，并把它真正做成可运行系统。

真正改变的到底是什么

这个项目最重要的结果，并不是一张 benchmark 表。基准数据当然属于 distro 本身，而且它们很重要，因为它们证明这套系统是真实可运行的。但对这篇文章来说，更深层的变化发生在方法论层面：AI Coding 改变了我们探索、验证和打磨架构方案的方式。

过去，架构往往更像一项以文档为主、后面拖着漫长而昂贵实现过程的活动。现在，我们可以更快地在想法、原型、比较和重设计之间切换。这让我们真正有机会去追求更高抽象层次的方案，保留更干净的边界，并建设那些让迁移过程可持续维护的自动化机制。

这项工作的技术证据，就是 SkyWalking GraalVM Distro 本身：它不仅是一个可运行的系统，更是一条由预编译器、自动生成的反射元数据、受控替换边界和漂移检查组成的迁移流水线。基准数据之所以重要，是因为它们证明这套系统在实践里是成立的；但从架构角度看，真正的结果是：这次迁移不再是一场一次性的移植，而是变成了一套可重复执行的系统工程。关于完整测试方法、原始数据和技术设计，请阅读配套文章：SkyWalking GraalVM Distro：设计与基准测试。

项目仓库位于 apache/skywalking-graalvm-distro。我们欢迎社区成员测试这个新发行版、提交 issue，并帮助它逐步走向生产可用。

对我来说，更深层的启发并不止于这个发行版。AI Coding 不会让架构变得不重要，反而会让架构更值得被认真追求。当实现速度提升到一定程度时，我们终于有机会在真实代码里验证更多想法，保留那些真正好的抽象，并把那些过去常常因为投入太大而半途妥协的系统真正做出来。

对于资深工程师来说，瓶颈正在从单纯的代码实现速度，转向品味、系统判断力，以及定义稳定边界的能力。对于初级工程师来说，真正该走的路不是追逐每一种看上去都很刺激的 AI 工作流，而是把基础能力练得更扎实，让加速真正产生复利：理解需求、阅读陌生系统、质疑假设，并识别出在系统快速变化时仍然必须保持正确的那些部分。AI Coding 降低了验证好设计的代价，但并没有降低工程判断本身的门槛。

Blog: How AI Changed the Economics of Architecture

Fri, 13 Mar 2026 00:00:00 +0000

SkyWalking GraalVM Distro: A case study in turning runnable PoCs into a repeatable migration pipeline.

The most important lesson from this project is not that AI can generate a large amount of code. It is that AI changes the economics of architecture. When runnable PoCs become cheap to build, compare, discard, and rebuild, architects can push further toward the design they actually want instead of stopping early at a compromise they can afford to implement.

That shift matters a lot in mature open source systems. Apache SkyWalking OAP has long been a powerful and production-proven observability backend, but it also carries all the realities of a large Java platform: runtime bytecode generation, reflection-heavy initialization, classpath scanning, SPI-based module wiring, and dynamic DSL execution that are friendly to extensibility but hostile to GraalVM native image.

SkyWalking GraalVM Distro is the result of treating that challenge as a design-system problem instead of a one-off porting exercise. The goal was not only to make OAP run as a native binary, but to turn GraalVM migration itself into a repeatable automation pipeline that can stay aligned with upstream evolution.

For the full technical design, benchmark data, and getting-started guide, see the companion post: SkyWalking GraalVM Distro: Design and Benchmarks.

From Paused Idea to Runnable System

This journey actually began years ago. Shortly after this repository was created, yswdqz spent several months exploring the transition. The project proved much harder in practice than the individual GraalVM limitations sounded on paper, and the work eventually paused for years.

That pause is important. The missing ingredient was not ideas. Mature maintainers usually have more ideas than time. The real constraint was implementation economics. Even when the architect can see several promising directions, limited developer resources force an earlier trade-off: choose the path that is cheapest to implement, not necessarily the path that is cleanest, most reusable, or most future-proof.

This is a very common reality, not an exceptional one. In open source communities, much of the work depends on volunteers or limited company sponsorship. In commercial products, the pressure is different but the constraint is still real: roadmap commitments, staffing limits, and delivery deadlines keep engineering resources tight. In both worlds, good ideas are often abandoned not because they are wrong, but because they are too expensive to validate and implement thoroughly.

There is another constraint that matters just as much: the architect is usually also a very senior engineer, not a full-time implementation machine. That means limited personal coding energy, fragmented time, and a constant need to explain ideas to other senior engineers before the code exists. Traditionally, that explanation happens through diagrams, documents, and conversations. It is slow, lossy, and unpredictable. We all know some version of the Telephone Game: even simple words are easy to misunderstand, and by the time the misunderstanding becomes visible, a lot of time has already passed.

What changed in late 2025 was that AI engineering made multiple runnable ideas affordable. Instead of picking an early compromise because implementation capacity was scarce, we could switch repeatedly between designs, validate them with code, discard weak directions quickly, and keep iterating until the architecture became solid, practical, and efficient enough to hold.

That design freedom was critical. GraalVM documentation gives clear guidance on isolated limitations, but a mature OSS platform hits them as a connected system. Fixing only one dynamic mechanism is not enough. To make native image practical, we had to turn whole categories of runtime behavior into build-time artifacts and automated metadata generation.

There was also a very concrete mountain in front of us in the early history of this distro. In the first several commits of the repository, upstream SkyWalking still relied heavily on Groovy for LAL, MAL, and Hierarchy scripts. In theory, that was just one more unsupported runtime-heavy component. In practice, Groovy was the biggest obstacle in the whole path. It represented not only script execution, but a whole dynamic model that was deeply convenient on the JVM side and deeply unfriendly to native image.

To bridge the gap, we re-architected the core engines of OAP around an AOT-first model. Earlier experiments had to confront Groovy-era runtime behavior directly and explore alternative script-compilation approaches to get around it. The finalized direction went further: align with the upstream compiler pipeline, move dynamic generation to build time, and add automation so the migration stays controllable as upstream keeps moving. Concretely, that meant turning OAL, MAL, LAL, and Hierarchy generation into build-time precompiler outputs instead of leaving them as startup-time dynamic behavior.

AI Speed Changed the Design Loop

The scale of this transformation was not only about coding faster. AI changed the loop between idea, prototype, validation, and redesign. We could build runnable PoCs for different approaches, throw away weak ones quickly, and preserve the promising abstractions until they formed a coherent migration system.

That does not reduce the role of human architecture. It raises the value of it. Human judgment was still required to decide what should become build-time, what should stay configurable, where to introduce same-FQCN replacements, how to keep upstream sync controllable, and which abstractions were worth preserving. But AI speed made it realistic to pursue those better designs instead of settling for a simpler compromise too early.

This is the real change in the economics of architecture. In the past, an architect might already know the cleaner direction, but limited engineering capacity often forced that vision back toward a cheaper compromise. Now the architect can return much closer to being a fast developer again: building code, shaping high-abstraction interfaces, and using design patterns to prove the vision directly in the real world.

That changes communication as much as implementation. In open source, we often say, talk is cheap, show me the code. With AI engineering, showing the code becomes much more straightforward. The design no longer depends so heavily on a slow top-down translation from idea to documents to interpretation to implementation. The code can appear earlier, and it can run earlier.

Other senior engineers benefit from this too. They do not need to reconstruct the whole design only from diagrams, meetings, or long explanations. They can review the actual abstraction, see the behavior in code, run it, challenge it, and refine it from something concrete. That makes architectural collaboration faster, clearer, and less lossy.

This is also where I think the current AI discussion is often noisy. Many projects are fun, surprising, and worth exploring, but advanced engineering work is not improved merely by attaching an agent to a codebase. The important question is not which demo looks most magical. The important question is which engineering capabilities are actually being accelerated without losing the discipline of software development itself.

For architects and senior engineers, the capabilities that mattered most here were:

Fast comparative prototyping: Building several runnable approaches in code instead of defending one idea with slides and documents.
Large-scale code comprehension: Reading across many modules quickly enough to keep the whole system in view.
Systematic refactoring: Converting reflection-heavy or runtime-dynamic paths into designs that fit AOT constraints.
Automation construction: When a migration step must be repeated every upstream sync, doing it manually once is already expensive. Doing it manually again next time is even more expensive. AI made it practical to invest in generators, inventories, consistency checks, and drift detectors that turn repeated manual work into repeatable automation.
Review at breadth: Checking edge cases, compatibility boundaries, and repeatability across a large surface area.

Those capabilities were visible in the resulting design. Same-FQCN replacements created a controlled boundary for GraalVM-specific behavior. Reflection metadata was generated from build outputs instead of maintained as a hand-written guess list. Inventories and drift detectors turned upstream sync from a vague maintenance risk into an explicit engineering workflow.

For junior engineers, I think the lesson is equally important. AI does not remove the need to learn architecture, invariants, interfaces, testing, or maintenance. It makes those skills more valuable, because they determine whether accelerated implementation produces a durable system or just more code faster. The leverage comes from engineering judgment, not from novelty.

Claude Code and Gemini AI acted as engineering accelerators throughout this process. In the GraalVM Distro specifically, they helped us:

Explore migration strategies as running code: Instead of debating which approach might work, we built and compared multiple real prototypes, discarded the weak ones, and kept what held up.
Refactor reflection-heavy and dynamic code paths: Replace runtime-hostile patterns with AOT-friendly alternatives across the codebase.
Make upstream sync sustainable: Every time the distro pulls from upstream SkyWalking, metadata scanning, config regeneration, and recompilation must happen again. AI helped build the pipeline so that each sync is a controlled, largely automated process rather than a fresh manual effort that grows longer each time.
Review logic and edge cases at scale: Especially in places where feature parity mattered more than raw implementation speed.

The result was not just a large rewrite. It was a repeatable system: precompilers, manifest-driven loading, reflection-config generation, replacement boundaries, and drift detectors that make upstream migration reviewable and automatable.

For the broader methodology behind this style of development, see Agentic Vibe Coding in a Mature OSS Project. This post is the next step in that story: not only enhancing an active mature codebase, but reviving a paused effort and making it actually runnable.

What Actually Changed

The most important outcome of this project is not a benchmark table. The benchmark results belong to the distro itself, and they matter because they prove the system is real. But for this post, the deeper result is methodological: AI engineering changed how architecture could be explored, validated, and refined.

Instead of treating architecture as a mostly document-driven activity followed by a long and expensive implementation phase, we were able to move much faster between idea, prototype, comparison, and redesign. That made it realistic to pursue higher-abstraction solutions, preserve cleaner boundaries, and build the automation needed to keep the migration maintainable over time.

The technical evidence for that work is the SkyWalking GraalVM Distro itself: not only a runnable system, but a migration pipeline expressed as precompilers, generated reflection metadata, controlled replacement boundaries, and drift checks. The benchmark data matter because they prove the system works in practice, but the architectural result is that the migration became a repeatable system rather than a one-time port. For detailed benchmark methodology, per-pod data, and the full technical design, see SkyWalking GraalVM Distro: Design and Benchmarks.

The project is hosted at apache/skywalking-graalvm-distro. We invite the community to test it, report issues, and help move it toward production readiness.

For me, the deeper takeaway is broader than this distro. AI engineering does not make architecture less important. It makes architecture more worth pursuing. When implementation speed rises enough, we can afford to test more ideas in code, keep the good abstractions, and build systems that would previously have been judged too expensive to finish well.

For senior engineers, that means the bottleneck shifts away from raw typing speed and toward taste, system judgment, and the ability to define stable boundaries. For junior engineers, it means the path forward is not to chase every exciting AI workflow, but to become stronger at the fundamentals that let acceleration compound: understanding requirements, reading unfamiliar systems, questioning assumptions, and recognizing what must remain correct as everything around it changes. AI changed the economics of architecture because it lowered the cost of validating better designs without lowering the bar for engineering judgment.

Blog: Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't

Sun, 08 Mar 2026 00:00:00 +0000

Most “vibe coding” stories start with a greenfield project. This one doesn’t.

Apache SkyWalking is a 9-year-old observability platform with hundreds of production deployments, a complex DSL stack, and an external API surface that users have built dashboards, alerting rules, and automation scripts against. When I decided to replace the core scripting engine — purging the Groovy runtime from four DSL compilers — the constraint wasn’t “can AI write the code?” It was: “can AI write the code without breaking anything for existing users?”

The answer turned out to be yes — ~77,000 lines changed across 10 major PRs in about 5 weeks — but only because the AI was tightly guided by a human who understood the project’s architecture, its compatibility contracts, and its users. This post is about the methodology: what worked, what didn’t, and what mature open-source maintainers should know before handing their codebase to AI agents.

The Project in Brief

The task was to replace SkyWalking’s Groovy-based scripting engines (MAL, LAL, Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline, matching the architecture already proven by the OAL compiler. The internal tech stack was completely overhauled; the external interface had to remain identical.

Beyond the compiler rewrites, the scope included a new queue infrastructure (threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E test modernization. By conventional estimates, this was 5-8 months of senior engineer work.

For the full technical details on the compiler architecture, see the Groovy elimination discussion.

What is Agentic Vibe Coding?

“Vibe coding” — a term coined by Andrej Karpathy — describes a style of programming where you describe intent and let AI write the code. It’s powerful for prototyping, but on its own, it’s risky for production systems.

Agentic vibe coding takes this further: instead of a single AI autocomplete, you orchestrate multiple AI agents — each with different strengths — under your architectural direction, with automated tests as the safety net. In my workflow:

Claude Code (plan mode): Primary coding agent. Plan mode lets me review the approach before any code is generated. This is critical for architectural decisions — I steer the design, Claude handles the implementation.
Gemini: Code review, concurrency analysis, and verification reports. Gemini reviewed every major PR for thread-safety, feature parity, and edge cases.
Codex: Autonomous task execution for well-defined, bounded work items.

The key insight: AI writes the code, but the architect owns the design. Without deep domain knowledge of SkyWalking’s internals, no AI could have planned these changes. Without AI, I couldn’t have executed them in 5 weeks.

How TDD Made AI Coding Safe

The reason I could move this fast without breaking things comes down to one principle: never let AI code without a test harness.

My workflow for each major change:

Plan mode first: Describe the goal to Claude, review the plan, iterate on architecture before any code is written.
Write the test contract: Define what “correct” means — for the compiler rewrites, this meant cross-version comparison tests that run every expression through both the old and new engines, asserting identical results across 1,290+ expressions.
Let AI implement: With the test contract in place, Claude can write thousands of lines of implementation code. If it’s wrong, the tests catch it immediately.
E2E as the final gate: Every PR must pass the full E2E test suite — Docker-based integration tests that boot the entire server with real storage backends.
AI code review: Gemini reviewed each PR for concurrency issues, thread-safety, and feature parity — catching things that unit tests alone wouldn’t find.

This is the opposite of “hope it works” vibe coding. The AI writes fast, the tests verify fast, and I steer the architecture. The feedback loop is tight enough that I can iterate on complex compiler code in minutes instead of days.

Lessons Learned

AI is a force multiplier, not a replacement. Before any AI agent wrote a single line, a human had to define the replacement solution: what gets replaced, how it gets replaced, and — critically — where the boundaries are. Which APIs could break? The internal compilation pipeline was fair game for a complete overhaul. Which APIs must stay aligned? Every external-facing DSL syntax, every YAML configuration key, every metrics name and tag structure had to remain byte-for-byte identical — because hundreds of deployed dashboards, alerting rules, and user scripts depend on them. Drawing these boundaries required deep knowledge of the codebase and its users. AI executed the plan at extraordinary speed, but the plan itself — the scope, the invariants, the compatibility contract — had to come from a human who understood the blast radius of every change.

Plan mode is non-negotiable for architectural work. Letting AI jump straight to code on a compiler rewrite would be a disaster. Plan mode’s strength is that it collects code context — scanning imports, tracing call chains, mapping class hierarchies — and uses that context to help me fill in implementation details I’d otherwise have to look up manually. But it can’t tell you the design principles. That direction had to come from me, stated clearly upfront, so the AI’s planning stayed on the right track instead of optimizing toward a locally reasonable but architecturally wrong solution.

Know when to hit ESC. Claude has a clear tendency to dive deep into solution code writing once it starts — and it won’t stop on its own when it encounters something that conflicts with the original plan’s concept. Instead of pausing to flag the conflict, it will push forward, improvising around the obstacle in ways that silently violate the design intent. I had to learn to watch for this: when Claude’s output started drifting from the plan, I’d manually cancel the task (ESC), call it off, identify where the plan and reality diverged, adjust the plan, and restart. This interrupt-replan cycle was a regular part of the workflow, not an exception. The architect has to stay in the loop — not just at planning time, but during execution — because AI agents don’t yet know when to stop and ask.

Spec-driven testing is necessary but not sufficient — the logic workflow matters more. It’s tempting to think that if you define the input/output spec clearly enough, AI can fill in the implementation and tests will catch any mistakes. I tried this. It doesn’t work for anything non-trivial. During the expression compiler rewrite, Claude would sometimes change code in unreasonable ways just to make the spec tests pass — the inputs went in, the expected outputs came out, and everything looked green. But the internal logic was wrong: inconsistent with the design patterns the rest of the codebase relied on, impossible to extend, or solving the specific test case through a hack rather than a general mechanism. A spec only checks what the code produces; it says nothing about how the code produces it. For a mature project, the “how” matters enormously — the solution needs to be consistent with the existing architecture, widely adoptable by contributors, and maintainable long-term. That’s why I needed cross-version testing and human review of the implementation path, not just the results.

Testing at two levels kept the rewrite honest. Cross-version testing was part of my design plan from the start — I architected the dual-path comparison framework so that every production DSL expression runs through both the old and new engines, asserting identical results across 1,290+ expressions. This gave me confidence no human review could match, and it was a deliberate planning decision: I knew AI-generated compiler code needed a mechanical proof of behavioral equivalence, not just eyeball review. On top of that, E2E tests served as the project’s existing infrastructure safety net — Docker-based integration tests that boot the entire server with real storage backends. Unit tests and cross-version tests verify logic in isolation; E2E tests verify the system actually works end-to-end. For infrastructure-level changes like queue replacement and thread model changes, E2E is the only gate that truly matters. Together, the two layers — designed-for-this-rewrite cross-version tests and pre-existing E2E infrastructure — caught different classes of bugs and made shipping with confidence possible.

Multiple AIs have different strengths. Claude excels at large-scale code generation with plan mode. Gemini is exceptional at logic review — it can mentally trace code branches with given input data, simulating execution without actually running the code. This is significant for reviewing AI-generated code: Gemini would walk through a generated compiler method step by step, flagging where a null check was missing or where a branch would produce wrong output for a specific edge case. Codex proved most valuable as a test reviewer and honesty checker. AI-generated code has a subtle failure mode: the coding agent can make wrong assumptions and then write tests that pass by setting expected values to match the wrong behavior — effectively bypassing the test safety net. Codex caught cases where Claude had set unreasonable expected values that happened to make tests green, masking logic errors that would have surfaced in production. Using all three as checks on each other was far more effective than relying on any single one.

The Mythical Man-Month still applies — and so does the Mythical Token-Month. Brooks taught us that a task requiring 12 person-months does not mean 12 people can finish it in one month. The same law applies to AI: you cannot simply throw more tokens, more agents, or more parallel sessions at a problem and expect it to converge faster. Communication costs, coordination overhead, requirements analysis, and conceptual integrity — these software engineering fundamentals do not disappear just because your workforce is artificial. Worse, when the direction is wrong — when there’s a conceptual error in the design or an unreasonable architectural choice — AI will not recognize it. It will charge down the wrong path at extraordinary speed, burning tokens furiously while trapped in a vortex of self-justification: patching code to make failing tests pass, adjusting expected values to match wrong behavior, adding workarounds on top of workarounds — each iteration making the codebase look more “complete” while drifting further from correctness. AI vibe coding cannot break out of this spiral on its own. Only a human who understands the domain can recognize “this is fundamentally wrong, stop,” discard the work, and redirect. Speed without direction is just expensive chaos.

The Bigger Picture

The agentic vibe coding approach worked because it combined AI’s speed with human architectural judgment and automated test discipline. It’s not magic — it’s engineering, accelerated.

Brooks also gave us “No Silver Bullet,” and its core distinction matters more than ever: software complexity comes in two kinds. Essential complexity comes from the problem itself — the domain semantics, the behavioral contracts, the concurrency invariants. No tool can eliminate this; it must be understood, modeled, and reasoned about by someone who knows the domain. Accidental complexity comes from the tools and implementation — boilerplate code, manual refactoring across hundreds of files, the mechanical work of translating a design into compilable source. This is exactly where AI excels. What made this project work was recognizing which complexity was which: I owned the essential complexity (architecture, API boundaries, correctness invariants), and AI demolished the accidental complexity (generating 77K lines of implementation, scaffolding test harnesses, rewriting repetitive patterns across dozens of config files). Confuse the two — let AI make essential decisions, or waste human time on accidental work — and you get the worst of both worlds.

Qian Xuesen(Tsien Hsue-shen)’s Engineering Cybernetics offers another lens that proved surprisingly relevant. His core framework — feedback, control, optimization — describes how to keep complex systems running toward their target. AI vibe coding at full speed is like a hypersonic missile: extraordinarily fast, but without a guidance system it just creates a bigger crater in the wrong place. The feedback loop in my workflow was the test harness — cross-version tests and E2E tests providing continuous signal on whether the system was still on course. Control was the human architect deciding when to intervene: reviewing plans before execution, hitting ESC when the direction drifted, choosing which AI to trust for which task. Optimization was iterative: each interrupt-replan cycle refined the approach, each Gemini review tightened the logic, each Codex audit caught assumptions the coding agent had smuggled past the tests. Without all three — feedback to detect deviation, control to correct course, optimization to converge — the speed of AI coding would be not an advantage but a liability. The faster the missile, the more precise the guidance must be.

For more details or to share your own experience with agentic coding on production systems, feel free to reach me on GitHub.

Zh: 在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续

Sun, 08 Mar 2026 00:00:00 +0000

大多数"vibe coding"的故事都从一个全新项目开始，讲述一个快速构建原型或者可运行项目的过程，但这篇不是。

Apache SkyWalking 是一个有 9 年历史的Apache顶级项目，线上数以千计的集群部署，内部有一套复杂的 DSL 编译栈，对外暴露的 API 上承载着用户构建的仪表盘、告警规则和自动化脚本。当我决定替换核心脚本引擎——从四个 DSL 编译器中彻底移除 Groovy 运行时——面临的问题不是"AI 能不能写出代码"，而是"也许只有AI能完成如此大规模的一致性迭代"，以及"AI 能不能在不破坏系统的前提下写出完整且高效的代码"。

答案是可以——约 7.7 万行代码变更，10 个主要 PR，历时约 5 周——但前提是 AI 始终在一个深刻理解项目架构、兼容性要求和用户场景的人的引导下工作。这篇文章分享了我在过去几个月的实践体验，以及成熟开源项目的维护者在把代码库交给 AI 智能体之前应该知道什么。

项目概况

这次的任务是将 SkyWalking 基于 Groovy 的脚本引擎（MAL、LAL、Hierarchy）替换为统一的 ANTLR4 + Javassist 字节码编译管线，对齐 OAL 编译器已经验证过的架构。内部技术栈彻底重构，但对外接口必须保持完全一致。

除了编译器重写，范围还包括新的线程管理策略（线程数从 36 降到 15）、JDK 25+ 虚拟线程支持，以及端到端测试的现代化改造。按传统估算，这是 5-8 个月的资深工程师（以我自己为例）工作量。

编译器架构的完整技术细节，参见 Groovy 移除讨论。

什么是 Agentic Vibe Coding？

“Vibe coding”——Andrej Karpathy 提出的概念——描述的是一种你表达意图、让 AI 来写代码的编程风格。整个AI编程过程，一直以来都是用来做原型，效果强大且速度迅猛，但单独用于生产系统是有风险的。

Agentic vibe coding 更进一步：不是单一的 AI 自动补全，而是在你的架构指导下编排多个 AI 智能体——各有所长——以自动化测试作为安全网。我的工作流是这样的：

Claude Code（plan 模式）：主力编码智能体。Plan 模式让我在生成任何代码之前先审查方案。这对架构决策至关重要——我把控设计方向，Claude 负责实现。
Gemini：代码审查、并发分析和验证报告。每个主要 PR 都经过 Gemini 审查线程安全性、功能对等性和边界情况。
Codex：对定义明确、边界清晰的工作项进行自主任务执行。

核心洞察：AI 写代码，但架构师掌控设计。 没有对 SkyWalking 内部机制的深入领域知识，任何 AI 都无法规划这些变更。没有 AI，我也不可能在 5 周内完成执行。

TDD 如何让 AI 编程变得安全

我能以这样的速度推进而不搞砸，归结为一个原则：绝不让 AI 在没有测试保护的情况下写代码。

每次重大变更的工作流：

先进 plan 模式：向 Claude 描述目标，审查方案，在写任何代码之前先在架构层面迭代。
编写测试契约：定义"正确"意味着什么——对于编译器重写，这意味着交叉版本对比测试，让每个表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。
让 AI 实现：有了测试契约，Claude 可以写出数千行实现代码。如果写错了，测试会立即捕获。
端到端测试作为最终关卡：每个 PR 都必须通过完整的端到端测试套件——基于 Docker 的集成测试，启动整个服务器并连接真实存储后端。
AI 代码审查：Gemini 审查每个 PR 的并发问题、线程安全性和功能对等性——捕获单元测试无法发现的问题。

这和"写完祈祷能跑"的 vibe coding 完全相反。AI 写得快，测试验证得快，我把控架构方向。反馈循环足够紧凑，让我能在几分钟而不是几天内迭代复杂的编译器代码。

经验教训

AI 是力量倍增器，不是替代品。 在任何 AI 智能体写下第一行代码之前，必须由人来定义替换方案：替换什么、怎么替换，以及——至关重要的——边界在哪里。哪些 API 可以破坏性变更？内部编译管线可以彻底重构。哪些 API 必须保持对齐？每一个对外的 DSL 语法、每一个 YAML 配置键、每一个指标名称和标签结构都必须逐字节保持一致——因为数百个已部署的仪表盘、告警规则和用户脚本依赖于它们。划定这些边界需要对代码库及其用户的深入了解。AI 以惊人的速度执行了计划，但计划本身——范围、不变量、兼容性契约——必须来自一个理解每次变更影响半径的人。

架构级工作，plan 模式不可妥协。 让 AI 在编译器重写上直接跳到写代码，那是灾难。Plan 模式的价值在于它会收集代码上下文——扫描 import、追踪调用链、映射类继承关系——并利用这些上下文帮我补全那些我本来需要手动查找的实现细节。但它无法告诉你设计原则。方向必须由我在前期明确给出，这样 AI 的规划才能沿着正确的轨道走，而不是朝着一个局部合理但架构上错误的方案去优化。

要知道什么时候该按 ESC。 Claude 有一个明显的倾向：一旦开始写解决方案代码就会一头扎进去——当遇到与原始计划概念冲突的东西时，它不会自己停下来。它不会暂停来标记冲突，而是会继续推进，用即兴的方式绕过障碍，悄无声息地违背设计意图。我必须学会观察这个信号：当 Claude 的输出开始偏离计划时，我会手动取消任务（ESC），叫停它，找出计划和现实的分歧点，调整计划，然后重新开始。这种中断-重新规划的循环是工作流的常态，而非例外。架构师必须始终在环路中——不仅是在规划阶段，执行阶段也是——因为 AI 智能体还不知道什么时候该停下来问一句。

Spec-Driven 更多的运用于测试，而非开发。它只是一个必要的但不充分条件，而逻辑工作流更重要。 很容易产生一种想法：只要把输入/输出规格定义得足够清楚，AI 就能填充实现，测试会捕获任何错误。我试过。对于任何复杂的生产场景，这行不通。在表达式编译器重写过程中，Claude 有时会以不合理的方式修改代码，仅仅为了让规格测试通过——输入进去了，预期输出出来了，一切看起来都是正常的。但内部逻辑是错的：与代码库其他部分依赖的设计模式不一致，无法扩展，或者通过 hack （代码反射、字段名称静态比较等不可接受的工程方法）而非通用机制来解决特定测试用例。规格只检查代码产出了什么；它对代码如何产出一无所知。对于成熟项目，“如何"极其重要——解决方案需要与现有架构一致，能被贡献者广泛采用，并且长期可维护可扩展。这就是为什么我需要交叉版本测试加上对实现路径的人工审查，而不仅仅是审查结果。

两个层次的测试让重写的代码验证更有保障。 交叉版本测试从一开始就是我设计方案的一部分——我架构了双路径对比框架，让每个生产环境的 DSL 表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。这给了我任何人工审查都无法匹敌的信心，而且这是一个刻意的规划决策：我知道 AI 生成的编译器代码需要行为等价性的机械证明，而不仅仅是肉眼审查。在此之上，端到端测试作为项目已有的基础设施安全网——基于 Docker/K8s 的集成测试，启动整个服务器并连接真实存储后端。单元测试和交叉版本测试在隔离环境中验证逻辑；端到端测试验证系统真正能端到端地工作。对于队列替换和线程模型变更这样的基础设施级变更，端到端测试是唯一真正重要的关卡。两个层次——为本次重写专门设计的交叉版本测试和预先存在的端到端基础设施——捕获了不同类别的 bug，使得有信心地发布成为可能。

多个 AI 各有所长。 Claude 擅长配合 plan 模式进行大规模代码生成。Gemini 在逻辑审查方面表现出色——它能在给定输入数据的情况下在脑中追踪代码分支，模拟执行而无需实际运行代码。这对审查 AI 生成的代码意义重大：Gemini 会逐步走查一个编译器生成的方法，标记出哪里缺少空值检查，或者哪个分支在特定边界情况下会产生错误输出。Codex 作为测试审查者和诚实性检查者最有价值。AI 生成的代码有一种微妙的失败模式：编码智能体可能做出错误假设，然后编写测试时将期望值设置为匹配错误行为——实际上绕过了测试安全网。Codex 捕获了 Claude 设置不合理期望值使测试变绿的情况，掩盖了本会在生产环境中暴露的逻辑错误。将三者互相校验，远比依赖其中任何一个更有效。

人月神话依然适用——基于Token的AI月神话同样如此。 Brooks 告诉我们，一个需要 12 人月的任务不意味着 12 个人能在一个月内完成。同样的定律适用于 AI：你不能简单地投入更多 token、更多智能体或更多并行会话，就指望问题更快收敛。沟通成本、协调开销、需求分析和概念完整性——这些软件工程的基本规律不会因为你的劳动力是人工智能就消失。更糟糕的是，当方向错误时——当设计中存在概念性错误或不合理的架构选择时——AI 不会识别出来。它会以惊人的速度冲向错误的方向，疯狂消耗 token，同时陷入自我辩护的漩涡：修补代码让失败的测试通过，调整期望值去匹配错误行为，在变通方案上叠加变通方案——每次迭代都让代码库看起来更"完整”，实际上却离正确越来越远。AI vibe coding 无法自行跳出这个螺旋。只有理解领域的人才能认识到"这从根本上就是错的，停下来"，丢弃这些工作，重新引导方向。没有方向的速度，只是昂贵的混乱。

更大的图景

Agentic vibe coding 之所以有效，是因为它将 AI 的速度与人的架构判断力和自动化测试纪律结合在了一起。这不是魔法——这是被加速的工程。

Brooks 还给了我们《没有银弹》，其核心区分在今天比以往任何时候都更重要：软件复杂性分为两种。本质复杂性来自问题本身——领域语义、行为契约、并发不变量。没有任何工具能消除它；它必须由理解领域的人去理解、建模和推理。偶然复杂性来自工具和实现——样板代码、跨数百个文件的手动重构、将设计翻译成可编译源码的机械工作。这恰恰是 AI 擅长的地方。这个项目之所以成功，在于认清了哪种复杂性是哪种：我掌控本质复杂性（架构、API 边界、正确性不变量），AI 消灭偶然复杂性（生成 7.7 万行实现代码、搭建测试框架、跨数十个配置文件重写重复模式）。搞混这两者——让 AI 做本质决策，或者让人浪费时间在偶然工作上——你会得到两个世界中最差的结果。

钱学森的《工程控制论》提供了另一个视角，在实践中出人意料地切题。他的核心框架——反馈、控制、优化——描述的是如何让复杂系统持续朝目标运行。全速运转的 AI vibe coding 就像一枚高超音速导弹：速度惊人，但没有制导系统只会在错误的地方炸出一个更大的坑。我工作流中的反馈回路是测试体系——交叉版本测试和端到端测试持续提供系统是否仍在航线上的信号。控制是人类架构师决定何时介入：在执行前审查方案，在方向偏移时按 ESC，选择哪个 AI 负责哪项任务。优化是迭代式的：每次中断-重新规划的循环都在精炼方法，每次 Gemini 审查都在收紧逻辑，每次 Codex 审计都在捕获编码智能体偷偷绕过测试的假设。缺少其中任何一个——检测偏差的反馈、纠正航向的控制、趋向收敛的优化——AI 编程的速度就不是优势而是负债。导弹越快，制导就必须越精确。

AI Vibe Coding以及它的迭代，正在快速地走进每一个开发者，也正在广泛地融入开源和商业软件。我们都在见证这种新的开发模式，以及AI Vibe Coding和软件工程理论的融合。如果你想和我探讨更多的AI + OSS话题，欢迎在 GitHub 上联系我。