Apache SkyWalking – Open Source

Blog: Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't

Sun, 08 Mar 2026 00:00:00 +0000

Most “vibe coding” stories start with a greenfield project. This one doesn’t.

Apache SkyWalking is a 9-year-old observability platform with hundreds of production deployments, a complex DSL stack, and an external API surface that users have built dashboards, alerting rules, and automation scripts against. When I decided to replace the core scripting engine — purging the Groovy runtime from four DSL compilers — the constraint wasn’t “can AI write the code?” It was: “can AI write the code without breaking anything for existing users?”

The answer turned out to be yes — ~77,000 lines changed across 10 major PRs in about 5 weeks — but only because the AI was tightly guided by a human who understood the project’s architecture, its compatibility contracts, and its users. This post is about the methodology: what worked, what didn’t, and what mature open-source maintainers should know before handing their codebase to AI agents.

The Project in Brief

The task was to replace SkyWalking’s Groovy-based scripting engines (MAL, LAL, Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline, matching the architecture already proven by the OAL compiler. The internal tech stack was completely overhauled; the external interface had to remain identical.

Beyond the compiler rewrites, the scope included a new queue infrastructure (threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E test modernization. By conventional estimates, this was 5-8 months of senior engineer work.

For the full technical details on the compiler architecture, see the Groovy elimination discussion.

What is Agentic Vibe Coding?

“Vibe coding” — a term coined by Andrej Karpathy — describes a style of programming where you describe intent and let AI write the code. It’s powerful for prototyping, but on its own, it’s risky for production systems.

Agentic vibe coding takes this further: instead of a single AI autocomplete, you orchestrate multiple AI agents — each with different strengths — under your architectural direction, with automated tests as the safety net. In my workflow:

Claude Code (plan mode): Primary coding agent. Plan mode lets me review the approach before any code is generated. This is critical for architectural decisions — I steer the design, Claude handles the implementation.
Gemini: Code review, concurrency analysis, and verification reports. Gemini reviewed every major PR for thread-safety, feature parity, and edge cases.
Codex: Autonomous task execution for well-defined, bounded work items.

The key insight: AI writes the code, but the architect owns the design. Without deep domain knowledge of SkyWalking’s internals, no AI could have planned these changes. Without AI, I couldn’t have executed them in 5 weeks.

How TDD Made AI Coding Safe

The reason I could move this fast without breaking things comes down to one principle: never let AI code without a test harness.

My workflow for each major change:

Plan mode first: Describe the goal to Claude, review the plan, iterate on architecture before any code is written.
Write the test contract: Define what “correct” means — for the compiler rewrites, this meant cross-version comparison tests that run every expression through both the old and new engines, asserting identical results across 1,290+ expressions.
Let AI implement: With the test contract in place, Claude can write thousands of lines of implementation code. If it’s wrong, the tests catch it immediately.
E2E as the final gate: Every PR must pass the full E2E test suite — Docker-based integration tests that boot the entire server with real storage backends.
AI code review: Gemini reviewed each PR for concurrency issues, thread-safety, and feature parity — catching things that unit tests alone wouldn’t find.

This is the opposite of “hope it works” vibe coding. The AI writes fast, the tests verify fast, and I steer the architecture. The feedback loop is tight enough that I can iterate on complex compiler code in minutes instead of days.

Lessons Learned

AI is a force multiplier, not a replacement. Before any AI agent wrote a single line, a human had to define the replacement solution: what gets replaced, how it gets replaced, and — critically — where the boundaries are. Which APIs could break? The internal compilation pipeline was fair game for a complete overhaul. Which APIs must stay aligned? Every external-facing DSL syntax, every YAML configuration key, every metrics name and tag structure had to remain byte-for-byte identical — because hundreds of deployed dashboards, alerting rules, and user scripts depend on them. Drawing these boundaries required deep knowledge of the codebase and its users. AI executed the plan at extraordinary speed, but the plan itself — the scope, the invariants, the compatibility contract — had to come from a human who understood the blast radius of every change.

Plan mode is non-negotiable for architectural work. Letting AI jump straight to code on a compiler rewrite would be a disaster. Plan mode’s strength is that it collects code context — scanning imports, tracing call chains, mapping class hierarchies — and uses that context to help me fill in implementation details I’d otherwise have to look up manually. But it can’t tell you the design principles. That direction had to come from me, stated clearly upfront, so the AI’s planning stayed on the right track instead of optimizing toward a locally reasonable but architecturally wrong solution.

Know when to hit ESC. Claude has a clear tendency to dive deep into solution code writing once it starts — and it won’t stop on its own when it encounters something that conflicts with the original plan’s concept. Instead of pausing to flag the conflict, it will push forward, improvising around the obstacle in ways that silently violate the design intent. I had to learn to watch for this: when Claude’s output started drifting from the plan, I’d manually cancel the task (ESC), call it off, identify where the plan and reality diverged, adjust the plan, and restart. This interrupt-replan cycle was a regular part of the workflow, not an exception. The architect has to stay in the loop — not just at planning time, but during execution — because AI agents don’t yet know when to stop and ask.

Spec-driven testing is necessary but not sufficient — the logic workflow matters more. It’s tempting to think that if you define the input/output spec clearly enough, AI can fill in the implementation and tests will catch any mistakes. I tried this. It doesn’t work for anything non-trivial. During the expression compiler rewrite, Claude would sometimes change code in unreasonable ways just to make the spec tests pass — the inputs went in, the expected outputs came out, and everything looked green. But the internal logic was wrong: inconsistent with the design patterns the rest of the codebase relied on, impossible to extend, or solving the specific test case through a hack rather than a general mechanism. A spec only checks what the code produces; it says nothing about how the code produces it. For a mature project, the “how” matters enormously — the solution needs to be consistent with the existing architecture, widely adoptable by contributors, and maintainable long-term. That’s why I needed cross-version testing and human review of the implementation path, not just the results.

Testing at two levels kept the rewrite honest. Cross-version testing was part of my design plan from the start — I architected the dual-path comparison framework so that every production DSL expression runs through both the old and new engines, asserting identical results across 1,290+ expressions. This gave me confidence no human review could match, and it was a deliberate planning decision: I knew AI-generated compiler code needed a mechanical proof of behavioral equivalence, not just eyeball review. On top of that, E2E tests served as the project’s existing infrastructure safety net — Docker-based integration tests that boot the entire server with real storage backends. Unit tests and cross-version tests verify logic in isolation; E2E tests verify the system actually works end-to-end. For infrastructure-level changes like queue replacement and thread model changes, E2E is the only gate that truly matters. Together, the two layers — designed-for-this-rewrite cross-version tests and pre-existing E2E infrastructure — caught different classes of bugs and made shipping with confidence possible.

Multiple AIs have different strengths. Claude excels at large-scale code generation with plan mode. Gemini is exceptional at logic review — it can mentally trace code branches with given input data, simulating execution without actually running the code. This is significant for reviewing AI-generated code: Gemini would walk through a generated compiler method step by step, flagging where a null check was missing or where a branch would produce wrong output for a specific edge case. Codex proved most valuable as a test reviewer and honesty checker. AI-generated code has a subtle failure mode: the coding agent can make wrong assumptions and then write tests that pass by setting expected values to match the wrong behavior — effectively bypassing the test safety net. Codex caught cases where Claude had set unreasonable expected values that happened to make tests green, masking logic errors that would have surfaced in production. Using all three as checks on each other was far more effective than relying on any single one.

The Mythical Man-Month still applies — and so does the Mythical Token-Month. Brooks taught us that a task requiring 12 person-months does not mean 12 people can finish it in one month. The same law applies to AI: you cannot simply throw more tokens, more agents, or more parallel sessions at a problem and expect it to converge faster. Communication costs, coordination overhead, requirements analysis, and conceptual integrity — these software engineering fundamentals do not disappear just because your workforce is artificial. Worse, when the direction is wrong — when there’s a conceptual error in the design or an unreasonable architectural choice — AI will not recognize it. It will charge down the wrong path at extraordinary speed, burning tokens furiously while trapped in a vortex of self-justification: patching code to make failing tests pass, adjusting expected values to match wrong behavior, adding workarounds on top of workarounds — each iteration making the codebase look more “complete” while drifting further from correctness. AI vibe coding cannot break out of this spiral on its own. Only a human who understands the domain can recognize “this is fundamentally wrong, stop,” discard the work, and redirect. Speed without direction is just expensive chaos.

The Bigger Picture

The agentic vibe coding approach worked because it combined AI’s speed with human architectural judgment and automated test discipline. It’s not magic — it’s engineering, accelerated.

Brooks also gave us “No Silver Bullet,” and its core distinction matters more than ever: software complexity comes in two kinds. Essential complexity comes from the problem itself — the domain semantics, the behavioral contracts, the concurrency invariants. No tool can eliminate this; it must be understood, modeled, and reasoned about by someone who knows the domain. Accidental complexity comes from the tools and implementation — boilerplate code, manual refactoring across hundreds of files, the mechanical work of translating a design into compilable source. This is exactly where AI excels. What made this project work was recognizing which complexity was which: I owned the essential complexity (architecture, API boundaries, correctness invariants), and AI demolished the accidental complexity (generating 77K lines of implementation, scaffolding test harnesses, rewriting repetitive patterns across dozens of config files). Confuse the two — let AI make essential decisions, or waste human time on accidental work — and you get the worst of both worlds.

Qian Xuesen(Tsien Hsue-shen)’s Engineering Cybernetics offers another lens that proved surprisingly relevant. His core framework — feedback, control, optimization — describes how to keep complex systems running toward their target. AI vibe coding at full speed is like a hypersonic missile: extraordinarily fast, but without a guidance system it just creates a bigger crater in the wrong place. The feedback loop in my workflow was the test harness — cross-version tests and E2E tests providing continuous signal on whether the system was still on course. Control was the human architect deciding when to intervene: reviewing plans before execution, hitting ESC when the direction drifted, choosing which AI to trust for which task. Optimization was iterative: each interrupt-replan cycle refined the approach, each Gemini review tightened the logic, each Codex audit caught assumptions the coding agent had smuggled past the tests. Without all three — feedback to detect deviation, control to correct course, optimization to converge — the speed of AI coding would be not an advantage but a liability. The faster the missile, the more precise the guidance must be.

For more details or to share your own experience with agentic coding on production systems, feel free to reach me on GitHub.

Zh: 在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续

Sun, 08 Mar 2026 00:00:00 +0000

大多数"vibe coding"的故事都从一个全新项目开始，讲述一个快速构建原型或者可运行项目的过程，但这篇不是。

Apache SkyWalking 是一个有 9 年历史的Apache顶级项目，线上数以千计的集群部署，内部有一套复杂的 DSL 编译栈，对外暴露的 API 上承载着用户构建的仪表盘、告警规则和自动化脚本。当我决定替换核心脚本引擎——从四个 DSL 编译器中彻底移除 Groovy 运行时——面临的问题不是"AI 能不能写出代码"，而是"也许只有AI能完成如此大规模的一致性迭代"，以及"AI 能不能在不破坏系统的前提下写出完整且高效的代码"。

答案是可以——约 7.7 万行代码变更，10 个主要 PR，历时约 5 周——但前提是 AI 始终在一个深刻理解项目架构、兼容性要求和用户场景的人的引导下工作。这篇文章分享了我在过去几个月的实践体验，以及成熟开源项目的维护者在把代码库交给 AI 智能体之前应该知道什么。

项目概况

这次的任务是将 SkyWalking 基于 Groovy 的脚本引擎（MAL、LAL、Hierarchy）替换为统一的 ANTLR4 + Javassist 字节码编译管线，对齐 OAL 编译器已经验证过的架构。内部技术栈彻底重构，但对外接口必须保持完全一致。

除了编译器重写，范围还包括新的线程管理策略（线程数从 36 降到 15）、JDK 25+ 虚拟线程支持，以及端到端测试的现代化改造。按传统估算，这是 5-8 个月的资深工程师（以我自己为例）工作量。

编译器架构的完整技术细节，参见 Groovy 移除讨论。

什么是 Agentic Vibe Coding？

“Vibe coding”——Andrej Karpathy 提出的概念——描述的是一种你表达意图、让 AI 来写代码的编程风格。整个AI编程过程，一直以来都是用来做原型，效果强大且速度迅猛，但单独用于生产系统是有风险的。

Agentic vibe coding 更进一步：不是单一的 AI 自动补全，而是在你的架构指导下编排多个 AI 智能体——各有所长——以自动化测试作为安全网。我的工作流是这样的：

Claude Code（plan 模式）：主力编码智能体。Plan 模式让我在生成任何代码之前先审查方案。这对架构决策至关重要——我把控设计方向，Claude 负责实现。
Gemini：代码审查、并发分析和验证报告。每个主要 PR 都经过 Gemini 审查线程安全性、功能对等性和边界情况。
Codex：对定义明确、边界清晰的工作项进行自主任务执行。

核心洞察：AI 写代码，但架构师掌控设计。 没有对 SkyWalking 内部机制的深入领域知识，任何 AI 都无法规划这些变更。没有 AI，我也不可能在 5 周内完成执行。

TDD 如何让 AI 编程变得安全

我能以这样的速度推进而不搞砸，归结为一个原则：绝不让 AI 在没有测试保护的情况下写代码。

每次重大变更的工作流：

先进 plan 模式：向 Claude 描述目标，审查方案，在写任何代码之前先在架构层面迭代。
编写测试契约：定义"正确"意味着什么——对于编译器重写，这意味着交叉版本对比测试，让每个表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。
让 AI 实现：有了测试契约，Claude 可以写出数千行实现代码。如果写错了，测试会立即捕获。
端到端测试作为最终关卡：每个 PR 都必须通过完整的端到端测试套件——基于 Docker 的集成测试，启动整个服务器并连接真实存储后端。
AI 代码审查：Gemini 审查每个 PR 的并发问题、线程安全性和功能对等性——捕获单元测试无法发现的问题。

这和"写完祈祷能跑"的 vibe coding 完全相反。AI 写得快，测试验证得快，我把控架构方向。反馈循环足够紧凑，让我能在几分钟而不是几天内迭代复杂的编译器代码。

经验教训

AI 是力量倍增器，不是替代品。 在任何 AI 智能体写下第一行代码之前，必须由人来定义替换方案：替换什么、怎么替换，以及——至关重要的——边界在哪里。哪些 API 可以破坏性变更？内部编译管线可以彻底重构。哪些 API 必须保持对齐？每一个对外的 DSL 语法、每一个 YAML 配置键、每一个指标名称和标签结构都必须逐字节保持一致——因为数百个已部署的仪表盘、告警规则和用户脚本依赖于它们。划定这些边界需要对代码库及其用户的深入了解。AI 以惊人的速度执行了计划，但计划本身——范围、不变量、兼容性契约——必须来自一个理解每次变更影响半径的人。

架构级工作，plan 模式不可妥协。 让 AI 在编译器重写上直接跳到写代码，那是灾难。Plan 模式的价值在于它会收集代码上下文——扫描 import、追踪调用链、映射类继承关系——并利用这些上下文帮我补全那些我本来需要手动查找的实现细节。但它无法告诉你设计原则。方向必须由我在前期明确给出，这样 AI 的规划才能沿着正确的轨道走，而不是朝着一个局部合理但架构上错误的方案去优化。

要知道什么时候该按 ESC。 Claude 有一个明显的倾向：一旦开始写解决方案代码就会一头扎进去——当遇到与原始计划概念冲突的东西时，它不会自己停下来。它不会暂停来标记冲突，而是会继续推进，用即兴的方式绕过障碍，悄无声息地违背设计意图。我必须学会观察这个信号：当 Claude 的输出开始偏离计划时，我会手动取消任务（ESC），叫停它，找出计划和现实的分歧点，调整计划，然后重新开始。这种中断-重新规划的循环是工作流的常态，而非例外。架构师必须始终在环路中——不仅是在规划阶段，执行阶段也是——因为 AI 智能体还不知道什么时候该停下来问一句。

Spec-Driven 更多的运用于测试，而非开发。它只是一个必要的但不充分条件，而逻辑工作流更重要。 很容易产生一种想法：只要把输入/输出规格定义得足够清楚，AI 就能填充实现，测试会捕获任何错误。我试过。对于任何复杂的生产场景，这行不通。在表达式编译器重写过程中，Claude 有时会以不合理的方式修改代码，仅仅为了让规格测试通过——输入进去了，预期输出出来了，一切看起来都是正常的。但内部逻辑是错的：与代码库其他部分依赖的设计模式不一致，无法扩展，或者通过 hack （代码反射、字段名称静态比较等不可接受的工程方法）而非通用机制来解决特定测试用例。规格只检查代码产出了什么；它对代码如何产出一无所知。对于成熟项目，“如何"极其重要——解决方案需要与现有架构一致，能被贡献者广泛采用，并且长期可维护可扩展。这就是为什么我需要交叉版本测试加上对实现路径的人工审查，而不仅仅是审查结果。

两个层次的测试让重写的代码验证更有保障。 交叉版本测试从一开始就是我设计方案的一部分——我架构了双路径对比框架，让每个生产环境的 DSL 表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。这给了我任何人工审查都无法匹敌的信心，而且这是一个刻意的规划决策：我知道 AI 生成的编译器代码需要行为等价性的机械证明，而不仅仅是肉眼审查。在此之上，端到端测试作为项目已有的基础设施安全网——基于 Docker/K8s 的集成测试，启动整个服务器并连接真实存储后端。单元测试和交叉版本测试在隔离环境中验证逻辑；端到端测试验证系统真正能端到端地工作。对于队列替换和线程模型变更这样的基础设施级变更，端到端测试是唯一真正重要的关卡。两个层次——为本次重写专门设计的交叉版本测试和预先存在的端到端基础设施——捕获了不同类别的 bug，使得有信心地发布成为可能。

多个 AI 各有所长。 Claude 擅长配合 plan 模式进行大规模代码生成。Gemini 在逻辑审查方面表现出色——它能在给定输入数据的情况下在脑中追踪代码分支，模拟执行而无需实际运行代码。这对审查 AI 生成的代码意义重大：Gemini 会逐步走查一个编译器生成的方法，标记出哪里缺少空值检查，或者哪个分支在特定边界情况下会产生错误输出。Codex 作为测试审查者和诚实性检查者最有价值。AI 生成的代码有一种微妙的失败模式：编码智能体可能做出错误假设，然后编写测试时将期望值设置为匹配错误行为——实际上绕过了测试安全网。Codex 捕获了 Claude 设置不合理期望值使测试变绿的情况，掩盖了本会在生产环境中暴露的逻辑错误。将三者互相校验，远比依赖其中任何一个更有效。

人月神话依然适用——基于Token的AI月神话同样如此。 Brooks 告诉我们，一个需要 12 人月的任务不意味着 12 个人能在一个月内完成。同样的定律适用于 AI：你不能简单地投入更多 token、更多智能体或更多并行会话，就指望问题更快收敛。沟通成本、协调开销、需求分析和概念完整性——这些软件工程的基本规律不会因为你的劳动力是人工智能就消失。更糟糕的是，当方向错误时——当设计中存在概念性错误或不合理的架构选择时——AI 不会识别出来。它会以惊人的速度冲向错误的方向，疯狂消耗 token，同时陷入自我辩护的漩涡：修补代码让失败的测试通过，调整期望值去匹配错误行为，在变通方案上叠加变通方案——每次迭代都让代码库看起来更"完整”，实际上却离正确越来越远。AI vibe coding 无法自行跳出这个螺旋。只有理解领域的人才能认识到"这从根本上就是错的，停下来"，丢弃这些工作，重新引导方向。没有方向的速度，只是昂贵的混乱。

更大的图景

Agentic vibe coding 之所以有效，是因为它将 AI 的速度与人的架构判断力和自动化测试纪律结合在了一起。这不是魔法——这是被加速的工程。

Brooks 还给了我们《没有银弹》，其核心区分在今天比以往任何时候都更重要：软件复杂性分为两种。本质复杂性来自问题本身——领域语义、行为契约、并发不变量。没有任何工具能消除它；它必须由理解领域的人去理解、建模和推理。偶然复杂性来自工具和实现——样板代码、跨数百个文件的手动重构、将设计翻译成可编译源码的机械工作。这恰恰是 AI 擅长的地方。这个项目之所以成功，在于认清了哪种复杂性是哪种：我掌控本质复杂性（架构、API 边界、正确性不变量），AI 消灭偶然复杂性（生成 7.7 万行实现代码、搭建测试框架、跨数十个配置文件重写重复模式）。搞混这两者——让 AI 做本质决策，或者让人浪费时间在偶然工作上——你会得到两个世界中最差的结果。

钱学森的《工程控制论》提供了另一个视角，在实践中出人意料地切题。他的核心框架——反馈、控制、优化——描述的是如何让复杂系统持续朝目标运行。全速运转的 AI vibe coding 就像一枚高超音速导弹：速度惊人，但没有制导系统只会在错误的地方炸出一个更大的坑。我工作流中的反馈回路是测试体系——交叉版本测试和端到端测试持续提供系统是否仍在航线上的信号。控制是人类架构师决定何时介入：在执行前审查方案，在方向偏移时按 ESC，选择哪个 AI 负责哪项任务。优化是迭代式的：每次中断-重新规划的循环都在精炼方法，每次 Gemini 审查都在收紧逻辑，每次 Codex 审计都在捕获编码智能体偷偷绕过测试的假设。缺少其中任何一个——检测偏差的反馈、纠正航向的控制、趋向收敛的优化——AI 编程的速度就不是优势而是负债。导弹越快，制导就必须越精确。

AI Vibe Coding以及它的迭代，正在快速地走进每一个开发者，也正在广泛地融入开源和商业软件。我们都在见证这种新的开发模式，以及AI Vibe Coding和软件工程理论的融合。如果你想和我探讨更多的AI + OSS话题，欢迎在 GitHub 上联系我。