Agentic Project Management: Why Vibe Coding Fails and How to Fix It

istvan · Monday, December 29 2025

agentic coding project management

Context

Intuitive prompt engineering - often called vibe coding - promises a flow state for software engineers, but the reality is often a repetitive loop of review and correction. We’ve traded writing code for supervising agents that write code. The question is how to make that trade worthwhile.

The Illusion of Effortless AI Coding

You've probably seen the demos. An agent scaffolds an entire feature in minutes, commits cleanly, and moves on to the next task. Then you try it on your own codebase.

The first task goes well. The second introduces a subtle regression because the agent "forgot" the component hierarchy it just modified. By the third, you're back to reading diffs line-by-line, except now there are three times as many lines and half of them are hallucinated imports.

AI companies are racing towards AGI, and models are being released at an unprecedented rate, claiming higher and higher scores on benchmarks like SWE-bench. The goal is to score in the top bracket of these tests, and this is orthogonal to the code quality that we have to produce at work, repeatedly, reliably, and in a controlled manner.

I have been vibe coding for the last couple of months. While the pace of development surprised me, so did the failure scenarios. We’ve all heard the horror stories: deleted databases, hallucinated file paths, and uncommitted work vanishing into the ether. But there are also subtle failures—like an agent centering a div vertically but forgetting the horizontal alignment because it drifted from the design system.

Based on my research and experience, I’ve categorized the primary challenges most engineers face with coding agents:

Failure Mode	Cause	Symptom
Attention Drift	Large context windows	Agent "forgets" earlier instructions, hallucinates file paths
Scope Creep	No explicit boundaries	Simple bug fix becomes architectural refactor
Context Exhaustion	Dumping entire codebase	Token limits hit mid-task
Coordination Issues	No persistent state	Multi-session work loses continuity
Reviewer Fatigue	Too lengthy reviews	Humans stop checking the AI's work thoroughly

Each of these failures originates from a singular root cause: the mismatch between the agent's almost infinite stamina and its finite understanding of intent. This is exactly why we need to direct its attention the right way so that we don't waste our tokens on subpar implementations.

My goal was to stop micro-managing and start directing. I needed a system to maximize token efficiency and enforce architectural boundaries. I call this framework Agentic Project Management.

The Intern Model

The core philosophy of agentic project management is to treat the agent not as a senior engineer who knows everything and figures out when to ignore details or when to expand on requirements, but as a brilliant, fast intern with zero short-term memory and who gets easily distracted.

To make this work, we need three pillars of guardrails:

Context Curation: Manually restricting the "surface area" of the code exposed to the agent to reduce hallucination and attention drift.
Externalized Memory: Moving project rules, architectural decisions, and current status (including project management tickets) into persistent files that live in the repo.
Atomic Scoping: Breaking complex features down into tasks so small that "Context Exhaustion" becomes very rare.

With these in mind, we can have a workflow that implements these pillars.

Definition Before Execution

If we treat the agent as an intern with no short-term memory, we cannot expect it to infer requirements as it works. The instructions need to be complete before execution begins. This means separating the workflow into distinct phases: first define what needs to be done, then execute, then verify.

This structure forces clarity. A vague idea must become a concrete specification before any code is written. The agent receives bounded context rather than an open-ended chat, which reduces both hallucination and scope creep.

The workflow operates in three tiers.

Tier 1: Definition

Create PRD: The PRD captures high-level intent and constraints. I use a prompt like: "Given @create-prd.md, create a PRD for a simple todo list app." For larger projects, each major feature gets its own PRD.
Task Breakdown: This prompt takes the PRD and decomposes it into epics and individual tasks. Each task should be small enough that an agent can complete it without exhausting its context window.
Context Preparation: Before execution, each task gets a structured context block—typically XML—that lists only the files, types, and dependencies the agent needs. This prevents the agent from reading irrelevant code and losing focus.

Tier 2: Execution

Execute Task: The agent receives a task ID, reads the prepared context, and implements the change. It commits to a branch and opens a pull request. The prompt constrains the agent to the defined scope.
Execute Epic: For larger units of work, an orchestrator agent can coordinate multiple task executions. This is still experimental—the goal is to spin up isolated containers, each handling a single task in parallel.

Tier 3: Verification

Code Review: A separate agent reviews the changes against the original PRD and task specification. This agent starts with fresh context, free from the execution agent's accumulated assumptions. It checks for scope creep, regressions, hallucinated imports, and architectural violations before human review.

The following diagram shows how these tiers connect.

Shifting Quality Left

With this workflow, I was able to reduce the amount of rollback I had to do and increase the efficiency of token use. This is only anecdotal since I did not think about measuring it from the get-go. It would be a fun project to prove this with more data points. This approach is similar to the "shift-left" in security engineering.

By moving quality checks earlier in the pipeline, defects are caught earlier in the development loop. The same principle applies to coding agents. Default vibe coding puts all quality control on the right side: generate code, review, find problems, correct, repeat.

Shifting left means investing in task definition before execution begins. The PRD catches intent misalignment before any code is written. The task breakdown enforces scope boundaries. The XML context preparation prevents attention drift before the agent opens files it should not touch or loses focus.

This mirrors how Rust's toolchain works: the borrow checker and clippy catch issues at compile time rather than runtime. You pay upfront—fighting the compiler, satisfying the type system—but you rarely pay the much higher cost of debugging a mysterious null pointer in production.

Crafting detailed task specifications feels like overhead when you could just let the agent start coding. The payoff only becomes obvious after you have experienced enough late-stage failures—enough agent sessions that spiraled into incoherence, enough diffs so large you stopped reading carefully. Once experienced that pain, putting in the work stops feeling like overhead and starts feeling like the way forward.

Parallelism Without Chaos

There are several ways to create a multi-agent environment. Locally, git worktrees allow multiple working directories within the same repo, each checked out to a different branch.

From the documentation:

Manage multiple working trees attached to the same repository. A git repository can support multiple working trees, allowing you to check out more than one branch at a time. With git worktree add a new working tree is associated with the repository, along with additional metadata that differentiates that working tree from others in the same repository. The working tree, along with this metadata, is called a "worktree".

It works very well with multiple agents running on the same node.

For maximum separation and to reduce the chance of unintended consequences further, I have also implemented a simple Docker-based workflow.

Supporting Toolchain

I use the following toolset for getting the most out of agentic work:

Mise is a polyglot tool version manager with environment variable support and a task runner. There are multiple locations and therefore, multiple strategies to capture tool (programming languages, executables, etc.) requirements. Mise supports multiple backends for installing software, even custom implementations.
Beads is a Git persistent, dependency-aware graph of tasks. By structuring tasks logically, it prevents context drift, allowing coding agents to execute multi-step engineering workflows reliably. Instead of giving the agent every single file in the repo, we can list the files we need in the ticket description together with all the other details. It has an agent aware CLI that makes task management for agents easy.
XML: Finally, after 27 years of waiting, we figured out what XML is good for. It turns out agents love it.

What Changed in Practice

Before this framework, I spent a significant portion of my time reviewing code and fixing subtle regressions, trying to get the agent back on track. With the three-tier workflow, that dynamic no longer exists.

Now, roughly 90% of agent output is correct on the first pass. When something is wrong, I do not fix the code directly—I adjust the task description and run the agent again. The feedback loop has shifted from debugging implementation to refining specification. This is a more sustainable way to work: specifications are reusable, and improving them benefits future tasks.

Babysitting sessions are gone. I observe outcomes, verify against the PRD, and move on.

Where This Goes Next

The immediate next step is true parallelism. I am experimenting with using the Orchestrator to spin up disposable containers, each assigned a specific task ID and context. This would allow an "epic" to be implemented by five "interns" simultaneously.

In the end, agentic productivity gains do not come from letting go of control. They come from deciding where control actually matters.