LLM-Driven Software Development

The core insight: The best LLMs have reached coding supremacy — they can write correct, idiomatic, production-quality code. However, the engineering discipline around that code — requirements, architecture, validation, review — still needs the same rigour you’d apply with human teams.

The Shift: What Humans Do Now

The Programmer Is Replaced — The Engineer Is Not

LLMs are now genuinely replacing the programmer. Not augmenting, not assisting — replacing. A well-directed LLM will produce code that is correct, idiomatic, and often more consistent than what a human would write under time pressure. This is a structural change, not an incremental one.

What remains essential — and is now the primary human contribution — is everything that surrounds the code:

Product thinking: Understanding users, defining value propositions, designing workflows, making trade-off decisions about what to build and what to leave out.
Architecture and technical direction: Choosing the right patterns, setting structural constraints, knowing when a problem calls for a particular algorithm, library, or approach — and being able to discuss these choices with the LLM in a way that produces the right outcome.
Principal-developer judgement: The pragmatic wisdom of how software should be built — the kind of knowledge a senior engineer or architect carries about maintainability, operability, and long-term health of a system.

Language Choice Is Now an Optimisation Decision

A profound consequence of this shift: the choice of programming language and ecosystem is no longer constrained by the skills of your team. The LLM codes fluently in any language. This means language selection should now be driven purely by what produces the best software — type safety, performance, correctness guarantees, ecosystem maturity — rather than by hiring concerns or existing team expertise. The talent pool for any given language is no longer a factor.

This reframes an argument that previously had real practical limits. Haskell is an extraordinary language — strong static types, purity by default, principled abstractions, a compiler that acts as an automated safety net. The traditional objection was always hiring: you can’t staff a Haskell team. That objection is now gone. The LLM is the Haskell developer. The question reduces to: does this language produce better software? The answer for Haskell, for most application development, is yes.

The Human’s Highest-Value Contribution: Directive Wisdom

LLMs are excellent at generating code that meets an explicit specification. They are much less likely to volunteer the kind of operational and architectural directives that distinguish production-grade software from a working prototype. For example:

A component should be capable of self-initialisation — detecting and setting up its own execution environment if it isn’t already in place.
Certain data operations should be idempotent, so that retries and recovery are safe by design.
Specific failure conditions should trigger operator notifications for human intervention, rather than failing silently or retrying indefinitely.
Interfaces should be designed for orthogonality — clean separation of concerns so that changes in one area don’t cascade unpredictably.

These directives can make an enormous difference to the quality, operability, and maintainability of the resulting software. The overhead of generating code is now so low that the dominant cost is thinking clearly about what to ask for — and doing so at the right time, during specification, not after the code is written.

Additive Strength, Refactoring Weakness

LLMs are now excellent at generating new code — building forward from a clear specification. They are notably weaker at refactoring and design changes after the fact. Common failure modes include:

Orphaned code — old implementations left in place after new ones are added, because the LLM doesn’t fully trace all call paths.
Unwired functions — new code is generated but not properly connected into existing execution paths, especially when old entry points appear to still be in place.
False backward compatibility — the LLM leaves vestigial code in the name of “backward compatibility,” even when the codebase is pre-alpha and no such compatibility is required.

The dual-LLM review cycle described in this guide catches many of these issues. But it is far more efficient to get requirements right up front and minimise refactoring in the first place. When refactoring is unavoidable, the human director should be especially vigilant — actively pushing the LLM to remove duplication, verify that new execution paths are fully wired, and strip out dead code rather than preserving it.

UI Development: Vision Is Strong, Dynamics Are Weak

LLMs now have excellent machine vision and can consume mockups, wireframes, and screenshots very effectively — static layout and visual composition are largely solved. Where they remain weak is in the dynamic aspects: transitions, interaction sequences, conditional visibility, drag behaviours, animation timing, and the overall feel of a workflow.

This means the human must carefully articulate workflows and desired user experiences in words, even when the static design is easy to convey visually. Don’t assume the LLM will infer how a multi-step form should flow, how errors should surface, or what should happen when a user navigates away mid-task.

Build Bench Tools — Code Is Cheap Now

Most LLMs are constrained (for security) from launching executables and driving them interactively via stdin/stdout. This limits their ability to dynamically explore and diagnose running software. A practical workaround: have the LLM create test harnesses and bench tools — small utilities that watch a command file for instructions, execute them, and write results to a log or output file that the LLM can then read.

Think of these as the software equivalent of workshop jigs — purpose-built fixtures that make a specific task faster and more reliable. The key insight is that the cost of creating code is now so low that building throwaway tooling to accelerate diagnosis and testing is almost always worth it. The human director should cultivate the habit of asking: “Would a small tool make this easier to verify?” — and if so, just have the LLM build one.

Part 1 — Technology Stack

Choose the Best LLMs

LLM capability is advancing rapidly; always use the current best-in-class. As of mid-2025:

Role	Recommended	Notes
Coding	Claude Code (Opus)	Primary implementation and self-review
Review	ChatGPT (via Codex)	Independent review with source access

Using two different LLMs for coding and review is a deliberate choice — it avoids blind spots that a single model might consistently miss.

Choose Languages That Constrain Bugs

The stricter the type system, the more classes of bugs are eliminated at compile time — before the code ever runs. This is even more valuable with LLM-generated code, where you want the compiler acting as an automated safety net.

Backend — Haskell (preferred) or Rust:

Extremely strict yet expressive type system — if it compiles, entire categories of bugs are ruled out
Native code generation for high performance
Concise syntax — fewer lines for the LLM to manage and for you to review
Simple memory management via garbage collection (contrast with Rust’s borrow checker, which can create compounding complexity as codebases grow)
Uniquely powerful concurrency: lightweight threads, Software Transactional Memory (STM) for composable in-process transactions
Rich ecosystem of type-safe database and web frameworks
Excellent testing support, including property-based testing (QuickCheck)

The Haskell advantage is particularly sharp in the agentic context. When the LLM makes a type error — misapplies an abstraction, confuses two conceptually similar types, constructs an impossible state — the GHC compiler refuses to accept the code and produces a precise, local error. The LLM reads the error, corrects it, and tries again. This tight feedback loop dramatically reduces the number of iterations required to reach correct code. Python gives you a stack trace at runtime, after the agent has already moved on. Haskell gives you a compiler error immediately, with a finger pointing at the exact expression.

Frontend — TypeScript:

Applies similar type-safety principles to the browser/client tier
Catches common mistakes at build time rather than at arbitrary runtime moments
Well-supported by all major LLMs

Build & Deploy — Nix:

Fully declarative, reproducible builds
Pinned dependency graphs — no version drift
Bit-for-bit identical executables regardless of build history or environment
Configuration-as-code for all dependencies, services, and even OS-level concerns

Part 2 — The Process

Guiding Principles

These principles recur throughout every phase:

Session isolation. Start each major phase in a fresh LLM session. This prevents context drift, avoids the LLM anchoring on earlier mistakes, and ensures the model works from your documents rather than stale conversational memory.
Dual-LLM review cycle. The coding LLM reviews its own work first, then an independent LLM performs a separate review. Iterate between them until the reviewer finds no further issues. This mimics the author/reviewer discipline in human teams.
Test suite as a guiding light. Tests are not an afterthought — they are built alongside each feature, serve as regression protection, and must always pass. Never tolerate failing tests, even “old” ones. The test suite is your ground truth.
Explicit review criteria. When asking an LLM to review, always specify: correctness, completeness, and performance. Vague review requests produce vague results.
Version control discipline. Commit at every phase boundary and before/after every review cycle. This gives you rollback points and a clear audit trail of how the code evolved.

Phase 1 — Requirements & Validation Specification

Goal: Produce a clear, complete proposal document that defines what to build and how to know it’s right.

In a dedicated LLM session: describe the required functionality framed around user value, solicit LLM feedback on edge cases and interface boundaries, define validation criteria, and ask the LLM to produce a formatted proposal document covering purpose, problem definition, solution elements, and validation criteria.

Then run the review cycle: authoring LLM self-reviews, then the alternate LLM reviews independently. Iterate until both agree the document is sound.

Phase 2 — Implementation Plan

Goal: Produce a detailed, phased implementation plan that a coding LLM can follow from a standing start.

In a new session, provide the approved proposal plus implementation directives — preferred libraries, anti-patterns to avoid, style conventions. Ensure the plan includes sensible phases with clear checkpoints, a testing strategy at each phase, and enough context for a cold-start session to begin coding without ambiguity.

Run the same dual-LLM review cycle until the plan is declared correct, consistent, and high-quality.

Phase 3 — Implementation

Goal: Execute the implementation plan phase by phase, with continuous quality assurance.

In a new coding session, provide the implementation plan and instruct the LLM to follow it phase by phase. Monitor actively during generation — watch for anti-patterns, deviations from the plan, unnecessary complexity. At each phase completion: self-review, run the full test suite (all tests must pass), and commit.

After implementation is complete: coding LLM performs a final self-review; address any deferments; tell the coding LLM that an independent reviewer will examine its work (this demonstrably improves output quality); submit to the review LLM; cycle until no further issues are found.

Phase 4 — Final Review

Goal: Comprehensive cross-LLM review of the complete, finished implementation.

Run the full dual-LLM review-remediation cycle one final time. The review LLM should also evaluate test suite coverage and quality — are there gaps, are edge cases tested, are tests meaningful rather than ceremonial? Add any missing tests and ensure the full suite passes.

Phase 5 — Documentation

Goal: Produce clear, maintainable documentation from the finished implementation.

Ask the coding LLM to generate: technical documentation (a post-hoc restatement of the implementation plan that accurately describes the actual structure as built) and user documentation (a manual for the personas identified in Phase 1). Update both whenever the software is extended.

The Dual-LLM Review Pattern

The recurring pattern at the heart of this process:

┌─────────────┐         ┌─────────────┐
│ Coding LLM  │◄───────►│ Review LLM  │
│  (creates)  │ iterate │ (evaluates) │
└─────────────┘         └─────────────┘
        │                       │
        ▼                       ▼
   Self-review             Independent
   first, then             review against
   hand off                spec/plan

Each cycle tightens quality. The two-model approach catches blind spots that self-review alone misses — analogous to code review in human teams, but faster and more systematic.

This guide reflects the state of LLM capabilities and tooling as of mid-2025. The specific model recommendations will evolve; the process principles are durable.