Why not just use a pre-trained image classification API and call it done?

A general-purpose vision API can get you part of the way there, but it is not designed for fine-grained product disambiguation. It may correctly identify 'a remote control' without distinguishing between two variants from the same family that differ only in button layout. The pipeline described here is specifically architected to handle that last-mile problem — and because it runs locally, it does not accumulate per-call costs at scale.

How many photos do you need to build a test suite like this?

27 photos was enough to make meaningful progress. The more important variable is coverage: different people, different angles, different backgrounds, different product families. A large set of nearly identical photos adds noise rather than signal. Quality and diversity of the test set matter more than size.

What happens when new products are added to the catalog?

New catalog items require generating fingerprints for the new products and adding them to the local index. The recognition pipeline itself does not need to be retrained or modified. This is one of the structural advantages of the approach: the catalog is a data layer, not a training dependency.

Could this approach work for products other than remote controls?

Yes, with caveats. The pipeline is most effective when products in the same visual family differ in structured, observable ways — layout, shape, proportional differences. It would transfer well to other hardware categories (routers, sensors, industrial components). It would be less effective for categories where visual similarity is very high and the distinguishing features are non-visual (serial numbers, internal specifications).

What does Claude Code actually do in the loop — is it writing all the code?

It is doing more than code generation. Inside each iteration, it is analysing test output, grouping failures into patterns, forming hypotheses about root causes, proposing and implementing targeted changes, and validating the result against both test suites. The value is not in producing code in one pass — it is in maintaining analytical continuity across dozens of iterations within a single session.

What is the ceiling for this approach? When would you need to actually train a model?

The current ceiling is fine-grained variant disambiguation within dense product families — cases where two products look nearly identical and the distinguishing signal is very subtle. If the remaining failure rate is unacceptable for the business use case, that is the point where a fine-tuned model trained on labeled product pairs would likely add value. But that is a significantly higher investment, and the test harness built here would still be necessary to evaluate it properly.

Is 85% accuracy good enough for production?

That depends entirely on the use case and how failures are handled. In this context, 85% accuracy with the remaining failures concentrated in rank-2 or rank-3 positions means the system almost always surfaces the correct product within the top few results. Combined with a fallback path and a UX that allows the customer to confirm the match, 85% is operationally viable. The answer would be different for a fully automated system with no human confirmation step.

You Don't Always Need to Train a Model: Building a Self-Improving Image Recognition Pipeline with Claude Code

The task sounds simple. A customer photographs a broken remote control. The system identifies the exact model and links them to the correct replacement product in a WooCommerce catalog of 600+ items.

In practice, it is not simple at all. Customer photos arrive in every condition: awkward angles, poor lighting, partial crops, varying distances, cluttered backgrounds. Remote controls within the same product family often differ only in subtle ways — button layout, label positioning, minor casing proportions. Recognising the right family but selecting the wrong variant still means sending the customer to the wrong product.

The question became: how do you build something reliable enough for production, without training a custom model?

The Architecture

The solution is a local-first Python pipeline with three stages:

Image preprocessing — crop and normalise the input photo
Fingerprint extraction — derive visual signatures for matching
Local ranking — multi-signal scoring against the catalog

An optional vision API fallback exists for edge cases, but the core recognition path runs entirely locally. No cloud dependency, no latency spikes, no per-call costs.

That choice was deliberate. Reliability in production often matters more than peak performance in a demo. A system you can observe, debug, and improve is more useful than one you cannot. This is a principle that applies broadly — as discussed in why historical data determines whether AI delivers real value, the gap between a working demo and a reliable production system is almost always about the quality of the inputs, not the power of the model.

The Real Insight: A Test-Driven Optimisation Loop

The architecture is conventional. What was unconventional was how the system was improved.

Rather than training a neural network, the entire pipeline was treated as the object being optimised — and Claude Code was used as the optimisation engine, operating inside a disciplined evaluate-hypothesize-modify-retest loop.

Step 1 — Build the test harness before touching the pipeline

Before any optimisation, a proper evaluation framework was established:

Batch test suite: 27 real customer-style photos, taken by 4 different people, covering 14 products across both white and non-white backgrounds
Regression suite: 6 original cases that must never break — any change improving new tests while breaking these was treated as a failure, not a trade-off

Once this existed, every subsequent decision became falsifiable. That changed the nature of the work entirely.

Step 2 — Track progress through git checkpoints

Each meaningful improvement was tagged, producing a clear and auditable trajectory:

Checkpoint	Accuracy	Key Change
`checkpoint-pipeline-41pct`	41% (11/27)	CLIP ViT-B/32 baseline
`checkpoint-pipeline-52pct`	52% (14/27)	Two-stage ranking + 4-way image comparison
`checkpoint-pipeline-78pct`	78% (21/27)	Switch from CLIP to DINOv2
`checkpoint-pipeline-85pct`	85% (23/27)	Upgrade to DINOv3

These are not just performance snapshots. Each checkpoint represents a specific architectural decision, which means the improvement history is also a reasoning history.

Step 3 — The feedback loop in practice

Each iteration inside a single Claude Code session followed the same cycle:

Run both test suites
Analyse failure patterns — not individual photos, but families of failures
Form a hypothesis about the likely cause
Implement a targeted change
Retest immediately
If a regression appears: revert or adjust; never advance with a broken baseline

This disciplined iteration requires explicit structure — the same principle that underlies the PRD.json pattern for preventing AI agent drift. Just as a JSON specification prevents an agent from hallucinating unplanned features, a test suite prevents a pipeline optimiser from straying from measurable objectives.

A concrete example. At the 41% stage, the 16 failures were analysed and a clear pattern emerged: CLIP embeddings were being compared only crop-to-crop. When the preprocessor produced a poor crop on a difficult background, the embedding quality degraded — and that degradation propagated through the entire scoring stage.

The fix was not to abandon embeddings. It was to make the comparison more resilient: evaluate four combinations (user-cropped × catalog-cropped, user-cropped × catalog-full, user-full × catalog-cropped, user-full × catalog-full) and take the maximum similarity score.

That single change raised accuracy from 41% to 52%.

Step 4 — Failed experiments are evidence, not waste

Not every hypothesis improved the system. Three instructive failures:

CLIP-only at 95% weight — scored 48% on new tests but broke 4 of 6 regression cases. Conclusion: geometric signals still carry essential information for distinguishing variants within the same product family. Embedding dominance alone is not enough.
CLIP at 70% weight — improved regression performance to 83% but dropped new-test accuracy to 44%. The balance point was still wrong.
Dual-model ensemble (ViT-B/32 + RN50) — performed worse than the single-model configuration (41% vs 52%). Concatenating embeddings from architecturally different models introduced noise rather than complementarity.

Each failed experiment removed a false assumption and narrowed the search space for better decisions. That is often how progress actually happens in applied AI work.

Step 5 — Architecture emerged from evidence

The two-stage ranking design was not planned upfront. It emerged from observing a repeated pattern across many test cycles:

CLIP is strong at family-level matching (“this is a remote from product family X”) but struggles to distinguish variants within a family
Shape and layout signals can distinguish variants, but degrade when crop quality is inconsistent

The solution: let a strong embedding signal narrow the field to a shortlist, then let fine-grained multi-signal scoring discriminate within that shortlist. One stage for breadth, one stage for precision.

Good pipeline architecture frequently emerges from accumulated failure analysis, not from whiteboard design sessions.

Step 6 — Model upgrades as controlled experiments

Once the architecture was stable, embedding model upgrades could be tested cleanly as isolated experiments:

CLIP ViT-B/32 → DINOv2-small: +26 percentage points (52% → 78%)
DINOv2-small → DINOv3-small: +7 percentage points (78% → 85%)

One detail worth noting: DINOv3 initially appeared to perform worse than DINOv2 (74% vs 78%). Failure analysis revealed that 5 of those apparent failures were catalog naming mismatches — the model’s output was correct, but the test expectations were wrong. Raw embedding accuracy was 93%; the full pipeline, including preprocessing and multi-signal ranking, reached 85%.

What “Self-Improving” Actually Means Here

The system does not learn in the traditional machine learning sense. There is no gradient descent, no training epochs, no weight updates. What happens instead is closer to iterative decision logic optimisation:

The test suite is the loss function. 27 photos plus 6 regression cases define what correct means.
Claude Code is the optimiser. It analyses failure patterns, forms hypotheses, implements changes, and validates results — all within a single session.
Git checkpoints are the training history. Each tagged state is a verifiable, reproducible improvement.
The loop runs in minutes, not days. A full cycle — test, analyse, fix, retest — takes roughly 5 to 10 minutes.

The key point: the system improves by restructuring its decision logic, not by adjusting weights in a neural network. The scoring weights, pipeline architecture, and model selection are all optimised by an AI agent running test-driven experiments in a tight feedback loop.

Current State

85% accuracy on 27 diverse real-world photos
103 unit tests passing, 6/6 regression cases stable
~7–10 seconds per recognition, running locally without API calls
4 remaining failures: all rank-2 or rank-3 cases — correct product family, wrong close variant
Integrated with WooCommerce via a WordPress plugin for end-to-end product lookup

The remaining problem is no longer broad visual recognition. It is fine-grained disambiguation inside visually dense product families. That is a much better problem to have.

The Takeaway

You do not always need to train a model.

In many applied settings, the more valuable step is to build a system that can be measured, challenged, and improved under controlled conditions. A well-designed test suite can function as a practical loss function. A disciplined iteration loop can produce rapid, compounding gains. An AI coding agent becomes genuinely powerful when it operates inside that structure — not as an oracle, but as a fast operator inside an evidence-based process. This is also a concrete example of software behaving as a living structure rather than a passive archive — the system does not simply store and retrieve, it actively refines its own decision-making through structured feedback.

The system improved not because a neural network was retrained, but because the decision logic around the model was repeatedly reworked in response to evidence. The same iterative, evidence-based approach applies to AI-assisted internal bug reporting, where the agent’s triage logic improves the quality of operational inputs through structured conversation rather than blind automation. This test-driven discipline extends beyond pipeline development — as explored in why AI-written code still needs serious testing after the first commit, the same structured verification approach is what separates reliable production systems from plausible-looking ones. The same principle applies to data-driven AI systems more broadly — what separates useful AI from impressive AI is the quality of the structured inputs, whether those inputs are historical business data or a carefully designed test harness.

For real business problems where reliability matters, that is often the more important capability.