The task sounds simple. A customer photographs a broken remote control. The system identifies the exact model and links them to the correct replacement product in a WooCommerce catalog of 600+ items.
In practice, it is not simple at all. Customer photos arrive in every condition: awkward angles, poor lighting, partial crops, varying distances, cluttered backgrounds. Remote controls within the same product family often differ only in subtle ways — button layout, label positioning, minor casing proportions. Recognising the right family but selecting the wrong variant still means sending the customer to the wrong product.
The question became: how do you build something reliable enough for production, without training a custom model?
The Architecture
The solution is a local-first Python pipeline with three stages:
- Image preprocessing — crop and normalise the input photo
- Fingerprint extraction — derive visual signatures for matching
- Local ranking — multi-signal scoring against the catalog
An optional vision API fallback exists for edge cases, but the core recognition path runs entirely locally. No cloud dependency, no latency spikes, no per-call costs.
That choice was deliberate. Reliability in production often matters more than peak performance in a demo. A system you can observe, debug, and improve is more useful than one you cannot. This is a principle that applies broadly — as discussed in why historical data determines whether AI delivers real value, the gap between a working demo and a reliable production system is almost always about the quality of the inputs, not the power of the model.
The Real Insight: A Test-Driven Optimisation Loop
The architecture is conventional. What was unconventional was how the system was improved.
Rather than training a neural network, the entire pipeline was treated as the object being optimised — and Claude Code was used as the optimisation engine, operating inside a disciplined evaluate-hypothesize-modify-retest loop.
Step 1 — Build the test harness before touching the pipeline
Before any optimisation, a proper evaluation framework was established:
- Batch test suite: 27 real customer-style photos, taken by 4 different people, covering 14 products across both white and non-white backgrounds
- Regression suite: 6 original cases that must never break — any change improving new tests while breaking these was treated as a failure, not a trade-off
Once this existed, every subsequent decision became falsifiable. That changed the nature of the work entirely.
Step 2 — Track progress through git checkpoints
Each meaningful improvement was tagged, producing a clear and auditable trajectory:
| Checkpoint | Accuracy | Key Change |
|---|---|---|
checkpoint-pipeline-41pct | 41% (11/27) | CLIP ViT-B/32 baseline |
checkpoint-pipeline-52pct | 52% (14/27) | Two-stage ranking + 4-way image comparison |
checkpoint-pipeline-78pct | 78% (21/27) | Switch from CLIP to DINOv2 |
checkpoint-pipeline-85pct | 85% (23/27) | Upgrade to DINOv3 |
These are not just performance snapshots. Each checkpoint represents a specific architectural decision, which means the improvement history is also a reasoning history.
Step 3 — The feedback loop in practice
Each iteration inside a single Claude Code session followed the same cycle:
- Run both test suites
- Analyse failure patterns — not individual photos, but families of failures
- Form a hypothesis about the likely cause
- Implement a targeted change
- Retest immediately
- If a regression appears: revert or adjust; never advance with a broken baseline
This disciplined iteration requires explicit structure — the same principle that underlies the PRD.json pattern for preventing AI agent drift. Just as a JSON specification prevents an agent from hallucinating unplanned features, a test suite prevents a pipeline optimiser from straying from measurable objectives.
A concrete example. At the 41% stage, the 16 failures were analysed and a clear pattern emerged: CLIP embeddings were being compared only crop-to-crop. When the preprocessor produced a poor crop on a difficult background, the embedding quality degraded — and that degradation propagated through the entire scoring stage.
The fix was not to abandon embeddings. It was to make the comparison more resilient: evaluate four combinations (user-cropped × catalog-cropped, user-cropped × catalog-full, user-full × catalog-cropped, user-full × catalog-full) and take the maximum similarity score.
That single change raised accuracy from 41% to 52%.
Step 4 — Failed experiments are evidence, not waste
Not every hypothesis improved the system. Three instructive failures:
- CLIP-only at 95% weight — scored 48% on new tests but broke 4 of 6 regression cases. Conclusion: geometric signals still carry essential information for distinguishing variants within the same product family. Embedding dominance alone is not enough.
- CLIP at 70% weight — improved regression performance to 83% but dropped new-test accuracy to 44%. The balance point was still wrong.
- Dual-model ensemble (ViT-B/32 + RN50) — performed worse than the single-model configuration (41% vs 52%). Concatenating embeddings from architecturally different models introduced noise rather than complementarity.
Each failed experiment removed a false assumption and narrowed the search space for better decisions. That is often how progress actually happens in applied AI work.
Step 5 — Architecture emerged from evidence
The two-stage ranking design was not planned upfront. It emerged from observing a repeated pattern across many test cycles:
- CLIP is strong at family-level matching (“this is a remote from product family X”) but struggles to distinguish variants within a family
- Shape and layout signals can distinguish variants, but degrade when crop quality is inconsistent
The solution: let a strong embedding signal narrow the field to a shortlist, then let fine-grained multi-signal scoring discriminate within that shortlist. One stage for breadth, one stage for precision.
Good pipeline architecture frequently emerges from accumulated failure analysis, not from whiteboard design sessions.
Step 6 — Model upgrades as controlled experiments
Once the architecture was stable, embedding model upgrades could be tested cleanly as isolated experiments:
- CLIP ViT-B/32 → DINOv2-small: +26 percentage points (52% → 78%)
- DINOv2-small → DINOv3-small: +7 percentage points (78% → 85%)
One detail worth noting: DINOv3 initially appeared to perform worse than DINOv2 (74% vs 78%). Failure analysis revealed that 5 of those apparent failures were catalog naming mismatches — the model’s output was correct, but the test expectations were wrong. Raw embedding accuracy was 93%; the full pipeline, including preprocessing and multi-signal ranking, reached 85%.
What “Self-Improving” Actually Means Here
The system does not learn in the traditional machine learning sense. There is no gradient descent, no training epochs, no weight updates. What happens instead is closer to iterative decision logic optimisation:
- The test suite is the loss function. 27 photos plus 6 regression cases define what correct means.
- Claude Code is the optimiser. It analyses failure patterns, forms hypotheses, implements changes, and validates results — all within a single session.
- Git checkpoints are the training history. Each tagged state is a verifiable, reproducible improvement.
- The loop runs in minutes, not days. A full cycle — test, analyse, fix, retest — takes roughly 5 to 10 minutes.
The key point: the system improves by restructuring its decision logic, not by adjusting weights in a neural network. The scoring weights, pipeline architecture, and model selection are all optimised by an AI agent running test-driven experiments in a tight feedback loop.
Current State
- 85% accuracy on 27 diverse real-world photos
- 103 unit tests passing, 6/6 regression cases stable
- ~7–10 seconds per recognition, running locally without API calls
- 4 remaining failures: all rank-2 or rank-3 cases — correct product family, wrong close variant
- Integrated with WooCommerce via a WordPress plugin for end-to-end product lookup
The remaining problem is no longer broad visual recognition. It is fine-grained disambiguation inside visually dense product families. That is a much better problem to have.
The Takeaway
You do not always need to train a model.
In many applied settings, the more valuable step is to build a system that can be measured, challenged, and improved under controlled conditions. A well-designed test suite can function as a practical loss function. A disciplined iteration loop can produce rapid, compounding gains. An AI coding agent becomes genuinely powerful when it operates inside that structure — not as an oracle, but as a fast operator inside an evidence-based process. This is also a concrete example of software behaving as a living structure rather than a passive archive — the system does not simply store and retrieve, it actively refines its own decision-making through structured feedback.
The system improved not because a neural network was retrained, but because the decision logic around the model was repeatedly reworked in response to evidence. The same iterative, evidence-based approach applies to AI-assisted internal bug reporting, where the agent’s triage logic improves the quality of operational inputs through structured conversation rather than blind automation. This test-driven discipline extends beyond pipeline development — as explored in why AI-written code still needs serious testing after the first commit, the same structured verification approach is what separates reliable production systems from plausible-looking ones. The same principle applies to data-driven AI systems more broadly — what separates useful AI from impressive AI is the quality of the structured inputs, whether those inputs are historical business data or a carefully designed test harness.
For real business problems where reliability matters, that is often the more important capability.