Why RAG Fails on Domain-Specific Data

Contents

1. The Problem 2. Attempt 1: Standard RAG 3. Attempt 2: Fine-Tuning an English Model 4. Attempt 3: Multilingual Retrieval Model 5. The Autoresearch Loop 6. Results Comparison 7. What I Learned

1. The Problem

Doctrina is my attempt at a free alternative to Aranzadi, Spain's dominant legal database. Aranzadi is owned by Thomson Reuters. Their cheapest tier (Aranzadi One) starts at €117/month, which is €1,404/year. The full Aranzadi Instituciones database runs €1,700 to €2,000/year depending on version and number of users. Most people who need Spanish tax answers can't afford that.

The data

I scraped three public government databases. All of them publish their content as open public records under Spanish transparency law (Ley 19/2013, de transparencia). Scraping publicly accessible government legal data for research and tool-building is legal in the EU; these aren't behind paywalls or authentication.

Database	Source	Documents	Content
DGT Consultas Vinculantes	petete.tributos.hacienda.gob.es	17,736	Binding tax consultations. A taxpayer asks a question; the DGT gives an official, citable answer.
TEAC Resoluciones	serviciostelematicosext.hacienda.gob.es	8,400	Tax tribunal rulings. Appeals of tax assessments decided by the economic-administrative courts.
CENDOJ Jurisprudencia	cendoj.poder-judicial.es	350,000+	Court decisions from all levels of the Spanish judiciary. The largest corpus by far.

Why training uses only DGT

The DGT consultas vinculantes are uniquely suited for training a retrieval model. Each one has a clean question-answer pair built in: the cuestion (what the taxpayer asked) and the contestacion (what the government answered). That structure is exactly what retrieval training needs: "given this question, find this answer."

TEAC decisions and CENDOJ rulings don't have that same clean pair structure. They're longer, more complex documents without a single question mapped to a single answer. Training on them would require manually creating query-document pairs or building a synthetic dataset, which is the next step.

So the plan is: train on the 17,736 DGT pairs first (because the data is ready), prove that the approach works, then extend to TEAC and CENDOJ.

The search problem

These documents all look the same. Same bureaucratic voice. Same laws referenced. Nearly identical vocabulary. A consultation about foreign asset reporting and one about real estate tax obligations share 80% of their terminology.

Keyword search works when someone types the exact legal phrase. But real people don't search that way. They type "do I need to report my Finnish property" and expect the system to understand what they mean.

That requires semantic search. And semantic search requires an embedding model that actually understands Spanish legal text.

2. Attempt 1: Standard RAG

First thing I tried was the standard approach. Take an off-the-shelf embedding model, convert every document into a vector (a list of numbers that represents its meaning), store them, and at search time find the vectors closest to the user's query.

User query

→

Embed

→

Compare 17,736 vectors

→

Top 10 results

The model I used was nomic-embed-text, running locally through Ollama. 79,000 vectors indexed. This is the setup you'd find in any RAG tutorial.

Result

recall@10 = 13.8%

Out of every 10 queries, the correct document appeared in the top 10 results about 1.4 times. The other 8.6 times, the system returned consultations that looked right but weren't. For a legal research tool, that's worse than returning nothing; it creates false confidence.

Why it failed

Generic embedding models are trained on the internet. Wikipedia, news articles, Reddit threads. They understand language broadly, but they don't understand this language.

Spanish tax law has its own dialect. "Tributación de no residentes" and "impuesto sobre la renta de personas físicas" are closely related concepts, but a generic model treats them as vaguely similar instead of recognizing they belong to the same legal framework.

When 17,736 documents all share most of their vocabulary, a generic model clusters them into one amorphous blob. The distances between vectors become too small to mean anything. It's like asking someone who speaks tourist Spanish to distinguish between two nearly identical legal clauses.

3. Attempt 2: Fine-Tuning an English Model

If the generic model doesn't understand the domain, teach it. I set up jina-ai/mlx-retrieval on a Mac Studio M2 Ultra and started fine-tuning Google's Gemma-3-270M on the Doctrina corpus.

Base model

Gemma-3 270M parameters. Google's small language model. Primarily English.

Training

LoRA fine-tuning, 150 steps, contrastive loss. 8 experiments via an autonomous agent.

Hardware

Mac Studio M2 Ultra, 192GB unified memory. Training via MLX at ~7,000 tokens/sec.

Result

recall@10 = 0.67% → 1.0%

The untrained model scored 0.67%. Basically random. After 8 automated experiments trying different loss functions, temperatures, and step counts, it reached 1.0%. Three correct results out of 300 queries.

Why it failed

The model didn't know enough Spanish. Gemma-3-270M was trained primarily on English text. Asking it to learn Spanish legal retrieval through LoRA alone is too much. No amount of hyperparameter tuning compensates for a fundamental language gap.

The agent tried three different loss functions, multiple temperature values, varying step counts. Everything plateaued at 1.0%. When the search space doesn't contain a good solution, optimizing within that space is a waste of time.

4. Attempt 3: The Right Base Model

The fix wasn't algorithmic. It was choosing the right starting point.

Qwen3-Embedding-0.6B is a 600M-parameter model built specifically for multilingual retrieval. Trained on 100+ languages including Spanish. Not a general-purpose language model forced into retrieval work; it was designed for exactly this.

recall@10 = 27.3% (zero-shot, no fine-tuning)

Out of the box. No training. No tuning. That's 41x better than the fine-tuned Gemma model, and 2x better than the off-the-shelf RAG setup. And this is the starting line for fine-tuning, not the finish line.

5. The Autoresearch Loop

This part is inspired by Karpathy's autoresearch. The concept: give an AI agent a model, a metric, and a time budget per experiment. Let it modify training settings, run the experiment, check the result, keep it or throw it away, and repeat. You go to sleep. You wake up to a log of experiments and a better model.

Agent reads results

→

Edits one hyperparameter

→

Trains (5 min budget)

→

Evaluates recall@10

Better? → keep

Worse? → revert

→

Repeat x100

The stack

Layer	Tool	Notes
Agent brain	qwen3:32b	Runs locally via Ollama. Zero API cost.
Agent harness	aider	Autonomous mode with auto-commits.
Training framework	sentence-transformers	Fine-tunes Qwen3-Embedding-0.6B.
Eval metric	recall@10	300 queries against 17,736 documents.
Hardware	Mac Studio M2 Ultra	192GB unified memory. Runs the agent and training simultaneously.
Cost per night	$0	Electricity.

The agent can modify learning rate, batch size, warmup ratio, sequence length, weight decay, and loss function. One change per experiment. Always measured against the same benchmark of 300 held-out queries.

6. Results Comparison

Approach	Model	Params	Recall@10	Cost	Time
Off-the-shelf RAG	nomic-embed-text	137M	13.8%	$0	1 day
Fine-tuned English model	Gemma-3 270M	270M	1.0%	$0	3 hours
Multilingual retrieval (zero-shot)	Qwen3-Embedding-0.6B	600M	27.3%	$0	25 min
+ Autoresearch fine-tuning	Qwen3-Embedding-0.6B	600M	🔄 running	$0	overnight

Live experiment: The autoresearch loop is running right now (March 25, 2026). I'll update the last row when results come in.

7. What I Learned

The model matters more than the method

I spent hours tuning hyperparameters on Gemma-3 and got nowhere. Switching to a model that actually speaks Spanish gave me 27x the performance with zero tuning. If your retrieval model doesn't understand the language and domain of your documents, no optimization will fix that.

Domain-specific text breaks generic embeddings

RAG works well when your documents are different from each other. A recipe, a weather report, and a legal brief are easy to tell apart. But 17,736 Spanish tax consultations written by the same bureaucratic voice, referencing the same laws? Generic embeddings collapse them into one cluster. The distances between vectors lose meaning.

Autonomous experimentation replaces manual tinkering

One night of automated experiments replaces days of manually adjusting settings and rerunning. The key insight from Karpathy's framing: you're not writing training code; you're writing the program that tells an agent how to write training code.

Apple Silicon is a real ML platform

The entire pipeline runs simultaneously on a single Mac Studio. Agent inference (32B model), training (0.6B model), and evaluation (17,736-document benchmark). 192GB unified memory means the GPU and CPU share the same pool. Total cost for overnight experimentation: the electricity bill.

Start with the right base

Each failure taught something specific. But the biggest leap came from model selection, not algorithmic cleverness. If I had started with Qwen3-Embedding, I would have saved two attempts and gotten to fine-tuning a day earlier. The lesson is simple: before optimizing anything, make sure you're optimizing the right thing.

Doctrina is live. doctrina.cristiandominguez.com has keyword search working today. Semantic search is being trained right now. The goal is to make all of its Spanish legal documents searchable by meaning, not just by keyword. For free.

Related project: This guide is part of the Doctrina build. See the live platform at Doctrina, the full project archive in Launchpad, or more public writing in Guides.