I'm building Doctrina, a free Spanish legal research platform. It indexes 17,736 binding tax consultations. This guide explains why generic retrieval didn't work, what I tried, and where things stand now.
Doctrina is my attempt at a free alternative to Aranzadi, Spain's dominant legal database. Aranzadi is owned by Thomson Reuters. Their cheapest tier (Aranzadi One) starts at €117/month, which is €1,404/year. The full Aranzadi Instituciones database runs €1,700 to €2,000/year depending on version and number of users. Most people who need Spanish tax answers can't afford that.
I scraped three public government databases. All of them publish their content as open public records under Spanish transparency law (Ley 19/2013, de transparencia). Scraping publicly accessible government legal data for research and tool-building is legal in the EU; these aren't behind paywalls or authentication.
| Database | Source | Documents | Content |
|---|---|---|---|
| DGT Consultas Vinculantes | petete.tributos.hacienda.gob.es | 17,736 | Binding tax consultations. A taxpayer asks a question; the DGT gives an official, citable answer. |
| TEAC Resoluciones | serviciostelematicosext.hacienda.gob.es | 8,400 | Tax tribunal rulings. Appeals of tax assessments decided by the economic-administrative courts. |
| CENDOJ Jurisprudencia | cendoj.poder-judicial.es | 350,000+ | Court decisions from all levels of the Spanish judiciary. The largest corpus by far. |
The DGT consultas vinculantes are uniquely suited for training a retrieval model. Each one has a clean question-answer pair built in: the cuestion (what the taxpayer asked) and the contestacion (what the government answered). That structure is exactly what retrieval training needs: "given this question, find this answer."
TEAC decisions and CENDOJ rulings don't have that same clean pair structure. They're longer, more complex documents without a single question mapped to a single answer. Training on them would require manually creating query-document pairs or building a synthetic dataset, which is the next step.
So the plan is: train on the 17,736 DGT pairs first (because the data is ready), prove that the approach works, then extend to TEAC and CENDOJ.
These documents all look the same. Same bureaucratic voice. Same laws referenced. Nearly identical vocabulary. A consultation about foreign asset reporting and one about real estate tax obligations share 80% of their terminology.
Keyword search works when someone types the exact legal phrase. But real people don't search that way. They type "do I need to report my Finnish property" and expect the system to understand what they mean.
That requires semantic search. And semantic search requires an embedding model that actually understands Spanish legal text.
First thing I tried was the standard approach. Take an off-the-shelf embedding model, convert every document into a vector (a list of numbers that represents its meaning), store them, and at search time find the vectors closest to the user's query.
The model I used was nomic-embed-text, running locally through Ollama. 79,000 vectors indexed. This is the setup you'd find in any RAG tutorial.
Out of every 10 queries, the correct document appeared in the top 10 results about 1.4 times. The other 8.6 times, the system returned consultations that looked right but weren't. For a legal research tool, that's worse than returning nothing; it creates false confidence.
Generic embedding models are trained on the internet. Wikipedia, news articles, Reddit threads. They understand language broadly, but they don't understand this language.
Spanish tax law has its own dialect. "Tributación de no residentes" and "impuesto sobre la renta de personas físicas" are closely related concepts, but a generic model treats them as vaguely similar instead of recognizing they belong to the same legal framework.
When 17,736 documents all share most of their vocabulary, a generic model clusters them into one amorphous blob. The distances between vectors become too small to mean anything. It's like asking someone who speaks tourist Spanish to distinguish between two nearly identical legal clauses.
If the generic model doesn't understand the domain, teach it. I set up jina-ai/mlx-retrieval on a Mac Studio M2 Ultra and started fine-tuning Google's Gemma-3-270M on the Doctrina corpus.
Gemma-3 270M parameters. Google's small language model. Primarily English.
LoRA fine-tuning, 150 steps, contrastive loss. 8 experiments via an autonomous agent.
Mac Studio M2 Ultra, 192GB unified memory. Training via MLX at ~7,000 tokens/sec.
The untrained model scored 0.67%. Basically random. After 8 automated experiments trying different loss functions, temperatures, and step counts, it reached 1.0%. Three correct results out of 300 queries.
The model didn't know enough Spanish. Gemma-3-270M was trained primarily on English text. Asking it to learn Spanish legal retrieval through LoRA alone is too much. No amount of hyperparameter tuning compensates for a fundamental language gap.
The agent tried three different loss functions, multiple temperature values, varying step counts. Everything plateaued at 1.0%. When the search space doesn't contain a good solution, optimizing within that space is a waste of time.
The fix wasn't algorithmic. It was choosing the right starting point.
Qwen3-Embedding-0.6B is a 600M-parameter model built specifically for multilingual retrieval. Trained on 100+ languages including Spanish. Not a general-purpose language model forced into retrieval work; it was designed for exactly this.
Out of the box. No training. No tuning. That's 41x better than the fine-tuned Gemma model, and 2x better than the off-the-shelf RAG setup. And this is the starting line for fine-tuning, not the finish line.
This part is inspired by Karpathy's autoresearch. The concept: give an AI agent a model, a metric, and a time budget per experiment. Let it modify training settings, run the experiment, check the result, keep it or throw it away, and repeat. You go to sleep. You wake up to a log of experiments and a better model.
| Layer | Tool | Notes |
|---|---|---|
| Agent brain | qwen3:32b | Runs locally via Ollama. Zero API cost. |
| Agent harness | aider | Autonomous mode with auto-commits. |
| Training framework | sentence-transformers | Fine-tunes Qwen3-Embedding-0.6B. |
| Eval metric | recall@10 | 300 queries against 17,736 documents. |
| Hardware | Mac Studio M2 Ultra | 192GB unified memory. Runs the agent and training simultaneously. |
| Cost per night | $0 | Electricity. |
The agent can modify learning rate, batch size, warmup ratio, sequence length, weight decay, and loss function. One change per experiment. Always measured against the same benchmark of 300 held-out queries.
| Approach | Model | Params | Recall@10 | Cost | Time |
|---|---|---|---|---|---|
| Off-the-shelf RAG | nomic-embed-text | 137M | 13.8% | $0 | 1 day |
| Fine-tuned English model | Gemma-3 270M | 270M | 1.0% | $0 | 3 hours |
| Multilingual retrieval (zero-shot) | Qwen3-Embedding-0.6B | 600M | 27.3% | $0 | 25 min |
| + Autoresearch fine-tuning | Qwen3-Embedding-0.6B | 600M | 🔄 running | $0 | overnight |
I spent hours tuning hyperparameters on Gemma-3 and got nowhere. Switching to a model that actually speaks Spanish gave me 27x the performance with zero tuning. If your retrieval model doesn't understand the language and domain of your documents, no optimization will fix that.
RAG works well when your documents are different from each other. A recipe, a weather report, and a legal brief are easy to tell apart. But 17,736 Spanish tax consultations written by the same bureaucratic voice, referencing the same laws? Generic embeddings collapse them into one cluster. The distances between vectors lose meaning.
One night of automated experiments replaces days of manually adjusting settings and rerunning. The key insight from Karpathy's framing: you're not writing training code; you're writing the program that tells an agent how to write training code.
The entire pipeline runs simultaneously on a single Mac Studio. Agent inference (32B model), training (0.6B model), and evaluation (17,736-document benchmark). 192GB unified memory means the GPU and CPU share the same pool. Total cost for overnight experimentation: the electricity bill.
Each failure taught something specific. But the biggest leap came from model selection, not algorithmic cleverness. If I had started with Qwen3-Embedding, I would have saved two attempts and gotten to fine-tuning a day earlier. The lesson is simple: before optimizing anything, make sure you're optimizing the right thing.