AI Embedding Evaluation
How to measure embedding quality — benchmarks, retrieval metrics, and intrinsic evaluation.
Reference
Benchmarks
| Benchmark | Coverage | Notes |
|---|---|---|
| MTEB | 56 tasks across 8 categories | De-facto text embedding leaderboard (HuggingFace) |
| BEIR | 18 retrieval tasks | Zero-shot retrieval focus |
| C-MTEB | Chinese counterpart to MTEB | |
| MIRACL | Multilingual IR (18 languages) | |
| BUCC | Bitext mining | Parallel sentence alignment |
| STS benchmarks | Semantic textual similarity | STS-B, SICK-R |
Retrieval metrics
| Metric | Meaning |
|---|---|
| Recall@k | Fraction of queries where correct doc is in top-k |
| Precision@k | Fraction of top-k that are relevant |
| MRR | Mean reciprocal rank — 1/rank of first correct |
| MAP | Mean average precision |
| NDCG@k | Normalized discounted cumulative gain — graded relevance |
| Hit@k | Binary: any relevant in top-k? |
Intrinsic vs extrinsic
- Intrinsic: STS tasks, analogy, word similarity — measure embedding geometry directly.
- Extrinsic: retrieval, classification, clustering — measure embedding usefulness for downstream tasks.
- Rule: always evaluate on tasks close to your actual use case, not just benchmark rank.
Practical tips
- Build a gold set from your domain — 100–500 query/document pairs.
- Use semi-hard negatives when fine-tuning — too-easy negatives don't teach much.
- Evaluate at the k your product shows: recall@10 vs recall@1 can pick different winners.
- Check latency and cost alongside quality — a 1% quality gain is rarely worth 10× cost.
Last updated: