AI Embedding Evaluation

How to measure embedding quality — benchmarks, retrieval metrics, and intrinsic evaluation.

Reference Reference Updated Apr 19, 2026
Reference

Benchmarks

Benchmark Coverage Notes
MTEB 56 tasks across 8 categories De-facto text embedding leaderboard (HuggingFace)
BEIR 18 retrieval tasks Zero-shot retrieval focus
C-MTEB Chinese counterpart to MTEB
MIRACL Multilingual IR (18 languages)
BUCC Bitext mining Parallel sentence alignment
STS benchmarks Semantic textual similarity STS-B, SICK-R

Retrieval metrics

Metric Meaning
Recall@k Fraction of queries where correct doc is in top-k
Precision@k Fraction of top-k that are relevant
MRR Mean reciprocal rank — 1/rank of first correct
MAP Mean average precision
NDCG@k Normalized discounted cumulative gain — graded relevance
Hit@k Binary: any relevant in top-k?

Intrinsic vs extrinsic

  • Intrinsic: STS tasks, analogy, word similarity — measure embedding geometry directly.
  • Extrinsic: retrieval, classification, clustering — measure embedding usefulness for downstream tasks.
  • Rule: always evaluate on tasks close to your actual use case, not just benchmark rank.

Practical tips

  • Build a gold set from your domain — 100–500 query/document pairs.
  • Use semi-hard negatives when fine-tuning — too-easy negatives don't teach much.
  • Evaluate at the k your product shows: recall@10 vs recall@1 can pick different winners.
  • Check latency and cost alongside quality — a 1% quality gain is rarely worth 10× cost.