AI Embedding Evaluation

How to measure embedding quality — benchmarks, retrieval metrics, and intrinsic evaluation.

Reference Reference Updated Apr 19, 2026

Benchmark	Coverage	Notes
MTEB	56 tasks across 8 categories	De-facto text embedding leaderboard (HuggingFace)
BEIR	18 retrieval tasks	Zero-shot retrieval focus
C-MTEB	Chinese counterpart to MTEB
MIRACL	Multilingual IR (18 languages)
BUCC	Bitext mining	Parallel sentence alignment
STS benchmarks	Semantic textual similarity	STS-B, SICK-R

Metric	Meaning
Recall@k	Fraction of queries where correct doc is in top-k
Precision@k	Fraction of top-k that are relevant
MRR	Mean reciprocal rank — 1/rank of first correct
MAP	Mean average precision
NDCG@k	Normalized discounted cumulative gain — graded relevance
Hit@k	Binary: any relevant in top-k?

Intrinsic: STS tasks, analogy, word similarity — measure embedding geometry directly.
Extrinsic: retrieval, classification, clustering — measure embedding usefulness for downstream tasks.
Rule: always evaluate on tasks close to your actual use case, not just benchmark rank.

Build a gold set from your domain — 100–500 query/document pairs.
Use semi-hard negatives when fine-tuning — too-easy negatives don't teach much.
Evaluate at the k your product shows: recall@10 vs recall@1 can pick different winners.
Check latency and cost alongside quality — a 1% quality gain is rarely worth 10× cost.

Last updated: April 19, 2026