Web & Dev

AI Embedding Evaluation

How to measure embedding quality — benchmarks, retrieval metrics, and intrinsic evaluation.

Benchmarks

BenchmarkCoverageNotes
MTEB56 tasks across 8 categoriesDe-facto text embedding leaderboard (HuggingFace)
BEIR18 retrieval tasksZero-shot retrieval focus
C-MTEBChinese counterpart to MTEB
MIRACLMultilingual IR (18 languages)
BUCCBitext miningParallel sentence alignment
STS benchmarksSemantic textual similaritySTS-B, SICK-R

Retrieval metrics

MetricMeaning
Recall@kFraction of queries where correct doc is in top-k
Precision@kFraction of top-k that are relevant
MRRMean reciprocal rank — 1/rank of first correct
MAPMean average precision
NDCG@kNormalized discounted cumulative gain — graded relevance
Hit@kBinary: any relevant in top-k?

Intrinsic vs extrinsic

  • Intrinsic: STS tasks, analogy, word similarity — measure embedding geometry directly.
  • Extrinsic: retrieval, classification, clustering — measure embedding usefulness for downstream tasks.
  • Rule: always evaluate on tasks close to your actual use case, not just benchmark rank.

Practical tips

  • Build a gold set from your domain — 100–500 query/document pairs.
  • Use semi-hard negatives when fine-tuning — too-easy negatives don't teach much.
  • Evaluate at the k your product shows: recall@10 vs recall@1 can pick different winners.
  • Check latency and cost alongside quality — a 1% quality gain is rarely worth 10× cost.
Was this article helpful?