Theme

Custom Colors

Accent

#d4943a

Background

#0c0e11

Header

#141719

Cards

#1a1d21

Accessibility Presets

Font

Code and tool outputs stay monospace.

Readability

Font Size

16px

Line Height

1.6

Letter Spacing

0px

Web & Dev

AI Embedding Evaluation

How to measure embedding quality — benchmarks, retrieval metrics, and intrinsic evaluation.

Updated Apr 19, 2026 2 min read

Benchmarks

Benchmark	Coverage	Notes
MTEB	56 tasks across 8 categories	De-facto text embedding leaderboard (HuggingFace)
BEIR	18 retrieval tasks	Zero-shot retrieval focus
C-MTEB	Chinese counterpart to MTEB
MIRACL	Multilingual IR (18 languages)
BUCC	Bitext mining	Parallel sentence alignment
STS benchmarks	Semantic textual similarity	STS-B, SICK-R

Retrieval metrics

Metric	Meaning
Recall@k	Fraction of queries where correct doc is in top-k
Precision@k	Fraction of top-k that are relevant
MRR	Mean reciprocal rank — 1/rank of first correct
MAP	Mean average precision
NDCG@k	Normalized discounted cumulative gain — graded relevance
Hit@k	Binary: any relevant in top-k?

Intrinsic vs extrinsic

Intrinsic: STS tasks, analogy, word similarity — measure embedding geometry directly.
Extrinsic: retrieval, classification, clustering — measure embedding usefulness for downstream tasks.
Rule: always evaluate on tasks close to your actual use case, not just benchmark rank.

Practical tips

Build a gold set from your domain — 100–500 query/document pairs.
Use semi-hard negatives when fine-tuning — too-easy negatives don't teach much.
Evaluate at the k your product shows: recall@10 vs recall@1 can pick different winners.
Check latency and cost alongside quality — a 1% quality gain is rarely worth 10× cost.

Was this article helpful?