AI Evaluation Metrics

Metrics for classifiers, regressions, LLMs, and embeddings — what each measures and watch-outs.

Reference Reference Updated Apr 19, 2026

Metric	Formula / meaning	Watch-out
Accuracy	(TP + TN) / total	Deceptive on imbalanced classes
Precision	TP / (TP + FP)	Of predicted positives, how many correct
Recall (sensitivity)	TP / (TP + FN)	Of actual positives, how many found
F1	2 · P · R / (P + R)	Balanced P vs R
Specificity	TN / (TN + FP)	For symmetric-cost problems
ROC-AUC	Area under TPR-vs-FPR curve	Ignores class balance
PR-AUC	Area under precision-recall curve	Better than ROC on imbalanced
Log loss	−(1/N) Σ [y·log(p) + (1−y)·log(1−p)]	Probabilistic calibration
Matthews CC	Correlation coefficient (−1 to +1)	Robust to imbalance

Metric	Formula / meaning	Watch-out
MAE	(1/N) Σ \|y − ŷ\|	Units match target
MSE	(1/N) Σ (y − ŷ)²	Penalizes outliers more
RMSE	√MSE	Units match target
MAPE	(1/N) Σ \|(y − ŷ) / y\|	Explodes near y = 0
R² (coef. of determination)	1 − SS_res / SS_tot	Can be negative on bad models
Huber loss	MSE for small errors, MAE for large	Outlier-resistant

Metric	Use	Notes
BLEU	Translation overlap	n-gram precision; doesn't capture meaning
ROUGE-1/2/L	Summarization	Recall of reference n-grams
METEOR	Translation	Considers synonyms, stemming
chrF	Translation	Character-level F-score
BERTScore	Semantic similarity	Contextual embeddings
Perplexity	LLM fluency	exp(cross-entropy); lower is better
Pass@k	Code benchmark	Prob. correct in k samples
Exact match	QA, structured	Strict; brittle to formatting
LLM-as-judge	Open-ended eval	Noisy, biased; use with pairwise + ensemble

Metric	Meaning
Recall@k	Is the correct doc in top-k?
MRR	Mean reciprocal rank
NDCG	Normalized DCG — graded relevance
MTEB	Embedding benchmark suite (56 tasks)
Cosine similarity	Measure of embedding proximity

Last updated: April 19, 2026