AI Evaluation Metrics

Metrics for classifiers, regressions, LLMs, and embeddings — what each measures and watch-outs.

Reference Reference Updated Apr 19, 2026
Reference

Classification

Metric Formula / meaning Watch-out
Accuracy (TP + TN) / total Deceptive on imbalanced classes
Precision TP / (TP + FP) Of predicted positives, how many correct
Recall (sensitivity) TP / (TP + FN) Of actual positives, how many found
F1 2 · P · R / (P + R) Balanced P vs R
Specificity TN / (TN + FP) For symmetric-cost problems
ROC-AUC Area under TPR-vs-FPR curve Ignores class balance
PR-AUC Area under precision-recall curve Better than ROC on imbalanced
Log loss −(1/N) Σ [y·log(p) + (1−y)·log(1−p)] Probabilistic calibration
Matthews CC Correlation coefficient (−1 to +1) Robust to imbalance

Regression

Metric Formula / meaning Watch-out
MAE (1/N) Σ |y − ŷ| Units match target
MSE (1/N) Σ (y − ŷ)² Penalizes outliers more
RMSE √MSE Units match target
MAPE (1/N) Σ |(y − ŷ) / y| Explodes near y = 0
R² (coef. of determination) 1 − SS_res / SS_tot Can be negative on bad models
Huber loss MSE for small errors, MAE for large Outlier-resistant

NLP / generation

Metric Use Notes
BLEU Translation overlap n-gram precision; doesn't capture meaning
ROUGE-1/2/L Summarization Recall of reference n-grams
METEOR Translation Considers synonyms, stemming
chrF Translation Character-level F-score
BERTScore Semantic similarity Contextual embeddings
Perplexity LLM fluency exp(cross-entropy); lower is better
Pass@k Code benchmark Prob. correct in k samples
Exact match QA, structured Strict; brittle to formatting
LLM-as-judge Open-ended eval Noisy, biased; use with pairwise + ensemble

Retrieval / embedding

Metric Meaning
Recall@k Is the correct doc in top-k?
MRR Mean reciprocal rank
NDCG Normalized DCG — graded relevance
MTEB Embedding benchmark suite (56 tasks)
Cosine similarity Measure of embedding proximity

Last updated: