Web & Dev

AI Evaluation Metrics

Metrics for classifiers, regressions, LLMs, and embeddings — what each measures and watch-outs.

Classification

MetricFormula / meaningWatch-out
Accuracy(TP + TN) / totalDeceptive on imbalanced classes
PrecisionTP / (TP + FP)Of predicted positives, how many correct
Recall (sensitivity)TP / (TP + FN)Of actual positives, how many found
F12 · P · R / (P + R)Balanced P vs R
SpecificityTN / (TN + FP)For symmetric-cost problems
ROC-AUCArea under TPR-vs-FPR curveIgnores class balance
PR-AUCArea under precision-recall curveBetter than ROC on imbalanced
Log loss−(1/N) Σ [y·log(p) + (1−y)·log(1−p)]Probabilistic calibration
Matthews CCCorrelation coefficient (−1 to +1)Robust to imbalance

Regression

MetricFormula / meaningWatch-out
MAE(1/N) Σ |y − ŷ|Units match target
MSE(1/N) Σ (y − ŷ)²Penalizes outliers more
RMSE√MSEUnits match target
MAPE(1/N) Σ |(y − ŷ) / y|Explodes near y = 0
R² (coef. of determination)1 − SS_res / SS_totCan be negative on bad models
Huber lossMSE for small errors, MAE for largeOutlier-resistant

NLP / generation

MetricUseNotes
BLEUTranslation overlapn-gram precision; doesn't capture meaning
ROUGE-1/2/LSummarizationRecall of reference n-grams
METEORTranslationConsiders synonyms, stemming
chrFTranslationCharacter-level F-score
BERTScoreSemantic similarityContextual embeddings
PerplexityLLM fluencyexp(cross-entropy); lower is better
Pass@kCode benchmarkProb. correct in k samples
Exact matchQA, structuredStrict; brittle to formatting
LLM-as-judgeOpen-ended evalNoisy, biased; use with pairwise + ensemble

Retrieval / embedding

MetricMeaning
Recall@kIs the correct doc in top-k?
MRRMean reciprocal rank
NDCGNormalized DCG — graded relevance
MTEBEmbedding benchmark suite (56 tasks)
Cosine similarityMeasure of embedding proximity
Was this article helpful?