AI Evaluation Metrics
Metrics for classifiers, regressions, LLMs, and embeddings — what each measures and watch-outs.
Reference
Classification
| Metric | Formula / meaning | Watch-out |
|---|---|---|
| Accuracy | (TP + TN) / total | Deceptive on imbalanced classes |
| Precision | TP / (TP + FP) | Of predicted positives, how many correct |
| Recall (sensitivity) | TP / (TP + FN) | Of actual positives, how many found |
| F1 | 2 · P · R / (P + R) | Balanced P vs R |
| Specificity | TN / (TN + FP) | For symmetric-cost problems |
| ROC-AUC | Area under TPR-vs-FPR curve | Ignores class balance |
| PR-AUC | Area under precision-recall curve | Better than ROC on imbalanced |
| Log loss | −(1/N) Σ [y·log(p) + (1−y)·log(1−p)] | Probabilistic calibration |
| Matthews CC | Correlation coefficient (−1 to +1) | Robust to imbalance |
Regression
| Metric | Formula / meaning | Watch-out |
|---|---|---|
| MAE | (1/N) Σ |y − ŷ| | Units match target |
| MSE | (1/N) Σ (y − ŷ)² | Penalizes outliers more |
| RMSE | √MSE | Units match target |
| MAPE | (1/N) Σ |(y − ŷ) / y| | Explodes near y = 0 |
| R² (coef. of determination) | 1 − SS_res / SS_tot | Can be negative on bad models |
| Huber loss | MSE for small errors, MAE for large | Outlier-resistant |
NLP / generation
| Metric | Use | Notes |
|---|---|---|
| BLEU | Translation overlap | n-gram precision; doesn't capture meaning |
| ROUGE-1/2/L | Summarization | Recall of reference n-grams |
| METEOR | Translation | Considers synonyms, stemming |
| chrF | Translation | Character-level F-score |
| BERTScore | Semantic similarity | Contextual embeddings |
| Perplexity | LLM fluency | exp(cross-entropy); lower is better |
| Pass@k | Code benchmark | Prob. correct in k samples |
| Exact match | QA, structured | Strict; brittle to formatting |
| LLM-as-judge | Open-ended eval | Noisy, biased; use with pairwise + ensemble |
Retrieval / embedding
| Metric | Meaning |
|---|---|
| Recall@k | Is the correct doc in top-k? |
| MRR | Mean reciprocal rank |
| NDCG | Normalized DCG — graded relevance |
| MTEB | Embedding benchmark suite (56 tasks) |
| Cosine similarity | Measure of embedding proximity |
Last updated: