AI Dataset Formats

Common dataset formats used for training and evaluating AI — JSONL, Parquet, HuggingFace datasets, and more.

Reference Reference Updated Apr 19, 2026
Reference

Formats

Format Best for Notes
CSV Small tabular data Simple, universal; type info lost
TSV Text with commas Same as CSV but tab-delimited
JSON Nested or hierarchical data Verbose; whole-file parse
JSONL (ND-JSON) Streaming records, LLM fine-tuning One JSON object per line
Parquet Columnar analytics Compressed, efficient column reads
Feather / Arrow IPC Fast in-memory handoff Zero-copy to pandas / polars
HDF5 Multidimensional arrays Common in older ML (Keras)
NumPy .npy / .npz Raw arrays Simple dump / restore
TFRecord TensorFlow pipelines Protocol Buffer records
WebDataset (tar) Large-scale vision / multimodal Streams shards from disk or cloud
HuggingFace datasets Training / evaluation Memory-mapped Arrow backend
MLflow model / signature Serving Metadata + artifacts
Safetensors Model weights (not data) Secure alternative to pickle

LLM fine-tuning formats

// OpenAI chat fine-tune (JSONL)
{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "Paris."}
]}

// Alpaca-style (JSONL)
{"instruction": "...", "input": "...", "output": "..."}

// Preference (DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}

Picking a format

  • Tabular, < 1 GB: CSV or Parquet.
  • Tabular, > 1 GB: Parquet — columnar reads + compression.
  • Records, streaming: JSONL.
  • Images / video at scale: WebDataset shards in tar.
  • Cross-team reproducibility: HuggingFace datasets or versioned Parquet.

Last updated: