Web & Dev

AI Dataset Formats

Common dataset formats used for training and evaluating AI — JSONL, Parquet, HuggingFace datasets, and more.

Formats

FormatBest forNotes
CSVSmall tabular dataSimple, universal; type info lost
TSVText with commasSame as CSV but tab-delimited
JSONNested or hierarchical dataVerbose; whole-file parse
JSONL (ND-JSON)Streaming records, LLM fine-tuningOne JSON object per line
ParquetColumnar analyticsCompressed, efficient column reads
Feather / Arrow IPCFast in-memory handoffZero-copy to pandas / polars
HDF5Multidimensional arraysCommon in older ML (Keras)
NumPy .npy / .npzRaw arraysSimple dump / restore
TFRecordTensorFlow pipelinesProtocol Buffer records
WebDataset (tar)Large-scale vision / multimodalStreams shards from disk or cloud
HuggingFace datasetsTraining / evaluationMemory-mapped Arrow backend
MLflow model / signatureServingMetadata + artifacts
SafetensorsModel weights (not data)Secure alternative to pickle

LLM fine-tuning formats

// OpenAI chat fine-tune (JSONL)
{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "Paris."}
]}

// Alpaca-style (JSONL)
{"instruction": "...", "input": "...", "output": "..."}

// Preference (DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}

Picking a format

  • Tabular, < 1 GB: CSV or Parquet.
  • Tabular, > 1 GB: Parquet — columnar reads + compression.
  • Records, streaming: JSONL.
  • Images / video at scale: WebDataset shards in tar.
  • Cross-team reproducibility: HuggingFace datasets or versioned Parquet.
Was this article helpful?