AI Dataset Formats

Formats

Format	Best for	Notes
CSV	Small tabular data	Simple, universal; type info lost
TSV	Text with commas	Same as CSV but tab-delimited
JSON	Nested or hierarchical data	Verbose; whole-file parse
JSONL (ND-JSON)	Streaming records, LLM fine-tuning	One JSON object per line
Parquet	Columnar analytics	Compressed, efficient column reads
Feather / Arrow IPC	Fast in-memory handoff	Zero-copy to pandas / polars
HDF5	Multidimensional arrays	Common in older ML (Keras)
NumPy .npy / .npz	Raw arrays	Simple dump / restore
TFRecord	TensorFlow pipelines	Protocol Buffer records
WebDataset (tar)	Large-scale vision / multimodal	Streams shards from disk or cloud
HuggingFace datasets	Training / evaluation	Memory-mapped Arrow backend
MLflow model / signature	Serving	Metadata + artifacts
Safetensors	Model weights (not data)	Secure alternative to pickle

LLM fine-tuning formats

// OpenAI chat fine-tune (JSONL)
{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "Paris."}
]}

// Alpaca-style (JSONL)
{"instruction": "...", "input": "...", "output": "..."}

// Preference (DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}

Picking a format

Tabular, < 1 GB: CSV or Parquet.
Tabular, > 1 GB: Parquet — columnar reads + compression.
Records, streaming: JSONL.
Images / video at scale: WebDataset shards in tar.
Cross-team reproducibility: HuggingFace datasets or versioned Parquet.

Was this article helpful?