AI Dataset Formats
Common dataset formats used for training and evaluating AI — JSONL, Parquet, HuggingFace datasets, and more.
Reference
Formats
| Format | Best for | Notes |
|---|---|---|
| CSV | Small tabular data | Simple, universal; type info lost |
| TSV | Text with commas | Same as CSV but tab-delimited |
| JSON | Nested or hierarchical data | Verbose; whole-file parse |
| JSONL (ND-JSON) | Streaming records, LLM fine-tuning | One JSON object per line |
| Parquet | Columnar analytics | Compressed, efficient column reads |
| Feather / Arrow IPC | Fast in-memory handoff | Zero-copy to pandas / polars |
| HDF5 | Multidimensional arrays | Common in older ML (Keras) |
| NumPy .npy / .npz | Raw arrays | Simple dump / restore |
| TFRecord | TensorFlow pipelines | Protocol Buffer records |
| WebDataset (tar) | Large-scale vision / multimodal | Streams shards from disk or cloud |
| HuggingFace datasets | Training / evaluation | Memory-mapped Arrow backend |
| MLflow model / signature | Serving | Metadata + artifacts |
| Safetensors | Model weights (not data) | Secure alternative to pickle |
LLM fine-tuning formats
// OpenAI chat fine-tune (JSONL)
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris."}
]}
// Alpaca-style (JSONL)
{"instruction": "...", "input": "...", "output": "..."}
// Preference (DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}Picking a format
- Tabular, < 1 GB: CSV or Parquet.
- Tabular, > 1 GB: Parquet — columnar reads + compression.
- Records, streaming: JSONL.
- Images / video at scale: WebDataset shards in tar.
- Cross-team reproducibility: HuggingFace datasets or versioned Parquet.
Last updated: