Media & Files

Raw Binary Data Architecture: How Bytes Become Files

Bits, bytes, hex, endianness, and how numbers and text are encoded into the raw binary that every file is made of — the foundation behind hex editors and file forensics.

Open any file in a hex editor and you are looking at the same thing every file ultimately is: a sequence of bytes. A photo, a song, a program, and a spreadsheet differ only in how those bytes are arranged and interpreted. Learn the handful of conventions that govern that arrangement — how numbers and text are encoded, how multi-byte values are ordered, how a format lays out its fields — and the wall of hex turns into something you can read.

This guide builds up from a single bit to the structure of a real file, the foundation that makes tools like the File Inspector and a hex editor make sense.

Bits, bytes, and nibbles

The smallest unit of data is the bit — a single 1 or 0. Bits are grouped into bytes of eight, and one byte can represent 28 = 256 distinct values, from 0 to 255. The byte is the fundamental addressable unit of almost all computing: file sizes, memory, and network transfers are all measured in bytes. Four bits — half a byte — is occasionally called a nibble, and it matters because a nibble maps exactly to one hexadecimal digit.

Why binary is read in hexadecimal

Writing bytes as raw binary is unwieldy: the single byte 11111111 is hard to scan, and files contain millions of them. Hexadecimal (base 16) solves this elegantly. Because 16 is 24, one hex digit encodes exactly four bits, so every byte is exactly two hex digits — 00 for all-zeros up to FF for all-ones. That clean 1-byte-to-2-characters mapping is why hex is the universal language of hex editors, colour codes, memory addresses, and magic numbers.

Binary (nibble)HexDecimal
000000
1010A10
1111F15
11111111FF255

Endianness: byte order matters

A single byte holds 0–255, but most numbers need more room, so they span several bytes. That raises a question: in what order are those bytes stored? The two answers are endianness. In little-endian, the least-significant byte comes first; in big-endian, the most-significant byte comes first. The 32-bit value 1 is stored as 01 00 00 00 little-endian but 00 00 00 01 big-endian.

This is not academic. Intel and AMD x86 chips and most ARM devices are little-endian, while many network protocols and some file formats are big-endian (“network byte order”). Read a multi-byte field with the wrong endianness and you get a nonsensical value — which is why a good hex editor’s data inspector shows both interpretations side by side.

Encoding numbers: integers and floats

Unsigned integers are the simplest case — the bytes are just the number in base 256. Signed integers (which can be negative) use a scheme called two’s complement, where the top bit indicates sign and negative numbers wrap around from the top of the range. Fractional numbers use the IEEE 754 floating-point standard, which packs a sign, an exponent, and a fraction into 32 or 64 bits — the reason 0.1 + 0.2 famously does not exactly equal 0.3 in most languages. The point is that the same four bytes mean different things depending on how you are told to read them; the bytes carry no inherent type.

Encoding text

Text is stored as numbers too. ASCII assigns the values 0–127 to English letters, digits, punctuation, and control codes — the letter A is 65 (0x41). ASCII cannot represent most of the world’s characters, so modern files use UTF-8, which encodes the entire Unicode set using one to four bytes per character. UTF-8’s clever design keeps the first 128 values identical to ASCII, so plain English text is byte-for-byte the same in both — which is why readable strings jump straight out of the ASCII pane of a hex view.

From bytes to structure

A format is a contract about what the bytes at each position mean. Most files share a common skeleton:

  • A header at the start — beginning with the magic number — declaring the format and key parameters (a PNG header carries image width, height, and colour type).
  • A body of fields or chunks, each at a known offset (distance in bytes from the start) or introduced by a length value that says how many bytes follow.
  • Sometimes a footer marking the end and helping detect truncation.

Many robust formats are chunked: PNG, RIFF/WAV, and MP4 all store data as a series of typed, length-prefixed blocks. This is why a parser can walk a file precisely — read a chunk’s type and length, jump that many bytes, read the next — and why a wrong length value immediately signals corruption or something hidden.

What randomness looks like

Finally, the statistics of the bytes tell a story even before you parse them. Entropy measures how unpredictable the data is, from 0 to 8 bits per byte. Structured data — text, code, simple images — has low, uneven entropy with lots of repetition. Compressed or encrypted data has near-maximum entropy and looks like pure noise. A sudden jump in entropy partway through an otherwise ordinary file is a classic sign of an embedded or hidden payload, which is why entropy visualisations are a staple of file forensics.

Explore real bytes

Theory clicks fast once you poke at actual data. Open the Hex Editor or the File Inspector, drop in a small file, and watch the offset column, the hex bytes, and the ASCII pane line up. Click a value to see it decoded as an integer, float, and timestamp in both byte orders — and the abstract ideas above become something you can see.

Frequently asked questions

Why is binary data shown in hexadecimal?

Because one hexadecimal digit represents exactly four bits, so one byte (eight bits) is always exactly two hex digits. Hex is a compact, unambiguous shorthand for binary that is far easier to read than long strings of 1s and 0s.

What is endianness?

The order in which the bytes of a multi-byte number are stored. Little-endian puts the least-significant byte first (used by x86 and ARM); big-endian puts the most-significant byte first (used in many network protocols). Reading a number with the wrong endianness gives a wildly wrong value.

How is text stored in a binary file?

As numbers, via an encoding. ASCII maps characters to values 0–127; UTF-8 extends that to all of Unicode using one to four bytes per character while staying backward-compatible with ASCII.

What does high entropy in a file mean?

That the bytes are highly unpredictable, which usually means the data is compressed or encrypted. Low, uneven entropy means structured data like text or code.

Was this article helpful?