Definition
German String
A German string is a fixed-size, immutable string representation, used mainly in analytical database systems, in which each value is stored in a 128-bit record. Short strings are stored directly inside that record. Long strings store their length, a cached four-byte prefix, and a pointer to the payload. 1
The format is designed for read-heavy workloads, where strings are compared, filtered, and scanned far more often than they are modified.
Motivation
Compared with C-style strings or std::string-style layouts, a German string is tuned for analytical workloads:
- most values are short
- many operations inspect only the first few characters
- strings are usually read many times but rarely updated in place
- avoiding per-value capacity metadata saves space
This makes immutability and cached prefixes more useful than cheap append operations.
Layout
Every German string occupies bytes.
Short strings
If the length is at most , the payload is stored inline.
- the record stores the length
- the remaining bytes store the payload directly
- reading the prefix needs no pointer dereference and no extra allocation
Long strings
If the length is greater than , the payload is stored out of line.
- a -bit length field stores the total length
- a four-byte prefix caches the beginning of the string
- a pointer refers to the full payload
- the payload buffer has exactly the required size; there is no spare capacity field
The cached prefix lets the system reject many comparisons early. This is useful for equality tests, prefix tests, and lexicographic order.
The representation is determined by the length alone:
Storage classes
Storage classes
German strings distinguish between how the payload is owned and how long it stays valid.
- persistent: the payload stays valid for the whole program lifetime, or otherwise outlives all uses
- temporary: the string owns its payload and releases it when its lifetime ends
- transient: the string only borrows externally managed memory
Transient strings are cheap to construct because they avoid copying. The cost is that the programmer must ensure that the referenced memory is still valid. Otherwise the pointer can become a dangling pointer.
Properties
Advantages
- Space efficiency: one fixed -byte record per value, with no separate capacity field
- Fast short strings: short values avoid heap allocation entirely
- Fast prefix access: the first four bytes are always directly available
- Read-friendly immutability: immutable payloads simplify repeated reads and comparisons
- Cheap passing: the descriptor fits into two -bit machine words
Trade-offs
Limits
- Long strings use a -bit length, so the maximum size is about .
- Appending is expensive because the representation is immutable and uses exact-size buffers.
- Transient strings require careful lifetime management.
Example
Prefix filtering
In a query such as
starts_with(content, 'http'), a long string can often be rejected by inspecting the cached four-byte prefix. The system only needs to follow the pointer when that prefix matches, or when more bytes must be checked.