databases

Definition

German String

A German string is a fixed-size, immutable string representation, used mainly in analytical database systems, in which each value is stored in a 128-bit record. Short strings are stored directly inside that record. Long strings store their length, a cached four-byte prefix, and a pointer to the payload. 1

The format is designed for read-heavy workloads, where strings are compared, filtered, and scanned far more often than they are modified.

Motivation

Compared with C-style strings or std::string-style layouts, a German string is tuned for analytical workloads:

  • most values are short
  • many operations inspect only the first few characters
  • strings are usually read many times but rarely updated in place
  • avoiding per-value capacity metadata saves space

This makes immutability and cached prefixes more useful than cheap append operations.

Layout

Every German string occupies bytes.

Short strings

If the length is at most , the payload is stored inline.

  • the record stores the length
  • the remaining bytes store the payload directly
  • reading the prefix needs no pointer dereference and no extra allocation

Long strings

If the length is greater than , the payload is stored out of line.

  • a -bit length field stores the total length
  • a four-byte prefix caches the beginning of the string
  • a pointer refers to the full payload
  • the payload buffer has exactly the required size; there is no spare capacity field

The cached prefix lets the system reject many comparisons early. This is useful for equality tests, prefix tests, and lexicographic order.

The representation is determined by the length alone:

Storage classes

Storage classes

German strings distinguish between how the payload is owned and how long it stays valid.

  • persistent: the payload stays valid for the whole program lifetime, or otherwise outlives all uses
  • temporary: the string owns its payload and releases it when its lifetime ends
  • transient: the string only borrows externally managed memory

Transient strings are cheap to construct because they avoid copying. The cost is that the programmer must ensure that the referenced memory is still valid. Otherwise the pointer can become a dangling pointer.

Properties

Advantages

  • Space efficiency: one fixed -byte record per value, with no separate capacity field
  • Fast short strings: short values avoid heap allocation entirely
  • Fast prefix access: the first four bytes are always directly available
  • Read-friendly immutability: immutable payloads simplify repeated reads and comparisons
  • Cheap passing: the descriptor fits into two -bit machine words

Trade-offs

Limits

  • Long strings use a -bit length, so the maximum size is about .
  • Appending is expensive because the representation is immutable and uses exact-size buffers.
  • Transient strings require careful lifetime management.

Example

Prefix filtering

In a query such as starts_with(content, 'http'), a long string can often be rejected by inspecting the cached four-byte prefix. The system only needs to follow the pointer when that prefix matches, or when more bytes must be checked.

Footnotes

  1. Why German Strings are Everywhere | CedarDB - The All-In-One-Database