Key concepts
A pointer is a region or a collection of other pointers
High-level languages allow programmers to describe and manipulate conceptual ideas as succinct, individual building blocks of machine execution. Nowadays, thanks to decades of compiler research, resulting low-level machine states are often reliably indecipherable, bearing no resemblance at all to even basic data abstractions.
The ethdebug/format/pointer schema provides a tree-based syntax for representing complete (and often minutely detailed) address information for finding a particular high-level data object. (For instance: a compiler may need to inform a debugger about where to find a particular array in memory.) As such, this schema is specified recursively: a pointer is either a single, continuous sequence of bytes addresses (a region), or it aggregates other pointers (a collection).
A region is a single continuous range of byte addresses
For simple allocations (like those that fit into a single word), the ethdebug/format/pointer representation is also quite simple: just a single, optionally named, continuous chunk of bytes in the machine state.
Example: Pointing to the first 32 bytes of memory
{
"name": "memory-start",
"location": "memory",
"offset": "0x0000000000000000000000000000000000000000000000000000000000000000",
"length": 32
}
This schema defines the concept of a region to be the representation
of the addressing details for a particular block of continuous bytes. Different
data locations use different, location-specific schemas for regions
(since, e.g., stack regions are very different than storage regions). The
ethdebug/format/pointer/region schema aggregates these using the
"location"
field as a polymorphic discriminator.
A collection aggregates other pointers
Other allocations are not so cleanly represented by a single continuous block of bytes anywhere. In these situations, the ethdebug/format/pointer representation can describe the aggregation of other composed pointers.
Example: Solidity struct
in memory
struct
in memoryConsider the struct
definition:
struct Record {
uint256 x;
uint256 y;
}
A minimal way to represent one possible memory allocation for this type is
to "group"
together two single-region pointers for each of the two struct
members.
{
"group": [{
"location": "memory",
"offset": "0x40",
"length": 32
}, {
"location": "memory",
"offset": "0x60",
"length": 32
}]
}
This kind of pointer is defined to be a collection of other pointers. This schema includes several sub-schemas for different kinds of collections (e.g., since the allocation of struct types are very different from the allocation of array types, etc.). The ethdebug/format/pointer/collection schema aggregates these.
Pointers allow describing a value as a complex expression
The examples above all use literal values for "offset"
and "length"
, but
these will very often be impossible to predict at compile-time. This format
provides a JSON-based expression syntax for relating the data components
involved in a complex allocation.
Besides representing byte offsets, word addresses, byte range lengths, etc. using just unsigned integer literals, this schema also allows representing addressing details via numeric operations, references to named regions, explicit EVM lookup, and so on. reference to other regions by name, and a few builtin operations.
Example: Region defined using an arithmetic operation
{
"location": "memory",
"offset": {
"$sum": [1, 1]
},
"length": 1
}
More expressively, regions whose representations include a
"name": "<identifier>"
property allow the use of this <identifier>
in
references to this region elsewhere in the pointer representation.
This can be used, for example, to indicate a storage slot whose address is known at some point in execution to be the value at the top of the stack.
Example: Naming a region and reading this region's data from machine state
{
"group": [{
"name": "pointer-to-storage-slot",
"location": "stack",
"slot": 0
}, {
"name": "storage-slot",
"location": "storage",
"slot": {
"$read": "pointer-to-storage-slot"
}
}]
}
Regions can also be referenced for the purposes of copying fields (to
avoid duplicating constants, etc.) This is useful, e.g., with structs, whose
members often are allocated one after the next with no gap. In this example,
the sub-pointer corresponding to y
is positioned based on x
's offset
and length.
Example: Defining regions in terms of other regions
{
"group": [{
"name": "record-pointer",
"location": "stack",
"slot": 0
}, {
"name": "record-x",
"location": "memory",
"offset": {
"$read": "record-pointer"
},
"length": 32
}, {
"name": "record-y",
"location": "memory",
"offset": {
"$sum": [{ ".offset": "record-x" }, { ".length": "record-x" }]
},
"length": 32
}]
}
This schema is designed to allow compilers maximal expressiveness in producing self-contained representations that are completely knowable at compile-time.
Collections can be dynamic
For collections whose cardinality or configuration is unpredictable at
compile-time (e.g., uint256[]
or string storage
allocations, respectively),
and for collections whose static representation would simply be too cumbersome
(e.g., bytes32[3200][5600][111]
allocations),
ethdebug/format/pointer/collection provides sub-schemas for describing the
full set of inter-related data addresses in dynamic terms.
Example: Representing a dynamically-sized array allocation
The following represents an allocation where the array's item count is stored as the leading word-sized sequence of bytes, and each item in the array has the same fixed size and appears in memory sequentially following that with no gaps.
Notice how "list": { ... }
expects an object with an expression for the
value of "count"
, the name of the scalar variable to represent "each"
item's index in the list, and the underlying pointer for the item itself.
{
"group": [
{
"name": "array-count",
"location": "memory",
"offset": "0x40",
"length": 32
},
{
"list": {
"count": { "$read": "array-count" },
"each": "item-index",
"is": {
"name": "array-item",
"location": "memory",
"offset": {
"$sum": [
{ ".offset": "array-count" },
{ ".length": "array-count" },
"$product": [
"item-index",
{ ".length": "array-item" }
]
]
},
"length": 32
}
}
}
]
}
A region is specified in terms of an addressing scheme
The EVM models its various data locations in a couple different ways based on how bytes are defined to be arranged in each location: e.g., storage is arranged in slots, but memory is just one long bytes array.
This pointer schema does not make any attempt to unify these different access abstractions, but instead it defines the concept of an addressing scheme.
Each location-specific region schema is defined as an extension of the schema for a particular addressing scheme.
This format currently defines two such addressing schemes: ethdebug/format/pointer/scheme/slice and ethdebug/format/pointer/scheme/segment, for addressing ranges within a single continuous byte array and addressing a slot or collection of slots in a word-arranged locaton (respectively).
Example: Slice-based region
This example addresses the 32 bytes starting at location 0x3334
in calldata:
{
"location": "calldata",
"offset": "0x3334",
"length": "0x20"
}
Example: Segment-based region
This example addresses the first four slots of transient storage (beginning at slot 0).
{
"location": "transient",
"slot": 0
}