@jtoman's format prototype
Status of this document
This is an initial draft for review and comment. It does not have consensus and should only be cited as work in progress.
Scope of this Document
This document proposes a "general" shape of the ultimate debugging format to be decided upon by the ethdebug working group. As such, it does not aim to be a complete formal specification of a JSON format that is expected to cover every single case. Rather, it aims to provide a basis from which a fully formal specification will be developed based on discussions generated around this document.
Under this vague scope, it is worth noting some non-goals. This document describes a debug format for EVM bytecode; support for other VMs is at least initially out of scope. In addition, this proposal is necessarily geared towards the state of the Solidity/Vyper languages as they exist now. It cannot (and will not) account for any possible future changes to the source language (Solidity, Vyper, etc.); rather, any significant changes to source languages/target VMs that require fundamental extensions to this format should be developed as needed and gated with a versioning scheme.
Goals of this Format
Before describing the format, it is useful to lay out the information that this initial proposal is attempting to provide to end-users of the format. Extensions to this format to support other use cases not explicitly identified here are expected.
Local Variable Information
Currently decompilers and formal methods tools must use internally generated names to give names to the values stored on the stack. The debugging format should provide information about what stack slots correspond to which source level identifiers.
Type Information
The EVM has one "type": a 256-bit word. Source languages that compile to the EVM have richer type information which can aid in debugging and fuzzing; for example, the counterexample generation used by the Certora prover could use type information to pretty print values according to their high-level type.
Jump Resolution
The EVM allows jumping to arbitrary values on the stack (subject to the restriction that the destination has a corresponding JUMPDEST opcode). This makes construction of a static control flow graph challenging (albeit not impossible). The format should provide reasonable hints about possible targets of jump commands.
Internal Function Calls
The EVM has no built-in concept of internal functions. Rather, Solidity internal function implementations are placed at some offset in the contract's bytecode, and callers jump to this predetermined location, passing arguments on the stack along with the return location (this is one possible complication when attempting to statically resolve jump destinations).
Statically detecting these internal calls and informing an end-user can be surprisingly complicated.
For example, the Solidity compiler will in some cases perform a "tail-call" optimization: for nested calls like f(g(x))
the compiler will push the entry point of f
as the return address for the call to g
. The format should
help explicitly identify the targets of internal function calls and what arguments are being passed on the stack.
Mapping key identification
EVM languages commonly include non-enumerable mappings. As such, it is useful to be able to dynamically identify any mapping keys that may appear while analyzing a transaction trace or debugging.
The Format
The format will be JSON so that it may be included in the standard input/output APIs that the Vyper and Solidity compilers support.
Top Level
The "top-level" artifact of the debug format will be a JSON dictionary with (at least) the following fields:
version
: A representation of the major/minor version of the format. The actual representation of this version (a string, an array, etc.) can be decided later.types
: An array describing the layout of user-defined types defined in contracts referenced during compilation (see below).bytecode
: Debug information about the bytecode output by the compiler.
Type Descriptions
When describing user defined types in contracts or describing the types of values on the stack, the format
will use type descriptors
to describe the type in question. There is one type descriptor per type in the
source language. Each descriptor is a JSON object with at least the following fields:
id
: a unique numeric id. This may be referenced by type descriptors for aggregate types (arrays, structs, etc.)sort
: A string representing the sort of the type. Possible values include:"mapping"
for a dynamic mapping from a key type to a value type"primitive"
built in primitive type"array"
for a homogeneous dynamic array of bounded/unbounded size"static_array"
for homogeneous static arrays"struct"
for user defined aggregate struct types"enum"
user defined enumeration types"contract"
a refinement of an address primitive with information about the contract deployed at the address"alias"
a user defined alias for some type"located"
a reference to another type with a data location attached
label
: a (not necessarily human-readable) string representation of the type. Expected to be used for debugging
Depending on the value of sort
the type descriptor will have additional fields.
Discussion The types here do not include events or errors. These can be described elsewhere in the format,
and indeed, they will likely reference the types defined here. However, as events and errors are not currently
first class in any language targeting the EVM that I'm aware of (i.e., you cannot declare a variable x
to be of
type error Foo()
) they should be described elsewhere.
Notes: some preference was expressed for kind
over sort
. In addition, it was suggested we use pointer
or reference
over located
.
Mappings
The type descriptor for a mapping type has the following additional fields defined.
keyType
: contains theid
of the type that is the domain of the mapping.valueType
: contains theid
of the type that is the codomain of the mapping.
Primitives
The type descriptor for a primitive has the following additional fields:
keyword
: the source keyword for the type. Examples includeuint256
,boolean
etc.bitwidth
: the maximum number of bits a value of this type may occupyalignment
: one ofhigh
/low
, indicating if the bits occur in the most significant bits (high
) or least significant bits (low
) of 256-bit EVM word.
Discussion: The bitwidth field is an initial attempt to come up with some language agnostic way to
describe primitive types. It is expected that further fields may be added, or perhaps the Primitive sort
should be split up into more specific units, like Integral
and Real
etc.
Array
The type descriptor for an array is further subdivided depending on whether the array is a bytes array or any other array. It has at least the following fields:
arraySort
: either the string"bytes"
or"generic"
(names not final).bound
: a field indicating the statically known upper bound on the size of this array (for Vyper). If null the array is unbounded.
If arraySort
is "bytes"
then the descriptor has the following field:
keyword
: the keyword used to declare this type, to account forstring
vsbytes
If the arraySort
is "generic"
then descriptor has the following field:
elementType
: a numeric id that references the type of values held in each element of the array.
Discussion: Here, as elsewhere, no attempt is made here in the type descriptors to describe the physical representation of the type. Short of some semi-turing complete DSL, there doesn't seem to be a compact way to describe declaratively the packed storage representation of strings in storage for example.
Static Arrays
The type descriptor for a static array has the following additional fields:
size
: the static, pre-declared size of the fixed size array/listelementType
: a numeric id that references the type of values held in each element of the array.
Struct
This format assumes that all struct types are user defined types and thus have a declaration site. The type descriptor for a struct has the following addition fields:
declaration
: A dictionary describing the definition site of the struct, see below.fields
: An ordered list of dictionaries describing the fields of the struct.name
: The name of the struct without thestruct
keyword and without contract qualifiers.
The order of the elements in fields
is significant, and should match the order that fields are declared in the source file.
Each element of the fields
array is a dictionary with the following fields:
name
: the name of the fieldtype
: the numeric id of the type held in this field
Enums
As with structs, this format assumes that all enumeration types are user defined. The descriptor for an enum contains the following fields:
declaration
: A dictionary describing the definition site of the enum, see elow.name
: the name of the enum, without theenum
keyword and without any contract qualifiers.members
: A list of members of the enum, as strings.
The order of elements within members
is significant, and should match the order that members of the enum are declared in the source file.
Contracts
The contract type refers to a primitive value that is known/expected to be an address of a contract deployed on the blockchain which implements the given type. It contains the following field:
contractDeclaration
: The AST id of the declaration of the contract type.name
: A string holding the (fully qualified) name of the contract type.
Discussion It is unclear to me whether this should actually be separate from primitives. I lean towards no, but it is presented this way to prompt discussion. Note that this format assumes that the declaration of the contract type is "visible" to the compiler during compilation and thus the declaration site is available for reference.
Aliases
As with enums and structs, this format assumes that all aliases are user defined, but this restriction could be relaxed by making the definitionScope
field optional.
An alias type descriptor has the following additional fields:
aliasName
: The user provided name of the alias type, without qualifiers.definitionScope
: A dictionary describing the site of the definition, see belowaliasedType
: The numeric id of the type for which this is an alias.
Discussion: This could be extended with information such as "is this alias opaque" a la private types in OCaml.
Located Types
A "located" type is simply a type that is additionally qualified with a data location, that is, a refinement on some other type to restrict its location. A located type has the following fields defined:
location
: A string describing EVM data locations. Possible values are"memory"
,"storage"
,"calldata"
,"returndata"
,"code"
.type
: The numeric ID of the type with this location.
It is expected that the type referenced in type
is not itself a located type, as this would indicate a type like uint[] calldata memory
which is not
valid and is never expected to be.
Discussion: The lack of a stack
or default
location is intentional, but can be added if needed. The choice to separate the location from rest of
the type was to avoid multiple descriptors for a struct depending on where that struct is located. Under this design, there is a single definition for the
shape of the struct, and the different data locations of that struct are handled by located type descriptors.
Definition Scopes
To provide information about where a user defined type was declared, the descriptors for those type include a definitionScope
field.
This field is a dictionary with the following fields:
definitionScope
: A dictionary describing where the type is defined. It has at least the following fieldssort
: a string, either"file"
indicating a top-level declaration or"contract"
indicating a type defined within a contract
name
: The string representation of the type name. For struct types this is the name of the struct, and does not include thestruct
keyword, and similarly for enums.
The definitionScope
dictionary has additional fields depending on the value of sort
. If it is "contract"
then it has the following field:
definingContract
: A dictionary with the following fields:name
: the source name of the defining contractastId
: the numeric AST id of the declaration which holds this definition
If the field is "file"
, then it instead has:
definingFile
: A dictionary with the following fields:name
: The path to the file (John: Fully resolved path? The path as understood by the compiler?)
It is expected that the combination of definitionScope
and name
is unique within the types
array
(otherwise we would have multiple declarations in the same scope).
Unresolved Questions
What about generics? Do we want to try to describe their format before any implementation is ready?
Bytecode Debug Information
The debug information for the bytecode is a dictionary of bytecode offsets to debug information. It is not required that every opcode in the bytecode has a corresponding entry in the debug dictionary. Implementers are encouraged, however, to have as much coverage as possible. Each entry in the debug information dictionary is itself a dictionary that (optionally) includes some of the following:
- The source location(s) that "correspond" to the opcode
- The AST ID(s) that "correspond" to the opcode
- The layout of the stack, including type information and local variable names (if available)
- Jump target information (if available/applicable)
- Identification of mapping key information
In the above "correspond" roughly means "what source code caused the generation of this opcode".
Specifically the dictionary may have the following fields:
source
: a list of source location specifiers. The format of these source location specifiers should be decided later. Every element should provide the location of the textual source code that contributed to the generation of this opcode.ast
: A list of AST ids for the "closest" AST node that contributed to the generation of this opcode.stack
A layout of the stack as understood by the compiler, represented as a list.jumps
: If present, provides hints about the location being jumped to by a jumping command (JUMP or JUMPI)mappings
: If present, contains information about how the opcode relates to mapping keys.
Source Locations
The choice of which source location should be attached to each opcode is likely an inexact science. However, implementers are encouraged to be as exact as possible: while it
is technically correct to give the entirety of the contract file as the "source" of every opcode, this is not a useful result. Consumers of this information should also take care
to assume that source code operations may map to (surprising) AST ids. For example, an optimizing compiler may tag a PUSH
of a constant 16
with the AST id of the following expression
(5 + 11)
. An even more aggressive optimizing compiler could even tag the same push with the AST ids of the literals 5
and 11
in the following (5 + x) + 11
.
Stack Information
Given internal function calls, the format will not (and cannot) represent the entire stack at every point during execution; a program can be reached at many different stack depths.
However, it is expected that all compilers will have a view of some "prefix" of the stack at each program point analogous to an activation frame in low-level assembly code.
The list contained in the stack
field exposes this view; consumers can combine this information with the jumps
information to build a complete representation of the stack.
The list is ordered such that the first element provides information about the top of the stack, the second element is the next element below it, and so on. Each element is a dictionary with the following fields:
type
: The type of the value stored in this stack slot. This is not a reference to a type descriptor or an embedding of the type descriptor, see below.sourceName
: A nullable string representation of the identifier held in this stack slot. A value of null indicates that the value does not come from any single identifier.sourceId
: A nullable numerical AST id that holds the definition (John: declaration?) of the identifier held in this stack slot. A value of null indicates the value does not come from any single identifier.
Note that due to dup
commands, multiple stack locations may hold the same variable name. If a compiler knows that a stack slot that holds
a variable will be later overwritten with a new value, it should mark the to be overwritten value with the "junk" type (see below).
The type
dictionary provides information about the value stored in the stack slot. The types used here are a superset of the types described by type descriptors.
The type
dictionary has the following field:
sort
: A string indicating the sort of value stored in the stack slot, drawn from one of the following values:"junk"
indicates a value that is dead or about to be popped."pc"
A refinement of the numeric type, indicating the slot holds a location which is a jump destination target"program"
The stack slot holds a value with a "program" type, i.e., one that can be expressed using type descriptors."internal"
Indicates that the stack slot holds a value that is being used by the compiler but does not correspond to a user type.
The dictionaries for pc
and junk
sorts do not have any additional information. The internal
type is to be used for, e.g., "scratch" pointers that are used to
marshal calldata buffers or hash storage keys. Compilers may insert their own information into the internal
dictionary but this format remains intentionally agnostic
on these contents. (John: every time a standard has allowed a "vendor specific" extension, it goes badly. Maybe we want to just say, consumers shouldn't look at this field)
If the sort
is "program"
then the dictionary has the following field:
typeId
: The numeric ID of the type held in this slot
Additionally, the compiler may insert a field to provide additional information about the representation on the stack. This field, if present, has the name representation
and holds a dictionary.
This dictionary has the following optional fields:
published
: A boolean field which, if present, indicates that this stack slot holds a pointer to some location in memory/storage. Further, if the field is true, then the object is "fully initialized" (the formal definition of fully initialized is to be decided on later)componentOf
: If the representation of a single value spans multiple stack slots, this field provides information about how the value is spread across the stack. It is a dictionary with the following fields:id
: an ID unique within each stack list. All stack slots with the same value ofid
are considered to represent the same logical value. It is allowed to re-use the same ID in different entries of thestack
list.componentName
: The name of the component. The only known use case for this is the decomposition of calldata arrays, so there are two possible values"ELEM_PTR"
and"LENGTH"
indicating the stack slots hold the pointer to the calldata location of the array's elements or the logical length of the array respectively.
Jumps
For jumping commands, the jumps
field provides information about the expected target of the jump, and information about the internal function stack.
The value of the jumps
field is a dictionary with the following (potentially optional) fields:
targets
: if present, a list of known PCs to which this command may jump. For JUMPI, this does not include the fallthrough case, as this is readily computable. This list may be non-singleton due to, e.g., function pointers, but the compiler is able to restrict the potential callees.sort
: A string indicating the type of jump being performed. One of the following values:"return"
: Used for a jump out of an internal function"call"
: Used for a jump into an internal function"normal"
: Used for all other jumps
Discussion: It may be useful to ask compilers to provide richer information about some jumps. For example, tagging a loop exit as a "break" or a backjump as a "continue". This may be redundant given sufficiently reliable source information however.
As elsewhere, the dictionary may contain additional fields depending on the value in sort
.
If the value is "call"
, then the dictionary contains the following fields:
arguments
: A list describing the calling convention. As in thestack
layout, the first element of this list describes the value on the top of the stack (after popping the jump destination). Each element is a dictionary described below.
If the callee of the call is known, then the dictionary with sort "call"
has the following field:
callee
: a dictionary with the following fields:target
: a human readable string name for the function being calledastId
: the AST id of the declaration site of the callee
Note that if the function being called is virtual
then the declaration site may not have any corresponding body.
Each element of the arguments
array is a dictionary with the following fields:
sort
:"program"
or"return_address"
."program"
has the same interpretation as in thetype
dictionary above."return_address"
is a refinement of thepc
type indicating this stack slot holds the return address of the call being performed.position
: The logical position of the parameter represented by this stack value. The ordering of parameters is defined by their program declaration order, where the first formal parameter to a function has position0
, the next1
, etc. As with the stack, a single logical argument can be spread across multiple stack slots. If multiple entries share the sameposition
value, then those arguments should have arepresentation
field that has acomponentOf
entry.
Note Due to named arguments, the order given in the debug information may not match the order of parameters as they appear at a call-site. For example, given a declaration:
function myFunction(uint a, uint b) ...
and an invocation:
myFunction(b = 3, a = 4)
the stack location which contains the 4
argument value will be tagged with position 0
, as that is the position of parameter a
in the declaration.
If the value of sort
is "return"
, then the dictionary has the following field:
returns
: A list of dictionaries with the same format as thearguments
array ofcall
, but without anyreturn_address
entries.
Discussion: The above proposal doesn't really handle the case of "tail-calls" identified at the beginning of this document, where multiple return addresses can be pushed onto the stack. Is that something the debug format must explicitly model?
Mapping key identification
The value of this field (when present) is a dictionary with (some of) the following fields:
isMappingHash
: A boolean that identifies whether the opcode is computing a hash for a mapping.isMappingPreHash
: For mappings that use two hashes, this boolean can identify whether the opcode is computing the first of the two hashes. Possibly this field should be combined with a previous one into some sort of enum?mappingHashFormat
: An enumeration; specifies the format of what gets hashed for the mapping. Formats could include "prefix" (for Solidity), "postfix" (for Vyper value types), and "postfix-prehashed" (for Vyper strings and bytestrings). Possibly "prefix" could be split further into "prefix-padded" (for Solidity value types) and "prefix-unpadded" (for Solidity strings and bytestrings). This could be expanded in the future if necessary. (Also, potentially"prefix-padded"
, if split out, could be broken down even further, by padding type -- zero padding (left) vs sign-padding vs zero-padding (right)...)