EVM Assembly Translator
The EVM assembly translator converts legacy EVM assembly (the default solc output) into LLVM IR via an intermediate representation called Ethereal IR (EthIR). The Yul pipeline (--via-ir) bypasses this translator entirely.
Why EthIR?
EVM assembly is stack-based with dynamic jumps, making it difficult to translate directly to LLVM IR which requires explicit control flow graphs. EthIR bridges this gap by:
- Tracking stack state to identify jump destinations at compile time
- Cloning blocks reachable from predecessors with different stack states
- Reconstructing control flow from stack-based jumps into a static CFG
- Resolving function calls using metadata from the solc fork
Translation Pipeline
Solidity source
│
▼
solc (solx-solidity fork)
│ Emits EVM assembly JSON + extraMetadata
▼
Assembly parsing
│ Parses instructions, resolves dependencies
▼
Block construction
│ Groups instructions between Tag labels
▼
EthIR traversal
│ DFS with stack simulation, block cloning
▼
LLVM IR generation
│ Creates LLVM functions, basic blocks, instructions
▼
LLVM optimizer
│
▼
EVM bytecode (via LLVM EVM backend)
Key Data Structures
Assembly
The Assembly struct represents the raw solc output. It contains:
- code: Flat list of instructions (deploy code)
- data["0"]: Nested assembly for runtime code
- data[hex]: Referenced data entries — sub-assemblies, hashes, or resolved contract paths (for CREATE/CREATE2)
Each instruction has a name (opcode), optional value (operand), and optional source location.
EtherealIR
The top-level container holding:
- entry_function: The main contract function (deploy + runtime)
- defined_functions: Internal functions discovered during traversal
Function
The Function struct is the core of the translator. It contains:
- blocks:
BTreeMap<BlockKey, Vec<Block>>— maps each block tag to one or more instances (clones for different stack states) - block_hash_index:
HashMap<BlockKey, HashSet<u64>>— fast duplicate detection by stack hash - stack_size: Maximum stack height observed, used to size LLVM stack allocations
Block
Each Block represents a sequence of instructions between two Tag labels:
- key:
BlockKey(code segment + tag number) - instance: Clone index (0, 1, 2... for blocks visited with different stack states)
- elements: Instructions with full stack state snapshots
- initial_stack / stack: Stack state at entry and after processing
Stack Elements
The stack tracks six kinds of values:
| Variant | Description | Example |
|---|---|---|
Value(String) | Runtime value (opaque) | Result of ADD, MLOAD |
Constant(BigUint) | Compile-time 256-bit constant | 0x60, 0xFFFF |
Tag(u64) | Block tag (jump target) | Tag 42 |
Path(String) | Contract dependency path | "SubContract" |
Data(String) | Hex data chunk | "deadbeef" |
ReturnAddress(usize) | Function return marker | Return with 2 outputs |
Block Cloning and Stack Hashing
The same block may be reached via different code paths with different stack contents. Since the stack determines jump targets (a JUMP pops its destination from the stack), the translator must handle each unique stack state separately.
How It Works
- When entering a block, the translator computes a stack hash using
XxHash3_64 - The hash considers only
Tagelements — tags determine control flow, while constants and runtime values affect only data flow - The pair
(BlockKey, stack_hash)uniquely identifies a block instance - If this pair has been visited before, the block is skipped (cycle detection)
- Otherwise, a new block instance is created
Block "process" reached with stack [T_10, V_x]: → instance 0
Block "process" reached with stack [T_20, V_y]: → instance 1 (different tag)
Block "process" reached with stack [T_10, V_z]: → instance 0 (same hash, reused)
Stack Hash Algorithm
fn hash(&self) -> u64 {
let mut hasher = XxHash3_64::default();
for element in self.elements.iter() {
match element {
Element::Tag(tag) => hasher.write(&tag.to_le_bytes()),
_ => hasher.write_u8(0),
}
}
hasher.finish()
}
Only Tag values contribute to the hash. This is intentional: two stack states with the same tags but different runtime values will follow the same control flow path.
Traversal Algorithm
The Function::traverse() method performs a depth-first traversal of blocks, simulating EVM execution:
traverse(blocks, extra_metadata):
queue ← [(entry_block, empty_stack)]
visited ← {}
while queue is not empty:
(block_key, stack) ← queue.pop()
hash ← stack.hash()
if (block_key, hash) in visited:
continue
visited.add((block_key, hash))
block ← blocks[block_key].clone_with(stack)
for instruction in block:
simulate_instruction(instruction, stack)
if instruction is JUMP/JUMPI:
queue.push((target_tag, stack))
Instruction Simulation
For each instruction, the translator:
- Pops the required number of inputs from the simulated stack
- Computes the output (compile-time if possible, runtime value otherwise)
- Pushes the result onto the stack
- For control flow instructions, queues successor blocks
Compile-Time Constant Folding
Arithmetic operations on known values are folded at compile time:
| Operands | Result |
|---|---|
Constant + Constant | Constant (computed) |
Tag + Constant | Tag (if result is valid block) |
Tag + Tag | Tag (if result is valid block) |
| Any other combination | Value (runtime, opaque) |
This is critical for resolving jump targets: solc often computes jump destinations via PUSH tag + arithmetic.
Function Call Detection
The translator identifies function calls using extra metadata from the solc fork. The extraMetadata JSON field lists all user-defined functions with their:
- Entry tag (in deploy and/or runtime code)
- Input parameter count
- Output return value count
- Function name and AST node ID
When a JUMP targets a known function entry:
- The stack is split: return address, arguments, and remaining caller state
- A
RecursiveCallpseudo-instruction replaces the JUMP - A new
Functionis created and recursively traversed from the entry block - The caller's stack receives
output_sizeopaque return values
Before JUMP to function "add(uint,uint)":
Stack: [... | return_tag | arg1 | arg2 | function_entry_tag]
After call detection:
Instruction: RecursiveCall add(uint,uint), input=2, output=1
Caller stack: [... | return_value]
Callee: new Function traversed from entry tag
LLVM IR Generation
After traversal, the translator generates LLVM IR in several phases:
1. Function Declaration
- Entry function: Uses the pre-declared contract entry point
- Defined functions: Creates private LLVM functions with
N × i256parameters and return values (multiple returns use LLVM struct types)
2. Stack Variable Allocation
For each function, stack_size stack slots are allocated as LLVM alloca instructions. These represent the simulated EVM stack as addressable memory:
%stack_0 = alloca i256 ; bottom of stack
%stack_1 = alloca i256
...
%stack_N = alloca i256 ; top of stack
For defined functions, slot 0 is reserved for the return address marker, and input parameters are stored starting from slot 1.
3. Basic Block Creation
Each (BlockKey, instance) pair becomes an LLVM BasicBlock:
block_runtime_42/0: ; tag 42, first instance
...
block_runtime_42/1: ; tag 42, second instance (different stack state)
...
4. Instruction Translation
Each EthIR element calls into_llvm() to generate LLVM instructions. Stack operations map to loads/stores on the allocated stack variables:
| EVM Operation | LLVM Translation |
|---|---|
PUSH 0x42 | store i256 66, ptr %stack_N |
DUP2 | %v = load i256, ptr %stack_(N-2); store i256 %v, ptr %stack_(N+1) |
ADD | %a = load ...; %b = load ...; %r = add i256 %a, %b; store ... |
MLOAD | %ptr = load ...; %v = load i256, ptr addrspace(1) %ptr; store ... |
JUMP | br label %target_block |
JUMPI | %cond = ...; br i1 %cond, label %taken, label %fallthrough |
solc Fork Modifications
The EVM assembly translator relies on several modifications in the solx-solidity fork. The most relevant to this pipeline are:
extraMetadataoutput: reports all user-defined functions with entry tags, parameter counts, and AST IDs. Without this, the translator cannot distinguish function calls from arbitrary jumps.- Dispatch tables for function pointers: indirect calls are lowered to static dispatch tables instead of dynamic jumps.
DUPX/SWAPXinstructions: extend stack access beyond depth 16, eliminating "stack too deep" errors.- Disabled optimizer: the solc optimizer is disabled to preserve function boundaries and metadata validity. All optimization is handled by the LLVM backend.
For the full list of fork modifications, see Limitations and Differences from solc.