Version 0.1-draft-20230529
Tagged RISC‑V is a parallel universe in the RISC‑V multiverse. Its purpose is to support enhanced security. It is based on a 72‑bit memory doubleword that supports 64 bits of instruction and data, just like the original RISC‑V 64‑bit universe, but with an 8‑bit tag for memory doublewords that is used for two purposes. The primary purpose of the tags is to support probabilistic bounds checking with conventional 64‑bit pointers in production code. The goal is not to catch every possible error, but to make the chances of undetected errors much smaller than in an untagged architecture. This is accomplished with cliques. The clique field of a pointer used to access memory must agree with the clique stored in memory for that pointer. A clique mismatch results in an exception that indicates that either the pointer was used to access beyond the memory allocated for the pointer, or was used to access memory after an explicit deallocation. There are performance costs to Tagged RISC‑V but these are expected to be acceptable in return for the safety achieved. The largest issue is one of main memory width (but see Synergy With Memory Encryption below).
A secondary goal of tagging is to support the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) programming model, in particular CHERI‑128 capabilities, which require a 1‑bit tag. Instead of the CHERI 1‑bit tag, Tagged RISC‑V reserves two tag values for CHERI, one for the least significant 64 bits of a CHERI‑128 capability, and another for the most significant bits.
Adding 12.5% overhead to main memory is heavy lift. However the continued inability of the software industry to address security issues that arise from undisciplined languages such as C++ would seem to necessitate this drastic step.
Tagged RISC‑V general-purpose x registers are widened to 128 bits to accommodate CHERI‑128 capabilities. Otherwise, the unprivileged state is identical to RVI64. New LC (Load CHERI) and SC (Store CHERI) instructions would be introduced to load and store the full 128‑bit values with tag checking.
As introduced above, an 8‑bit tag is added to main-memory 64‑bit aligned memory location as shown below:
71 | 64 | 63 | 0 | ||
mc | data | ||||
8 | 64 |
Field | Width | Bits | Description |
---|---|---|---|
data | 64 | 63:0 | The RISC‑V memory contents |
mc | 8 | 71:64 | Clique of memory storing data, which must match bits 63:56 of the pointer used to access it. |
Misaligned loads and stores are supported. For misaligned accesses that cross an 8‑byte boundary, the clique of both memory locations accessed is checked.
A new Physical Memory Attribute (PMA) indicates whether memory is tagged or not. Untagged (64‑bit) memory is equivalent to having a tag of 252 for clique checking purposes.
The LT (Load Tag) instruction is used for reading memory tags, primarily for diagnostic purposes, but perhaps for avoiding tag values of adjacent blocks. The ST (Store Tag) instruction is used for writing memory tags, and would occur primarily in memory allocators and compiler-generated stack frame allocation. The ST8 instruction would exist for storing 8 aligned tags in a single operation for performance purposes. Whether a corresponding LT8 instruction is appropriate is TBD.
An attempt to use the ST instruction to write tag values 252..255 would take an exception except in machine mode. This prevents CHERI capabilities from being created except by derivation using CHERI instructions in a chain from root capabilities initially created in machine mode.
While ST8 allows somewhat efficient memory allocator, the code size required for stack frame initialization is excessive unless stack frames are 64‑byte aligned, which has other size issues. A STM (Store Tag Multiple) instruction with base and length would be much better for code size, and in a processor supporting vector instructions, might be acceptable complexity.
Tag Value | Usage |
---|---|
0..251 | Clique ids for tagged memory |
252 | Clique id for untagged memory |
253 | Reserved |
254 | CHERI doubleword 0 tag |
255 | CHERI doubleword 1 tag |
It may be appropriate to add an option to detect loads from uninitialized memory locations. This could redefine the low bit of the tag to not participate in tag comparisons, but instead to cause an exception if set. Stores would clear the bit. Allocators for the stack and heap would set the bit so that if a location is referenced before being written, this bug is detected. Such detection is not complete (that would require a bit per byte). For example, this would not trigger when portions of the location are stored and the non-written portions are are loaded, but it is likely to catch many bugs.
Tags, whether 8‑bit (as proposed here) or 1‑bit (as in pure CHERI) are a heavy lift given standard non-ECC DRAM modules, but there is also a movement to add data at rest encryption to main memory, and this has the potential to make Tagged RISC‑V work with standard DRAM modules. In a 64‑byte cache line, there would also be eight 8‑bit tags, for a total of 576 bits. Consider the 576 bits to be 9 words of 64 bits, use a standard 128‑bit block cipher (e.g. standard AES) nine times, add the 128‑bit authentication, resulting in 704 bits, or eight words of 88 bits. Adding 8 bits of ECC results in 96 bits per memory word, which might use three 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 GCM xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Regenerating ECC for the decrypted data for writing into the L2 cache can be done by also precomputing the 64 bits to xor with the 8 ECC codes. Only if an ECC error is detected and corrected is it necessary to recompute the ECC before writing into the L2 cache. Writes would incur the GCM computation latency (primarily nine AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code without doing a read, modify, write).
Memory and pointer cliques allow accesses to be bounds checked even in undisciplined languages such as C++, where compiler-generated bounds checking is not always possible. When memory is allocated, the tags of the memory locations is filled in with a clique identifier, chosen to be different to nearby allocations, and the pointer to that memory is given the same clique. When the pointer is used as in a load or store instruction, the memory tag is compared to clique in the pointer and if it is the same, then the access succeeds. If the pointer and memory cliques differ, the access takes an exception. For example, if the pointer is used to access beyond the memory allocated when the pointer was created, typically a new clique id would be seen indicating an out of bounds situation. When memory is explicitly deallocated, the tags in memory words would be changed so that pointers to the deallocated
There are multiple ways this mechanism might be used, but one way a slab allocator might work is by setting the tags of each N words of the slab to incrementing values mod 252, and then incrementing the tags in the words of a freed block by 16 or 32 mod 252. A non-slab allocator might choose clique values randomly, but with an additional test that the value chosen is not in either adjacent block.
It is also desirable that stack frame allocation write clique values into the new frame and that the stack pointer reflect this new clique value in bits 63:56. A new ALLOC sp, sp, framesize instruction might be created for this purpose, where the clique number comes from incrementing an unprivileged CSR, except that it would increment twice if it matched the source sp value. See the discussion of a possible STM instruction above, as a vector-like unaligned tag write would be needed to keep code size acceptable. In addition, arrays within the stack frame would need to be given different clique values from the rest of the frame. This might be accomplished by using the ALLOC instruction multiple times when arrays are present. Finally a separate clique value could be used to protect the return address to prevent Return-Oriented Programming. This area needs further exploration. It may obviate the need for a shadow stack mechanism.
Tagged RISC‑V employs a 48‑bit virtual address space, with bits 47:0 being the virtual address. Pointer bits 63:56 are used as the clique for checking memory accesses. Bits 55:48 are reserved, and may be used for address space expansion (e.g. as in Ssv64) or for Garbage Collection (GC) support.
63 | 56 | 55 | 48 | 47 | 0 | |||
ac | reserved | address | ||||||
8 | 8 | 48 |
71 | 64 | 63 | 56 | 55 | 48 | 47 | 0 | ||||
mc | ac | reserved | address | ||||||||
8 | 8 | 8 | 48 |
Tagged RISC‑V supports satp.mode of Sv39 and Sv48. In addition, we propose a a new satp.mode value that employs Ssv64 page table formats and separates the user-mode and supervisor-mode page tables, using bit 55 to pick either satp or a new ssatp as the page table address. This extends the user-mode address space by one bit, and may avoid issues that come up with sign-extension in the RISC‑V universe.
The RVI64 load instructions (e.g. LB, LH, LW, and LD) set bits 127:64 of the destination x register to 0, which marks the register as not containing a CHERI‑128 capability (perhaps a single bit of the P field?). When used as base register for a load or store instruction, such values indicate that cliqued checking is employed. When the x register does contain a CHERI‑128 capability, CHERI access rules apply. Whether CHERI accesses check memory tags against the ac field is TBD.
The CHERI‑128 format is changed to reduce the address bits from 64 to 56 and expand the top and bottom values as shown below:
71 | 64 | 63 | 56 | 55 | 48 | 47 | 0 | ||||
254 | ac | 0 | byteaddress | ||||||||
8 | 8 | 8 | 48 |
Field | Width | Bits | Description |
---|---|---|---|
byteaddress | 48 | 47:0 | Address of addressed memory |
0 | 8 | 55:48 | Reserved |
ac | 8 | 63:56 | Clique of addressed memory (TBD whether used by CHERI or not) |
254 | 8 | 71:64 | CHERI doubleword 0 tag |
71 | 64 | 63 | 49 | 48 | 42 | 41 | 40 | 39 | 24 | 23 | 21 | 20 | 3 | 2 | 0 | |||||||
255 | P | 0 | Z | L | T | TE | B | BE | ||||||||||||||
8 | 15 | 7 | 1 | 1 | 16 | 3 | 18 | 3 |
Field | Width | Bits | Description |
---|---|---|---|
BE | 3 | 2:0 | Bottom or Exponent bits |
B | 18 | 20:3 | Bottom bits |
TE | 3 | 23:21 | Top or Exponent bits |
T | 16 | 39:24 | Top bits |
L | 1 | 40 | Length bit 19 or Sealed |
Z | 1 | 41 | Internal Exponent flag |
P | 15 | 63:49 | Permissions |
255 | 8 | 71:64 | CHERI doubleword 0 tag |
See CHERI Concentrate for details.
The LC (Load CHERI) instruction takes an exception if the memory tag in doubleword 0 is not 254 or if the memory tag in doubleword 1 is not 255. The SC (Store CHERI) instruction writes the memory tag of doubleword 0 to 254 and of doubleword 1 to 255. This makes these locations somewhat inaccessible to other load and store instructions since these tags would never be returned by an allocator. A pointer with bits 63:56 with these tag values could be created, which would allow the CHERI‑128 fields to be read. Outside of machine mode, stores to memory locations with CHERI tags would take an exception to prevent tampering (an alternative would be to change the tag to a non-CHERI tag).
Because the x registers support 128‑bit load/store for LC and SC, it might be desirable appropriate to add instructions that load/store this width for non-capability data. This becomes problematic without a register tag, and so is not proposed here.
<webmaster at securerisc.org> | |||
2023-05-29 |