SecureRISC Instruction Set Architecture

Version 0.61-draft-20250711

Up Front

Documentation Outline

This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So, if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fitted.

Work In Progress Documentation

SecureRISC was created to develop old ideas and notes of mine. It is not a complete Instruction Set Architecture (ISA), but only includes the things I have had time to consider and work on. It is certainly not a specification. At present, this document exists only for discussion purposes.

The ISA is mostly just ideas at this point. The opcode assignments and instruction specifications are little more than hints. The Virtual Memory architecture needs work. This is not a formal specification at this point; it is a document for discussion. Should it progress to a specification, much rewriting would be required (e.g., to adopt RFC 2119 requirement level keywords and more precise definitions.

Open Source

SecureRISC began as an exploration of what a security-conscious ISA might look like. I hope I can improve it over time to live up to its name. Should it turn into something more than an exploration, I would intend to make it an Open-Source ISA, along the lines of RISC‑V.

Software

There is no software (simulator, compiler, operating system, etc.) for SecureRISC. This is a paper-only set of ideas at this point. A compiler, simulator, and FPGA implementation might be created at some point, but that is probably years in the future.

Terminology

The reader of this document is likely already familiar with the acronyms, terminology, and concepts used herein. Occasionally the reader might encounter something unfamiliar. So just in case, there is a set of glossaries at the end of this document. Links to glossary entries are marked with a icon. However, often links are not provided, so the reader may have to search for unfamiliar terminology. One glossary is for general terms used in instruction set, processor, and system software, with a coda to the general glossary for terms specific to RISC‑V that this document cites. There are also references for programming languages, operating systems, and other processor architectures cited herein. Since cryptography terminology is used in a few places, there is a specialized crypto glossary for that. Finally, a glossary of security vulnerabilities that have tripped up many processor designs is included, since this document refers to many such things.

SecureRISC Instruction Set Architecture

Introduction

SecureRISC is an attempt to define a security-conscious Instruction Set Architecture (ISA) appropriate for server-class systems, but which, with modern process technology (e.g., 5 nm), could even be used for IoT computing given that the die area for a single such processor is a small fraction of one mm². SecureRISC starts with the assumption that processors should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as those found in Multics, including rings, segmentation, and both discretionary and mandatory (non-discretionary) access control. I also propose a new block-structured instruction set that allows for better Control Flow Integrity (CFI) and performance. For performance, several features support highly parallel execution and latency tolerance, even in implementations that avoid highly speculative execution, which can lead to security vulnerabilities.

A comment about Multics is appropriate here. There seems to be an impression among many in the computer architecture world that many Multics features are complex. However, they are simple and general, easy to implement, and remove pressure on the architecture to add warts for specialized purposes. Computer architecture from the 1980s to the present is often an oversimplification of Multics. For example, segmentation in Multics served to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) creates pressure to keep the number of bits devoted to access control minimal when security considerations might suggest a more robust set. Even though processor hardware has eliminated the segment concept, operating systems (e.g., Linux) have reinvented it. Linux calls a contiguous set of virtual pages with similar attributes (e.g., protection) a Virtual Memory Area (VMA). It may be that calling it VMA instead of segments was necessary given that a completely different usage of segmentation terminology was the 8086 microprocessor trying to extend its address spaces from tiny to small as an alternative to having a reasonable number of bits in a pointer.

As another example, many contemporary processor architectures (e.g., RISC‑V) have two rings (User and Supervisor), with a single bit in PTEs (the U bit in RISC‑V) serving as a ring bracket. (RISC‑V has, in addition, a third ring-like privilege level called Machine Mode, but this privilege is mostly operates without virtual memory and so PTE bits are not needed for it.) Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. It was true that rings were not well utilized on Multics, but we now have more uses for multiple rings, such as hypervisors, Just-In-Time (JIT) compilation, WebAssembly, and sandboxing.

Goals

The goals for SecureRISC, in order of priority, are as follows:

Security
Performance
Power efficiency
Compatibility where not in conflict with the above
Code size, primarily for performance
Support for Garbage Collection (GC)
Support for languages with dynamic typing (e.g., Lisp, Python, Julia)
Suitable for implementations that execute many instructions per cycle (wide issue)
Support very large (2⁶⁴ bytes) address spaces efficiently
Suitable for implementations that minimize speculative execution to avoid security vulnerabilities (e.g., in-order and narrow issue) while maintaining some latency tolerance
Simplicity where beneficial, but not at the expense of the higher-priority goals

Non-goals for SecureRISC include (this list will probably grow):

Support for low-end processors:
- 32‑bit datapaths and addresses
- Omitting hypervisor support
- Omitting memory translation
- Omitting security functionality
- Omitting floating-point, vector, and matrix operations
Support for integer Digital Signal Processing (DSP) (prefer floating-point for DSP)
Ideological purity (e.g., pure RISC)

(An exception for low-end processor support might be a small secure enclave chip that omits floating-point, vector, and matrix operations. It might also have minimal memory translation while incorporating cryptographic features.)

Security encompasses many aspects. One of the most critical is preventing unassisted infiltration—for example, attacks exploiting buffer overflows, use-after-free errors, and other programming mistakes. SecureRISC has two defenses against buffer overflows: bounds checking for disciplined languages such as Rust and pointer/memory tagging for undisciplined languages such as C++. Another key security issue is unintentionally assisted infiltration, such as phishing attacks that install Trojans. This risk may be mitigated through mandatory access control.

The goal of SecureRISC is to provide a comprehensive security solution, but further work is required to achieve this. At present, SecureRISC focuses on areas where meaningful improvements can be made while remaining practical to implement with current processor technology.

I believe SecureRISC has yet to fully exploit the potential of mandatory access control for security, and I continue to explore opportunities in this area.

Synergy Between Security and Other Goals

Security, garbage collection, and dynamic typing may appear to be orthogonal, but they are synergistic. SecureRISC attempts to minimize the impact of programming mistakes in several ways, such as making bounds checking somewhat automatic and making compiler-generated checking more efficient for disciplined languages where bounds checking is possible. To keep pointers a single word, the architecture supports encoding the bounds in extra information per memory location. For undisciplined languages (e.g., C++) the compiler does not, in general, know the bounds that would be required to perform a check, and the two best methods so far invented to solve this also require some sort of extra information per memory location, such as the pointer and memory coloring used in AddressSanitizer or the tag bit in CHERI. AddressSanitizer uses an extra N bits per the minimum allocation unit (where that unit may be increased to reduce overhead) to detect errors with approximate probability 2^−N. To address memory allocation error detection, other techniques are necessary. One possibility is garbage collection (GC), which eliminates these errors, but to be a substitute for explicit memory deallocation, GC needs to be efficient, hence the synergy of these two SecureRISC goals. Some implementations of GC are made more efficient by being able to distinguish pointers from integers that look like addresses at runtime, and some sort of tagging helps with things. For languages requiring explicit deallocation of memory, AddressSanitizer may be used. However, AddressSanitizer on most architectures is too inefficient to use in production and is typically employed only during development as a debugging tool. SecureRISC seeks to make it efficient enough for production use. CHERI accomplishes its extra bounds checking by implementing a 129‑bit capability encoding a 64‑bit pointer, 64‑bit base, 64‑bit length, type, and permissions (note the extra bit over each 64‑bit memory location required for making capabilities non-forgeable). Thus bounds checking, GC, or memory allocation error detection are all made possible or more efficient by having extra information per memory location. Since SecureRISC must support 64‑bit integer and floating-point arithmetic, this extra information needs to be in addition to the 64 bits required for that data.

As justified above, SecureRISC targets its goals through what will likely be the most controversial aspect of SecureRISC: tags on words in memory and registers. The Basic Block descriptors may be more unusual, but the reader may come to appreciate this aspect of SecureRISC with familiarity (especially given the Control Flow Integrity advantages as a security feature). However, the reader may, in the end, not find memory tags convincing because of the non-standard word size that results. An efficient and secure alternative is not known, and as a result, SecureRISC adds tags to memory locations. Tags simultaneously provide an efficient mechanism for bounding pointers, support use-after-free detection, support bounds checking with single-word pointers for undisciplined languages such as C++ (HWASAN or CHERI), and support more efficient Garbage Collection (the best solution to allocation errors). They also happen to support dynamically typed languages.

SecureRISC has not yet explored another use for tagging data, which is taint checking.

If the reader is tempted to dismiss SecureRISC because of tagged memory, consider jumping ahead to Memory Options for SecureRISC. That section elaborate several ways in which tags may be added to main memory using existing DRAM modules.

Language Specific Mechanisms

The above discussion suggests at least five different uses of memory tags:

Implementing CHERI capabilities (including bounds checking for undisciplined languages at the cost of doubling the width of pointers).
Implementing AddressSanitizer bounds checking and use-after-free detection for undisciplined languages (e.g., C++).
Making Garbage Collection more efficient (for disciplined languages).
Efficient support of dynamically typed languages (e.g., being able to type 64‑bit integer and floating-point data without indirection).
More efficient bounds checking for disciplined languages while keeping pointers small (unlike CHERI).

While memory tagging is useful for the above, it is used in different ways for the above. Instead of a single unified mechanism, SecureRISC uses memory tagging in two ways, one for AddressSanitizer, and then combining CHERI and disciplined language support into the other.

Is SecureRISC actually RISC?

Is SecureRISC Reduced Instruction Set Computing? It is certainly not a small instruction set, but RISC no longer stands for that and has primarily become a marketing term. As one wag put it, RISC is any instruction set architecture defined after 1980. A more accurate description might be ISAs suitable as advanced compiler targets, as the general trend is to depend on the compiler to exploit features of the ISA, such as redundancy elimination, sophisticated register allocation, instruction scheduling, etc. Such features have generally favored ISAs organized along the load-and-store model and simple addressing modes. By this criterion, I believe SecureRISC is a RISC architecture, but it is not a simplistic or reduced instruction set. Contemporary processors, even for simple instruction sets, are very complex, and that complexity will probably grow until Moore’s Law fails. The design challenges are significant. In the contemporary world, simplicity is a goal when it furthers other goals such as performance (e.g., by maximizing clock rate), efficiency (e.g., by reducing power consumption), and so on.

Background

The original motivation for block-structured ISAs came from Instruction-Level Parallelism (ILP) studies that I did back in 1997 at SGI that showed that instruction fetch was the limiting factor in ILP. This was before modern branch prediction, e.g., TAGE, so that result may no longer be true. The idea was that instruction fetch is analogous to linked-list processing, with parsing at each list node to find the next link. Linked-list processing is inherently slow in modern processors, and with parsing, it is even worse. I wanted to replace linked lists with vectors (i.e., to vectorize instruction fetch), but couldn’t figure out how, and settled for reducing the parsing at each list node. HP Labs’ Playdoh ISA’s Prepare-to-branch may have also been an influence. Despite improved branch prediction, I still feel that block-structured ISAs are worthwhile, but the exact trade-offs might require updating older work in this area. The best validation of this dates from 2007, when Professor Christoforos Kozyrakis convinced his Ph.D. student Dr. Ahmad Zmily to look at this approach in a Ph.D. thesis. In the introduction of Block-Aware Instruction Set Architecture, Dr. Zmily wrote, We demonstrate that the new architecture improves upon conventional superscalar designs by 20% in performance and 16% in energy. Such an advantage is not enough on which to foist a new ISA upon the world, but it encourages me to think that it provides an impetus for using such a base when creating a new ISA for other purposes, such as security. Since 2007, improvements in the proposed block-structured ISA should result in greater performance improvements, while improvements in branch prediction (e.g., TAGE predictors) decrease some of the advantages. Also, Dr. Zmily’s work was based on the MIPS ISA, and SecureRISC is quite different in many aspects. Should SecureRISC be developed to the point where binaries can be produced and simulated, a more appropriate performance estimate will be possible.

I have long been aware for the need for latency tolerance microarchitecture. Later, I encountered the idea of decoupling portions of the pipeline to enhance latency tolerance, by allowing address generation to occur early, but allow some execution units to operate later (e.g., by at least a L2 Data Cache latency). I now attribute this idea to the Astronautics ZS-1. While this was certain to not match OoO microarchitectures in performance, the parade of security vulnerabilities exploiting microarchitecture speculation suggested that supporting this early vs. late pipeline split in the instruction set could benefit microarchitectures targeting security at the cost of performance.

Before starting SecureRISC, my previous experience was with many ISAs and operating systems. Long after starting my block-structured ISA thoughts, I became involved in the RISC‑V ISA project. RISC‑V is, in many ways, a cleaned-up version of the MIPS ISA (e.g., minus load and branch delay slots) and it seems likely to become the next major ISA after x86-64 and ARM. Being open source, RISC‑V has easy-to-access documentation. As such, I have used it for comparisons in the current description of SecureRISC and modified some of its virtual memory model to be slightly more RISC‑V compatible (e.g., choosing the number of segment bits to be compatible with RISC‑V Sv48). However, most aspects of the SecureRISC ISA predate my knowledge of RISC‑V and were not influenced by it, except that I found that RISC‑V’s Vector ISA was more developed than my thoughts (which were most influenced by the Cray-1, which supported only 64‑bit precision).

In 2022, I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. Despite my concerns, I thought that SecureRISC might be a good platform for CHERI, so I have extended SecureRISC to outline how it might support CHERI capabilities as an exploration. I also modified SecureRISC’s sized pointers to include a simple exponent to extend the size range based on ideas from CHERI but kept them as single words by not including both upper and lower bounds. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC below.

In 2023, I took the virtual memory ideas in SecureRISC and created a proposal for RISC‑V, tentatively called Ssv64. I made Ssv64 much more RISC‑V compatible than SecureRISC had been (e.g., in PTE formats), and have recently been backporting some of those changes into SecureRISC since there is no reason to be needlessly different.

SecureRISC does depend upon a few new microarchitecture structures to realize its potential. There should be a Basic Block Descriptor Cache (BBDC), though this could be thought of as an ISA-defined Branch Target Buffer (BTB). The BBDC is in addition to the usual L1 Instruction Cache. While the BTB and BBDC are similar, the BBDC is likely to be sized such that it requires more than one cycle to access (resulting in a target prediction in two cycles), making another structure useful (in the An Example Microarchitecture section at the end, this is called a Next Descriptor Index Predictor) to enable a new basic block to be fetched every cycle by providing just the set index bits one cycle early. The most novel new microarchitecture structure suggested for SecureRISC is the Segment Size Cache, which caches the segment size log2 for a segment number, which is used for segment bounds checking on the base register of loads. This cache might also provide the GC generation number of the segment (TBD). While these are new structures, in the context of a modern microarchitecture, especially one with two or three levels of caches and a vector unit, they are tiny and worthwhile.

It is possible that the Segment Size Cache would be generalized to a Segment Descriptor Cache by storing more than just the ssize field of Segment Descriptor Entries (SDEs). This would be used to save an L2 Data Cache reference on many Translation Cache (TLB) misses.

Conventional Aspects of SecureRISC

Some things remain unchanged from other RISCs. Addresses are byte-addressed. Like other RISC ISAs, SecureRISC is mostly based upon loads and stores for memory access. Integers and floating-point values have 8, 16, 32, or 64‑bit precision. Floating-point would be IEEE-754-2019 compatible. The Vector/Matrix ISA will probably be similar to the RISC‑V Vector ISA but might, however, use the 48‑bit or 64‑bit instruction format to do more in the instruction word and less with vset (perhaps a subset of vector instructions would exist as the 32‑bit instructions). Also, there are multiple explicit vector mask registers, rather than using v0. (There are sixteen vector mask registers, but only vm0 to vm3 are usable to mask vector operations in 32‑bit vector instructions—the others primarily exist for vector compare results.)

Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture since it echoes the Cray-1 ISA from 1976 but is somewhat different from RISCs since the 1980s. (The additional register files in SecureRISC serve multiple purposes, but an important one is supporting the execution of many instructions per cycle without the wiring bottleneck that a single unified register file would create.)

Unconventional Aspects of SecureRISC

Much more in SecureRISC is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.

Programming Model
- SecureRISC’s sized pointers are oriented towards array indexing rather than pointer arithmetic. An array pointer provides the address of element 0 and the size of the array. The array reference a[i] loads or stores to that location only after checking that i is within the bounds specified in the array pointer. C++ *p++ sort of programming is less amenable to SecureRISC bounds checking and must use either CHERI-128 pointers with bottom and top encoded in addition to the pointer itself, or use the alternative AddressSanitizer memory tag method of bounds checking.
Control Flow
- The conventional instruction stream is split into Basic Block (BB) descriptors that carry most of the control flow specification of the program, leaving the instructions to specify loads, stores, operations, and checking the predictions made in the control flow front-end. This has several performance and security advantages and, in particular, facilitates Control Flow Integrity (CFI) checks quite naturally, and defeats most Return Oriented Programming. The BB descriptors get their own cache, which replaces the dynamically created Branch Target Buffer (BTB) found in other microprocessors. The advantages of this approach are outlined in the next section.
- Conditional branches, indirect calls, and jumps exist in the instruction stream to check the predictions made in the front-end of the pipeline, but they may occur anywhere in the basic block (not just as the last instruction), which reduces the misprediction penalty and increases scheduling opportunities. The branch determines what happens at the end of the basic block, not to the next instruction, which is why it can occur anywhere. Conditional branch instructions (as opposed to descriptors) do not include a target address, only the taken/not-taken decision; the target address is taken from the basic block descriptor. Jump instructions may include a memory read, as in a load. Calls, returns, and jumps can be conditional (i.e., a basic block may include both a conditional branch and a jump or call).
- Variable length instructions are supported in a manner that still allows efficient parallel decoding. The instruction stream also does not need bits to specify instruction length, as this information comes from the basic block descriptor.
- There is inner loop iteration prediction support to avoid some branch misprediction on loops such as for i ← a to b (where the loop iteration count is b − a + 1) and for i ← a to b step -1 (where the loop iteration count is a − b + 1). The loop may be exited early with a conditional branch; only the loop back is predicted with the hint.
- For Control Flow Integrity (CFI), the return address stack is managed in a protected manner by the processor and is separate from the program data stack. This stack is expected to be implemented with a specialized cache rather than as a prediction-only structure. This is a tentative feature and not central to the SecureRISC idea, but simply a response to one of the major exploits of modern systems, overwriting return addresses on the program stack. It also happens to eliminate vulnerabilities introduced by conventional return address prediction (e.g. SpectreRSB). If this problem is completely addressed by bounds checking, the return address stack is not needed. This feature is also motivated by modern processors having special return address stack hardware for prediction purposes, so having hardware push/pop is not really new. Additionally, this stack is used for exceptions, which avoids the need for per-ring CSRs for exception PCs and scratch registers to bootstrap the exception handler.
Tagged Memory and Registers
- Words in memory and registers are tagged, and tags are checked. For pointers, the tag supplies the size of the memory addressed or indicates that the size is stored relative to the pointer. Using the pointer as a base with an index checks the index against the size. Tags are stored in memory. Address registers are wide enough to hold the pointer, tag, and full size when that size is loaded from the pointer.
- SecureRISC defines words in registers and memory to be 72 bits, consisting of 8 bits for the tag and 64 bits for the data. In some ISAs a word is 16 bits and in many others it is 32 bits. In the Cray ISA it is 64, but here it is 72.
- Tags on words in memory and registers may be the most controversial aspect of SecureRISC. As mentioned in the introduction, the Basic Block (BB) descriptors may be more unfamiliar, but the reader may come to appreciate this aspect of SecureRISC. However, the reader may, in the end, not find memory tags worth the non-standard word size. Tags in SecureRISC achieve several goals, none of which might be sufficiently motivating by itself. However, I find that, viewed in combination, bounds checking, support for Garbage Collection, and support for dynamically typed languages justify the non-standard word size. Tagging is also an area of the ISA that is currently underspecified; the reader should expect some gaps for the time being.
- One downside of tags is that pointers must be aligned on word boundaries in memory. The sub-word load and store instructions for 8, 16, 32, and 64‑bit data can be implemented to handle misaligned data, but there is nowhere to load the tag from on a misaligned word load. Misaligned loads optionally (for debugging) check that the memory word tag (or multiple tags if a misaligned load crosses a word boundary) has non-pointer data (tags 240 to 253). Sub-word stores write the memory tag(s) to an integer tag 240. Words tagged with tag 255 are not overwritten but cause an exception (most likely implemented by 8 bits per 64 B block stored in the tags of data caches). SecureRISC will take an exception on word loads and stores to memory addresses that are not 8 B aligned. This is not a problem with High-Level Languages but poses a problem for languages such as C++, where decades of bad programming have made misaligned pointers common enough to cause porting problems to stricter architectures. In most cases, C++ code will only use AddressSanitizer cliques and sub-word loads and stores, and sub-word accesses do support unaligned access.
- SecureRISC defines one tag value (254) that causes an exception when the memory location is loaded from, and another (255) that causes an exception on loads or stores (with a special instruction required to overwrite tag 255). These tags have several uses. One use is for dynamic linking (tag 254). Another could be to detect references to uninitialized memory and dynamically allocated memory used after being freed, but pointer and memory cliques are likely the preferred method for languages that require explicit deallocation.
- A goal of SecureRISC is to support the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project programming model, while not requiring it. Two tags are dedicated to CHERI 128‑bit capabilities; these reserved tag values replace the CHERI tag bit. More on this follows in a later section.
Register Files
- The trend in recent ISAs has been toward large (e.g., 31 or 32 entries) homogeneous register files, but SecureRISC opts for multiple smaller (16) files of greater heterogeneity for several reasons. One reason is specialization (the same reason recent ISAs have separate integer and floating-point files), and another is to simplify layout and wiring in very wide issue implementations by allowing separate files with fewer read and write ports to be placed nearer the functional units that operate on their values. This places an additional burden on the compiler, but the suitability of the ISA for very wide issue implementations makes this burden-shifting appropriate. Due to having only 16 registers of each type, generally, no register is hardwired to 0.
- The usual distinction between integer and floating-point register files is replaced by address registers, index registers, scalar registers, vector registers, and matrix accumulators. Scalar registers may hold any tagged value and may be used for integer and floating-point computation. Address and index registers may likewise hold any tagged value but are specialized for bounds checking. Integer arithmetic involved in creating addresses is done in the address and index registers, while other integer arithmetic is done in the scalar registers, which are also used for floating-point operands. Scalar registers support 3 reads and one write per instruction and support fused operations (e.g., multiply-add). SIMD registers, found in some ISAs, are replaced by vector registers. Matrix accumulators, unlike the other register files, are not tagged.
  - The address registers (ARs) are particularly important to SecureRISC, as the instructions that load them or move to them are designed to separate portions that are actual address bits from portions that are used for checking. This allows loads, which use ARs as a base address, to feed bits into the data and translation caches without masking off portions. (Also, there are no sub-word loads and therefore no load-aligner for AR loads, which helps the timing of address/size separation.)
  - Some microarchitectures may exploit the heterogeneous register files to use a simpler mechanism than full Out-of-Order (OoO) instruction scheduling to tolerate cache miss latency in the scalar registers, providing higher performance for mid-range implementations and implementations that want to avoid Spectre, Meltdown, Foreshadow, Downfall, Inception, etc. vulnerabilities that arise from speculation. In particular, SecureRISC implementations, even OoO ones, will often have execution units early in the pipeline for the AR/XR operations and separate execution units later in the pipeline (e.g., separated by the L2 data cache latency or more) for the SR/BR/VR/VM/MA operations.
  - While the heterogeneous register file may require some additional compiler work, the reward is the savings in processor hardware, which can use register files with many fewer read and write ports that are also located physically closer to the functional units they serve. It also keeps the individual register files smaller with register renaming (e.g., 64 or 128 physical registers should be enough for each file).
  - Recent ISAs have had 31 integer registers and 32 floating-point registers. SecureRISC register fields are only 4 bits and support only 16 registers in the instruction word. For pure integer scalar algorithms, this is 48 registers (Address, Index, Scalar), rather than 31. Increasing register fields to 5 bits would add 3 to 4 more bits in instructions. SecureRISC needs additional opcode bits to specify operations on different register files (e.g., integer subtract on both index registers and scalar registers). However, the number of additional bits required is less than the 3–4 from a register field expansion. For floating-point algorithms, there are still 32 for address calculations but only 16 for scalar floating-point computations. However, many floating-point algorithms will be able to employ the vector register file and matrix accumulators, and this will be a better overall mix of resources for these algorithms.
  - The Boolean Registers (BRs) are introduced for comparisons and selection on the scalar registers, which in some microarchitectures may be much later in the pipeline. Having these registers keeps their computation and use local to that portion of the pipeline and facilitates branch avoidance.
- In addition to address and scalar register files, there is also a vector register file, as in the Cray-1 and RISC‑V, replacing the SIMD registers found in earlier RISC ISAs. Vectors are both the ancient and modern alternative to SIMD. Ancient in that processors such as the Cray-1 (1976) offered a vector ISA using vector registers, and modern (as in RISC‑V) because they are potentially more efficient and scalable. Vector ISAs are, however, somewhat complex because they have to package up all the common idioms that might be composable by a compiler using scalar instructions. The vector registers are also joined by a set of vector mask registers, and a set of matrix accumulators.

Virtual Memory

Pointer addresses are translated using one stage of segmentation with a choice of direct mapping or paging, as specified by the supervisor. Segments are the primary unit of access control and sharing and are bounds-checked. Segments in SecureRISC are somewhat similar to an abstraction called Virtual Memory Areas (VMAs)

in operating systems that target processors without segments. VMAs are a variable number upper bits of the processor-specific virtual address width, with bits above that being zero or sign extended from there. For example, the VMA for the application address space might be as shown below on RISC‑V’s Sv57:

RISC-V Sv57 Application Address with VMA
63	56	55		12	11	0
0		VMA	VPN		byte
8		V	44−V		12

In contrast, SecureRISC moves the VMA to bits 63..48 of the virtual address and makes it fixed at 16 bits in size. When paged, Translation Cache (TLB) miss table walks use segment size and OS-specified table sizes to determine the number of page table stages. Each page table stage may use independently chosen sizes, selected by the operating system, to reduce the number of stages in the page table, and multiple page sizes are supported to reduce TLB miss rates. Because page table walks start with the segment size, small segments (e.g., code segments) have a reduced TLB miss penalty, as illustrated below where SS is the Segment Size log2 and PTS is the root page table size log2 chosen by the operating system.

Root page table for segment starts with segment size
63	48	47			0
segment		fill	tableindex0	offset
16		48−SS	PTS	SS−PTS

Direct mapping, along with the segment size feature, is particularly useful for I/O regions of the address space. Because the operating system specifies non-leaf page table sizes, operating systems with flexible physical memory allocation can use larger tables (along with the segment size) to minimize the TLB miss penalty for these segments.

Support for hypervisors is not an add-on but is built-in from the start. Hypervisors specify a second level of either direct mapping or paging, which allows them to page the memory used by the supervisor. There is no pretense that the supervisor is running on bare hardware; rather than virtualize privileged operations, SecureRISC has supervisors explicitly call their hypervisor for certain services.
A pointer that is relative to the segment base is supported, allowing such pointers to appear in files that are mapped to different locations in different address spaces. These local pointers may be 64 or 32 bits in width, and the 32‑bit local pointers may reduce memory footprints in some circumstances.
Page tables for memory mapped into multiple address spaces are sharable, even when these address spaces have different access permissions, because permissions are generally kept per segment rather than per page. (Though per-page permissions also exist for backward compatibility and can further reduce segment permissions if used.) Note: The segment descriptor could provide a bit to turn off PTE permission bits to give more bits for the operating system to use in PTEs, if needed.
New segment permissions are introduced in addition to the standard Read, Write, and Execute (R, W, and X). Pointer permission (P) allows segments to contain pointers to other segments, while Capability permission (C) allows segments to contain CHERI capability pointers. Pointer permission would be enabled for stack and heap segments, but not for mapped files, code segments, etc.
SecureRISC uses new terminology for what are typically called virtual and physical addresses (where physical address is often split into guest physical address and system physical address): local virtual addresses, system virtual addresses, and system interconnect addresses. The mapping of system virtual addresses to system interconnect addresses is system-wide, and is used by both processors and I/O. This mapping also provides unconventional attributes and permissions (e.g., beyond RISC‑V’s Physical Memory Attributes or PMAs), such as whether the memory is tagged or not, and whether it supports CHERI capabilities. Loading from non-tagged memory automatically synthesizes an integer tag as the data is supplied to the system interconnect fabric.
SecureRISC does not bypass translation, except during boot. After boot, all loads and stores go through the lvaddr → svaddr → siaddr translation and access checks. The only places where svaddrs and siaddrs are used are in segment and page tables.

Modes
- The concepts of user and supervisor modes are generalized into an older, more sophisticated model called rings, which not only improves performance but also enables sandboxing. Recent ISAs have introduced complicated workarounds for securely running code downloaded from the internet (e.g., in web browsers), but the ring model is simpler and offers better performance.
  Like Multics, SecureRISC defines 8 rings numbered 0–7, but with inverted numbering; ring 0 is the least privileged, while ring 7 is the most privileged. For example, ring 7 could be analogous to RISC‑V’s Machine Mode, used for the Trusted Execution Environment (TEE), or serve a similar role as POWER 9’s Ultravisor. Hypervisors might run in ring 6, with their device drivers in ring 5. Guest supervisors could operate in ring 4, with their device drivers in ring 3, while a user-mode browser might run in ring 2, executing JavaScript JIT code in ring 0.
Other
- SecureRISC more efficiently supports dynamically typed languages, such as Lisp, Python, and Julia.
- SecureRISC more efficiently supports generational garbage collection, by defining a few extra bits in segment descriptors, page table entries, and the address match/trap registers.
- In contrast to recent RISCs, which have reduced reliance on exceptions (except for loads, stores, and jumps), SecureRISC leans into runtime exceptions for increased security. The compiler is encouraged to use trap instructions to check assertions, bounds, and other conditions at runtime.
- SecureRISC supports speculative and non-speculative load instructions, like a few recent ISAs (e.g., IA‑64), but unlike many others such as MIPS, SPARC, Alpha, POWER, ARM, RISC‑V, etc. When beneficial, the compiler can split a load instruction into a speculative load and check instruction pair, hoisting the speculative load to an earlier basic block while leaving the check instruction in the original basic block to take the exception based on the validity bits saved by the speculative load. While this may not significantly benefit Out-of-Order (OoO) processors, it is included so that processors can eschew OoO to avoid Spectre, Meltdown, Foreshadow, Downfall, Inception, etc. pitfalls and still get better performance than simple in-order processors. (Perhaps this could be extended to tracking mandatory access control for data?) It will be possible to turn off this feature to simplify context switch.
- While not unconventional in older processors, RISC processors have avoided push and pop instructions. As RISC processors have gotten increasingly complex (for example hypervisor support is complex and the RISC‑V vector ISA is surely not RISC), it seems less necessary to avoid push and pop. On the other hand, push and pop are generally not especially useful (and sometimes create unnecessary dependencies in instruction scheduling), and so RISC ISAs have not particularly missed them. SecureRISC adds push and pop in the places where it is useful as hinted earlier: a protected stack used for return addresses and exception context. For exceptions, this avoids several other features such as scratch registers and global interrupt enable bits.

Advantages of the Basic Block Descriptor

The Basic Block (BB) descriptor aspect listed above is perhaps the most unfamiliar. Below are some rationale and advantages of this approach.

Contemporary processors have various structures that are created and updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Stack Buffers (RSB), Conditional Branch predictors, Indirect Jump and Call predictors, prefetchers, and so on. In SecureRISC, the BTB is moved into the ISA for performance and security. In particular, the BTB becomes a Basic Block Descriptor Cache (BBDC) in SecureRISC. The BBDC caches lines of Basic Block Descriptors, generated by the compiler, in a section separate from the instructions. SecureRISC also separates the return address stack from the program stack for security^*, making the RSB more of a cache (with a speculative version in the Return Address Predictor) and further enhancing performance^†. SecureRISC adds some additional ISA support for loop prediction so that branch prediction is not needed, in most cases, to predict loop iteration counts. This frees up bits in the branch predictor for other branches.

* Both by protecting return addresses from modification and preventing predictions from privilege levels and context switches from affecting each other.

† Allowing underflow to fill a line of return addresses from the L2 data cache.

Fetch prediction operates solely on the BB descriptors (cached in the BBDC), which are fixed in size and do not need to be parsed, unlike the instruction stream. This allows for much more efficient operation. The BBDC is filled a block at a time just like the Instruction Cache (with prefetch as well), and locality improves performance here too.
In SecureRISC, it is much easier to rearrange instructions in memory to place hot blocks together to improve instruction cache miss rates and make prefetching more effective. In conventional ISAs, a basic block ending in a conditional branch has two possible successors. If the branch is usually taken, the fall-through code may be relatively cold. Moving the fall-through code out of line introduces an extra unconditional branch at its end to return after execution. In SecureRISC, BB descriptors fall through to subsequent descriptors, but each has a pointer to the instructions to fetch. The instruction blocks of a bage could simply be sorted by frequency, placing the hottest first and the coldest last, or some similar arrangement^*, all without introducing new instructions or changing anything other the pointers in the descriptors.
* For example, one might want to place hot code differently on even and odd pages to reduce set conflicts.
This freedom to place instructions separately from BB descriptor order allows for code sharing, thereby reducing instruction cache misses. One copy of identical instruction sequences can be stored and shared between descriptors within a given bage. This is most likely to occur for register restore in function epilogues, but it would be interesting to see how often it occurs elsewhere. This is more likely to occur if register save stack locations are standardized by the compiler, and if stack frames are deallocated without reference to the stack frame size (which is what is proposed for the ABI).
The size of the instruction stream is reduced slightly by removing branch targets, making the L1 instruction Cache (which does not contain BB descriptors) somewhat more effective, but this is likely insignificant. More significantly, it also reduces the penalty of having a larger L1 Instruction Cache with an additional cycle of latency, which if exploited would improve performance. The translation performed by the L1 Instruction TLB during the BB Descriptor Cache access also supplies the translation required for the L1 Instruction Cache. The translation time no longer affects the timing of the instruction cache. It is the latency of the BB descriptor cache that becomes critical to performance since this is the primary dependency in following the program control flow. Latency in the instruction cache affects the misprediction penalty but is not particularly timing critical.
The size of the BBDC is critical to performance, just as the size of the BTB in microarchitectures for other ISAs is crucial to performance. I expect the BBDC to be a similar size to the L1 Instruction Cache in other architectures, which can cache significantly more descriptors since it does not contain instructions. It would also have an improved hit rate because accesses are dense (BTBs are accessed with more random instruction addresses).

Capability Hardware Enhanced RISC Instructions (CHERI)

I started with the assumption that pointers are a single word, which are expanded based on the 8‑bit tag to a base and size when loaded into the doubleword (144‑bit) Address Registers (ARs). This enables automatic bounds checking. The effective address calculation uses the ARs base to check the offset/index value against the size. This supports programs oriented toward a[i] pointer usage, but not C++ *p++ pointer arithmetic (such arithmetic is possible in SecureRISC at the expense of bounds checking). In contrast, the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project started with the assumption that capability pointers are four words (including lower and upper bounds, the pointer itself, and permissions and object type), and invented a compression technique to get them down to two words. SecureRISC can support CHERI by using its 128‑bit AR load and store instructions to transfer capabilities to and from the 144‑bit ARs, and therefore be able to accommodate either singleword or doubleword pointers. Support for the CHERI bottom and top decoding, its permissions, and its additional instructions would be required. The CHERI tag bit is replaced with two SecureRISC reserved tag values (one tag value in word 0, another in word 1). I would expect languages such as Julia and Lisp would prefer singleword pointers, so supporting both singleword and doubleword pointers allows both to exist on the same processor depending on the instructions generated by the compiler.

Unlike CHERI, SecureRISC pointers have only a size and not bottom and top values encoded. As a result, SecureRISC’s bounds checking is more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so bounds checking is better suited to disciplined languages, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. Bounds checking is a better target for Rust, Swift, Julia, or Lisp. SecureRISC can use unsized pointers for C++, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per process basis whether such C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity. Instead, undisciplined languages (such as C++) are likely to either use CHERI-128 pointers or memory and pointer cliques for security.

SecureRISC and CHERI Variants

Tagged memory words are separable from other aspects of SecureRISC, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag, since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/XR/SR separation, with the ARs specialized for CHERI capabilities, and the XRs/SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.

One could imagine variants of SecureRISC that have only some of its features:

Name	Block ISA	Segmentation	Rings	Tags	CHERI	Word	Pointer
SecureRISC	✔	✔	✔	✔	✔	72	72/144
SemiSecureRISC	✔	✔	✔			64	64
BlockRISC	✔					64	64
BlockCHERI	✔	?	?		✔	65	130

As I indicated earlier, I don’t think that BlockRISC is sufficient to justify a new ISA. I am concentrating on the full package.

Open Aspects

I need to think more carefully about I/O in a SecureRISC system. Some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g., DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly, if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as a stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.

Ports onto the system interconnect fabric will have to have rights and permissions assigned by hypervisors, and perhaps hypervisor guests. This needs to be worked out.

Documentation Conventions

Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC is similarly Little Endian in its byte addressing.

Links to Wikipedia articles are followed with a icon. Links to documents in PDF format are followed by a icon. Linsk to off-site HTML content are followed by a icon. Links to the glossaries in this document are followed by a icon.

To augment English descriptions of things, SecureRISC uses notation that operates on bits. This notation is sketched out here, but it is still only a guide to the reader (i.e., it is not a complete formal specification language such as SAIL). Its advantage is brevity.

Register file access is denoted RF[fieldname], as for example in AR[a].
Assignment is designated with ← as in XR[d] ← XR[a] − XR[b].
Subscripts are used for bit-field extraction: AR[a]₆₃ is bit 63 of AR[a], and SR[c]_7..0 is the least significant 8 bits of SR[c].
Superscripts on single-bit quantities are used for bit replication: 1¹⁶ represents a bit string of sixteen ones, but confusingly I also use 2^N to mean the Nth power of two (2⁸ = 256)—my excuse is that 2 is not a single-bit value.
a∥b represents bit-string catenation.
Thus SR[a]_64−s..0 ∥ 0^s represents a left shift of SR[a] by s. Similarly, (SR[a]₆₃)^s ∥ SR[a]_63..s represents an arithmetic right shift of SR[a] by s.
Where necessary, operators are subscripted with a u for unsigned, s for signed, and us for unsigned-signed. Thus <_u is unsigned less than, and >>_s is signed arithmetic shift.
The ×_u and ×_s multiplication operators produce a bit-string product of width equal to the sum of the widths of the two operands (e.g., the unsigned product of two 64‑bit numbers is 128 bits).
The +_u, +_s, −_u, and −_s operators produce a bit-string one bit wider than the widest operand, but the unscripted +, −, and × operators produce a result of the same width as the widest operand. Finally, the +_p operator is used for tagged, bounds-checked addition of pointer+integer producing pointer.
The +_d, +_f, −_h, etc. operators represent floating-point addition in IEEE 754 binary64, binary32, and binary16 formats respectively, and similarly for subscripts on the −, ×, and ÷ operators. The subscript fmt is used when the floating-point format is specified in an instruction field.
a ⊕ b bitwise xor, or addition in a GF(2^N) polynomial field.
a ⊗ b is carryless multiplication producing a product of size(a)+size(b)-1 bits.
a ⊗ b mod 𝑥^N+…+1 is polynomial multiplication with GF(2) coefficients reduced to N bits modulo the specified polynomial.
However, in a matrix context, $u \otimes v$ is used as the outer product of vectors $u$ and $v$ .

Basic Terminology

Multics Terminology (Multicians may mostly skim)

For those familiar with Multics, the primary thing to know is that SecureRISC has up to 8 rings (0–7) and inverts the ring numbers so that ring 0 is the least privileged, and that each ring has its own address space, but ring-tagged pointers allow privileged rings to reference the address space of less privileged rings. The ring bracket rules are slightly modified from Multics in the case of pointers without ring tags. Segment sizes are powers of two.

Segment

Segments are the primary point of access control and sharing. They implement in the ISA an operating system abstraction: Virtual Memory Areas (VMAs)

, which are contiguous regions of virtual memory of virtual pages with the same attributes. Segments are used for heaps, stacks, and memory mapped files (e.g., code and databases). SecureRISC Segments may be paged or directly mapped (e.g., to an I/O device). Segments have power-of-two sizes that are used for bounds checking and for determining the depth of the page table walk when paging is used. (Multics segment lengths were not limited to powers of two, but arbitrary lengths significantly increase the size of Segment Descriptors, whereas only six bits are required for powers of two.) Segment sizes < 12 (4096 bytes) would not be supported in most microarchitectures, and the maximum segment size is 61 (2 EiB or exbibytes). Segment sizes > 48 (256 TiB or terabytes) require the Segment Descriptor Table (set by the supervisor) to have 2^size-48 entries with consistent values.
Segments enable several features not available in page table-only virtual memory systems. The segment descriptors allow for a larger set of permissions to be specified than would typically fit in a Page Table Entry (e.g., ring brackets described below). They also allow page tables for segments to be shared between different address spaces, even when different permissions are in effect. A typical application would use dozens of segments. Separate segments would be used for thread stacks and heaps (to avoid contention for allocation), and a separate segment would be used for each shared library. For some operating systems, reading and writing files would be done by mapping each file into a separate segment.
Segments and pages may also be used to implement generational garbage collection (GC) by extending Segment Descriptors, Page Table Entries (PTEs), and TLB entries to have a generation number so that stores of pointers from an old generation to a new one are noted in PTE bits in a fashion similar to the usual Dirty bits. This allows pages that are about to be swapped out to be scanned for pointers to newer generations and noted, so that these pages need not be swapped in during GC. (SecureRISC may need a way to disable this feature in PTEs to provide more bits to software.)

Ring

Rings provide Read, Write, Execute, and Gate permissions on a nested basis for different layers of privilege. Many recent and simpler architectures provide only two layers of privilege (typically called User and Supervisor), with Read (R), Write (W), and Execute (X) permission bits, and a fourth bit (e.g., the RISC‑V U bit in PTEs) determining whether these permissions apply to both privilege levels or only to the most privileged, with access denied to the least privileged. Rings are the older generalization, where separate Read, Write, and Execute permissions are specified for multiple privilege levels (typically 4 or 8), in a nested fashion that takes only a few more bits. Compare RISC‑V’s 4 bits (RWXU) to 9 bits for four rings (RWX plus three 2‑bit ring brackets) and 12 bits for eight rings (RWX three 3‑bit ring brackets). In addition to Execute permission, rings allow Multics and SecureRISC to implement Gate permission for privilege transitions on procedure calls to and returns from gates.
Rings enforce a layering on what may be read or written by code on a per-segment basis, with ring 0 being the least privileged and ring 7 being the most privileged. (SecureRISC inverts the ring number-to-privilege mapping chosen by Multics. This allows less privileged code to be unaware of higher privilege levels, and the number of rings supported by an implementation may vary: some implementations might have fewer than 8 rings.) Less privileged rings may also call gates in more privileged rings to request services from those rings. Ring numbers are stored in some pointer tags so that pointer parameters passed to more privileged rings result in access to virtual memory using the access rights of the caller, not the rights of the more privileged ring. Loading a pointer from memory sets the ring field to the minimum of the current ring of execution, the ring number in the base register of the load, and the ring number stored in memory (if any).
The RINGA instruction is used in gates on pointers passed in registers to apply this minimum calculation using the ring number of the caller.
The number of rings could be reduced from 8 to 4 or even just 2 in some implementations, though the savings from this would be minimal. Perhaps in a four-ring system, ring 2 would be for the operating system, ring 1 for user code, and ring 0 for sandboxed user code.
Many recent ISAs can be thought of as having only two rings, with ring permissions being allow or deny (again, the RISC‑V U bit in PTEs is one example).
Michael D. Schroeder’s Ph.D. thesis, MAC TR-104, Cooperation of Mutually Suspicious Subsystems in a Computer Utility, September 1972

presented a generalization of rings to domains where permissions were specified without nesting. This is straightforward until the procedure for evaluating permissions of reference parameters using the privilege of the calling domain is attempted. SecureRISC does not attempt to generalize rings to domains due to this complexity. SecureRISC does support encrypted main memory, which potentially allows data protection from more privileged rings, but without a mechanism for decrypting this data when passed by reference. This approach requires further development.
In SecureRISC, each ring potentially has its own address space. The ring number in unsized pointers (tags 192–215) is used to allow a privileged ring to load or store from a less privileged ring. Accesses to more privileged rings cause an exception without going through the translation process, which prevents less privileged rings from even executing timing probes of the address space of more privileged rings. When rings share address spaces (i.e., share a Segment Descriptor Table (SDT) and TCAT), the Segment Table Size (STS) field of sdtp[ring] may be used to limit the access of the less privileged ring to segments of the privileged ring by specifying a smaller subset of the SDT. The STS field check should be implemented in constant time to prevent timing probes of more privileged rings. Ring brackets will also prevent less privileged rings from accessing more privileged segments, but these ring brackets are only known after the Segment Descriptor Entry (SDE) is read from the Segment Descriptor Table (SDT) after a TLB miss.

Ring brackets

Each segment has three 3‑bit ring numbers—R1, R2, and R3—stored in Segment Descriptor Table Entries (SDEs) and used for bracketing accesses by ring of execution in addition to the Read, Write, Execute permissions from the segment descriptor table. To reiterate, SecureRISC inverts the ring number to privilege mapping chosen by Multics: ring 7 is the most privileged and ring 0 the least privileged. Typically, R3≤R2≤R1. These numbers define four brackets as in Multics for accesses with a ring tagged pointer (tags 192..215):

[R1,7]	write bracket	writedeny ← earing < R1
[R2,7]	read bracket	readdeny ← earing < R2
[R2,R1]	execute bracket	executedeny ← earing < R2 \| earing > R1
[R3,R1−1]	gate bracket	gatedeny ← earing < R3 \| earing ≥ R1

The execute bracket denies execute permission when a less privileged ring can write the segment.
For loads and stores with base address registers without ring tags, the rules are modified from the Multics rules to detect bugs:

[R1,R1]	write bracket	writedeny ← PC.ring ≠ R1
[R2,R1]	read bracket	readdeny ← PC.ring < R2 \| PC.ring > R1

The ring level of an access is PC.ring for Basic Block Descriptor (BBD) and instruction fetch. Execution is permitted when R2 ≤ PC.ring ≤ R1, and calls to gates when R3 ≤ PC.ring < R1^*. The ring level of an access is also PC.ring for loads and stores when the base register AR[a] is not ring tagged (AR[a]_71..64 not in 192..215). In this case, writes are permitted when PC.ring = R1 and reads are permitted when R2 ≤ PC.ring ≤ R1^†. For loads and stores when AR[a] is ring tagged (i.e., AR[a]_71..64 is in 192..215), an exception occurs when a less privileged ring attempts to reference a higher privilege ring (PC.ring < AR[a]_66..64), and otherwise writes are permitted when AR[a]_66..64 ≥ R1 and reads are permitted when AR[a]_66..64 ≥ R2.
† This further restricts the Multics ring brackets rules, which require only that R2 ≤ PC.ring. This change prevents reading from segments writeable by less privileged rings (i.e., R1 < PC.ring) unless such reads use pointers with ring tags (192..215). A reference using a non-ring pointer would cause an exception, making it difficult for rings to accidentally use untrusted data (a partial Biba integrity

enforcement). This might occur when privilege levels share an address space and would detect bugs making such references.
* The ring number of the caller and the ring brackets of the target segment are used to calculate the new ring number of execution, as per the Multics documentation

modified for the inverted ring order:

If the caller’s ring is within the execute bracket [R2:R1], execution will continue in the same ring as the caller.
If the caller’s ring is within the gate bracket [R3:R2−1] and the target Basic Block descriptor has the gate bit set, the process will switch to ring R2 before executing the target, increasing privilege.
If the caller’s ring is above the execute bracket (> R1), the process will switch to ring R1 before executing the target, decreasing privilege.
If the caller’s ring is above the gate bracket (< R3) or the target Basic Block Descriptor does not have the gate bit set, an access fault exception occurs.

To illustrate the utility of rings, the following example shows how all 8 rings might be used. Indeed, if there were one more ring available, it might be used for the user-mode dynamic linker, so that links are readable by applications, but not writeable.

Example Ring Brackets
What	R1,R2,R3	seg RWX	R b	W b	X b	G b	Ring 0	Ring 1	Ring 2	Ring 3	Ring 4	Ring 5	Ring 6
User code	`2,2,2`	`R-X`	`[2,7]`	`-`	`[2,2]`	`-`	`----`	`----`	`R-X-`	`R---`	`R---`	`R---`	`R---`
User execute only	`2,2,2`	`--X`	`-`	`-`	`[2,2]`	`-`	`----`	`----`	`--X-`	`----`	`----`	`----`	`----`
User stack or heap	`2,2,2`	`RW-`	`[2,7]`	`[2,7]`	`-`	`-`	`----`	`----`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`
User read-only file	`2,2,2`	`R--`	`[2,7]`	`-`	`-`	`-`	`----`	`----`	`R---`	`R---`	`R---`	`R---`	`R---`
User return stack	`4,2,4`	`RW-`	`[2,7]`	`[4,7]`	`-`	`-`	`----`	`----`	`R---`	`R---`	`RW--`	`RW--`	`RW--`
Compiler library	`7,0,0`	`R-X`	`[0,7]`	`-`	`[0,7]`	`-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`	`R-X-`
Super driver code	`3,3,3`	`R-X`	`[3,7]`	`-`	`[3,3]`	`-`	`----`	`----`	`----`	`R-X-`	`R---`	`R---`	`R---`
Super driver data	`3,3,3`	`RW-`	`[3,7]`	`[3,7]`	`-`	`-`	`----`	`----`	`----`	`RW--`	`RW--`	`RW--`	`RW--`
Super code	`4,4,4`	`R-X`	`[4,7]`	`-`	`[4,4]`	`-`	`----`	`----`	`----`	`----`	`R-X-`	`R---`	`R---`
Super gates for user	`4,4,2`	`R-X`	`[4,7]`	`-`	`[4,4]`	`[2,3]`	`----`	`----`	`---G`	`---G`	`R-X-`	`R---`	`R---`
Super heap or stack	`4,4,4`	`RW-`	`[4,7]`	`[4,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`RW--`	`RW--`	`RW--`
Super return stack	`6,4,6`	`RW-`	`[4,7]`	`[6,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`R---`	`R---`	`RW--`
Hyper driver code	`5,5,5`	`R-X`	`[5,7]`	`-`	`[5,5]`	`-`	`----`	`----`	`----`	`----`	`----`	`R-X-`	`R---`
Hyper driver data	`5,5,5`	`RW-`	`[5,7]`	`[5,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`----`	`RW--`	`RW--`
Hyper code	`6,6,6`	`R-X`	`[6,7]`	`-`	`[6,6]`	`-`	`----`	`----`	`----`	`----`	`----`	`----`	`R-X-`
Hyper heap or stack	`6,6,6`	`RW-`	`[6,7]`	`[6,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`----`	`----`	`RW--`
Hyper return stack	`6,6,6`	`RW-`	`[6,7]`	`[6,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`----`	`----`	`RW--`
Hyper gates for supervisor	`6,6,4`	`R-X`	`[6,7]`	`-`	`[6,6]`	`[4,5]`	`----`	`----`	`----`	`----`	`---G`	`---G`	`R-X-`
TEE code	`7,7,7`	`R-X`	`[7,7]`	`-`	`[7,7]`	`-`	`----`	`----`	`----`	`----`	`----`	`----`	`----`
TEE data	`7,7,7`	`RW-`	`[7,7]`	`[7,7]`	`-`	`-`	`----`	`----`	`----`	`----`	`----`	`----`	`----`
Sandboxed JIT code	`1,0,0`	`RWX`	`[0,7]`	`[1,7]`	`[0,1]`	`-`	`R-X-`	`RWX-`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`
Sandboxed JIT stack or heap	`0,0,0`	`RW-`	`[0,7]`	`[0,7]`	`-`	`-`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`	`RW--`
Sandboxed non-JIT code	`1,1,1`	`R-X`	`[1,7]`	`-`	`[1,1]`	`-`	`----`	`R-X-`	`R---`	`R---`	`R---`	`R---`	`R---`
User gates for sandboxes	`2,2,0`	`R-X`	`[2,7]`	`-`	`[2,2]`	`[0,1]`	`---G`	`---G`	`R-X-`	`R---`	`R---`	`R---`	`R---`

Gate

Gates are the entry points into more privileged rings from less privileged rings and are marked as such by a bit in basic block descriptors. Less privileged rings may call directly to more privileged rings without employing the exception mechanism (not employing exceptions is a performance advantage). When the target segment does not allow execution to the current ring (i.e., the current ring is less than the target segment’s R2), but does allow gate calls (i.e., the current ring is in the target segment’s [R3:R2−1]), a ring transition takes place. Only basic block descriptors marked as gates (as indicated in the descriptor) may be used for such transfers. Gates are responsible for stack switching, validating the ring numbers of pointer arguments passed in registers, and clearing non-preserved registers before return.

Discretionary Access Control

The operating system maintains an Access Control List (ACL) for objects, such as files. When files are mapped into a user address space, this ACL is mapped to permissions in the Segment Descriptor for that user. Those permissions are Read (R), Write (W), Execute (X), Pointer (P), and Capability (C) permissions.

Mandatory Access Control

Mandatory Access Control

(aka Non-Discretionary Access Control) adds a second test that must be satisfied in addition to ACLs

. The primary motivation for mandatory access control is the Bell-LaPadula security model

, which Multics implemented in 1973 to satisfy the old Orange Book

requirements for multilevel security

and the U.S. government classification system

. In addition to that purpose, it can protect against Trojan attacks and enforce separation between multiple hypervisors and supervisors on a system, which is one reason it remains relevant, despite the existence of newer standards. The old Orange Book called for implementing both levels representing hierarchical classifications and categories for non-hierarchical compartments. Following Bell-LaPadula (with the strong star property), read access is granted when
DataLevel ≤ ProcessLevel ∧ DataCategories ⊂ ProcessCategories,
and write access is granted when
DataLevel = ProcessLevel ∧ DataCategories = ProcessCategories.
(Note: x⊂y is implemented in logic on bit-strings x and y as (x&~y)=0.)
This could be simplified to just categories, with N levels encoded with 2^N−1 category bits. Access would then be the simpler tests, with read granted to data when DataCategories ⊂ ProcessCategories, and write when DataCategories = ProcessCategories.
An argument against using 2^N−1 category bits to encode N levels is the number of bits required, e.g., if these bits are stored in a TLB. In that case, using two mechanisms rather than one may be worth the complexity. For example, a classification system with six levels might use 3 bits for a binary level or 5 bits for a set representation (00000, 00001, 00011, 00111, 01111, 11111), a savings of 2 bits. For three levels, the savings is only 1 bit. Because SecureRISC has more uses for mandatory access control than just government classification systems, it has chosen to implement both to maximize the utility of the fixed 16 bits provided.
SecureRISC generalizes mandatory access tests to allow multiple parallel tests of levels and categories for orthogonal security considerations (the failure of any test denies access). For example, a Trusted Execution Environment (TEE) category might be independent of Secret/TopSecret levels. For example, ARM’s TrustZone has two levels (Non-Secure and Secure) for protecting the TEE, but does this in a hardwired way. Implementing such a partition is just one possibility of a generalized levels and categories mechanism. In particular, orthogonal tests are implemented with masks that allow ring 7 to divide the 16 bits into 0–4 levels and 0–4 category sets. This is described further with the AccessLevels and AccessCats XCSRs.
Using the complement of the levels and categories with this mechanism almost implements the Biba integrity model

(This is not perfect because Biba allows write down, but the inverse of strong star is still write same.)
SecureRISC systems may implement mandatory access control in memory controllers, e.g., storing a 16 bits for regions of memory. For example, storing 16 bits for every 256 KiB of memory would require 4 MiB for each 512 GiB DIMM, which might even be feasible in SRAM to not compromise access time (less than 1 mm² in 2 nm nodes). Adding a 16‑bit identifier might enforce virtual machine separation. Increasing the size or making this table finer grain would probably require moving into the DRAM, and perhaps caching it in SRAM in the memory controller.

Address Terminology

SecureRISC implements two levels of address translation, as in processors with hypervisor support and virtualization, but uses new terminology for the process because physical address is somewhat ambiguous in a two-level translation. Programs operate using local virtual addresses. These addresses are translated to a system virtual address in a mapping specified by guest operating systems. The guest operating systems consider system virtual addresses as representing physical memory, but actually these addresses are translated again by a system-wide mapping specified by hypervisors to system interconnect addresses that are used in the routing of accesses in the system fabric. All ports on the system interconnect translate system virtual addresses to system interconnection addresses in local Translation Caches (TLBs) at the boundary into the system interconnect. This allows guest operating systems to transmit system virtual addresses directly to I/O devices, which may transfer data to or from these addresses, employing the system-wide translation at the port boundary.

Making the svaddr → siaddr translation system-wide is a somewhat radical simplification compared to other virtualization systems. Whether SecureRISC retains this simplification or adopts a more traditional second level translation is open at this point, but my intent is to see if the simplification can suffice. A system-wide mapping means that hypervisors must give each supervisor unique system virtual addresses for its memory and I/O, and the supervisors must be prevented from referencing the system virtual addresses of the other supervisors via the protection mechanism. This requires that supervisors must not expect memory and I/O in fixed locations. The advantage of a single mapping is that a single 64‑bit svaddr is all that is required when communicating with I/O devices, rather than two 64‑bit addresses (i.e., a page table address and the address within the address space specified by the page table).

The following elaborates on the above:

Local Virtual Address

This is the 64‑bit address generated for Basic Block descriptor fetches, Instruction fetches, and load and store instructions based on address arithmetic on the address and index register files. Local Virtual Addresses are translated in by the first level translation mechanism starting with the Segment Descriptor Tables specified by the sdtp registers. The result of this translation is a System Virtual Address. This first-level translation is usually specified by the guest operating system supervisor. Local Virtual Addresses are sometimes abbreviated to lvaddr.

Local Virtual Address
63	48	47			0
SEG		fill	VPN	byte
16		48−ssize	ssize−PS	PS

where ssize is the segment size, PS is the page size given by the segment mapping, and fill is all 0s for upward-growing segments and all 1s for downward-growing segments.
The segment number size of 16 bits was chosen to limit the offset to 48 bits, which keeps simplistic operating system page tables (one page at each stage) to four stages with 4 KiB pages. If all operating systems for SecureRISC are likely to take advantage of flexible page table sizes SecureRISC might reduce the segment number to 14 bits.
To exploit the full potential of what SecureRISC offers, an operating system needs to be able to allocate at least two different power of two memory sizes (e.g., some size larger than 4 KiB) efficiently, but that was done in commercially successful operating system in the 1990s

, and is supported today in Linux (using buddy memory allocation

). A discussion of page sizes is included below in Page Size Issues.
As will be explained later, lvaddrs have are associated with a VMID and ASID. The tuple (VMID, ASID, LVADDR) is a system-wide virtual address, which is useful for DMA.

System-wide Virtual Addresses for I/O
127	96	95	64	63	0
VMID		ASID		lvaddr
32		32		64

System Virtual Address

These 64‑bit addresses are used by processors and I/O devices when interfacing to the System Interconnect. Initiator ports on the System Interconnect translate (and check) these addresses to System Interconnect Addresses in Initiator TLBs based on the system Region Descriptor Table. This second-level translation is usually specified by hypervisors. System Virtual Addresses are sometimes abbreviated to svaddr.
Note: SecureRISC is undergoing some evolution with regard to svaddrs. It may be that I/O devices will be programmed with lvaddrs instead, capturing the current VMID and ASID as part of the vaddr write transaction. This would allow I/O devices to be programmed by application ring processes if the page fault issue can be addressed. Separately, there is a question whether second-level address translation can be moved into ring 7, with supervisors calling hypervisor rings to set up first-level translation, and hypervisors calling ring 7 to set up second-level translation, all with software algorithms instead of two-level translation tables. Thus svaddrs are TBD.
It would be desirable to support >64‑bit svaddrs, but there is limited room in Page Table Entries with a 4 KiB page size. If SecureRISC were to raise the minimum page size, I think this should be increased a little.

System Virtual Address
63	48	47	0
region		byte address
16		48

System Interconnect

The system-specific logic with multiple ports that allows these ports to communicate with each other. It may be implemented with a bus, cross-bar, ring, 2D or 3D mesh, HexaMesh

, Diametrical Mesh

(mesh with wormhole routes), dragonfly network, hypercube, or still other mechanisms. Ports on the System Interconnect may be either Initiators or Responders or both.

System Interconnect Address

The address used for routing data transfers on the system interconnect. The width of the system interconnect address is system dependent. System interconnect addresses are the result of translating System Virtual Addresses using the system-wide second-level translation specified by the Region Descriptor Table (RDT). SecureRISC currently defines the maximum width of these addresses as 64 bits (e.g., by the width of fields in various translation structures). Some systems may have a smaller System Interconnect Address width, and would then generate an exception when these addresses exceed the maximum value supported. System Interconnect Addresses are usually abbreviated to siaddr.

System Interconnect Address
63		0
byteaddress
64

This document does not attempt to define the format of System Interconnect Addresses (siaddrs). That is left to the system designers. However, just to illustrate one possibility, what follows is an example of how a hypothetical system might interpret an siaddr.

Example System Interconnect Address
63	50	49	6	5	3	2	0
port		block		word		byte
14		44		3		3

In addition to the System Interconnect Address (siaddr) for reads and writes, SecureRISC processors would typically provide a 32‑bit Virtual Machine Identifier (VMID), 32‑bit Address Space Identifier (ASID), 8‑bit Quality of Service (QoS), and 16‑bit Mandatory Access Control (MAC) Set values as part of transactions. QoS values are used for prioritization. MAC values are used for additional permission testing. VMID and ASID values are used to interpret lvaddrs provided for DMA.

Tagged Pointer Terminology

Word

Memory Word
71	64	63	0
tag		data
8		64

Words are 72 bits in memory with 64 bits of data, 8 bits of tag, and addresses that are multiples of 8 (i.e., aligned). SecureRISC supports using the tag portion of words in two ways, one appropriate to languages with strong bounds checking and garbage collection, and one for less disciplined languages (e.g., C++).
In the first tag usage, the tag is primarily used as a type, and by distinguishing pointers from non-pointers, facilitates garbage collection, and dynamic typing. When the type indicates a pointer, it may sometimes specify the size of the memory addressed by the pointer. Non-pointer tag values (≥240) are reserved for 64 bits of data contained directly in the word, rather than what the word points to. This allows dynamically typed languages such as Lisp to have 64‑bit integer and 64‑bit floating-point data as objects without allocating memory to contain it. Tags <240 represent pointers.
In the second tag usage, the tag is primarily used for checking for accesses beyond allocated memory or after explicit deallocation. This works probabilistically by setting the tags of allocated memory to a new 8‑bit value (0..231), tagging pointers to this allocation with this value, and checking on reads and writes that the pointer value matches the memory value. A bit in the segment descriptor word indicates whether tags are used in this way, and this bit must match the opcodes used to load and store to the segment. SecureRISC borrows the word clique to refer to this usage of tags; the clique of memory and pointers must match on access. Cliqued pointers in memory use the tag to represent the allocation containing the pointer, and so different bits must be used to specify the pointer clique, reducing the address space size by 8 bits for such pointers (making only 256 segments addressable). SecureRISC has CLA64 and CSA64 instructions that decode cliqued pointers on load and encode them on store. Cliqued pointers do not need to be word aligned in memory. When a load or store instruction checks memory tags (i.e., when the AR base register memtag field is not 251), if the address is not word aligned and the access crosses a word boundary, then all accessed word tags must match.
The CLA64 instructions supply the tag to write to AR[d] (typically 222, 240, or 244). The decode and encode allow the format in address registers (ARs) to be compatible with non-cliqued address usage, by setting the AR tag to newtag, moving ac to bits 143..136, setting the ring field to PC.ring, and setting the size field to the segment size. The memory forms of data and pointers are as shown below:

Cliqued Non-Pointer Stored in Memory accessed with {L,S}{X,S}n{U,S} etc.
71	64	63	0
mc		data
8		64

Cliqued Pointer Stored in Memory accessed with CLA64/CSA64 etc.
71	64	63	56	55	0
mc		ac		address
8		8		56

Field	Width	Bits	Description
address	56	55:0	Byte address
ac	8	63:56	Clique of addressed memory (0..231)
mc	8	71:64	Clique assigned by allocator to memory containing the pointer (0..231)

The CLA64 transformation is as follows:
t ← lvload64cliquecheck(ea, AR[a]_143..136)
AR[d] ← t_63..56 ∥ PC.ring ∥ segsize(ea) ∥ newtag ∥ 0⁸ ∥ t_55..0
There are many ways this mechanism might be used, but one way a slab allocator might work is by setting the tags of each N words of the slab to incrementing values mod 232, and then incrementing the tags in the words of a freed block by 16 or 32 mod 232.

Doubleword

A doubleword is two words of course. Doublewords are stored at addresses that are multiples of 16 (16‑byte aligned in memory).

Sub-word

SecureRISC avoids the terms half-word, quarter-word, etc. because words have tags and there is no such thing as half or a quarter of tag. SecureRISC does have 8‑bit, 16‑bit, 32‑bit, 64‑bit, and 128‑bit load and store instructions, and these instructions support misaligned accesses. Collectively these accesses are called sub-word loads and stores because they extract data from the data portion of one to three memory words. Sub-word loads tag the resulting data with an integer tag, and typed sub-word stores change the memory tag to integer.

Clique

A clique is the grouping of memory and pointers that access that memory into one of 232 cliques to detect errors in one of the two uses of tags. Memory allocation assigns the words one of the 232 cliques distinct from adjacent memory, and distinct from previous uses, and returns pointers that include this value, which is checked on accesses through the pointer. Thus, if pointer arithmetic moves the pointer outside of the allocation accesses are likely to fail. When memory is freed, the tags are changed so that old pointers will no longer match. With strong bounds checking and garbage collection, cliques are not necessary, and the tag is not used in this way.

Decoded Pointer

Pointers live in memory and in XRs/SRs/VRs in memory format, undecoded. When loaded into ARs however, some decoding is done to prepare the pointer for use for calculating effective addresses for loads and stores. The decoded form is 144 bits. The decoding depends on the instruction used to load the AR. Storing an AR encodes it back to memory format, depending on the store opcode used. There are also instructions for saving and restoring the full decoded form of ARs to memory. The four forms of AR loads and stores are then for typed/sized pointers (e.g., LA or SA), cliqued pointers (e.g., CLA64 or CSA64), CHERI pointers (e.g., LAC or SAC), and decoded pointers (e.g., LAD or SAD). The 144‑bit AR format is shown below (note that CHERI internal format is shown in CHERI Capabilities):

AR bits 71..0
71	64	63	0
type		data
8		64

AR bits 143..72
71	64	63	61	60	0
memtag		ring		size
8		3		61

The memtag field is set to 251 for word loads (e.g., LA), but is set to the expected memory tag on cliqued loads (e.g., CLA64) by copying from bits 63..56 (which are then cleared). For doubleword loads (e.g., LAD or LAC) it is set from the 144‑bit memory read, but LAC traps if it is not 251. The ring field is set from bits 66..64 for tags 192..199 and 208..215, and to PC.ring otherwise.

Tagged Non-pointer Data

64‑bit integer data
71	64	63	0
240		integer
8		64

64‑bit unsigned integer data
71	64	63	0
241		unsigned integer
8		64

<64‑bit integer data
71	64	63	48	47	0
242		type		integer
8		16		48

IEEE-754 64‑bit floating-point data
71	64	63	0
244		float64
8		64

<64‑bit floating point data
71	64	63	48	47	0
245		type		small float
8		16		48

Null/Nil pointer

Null Pointer
71	64	63	0
0		0
8		64

A pointer to 0-length data (tag 0) and address 0 is used as a null pointer. Any reference through this pointer causes an exception. There are BEQN and BNEN 16‑bit instructions for branching on null pointers since this is so common. Other uses of the data field are reserved.

Sized Pointers

Sized pointers encode the size of the addressed memory in the tag for small sizes. This size is decoded into a full size by the LA instruction, and this size is used by bounds checking loads and stores that use the decoded form as a base address. Small sizes support many common cases, such as the pointers returned by memory allocators (which typically increase allocation size to prevent fragmentation, e.g., as in slab allocators). Array slices would be rounded up to the next supported size. When bounds checking is called for on things where the size cannot be encoded in the tag, the WSIZE may be used to set the decoded from an XR. These pointers lack a ring number.
Adding to an AR decreases its size field by the increment.
Sized pointers may be created by using the LIMIT instruction using a base and new size, where the new size is bounds checked against base.size.
The tag specifies the size in an unsigned floating-point format with a 4‑bit exponent, 4‑bit significand with a hidden bit for non-zero exponents, and is expanded to a 61‑bit size as follows:
e ← tag_7..4
size ← e = 0 ? 0⁵⁴∥tag_3..0∥0³ : 0^53−e∥1∥tag_3..0∥0^2+e

Sized Pointer with size encoded in tag
71	70	64	63	48	47		0
0	SS		segment		fill	byte address in segment
1	7		16		48−SEGSIZE	SEGSIZE

Small Size Encoding SS
tag	SS	Size in Words								G
tag	2:0 6:3	0	1	2	3	4	5	6	7	G
0..7	0	0	1	2	3	4	5	6	7	1
8..15	1	8	9	10	11	12	13	14	15	1
16..23	2	16	18	20	22	24	26	28	30	2
24..31	3	32	36	40	44	48	52	56	60	4
32..39	4	64	72	80	88	96	104	112	120	8
40..47	5	128	144	160	176	192	208	224	240	16
48..55	6	256	288	320	352	384	416	448	480	32
56..63	7	512	576	640	704	768	832	896	960	64
64..71	8	1024	1152	1280	1408	1536	1664	1792	1920	128
72..79	9	2048	2304	2560	2816	3072	3328	3584	3840	256
80..87	10	4096	4608	5120	5632	6144	6656	7168	7680	512
88..95	11	8192	9216	10240	11264	12288	13312	14336	15360	1024
96..103	12	16384	18432	20480	22528	24576	26624	28672	30720	2048
104..111	13	32768	36864	40960	45056	49152	53248	57344	61440	4096
112..119	14	65536	73728	81920	90112	98304	106496	114688	122880	8192
120..127	15	131072	147456	163840	180224	196608	212992	229376	245760	16384

Possible Size Extension to Tags 128..191
tag	SS	Size in Words								G
tag	2:0 7:3	0	1	2	3	4	5	6	7	G
128..135	16	2¹⁸	2¹⁸+2¹⁵	2¹⁸+2×2¹⁵	2¹⁸+3×2¹⁵	2¹⁸+4×2¹⁵	2¹⁸+5×2¹⁵	2¹⁸+6×2¹⁵	2¹⁸+7×2¹⁵	2¹⁵
136..143	17	2¹⁹	2¹⁹+2¹⁶	2¹⁹+2×2¹⁶	2¹⁹+3×2¹⁶	2¹⁹+4×2¹⁶	2¹⁹+5×2¹⁶	2¹⁹+6×2¹⁶	2¹⁹+7×2¹⁶	2¹⁶
144..151	18	2²⁰	2²⁰+2¹⁷	2²⁰+2×2¹⁷	2²⁰+3×2¹⁷	2²⁰+4×2¹⁷	2²⁰+5×2¹⁷	2²⁰+6×2¹⁷	2²⁰+7×2¹⁷	2¹⁷
152..159	19	2²¹	2²¹+2¹⁸	2²¹+2×2¹⁸	2²¹+3×2¹⁸	2²¹+4×2¹⁸	2²¹+5×2¹⁸	2²¹+6×2¹⁸	2²¹+7×2¹⁸	2¹⁸
160..167	20	2²²	2²²+2¹⁹	2²²+2×2¹⁹	2²²+3×2¹⁹	2²²+4×2¹⁹	2²²+5×2¹⁹	2²²+6×2¹⁹	2²²+7×2¹⁹	2¹⁹
168..175	21	2²³	2²³+2²⁰	2²³+2×2²⁰	2²³+3×2²⁰	2²³+4×2²⁰	2²³+5×2²⁰	2²³+6×2²⁰	2²³+7×2²⁰	2²⁰
176..183	22	2²⁴	2²⁴+2²¹	2²⁴+2×2²¹	2²⁴+3×2²¹	2²⁴+4×2²¹	2²⁴+5×2²¹	2²⁴+6×2²¹	2²⁴+7×2²¹	2²¹
184..191	23	2²⁵	2²⁵+2²²	2²⁵+2×2²²	2²⁵+3×2²²	2²⁵+4×2²²	2²⁵+5×2²²	2²⁵+6×2²²	2²⁵+7×2²²	2²²

Alternate Use for Read-Only Sized Pointers Using Tags 128..191
tag	SS	Size in Words								G
tag	2:0 7:3	0	1	2	3	4	5	6	7	G
128..135	16	0	1	2	3	4	5	6	7	1
136..143	17	8	9	10	11	12	13	14	15	1
144..151	18	16	18	20	22	24	26	28	30	2
152..159	19	32	36	40	44	48	52	56	60	4
160..167	20	64	72	80	88	96	104	112	120	8
168..175	21	128	144	160	176	192	208	224	240	16
176..183	22	256	288	320	352	384	416	448	480	32
184..191	23	512	576	640	704	768	832	896	960	64

The maximum size supported for 128 tags is 245,750 words or 1.875 MiB; with the possible extension to 176 tags is 15,728,640 words or 120 MiB; and with 192 tags is 62,914,560 words or 48 MiB. The maximum allocation overhead with this encoding is 12.5% (e.g., 131073 words rounding up to 147456). With fewer exponent bits, this could be reduced to 6.2% (e.g., 16385 words rounded up to 17408) at the cost of reduced range (31744 words or 136 KiB for tag 191). Beyond the maximum size, or when a memory ring number is required, unsized pointers must be used. However, an alternative use of tags 128..192 might be to implement read-only pointers (e.g., for parameters passed by reference).

Pointers to blocks with header/trailer sizes

Pointer with size at virtual address − 8
71	64	63	4	3	0
221		doubleword address		0
8		60		4

Providing a special tag for headers and trailers of allocated blocks allows a backward scan to find the start of the block. This may be useful in some applications.
There may be associated words before a header word that give additional information. If these exist, they are called leader words.

Size word stored at pointer − 8
71	64	63	4	3	0
250		doubleword count		0
8		60		4

Size word stored at pointer + size
71	64	63	4	3	0
250		− doubleword count		0
8		60		4

Unsized Pointers

Unsized Pointer
71	67	66	64	63	48	47	0
24		ring		segment		byte address
5		3		16		48

The only size check for unsized pointers comes from the segment size stored in the Segment Descriptor Word as cached in the TLB.
Unsized pointers are used for memory regions too large to have the size encoded in the tag and for pointers that have a ring number less than PC.ring. They may also be used for undisciplined language (e.g., C++) pointers to disable checking, but this is an insecure mode of operation, and cliqued pointers are preferred in such cases.
Perhaps rename these to ringed pointers?

Code Pointers

Code pointers are used for function calls and returns, and for implementing switch statements. CHERI capabilities may also be used as code pointers. Calls and jumps using pointers without tags in the range 208..215 or 232 trap.
Code pointers must address words with tag 252 (or tag 253 when that is defined).
The following diagram shows the low two bits as 0, looking forward to when compressed BBDs with tag 253 are implemented. Pointers to tag 252 basic block descriptors would always have bit 2 zero.

Pointer to Basic Block Descriptor
71	67	66	64	63	2	1	0
26		ring		BB descriptor address		0
5		3		62		2

Segment Relative Pointers

Segment relative pointers allow segments to contain address-space-independent pointers to locations within the segment. For example, a database could be mapped to different addresses in different address spaces, but still contain pointers to other data in the segment. There is no ring field in these pointers. The RLA64, RLA32, RLA64I, and RLA32I instructions load these pointers and convert to a sized pointer using the segment size. These instructions fill in the ring field with the current ring of execution (PC.ring). The RSA64, RSA32, RSA64I, and RSA32I instructions store pointers and convert to this format, checking that the segment number matches the segment number of the store address register and that the ring number is equal to the current ring of execution.

Segment Relative Pointers
71	64	63	61	60	0
223		0		offset
8		3		61

CHERI Capabilities

CHERI capabilities are stored in memory doublewords and may be loaded into ARs with the LAC instruction and stored with the SAC instruction. Word 1 of a CHERI capability is given a special tag. The word 0 and 1 tag values of CHERI capabilities may only be created by ring 7 and by CHERI instructions that derive from other CHERI capabilities.

Word 0 of CHERI capability
71	64	63	0
232		Local virtual address
8		64

Word 1 of CHERI capability
71	64	63	61	60	57	56	53	52	47	46	28	27	26	25	17	16	14	13	3	2	0
251		R		0		SDP		AP		0		S	F	T		TE		B		BE
8		3		4		4		6		19		1	1	9		3		11		3

The following gives an overview of the above. See CHERI Concentrate Section 6 for details, except for the ring number field, which is SecureRISC specific.

Fields of CHERI Capability Word 1
Field	Width	Bits	Description
BE	3	2:0	Bottom bits 2:0 or Exponent bits 2:0
B	11	13:3	Bottom bits 13:3
TE	3	16:14	Top bits 2:0 or Exponent bits 5:3
T	9	25:17	Top bits 11:3
F	1	26	Exponent format flag indicating the encoding for T, B and E: The exponent is stored in T and B if EF=0, so it is internal The exponent is zero if EF=1
S	1	27	Sealed
AP	6	52:47	Architectural permissions
SDP	4	56:53	Software defined permissions
R	3	63:61	Ring number (SecureRISC specific)
251	8	71:64	Tag for CHERI Word 1

The interpretation of the above fields is approximately (the CHERI Concentrate documention is definitive) as follows:
e ← F ? 0⁶ : 52 − (TE ∥ BE)
ba ← F ? (B ∥ BE) : (B ∥ 0³)
ta ← F ? (T ∥ TE) : (T ∥ 0³)
carry ← ta_11..0 < ba_11..0
bot ← (lvaddr_63..14+e + cb) ∥ ba_13..0 ∥ 0^e
top ← (lvaddr_63..14+e + ct) ∥ (ba_13..12 + ~F + carry) ∥ ta ∥ 0^e

CHERI Alternative

The CHERI-128 format above may be appropriate for a capability architecture, but something simpler could drop the capability aspects and provide more bits the top and bottom bounds and a clique for dangling pointer detection. Here is one possibility:

Word 1 of CHERI Alternative
71	64	63	61	60	59	55	54	53	28	27	0
CLIQUE		R		W	E		L	T		B
8		3		1	5		1	26		28

Fields of cherialt
Field	Width	Bits	Description
B	28	27:0	Bottom bits 30..3
T	26	53:28	Top bits 28..3
L	1	54	Length bit
E	5	59:55	Exponent
W	1	60	Write permission
R	3	63:61	Ring
CLIQUE	8	71:64	Clique

carry ← T_25..0 < B_25..0
T2 ← B_27..26 + L + carry
bot ← (lvaddr_63..32+e + cb) ∥ B ∥ 0^3+e
top ← (lvaddr_63..32+e + ct) ∥ T2 ∥ T ∥ 0^3+e

Trap on load or store

One tag is defined to cause an exception when it is referenced on a load or store. This is useful for detecting accesses to freed memory, which is a source of security issues. A special instruction is provided to overwrite such words. Trap on load is also useful for dynamic linking and for setting breakpoints on basic block descriptors.
Trap on store either requires either a read before write, which is undesirable, or more likely storing an extra bit per word in the data cache tag RAMs (e.g., 8 bits for a 64 B block) which seems worth the checking this feature provides.

Trap on load tag
71	64	63	0
254		data
8		64

Trap on load or store tag
71	64	63	0
255		data
8		64

Dynamic Typing

As noted earlier, it is useful to provide tags for Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).

Tag	Lisp	Julia	Data use
0	nil?		0
1..31	simple-vector?	Tuple?	TBD (pointers with exact size in words)
32..127			? (pointer with inexact sizes)
128..191			no dynamic typing use (Reserved)
192..199			no dynamic typing use (unsized pointer with ring)
200..207			no dynamic typing use (Reserved—possible unsized pointers with ring)
208..215			Code pointer with ring
216..220			no dynamic typing use (Reserved)
221	simple-vector?	Tuple?	TBD (pointer to words with size header)
222			no dynamic typing use (Cliqued pointer in AR)
223			no dynamic typing use (Segment Relative)
224	CONS		Pointer to a pair
225	Function		Pointer to a pair
226	Symbol		Pointer to structure
227	Structure	Structure?	Pointer to structure
228	Array		Pointer to structure
229	Vector		Pointer to structure
230	String		Pointer to structure
231	Bit-vector		Pointer to structure
232			CHERI-128 capability word 0
233			no dynamic typing use (Reserved)
234	Ratio	Rational	Pointer to pair
235	Complex	Complex	Pointer to pair
236	Bigfloat	BigFloat	Pointer to structure
237	Bignum	BigInt	Pointer to structure
238		Int128	Pointer to pair, −2¹²⁷..2¹²⁷−1 tag 241 in word 0, tab 240 in word 1
239		UInt128	Pointer to pair, 0..2¹²⁸−1 tag 241 in both word 0 and word 1
240	Fixnum	Int64	−2⁶³..2⁶³−1
241		UInt64	0..2⁶⁴−1
242	Character	Bool, Char, Int8, Int16, Int32, UInt8, Uint16, Uint32	UTF-32 + modifiers, subtype in upper 32 bits
243			no dynamic typing use (Reserved)
244	Float	Float64	IEEE-754 binary64
245		Float16, Float32	subtype in upper 32 bits
246..249			no dynamic typing use (Reserved)
250			no dynamic typing use (header/trailer words)
251			no dynamic typing use (AR and CHERI word 1)
252..253			no dynamic typing use (BB descriptor)
254			no dynamic typing use (trap on load or BBD fetch (breakpoint))
255			no dynamic typing use (trap on load or store)

Python and Other Language Types

In addition to Lisp types, SecureRISC could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.

Block-Oriented ISA Terminology

Basic Block: A series of instructions with control transfers only before the first instruction and after the last.
Bage: A SecureRISC invented term for Basic Block Page, which is a 4 KiB aligned portion of the virtual address space containing basic block descriptors and the instructions addressed by those descriptors. The bage size should be less than or equal to the minimum page size so that the bage lvaddr → siaddr translation can be used for the L1 instruction cache access; if the SecureRISC minimum page size is increased, the bage size might be increased as well.
Basic Block descriptor: All control transfers are to Basic Block (BB) descriptors, which have tags 252 and 253. Transfers are not to instructions; a jump addressing instruction words would take an exception based on a tag mismatch. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. For basic blocks with conditional branches, the conditional branch prediction is made when the basic block descriptor is executed, and checked when the conditional branch instruction in the basic block is executed. The conditional branch instruction only has the operands to decide on taken or not-taken; the branch offset is stored in the descriptor, not the instruction. Thus, conditional branches look like other ALU instructions and may occur anywhere in the basic block and need not be the last instruction (earlier placement may reduce the branch misprediction penalty).
Basic block descriptors (BBDs) would typically be cached in their own specialized cache earlier in the pipeline than the instruction cache, and all program path prediction would be done based on this cache. This takes the instruction cache hierarchy out of the critical path. Filling the BBD cache is done a block at a time, perhaps with prefetch, which further helps performance. Descriptors also enable wide, parallel instruction decode even when variable sized instructions are present.
Program Counter: The Program Counter (PC) is a processor register giving the current Basic Block Descriptor (a 8‑byte aligned lvaddr). In normal operation a Basic Block is executed in its entirety, and so the instruction within the Basic Block need only be identified when exceptions stop execution in the middle of a basic block. Calls only store 8‑byte aligned values and returns trap on non-aligned values. Exceptions do store the full PC (along with a 208+ring tag value) and the offset within the BB. When packed Basic Block Descriptors (tag 253) are implemented, the PC may become 4‑byte or 2‑byte aligned.

Other Features and Aspects of SecureRISC

Segment Growth: Segments may grow upward or downward, but not both. Downward-growing segments are supported by setting the downward bit in the Segment Descriptor Entry (SDE), which checks that the fill bits (bits 47..ssize) are all set. To grow a segment the supervisor increases ssize in the SDE and doubles the area allocated to the first-level page table.
Stacks: Stacks are typically contained in a single segment so that they may grow as needed, but still be bounds checked. It is up to the ABI whether stacks grow upward or downward. Downward-growing stacks made sense when the heap and stack were at opposite ends of a small address space, and bi-directional growth allowed each to grow to fill the space between without predetermined limits to each. This is no longer necessary with stack and heap each occupying its own segment. For both upward and downward stacks the size of stack frames is encoded in the sp register size field, which causes references using sp as a base register to be bounds checked to the stack frame.
Upward-growing stacks increment sp by the frame size, and the operating system knows that the stack includes that size. The ENTRY instruction accomplishes this, given an immediate value representing the tag of the new sp.
Downward-growing stacks simply decrement sp by the new frame size. This is accomplished with the ENTRYD instruction.
In both stack directions, the new size is written into the size field of sp.^* The old value of sp is written into the new stack frame so that the frame can deallocated by loading this value. Attempting to deallocate a frame with only an increment or decrement would lose the frame size. This method also makes stack frames a linked list, which is convenient for debugging. (If the compiler chooses frame size that can be exactly encoded in the tag field, then this sp store can occupy a single word in the stack frame, rather than a doubleword.)
Each process thread is typically given two stacks per ring, each in its own segment. One stack of the pair is used only for return addresses, which are pushed and popped when calls, returns, exceptions, and exception returns commit in the pipeline using the CSP[ring] RCSRs.† The other stack is used for everything else. The return address stack segment is typically write-protected from the current ring of execution except during call operations and can only be written by calling a more privileged ring. This aids in Control Flow Integrity (CFI) by preventing return-oriented programming (ROP) from overwriting return addresses. The return address stack also avoids wasting an AR.
* When pointer and memory cliques are used for bounds checking undisciplined languages, it will be desirable to set the memory tags of the frame to a unique value, e.g., by using a call count mod the number of cliques, except if this results in the same clique as the previous frame. Initializing all the words of a stack frame is likely to be the major performance cost to the clique method.
† Most implementations will provide two specialized caches of several lines (typically 2–8 lines representing 16–64 return addresses) for the return address stack, one at the commit point of the pipeline and one at the front-end used for prediction. The commit cache is kept coherent with the processor data caches used by load and store instructions and the prediction cache may be restored from the commit cache on mispredicts.
While both stack directions are supported, the ABI is expected to prefer upward growing stacks, as this may simplify hardware prefetch in some implementations.
Control-flow Integrity: Many attacks on conventional processors exploit sneaking Trojan data into the memory of a process. Since that memory typically lacks execute permission, the attacker instead depends upon causing existing instructions to execute the attacker’s algorithm using bogus data. Return-oriented programming (ROP) is one method to exploit existing instructions by overwriting the return address on the stack so that the return transfers to carefully chosen address that executes a few instructions and then returns to a new address. Often only a portion of a basic block containing a return is executed in this way. The basic block descriptor mechanism defeats this since a return to a basic block descriptor executes the entire basic block. In addition, basic block descriptors contain a field indicating whether a return ever targets the block and returns to blocks that do not expect a return take an exception. In addition, by moving the return address stack into protected memory, overwrites are prohibited.
Branch avoidance: SecureRISC has several features that reduces the demands on branch prediction, which improves performance. The Boolean Registers (BRs) are one aspect of the ISA that enables some branch avoidance. Boolean registers are especially useful in implementations that employ the early/late pipeline split.
Trap instructions: SecureRISC contains a rich set of trap instructions that cause an exception based on various conditional tests. This allows the compiler to supplement the checking mandated by the SecureRISC ISA with its own checks. Trap instructions do not use branch prediction resources and in some microarchitectures are almost free to execute with minor performance impact, except for their code size and fetch bandwidth requirements.
Loop count: Rather than depending upon the conditional branch predictor to predict loop iteration counts, SecureRISC defines instructions to communicate inner loop iteration counts to the BB engine and to indicate how to check predictions made thereby. This feature is initiated by the LOOPX or LOOPXI instructions with the number of iterations prior to the start of the loop in a BB with the c bit set in its descriptor. The microarchitecture employs count prediction on such BBs, most likely in a specialized structure. This prediction is be replaced by the actual value when the LOOPX or LOOPXI executes in the AR engine, which is often before the first or second loop back. When the last BB of the loop wants to loop back, it uses a BB descriptor next code of loop back or loop back conditional decrementing the predicted count and branching to the target if not zero. This feature allows SecureRISC to achieve DSP-like performance on simple loops and reduces the burden on the branch predictor, making it more effective on real conditional branches. The BB containing a loop test must also contain the SOBX instruction to decrement the actual loop iteration count in an XR and check the prediction.
Pointer Permission: In addition to Read, Write, and Execute permissions, SecureRISC includes Pointer and Capability permission bits in Segment Descriptors. Only segments with the P bit set are allowed to contain pointers to other segments. Stack and heap segments would typically have P set, but code and mapped data files would have P clear. Segments with P clear may only contain local pointers, which consists of just the offset within the segment. The RLA64, RLA32, RLA64I, and RLA32I instructions allow such pointers to be converted to full pointers when loaded into an Address Register. This allows a database to contain internal pointers that are independent of the address to which the segment is mapped at runtime.
Capability Permission: Capability Permission allows the segment to contain CHERI capabilities.

Sandboxing

At times it can be useful to be able to execute untrusted code in an environment where that code has no direct access to the rest of the system, but where it can communicate with the system efficiently. Hierarchical protection domains (aka protection rings) provide an efficient way to provide such an environment. Imagine a web browser that wants to be able to download code from an untrusted source, perhaps use Just-In-Time Compilation to generate native code, and then execute to provide some service as part of displaying the web page. The downloaded code should not be able to access any files or the state of the user browser. For this scenario on SecureRISC, where ring 0 is the least privileged and ring 7 the most privileged (the opposite of the usual convention), the web browser might execute in ring 2, generate machine code to a segment that is writeable from ring 2, but only Read and Execute to ring 0, and then transfer to that ring 0 code. Rings may share the same address space and TLB entries for a given process, but the ring brackets stored in the TLB change access to data based on the current ring of execution. Ring 0 would have access only to its code, stack, and heap segments, and nothing else. It would not be able to make system calls or access files, except indirectly by making requests to ring 2. The only access ring 0 would have outside of its three segments might be to call a limited set of gates in ring 2, causing a ring transition. Interrupts and such would be delivered to the browser in ring 2, allowing it to regain control if the ring 0 code does not terminate. The browser and the rest of the system is completely protected by the code executing in ring 0. When a more privileged ring accesses a less privileged ring’s address space, it does so through pointers that include the ring number of the less privileged ring, and the permissions enforced by SecureRISC are those of the less privileged ring. Thus ring 0 may pass pointers to its data when calling ring 2 gates, and these pointers are checked with ring 0 permissions. Because of the ring number in pointers and ring brackets, the ring 0 address space is a subset of the address space of ring 2, ring 2 has complete access to all the data in ring 0, but ring 0 has access only to the segments granted to it by ring 2. Ring 2 has the option to grow or not grow the code, heap, and stack segments of ring 0 as appropriate. Less privileged rings cannot use the ring number in pointers to gain access, as the permissions are computed for min(PC.ring, AR[a]_66..64).

Garbage Collection

One goal for SecureRISC is to support languages, such as Lisp, Julia, Javascript, and Java, that rely on Garbage Collection (GC), as this eliminates many programming errors that introduce bugs and vulnerabilities. GC is the automatic reclamation of allocated memory by tracing all reachable allocations and freeing the remainder. GC needs to be both low overhead while meeting application response time requirements (e.g., by not pausing the application excessively). SecureRISC will achieve this by including features (described in subsequent sections) for generational GC and per-thread access restrictions to allow concurrent GC to be performed by one processor while another continues to run the application.

GC Terminology

Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1–4 generations, each generation consisting of some data structures and many small independent incremental regions that are used to implement incremental GC. The purpose of the incremental regions is to bound the timing of certain GC operations making program delays not proportional to memory size but only to incremental region size. When the application program needs to access an incremental region that has not been processed, the application switches to process it immediately, and then proceeds. The incremental region is small enough that the delay in processing it is acceptable to application performance, but large enough that its overhead is not excessive. A group of incremental regions is called a macro region, and a generation might be one or more macro regions. Macro regions are further divided into those for small and large objects, which use different algorithms for their incremental regions.

The SecureRISC Garbage Collection (GC) terminology introduced so far is briefly summarized below:

Area: Programmers allocate objects in areas, which provides grouping. In SecureRISC, an area might consist of a data structure and several segments, one per generation.
Generation: Generations group data into four lifetimes, from ephemeral (generation 0) to long-lived (generation 3), with generations 1 and 2 having intermediate object lifetimes. Generations consist of small-object macro regions and large-object macro regions.
Incremental Region: To minimize the time the application is paused by the GC algorithm, objects are stored in incremental regions that can be compacted quickly in response to an access by the application. Once the incremental region is compacted, the application continues.
Macro Region: Incremental regions are small to minimize processing time, and so cannot hold all of the application data. A set of incremental regions provides the capacity required by the application. Macro regions have incremental regions for large and small objects, managed by different algorithms.
Small Object Region: Small objects (less than a small multiple of the page size) benefit from compaction, and are allocated sequentially from free space.
Large Object Region: Large objects (greater than a small multiple of the page size) are allocated as many pages as required, and are never moved. When they are no longer referenced, GC frees the pages they occupied.

Generational GC

New allocations are presumed to how short lifetimes until proven otherwise. Such allocations are ephemeral and done in a generation 0, which is reclaimed frequently. The ephemeral allocations store pointers to all generations, but have few pointers from longer-lived generations to the more ephemeral allocations. For efficiency, reclamation operates without scanning all older allocations. Over time as data remains live in the ephemeral generation for many reclamations, it may be moved to an older generation. To work correctly, pointers in older areas that point to recent ones need to be known and used as roots for recent area scans. The processor hardware helps this process by taking an exception when a pointer to a newer generation is first stored to location in an older generation; the trap handler can note the page being stored to and then continue. The translation cache access for the store will provide both the generation dirty level for the target page and the generation number of the target segment. For the data being stored, the tag indicates whether it is a pointer or not, and if so then the Segment Size Cache provides the generation number of the pointer being stored, and the translation cache provides the generation of the page being stored to. If the page generation is greater than the generation of the pointer being stored, an exception occurs. SecureRISC has support for 4 generations, with generation 0 being the most ephemeral and generation 3 being the least frequently reclaimed. Rather than storing the location of all pointers on a page to more recent generations, the trap might simply note which pages need to be scanned when GC happens later. Because words in memory are tagged, pages can be scanned later without concern that an integer might be interpreted as a pointer. With sufficient operating system sophistication, it is even possible that a page could be scanned prior to being swapped to secondary storage, to prevent it needing to be read back in during GC. After the first trap on storing a recent generation pointer to an older generation page, if only the page is noted for later scan, then the GC field in the PTE would typically be changed by the trap handler so that future stores to the page are not trapped.

Before describing the mechanisms for incremental GC, it is helpful to have a specific GC algorithm in mind. The next section presents the preferred algorithm. After the preferred algorithm, the details of per-thread access restriction for incremental GC are presented.

Garbage Collection Algorithm

David Moon, architect of Symbolics Lisp Machines, kindly offered suggestions on Garbage Collection (GC). I have dubbed his algorithm MoonGC. He began by observing the following:

Compacting garbage collection is better than copying garbage collection because it uses physical memory more efficiently.

Compacting garbage collection is better than non-moving garbage collection and C-style heap allocation because it does not cause fragmentation.

First, divide objects into small and large. Large objects are too large to be worth the overhead of compacting, larger than a few times the page size. Large objects are allocated from a heap and never change their address. The garbage collector frees a large object when it is no longer in use. By putting each large object on its own pages, physical memory is not fragmented and the heap only has to track free space at page granularity. Virtual address space gets fragmented, but it is plentiful so that is not a problem.

Small objects are allocated consecutively in fixed-size regions by advancing a per-region fill pointer until it hits the end of the region or there is not enough free space left in that region; at that point allocation switches to a different region. The region size is a small multiple of the page size. The allocator chooses the region from among a set of regions that belong to a user-specified area. Garbage collection will compact all in-use objects in a region to the low-address end of that region, consolidating all the free space at the high-address end where it can easily be reused for new allocations.

SecureRISC now uses incremental region for what MoonGC called simply region above. Before continuing, this proposal introduces this and other terminology in the next section.

One other advantage of compaction, not mentioned above, is that it provides a natural mechanism for determining the long-lifetime data in ephemeral generations: it is the data compacted to the lowest addresses.

MoonGC, as originally presented, is a four phase algorithm to implement the above using only virtual memory and changing page permissions. The following adapts MoonGC to take advantage of the address restriction feature described below, as using virtual memory protection changes is costly. The restriction allows GC to deny application threads access to incremental regions when they are in an inconsistent state. The following also makes other minor changes so that the exposition below is new. The credit goes to David Moon, but problems and bugs are likely the result of these changes and exposition.

The application threads run concurrently with the GC threads, except in phase 3 (the stack scan). Application threads may be slowed during phase 4 as will be explained. The four phases of MoonGC are as follows:

Mark phase. Mark data reachable from roots in incremental region bitmaps. This phase is concurrent with the application. New allocations also mark the bitmaps.
Preparation phase. Process the bitmaps to prepare for small object relocation and free large object pages. This phase is concurrent with the application.
Relocation phase. Translate all roots, including the stack and the pointers from longer-lived generations, converting pre-compaction addresses to post-compaction addresses. Deny the application access to all small and large object regions using address restriction. This trap is handled by a handler in the same ring, and is therefore fairly low-overhead. The application is paused during this phase, but this phase takes a short time and is not proportional to memory size, only the size of the stack and other roots. After this phase, the roots, stack, and all accessible memory contain only compacted addresses, and the application works only with such addresses. Application access to memory with pre-compaction addresses is denied (GC threads continue to have access).
Compaction phase. This phase is concurrent with the application and occurs primarily in the GC threads, but application threads may join the work when needed. The compaction phase goes through all small object incremental regions and moves marked objects from their pre-compaction to post-compaction address, and also translates the pointers contained in the objects as well. It also goes through all large objects and translates the pointers contained therein. Once compaction is completed for an incremental region, the permission for that incremental region is enabled for the application. If an application thread tries to access a disabled region, the trap will, if compaction has not already started on this region, the thread switches to compaction and translation of the incremental region (only translation for large object regions). (If a GC thread is already working on the incremental region, the application thread will just wait.) When an application thread joins the compaction threads, it will temporarily enable access, and then restore access when it finishes its incremental region, and then return to application work, which now that the region is compacted, will no longer trap. The time spent in the application thread doing compaction and translation is proportional to incremental region size and not memory size, bounding the pause.

Occasionally an extra phase of the algorithm might compact two incremental regions into one. Still additional phases might migrate objects from a frequent generation to a less frequent one.

Virtual Address Restriction

This proposal starts with the assumption that software will designate one or more macro regions of the virtual address space to be subject to additional access control for rings ≤ R (controlled by a privileged register so that, for example, user mode cannot affect supervisor mode). For example, when Garbage Collection is used for reclaiming allocated storage, only the heap might be subject to additional protection to implement read or write barriers. These macro regions of the virtual address space are specified using a Naturally Aligned Power-of-2 (NAPOT) matching mechanism to provide flexible sizing. Matching for eight macro regions is currently proposed, which would support four generations of small object macro regions, and four generations of large object regions. This restriction is implemented in a fully associative 8‑entry CAM matching the effective address of loads and stores. A match results in 128 access restriction bits, with one bit selected by the address bits below the match. In particular, there are eight Virtual Access Match registers (amatch0 to amatch7), eight 128‑bit Virtual Address Region Trap registers (atrap0 to atrap7), and eight 128‑bit Virtual Address Region Write Trap registers (awtrap0 to awtrap7). The atrapi/awtrapi registers are read and written 64 bits at a time using low and high suffixes, i.e., atrapil/atrapih and awtrapil/awtrapih. The format of the amatchi registers is as follows, using a NAPOT encoding of the bits to compare when testing for a match.

Virtual Address Match Registers
63	22	21	18	17	4	3	0
vaddr_63..19+S		2^S		0		TYP
45−S		1+S		14		4

Fields of amatchi registers
Field	Width	Bits	Description
TYP	4	3:0	0 ⇒ Disabled 1 ⇒ Address restriction for GC 2..15 Reserved
2^S	1+S	18+S:18	NAPOT encoding of virtual address region to match
vaddr_63..19+S	45−S	63:19+S	Virtual address to match

When bits 63:19+S of a virtual address match the same bits of amatchi, then the corresponding atrapil/atrapih and awtrapil/awtrapih pairs specify 128 additional access and write denial bits for the incremental regions thereof. In particular, on a match to amatchi, bits 18+S:12+S of the effective address are used to select bits from the atrapi pair and the awtrapi pair. If the atrapi bit is 1, then loads and stores generate an access fault; else if the awtrapi bit is 1, then only stores generate an access fault. The value of S comes from the NAPOT encoding of amatchi registers, as the number zero bits starting from bit 18 (i.e., S=0 if bit 18 is 1, S=1 if bits 19:18 are 10, and so on). Setting bits 63:18 to 2⁴⁵ causes it to match the entire virtual address space. The lowest numbered amatchi match has priority. If no amatchi register matches then there is no additional access check.

How to control ring access to the above CSRs is TBD, as what ring accesses are trapped.

A atest instruction will be specified to return the incremental region that matches the effective address ea. If there is not a match, these instructions return the null pointer (tag 0). On a match to amatchi they return a pointer (with the appropriate size tag) to ea_63..12+S∥0^12+S based on the S from the matching register.

awtrapi registers are not required for MoonGC, described above, and may be left set to zero for that algorithm. They could be omitted if another use is not found for them, but they may be useful for other GC algorithms.

The efficiency of translating pre-compaction to post-compaction addresses is critical. The original MoonGC proposal recognized that this time is probably limited by data cache misses, and used the preparation phase to convert the bitmaps into a relocation tree that would require only three cache block accesses per translation with binary searching. The following modification is proposed to reduce this to just two cache blocks by making extensive use of population count (popcount) operations.

Within a small object incremental region, the post-compaction offset of an object is the number of mark bits set in the incremental region bitmap for objects up to but not including that object. For translation, summing the popcount on all the words in the bitmap prior to the word for the pre-compaction address would touch too many cache blocks, so in phase 2 (preparation) compute the popcounts of each bitmap cache block and store them for lookup in phases 3 and 4. Each translation is then one popcount cache block access and one bitmap cache block access. For a small object incremental region holding N objects and a cache block size of 512 bits (64 B), the number of bitmap cache blocks B is ⌈N/512⌉. Store 0 in summary[0]; store popcount(bitmap[0..511]) in summary[1]; store summary[1]+popcount(bitmap[512..1023]) in summary[2]; and so on … and finally store summary[B-2]+popcount(bitmap[N-1024..N-511]) in summary[B-1]. If N ≤ 65536 then the summary count array elements fit in 16 bits, and so the size of the summary array is ⌈B/32⌉ cache blocks, and if N ≤ 16384 the summary array fits in only one cache block. To translate from the pre-compaction offset to the post-compaction offset in phases 3 and 4, simply take the ⌊offset/512⌋ as the index into this array to get the number of objects before the bitmap cache block. Now read the bitmap cache block. Add the popcount of the 1–8 words up to the object of interest (using a mask on the last word read) to the lookup value. This sum is the post-compaction offset in the small object incremental region. If eight popcounts are too costly, then the summary popcount array may be doubled in size to cover just four words, or a vector popcount reduction instruction might be added to make this even more efficient.

As an example, to illustrate the above, consider NAPOT matches on 16 MiB (S=5), which provides 128 access controlled incremental regions of 128 KiB (131072 B) each. An object pointer is converted to its containing incremental region by simply clearing the low 17 bits. There are 16104 words (2013 cache blocks) of object store (98.29%), which are stored starting at offset 0 in the incremental region. The bitmap summary popcounts are 64 bytes starting at 128832. Bitmaps are 2016 bytes (31.5 cache blocks) starting at 128896. Finally there are 160 bytes (20 words, 2.5 cache blocks) of incremental region overhead for locks, freelists, etc. available starting at 130912. To go from the pointer to its bitmap byte, add bits 16:6 to the region pointer plus 128896 and the bit is given by bits 5:3.

Incremental region layout examples
Mregion		Iregion	Objects		Summary		Bitmap		Other
S	MiB	words	words	%	words	%	words	%	words	%
0	0.5	512	480	93.75	1	0.20	8	1.56	23	4.49
1	1	1024	984	96.09	1	0.10	16	1.56	23	2.25
2	2	2048	1992	97.27	1	0.05	32	1.56	23	1.12
3	4	4096	4008	97.85	2	0.05	63	1.54	23	0.56
4	8	8192	8040	98.14	4	0.05	126	1.54	22	0.27
5	16	16384	16104	98.29	8	0.05	252	1.54	20	0.12
6	32	32768	32232	98.36	16	0.05	504	1.54	16	0.05
7	64	65536	64480	98.39	32	0.05	1008	1.54	16	0.02
8	128	131072	128976	98.40	63	0.05	2016	1.54	17	0.01
9	256	262144	257968	98.41	126	0.05	4031	1.54	19	0.01
10	512	524288	515952	98.41	252	0.05	8062	1.54	22	0.00
11	1024	1048576	1031928	98.41	504	0.05	16124	1.54	20	0.00

Smaller incremental regions may provide better real-time response, but limit the size of a macro region due to the 128 access denial bits provided by atrapi pairs. Larger incremental regions pause the application for longer and also require a larger summary popcount array, but allow for larger memory spaces. Generations might choose different incremental region sizes. Typically generation 0 (the most ephemeral) would use small incremental regions, while generation 3 (the most stable) would use incremental regions sized to fit the amount of data required.

With eight amatch sets of registers, half might be used for four generations of small object regions, and half for four generations of large object regions. In the above example, if each bit of atrap controls a 128 KiB small object region, then the ephemeral generation can be as large as 16 MiB. Less ephemeral generations might be larger.

A possible improvement to the algorithm is to have areas use slab allocation for a few common sizes. For example, there might be separate incremental regions for 1, 2, …, 8, and >8‑word objects. This allows a simple freelist to be used for objects ≤8 words so that compaction is not required on every GC. Incremental regions for ≤8 words might only be compacted when it would allow pages to be reclaimed or cache locality to be increased. Note that different trade-offs may be appropriate for triggering compaction in ephemeral vs. long-lived generations. Also, bitmaps could be potentially use only one bit per object rather than one bit per word in 2‑word, 4‑word, and 8‑word regions, making these even more efficient. However, that requires a more complicated mark and translation code sequence.

When a GC thread finishes compaction of an incremental region, application access is not immediately enabled since that would require sending an interrupt to all the application threads telling them to update their atrap registers. Instead the updated atrap bits are stored in memory, and the next application exception will load the updated value before testing whether compaction is required, in progress, or still needs to be done.

Setting the TYP to 0 in amatchi registers may be used by operating systems to reduce context switch overhead; disabled registers may be treated as having amatchi/atrapi/awtrapi all zero.

Exceptions And Interrupts

This section is preliminary at this point.

Each ring is capable of handling some of its own exceptions and interrupts. For example, ring R assertion failures (attempts to writes of 1 to b0) abort the execution of the instruction making the assertion and transfers to the ring R handler. This exception call pushes the PC, the offset in the basic block, plus three words of additional information onto the Call Stack, and a return pops this information. The exception handler is specified in a per ring register. The additional information includes a cause code and may include further information that is cause dependent. The details of the exception mechanism are TBD. Of course, in some cases exceptions should be handled by a more privileged ring (e.g., user page faults should go a supervisor exception handler since the user exception handler might take a page fault, and similarly for second-level page faults for the supervisor and hypervisor). Again the details are still TBD. Also, exceptions in exception handlers may go to a higher ring.

Also, it is conceivable that ring R page faults could be handled by the ring R exception handler. A page fault in the handler would then be handled by a more privileged ring. This might allow rings to implement their own page replacement algorithms. However, this would not be the typical case.

The Basic Block Descriptor (BBD) addresses of the exception handlers for ring R are given by the RCSR ExcHandler[R], which must be 8‑byte aligned (typically these values are 128‑byte aligned). As with other RCSRs, only rings of equal or higher privilege may write the register. In addition, values written to this register must have a code pointer tag designating a ring of equal or higher privilege, but not higher than PC.ring. Thus the validity test is as follows:
h ← X[a]
if (h_2..0 ≠ 0) | (R > PC.ring) | (h_71..67 ≠ 26) | (h_66..64 < R) | (h_66..64 > PC.ring) then
exception
endif
In addition the basic block descriptor (BBD) at ExcHandler[R] must have tag 252 with prev = 4 (Cross-Ring Exception entry) and the BBD at ExcHandler[R] | 64 must have tag 252 with prev = 12 (Same-ring Exception entry).

ExcHandler[R] specifies the BBD address for exceptions from less privileged rings to ring R (i.e., for PC.ring < R). Exceptions from ring R to R (i.e., for PC.ring = R) use the modified BBD address ExcHandler[R] | 64. This allows cross-ring exceptions to additional state save and restore (e.g., stack switching), while same-ring exceptions are fast (and for example stay on the same stack).

The exception process may itself encounter an exception that must be serviced by a more privileged ring (e.g., a virtual memory exception in writing the call stack). This will be designed so that after the virtual memory exception is remedied, the lower privilege ring exception can proceed. Also, programming or hardware errors might result in an attempt to take an exception in the critical portion of the exception handler, which will be detected, and signal a new exception to a more privileged ring, or a halt in the case of ring 7.

When an exception handler is entered, its first task is usually to save some registers and load them with values that the exception handler requires. In some ISAs these registers are pushed onto to the stack. On other ISAs, special scratch registers are provided that are used to save General-Purpose Registers (GPRs). The loading of useful values may come from various sources (e.g., a PC-relative load). For RISC‑V a GPR is swapped with a special scratch register, both saving and loading in a single instruction. RISC‑V has 3–4 scratch registers for different privilege levels. SecureRISC would require 8 RCSRs (one per ring) if it chooses this method. SecureRISC may instead provide an instruction to push or store an AR pair and an XR pair onto the Call Stack rather than providing per-ring scratch registers. This is still TBD. For loading new values, SecureRISC has the RCSR ThreadData[ring], which can be moved to an AR.

Each ring has its own set of interrupt enable and pending bits, and these are distinct from other rings’ bits. Interrupts also use the general exception handler, with a cause code indicating the reason is for an interrupt. Their additional information includes the previous IntrEnable mask for the target ring. When the interrupt exception occurs, IntrEnable[ring] is set to zero, disabling further interrupts and the original interrupt enables are saved on the Call Stack. The interrupt handler is expected to reenable higher-priority interrupts based on clearing same and lower priority interrupts from the saved enables and writing that back to IntrEnable[PC.ring]. The bits from the saved enables to clear might be a bitmask from a per-thread lookup table, which allows for all 64 interrupts to be prioritized relative to each other. The RFI instruction restores the interrupt enable bits from the Call Stack. Any pending interrupts that are thereby enabled will be taken before executing the instruction returned to. The RFI instruction may optimize this case by simply transferring back to the handler address rather than popping and pushing the call stack.

Interrupt pending bits are set by writing to a memory address specific to the target process. When the process is running, this memory address is redirected to the process’ pending register; otherwise, it will receive the interrupt when it switches to running.

The mechanism for clearing an interrupt pending bit is interrupt dependent. For level-triggered interrupts it is interaction with the interrupt signal source that deasserts the signal, and thus clears the pending bit. For edge-triggered and message-signaled interrupts, the RCSRRCXC instruction may be used clear the interrupt pending bit.

Processors check for interrupts at the start of each instruction. An interrupt is taken instead of executing the instruction if (IntrPending[ring] & IntrEnable[ring]) ≠ 0 with the check done in order for ring from 7 down to PC.ring.

Three interrupts are generated by the processor itself and are assigned fixed bits in the IntrPending and IntrEnable registers. Bit 0 is for the ICount interrupt; bit 1 is for the CCount interrupt; and bit 2 is for the Wait interrupt. Wait interrupts occur whenever less privileged rings attempt to use one of the wait instructions that would suspend execution. Enabling Wait interrupts allows intercept such waits to switch to other work. This interrupt would typically be enabled when other work exists, and disabled otherwise. In addition, the supervisor is expected to define certain interrupts for user rings. For example, a timer interrupt would typically be created from cycle counts for bit 3. (Need to either define per-ring Wait interrupts or have a rule that the least privileged ring of higher privilege gets the interrupt.)

Virtualization of Interrupts

Interrupts need to be virtualized. SecureRISC expects systems to primarily employ Message Signaled Interrupts (MSIs), where interrupts are transactions on the system interconnect. MSIs are directed to a specific process. If the process is currently executing on a processor, then the interrupt goes to that processor. If the process is not running, then the interrupt must be stored in memory structures (e.g., by setting a bit), and then the scheduler for that process (stored in the in the memory data structure) must be notified (e.g., by a new interrupt message MSI). This process may repeat several times until a running scheduler is reached. When a process is scheduled on a processor, the interrupts stored in memory are loaded into the processor state, and future interrupts are directed to the processor rather than to memory.

To implement this, interrupt messages are directed to one or more specialized Interrupt Processing Units (IPUs). Creating a process allocates system interconnect memory for the process’ interrupt data structures and provides this memory to the chosen IPU. When the process is scheduled, the IPU is informed to forward interrupts directly to it. When a process is descheduled, the IPU is informed to store its interrupts in the allocated memory and send an interrupt to the scheduler.

For some systems a single Interrupt Processing Unit (IPU) may be sufficient. In others it may be appropriate to have multiple IPUs, e.g., one unit per major priority level, so that lower priority interrupts do not impede the processing of higher priority ones. (There may be some sequential processing in IPUs, such as a limitation on outstanding memory operations.) NUMA systems may also want distributed IPUs. It is also possible there would be an IPU associated with each processor or processor cluster (e.g., a set of a processors sharing a LLC), which would take advantage of the interface to the system interconnect.

The details of the above are TBD. Conceptually, MSIs would probably address a process through a triple of Interrupt Processor Unit (IPU) number, an opaque identifier^* referencing a process or thread, and an interrupt number for the process. The opaque identifier would be translated to its associated memory by the IPU, and the interrupt number bounds checked against the number of interrupts configured for the process. Forwarding interrupts to running processes would specify a processor port on the system interconnect, a ring number, and the interrupt number. It may be desirable to fit the interrupt state for a process into a single cache block to help manage atomic transfers between IPUs and processors.

* If rings do not share address spaces the (VMID, ASID) tuple might be used as an interrupt identifier, perhaps with the addition of a thread ID. For rings sharing an address space, adding a ring number could be sufficient. This would require a mapping table of this identifier to an IPU and memory address.

The advantage of this outline is that no specialized storage is required per process. Main memory replaces the specialized storage for non-running processes, and the processor interrupt mechanisms are used for running processes.

Dynamic Linking and Loading

Most RISC ISAs use a set of mechanisms to implement dynamic loading, and dynamic linking that are less efficient than what SecureRISC can do using tags and a different ABI. Because the RISC‑V ABI for dynamic linking is slightly better than some older ABIs, it will be the basis of comparison here.

Most dynamic linking implementations do lazy linking on procedure calls; the first call to a procedure invokes the dynamic linker, which converts the symbol being referenced into an address and arranges that subsequent calls go directly to the specified address. This speeds up program loading because symbol lookup is somewhat expensive, and not all symbols need to be resolved in every program invocation. Lazy linking is not typically done for data symbols because the cost of detecting the first reference and invoking the dynamic linker would require too much extra code at every data access. So data symbol links are typically resolved when a shared library is loaded. In contrast, SecureRISC’s trap on load tag (tag value 254), allows links to external data symbols to be resolved on first reference, which should lead to faster execution initiation.

In RISC‑V, because external symbols are resolved by the dynamic linker when the object is dynamically loaded, it suffices to reference indirect through the link filled in by the linker, which is stored in a section called the Global Offset Table (GOT). In RISC‑V the GOT is a copy-on-write section of the mapped file and is addressed using PC-relative addressing (using the RISC‑V AUIPC instruction).

External symbol and function references are given in the C++, RISC‑V, and SecureRISC code examples below to illustrate the differences between the RISC‑V ABI and the proposed SecureRISC ABI. Starting with the C++ code:

	extern uint64_t count;		// external data
	extern void doit(void);		// external function
	static void
	count_doit(void)
	{
		count += 1;
		doit();
	}
could be implemented as follows for RISC‑V:
count_doit:
	addi	sp, sp, -16	// allocate stack frame
	sd	ra, 0(sp)	// save return address
.Ltmp:	auipc	t0, %got_pcrel_hi(count)	// load link to count from GOT
	ld	t0, %pcrel_lo(.Ltmp)(t0)	// (PC-relative)
	ld	t1, (t0)	// load count
	addi	t2, t1, 1	// increment
	sd	t2, (t0)	// store count
	call	doit@plt	// call doit indirectly through PLT
	ld	ra, 0(sp)	// restore return address
	ret		// return from count_doit
where the call pseudoinstruction above is initially:
	auipc	ra, 0	// with relocation R_RISCV_CALL_PLT
	jalr	ra, ra, 0
but potentially relaxed to:
	jal	ra, 0	// with relocation R_RISCV_JAL
when the PLT is within the 1 MiB reach of the JAL (see RISC-V ABIs Specification version 1.0).
The PLT target of the above AUIPC/JALR or JAL is a 16‑byte stub with the following contents:
1:	auipc	t3, %pcrel_hi(doit@.got.plt)
	ld	t3, %pcrel_lo(1b)(t3)
	jalr	t1, t3
	nop

As seen above, the external variable reference is three instructions initially (and subsequently just one, as long as the link is held in the register). The SecureRISC ABI generally requires only two instructions to the first reference.

Also as seen above, the external procedure call is 4–5 instructions with two changes of instruction fetch (two BTB entries), one in the body and one in the PLT. If there are multiple calls to doit in the library, the PLT entry is shared by all the calls. When the number of frequent calls to doit is N, then N+1 BTB entries are required (N from the body, 1 from the PLT). The SecureRISC ABI requires 2 instructions and N BTB entries, which is not significantly different from N+1 for large N, but for N=1 represents half the BTB entries.

The typical POSIX ABI, such as the RISC‑V ABI, is as based on the C/C++ notion that all functions are top level. Other languages allow functions nesting, and is typically implemented by making function variables two pointers: a pointer to the code to call, and a context pointer specifying the stack frame of the function’s parent, which is used when referencing the parent’s local variables. The SecureRISC ABI proposal is to adopt the idea that all functions are specified by a code and context pointer pair, where the context for top-level functions is a pointer to their global variables and function links.

One the consequences of the proposed SecureRISC ABI is that copy-on-write is not required. An operating system that implements copy-on-write could use it (the context pointer would point to the data section of the mapped file), but it might avoid copy-on-write by copying the mapped file’s data template to a data segment with read and write permission, which allows page table sharing for the mapped file.

Another consequence is that the method of access to globals and links is the same in both the main application code and dynamically loaded shared libraries. In RISC‑V and other ABIs, the application code typically references global variables via the Global Pointer (gp) register, but with PC-relative references in shared libraries. For SecureRISC, each shared library has a register (the context pointer) for addressing its top-level data.

The C++ function above could be implemented on SecureRISC as follows (with shading used to highlight the basic blocks):
count_doit:
	bb	%prcall,%icall\|%nohist
	entryi	sp, sp, 32	// allocate stack frame
	sadi	sp, sp, 0	// save return address
	sadi	a10, sp, 16	// save a10
	mova	a10, a1	// move context to a10
	lai	a2, a10, count_link	// load count link
	lsi	s0, a2, 0	// load count
	addsi	s1, s0, 1	// increment
	ssi	s1, a2, 0	// store count
	ljmpi	a0, a10, doit_link+0	// doit code pointer
	lai	a1, a10, doit_link+8	// doit context pointer
	bb	%preturn,%return
	ladi	a10, sp, 16	// restore saved register
	ladi	sp, sp, 0	// restore stack pointer

Note that the LJMPI is a load instruction that checks the call prediction performed by the fetch engine when the BB descriptor at the start of count_doit is processed; it does not end the basic block.

Bounds Checking

Various checks are performed on all load and store instructions:

The tag of the base register AR[a] is checked for a valid pointer tag (range 0..239):
tag ← AR[a]_71..64
if tag ≥ 240 then exception
The offset from the base address is calculated. This is typically done in one of three ways:
- offset ← (XR[b]<<sa)
- offset ← imm
- offset ← 0
Unsigned overflow during the shift raises an exception.
The offset is compared to the size field of AR[a] and an exception is raised if if the offset exceeds the size:
trap if offset ≥ 0³∥AR[a]_133..72
The effective address is computed: ea ← AR[a]_63..0 + offset
Unsigned overflow during the addition raises an exception.
The segment size cache is accessed for AR[a]_63..48 to determine the segment size, ssize and this is checked:
if AR[a]_63..ssize ≠ ea_63..ssize then exception
The effective ring for the load or store is calculated:
earing ← tag ≥ 224 | tag_2..0 ≥ PC.ring ? PC.ring : tag_2..0
The data translation cache(s) are used to translate the lvaddr ea to a siaddr and Read, Write, Execute permissions and ring brackets. These are checked as described in the virtual memory section using the access type and earing.

Memory Model

SecureRISC, as originally conceived, was simply going to specify its memory model as Release Consistency), but after encountering RISC‑V, it seemed wise to look at what had been done there for memory model specification, so this is on hold. This section will be expanded when the memory model is defined.

Instruction Set

The following overview is meant to give a general framework to help the reader appreciate the details presented subsequently.

The SecureRISC Instruction Set is designed around six register files, two intended for use early in the pipeline, and four later in the pipeline. While some implementations may not have an early/late distinction, they are described this way here to indicate the possibility of such a split.

Name(s)

Description

Comment

Early Pipeline

These instructions have at most three register operands, at most two sources and one destination except stores, which have up to three register sources, but never more than two sources from a given register file.
Operations are grouped into classes represented by schemas for conciseness in instruction exposition:

Early Pipeline Class	Operation
Address comparison	ac
Index arithmetic	xa
Index bitwise logical	xl
Index comparison	xc

AR/AV

Address Registers

Used as base address for load and store instructions, where the effective address calculation is either
AR[a] + (XR[a]<<sa)
where sa is 0..3, or
AR[a] + immediate.
The single-bit AVs are valid bits for speculative load validity propagation.

XR/XV

Index Registers

Used as for integer calculations related to addressing. Often the general non-memory format is:
XR[d] ← XR[a] xa XR[b]
where xa is a fairly simple operation (e.g., + or <<_u).
The single-bit XVs are valid bits for speculative load validity propagation.

Late Pipeline

These instructions have up to three sources and one destination. SecureRISC makes use of the three source operands more than other ISAs. Often the general format is:
RF[d] ← RF[c] accop (RF[a] op RF[b])
where accop is an accumulation operation (e.g., + or −) and op is more general operation (e.g., ×).
Operations are grouped into classes represented by schemas for conciseness in instruction exposition, and most classes have an associated accumulation operation schema:

Late Pipeline Class	Operation	Accumulation
Boolean operation	bo	ba
Integer comparison	ic	ba
Integer arithmetic	io	ia
Bitwise logical	lo	la
Floating-point arithmetic	fo	fa
Floating-point comparison	fc	ba

BR/BV

Boolean Registers

Use for comparisons, selection, and branches on scalar registers.

SR/SV

Scalar Registers

Used for both integer and floating-point scalar calculations. Not associated with address calculation.

Vector Registers

Used for both integer and floating-point vector and matrix calculations. VRs have no associated valid bits and are typically not renamed.

Mask Registers

Used for both integer and floating-point vector masking. VMs have no associated valid bits and are typically not renamed.

Matrix Accumulators

Used for to accumulate the outer product of two vectors.

Processor State

Access to the state of more privileged rings is prohibited. For example, attempting to read or write CSP[ring] when ring > PC.ring causes an exception. Unprivileged state (e.g., the register files) may be accessed by any ring.

In the table below, the Type field values CSR, RCSR (per-ring CSRs), and ICSR (indirect CSRs) are described in Control and Status Register Operations. The type R is used for a simple register, RF for a Register File, VRF for a Vector Register File, and MRF for a Matrix Accumulator.

SecureRISC processor state includes:

Name	Type	Depth	Width	Read ports	Write ports	Description
PC	R	1	3 + 62 + 5			The Program Counter holds the current ring number (3 bits), Basic Block descriptor address (62 bits), and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions and interrupts. Until compressed BBDs (tag 253) are defined, only be 61 bits are required.
CSP	RCSR	8	3 + 61			The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks, and by exceptions and interrupts. This is not the same as the Program Stack Pointer, which is held in an AR designated by the Software ABI. There is one CSP per ring. CSP[PC.ring] generally increments and decrements by 8, but exceptions increment and decrement by a TBD value (probably 32). Some implementations are expected to provide a dedicated cache for CSP values, e.g., containing 8–16 entries of 8 words per entry. In addition, a separate fetch prediction structure may attempt to speculatively predict this cache. The tag on values written to CSP[ring] must be in the range 192..192+PC.ring (unsized pointers with ring number); otherwise an access fault exception is taken.
CSPRE	RECSR	1	3			An exception occurs on access to CSP[ring] when PC.ring < CSPRE, in addition to the other access checks.
ThreadData	RCSR	8	72			Thread Data is a per-ring storage location for a pointer to Thread-Local Storage (TLS). Functions that require access to per-thread data typically move this to an AR. It is also typically used in cross-ring exception handlers to save and restore the registers that ring requires to handle exceptions.
ThreadDataRE	RECSR	1	3			An exception occurs on access to ThreadData[ring] when PC.ring < ThreadDataRE, in addition to the other access checks.
ExcHandler	RCSR	8	3 + 61			ExcHandler[ring] holds the ring number and address to which the processor redirects execution on an exception for that ring. The tag on values written to CSP[ring] must be in the range 208..208+PC.ring (BB descriptor pointers with ring number); otherwise an access fault exception is taken.
ExcHandlerRE	RECSR	1	3			An exception occurs on access to ExcHandler[ring] when PC.ring < ExcHandlerRE, in addition to the other access checks.
OptionEnable	RCSR	8	16			OptionEnable[ring] holds enable bits for various groupings of instructions, CSRs, etc. that SecureRISC defines to either trap if disabled or operate as specified. The set of enables is currently not defined, but Vector instructions will be given a bit here. This also allows future extensions to be detected (their enable bits will be read-only zero if not implemented) and disabled if software does not support the extension. The mechanism for privileged rings preventing less privileged rings from enabling options is TBD, but might be an AllowOption[ring] RCSR.
OptionEnableRE	RECSR	1	3			An exception occurs on access to OptionEnable[ring] when PC.ring < OptionEnableRE, in addition to the other access checks.
InstCount	RCSR	8	64			InstCount[ring] holds the count of executed instructions in each ring.
InstCountRE	RECSR	1	3			An exception occurs on access to InstCount[ring] when PC.ring < InstCountRE, in addition to the other access checks.
BBCount	RCSR	8	64			BBCount[ring] holds the count of executed Basic Blocks in each ring.
BBCountRE	RECSR	1	3			An exception occurs on access to BBCount[ring] when PC.ring < BBCountRE, in addition to the other access checks.
ICountIntr	RCSR	8	64			The ICount bit in IntrPending[PC.ring] is set when (InstCount[PC.ring] − ICountIntr[PC.ring]) > 0. This may be used for single stepping.
ICountIntrRE	RECSR	1	3			An exception occurs on access to ICountIntr[ring] when PC.ring < ICountIntrRE, in addition to the other access checks.
CycleCount	RCSR	8	64			CycleCount[ring] holds the number of cycles executed by ring.
CCountIntr	RCSR	8	64			The CCount bit in IntrPending[PC.ring] is set when (CycleCount[PC.ring] − CCountIntr[PC.ring]) > 0.
CCountIntrRE	RECSR	1	3			An exception occurs on access to CCountIntr[ring] when PC.ring < CCountIntrRE, in addition to the other access checks.
IntrEnable	RCSR	8	64			IntrEnable[ring] holds interrupt enable bits for each ring. Interrupts for each ring are distinct. Application rings are expected to use the interrupts for inter-process communication. Supervisor and hypervisor rings will also use interrupts for communication with I/O devices.
IntrEnableRE	RECSR	1	3			An exception occurs on access to IntrEnable[ring] when PC.ring < IntrEnableRE, in addition to the other access checks.
IntrPending	RCSR	8	64			IntrPending[ring] holds interrupt pending bits for each ring.
IntrPendingRE	RECSR	1	3			An exception occurs on access to IntrPending[ring] when PC.ring < IntrPendingRE, in addition to the other access checks.
AccessRights	RCSR	8	16			AccessRights[ring] holds the current Mandatory Access Control Set per ring. It is writeable only by ring 7. These rights are tested against the MAC level of svaddr regions specified in the Region Protection Table and potentially by the System Interconnect.
AccessRightsRE	RECSR	1	3			An exception occurs on access to AccessRights[ring] when PC.ring < AccessRightsRE, in addition to the other access checks.
AccessLevels	XCSR	1	64			This CSR is writeable only by ring 7. It contains four 16‑bit masks that divide AccessRights into 0–4 orthogonal Bell-LaPadula levels. Typically these four masks are non-overlapping between themselves and AccessCats. Setting this CSR to 0 in effect disables Bell-LaPadula level checking. Read and write access to data with Data Access Set DAS denial is computed as follows (where ering is the effective ring access level): PAS ← AccessRights[ering] set0 ← AccessLevels_15..0 set1 ← AccessLevels_31..16 set2 ← AccessLevels_47..32 set3 ← AccessLevels_63..48 readdeny ← (DAS&set0)>_u(PAS&set0) \|(DAS&set1)>_u(PAS&set1) \|(DAS&set2)>_u(PAS&set2) \|(DAS&set3)>_u(PAS&set3) writedeny ← (DAS&set0)≠(PAS&set0) \|(DAS&set1)≠(PAS&set1) \|(DAS&set2)≠(PAS&set2) \|(DAS&set3)≠(PAS&set3)
AccessCats	XCSR	1	64			This CSR is writeable only by ring 7. It contains four 16‑bit masks that divide AccessRights into 0–4 orthogonal Bell-LaPadula category sets. Typically these four masks are non-overlapping between themselves and AccessLevels. Setting this CSR to 0 in effect disables Bell-LaPadula category checking. Read and write access to data with Data Access Set DAS denial is computed as follows (where ering is the effective ring access level): PAS ← AccessRights[ering] set0 ← AccessCats_15..0 set1 ← AccessCats_31..16 set2 ← AccessCats_47..32 set3 ← AccessCats_63..48 readdeny ← (DAS&set0)&~(PAS&set0)≠0 \|(DAS&set1)&~(PAS&set1)≠0 \|(DAS&set2)&~(PAS&set2)≠0 \|(DAS&set3)&~(PAS&set3)≠0 writedeny ← (DAS&set0)≠(PAS&set0) \|(DAS&set1)≠(PAS&set1) \|(DAS&set2)≠(PAS&set2) \|(DAS&set3)≠(PAS&set3)
ALLOWQOS	RCSR	8	6			ALLOWQOS[ring] holds the minimum value that may be written to QOS by a ring. Rings may not write values to QOS[ring] less than ALLOWQOS[PC.ring]. Only ring 7 may write ALLOWQOS[ring].
QOS	RCSR	8	6			QOS[ring] holds the current Quality of Service (QoS) identifier per ring. QoS identifiers are used on system interconnect transactions for prioritization. Rings may only set QOS to values allowed by ALLOWQOS[PC.ring]. Attempts to write smaller values trap.
QOSRE	RECSR	1	3			An exception occurs on access to QOS[ring] when PC.ring < QOSRE, in addition to the other access checks.
KEYSET	XCSR	1	16			This register is writeable only by ring 7, and specifies which encryption key indexes are currently usable. A reference to a disabled key in the ENC field of the Region Descriptor Table causes an exception. This allows ring 7 to partition the system based on which encryption keys are usable.
ENCRYPTION	ICSR	15	8 + 256			These registers are readable and writeable only by ring 7, and provide the 8‑bit encryption algorithm and 256‑bit encryption key for main memory encryption as specified in Region Descriptor as an index into this array. The encryption algorithm and key are selected by the ENC of the Region Descriptor Table, with 0 being hardwired to no-encryption. Up to 15 pairs may be specified, but some implementations may support a smaller number. This is further defined in Memory Encryption below.
AMATCH	ICSR	8	64 + 128			These registers are described in Virtual Address Restriction.
VMID	RCSR	8	32			VMID[ring] holds the per-ring Virtual Machine Identifier (VMID). This is used to annotate system interconnect reads and writes so that I/O devices can interpret lvaddrs used for DMA. An exception occurs on access to VMID[ring] when PC.ring < VMIDRE, in addition to the other access checks.
VMIDRE	RECSR	1	3			An exception occurs on access to VMID[ring] when PC.ring < VMIDRE, in addition to the other access checks.
ASTP	RCSR	8	64			ASTP[ring] holds the per-ring Address Space Table Pointer. This is typically set by the hypervisor on context switch to VMT[VMID[ring]].astp.
ASTPRE	RECSR	1	3			An exception occurs on access to ASTP[ring] when PC.ring < ASTPRE, in addition to the other access checks.
ASID	RCSR	8	32			ASID[ring] holds the per-ring Address Space Identifier (ASID). This is used to annotate system interconnect reads and writes so that I/O devices can interpret lvaddrs used for DMA.
ASIDRE	RECSR	1	3			An exception occurs on access to ASID[ring] when PC.ring < ASIDRE, in addition to the other access checks.
SDTP	RCSR	8	64			SDTP[ring] holds the per-ring Segment Descriptor Table Pointer. This is typically set by the supervisor on context switch to ASTP[ASID[ring]].
SDTPRE	RECSR	1	3			An exception occurs on access to SDTP[ring] when PC.ring < SDTPRE, in addition to the other access checks.
RPTP	XCSR	1	64			RPTP holds the Region Protection Table Pointer. This is typically set by hypervisors to VMT[VMID[ring]].rptp on context switch from one supervisor to another. An exception occurs on access to RPTP when PC.ring < VMIDRE, in addition to the other access checks.
AR	RF	16	144	2	1	The Address Register file holds pointers and integers to perform calculations related to control flow and to load and store address generation. No AR is hardwired to 0. In all cases, pointer and non-pointer, bits 63..0 are address or data and bits 71..674 are the tag. When non-pointers are loaded into an AR, with a word load or move (LA, LAI, MOVAX, or MOVAS), bits 71..0 are the value loaded and bits 143..72 are 0. When pointer tagged values are loaded or moved to an AR, bits 143..72 are set to decoded values to prepare the pointer to be used as a base address. In this case, bits 63..0 are the address, bits 71..64 are the original tag, bits 135..133 are the ring number, bits 132..72 are the size expanded from the tag, or as written by the WSIZE instruction, and bits 143..136 are used for the expected memory tag for cliqued pointers, or are the value 251 for other pointers. In some microarchitectures, operations on ARs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.) Most instructions that read ARs read only AR[a]. When two ARs are read, it is sometimes using the b field and sometimes the c field (AR stores read AR[c] and a few branches and SUBXAA read AR[b]). The b/c multiplexing can be done during instruction decode. The assembler designation for individual ARs is by the names a0, a1, …, a15.
AV	RF	16	1	1	1	The Address Register Valid file holds valid bits from speculative loads and propagation therefrom.
XR	RF	16	72	2	1	The Index Register file holds integers to perform calculations related to control flow and to load and store address generation. No XR is hardwired to 0. Bits 63..0 are data and bits 71..64 are the tag. The XR primarily holds integer-tagged data, but other tags may be loaded. In some microarchitectures, operations on XRs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction. Most instructions that read XRs read XR[a] and XR[b], but XR stores read XR[b] and XR[c]. The b/c multiplexing can be done during instruction decode. The assembler designation for individual XRs is by the names x0, x1, …, x15.
XV	RF	16	1	2	1	The Index Register Valid file holds valid bits from speculative loads and propagation therefrom.
SR	RF	16	72	3	1	The Scalar Register file holds data for computations not involved in address generation and primarily holds integer or floating-point values. Tags are stored, and so SRs may be used for copying arbitrary data, including pointers, but no instruction uses SRs as an address (e.g., base) register. Integer operations check for integer tags, and floating-point operations check for float tags. No SR is hardwired to 0. In some microarchitectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The most important instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts. The three SR read ports handle the a, b, and c register specifier fields, with writes specified by the d register field. SRs are late pipeline state. The assembler designation for individual SRs is by the names s0, s1, …, s15.
SV	RF	16	1	3	1	The Scalar Register Valid file holds valid bits from speculative loads and propagation therefrom. SVs are late pipeline state.
BR	RF	16	1	3	1	Boolean Registers hold 0/False or 1/True, such as the result of comparisons and logical operations on other Boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions. BRs are late pipeline state. The assembler designation for individual BRs is by the names b0, b1, …, b15.
BV	RF	16	1	3	1	The Boolean Register Valid file holds valid bit propagation from speculative loads (primarily SR loads). Branches with an invalid BR operand take an exception. BVs are late pipeline state.
CARRY	RF	1	64			The CARRY register is used on multiword arithmetic (addition, subtraction, multiplication, division, and carryless multiplication). See below. Consider expansion of CARRY to a 4-entry register file (c0 to c3). CARRY is late pipeline state.
VL	RF	16	8			The Vector Length registers specify the length of vector loads, stores, and operations. Vector length registers are paired with vector registers (VRs). VLs are late pipeline state.
VSTART	SCSR	1	7			The Vector Start register is used to restart vector operations after exceptions. Details to follow. VSTART is late pipeline state.
VM	RF	16	128	3	1	The Vector Mask register file stores a bit mask for elements of vector operations. VM[0] is hardwired to all 1s and is used for unmasked operations. Only VM[0] to VM[3] may be specified for masking vector operations in 32-bit instructions. VM[4] to VM[15] are available for vector comparison results and Boolean operations and in 48‑bit and 64‑bit formats. VMs are late pipeline state. The assembler designation for individual VMs is by the names vm0, vm1, …, vm15.
VR	VRF	16	72 × 128	3	1	Vector Registers hold vectors of tagged data, typically integers or floating-point data. (There are no speculative loads for the VRs and no associated valid bits. Vector operations with an invalid non-vector operand take an exception.) VRs are late pipeline state. The assembler designation for individual VMs is by the names v0, v1, …, v15.
MA	MRF	4	32 × MR×MT	1	1	Matrix accumulators hold matrixes of untagged data, typically integers or floating-point data and are used to accumulate the outer product of two vectors. (There are no speculative loads for the MAs and no associated valid bits. Matrix operations with an invalid non-vector operand take an exception.) MAs are late pipeline state. The assembler designation for individual MAs is by the names m0, m1, m2, and m3. The number of rows and columns of the matrix accumulators is implementation dependent.

The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.

Basic Block Descriptor Words

Basic Block Descriptor
71	64	63	58	57	47	46	38	37	34	33	29	28	13	12	11	10	9	0
252		hint		targr		targl		next		prev		start		s	c		offset
8		6		11		9		4		5		16		1	2		10

Fields of BB Descriptors
Field	Width	Bits	Description
offset	10	9:0	Instruction offset in bage for this BB
c	2	11:10	LOOPX present 0 ⇒ no LOOPX 1 ⇒ LOOPX present 2..3 ⇒ Reserved (possible use for nested loops)
s	1	12	Instruction size restriction: 0 ⇒ 16, 32, 48, and 64 bit instructions 1 ⇒ 32 and 64 bit instructions only
start	16	28:13	Instruction start mask (interpretation depends on s field)
prev	5	33:29	Mask of things targeting this BB for CFI checking
next	4	37:34	BB type / exit method
targl	9	46:38	Target BB offset in bage (virtual address bits 11:3)
targr	11	57:47	Target BB bage relative to this bage (±1024 4 KiB bages)
hint	6	63:58	Prediction hints specific to BB type
252	8	71:64	BB Descriptor Tag

Basic block descriptors are words with tags 252..253 aligned on a word boundary. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. (Tag 253 is reserved for future use, most likely for compressed descriptors.)

The s and 16‑bit start fields specify both the size of the basic block and the location of all the instructions in it. If s is set, then all instructions are 32‑bit or 64‑bit; if clear then 16‑bit and 48‑bit instructions may also be present. For s = 0, each bit represents 16 bits at offset in the bage, and the BB size can be up to sixteen 16‑bit locations, which could contain eight 32‑bit instructions, sixteen 16‑bit instructions, or an intermediate number of a mixture of the two, or a lesser number if 48‑bit and 64‑bit instructions are included. For s = 1, each bit represents 32 bits and the BB size can be up to sixteen 32‑bit locations, which could contain sixteen 32‑bit instructions, eight 64‑bit instructions, or an intermediate number of a mixture of the two. If the block is larger than these limits, then it is continued using a fall-through next field. The 16‑bit start field gives a bit mask specifying which 2‑byte locations start instructions, which allows parallel instruction decoding to begin as soon as the instruction bytes are read from the instruction cache. For example, sixteen instruction decoders could be fed in a single cycle from a single 8‑word instruction cache block fetch, using the start mask to specify which bytes to decode. The start bit for the first 16 bits is implicitly 1 and is not stored. The last 1 bit in the start field represents the 2‑byte position after the last instruction. Thus, the number of instructions is the number of 1 bits in the start field (if 0 bits are set, then there are 0 instructions). If the last instruction ends before a 32‑bit boundary, the last 16 bits should be filled with an illegal instruction. The s = 1 case is intended for floating-point intensive basic blocks which tend to have few 16‑bit instructions and also tend to be longer.

To increase locality and keep pointers short, SecureRISC stores basic block descriptors and instructions into 4 KiB regions of the address space (called bages) with the basic block descriptors in the one half and the instructions in the other half (the compiler should alternate the half used for even and odd bages to minimize set conflicts). This allows the pointer from the descriptor to 32‑bit aligned instructions to be only 10 bits, and in a paged system, the same TLB entry maps both the descriptors and instructions (since bage size ≤ page size), so only the BB engine requires a TLB (its translations are simply forwarded to the instruction fetch engine). The instructions are fetched from
PC_63..12 ∥ offset ∥ 0²
in the L1 instruction cache in parallel with the BB engine moving to fetch the next BB descriptor. For non-indirect branches and calls, the target is given by an 11‑bit signed relative 4 KiB delta from the current bage and a 9‑bit unsigned 8‑byte aligned descriptor address within that bage. Specifically
TargetPC ← PC_66..64 ∥ (PC_63..12 +_p (targr₁₀⁴¹∥targr)) ∥ targl ∥ 0³.
(Note: the name targr is short for target relative and targl is short for target low.)
For indirect branches and calls, the targr/targl fields may be used as a hint or default.

Instructions are stored in the bage with tag 240, which may be helpful when code reads and writes instructions in memory. A future option may be to use tags 240..243 to provide two more bits for instruction encoding per word, or one bit per 32‑bit location. Using 16 tags would provide four more bits per word, or one bit per 16‑bit location.

The low targl field is sufficient to index a set-associative BB descriptor cache that uses bits 11..3 (or a subset) as a set index without waiting for the targr addition giving the high bits. As an example, a 32 KiB, 8‑way set associate BB descriptor could read the tags in parallel with completing the addition giving the high address bits for tag comparison. If the minimum page size can be increased, then the number of bits allocated to the targl and offset fields might be increased and the bits to targr decreased; the current values were chosen for a minimum page size of 4 KiB, which encourages a bage size of 4 KiB to match. When targr = 0, the TLB translation for the current BB remains valid, and energy can be saved by detecting this case.

For even bages, it is recommended that BB descriptors start at the beginning of a bage, and instructions start on a 64‑byte boundary in the bage. Any full word padding between the last BB descriptor and the first instruction would use an illegal tag. For odd bages, BB descriptors would be packed at the end starting on a 64‑byte boundary and the instructions start at the beginning. Intermixing BB descriptors and instructions is possible but is not ideal for prefetch or cache utilization.

A non-zero c field (assembler %loopx) indicates that the BB contains a LOOPX/LOOPXI instruction, and therefore the BB engine should initialize its iteration count to zero and should predict the count until the AR engine executes the LOOPX and sends the actual loop count value back. If no prediction is available, 2⁶⁴−1 may be used. Often the AR engine does so before the final iteration, and the loop is predicted precisely even if this default loop count prediction is used. The iteration count increments when the next field contains a loop back or conditional loop back, and these are predicted as taken based on the iteration count being unequal to the predicted or actual loop count.

The next field specifies how the next basic block after this one is selected. It is sufficient to enable branch prediction, jump prediction, return address prediction, loop back prediction, etc. to occur without seeing the instructions involved in the basic block. Its values are described in Basic Block Descriptor Types in the subsequent section.

The prev field is used for Control Flow Integrity (CFI) checking and to implement the gates for calls to more privileged rings. It too is described in Basic Block Descriptor Types in the subsequent section.

The hint field will be defined in the future for prediction hints specific to each next field value. For example, conditional branches will use the hint field with a taken/not-taken initial value for prediction, a hysteresis bit (strong/weak), and an encoded 4‑bit history length (8, 11, 15, …, 1024) indication of what global history is most likely to be useful in prediction. Similarly indirect jumps and calls may have hints appropriate to their prediction. More hint bits would be nice to have, for example to encode Whisper’s Boolean function^*.

Note: I expect to use tag 253 for packing multiple Basic Block Descriptors in a single word. However, the details of this would probably be driven by statistics gathered once a compiler is generating the unpacked descriptors. This is expected to be limited to the BBDs that are internal to functions (simply branching).

* Whisper could be accomodated by using a new next field value. If unpredictable branches are short range, this value could then use most of the 26 bits of hint, targr, and targl for Whisper’s 21 bits, and use only a few bits for a BBD offset. Alternatively, the extra information could be stored in the following word, at the cost of significant relocation.

Basic Block Descriptor Types

The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:

Value

Description

Unconditional branch (%ubranch): The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check.

Conditional branch (%cbranch): The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one conditional branch instruction, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8.

Call (%rcall): The address PC + 8 is written to the word pointed to CSP[TargetPC_66..64], and CSP[TargetPC_66..64] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check.

Conditional Call (%crcall): The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There is no instruction for the call itself in the basic block, as this is not predicted. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC_66..64], and CSP[TargetPC_66..64] is incremented by 8.

Loop back (%loop): The predicted loop iteration count is used to predict whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions. There should be no other branch or jump instructions in the basic block. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8.

Conditional Loop back (%cloop): The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the predicted loop iteration count is used to determine whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions and exactly one conditional branch instruction, but no jump instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below or is the fall-through BB descriptor at PC + 8.

Fall through (%fallthrough): This Basic Block is unconditionally followed by the BB at PC + 8.
The targr/targl/start fields are not required for fall-through, so instead they may be used for prefetch. The targr/targl fields would then specify the first of several lines to prefetch into the BB Descriptor Cache (BBDC). The three least-significant bits of the targl field are not needed to specify a block in the BBDC, and are instead a sub-type:

0	No prefetch suggested
1	PC-relative prefetch suggested of one or more lines in the BBDC starting at PC_66..64 ∥ (PC_63..12 +_p (targr₁₀⁴¹∥targr)) ∥ targl_8..3 ∥ 0⁶ The hint field would specify a bitmask of the lines to be prefetched subsequent to the designated block. For example, this could be used on entry to the function to indicate up to 6 hot blocks after the specified one to prefetch, which allows up to 56 basic block descriptors of the function to be prefetched).
2..7	Reserved

BBDC prefetch might be queued for cycles when the BBDC is not being accessed or the tags might be dual-ported to allow parallel tag checks.

Reserved.

Jump Indirect (%ijump): The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/SWITCHX/etc. instructions in the instructions of the basic block. There should be exactly one jump, which may be located anywhere in the basic block instructions, but no conditional branches.
The targr/targl may be used as a hint for the most likely destination when hint₀ is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.

Conditional Jump Indirect (%cijump): The branch predictor is used to determine whether this jump indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the jump indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/SWITCHX/etc. instruction in the instructions of the basic block. There should be exactly one jump and exactly one conditional branch, which each may be located anywhere in the basic block instructions. In the case the jump is not taken the destination is fall-through BB descriptor at PC + 8.
This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JMPA/LJMP/LJMPI/SWITCHX uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB bage (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used).
The targr/targl may be used as a hint for the most likely destination when hint₀ is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.

Call Indirect (%icall): The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the LJMP/LJMPI instruction in the instructions of the basic block. There should be exactly one LJMP/LJMPI, which may be located anywhere in the basic block instructions, but no conditional branch instructions. The address PC + 8 is written to the word pointed to CSP[TargetPC_66..64], and CSP[TargetPC_66..64] is incremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint₀ is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.

Conditional Call Indirect (%cicall): The branch predictor is used to determine whether this call indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/etc. instruction in the instructions of the basic block. There should be exactly one jump, which may be located anywhere in the basic block instructions, and exactly one conditional branch. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC_66..64], and CSP[TargetPC_66..64] is incremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint₀ is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.

Return (%return): The Call Stack cache is used to predict the return using CSP[PC_66..64] − 8 as the index and CSP[PC_66..64] is decremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint₀ is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
It may be desirable to encode Exception Return with this BB type. Using hint₁ might be used to distinguish this case.

Conditional return (%creturn): This is probably only useful in leaf functions without a stack frame, unless register windows are added.

Reserved. Potential use compiler-generated data along the lines of for Instruction Presending

for high-level prefetch. Alternatively, another tag could be used for this.

Reserved.

The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB for Control Flow Integrity (CFI) checking. It is a set of values bits, with the least significant bits of prev controlling interpretation of the more significant bits as follows:

prev₀ = 1
Bit	Description	Assembler
1	Fall through to this BB allowed	%pfallthrough
2	Branch/Loopback to this BB allowed	%pbranch
3	Jump to this BB allowed (for case dispatch)	%pswitch
4	Return to this BB allowed	%preturn

prev_1..0 = 2
Bit	Description
2	Call relative allowed	%prcall
3	Call indirect allowed	%picall
4	Gate allowed	%pgate

prev_2..0 = 4
Bits 4..3	Description
0	Cross-ring Exception Entry	%pxrexc
1	Same-ring Exception Entry	%psrexc
2	Reserved
3	Reserved

Call/Return Details

Basic Block descriptors with one of the four call types (Call, Conditional Call, Call Indirect, Conditional Call Indirect), push the return address on a protected stack addressed by the CSP indexed by the target ring number (which is the same as the current ring number unless a gate is addressed). Returns pop the address from the protected stack and jump to it. The ring number of the CSP pointer is used for the stores and loads, and typically this ring is not writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC_66..64]) ← PC
CSP[TargetPC_66..64]) ← CSP[TargetPC_66..64]) +_p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC_66..64] −_p 8)
CSP[TargetPC_66..64]) ← CSP[TargetPC_66..64]) −_p 8

Debugging

Support for debuggers in SecureRISC has yet to be considered and thus TBD. Instruction count interrupts provide a single-step mechanism, and basic block descriptors may be patched with a 254 tag as a breakpoint mechanism, but some mechanism for debugging ROM code and for setting memory read and write breakpoints is also required. Note that amatch ICSRs could be used for read and write breakpoints, if changed to have finer resolution (e.g., start the NAPOT encoding at bit 7). This however might complicate debugging programs with Garbage Collection. Something similar to amatch could be defined on the fetch side for debugging ROM code, e.g., 256 bits per bage to indicate which trap, but probably something much simpler would suffice.

Overflow Checking

Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC provides a reasonably complete set of such instructions in addition to the usual mod 2⁶⁴ add, subtract, negate, multiply, and shift left.

Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDOU (Add Overflow trapped Unsigned).

In addition, the ADDOS (Add Overflow Trapped Signed), ADDOUS (Add Overflow trapped Unsigned Signed), SUBOS (Subtract Overflow trapped Signed), SUBOU (Subtract Overflow trapped Unsigned), SUBOUS (Subtract Overflow trapped Unsigned Signed with Overflow), and NEGO (Negate Overflow trapped) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.

Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 63..size. A special small cache is required for this purpose, but the data portion is only eight bits of the Segment Descriptor Entry (a 6‑bit segment size and a 2‑bit generation).

Comparisons

SecureRISC handles comparisons differently in the early and late pipeline instruction sets. Comparisons on ARs/XRs are done with conditional branch instructions. Comparisons on SRs are done with instructions that write one of the 15 boolean registers (BRs). The boolean registers may be branched on, but also used in selection and logical operations.

For the SRs, SecureRISC has comparisons that produce both true and complement values (e.g., = and ≠ or < and ≥) so that they can be used with b0 as assertions. If b1 were hardwired to 1 and writes of 0 trapped, SecureRISC could have half as many comparisons, but would have to add more accumulation functions and SELI would have to have an inverted form. This would also require more compiler support to track whether Booleans in BRs are inverted or not. For the moment, SecureRISC has more comparisons, but might change.

Floating-Point Comparisons

SecureRISC provides floating-point comparisons that store 0 or 1 to a BR. These comparisons do not trap on NaN operands. The compiler can generate an unordered comparison to b0 to trap before doing the equal, less than, etc. test if traps on NaNs are required.

Branch Avoidance

SecureRISC has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC‑V ISA would use conditional branches:

	move x6, x1
	blt x1, x3, L
	move x6, x3
L:

The performance of the above on contemporary microarchitectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC, the sequence could be as follows:

	lts b2, s1, s3
	sels s6, b2, s1, s3

This sequence involves no conditional branches and has consistent performance. (Note: there is actually a minss instruction that would be preferred here, but this illustrated a general point.)

As another example, the range test

	assert ((lo <= x) && (x <= hi));

on RISC‑V would compile to

	blt x, lo, T
	bge hi, x, L
T:
	jal assertionfailed
L:

but on SecureRISC would compile to

	lts b1, x, lo
	orles b0, b1, hi, x

which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torles b1, hi, x as equivalent to the above orles by supplying the b0 destination operand.

Even when conditional branches are used, the Boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be

	lts b1, x, lo
	borles b1, hi, x, outofrange

which has one branch rather than two.

Tag Checking

Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer-tagged operands and produce an integer-tagged result. Floating-point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g., to extract the exponent) requires a CAST instruction to first change the tag. Similarly, to perform logical operations on a pointer, a CAST instruction to integer type is required.

Comparisons of tagged values compare the entire word in its entirety for =, ≠, <_u, ≥_u etc. This allows sorting regardless of type. Similarly, the CMPU operation produces −1, 0, 1 based on <_u, =, >_u of word values.

Multiword Multiplication

The ideal integer multiplication operation would be
SR[e],SR[d] ← (SR[a] ×_u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4 reads and 2 writes, which we clearly don’t want. The chosen alternative is to introduce a 64‑bit CARRY register to provide the additional 64‑bit input to the 128‑bit product and a place to store the high 64 bits of the product as follows:
p ← SR[c] + (SR[a] ×_u SR[b]) + CARRY
SR[d] ← p_63..0
CARRY ← p_127..64
The CARRY register is potentially awkward for OoO microarchitectures. The simplest option is to rename it to a small register file (e.g., 4 or 8‑entry) in the multiword arithmetic unit. It is also possible that even an OoO processor will be called on to have a subset of instructions that are to be executed in-order relative to each other, and the multiword arithmetic instructions can be put in this queue.

Multiword Division

The ideal integer division operation would be
SR[e],SR[d] ← SR[c]∥SR[a] ÷_u SR[b]
to efficiently support multiword division, but that requires 3 reads and 2 writes for quotient and remainder, which we clearly don’t want. As with multiplication, the alternative is to use the proposed 64‑bit CARRY register to provide the additional 64‑bit input to form the 128‑bit dividend and a place to store the remainder. The remainder of the previous division then naturally becomes the high bits of the current division. Thus the definition of DIVC is:
q,r ← (CARRY∥SR[a]_63..0) ÷_u SR[b]_63..0
CARRY ← r
SR[d] ← 240 ∥ q

Arithmetic for Polynomials over GF(2)

Addition of polynomials over GF(2) is just xor (addition without carries), and so the existing bitwise logical XORS instruction provides this functionality. Polynomial multiplication requires carryless multiplication instructions. Three forms are provided:
CARRY,SR[d] ← SR[a] ⊗ SR[b]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c] ⊕ CARRY
A modulo reduction instruction may not be required, as illustrated by the following example. In many applications, the field uses a polynomial such as 𝑥¹²⁸+𝑥⁷+𝑥²+𝑥+1 and in this case a 256→128 reduction can be implemented by further multiplication. First a series of carryless multiplication instructions are used to form the 255‑bit product p of two 128‑bit values. Bits 254..128 of this product have weight 𝑥¹²⁸, i.e., represent (p₂₅₄𝑥¹²⁶+…+p₁₂₉𝑥+p₁₂₈)𝑥¹²⁸. Because 𝑥¹²⁸ mod 𝑥¹²⁸+𝑥⁷+𝑥²+𝑥+1 is just 𝑥⁷+𝑥²+𝑥+1, multiplication of p_254..128 by this value results in a product q with a maximum term of 𝑥¹³³. q_127..0 is added to p_127..0 and q_133..128 of that product can then be multiplied again by 𝑥⁷+𝑥²+𝑥+1 resulting in a product with a maximum term of 𝑥¹², which can then be added to the low 128 bits of the original product (p_127..0). This generalizes to any modulo polynomial with no term after 𝑥¹²⁸ greater than 𝑥⁶³. If most modulo reductions are of this form, then no specialized support is required.

Multiword Addition

The ideal instructions for multiword addition and subtraction need additional single bit inputs and outputs for the carry-in and carry-out. The BRs would be natural for this purpose, but this would result in undesirable five-operand instructions, e.g., Add with Carry (ADDC) would be:
s ← SR[a] +_u SR[b] +_u BR[c]
SR[d] ← s_63..0
BR[e] ← s₆₄.
To avoid five operand instructions, SecureRISC instead defines the Add with Carry (ADDC) and Subtract with Carry (SUBC) instructions to use one bit in the 64‑bit CARRY register. ADDC is defined as:
s ← SR[a] +_u SR[b] +_u CARRY₀
SR[d] ← s_63..0
CARRY ← 0⁶³ ∥ s₆₄.
SUBC is defined as:
s ← SR[a] −_u SR[b] −_u CARRY₀
SR[d] ← s_63..0
CARRY ← 0⁶³ ∥ s₆₄.

Multiword Shifts

One advantage of the 3 read SR file is that shifts can be based upon a funnel shift where the value to be shifted is the catenation of SR[a] and SR[b], allowing for rotates by specifying the same operand for the high and low funnel operands, and multiword shifts by supplying adjacent source words of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left shifts supply 0 for the lo side of the funnel and use a negative shift amount. Logical right shifts supply 0 on the high side of the funnel and arithmetic right shifts supply a signed-extended version of SR[a] on the high side of the funnel. Need to decide whether overflow detecting left shifts are required.

The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.

Floating-Point Rounding Modes

SecureRISC has the floating-point rounding mode encoded in the instruction word to allow various uses of rounding mode where changing a CSR would be too costly to the use. For example, round to odd might be used in a sequence to do operations in higher precision and then round correctly to a lower precision. In such a case dynamic rounding mode changes are likely to make the sequence slower than necessary.

Floating-Point Flags

The floating-point flag mechanism for SecureRISC is TBD, but it is likely to be similar to other ISAs, where an unprivileged CSR has flag bits that are set by operations until cleared by software. Also, there is a need for a few other floating-point control items, such as the dynamic rounding mode of the previous section. There would be individual SCSRs that for various purposes, and one SCSR that combines all or most of them into a single place for efficient context switch.

Vector Register File

SecureRISC has not adopted flexible vector register file sizing, such as found in RISC‑V. This could change, but for now it is simpler. Instead there are 16 vector registers (VRs) that consist of 128 72‑bit words (9216 bits per register, 8192 of data, 1024 of tag). This size was chosen to target implementations with up to sixteen parallel execution units, which for a full-length vector would require eight iterations to perform the vector operation, giving the processor sufficient time to set up the next vector operation. Flexible sizing would allow smaller implementations of the vector unit, but 144 Kib (128 Kib of data, 32 Kib of tag) is acceptable area in modern process nodes. This also support a 128×128 outer product of two source vectors accumulating results in the MAs (Matrix Accumulators).

In addition, SecureRISC does not pack 8, 16, and 32 bit elements into the 64‑bit elements of vector registers. This reduces the need for cross datapath data movement on operations (but not loads and stores). For example, unlike RISC‑V, a widening operation writes the same datapath column as the source operands. However, it does reduce the performance of elements of these smaller sizes. This is one reason that SecureRISC vector registers are defined to have 128 elements. Small element packing benefits from having two 32‑bit and four 16‑bit units per 64‑bit unit, which increases datapath cost; without packing it is possible to have more twice the number of 64‑bit units, which matches the number of 32‑bit units that would be provided with small element packing. Even smaller element performance is addressed by the matrix extension.

Each vector register (VR), v0 to v15, has an associated vector length register, vl0 to vl15. In contrast, the RISC‑V Vector Extension has a single vector length register (VL). Most vector operations (e.g., vector multiply/add) trap if their operand vector lengths are not equal, and write the same length to the destination vector register length. Outer product instructions allow the vector lengths to be different. Software typically set VLs before doing load instructions, which use the associated length. This length then propagates through vector operations, and finally to vector store instructions. Vector compress instructions write a length equal to the popcount of the vector mask, and uncompress writes the maximum mask bit set plus one. Vector code is encouraged to set (VLs) to 0 when done, which may reduce context save time (context restore may also benefit if VRs are zeroed for security before loading only the portion specified by the associated VL).

Vector Boolean Registers

There are sixteen vector boolean registers, 128 bits each. These are used by vector comparison instructions, which set the number of bits specified by the comparison vector length based on the comparison, and the rest of the bits to zero.

Matrix Multiply

This section attempts to explain the reasoning for the SecureRISC approach to matrix multiplication, which is to use a large array of multiply/add units with local storage to accumulate outer products of vectors from the source matrixes to a submatrix of the product matrix. The following exposition presents definitions and theory related to straightforward^* matrix multiplication, then explores characteristics of various implementations, leading to the reasoning and explanation for this proposal’s approach to matrix computation.

* This exploration is for classic, or schoolbook, O(N³) matrix multiplication. Other algorithms, such as the Strassen algorithm, ≈O(N^2.8074) are not considered, partially due to stability issues. Even faster algorithms, ≈O(N^2.37), exist, but are even less applicable due to the requirement that N be extremely large to be faster.

Matrix Algebra

If 𝐴 is an $m \times k$ matrix and 𝐵 is an $k \times n$ matrix

A = (\begin{matrix} a_{1, 1} & a_{1, 2} & \dots & a_{1, k} \\ a_{2, 1} & a_{2, 2} & \dots & a_{2, k} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{m, 1} & a_{m, 2} & \dots & a_{m, k} \end{matrix}), B = (\begin{matrix} b_{1, 1} & b_{1, 2} & \dots & b_{1, n} \\ b_{2, 1} & b_{2, 2} & \dots & b_{2, n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ b_{k, 1} & b_{k, 2} & \dots & b_{k, n} \end{matrix})

the matrix product $C = A B$ is defined to be the $m \times n$ matrix

C = (\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, n} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{m, 1} & c_{m, 2} & \dots & c_{m, n} \end{matrix})

such that

c_{i, j} = a_{i, 1} b_{1, j} + a_{i, 2} b_{2, j} + \dots + a_{i, k} b_{k, j} = \sum_{l = 1}^{k} a_{i, l} b_{l, j}

Equivalently, $c_{i, j}$ is the inner or dot product of row 𝑖 of 𝐴 and column 𝑗 of 𝐵. Unless parallel summation techniques are employed, the inner product is the sequential portion of the computation.

Straightforward matrix multiplication $C + A B$ is $m n k$ multiplications and $m n k$ additions with each matrix element $c_{i, j}$ being independent of the others but potentially sequential due the inner product additions. The $m n k$ multiplications are all independent (potentially done in parallel), but only $m n$ of the additions are parallel when floating-point rounding is preserved. With unbounded hardware, the execution time of matrix multiply with floating-point rounding is $k Δ$ where $Δ$ is the add latency. This is achieved by using $m n$ multiply/add units 𝑘 times every $Δ$ cycles, but a smarter implementation would use $m n / Δ$ units pipelined to produce a value every cycle, thereby adding only $Δ - 1$ additional cycles for the complete result.

The parallel portion of the computation is called the outer product. If 𝒖 is a 𝑚‑element vector and 𝒗 is a 𝑛‑element vector,

𝒖 = [\begin{matrix} u_{1} \\ u_{2} \\ ⋮ \\ u_{m} \end{matrix}], 𝒗 = [\begin{matrix} v_{1} \\ v_{2} \\ ⋮ \\ v_{n} \end{matrix}]

then the outer product is defined to be the $m \times n$ matrix

C = (\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, n} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{m, 1} & c_{m, 2} & \dots & c_{m, n} \end{matrix}) = 𝒖 \otimes 𝒗 = (\begin{matrix} u_{1}, v_{1} & u_{1}, v_{2} & \dots & u_{1}, v_{n} \\ u_{2}, v_{1} & u_{2}, v_{2} & \dots & u_{2}, v_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ u_{m} v_{1} & u_{m} v_{2} & \dots & u_{m} v_{n} \end{matrix})

i.e.

c_{i, j} = u_{i} v_{j}

The outer product is a fully parallel computation of $m n$ multiplications.

Using this formulation, the matrix product can be expressed as the sum of 𝑘 outer products of the columns of 𝐴 with the rows of 𝐵:

C = A_{*, 1} \otimes B_{1, *} + A_{*, 2} \otimes B_{2, *} + \dots + A_{*, n} \otimes B_{n, *} = \sum_{l = 1}^{k} A_{*, l} \otimes B_{l, *}

where $A_{*, l}$ is column 𝑙 of 𝐴 and $B_{l, *}$ is row 𝑙 of 𝐵 and $\otimes$ is the outer product operator. (Note that elsewhere in this document $\otimes$ denotes carryless multiply, but in a vector context, it is use for the outer product.)

The matrix multiplication and outer product definitions (i.e., the equations above) work both when the elements are scalars or are themselves matrixes. For example, $c_{i, j}$ could be a $s \times t$ matrix that is sum of the products of a row of $s \times r$ matrixes from 𝐴 with a column of $r \times t$ matrixes from 𝐵. Such submatrixes are called tiles.

Matrix Tiling

It is possible to code the above in many ways, but the most common is simply:

    for i ← 0 to m-1
      for j ← 0 to n-1
	for l ← 0 to k-1
	  c[i,j] ← c[i,j] + a[i,l] * b[l,j]

Unfortunately, this an inefficient use of memory bandwidth when the matrixes are large. However, by recoding the matrix multiplication, efficiency can be restored. As noted above, the same equations apply when the elements are scalar values, or are themselves matrixes. When $c_{i, j}$ , $a_{i, j}$ , $b_{i, j}$ , are matrixes rather than scalars, they are called tiles. When a matrix multiplication is performed by tiles, typically the elements of 𝐶 are loaded into local storage, and all of the operations targeting that tile are performed, and then that local storage is written back to 𝐶. In this scenario, each element of 𝐶 is read once and written once. Matrix multiplication of elements (or matrixes as tiles) is illustrated in the following figure showing multiplying elements (or tiles) from a column of of 𝐴 with elements (or tiles) from a row of 𝐵, accumulating to an element (tile) of the product 𝐶:
Tile of C = Column of A times Row of B

Larger tiles reduce references to memory and increase parallelism opportunities, but require more local storage.

It was noted above that each element of 𝐶 is read once and written once. In contrast, the elements of 𝐴 are read $n / t$ times and the elements of 𝐵 are read $m / s$ times.

Note also that software often transposes 𝐴 or 𝐵 prior to performing the matrix multiply, to avoided strided memory accesses. The matrix transposed depends on whether row or column-major order is used and the access pattern of the algorithm employed. The appropriate transpose is not reflected in the material below, and is left as an exercise to the reader.

The description above is represented by the tiling loops shown below, with loops for the tile multiplication replaced by a optimized kernel:

    for ti ← 0 to m-1 step s		// tile i
      for tj ← 0 to n-1 step t		// tile j
	for tl ← 0 to k-1 step r	// tile l
	  matmul(c[ti..ti+s-1,tj..tj+t-1],
		 a[ti..ti+s-1,tl..tl+r-1],
		 b[tl..tl+r-1,tj..tj+t-1])

Since c[ti..ti+s-1,tj..tj+t-1] is inner loop invariant, given the above loop order, it can be allocated to local storage (represented by acc below):

    for ti ← 0 to m-1 step s		// tile i
      for tj ← 0 to n-1 step t		// tile j
	acc[0..s-1,0..t-1] ← c[ti..ti+s-1,tj..tj+t-1]
	for tl ← 0 to k-1 step r	// tile l
	  matmul(acc[0..s-1,0..t-1],
		 a[ti..ti+s-1,tl..tl+r-1],
		 b[tl..tl+r-1,tj..tj+t-1])
	c[ti..ti+s-1,tj..tj+t-1] ← acc[0..s-1,0..t-1]

Other Loop Orderings

Other orderings of the i,j,l loops are possible, but are generally inferior to keeping 𝐶 tiles in registers during the inner loop, as summarized in the following table. References to 𝐶 have to be both loaded and stored, represented below by the factor of 2, and in addition the factor of $w$ represents the ratio of the bits in 𝐴 and 𝐵 elements to the number of bits in elements of 𝐶, i.e., the widening ratio. $w = 1$ represents no widening. Typically, FP32 and FP64 do not require widening (though widening FP32 to FP64 is seen occasionally). Smaller data formats, such as FP16, BF16, FP8, FP4, and int8 are usually accumulated in a wider format such as FP32 or int32, so for FP8 products with FP32 accumulation, $w = 4$ , and for FP16 with FP32 accumulation, $w = 2$ .

Elements loaded/stored from memory for each matrix multiply tile loop orderings
Order	Inner loop invariant				Inner 1					Inner 2
i,j,l	c[i,j]	$2 w s t$	$m n / s t$	$2 w m n$	a[i,l]	$s r$	$m k / s r$	$n / t$	$m n k / t$	b[l,j]	$r t$	$n k / r t$	$m / s$	$m n k / s$
i,l,j	a[i,l]	$s r$	$m k / s r$	$m k$	c[i,j]	$2 w s t$	$m n / s t$	$k / r$	$2 w m n k / r$	b[l,j]	$r t$	$n k / r t$	$m / s$	$m n k / s$
j,i,l	c[i,j]	$2 w s t$	$m n / s t$	$2 w m n$	a[i,l]	$s r$	$m k / s r$	$n / t$	$m n k / t$	b[l,j]	$r t$	$n k / r t$	$m / s$	$m n k / s$
j,l,i	b[l,j]	$r t$	$n k / r t$	$n k$	c[i,j]	$2 w s t$	$m n / s t$	$k / r$	$2 w m n k / r$	a[i,l]	$s r$	$m k / s r$	$n / t$	$m n k / t$
l,i,j	a[i,l]	$s r$	$m k / s r$	$m k$	c[i,j]	$2 w s t$	$m n / s t$	$k / r$	$2 w m n k / r$	b[l,j]	$r t$	$n k / r t$	$m / s$	$m n k / s$
l,j,i	b[l,j]	$r t$	$n k / r t$	$n k$	c[i,j]	$2 w s t$	$m n / s t$	$k / r$	$2 w m n k / r$	a[i,l]	$s r$	$m k / s r$	$n / t$	$m n k / t$

Summary of the above
Order	Total	$w = 4$ $s = t = r = q$	$m = n = k = d$
i,j,l	$2 w m n + m n k / t + m n k / s$	$m n (8 + k / q)$	$8 d^{2} + d^{3} / q$
i,l,j	$m k + 2 w m n k / r + m n k / s$	$m k (1 + 9 n / q)$	$d^{2} + 9 d^{3} / q$
j,i,l	$2 w m n + m n k / t + m n k / s$	$m n (8 + k / q)$	$8 d^{2} + d^{3} / q$
j,l,i	$n k + 2 w m n k / r + m n k / t$	$n k (1 + 9 m / q)$	$d^{2} + 9 d^{3} / q$
l,i,j	$m k + 2 w m n k / r + m n k / s$	$m k (1 + 9 n / q)$	$d^{2} + 9 d^{3} / q$
l,j,i	$n k + 2 w m n k / r + m n k / t$	$n k (1 + 9 m / q)$	$d^{2} + 9 d^{3} / q$

The above table makes it clear why the 𝐶 tile is typically kept stationary in the inner loop, as it must be both loaded and stored (a factor of two in references), and for some data formats, may be wider than the source matrixes, which is potentially another factor of 2, 4, or 8 again (the table provides a $w = 4$ widening example, with the combination being a factor of 8).

Matrix Multiplication Implementation

With 𝐶 held in registers—no loads or stores required—tile multiplication requires loads from 𝐴 and 𝐵. Above, it was observed that 𝑚𝑛 of matrix multiplications are parallel, and equivalently 𝑠𝑡 of a tile are parallel.

For practical implementation, hardware is bounded and should lay out in a regular fashion. The number of multiply/add units is usually much smaller than $m n$ , in which case there is flexibility in how these units are allocated to the calculations to be performed, but the allocation that minimizes data movement between the units and memory is to complete a tile of 𝐶 using the hardware array before moving on to a new tile (see preceding section).

To discuss implementation, it helps to consider the number of bits that can be loaded in a single cycle (the width of the cache feeding the vector unit). This is designated 𝐿 here. It is more useful in this exposition to define $V = L / S$ , where 𝑆 is the number of bits in a scalar element. 𝑉 is then the the number of elements that can be loaded in a single cycle. Loads of less than 𝑉 elements do not efficiently use the cache datapaths of the processor. Loads of 𝐿 ≥ 𝑉 elements take $⌈ L / V ⌉$ cycles and may be appropriate, but they are not necessarily more efficient than doing that many independent 𝑉‑element loads.

The other important implementation parameter is the add latency Δ. The multiply/add units are usually pipelined so that N units will accomplish $N Δ$ multiply/adds in $Δ + Δ - 1$ cycles. Turning this around, if there are W parallel multiply/adds to be accomplished, then it is appropriate to use $W / Δ$ multiply/add units to accomplish this computation, starting a new batch of 𝑊 every Δ cycles. Using less than $W / Δ$ units increases the time, and using more units leaves the units underutilized. In most cases $Δ \geq 1$ . However, carry-save arithmetic can result in $Δ < 1$ , creating possibilities to be discussed later.

For floating-point formats, the sums are typically done sequentially from 1 to 𝑘 to give the same rounding as the scalar implementation, which results in the latency $k Δ + Δ - 1$ when pipelined. The order of integer summation is not constrained, and is considerably faster, with $Δ \leq 1$ possible using carry-save arithmetic. While it is possible to use more parallel techniques for floating-point summation, this is often avoided due to rounding differences, and as a consequence the outer product is often the parallel portion of the computation, and the inner product is the sequential portion. For integer combining $a_{i, l} b_{l, j} + a_{i, l + 1} b_{l + 1, j} + \dots + a_{i, l + q - 1} b_{l + q - 1, j}$ into a single 𝑞‑element computation with carry-save arithmetic is possible.

For efficient loads, we load at least 𝑝𝑉 elements from 𝐴 and 𝑞𝑉 elements from 𝐵. These loads take $p + q$ cycles (the minimum $p = q = 1$ takes 2 cycles). The parallel computation is then $W = p q V^{2}$ multiply/adds, so using $p q V^{2} / (p + q)$ units results in balanced load and compute cycles. The units are fully utilized for $Δ \leq p + q$ . For example, for $Δ \leq 2$ the minimum $p = q = 1$ is balanced using fully utilized $V^{2} / 2$ multiply/add units. The multiply/add units are underutilized for $Δ > p + q$ . The underutilization can be solved by increasing 𝑝 and/or 𝑞 (thereby increasing the parallel computation). For even Δ, $p = q = Δ / 2$ is one possibility, in which case loading $\frac{Δ}{2} V$ elements from each of 𝐴 and 𝐵, taking Δ total cycles, making the parallel computation $W = {(Δ V)}^{2} / 4$ . The computation is performed using $W / Δ = \frac{Δ}{4} V^{2}$ multiply/add units for Δ cycles. For the typical floating-point add latency $Δ = 4$ this simplifies to $V^{2}$ units for 4 cycles.

Summarizing the above for the common cases, the analysis suggests that integer computation ( $Δ = 1$ ) use 𝑉‑element loads and $V^{2} / 2$ units with balanced 2‑cycle load and compute into a 𝑉 × 𝑉 tile. For floating-point computation ( $Δ = 4$ ), use $2 V$ ‑element loads and $V^{2}$ pipelined units with balanced 4‑cycle load and compute into a $2 V \times 2 V$ tile. Because SecureRISC supports both integer and floating-point, it appropriate to set the tile size based on meeting the performance requirements for each data type, and then using that tile storage for other data types. For example, if FP8 has the most demanding performance requirement, then the tile size would be at least $2 V \times 2 V$ and then int8 might also use that tile size, despite only requiring 𝑉 × 𝑉. The larger int8 tile size results in fewer references to the upper levels of the memory hierarchy, and therefore better energy efficiency.

Efficient parallel computation of matrix multiplication becomes more challenging due to the storage requirements for 𝐶 tiles as the data width is reduced. Above, $\frac{Δ}{2} V$ elements (typically $2 V$ ) for floating-point and 𝑉 elements for integer were derived as efficient load widths that balance load and compute and fully utilize multiply/add units. For these load widths the outer product is $4 V^{2}$ and $V^{2}$ computation respectively. Floating-point requires $4 V^{2} S = 4 L^{2} / S$ bits of storage for 𝑆 ≥ 32, which is proportional to the inverse of 𝑆. With widening to 32 bits for narrower floating-point data types (𝑆 < 32), the requirement is $4 V^{2} w S = 128 {(L / S)}^{2}$ , which is quadratic in the 32/𝑆. Thus FP8 requires 𝐶 tiles 16 times larger than FP32, and FP4 requires 4 times larger than FP8. Integer is generally 𝑆 < 32, and the storage requirement is $V^{2} w S = 32 {(L / S)}^{2}$ , Examples of these sizes are given in the table below.

C tile bits 𝑏 for efficient parallel computation for 𝐿 and 𝑆
𝑆 𝐿	Δ=1		Δ=4
𝑆 𝐿	int4	int8	FP4	FP8	FP16	FP32	FP64
128	32768	8192	131072	32768	8192	2048	1024
256	131072	32768	524288	131072	32768	8192	4096
512	524288	131072	2097152	524288	131072	32768	16384
1024	2097152	524288	8388608	2097152	524288	131072	65536

This calculation can be inverted to give the 𝐿 and MACs/cycle for various 𝐶 tile sizes. Here again we assume 32‑bit accumulation for 𝑆≤32 (if 16‑bit accumulation is possible for 𝑆≤8 then savings are possible). For bits 𝑏, the number of tile elements is $b / 32$ for 𝑆≤32 and $b / 64$ for FP64. The MACs/cycle is half of tile elements for Δ=1 integer and a quarter for Δ=4 floating-point. The square root of the tile elements is the outer product elements. This times 𝑆 is the 𝐿.

𝐿 for efficient parallel computation by C tile bits 𝑏
𝑆 𝐿	Δ=1		Δ=4
𝑆 𝐿	int4	int8	FP4	FP8	FP16	FP32
2048					64	128
8192	64	128		64	128	256
32768	128	256	64	128	256	512
131072	256	512	128	256	512	1024
524288	512	1024	256	512	1024	2048
2097152	1024	2048	512	1024	2048	4096
8388608	2048	4096	1024	2048	4096

MACs/cycle by C tile bits 𝑏
Bits	Δ=1 int4 / int8	Δ=4 FP4 ⋯ FP32
2048		16
8192	128	64
32768	512	256
131072	2048	1024
524288	8192	4096
2097152	32768	16384
8388608	131072	65536

FP64 𝐿 and MACs/cycle by C bits
Bits	Δ=4 FP64
Bits	𝐿	MACs/cycle
1024	128	4
4096	256	16
16384	512	64
65536	1024	256
262144	2048	1024

Matrix Multiply Using Vectors

The following series of transforms demonstrates how the simple, schoolbook matrix multiply written as three nested loops shown below is transformed for a vector ISA using outer products. (Note that the pseudo-code switches from 1‑origin indexing of Matrix Algebra to 0‑origin indexing of computer programming. Note also that, for clarity, the pseudo-code below does not attempt to handle the case of the matrix dimensions not being a multiple of the tile size.)

    for i ← 0 to m-1
      for j ← 0 to n-1
	for l ← 0 to k-1
	  c[i,j] ← c[i,j] + a[i,l] * b[l,j]

The scalar version above would typically then move c[i,j] references to a register to reduce the load/store to multiply/add ratio from 4:1 to 2:1.

    for i ← 0 to m-1
      for j ← 0 to n-1
	acc ← c[i,j]
	for l ← 0 to k-1
	  acc ← acc + a[i,l] * b[l,j]
	c[i,j] ← acc

However, in the vector version this step is delayed until after tiling. For vector, the above code is first tiled to become the following:

    // iterate over 8×TC tiles of C
    for ti ← 0 to m-1 step 8
      for tj ← 0 to n-1 step TC
	// add product of eight columns of a (a[ti..ti+7,0..k-1])
	// and eight rows of b (b[0..k-1,tj..tj+TC-1]) to product tile
	for i ← 0 to 7
	  for j ← 0 to TC-1
	    for l ← 0 to k-1
	      c[ti+i,tj+j] ← c[ti+i,tj+j] + a[ti+i,l] * b[l,tj+j]

The above code is then modified to use eight vector registers as an 8×TC tile accumulator (typically TC would be 128 for SecureRISC), and all i and j loops replaced by vector loads:

    for ti ← 0 to m-1 step 8		// tile i
      for tj ← 0 to n-1 step TC		// tile j
	// copy to accumulator
	v0 ← c[ti+0,tj..tj+TC-1]		// TC-element vector loads
	v1 ← c[ti+1,tj..tj+TC-1]
	v2 ← c[ti+2,tj..tj+TC-1]
	v3 ← c[ti+3,tj..tj+TC-1]
	v4 ← c[ti+4,tj..tj+TC-1]
	v5 ← c[ti+5,tj..tj+TC-1]
	v6 ← c[ti+6,tj..tj+TC-1]
	v7 ← c[ti+7,tj..tj+TC-1]
	// add product of a[ti..ti+7,0..k-1]
	// and b[0..k-1,tj..tj+TC-1] to tile
	for l ← 0 to k-1
	  va ← a[ti..ti+i+7,l]		// 8-element vector load
	  vb ← b[l,tj..tj+i+TC-1]		// TC-element vector load
	  v0 ← v0 + va[0] * vb		// vector * scalar
	  v1 ← v1 + va[1] * vb
	  v2 ← v2 + va[2] * vb
	  v3 ← v3 + va[3] * vb
	  v4 ← v4 + va[4] * vb
	  v5 ← v5 + va[5] * vb
	  v6 ← v6 + va[6] * vb
	  v7 ← v7 + va[7] * vb
	// copy accumulator back to tile
	c[ti+0,tj..tj+TC-1] ← v0	  // TC-element vector stores
	c[ti+1,tj..tj+TC-1] ← v1
	c[ti+2,tj..tj+TC-1] ← v2
	c[ti+3,tj..tj+TC-1] ← v3
	c[ti+4,tj..tj+TC-1] ← v4
	c[ti+5,tj..tj+TC-1] ← v5
	c[ti+6,tj..tj+TC-1] ← v6
	c[ti+7,tj..tj+TC-1] ← v7

One limitation of some vector instruction sets is the lack of a vector × scalar instruction where the scalar is an element of a vector register, which would add many scalar loads to the above loop. SecureRISC provides scalar operands from elements of vector registers.

Besides the obvious parallelism advantage, another improvement is that each element of the 𝐴 matrix is used TC times per load, and each element of the 𝐵 matrix is used eight times per load, which improves energy efficiency. However, one limitation of the vector implementation of matrix multiply is the limited number of multiply/add units that can be used in parallel. It is obvious that the above can use TC units in parallel (one for each element of the vectors). Slightly less obvious is that an implementation could employ $8 \times TC / Δ$ units to execute the above code, issuing groups of $8 / Δ$ vector instructions in a single cycle, and parceling these vector operations out to the various units to proceed in parallel. After $8 / Δ$ instructions, the next group can be issued to the pipelined units. Implementing this requires a factor of $8 / Δ$ increase in VRF read bandwidth. However a better solution is possible by providing more direct support for the outer product formulation. The goals are to obtain better energy efficiency on the computation by reducing the data movement in the above, particularly the VRF bandwidth, and to allow even more multiply/add units to be employed on matrix operations (the above limited to 8×TC tiles by the number of vector registers).

When $V > 8$ the load from 𝐴 uses only a portion of data from the memory hierarchy. It is possible to load more of 𝐴, so long as the adder latency $Δ \leq 8$ :

    for ti ← 0 to m-1 step 16       // tile i
      for tj ← 0 to n-1 step 128    // tile j
	// copy to accumulator
	v0 ← c[ti+0,tj..tj+127]     // 128-element vector loads
	v1 ← c[ti+1,tj..tj+127]
	v2 ← c[ti+2,tj..tj+127]
	v3 ← c[ti+3,tj..tj+127]
	v4 ← c[ti+4,tj..tj+127]
	v5 ← c[ti+5,tj..tj+127]
	v6 ← c[ti+6,tj..tj+127]
	v7 ← c[ti+7,tj..tj+127]
	// add product of a[ti..ti+7,0..k-1]
	// and b[0..k-1,tj..tj+7] to tile
	for l ← 0 to k-1
	  va ← a[ti..ti+i+15,l]   // 16-element vector load
	  vb ← b[l,tj..tj+i+127]  // 128-element vector load
	  v0 ← v0 + va[ 0] * vb   // vector * scalar
	  v1 ← v1 + va[ 1] * vb
	  v2 ← v2 + va[ 2] * vb
	  v3 ← v3 + va[ 3] * vb
	  v4 ← v4 + va[ 4] * vb
	  v5 ← v5 + va[ 5] * vb
	  v6 ← v6 + va[ 6] * vb
	  v7 ← v7 + va[ 7] * vb
	  v0 ← v0 + va[ 8] * vb
	  v1 ← v1 + va[ 9] * vb
	  v2 ← v2 + va[10] * vb
	  v3 ← v3 + va[11] * vb
	  v4 ← v4 + va[12] * vb
	  v5 ← v5 + va[13] * vb
	  v6 ← v6 + va[14] * vb
	  v7 ← v7 + va[15] * vb
	// copy accumulator back to tile
	c[ti+0,tj..tj+127] ← v0     // 128-element vector stores
	c[ti+1,tj..tj+127] ← v1
	c[ti+2,tj..tj+127] ← v2
	c[ti+3,tj..tj+127] ← v3
	c[ti+4,tj..tj+127] ← v4
	c[ti+5,tj..tj+127] ← v5
	c[ti+6,tj..tj+127] ← v6
	c[ti+7,tj..tj+127] ← v7

Matrix Multiply Using An Outer Product Array

It is desirable to match the number of multiply/add units to the load bandwidth when practical, as this results in a balanced set of resources (memory and computation are equally limiting). We use 𝑉 to represent the vector load bandwidth as the number of elements per cycle. Assuming that loads and computation are done in parallel, next we ask what is the tile size that balances results in equal time loading and computing. We have already seen that the multiply/adds in a matrix multiply is O(N³) but with O(N²) parallelism, so the time can be made as fast as O(N). However loading the data from memory is O(N²), so with sufficient hardware, data load time will be O(N) times the compute time. When load time grows quadratically with problem size while compute time grows linearly, a balanced system will scale up the compute hardware to match the load bandwidth available but not go any further. Of course, to achieve O(N) compute time requires O(N²) hardware, which is feasible for typical T×T matrix tiles, but usually not for the entire problem size N. Conversely, for balanced systems, when load bandwidth increases linearly, the computation array increases quadratically.

Since a vector load provides 𝑉 elements in a single cycle, it makes sense to find the tile size that matches this load bandwidth. This turns out to be a tile of $V \times V$ . This tile can be computed by 𝑉 outer products. Take one cycle to load 𝑉 elements from 𝐴 and one cycle to load 𝑉 elements from 𝐵. Processing these values in two cycles matches load bandwidth to computation. For $Δ \leq 2$ , a $V \times (V / 2)$ array of multiply/add units with $V^{2}$ accumulators (two per multiply/add unit) accomplishes this by taking the outer product of all of the $𝒖$ vector (from 𝐴) and the even elements of the $𝒗$ vector (from 𝐵) in the first cycle, and all of $𝒖$ with the odd elements of $𝒗$ in the second cycle. The full latency is $Δ + 1$ cycles, but with pipelining a new set of values can be started every two cycles. For $Δ > 2$ , using a $V \times (V / Δ)$ pipelined array for $Δ$ cycles is a natural implementation but does not balance load cycles to computation cycles. For example, for $Δ = 4$ , a $V \times (V / 4)$ array completes the outer product in 4 cycles, which is half of the load bandwidth limit. For $Δ = 4$ there are multiple ways to match the load bandwidth and adder latency. A good way would be to target a $2 V \times 2 V$ accumulation tile taking four load cycles and four computation cycles, but this requires $4 V^{2}$ accumulators, with four accumulators for each multiply/add unit. The method that minimizes hardware is to process two tiles of 𝐶 in parallel using pipelined multiply/add units by doing four cycles of loads followed by two 2‑cycle outer products to two sets of $V^{2}$ accumulators. For example, the loads might be 𝑉 elements from an even column of 𝐴, 𝑉 elements from an even row of 𝐵, 𝑉 elements from an odd column of 𝐴, and 𝑉 elements from an odd row of 𝐵. The computation would consist of two $V \times (V / 2)$ outer product accumulates, each into $V^{2}$ accumulators (total $2 V^{2}$ ). The total latency is seven cycles but the hardware is able to start a new outer product every four cycles by alternating the accumulators used, thereby matching the load bandwidth. If any of these array sizes is too large for the area budget, then it will be necessary to reduce performance, and no longer match the memory hierarchy. However, in 2024 process nodes (e.g., 3 nm), it would take fairly large 𝑉 to make the multiply/add unit array visible on a die.

A $V \times V$ multiply/add array with one accumulator per unit is illustrated below for $V = 4$ :
4×4 Outer Product Array with Local Accumulators

The above array is not suggested for use, as compute exceeds the load bandwidth. Instead one proposal developed above is a $V \times (V / 2)$ multiply/add array with two accumulators per unit for two cycle accumulation to $V \times V$ accumulators. This is illustrated below for $V = 4$ :
4×2 Outer Product Array with 4×4 Accumulators

A $V \times V$ multiply/add array with four accumulators per unit for $2 V \times 2 V$ accumulation is illustrated below for $V = 4$ . Such an array would be used four times over four cycles, each cycle sourcing from a different combination of 𝑉 elements from the $2 V$ elements loaded from 𝐴 and the $2 V$ elements loaded from 𝐵. This is one possibility explained above for supporting $Δ = 4$ or simply to improve performance energy efficiency for $Δ \leq 2$ .
4×4 Outer Product Array with 8×8 Accumulators

For the general case of a $m V \times n V$ tile, the load cycles are $m + n$ and the computation cycles using a $p V \times q V$ array are $m n / p q$ . Balancing these is not always possible.

The $V \times (V / 2)$ sequence for $Δ = 2$ is illustrated below, using superscripts to indicate cycle numbers, as in $C^{0} = 0$ to indicate accumulators being zero on cycle 0, $𝒖^{0}$ the value loaded on cycle 0, $𝒗^{1}$ the vector loaded on cycle 1, $C^{3}$ the result of the first half of the two-cycle latency outer product, $C^{4}$ the result of the second half of the outer product, etc.

C^{0} = (\begin{matrix} 0 & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & 0 \end{matrix}), 𝒖^{0} = [\begin{matrix} a_{1, 1} \\ a_{2, 1} \\ ⋮ \\ a_{m, 1} \end{matrix}], 𝒗^{1} = [\begin{matrix} b_{1, 1} \\ b_{1, 2} \\ ⋮ \\ b_{1, n} \end{matrix}], 𝒖^{2} = [\begin{matrix} a_{1, 2} \\ a_{2, 2} \\ ⋮ \\ a_{m, 2} \end{matrix}], 𝒗^{3} = [\begin{matrix} b_{2, 1} \\ b_{2, 2} \\ ⋮ \\ b_{2, n} \end{matrix}],

C^{3} = C^{0} + 𝒖^{0} \otimes 𝒖^{1} = (\begin{matrix} u_{1}^{0} v_{1}^{1} & 0 & \dots & u_{1}^{0} v_{n - 1}^{1} & 0 \\ u_{2}^{0} v_{1}^{1} & 0 & \dots & u_{2}^{0} v_{n - 1}^{1} & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ u_{m}^{0} v_{1}^{1} & 0 & \dots & u_{m}^{0} v_{n - 1}^{1} & 0 \end{matrix}) = (\begin{matrix} a_{1, 1} b_{1, 1} & 0 & \dots & a_{1, 1} b_{1, n - 1} & 0 \\ a_{2, 1} b_{1, 1} & 0 & \dots & a_{2, 1} b_{1, n - 1} & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ a_{m, 1} b_{1, 1} & 0 & \dots & a_{m, 1} b_{1, n - 1} & 0 \end{matrix}),

C^{4} = (\begin{matrix} u_{1}^{0} v_{1}^{1} & u_{1}^{0} v_{2}^{1} & \dots & u_{1}^{0} v_{n - 1}^{1} & u_{1}^{0} v_{n}^{1} \\ u_{2}^{0} v_{1}^{1} & u_{2}^{0} v_{2}^{1} & \dots & u_{2}^{0} v_{n - 1}^{1} & u_{2}^{0} v_{n}^{1} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ u_{m}^{0} v_{m}^{1} & u_{m}^{0} v_{m}^{1} & \dots & u_{m}^{0} v_{n - 1}^{1} & u_{m}^{0} v_{n}^{1} \end{matrix}) = (\begin{matrix} a_{1, 1} b_{1, 1} & a_{1, 1} b_{1, 2} & \dots & a_{1, 1} b_{1, n - 1} & a_{1, 1} b_{1, n} \\ a_{2, 1} b_{1, 1} & a_{2, 1} b_{1, 2} & \dots & a_{2, 1} b_{1, n - 1} & a_{2, 1} b_{1, n} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ a_{m, 1} b_{1, 1} & a_{m, 1} b_{1, 2} & \dots & a_{m, 1} b_{1, n - 1} & a_{m, 1} b_{1, n} \end{matrix}),

C^{5} = (\begin{matrix} c_{1, 1}^{3} + u_{1}^{2} v_{1}^{3} & c_{1, 2}^{4} & \dots & c_{1, n -1}^{3} + u_{1}^{2} v_{n - 1}^{3} & c_{1, n}^{4} \\ c_{2, 1}^{3} + u_{2}^{2} v_{1}^{3} & c_{2, 2}^{4} & \dots & c_{2, n -1}^{3} + u_{2}^{2} v_{n - 1}^{3} & c_{2, n}^{4} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ c_{m, 1}^{3} + u_{m}^{2} v_{1}^{3} & c_{m, 2}^{4} & \dots & c_{m, n -1}^{3} + u_{m}^{2} v_{n - 1}^{3} & c_{m, n}^{4} \end{matrix}),

C^{6} = (\begin{matrix} c_{1, 1}^{5} & c_{1, 2}^{4} + u_{1}^{2} v_{2}^{3} & \dots & c_{1, n -1}^{5} & c_{1, n}^{4} + u_{1}^{2} v_{n}^{3} \\ c_{2, 1}^{5} & c_{2, 2}^{4} + u_{2}^{2} v_{2}^{3} & \dots & c_{2, n -1}^{5} & c_{2, n}^{4} + u_{2}^{2} v_{n}^{3} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ c_{m, 1}^{5} & c_{m, 2}^{4} + u_{m}^{2} v_{2}^{3} & \dots & c_{m, n -1}^{5} & c_{m, n}^{4} + u_{m}^{2} v_{n}^{3} \end{matrix}), \dots

The following series of transforms demonstrates how the simple, classic matrix multiply written as three nested loops shown below is transformed to use tiles with an outer product multiply/add/accumulator array. For the tiling, usually TR=TC=V or TR=TC=2V, but there may be implementations that choose other vector lengths for microarchitectural reasons, and this should be supported.

    for i ← 0 to m-1
      for j ← 0 to n-1
	for l ← 0 to k-1
	  c[i,j] ← c[i,j] + a[i,l] * b[l,j]

The above code is then tiled to become the following:

    // iterate over TR×TC tiles of C
    for ti ← 0 to m-1 step TR
      for tj ← 0 to n-1 step TC
	// add product of a[ti..ti+TR-1,0..k-1]
	// and b[0..k-1,tj..tj+TC-1] to tile
	for i ← 0 to TR-1
	  for j ← 0 to TC-1
	    for l ← 0 to k-1
	      c[ti+i,tj+j] ← c[ti+i,tj+j] + a[ti+i,l] * b[l,tj+j]

The above code is modified to use an accumulator tile:

    for ti ← 0 to m-1 step TR
      for tj ← 0 to n-1 step TC
	// copy to accumulator
	for i ← 0 to TR-1
	  for j ← 0 to TC-1
	    acc[i,j] ← c[ti+i,tj+j]
	// add product of a[ti..ti+TR-1,0..k-1]
	// and b[0..k-1,tj..tj+TC-1] to tile
	for i ← 0 to TR-1
	  for j ← 0 to TC-1
	    for l ← 0 to k-1
	      acc[i,j] ← acc[i,j] + a[ti+i,l] * b[l,tj+j]
	// copy accumulator back to tile
	for i ← 0 to TR-1
	  for j ← 0 to TC-1
	    c[ti+i,tj+j] ← acc[i,j]

The above code is then vectorized by moving the l loop outside and the i and j loops into the outer product instruction:

    for ti ← 0 to m-1 step TR
      for tj ← 0 to n-1 step TC
	for i ← 0 to TR-1
	  acc[i,0..TC-1] ← c[ti+i,tj..tj+TC-1]	// TC-element vector load + acc write
	for l ← 0 to k-1
	  va ← a[ti..ti+i+TR-1,l]		// TR-element vector load col of A
	  vb ← b[l,tj..tj+i+TC-1]		// TC-element vector load row of B
	  acc ← acc + outerproduct(va, vb)	// 2-cycle outer product instruction
	for i ← 0 to TR-1
	  c[ti+i,tj..tj+TC-1] ← acc[i,0..TC-1]	// acc read + TC-element vector store

where the outerproduct(va, vb) operation invoked above is defined as follows:

	for i ← 0 to TR-1
	  for j ← 0 to TC-1
	    product[i,j] ← va[i] * vb[j]
	return product

In the Matrix Algebra section it was observed that cycle count for matrix multiplication with the smarter variant of unbounded multiply/add units (i.e., $N^{2} / Δ$ units) pipelined to produce a value every cycle takes $N \times Δ + Δ - 1$ cycles. It is worth answering how the above method fares relative to this standard applied to a single tile. Because we cut the number of multiply/add units in half to match the load bandwidth, we expect at least twice the cycle count, and this expectation is met: matching a memory system that delivers 𝑉 elements per cycle, a tile of $V \times V$ processed by an array of $V \times (V / 2)$ multiply/add units ( $Δ \leq 2$ ) produces the tile in $2 V + 1$ cycles. It may help to work an example. For a memory system delivering one 512‑bit cache block per cycle and 16‑bit data (e.g., BF16), $V = 32$ , and the 32×32 tile is produced using 2 vector loads and one 2‑cycle outer product instruction iterated 32 times taking 64 cycles yielding 512 multiply/adds per cycle. However, this does not include the time to load the accumulators before and transfer them back to 𝐶 after. When this 64 cycle tile computation is part of a 1024×1024 matrix multiply, this tile loop will be called 32 times for each tile of 𝐶. If it takes 64 cycles to load the accumulators from memory and 64 cycles to store back to memory, then this is 64+32×64+64=2176 total cycles. There are a total of 1024 output tiles, so the matrix multiply is 2228224 cycles (not counting cache misses) for 1024³ multiply/adds, which works out to 481.88 multiply/adds per cycle, or 94% of peak.

Note that there is no point in loading entire tiles, as this would not benefit performance. Rows and columns are loaded and consumed, and not used again. Storing whole tiles of the 𝐴 and 𝐵 matrixes would only be useful in situation when such a tile is used repeatedly, which does not occur in a larger matrix multiply. This does occur for the accumulation tile of the 𝐶 matrix, which does make that worth storing locally. The question is where it should be stored.

It is worth noting that the l loop above can be unrolled to create more scheduling opportunities as shown below:

    for ti ← 0 to m-1 step TR
      for tj ← 0 to n-1 step TC
	for i ← 0 to TR-1
	  acc[i,0..TC-1] ← c[ti+i,tj..tj+TC-1]	// TC-element vector load + acc write
	for l ← 0 to k-1 step 4
	  acc ← acc + outerproduct(a[ti..ti+i+TR-1,l+0], b[l+0,tj..tj+i+TC-1])
	  acc ← acc + outerproduct(a[ti..ti+i+TR-1,l+1], b[l+1,tj..tj+i+TC-1])
	  acc ← acc + outerproduct(a[ti..ti+i+TR-1,l+2], b[l+2,tj..tj+i+TC-1])
	  acc ← acc + outerproduct(a[ti..ti+i+TR-1,l+3], b[l+3,tj..tj+i+TC-1])
	for i ← 0 to TR-1
	  c[ti+i,tj..tj+TC-1] ← acc[i,0..TC-1]	// acc read + TC-element vector store

The transpose and packing phase would then arrange the above non-contiguous vectors to be packed contiguously so that 4×TR and 4×TC element loads would be used. It is also apparent that either the outerproduct operation should be at least Δ cycles so that the acc dependency does not stall. It would also be possible to implement an inner product operation of reduced latency for the above (e.g., using carry-save arithmetic).

Matrix Accumulators

The bandwidth of reads and writes to outer product accumulators far exceeds what a Vector Register File (VRF) generally targets, which suggests that that these structures be kept separate. Also the number of bits in the accumulators is potentially large relative to VRF sizes. Increasing the bandwidth and potentially the size of the VRF to meet the needs of outer product accumulation is not a good solution. Rather the accumulator bits should located in the multiply/add array, and be transferred to memory when a tile is completed. This transfer might be one row at a time through the VRF, since the VRF has the necessary store operations and datapaths to the cache hierarchy. The appropriateness of separate accumulator storage may be illustrated by examples. A typical vector load width might be the cache block size of 512 bits. This represents 64 8‑bit elements. If the products of these 8‑bit elements is accumulated in 16 bits (e.g., int16 for int8 or fp16 for fp8), then for $Δ \leq 2$ , 16×64² = 65536 bits of accumulator are required. The entire SecureRISC VRF is only twice as many bits, and these bits require more area than accumulator bits, as the VRF must support at least 4 read ports and 2 write ports for parallel execution of vector multiply/acc and a vector load or vector store. In contrast, accumulator storage within the multiply/add array is local, small, and due to locality consumes negligible power. As another example, consider the same 512 bits as sixteen IEEE 754 binary32 elements with $Δ = 4$ . The method for this latency suggests a 16×8 array of binary32 multiply/add units with 2048 32‑bit accumulators, which is again a total of 65536 bits of accumulator storage, but now embedded in much larger multiply/add units.

The number of bits require for accumulation needs to be determined (the example above is not meant to be anything other than an example). Recently the TF32 format appears to be gaining popularity for AI applications, and so accumulation in TF32 for BF16 inputs is one option. However, this needs further investigation.

SecureRISC has instructions that produce the outer product of two vectors and add this to one of four matrix accumulators. The matrix accumulators are expected to be stored within the logic producing the outer product, and so are distinct from the vector register file. The outer product hardware allows a large number of multiply/accumulate units to accelerate matrix multiply more efficiently than using vector operations.

The purpose of providing 4 accumulators per multiply/add unit is to allow the accumulators to be loaded and stored by software while outer products are being accumulated to other registers and to allow multiple tiles to be pipelined.

Control and Status Register Operations

SecureRISC has a large number of registers that affect instruction execution. These registers, called CSRs, are accessed by special instructions that support reading, writing, swapping, setting bits, and clearing bits. Many ISAs have such instructions; the unusual aspect of SecureRISC is first that CSRs are split into early (XCSRs) and late (SCSRs), per-ring registers (RCSRs), ring-enable registers (RECSRs), and indirect CSRs (ICSRs).

RCSRs can be accessed in two ways: first, via the CSR number in the immediate field and ring from an XR; and second via an encoding that refers to the register for the current ring of execution (PC.ring). In addition, RCSRs also have an associated enable CSR, with a 3‑bit ring number specifying which rings may access the register (if it proves useful, an 8‑bit mask could be used). The access test for the x[ring] RCSR is ring ≤ PC.ring & ring ≤ xRE. Only ring 7 may access xRE RECSRs. RECSRs may be read and written individually, or in groups of sixteen, packed into 4‑bit fields of a 64‑bit read or write, which facilitates context switch.

ICSRs are accessed with a CSR base immediate, CSR index from an XR and an offset for the word of data at that index. For example, the ENCRYPTION ICSRs have five 64‑bit values for a given index (an 8‑bit algorithm and 256 bits of encryption key). Similarly the amatch ICSRs have five 64‑bit values for a given index (the address to match, 128 bits of region access permission, and 128 bits of region write permission).

XCSRs, RECSRs, RCSRs, and ICSRs are read and written to and from the XRs. Late pipeline SCSRs are read and written to and from the SRs.

Read and writing CSRs have no side effects. However, reads and writes may affect performance and so it is desirable to provide read-only and write-only versions of CSR operations. However, this introduces five variants, which requires a three-bit instruction field. It may be that CSR operations are changed to always return the old value of the CSR (i.e., always perform a read), which if not useful, wastes a register and perhaps affects performance. This is TBD.

Per-ring CSRs (RCSRs) appear to be fairly expensive, but the invention of SRAM cells in 5 nm and later process nodes that support efficient small RAM means that some RCSRs can by implemented by tiny 8‑entry SRAM arrays, provided that multiple ring values are not required in the same cycle. Unfortunately OoO microarchitectures might produce such a situation, but in some cases this could be handled by reading the necessary RCSRs during instruction decode and pipelining that value along. Other tricks might be used to keep RCSRs as tiny SRAMs.

In the specifications below, the definition of n ← op(o,v,m) comes from the opcode (mnemonic RW for Read-Write, mnemonic RS for Read-Set, mnemonic RC for Read-Clear). Here o is the old value, v is the operand value, and m is the per-CSR bitmask specifying which bits are writeable (some bits possibly being read-only).

Description	op	Definition	Supported	Reserved
Read	R	n ← o	All
Read-Write	RW	n ← (o & ~m) \| (v & m)	All
Write	W	n ← (o & ~m) \| (v & m)	All
Read-Set	RS	n ← o \| (v & m)	XCSRs, RCSRs, ICSRs, SCSRs	RECSRs
Read-Clear	RC	n ← o & ~(v & m)	XCSRs, RCSRs, ICSRs, SCSRs	RECSRs
Read-Write Group	RW	see below	RECSRs	XCSRs, RCSRs, ICSRs, SCSRs

Note: the following ignores whether the CSRs in question can store a tag or not. These details will be provided in the future. At the present time, there are no XCSRs, SCSRs, or ICSRs that have tags. This could, however, change in the future. Also, at present, no RCSRs have full tags, but CSP and ExcHandler store a ring number in bits 66..64. CSP reads supply 24 as the five high bits on a read, giving the tag the values 192..199. ExcHandler reads supply 26 as the five high bits on a read, giving the tag the values 208..215.

CSRopX d,imm,a: // XCSR op
// XR[d] ← op(XCSR[imm], XR[a])
v ← XR[a]
trap if (v & XCSRreserved[imm]) ≠ 0
o ← XCSR[imm]
n ← op(o,v,XCSRwriteable[imm])
if op ≠ R then XCSR[imm] ← n
if op ≠ W then XR[d] ← o
CSRopS d,imm,a: // SCSR op
// SR[d] ← op(SCSR[imm], SR[a])
v ← SR[a]
trap if (v & SCSRreserved[imm]) ≠ 0
o ← SCSR[imm]
n ← op(o,v,SCSRwriteable[imm])
if op ≠ R then SCSR[imm] ← n
if op ≠ W then SR[d] ← o
RECSRopX d,imm,a,b: // RECSR R,W,RW for ring given by XR[b]
// XR[d] ← op(RCSR[imm][XR[b].ring], XR[a])
v ← XR[a]
r ← XR[b]_66..64
trap if PC.ring ≠ 7
trap if v_71..64 ≠ 240
trap if XR[b]_71..67 ≠ 24
o ← RECSR[imm]
n ← op(o,v,7)
if op ≠ R then RECSR[imm] ← n_2..0
if op ≠ W then XR[d] ← 240 ‖ 0⁶¹ ‖ o
REGCSRRWX d,imm,a,b: // RECSR Group R,W,RW
v ← XR[a]
r ← XR[b]_66..64
trap if PC.ring ≠ 7
trap if v_71..64 ≠ 240
trap if XR[b]_71..67 ≠ 24
trap if (v & 0x8888888888888888) ≠ 0
trap if imm_3..0 ≠ 0
o ← 0¹‖RECSR[imm+15] ‖ 0¹‖RECSR[imm+14] ‖ ⋯ ‖ 0¹‖RECSR[imm+0]
n ← op(o,v,0x7777777777777777)
if op ≠ R then
  RECSR[imm+0] ← n_2..0
  RECSR[imm+1] ← n_6..4
  ⋮
  RECSR[imm+15] ← n_62..60
if op ≠ W then
  XR[d] ← o
RCSRopXC d,imm,a: // RCSR current ring op
// XR[d] ← op(RCSR[imm][PC.ring], XR[a])
v ← XR[a]
trap if (v & RCSRreserved[imm]) ≠ 0
r ← PC.ring
trap if PC.ring < RECSR[imm]
o ← RCSR[imm][r]
n ← op(o,v,RCSRwriteable[imm])
if op ≠ R then RCSR[imm][r] ← n
if op ≠ W then XR[d] ← o
RCSRopXR d,imm,a,b: // RCSR specified ring op
// XR[d] ← op(RCSR[imm][XR[b].ring], XR[a])
v ← XR[a]
trap if (v & RCSRreserved[imm]) ≠ 0
trap if XR[b]_71..67 ≠ 24
r ← XR[b]_66..64
trap if PC.ring < r
trap if PC.ring < RECSR[r]
o ← RCSR[imm][r]
n ← op(o,v,RCSRwriteable[imm])
if op ≠ R then RCSR[imm][r] ← n
if op ≠ W then XR[d] ← o
ICSRopX d,a,b,c,e: // ICSR op
// XR[d] ← op(ICSR[c][XR[b]][e], XR[a])
v ← XR[a]
trap if (v & ICSRreserved[c][e]) ≠ 0
i ← XR[b]_7..0
trap if i ≥ ICSRcount[c]
o ← ICSR[c][i][e]
n ← op(o,v,ICSRwriteable[c][e])
if op ≠ R then ICSR[c][i][e] ← n
if op ≠ W then XR[d] ← o

Atomic Memory Operations

SecureRISC has two sorts of instructions for synchronization via memory locations. The first is one of the primitives that can implement most synchronization methods: Compare And Swap). Compare And Swap (CAS) exists for the SRs (CASS, CASSD, CASS64, CASS128, CASSI, CASSDI, CASS64I, CASS128I). and perhaps the XRs (CASX, CASXD, CASX64, CASX128, CASXI, CASXDI, CASX64I, CASX128I) It is possible that 8, 16, and 32‑bit versions of Compare And Swap might also be provided. It is also plausible that 288‑bit (half cache block) and 576‑bit (whole cache block) CAS may be provided from the VRs. The basic schema of CAS is illustrated by the following simplified semantics of CASS64, with the other instruction formats being similar:
ea ← AR[a]
expected ← SR[b]
new ← SR[c]
m ← lvload64(ea)
if m = expected then
lvstore64(ea) ← new
endif
SR[d] ← m
This specification clearly violates the number of read and write ports for the XRs, and the CASX forms might be omitted, but CAS instructions are likely at least two cycle instructions, and might read the register file over two cycles. However, it is possible that a CSR could be introduced for the expected value, though this would mean longer instruction sequences for synchronization. TBD.

The second synchronization is not as powerful as Compare And Swap, and could be implemented by CAS, but it may be more efficient in some circumstances. It is atomic load and add (AADDX64, and AADDS64). These instructions load the specified memory location into the destination register and then atomically increment the memory location, as illustrated by the following simplified semantics of AADDX64:
m ← lvload64(ea)
t ← m + 1
lvstore64(ea) ← t
XR[d] ← m
These operations correspond to the ticket(S) operation on a sequencer, as defined in Synchronization with Eventcounts and Sequencers by Reed and Kanodia, though sequencers only require an atomic increment, the generalization to AADDX64 etc. keeps the system interconnect transaction for uncached atomic add similar to atomic AND and OR below.

The third synchronization instructions are even less powerful than atomic increment, and could be implemented by CAS, but may be more efficient in some circumstances, such as the GC mark phase for updating bitmaps. The instructions are atomic AND (AANDX64), atomic OR (AORX64), and atomic XOR (AXORX64). These instructions load the specified memory location into the destination register and then atomically AND, OR, or XOR the memory location, as illustrated by the following simplified semantics of AANDX64:
m ← lvload64(ea)
t ← m & XR[b]
if t ≠ m then
lvstore64(ea) ← t
endif
XR[d] ← m
The case for RISC‑V’s AMOSWAP, AMOMIN, and AMOMAX seem unclear at this point. (The case for RISC‑V’s AMOXOR is also unclear to this author, but it is trivial given support for AANDX64 and AORX64, and also called for by C++11 std::atomic and so included.) Some APIs (e.g., CUDA) may expect these operations, but they could be implemented on SecureRISC with CAS instructions. C++20 added atomic operations on floating-point types, but these are best done using CAS (e.g., it is not appropriate to have floating-point addition in memory controller for uncached operations).

Atomic operations may be performed by the processor on coherent memory locations in the cache by holding off coherency transactions during the operations involved, or on uncached locations by sending a transaction to the memory, which performs the operation atomically there and returns the result. The System Interconnect Address Attributes section describes main memory attributes indicating which locations implement uncached atomic memory operations. The locations to be modified by atomic operations must not cross a 64‑byte boundary; for example, the address for CASS64 must be in the range 0..56 mod 64.

SecureRISC, having committed to providing Compare and Swap (CAS), has not included the Load and Store Conditional paradigm. RISC architectures, have typically begun with this simpler primitive to avoid CAS, and then later added CAS. SecureRISC seeks to avoid this evolution.

Wait Instructions

SecureRISC will have the usual instructions to wait for an interrupt. Such instructions increase efficiency. While the details are TBD, for example, there might be a WAIT instruction that takes a value to write to IntrEnable[PC.ring], and then suspends fetch before the next instruction (so that the return from the interrupt exception returns to that instruction).

A more interesting instruction under consideration is one that waits for a memory location to change, which may be useful for reducing the overhead of memory based synchronization. The ITS .HANG UUO and the x86 MONITOR/MWAIT instructions may be one model.

Fence Instructions

Note: SecureRISC has acquire and release options for loads and stores, which reduces (but does not eliminate) the need for some memory fences. Fences for virtual memory changes may be necessary, though it may be possible to handle those in the coherence protocol. Some fence instructions may also be useful in mitigating security vulnerabilities due to microarchitecture bugs.

The details of SecureRISC’s fence instructions are TBD, but it is likely to specify a first set of (e.g., as a bitmask) of operations that must complete before a second set (also a bitmask) of operations are allowed initiate. This is similar to what RISC‑V adopted for memory fences (their FENCE instruction, where there are only four set elements), but for a larger set of instructions. The goal is to encompass the variety of fences found in other ISAs. The set elements might include instruction fetch, loads, stores, CSR reads, CSR writes, and other instructions. Loads and stores could be further categorized by System Interconnect Address Attributes or acquire and release attributes. Other operations that might be receive bits in the sets might be related to prediction, system interconnect transactions, error checking, privilege level changes, prefetch, address space changes, waits, interrupts, and exceptions, and so on. One goal is to correctly handle Just-in-Time (JIT) compilation in the presence of processor migration, which should be easier in SecureRISC because stores must invalidate instruction caches. An enlarged set of things to fence should also allow for finer-grain patching of security vulnerability bugs that seem to plague speculative processors; even though these should be handled correctly by processor designers, they seem to often not be handled properly. Not all of this is thought out. Again, the details are TBD.

Note: Need to look at POWER persistent synchronization instructions (phwsync and plwsync). See Power ISA Version 3.1B section 4.6.3 Memory Barrier Instructions.

System Call Instructions

SecureRISC lacks a System Call instruction (e.g., RISC‑V ECALL), as gates are the preferred method of making requests of higher privilege levels.

Prefetch and Cache Operations

SecureRISC has instructions for compiler-directed prefetching and to control automatic prefetching. These instructions operate on 8‑word cache lines. The C prefix to these assembler mnemonics represents Cache. Rather than identify caches as L1 BBDC, L1 Instruction, L2 Instruction, L1 Data, L2 Data, L3, etc. we designate caches by referencing the instructions that use those caches. Further work is required for things that operate on or stop at some intermediate level of the hierarchy. These instructions operate on cache block specified by an lvaddr and are subject to access permissions. They are available to all rings. There will be privileged instructions not yet listed here.

SecureRISC requires that writes invalidate or update all caches that contain previous fetches, including the BBDC and L1 and L2 Instruction caches. Previously fetched instructions still in the pipeline are not invalidated, so a fence is required. Thus, cache operations are not required for JIT compilers, merely the fence. This is typically implemented by having a directory at what ARM calls the Point of Unification (PoU) in the cache hierarchy. This directory records the locations in lower levels which may contain the a copy. Stores consult the directory and when other locations are noted, those locations are invalidated or updated. For multiprocessor systems, a first processor may write instructions that a second will execute. The first processor must execute a fence to ensure all writes have completed before signaling the second processor to proceed. The second processor must also use a fence to ensure that the pipeline has stale instructions (e.g., fetched speculatively). The details will be spelled out when the fence instructions are specified.

Is TLB prefetching required?

Instruction	Operation
Fetch prefetching and eviction designation (these may be executed too late in the pipeline to be useful an so may be replaced by BBD features)
CPBB	Prefetch into Basic Block Descriptor Cache (BBDC)
CPI	Prefetch into Basic Block Descriptor Cache (BBDC) and Instruction Cache
CEBB	Designate eviction candidate for Basic Block Descriptor Cache (BBDC)
CEI	Designate eviction candidate for Basic Block Descriptor Cache (BBDC) and Instruction Caches
Early pipeline prefetching, zeroing, writeback, invalidation, and eviction designation
CPLA	Prefetch for LA/LAC/etc.
CPLX	Prefetch for LX/etc. (probably identical to CPLA in most cases)
CPSA	Prefetch for SA/SAC/etc.
CPSX	Prefetch for SX/etc. (probably identical to CPSA in most cases)
CZA	Zero cache block used for SA/SAC/etc.
CZX	Zero cache block used for SX/etc. (probably identical to CZA in most cases)
CEA	Designate eviction candidate for LA/SA
CEX	Designate eviction candidate for LX/SX
CCX	Clean (writeback) for SX cache
CCIX	Clean (writeback) and invalidate for SX cache
Late pipeline prefetching, zeroing, writeback, invalidation, and eviction designation (the primary difference from early prefetching is some microarchitectures may not prefetch to the first stage(s) of the data cache hierarchy)
CPLS	Prefetch for LS
CPSS	Prefetch for SS
CZS	Zero cache block used for SS/etc.
CES	Designate eviction candidate for LS/SS
CCS	Clean (writeback) for SS cache
CCIS	Clean (writeback) and invalidate for SS cache

Need to look at POWER dcbstps (data cache block store to persistent storage) and dcbfps (data cache block flush to persistent storage).

The primary issue with fetch prefetching is that some implementations may execute explicit instructions too late to be useful. Eventually I expect to define new next codes in Basic Block Descriptors for L1 BBDC and L1 Instruction Cache prefetch and eviction designation to solve this problem. Whether some of the above instructions are removed by such a solution is TBD.

Prefetch may want additional options for rereference interval prediction and similar hints to avoid removing useful cache blocks when streaming data larger than the cache size.

Code Size Reduction

It is likely appropriate to add some instructions that exist only for code size reduction, which expand into multiple SecureRISC instructions early in the pipeline (e.g., before register renaming). The best candidates for this so far are doubleword load/store instructions, which would expand into two singleword load/store instructions. This expansion and execution as separate instructions in the backend of the pipeline avoids the issues with register renaming that would otherwise exist. The partial execution of part of the pair would be allowed (and loads to the source registers would be not allowed). Doubleword load/store significantly reduce the size of function call entry and exit and may be useful for loading a code pointer and context pointer pair for indirect calls.

Instruction Formats and Overview

The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.

The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream; in effect it provides an extra bit of information for every 16 bits of the instruction stream.

Assembler Mnemonics

Because the identical or nearly identical instructions may exist in multiple widths, a convention for distinguishing them is required. Since 32‑bit instructions are most common, these have the shortest form. Mnemonics for instruction sizes other than 32 bits are indicated by their first letter:

Size	Mnemonic prefix
16	1
32
48	3
64	4

Instructions that calculate an effective address are distinguished by the first letter of their mnemonic: Address, Load, or Store. For loads and stores, the second letter of the mnemonic gives the destination register file either A for ARs, X for XRs, S for SRs, M for VMs, or V for VRs. (There are no loads or stores to the BRs.) The next field of the mnemonic is empty for word loads and stores, or the size (8, 16, 32, or 64—possibly 128?) for sub-word loads and stores to the XRs or SRs. Word stores must be word-aligned, but 64‑bit (possibly 128? sub-word stores may be misaligned and generate an integer tag. Sub-word loads for 8, 16, or 32 bits next include S for signed or U for unsigned. Finally, the last letter is I for an immediate offset (as opposed to a XR[b] offset).

As examples of the above rules: A stores the address calculation AR[a] +_p XR[b]<<sa to destination AR[d] while 1AI stores the address calculation AR[a] +_p imm8. LA and LAI loads the contents of those two address calculation to the destination AR[d]. LX32U loads XR[d] from an unsigned 32‑bit memory location located using a XR[b] offset and LS16SI loads SR[d] from a signed 16‑bit memory location located using an immediate offset.

Arithmetic instructions use the operation (e.g., ADD or SUB) with a suffix X or S for the register file of the source and destination operands. If an immediate value is one of the operands, a final I is appended. For vector operations the suffixes are VV for vector-vector VS for vector-scalar, and VI for vector-scalar immediate.

As examples of the above rules:

Assembler	Simplified meaning (ignoring details)
ADDXI d, a, imm	XR[d] ← XR[a] + imm
ADDS d, a, b	SR[d] ← SR[a] + SR[b]

For Floating-Point operations, F is used for IEEE754 binary32 (single-precision), D is used for IEEE754 binary64 (double-precision), H is used for IEEE754 binary16 (half-precision), B is used for the non-standard Machine Learning (ML) 16‑bit Brain Float format, and P3, P4, and P5 are used for the three proposed IEEE binary8pp formats for ML quarter-precision (8‑bit) with 5‑bit, 4‑bit, 3‑bit exponents. Q is reserved for a future IEEE754 binary128 (quad-precision).

Some floating-point examples are as follows:

Assembler	Simplified meaning (ignoring details)	Comment
FNMADDS d, a, b, c	SR[d] ← −(SR[a] ×_f SR[b]) +_f SR[c]
DMADDVS d, a, b, c	VR[d] ← (VR[a] ×_d SR[b]) +_d VR[c]
P4MBSUBVV d, c, a, b	VR[d] ← (VR[a] ×_p4b VR[b]) −_b VR[c]	P4 widening to BF multiply-subtract

*fmt* Floating-Point Precision schema
Mnemonic	Definition	Comment	Exp	Prec
Q	binary128	quadruple-precision	15	113
D	binary64	double-precision	11	53
F	binary32	single-precision	8	24
H	binary16	half-precision	5	11
B	bfloat16	ML alternative to half-precision	8	8
P5	binary8p5	quarter-precision for ML alternative	3	5
P4	binary8p4	quarter-precision for ML alternative	4	4
P3	binary8p3	quarter-precision for ML	5	3

SecureRISC has not yet considered inclusion of NVIDIA’s Tensor Float format.

In the following sections, sometimes a set of instructions are defined with a mnemonic schema using the following:

What	Schema	Mnemonic	Definition	Comment
Operation Mnemonic schemas for ARs
Address Comparison	ac	EQ	x_63..0 = y_63..0
		NE	x_63..0 ≠ y_63..0
		LTU	x_63..0 <_u y_63..0
		GEU	x_63..0 ≥_u y_63..0
		TEQ	x_71..64 = y_7..0	tag equal
		TNE	x_71..64 ≠ y_7..0	tag not-equal
		TLTU	x_71..64 <_u y_7..0	tag less than
		TGEU	x_71..64 ≥_u y_7..0	tag greater than or equal
		WEQ	x_71..0 = y_71..0	word equal
		WNE	x_71..0 ≠ y_71..0	word not-equal
		WLTU	x_71..0 <_u y_71..0	word less than
		WGEU	x_71..0 ≥_u y_71..0	word greater than or equal
Operation Mnemonic schemas for XRs
Index Arithmetic	xa	ADD	x_63..0 + y_63..0	mod 2⁶⁴ addition
		SUB	x_63..0 − y_63..0	mod 2⁶⁴ subtraction
		MINU	min_u(x_63..0, y_63..0)
		MINS	min_s(x_63..0, y_63..0)
		MINUS	min_us(x_63..0, y_63..0)
		MAXU	max_u(x_63..0, y_63..0)
		MAXS	max_s(x_63..0, y_63..0)
		MAXUS	max_us(x_63..0, y_63..0)
		Index Logical	xl	AND	x_63..0 & y_63..0
				OR	x_63..0 \| y_63..0
				XOR	x_63..0 ^ y_63..0
				SLL	x_63..0 <<_u y_5..0
				SRL	x_63..0 >>_u y_5..0
				SRA	x_63..0 >>_s y_5..0
				Index Comparison	xc	EQ	x_63..0 = y_63..0
						NE	x_63..0 ≠ y_63..0
						LTU	x_63..0 <_u y_63..0
						LT	x_63..0 <_s y_63..0
GEU	x_63..0 ≥_u y_63..0
GE	x_63..0 ≥_s y_63..0
NONE	(x_63..0&y_63..0)=0	Check statistics
ANY	(x_63..0&y_63..0)≠0	Check statistics
ALL	(x_63..0&~y_63..0)=0	Check statistics
NALL	(x_63..0&~y_63..0)≠0	Check statistics
BITC	x_{y_5..0}=0	Check statistics
BITS	x_{y_5..0}≠0	Check statistics
TEQ	x_71..64 = y_7..0	tag equal
TNE	x_71..64 ≠ y_7..0	tag not-equal
TLTU	x_71..64 <_u y_7..0	tag less than
TGEU	x_71..64 ≥_u y_7..0	tag greater than or equal
WEQ	x_71..0 = y_71..0	word equal
WNE	x_71..0 ≠ y_71..0	word not-equal
WLTU	x_71..0 <_u y_71..0	word less than
WGEU	x_71..0 ≥_u y_71..0	word greater than or equal
Operation Mnemonic schemas for SRs, BRs, VRs, and VMs
Boolean	bo	AND	x & y
		ANDTC	x & ~y
		NAND	~(x & y)
		NOR	~(x \| y)
		OR	x \| y
		ORTC	x \| ~y
		XOR	x ^ y
		EQV	x ^ ~y
Boolean accumulation	ba	AND	x & y
		OR	x \| y	OR with b0 used by assembler for non-accumulation
		Integer Comparison	ic	EQ	x_63..0 = y_63..0
NE	x_63..0 ≠ y_63..0
LTU	x_63..0 <_u y_63..0
LT	x_63..0 <_s y_63..0
GEU	x_63..0 ≥_u y_63..0
GE	x_63..0 ≥_s y_63..0
NONE	(x_63..0&y_63..0)=0			Check statistics
ANY	(x_63..0&y_63..0)≠0			Check statistics
ALL	(x_63..0&~y_63..0)=0			Check statistics
NALL	(x_63..0&~y_63..0)≠0			Check statistics
BITC	x_{y_5..0}=0			Check statistics
BITS	x_{y_5..0}≠0			Check statistics
Integer Arithmetic	io	ADD	x_63..0 + y_63..0	mod 2⁶⁴ addition
		SUB	x_63..0 − y_63..0	mod 2⁶⁴ subtraction
		ADDOU	x_63..0 +_u y_63..0	Trap on unsigned overflow
		ADDOS	x_63..0 +_s y_63..0	Trap on signed overflow
		ADDOUS	x_63..0 +_us y_63..0	Trap on unsigned-signed overflow
		SUBOU	x_63..0 −_u y_63..0	Trap on unsigned overflow
		SUBOS	x_63..0 −_s y_63..0	Trap on signed overflow
		SUBOUS	x_63..0 −_us y_63..0	Trap on unsigned-signed overflow
		MINU	min_u(x_63..0, y_63..0)
		MINS	min_s(x_63..0, y_63..0)
		MINUS	min_us(x_63..0, y_63..0)
		MAXU	max_u(x_63..0, y_63..0)
		MAXS	max_s(x_63..0, y_63..0)
		MAXUS	max_us(x_63..0, y_63..0)
		MUL	x_63..0 × y_63..0	least-significant 64 bits of product
		MULOU	x_63..0 ×_u y_63..0	Trap on unsigned overflow
		MULOS	x_63..0 ×_s y_63..0	Trap on signed overflow
		MULUS	x_63..0 ×_us y_63..0	Trap on unsigned-signed overflow
Integer 1-operand (should these be logical accumulations instead?)	a1	NEG	− x_63..0	negate
		ABS	abs(x_63..0)	absolute value
		POPC	popcount(x_63..0)	count number of one bits
		COUNTS	countsign(x_63..0)	count most-significant bits equal to sign bit
		COUNTMS0	countms0(x_63..0)
		COUNTMS1	countms1(x_63..0)
		COUNTLS0	countls0(x_63..0)
		COUNTLS1	countls1(x_63..0)
Integer Arithmetic accumulation	ia	ADD	x_63..0 + y_63..0
		SUB	x_63..0 − y_63..0
			y_63..0	Non-accumulation Mnemonic omitted in assembler: e.g., just ADDS d, a, b is encoded with this ia to perform SR[d] ← SR[a] + SR[b]
Bitwise Logical	lo	AND	x_63..0 & y_63..0
		ANDTC	x_63..0 & ~y_63..0
		NAND	~(x_63..0 & y_63..0)
		NOR	~(x_63..0 \| y_63..0)
		OR	x_63..0 \| y_63..0
		ORTC	x_63..0 \| ~y_63..0
		XOR	x_63..0 ^ y_63..0
		EQV	x_63..0 ^ ~y_63..0
		SLL	x_63..0 <<_u y_5..0
		SRL	x_63..0 >>_u y_5..0
		SRA	x_63..0 >>_s y_5..0
		CLMUL	x_63..0 ⊗ y_63..0	Carryless multiplication
Bitwise Logical accumulation	la	AND	x_63..0 & y_63..0
		OR	x_63..0 \| y_63..0
		XOR	x_63..0 ^ y_63..0	Primarily for CLMUL
			y_63..0	Non-accumulation Mnemonic omitted in assembler: e.g., just ANDS d, [c,] a, b is encoded with this la to perform SR[d] ← SR[a] & SR[b] with SR[c] ignored.
Floating-Point Arithmetic	fo	ADD	x +_fmt y
		SUB	x −_fmt y
		MIN	min_fmt(x, y)
		MAX	max_fmt(x, y)
		MINM	minmag_fmt(x, y)
		MAXM	maxmag_fmt(x, y)
		M	x ×_fmt y
		NM	−(x ×_fmt y)	negative multiply
		Mw	w(x) ×_w w(y)	widening multiply
NMw	−(w(x) ×_w w(y))	widening negative multiply
DIV	x_63..0 ÷_fmt y_63..0	Must be no-accumulation
Floating-Point accumulation	fa	ADD	x +_fmt y
		SUB	x −_fmt y
			y_63..0	Non-accumulation Mnemonic omitted in assembler: e.g., just DADDS d, a, b is encoded with this fa to perform SR[d] ← SR[a] +_d SR[b]
Floating-Point 1-operand	f1	MOV	x
		NEG	−_fmt x_63..0
		ABS	abs_fmt(x_63..0)
		RECIP	1.0 ÷_fmt x_63..0
		SQRT	sqrt_fmt(x_63..0)
		RSQRT	rsqrt_fmt(x_63..0)
		FLOOR	floor_fmt(x_63..0)
		CEIL	ceil_fmt(x_63..0)
		TRUNC	trunc_fmt(x_63..0)
		ROUND	round_fmt(x_63..0)
		CVTI	convert_i,fmt(x_63..0)
		CVTB	convert_b,fmt(x_63..0)
		CVTH	convert_h,fmt(x_63..0)
		CVTF	convert_f,fmt(x_63..0)
		CVTD	convert_d,fmt(x_63..0)
		FLOATU	float_fmt,u(x_63..0, imm)
		FLOATS	float_fmt,s(x_63..0, imm)
		CLASS	class_fmt(x_63..0)
Floating-Point Comparison	fc	OR	x_63..0 ?_fmt y_63..0	ordered
		UN	x_63..0 ~?_fmt y_63..0	unordered
		EQ	x_63..0 =_fmt y_63..0
		NE	x_63..0 ≠_fmt y_63..0
		LT	x_63..0 <_fmt y_63..0
		GE	x_63..0 ≥_fmt y_63..0
		LE	x_63..0 ≤_fmt y_63..0
		GT	x_63..0 >_fmt y_63..0

Schema combinations for the late pipeline
Class	Schema	Definition	Examples
Integer Arithmetic	ioia	SR[d] ← SR[c] ia (SR[a] io SR[b])	MADDS
Bitwise Logical	lola	SR[d] ← SR[c] la (SR[a] lo SR[b])	ANDORS
Floating-Point	fofa	SR[d] ← SR[c] fa_fmt (SR[a] fo_fmt SR[b])	DNMSUBS
Boolean	boba	BR[d] ← BR[c] ba (BR[a] bo BR[b])	ANDORS
Boolean	boba	VM[d] ← VM[c] ba (VM[a] bo VM[b])	ORANDM
Integer Comparison	icba	BR[d] ← BR[c] ba (SR[a] ic SR[b])	LTUANDS
Floating-Point Comparison	fcba	BR[d] ← BR[c] ba (SR[a] fc SR[b])	DLTANDS

logical operation encodings
Value	Mnemonic	Function	Mnemonic	Function
0000	F	0
0001	NOR	a ~\| b	ANDCC	~a & ~b
0010	ANDTC	a & ~b
0011	NOTB	~b
0100	ANDCT	~a & b
0101	NOTA	~a
0110	XOR	a ^ b
0111	NAND	a ~& b	ORCC	~a \| ~b
1000	AND	a & b
1001	EQV	a ~^ b	XNOR	~(a ^ b)
1010	A	a
1011	ORTC	a \| ~b
1100	B	b
1101	ORCT	~a \| b
1110	OR	a \| b
1111	T	1

signed/unsigned encoding
m	What	Example mnemonic
0	Reserved
1	Unsigned	MINU
2	Signed	MAXS
3	Unsigned Signed	MINUS

overflow signed/unsigned encoding
m	What	Example mnemonic
0	wrap	ADD
1	Overflow Unsigned	SUBOU
2	Overflow Signed	MULOS
3	Overflow Unsigned Signed	ADDOUS

rounding mode encoding
field TBD	Static	Dynamic
0	Nearest, ties to Even
1	Round to odd
2	Nearest, ties to Min Magnitude
3	Nearest, ties to Max Magnitude
4	Toward −∞ (floor)
5	Toward +∞ (ceiling)
6	Toward 0 (truncate)
7	Dynamic	Away from 0

load/store size encoding
field TBD	Data width	Aligned	MemTag check	Examples
0	8		240..245	LX8U, LS8SI
1	16		240..245	LS16S, LS16UI
2	32		240..245	SX32I, SS32
3	64		240..252	LX64UI, SS64
4	72	word		LAI, LX, LS, SSI
5	144	doubleword		LAD, SADI
6	144	doubleword	232/251	LAC, SAC
7	64		clique	CLA64, CSA64

load/store ordering encoding
field TBD	Mnemonic	Semantics
0		Neither acquire nor release
1	.a	Acquire
2	.r	Release
3	.ar	Acquire and Release

The table below lists the indexed load/store opcode mnemonics, but the same encoding is used for the immediate offset opcodes (i.e., with the appended I suffix). The {L,S}{X,S}128 instructions marked with a ? are possible placeholders for future code density instructions that expand into a pair of load or store instructions, similar to the existing {L,S}{X,S}D instructions.

load/store operation encoding
field TBD	Reg file	Operation	field TBD
field TBD	Reg file	Operation	0	1	2	3	4	5	6	7
0	XR	Load Unsigned	LX8U	LX16U	LX32U	LX64U	LX	LXD		LX128?
1	SR	Load Unsigned	LS8U	LS16U	LS32U	LS64U	LS	LSD		LS128?
2	XR	Speculative Load Unsigned	SLX8U	SLX16U	SLX32U	SLX64U	SLX	SLXD
3	SR	Speculative Load Unsigned	SLS8U	SLS16U	SLS32U	SLS64U	SLS	SLSD
4	XR	Load Signed	LX8S	LX16S	LX32S	LX64S
5	SR	Load Signed	LS8S	LS16S	LS32S	LS64S
6	XR	Speculative Load Signed	SLX8S	SLX16S	SLX32S	SLX64S
7	SR	Speculative Load Signed	SLS8S	SLS16S	SLS32S	SLS64S
8	AR	Load			RLA32	RLA64	LA	LAD	LAC	CLA64
9	VM	Load								LM
10	AR	Speculative Load			SRLA32	SRLA64	SLA	SLAD	SLAC	SCLA64
11	Reserved
12	XR	Store	SX8	SX16	SX32	SX64	SX	SXD		SX128?
13	SR	Store	SS8	SS16	SS32	SS64	SS	SSD		SS128?
14	AR	Store			RSA32	RSA64	SA	SAD	SAC	CSA64
15	VM	Store								SM

scalar vector encoding
n	Suffix	What	Example	m usage	f usage
0	S	Scalar integer	SR[d] ← SR[c] ia (SR[a] io SR[b])	su or osu
	B	Boolean	BR[d] ← BR[c] ba (BR[a] bo BR[b])
	S	Scalar floating	SR[d] ← SR[c] fa_fmt (SR[a] fo_fmt SR[b])		round
1	SV	Vector reduction to scalar integer	SR[d] ← reduction(SR[a], VR[b]) masked by VM[m] and n	vector mask
	SV	Vector reduction to scalar floating	SR[d] ← reduction_fmt(SR[a], VR[b]) masked by VM[m] and n	vector mask	round
	M	Mask	VM[d] ← VM[c] ba (VM[a] bo VM[b])
2	VS	Vector Scalar integer	VR[d] ← VR[c] ia (VR[a] io SR[b]) masked by VM[m] and n	vector mask
	VS	Vector Scalar floating	VR[d] ← VR[c] fa_fmt (VR[a] fo_fmt SR[b]) masked by VM[m] and n	vector mask	round
	VI	Vector Immediate integer	VR[d] ← VR[a] io imm masked by VM[m] and n	vector mask
	VS	Vector Scalar integer compare	VM[d] ← VM[c] ba (VR[a] ic SR[b]) masked by VM[m] and n
	VI	Vector Immediate integer compare	VM[d] ← VR[a] ic imm masked by VM[m] and n
	VS	Vector Scalar floating compare	VM[d] ← VM[c] ba (VR[a] fc_fmt SR[b]) masked by VM[m] and n
3	VV	Vector Vector integer	VR[d] ← VR[c] ia (VR[a] io VR[b]) masked by VM[m] and n	vector mask
	VV	Vector Vector floating	VR[d] ← VR[c] fa_fmt (VR[a] fo_fmt VR[b]) masked by VM[m] and n	vector mask	round
	VV	Vector Vector integer compare	VM[d] ← VM[c] ba (VR[a] ic VR[b]) masked by VM[m] and n
	VV	Vector Vector floating compare	VM[d] ← VM[c] ba (VR[a] fc_fmt VR[b]) masked by VM[m] and n

Vector operations write only the destination elements enabled by the vector mask operand. Destination element i is written if VM[m]_i = n. Since VM[0] is hardwired to 0, setting m to 0 and n to 0 writes unconditionally. The combination of m = 0 and n = 1 is reserved.

The following are a sketch of the 16‑bit instruction encodings, but the actual encodings will be determined by analyzing instruction frequency in the 32‑bit instruction set.

16‑bit op16
1:0 3:2	0	1	2	3
0	1A	1LA	1SA	i16da
1	1AI	1LAI	1SAI	i16ab0
2	1ADDX	1LX	1SX	1XI
3	1ADDXI	1LXI	1SXI	i16ab1

16‑bit instruction format destination 2 source
15	12	11	8	7	4	3	0
b		a		d		op16
4		4		4		4

Word address calculation with indexed addressing: 1A
1A	d, a, b	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 AR[d] ← AR[a] +_p XR[b]<<3 AV[d] ← v
Index register addition
1ADDX	d, a, b	XR[d] ← XR[a] + XR[b] XV[d] ← XV[a] & XV[b]
Non-speculative tagged word loads with indexed addressing: L{A,X,S} (LS in 32‑bit table)
1LA	d, a, b	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 AR[d] ← v ? lvload72(AR[a] +_p XR[b]<<3) : 0 AV[d] ← v
1LX	d, a, b	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<3) = 0 XR[d] ← v ? lvload72(AR[a] +_p XR[b]<<3) : 0 XV[d] ← v

16‑bit instruction format destination source immediate
15	12	11	8	7	4	3	0
imm4		a		d		op16
4		4		4		4

Word address calculation with immediate addressing: AI
1AI	d, a, imm4	v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 AR[d] ← AR[a] +_p imm4<<3 AV[d] ← v
Index register addition immediate
1ADDXI	d, a, imm4	XR[d] ← XR[a] + imm4 XV[d] ← XV[a]
Non-speculative tagged word loads with indexed addressing: L{A,X,S}I (LSI and wider immediate LA and LX in 32‑bit table)
1LAI	d, a, imm4	v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 AR[d] ← v ? lvload72(AR[a] +_p imm4<<3) : 0 AV[d] ← v
1LXI	d, a, imm4	v ← AV[a] trap if v & boundscheck(AR[a], imm4<<3) = 0 XR[d] ← v ? lvload72(AR[a] +_p imm4<<3) : 0 XV[d] ← v

16‑bit op16da
13:12 15:14	0	1	2	3
0	1NEGX	1NOTX	1MOVSX	1MOVXS
1	1RTAGX		1RTAGA	1RSIZEA
2	1MOVAX	1MOVXA	1MOVAS	1MOVSA
3	1SOBX		1RTAGS

16‑bit instruction format destination 1 source
15	12	11	8	7	4	3	0
op16da		a		d		i16da
4		4		4		4

1NEGX	d, a	XR[d] ← −XR[a] XV[d] ← XV[a]
1NOTX	d, a	XR[d] ← ~XR[a] XV[d] ← AV[a]
1MOVA	d, a	AR[d] ← AR[a] AV[d] ← AV[a]
1MOVX	d, a	XR[d] ← XR[a] XV[d] ← XV[a]
1MOVS	d, a	SR[d] ← SR[a] SV[d] ← SV[a]
1MOVAX	d, a	AR[d] ← XR[a] AV[d] ← XV[a]
1MOVXA	d, a	XR[d] ← AR[a] XV[d] ← AV[a]
1MOVSX	d, a	SR[d] ← XR[a] SV[d] ← XV[a]
1MOVXS	d, a	XR[d] ← SR[a] XV[d] ← SV[a]
1MOVAS	d, a	AR[d] ← SR[a] AV[d] ← SV[a]
1MOVSA	d, a	SR[d] ← AR[a] SV[d] ← AV[a]
1RTAGA	d, a	XR[d] ← 240 ∥ 0⁵⁶ ∥ AR[a]_71..64 XV[d] ← XV[a]
1RTAGX	d, a	XR[d] ← 240 ∥ 0⁵⁶ ∥ XR[a]_71..64 XV[d] ← XV[a]
1RTAGS	d, a	SR[d] ← 240 ∥ 0⁵⁶ ∥ SR[a]_71..64 SV[d] ← SV[a]
1RSIZEA	d, a	XR[d] ← 240 ∥ 0³ ∥ AR[a]_132..72 XV[d] ← XV[a]
1SOBX		trap if XV[a] = 0 XR[d] ← XR[a] − 1 XV[d] ← 1 loop back if XR[d] ≠ 0

1NEGX is identical to RSUBXI with an immediate of 0 but is 16 bits rather than 32. Whether this is important is unclear. 1NOTX is identical to RSUBXI with an immediate of -1 but is 16 bits rather than 32. Whether this is important is unclear. Whether to include these will depend on code size statistics.

16‑bit op16ab0
5:3 7:6	0	1	2	3
0	1BEQA	1BNEA	1BLTAU	1BGEAU
1	1BEQX	1BNEX	1BLTXU	1BGEXU
2	1BNONEX	1BANYX	1BLTX	1BGEX
3	i16a0

16‑bit op16ab1
5:4 7:6	0	1	2	3
0	1TEQA	1TNEA	1TLTAU	1TGEAU
1	1TEQX	1TNEX	1TLTXU	1TGEXU
2	1TNONEX	1TANYX	1TLTX	1TGEX
3	i16a1

16‑bit instruction format 2 source
15	12	11	8	7	4	3	0
b		a		op16ab		i16ab
4		4		4		4

All of the following first do either
trap if AV[a] & AV[b] = 0
or
trap if XV[a] & XV[b] = 0
as appropriate.

1BEQA	a, b	branch if AR[a] = AR[b]
1BEQX	a, b	branch if XR[a] = XR[b]
1BNEA	a, b	branch if AR[a] ≠ AR[b]
1BNEX	a, b	branch if XR[a] ≠ XR[b]
1BLTAU	a, b	branch if AR[a] <_u AR[b]
1BLTXU	a, b	branch if XR[a] <_u XR[b]
1BGEAU	a, b	branch if AR[a] ≥_u AR[b]
1BGEXU	a, b	branch if XR[a] ≥_u XR[b]
1BLTX	a, b	branch if XR[a] <_s XR[b]
1BGEX	a, b	branch if XR[a] ≥_s XR[b]
1BNONEX	a, b	branch if (XR[a] & XR[b]) = 0
1BANYX	a, b	branch if (XR[a] & XR[b]) ≠ 0
1TEQA	a, b	trap if AR[a] = AR[b]
1TEQX	a, b	trap if XR[a] = XR[b]
1TNEA	a, b	trap if AR[a] ≠ AR[b]
1TNEX	a, b	trap if XR[a] ≠ XR[b]
1TLTAU	a, b	trap if AR[a] <_u AR[b]
1TLTXU	a, b	trap if XR[a] <_u XR[b]
1TGEAU	a, b	trap if AR[a] ≥_u AR[b]
1TGEXU	a, b	trap if XR[a] ≥_u XR[b]
1TLTX	a, b	trap if XR[a] <_s XR[b]
1TGEX	a, b	trap if XR[a] ≥_s XR[b]
1TNONEX	a, b	trap if (XR[a] & XR[b]) = 0
1TANYX	a, b	trap if (XR[a] & XR[b]) ≠ 0

16‑bit op16a0
13:12 15:14	0	1	2	3
0	1BEQNA	1BNENA	1BF	1BT
1	1BEQZX	1BNEZX	1BLTZX	1BGEZX
2	1JMPA
3	1SWITCHX		1BLEZX	1BGTZX

16‑bit op16a1
13:12 15:14	0	1	2	3
0	1TEQNA	1TNENA	1TF	1TT
1	1TEQZX	1TNEZX	1TLTZX	1TGEZX
2	1CHKVA	1CHKVX	1CHKVS
3			1TLEZX	1TGTZX

16‑bit instruction format 1 source
15	12	11	8	7	4	3	0
op16a		a		i16a		i16ab
4		4		4		4

All of the following first do either
trap if AV[a] = 0
or
trap if XV[a] = 0
or
trap if BV[a] = 0
as appropriate.

1BEQNA	a	branch if AR[a]_63..0 = 0
1BNENA	a	branch if AR[a]_63..0 ≠ 0
1BEQZX	a	branch if XR[a]_63..0 = 0
1BNEZX	a	branch if XR[a]_63..0 ≠ 0
1BLTZX	a	branch if XR[a]_63..0 <_s 0
1BGEZX	a	branch if XR[a]_63..0 ≥_s 0
1BLEZX	a	branch if XR[a]_63..0 ≤_s 0
1BGTZX	a	branch if XR[a]_63..0 >_s 0
1BF	a	branch if BR[a] = 0
1BT	a	branch if BR[a] ≠ 0
1TEQZX	a	trap if XR[a]_63..0 = 0
1TNEZX	a	trap if XR[a]_63..0 ≠ 0
1TLTZX	a	trap if XR[a]_63..0 <_s 0
1TGEZX	a	trap if XR[a]_63..0 ≥_s 0
1TLEZX	a	trap if XR[a]_63..0 ≤_s 0
1TGTZX	a	trap if XR[a]_63..0 >_s 0
1TF	a	trap if BR[a] = 0
1TT	a	trap if BR[a] ≠ 0
1JMPA	a	trap if AR[a]_71..68 ≠ 12 trap if AR[a]_2..0 ≠ 0 trap if PC_66..64 ≠ AR[a]_66..64 PC ← AR[a]_66..0
1SWITCHX	a	trap if XR[a]_71..65 ≠ 120 PC ← PC +_p (XR[a]<<3)
1CHKVA	a	trap if AV[a] = 0
1CHKVX	a	trap if XV[a] = 0
1CHKVS	a	trap if SV[a] = 0

16‑bit instruction format destination immediate
15	8	7	4	3	0
imm8		d		op16
8		4		4

1XI

d, imm8

XR[d] ← 240 ∥ imm8₇⁴⁸ ∥ imm8
XV[d] ← 1

32‑bit op32
1:0 3:2	0	1	2	3
0	AXload	AXstore	SVload	SVstore
1
2	ARop	XRop	SRop	VRop
3	XI	XUI	SI	SUI

32‑bit instruction format 3 sources 1 destination
31	28	27	24	23	22	21	20	19	16	15	12	11	8	7	4	3	0
op32g		f		n		m		c		b		a		d		op32
4		4		2		2		4		4		4		4		4

Scalar Integer
ioiaS	d, c, a, b	SR[d] ← SR[c] ia (SR[a] io SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
lolaS	d, c, a, b	SR[d] ← SR[c] la (SR[a] lo SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
SELS	d, c, a, b	SR[d] ← BR[c] ? SR[a] : SR[b] SV[d] ← BV[c] & (BR[c] ? SV[a] : SV[b])
i1S	d, a	SR[d] ← i1(SR[a]) SV[d] ← SV[a]
Scalar Integer Multiword
FUNS	d, b, a, c	t ← (SR[b]_63..0∥SR[a]_63..0) >> SR[c]_5..0 SR[d] ← 240 ∥ t_63..0 SV[d] ← SV[a] & SV[b] & SV[c]
ROTRS	d, a, b	assembler expands to FUNS d, a, a, b
FUNNS	d, b, a, c	t ← (SR[b]_63..0∥SR[a]_63..0) >> (−SR[c])_5..0 SR[d] ← 240 ∥ t_63..0 SV[d] ← SV[a] & SV[b] & SV[c]
ROTLS	d, a, b	assembler expands to FUNNS d, a, a, b
ADDC	d, b, a	trap if (SV[a] & SV[b]) = 0 t ← SR[a]_63..0 + SR[b]_63..0 + CARRY₀ CARRY ← 0⁶³ ∥ t₆₄ SR[d] ← 240 ∥ t_63..0 SV[d] ← 1
MULC	d, b, a, c	trap if (SV[a] & SV[b] & SV[c]) = 0 t ← (SR[a]_63..0 ×_u SR[b]_63..0) + SR[c]_63..0 + CARRY CARRY ← t_127..64 SR[d] ← 240 ∥ t_63..0 SV[d] ← 1
DIVC	d, b, a, c	trap if (SV[a] & SV[b]) = 0 q,r ← (CARRY∥SR[a]_63..0) ÷_u SR[b]_63..0 CARRY ← r SR[d] ← 240 ∥ q SV[d] ← 1
Boolean
boba	d, c, a, b	BR[d] ← BR[c] ba (BR[a] bo BR[b]) BV[d] ← BV[a] & BV[b] & BV[c]
bobaM	d, c, a, b	VR[d] ← VM[c] ba (VM[a] bo VM[b])
Integer Comparison
acbaA	d, c, a, b	BR[d] ← BR[c] ba (AR[a] ac AR[b]) BV[d] ← AV[a] & AV[b] & BV[c]
xcbaX	d, c, a, b	BR[d] ← BR[c] xa (XR[a] xc XR[b]) BV[d] ← XV[a] & XV[b] & BV[c]
icbaS	d, c, a, b	BR[d] ← BR[c] ba (SR[a] ic SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
Floating-Point
Df1S	d, a	SR[d] ← f1_d(SR[a]) SV[d] ← SV[a]
Ff1S	d, a	SR[d] ← f1_f(SR[a]) SV[d] ← SV[a]
Hf1S	d, a	SR[d] ← f1_h(SR[a]) SV[d] ← SV[a]
Bf1S	d, a	SR[d] ← f1_b(SR[a]) SV[d] ← SV[a]
P4f1S	d, a	SR[d] ← f1_p4(SR[a]) SV[d] ← SV[a]
P3f1S	d, a	SR[d] ← f1_p3(SR[a]) SV[d] ← SV[a]
DfofaS	d, c, a, b	SR[d] ← SR[c] fa_d (SR[a] fo_d SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
FfofaS	d, c, a, b	SR[d] ← SR[c] fa_f (SR[a] fo_f SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
HfofaS	d, c, a, b	SR[d] ← SR[c] fa_h (SR[a] fo_h SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
BfofaS	d, c, a, b	SR[d] ← SR[c] fa_b (SR[a] fo_b SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
P4fofaS	d, c, a, b	SR[d] ← SR[c] fa_p4 (SR[a] fo_p4 SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
P3fofaS	d, c, a, b	SR[d] ← SR[c] fa_p3 (SR[a] fo_p3 SR[b]) SV[d] ← SV[a] & SV[b] & SV[c]
Boolean Floating-Point Comparison
DfcbaS	d, c, a, b	BR[d] ← BR[c] ba_d (SR[a] fc_d SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
FfcbaS	d, c, a, b	BR[d] ← BR[c] ba_f (SR[a] fc_f SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
HfcbaS	d, c, a, b	BR[d] ← BR[c] ba_h (SR[a] fc_h SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
BfcbaS	d, c, a, b	BR[d] ← BR[c] ba_b (SR[a] fc_b SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
P4fcbaS	d, c, a, b	BR[d] ← BR[c] ba_p4 (SR[a] fc_p4 SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]
P3fcbaS	d, c, a, b	BR[d] ← BR[c] ba_p3 (SR[a] fc_p3 SR[b]) BV[d] ← SV[a] & SV[b] & BV[c]

32‑bit instruction format with 2 sources 1 destination and 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op32g		i		c		i		a		d		op32
4		8		4		4		4		4		4

Index comparison immediate
xcbaXI	d, c, a, imm	BR[d] ← BR[c] ba (XR[a] xc imm12) BV[d] ← XV[a] & BV[c]
Scalar comparison immediate
icbaSI	d, c, a, imm	BR[d] ← BR[c] ba (SR[a] ic imm12) BV[d] ← SV[a] & BV[c]
Scalar arithmetic immediate
ioiaSI	d, c, a, imm	SR[d] ← SR[c] ia (SR[a] io imm12) SV[d] ← SV[a] & SV[c]
lolaSI	d, c, a, imm	SR[d] ← SR[c] la (SR[a] lo imm12) SV[d] ← SV[a] & SV[c]
SELSI	d, c, a, imm	SR[d] ← BR[c] ? SR[a] : imm12 SV[d] ← BV[c] & (~BR[c] \| SV[a])

32‑bit instruction format 2 sources 1 destination
31	28	27	22	21	20	19	16	15	12	11	8	7	4	3	0
op32g		op32f		m		op32c		b		a		d		op32
4		6		2		4		4		4		4		4

Address arithmetic: SUBXAA, RINGA
SUBXAA	d, a, b	XR[d] ← 240 ∥ (AR[a]_63..0 − AR[b]_63..0) XV[d] ← AV[a] & AV[b]
RINGA	d, a, b	trap if XR[b]_71..64 ≠ 26 \| XR[b]_66..64 > PC.ring AR[d] ← sizedecode(24 ∥ XR[b]_66..64 ∥ AR[a]_63..0) AV[d] ← AV[a] & XV[b]
Index arithmetic
ADDX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 + XR[b]_63..0) XV[d] ← XV[a] & XV[b]
SUBX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 − XR[b]_63..0) XV[d] ← XV[a] & XV[b]
MINUX	d, a, b	XR[d] ← 240 ∥ min_u(XR[a]_63..0, XR[b]_63..0) XV[d] ← XV[a] & XV[b]
MINSX	d, a, b	XR[d] ← 240 ∥ min_s(XR[a]_63..0, XR[b]_63..0) XV[d] ← XV[a] & XV[b]
MAXUX	d, a, b	XR[d] ← 240 ∥ max_u(XR[a]_63..0, XR[b]_63..0) XV[d] ← XV[a] & XV[b]
MAXSX	d, a, b	XR[d] ← 240 ∥ max_s(XR[a]_63..0, XR[b]_63..0) XV[d] ← XV[a] & XV[b]
Possible changes
ADDX	d, a, b, sa	XR[d] ← 240 ∥ (XR[a]_63..0 + XR[b]_63..0<<sa) XV[d] ← XV[a] & XV[b]
SUBX	d, a, b, sa	XR[d] ← 240 ∥ (XR[a]_63..0 − XR[b]_63..0<<sa) XV[d] ← XV[a] & XV[b]
Instructions for loop iteration count prediction
LOOPX	d	trap if XV[a] & XV[b] = 0 XR[d] ← XR[a] − XR[b] XV[d] ← 1
Possible additions: ADDOUX, ADDOSX, ADDUSX, SUBOUX, SUBOSX, SUBUSX, MINOUSX, MAXOUSX
Index logical
ANDX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 & XR[b]_63..0) XV[d] ← XV[a] & XV[b]
ORX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 \| XR[b]_63..0) XV[d] ← XV[a] & XV[b]
XORX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 ^ XR[b]_63..0) XV[d] ← XV[a] & XV[b]
SLLX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 <<_u XR[b]_5..0) XV[d] ← XV[a] & XV[b]
SRLX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 >>_u XR[b]_5..0) XV[d] ← XV[a] & XV[b]
SRAX	d, a, b	XR[d] ← 240 ∥ (XR[a]_63..0 >>_s XR[b]_5..0) XV[d] ← XV[a] & XV[b]
Address calculation with index shift: A
A	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← ea AV[d] ← 1 endif
Non-speculative tagged word loads with indexed addressing: L{A,X,S}
LA	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif
LX	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 XR[d] ← lvload72(ea) XV[d] ← 1 endif
LS	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 SR[d] ← lvload72(ea) SV[d] ← 1 endif
Non-speculative doubleword loads with indexed addressing: LAD (save/restore) and LAC (CHERI)
LAD	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 AR[d] ← lvload144(ea) AV[d] ← 1 endif
LAC	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload144(ea) trap if t_71..64 ≠ 232 trap if t_143..136 ≠ 251 AR[d] ← t AV[d] ← 1 endif
Non-speculative segment relative loads with indexed addressing: RLA{64,32}
RLA64	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload64(ea) AR[d] ← segrelative(AR[a], t) AV[d] ← 1 endif
RLA32	d, a, b, sa	v ← AV[a] & XV[b] if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 t ← lvload32(ea) AR[d] ← segrelative(AR[a], 0³² ∥ t) AV[d] ← 1 endif
Non-speculative sub-word loads with indexed addressing: L{A,X,S}{8,16,32,64}{U,S}
LX64	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload64(AR[a] +_p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t XV[d] ← v
LS64	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload64(AR[a] +_p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t SV[d] ← v
LX32U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +_p XR[b]<<sa) : 0 XR[d] ← 240 ∥ 0³² ∥ t XV[d] ← v
LS32U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +_p XR[b]<<sa) : 0 SR[d] ← 240 ∥ 0³² ∥ t SV[d] ← v
LX32S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +_p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t₃₁³² ∥ t XV[d] ← v
LS32S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload32(AR[a] +_p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t₃₁³² ∥ t SV[d] ← v
LX16U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +_p XR[b]<<sa) : 0 XR[d] ← 240 ∥ 0⁴⁸ ∥ t XV[d] ← v
LS16U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +_p XR[b]<<sa) : 0 SR[d] ← 240 ∥ 0⁴⁸ ∥ t SV[d] ← v
LX16S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +_p XR[b]<<sa) : 0 XR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t XV[d] ← v
LS16S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]<<sa) = 0 t ← v ? lvload16(AR[a] +_p XR[b]<<sa) : 0 SR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t SV[d] ← v
LX8U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← v ? lvload8(AR[a] +_p XR[b]) : 0 XR[d] ← 240 ∥ 0⁵⁶ ∥ t XV[d] ← v
LS8U	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← v ? lvload8(AR[a] +_p XR[b]) : 0 SR[d] ← 240 ∥ 0⁵⁶ ∥ t SV[d] ← v
LX8S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← lvload8(AR[a] +_p XR[b]) : 0 XR[d] ← 240 ∥ t₇⁵⁶ ∥ t XV[d] ← v
LS8S	d, a, b, sa	v ← AV[a] & XV[b] trap if v & boundscheck(AR[a], XR[b]) = 0 t ← lvload8(AR[a] +_p XR[b]) SR[d] ← 240 ∥ t₇⁵⁶ ∥ t SV[d] ← v
Load vector mask instructions with indexed addressing
LM	d, a, b, sa	v ← AV[a] & XV[b] trap if v = 0 ea ← AR[a] +_p XR[b]<<sa trap if boundscheck(AR[a], XR[b]<<sa) = 0 VM[d] ← lvload128(ea)
Speculative tagged word loads with indexed addressing: SL{A,X,S}
SLA	d, a, b, sa	v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif
SLX	d, a, b, sa	v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ XR[d] ← lvload72(ea) XV[d] ← 1 endif
SLS	d, a, b, sa	v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa) if v = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +_p XR[b]<<sa trap if ea_2..0 ≠ 0³ SR[d] ← lvload72(ea) SV[d] ← 1 endif
Speculative sub-word loads with indexed addressing: SL{X,S}{8,16,32,64}{U,S} (TBD)

32‑bit instruction format with 1 source 1 destination and 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op32g		i		op32c		i		a		d		op32
4		8		4		4		4		4		4

Index arithmetic immediate
ADDXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 + imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
ANDXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 & imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
MINUXI	d, a, imm	XR[d] ← 240 ∥ min_u(XR[a]_63..0, imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
MINSXI	d, a, imm	XR[d] ← 240 ∥ min_s(XR[a]_63..0, imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
MAXUXI	d, a, imm	XR[d] ← 240 ∥ max_u(XR[a]_63..0, imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
MAXSXI	d, a, imm	XR[d] ← 240 ∥ max_s(XR[a]_63..0, imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
RSUBXI	d, imm, a	XR[d] ← 240 ∥ ((imm12₁₁⁵²∥imm12) − XR[a]_63..0) XV[d] ← XV[a]
RSUBI	d, imm, a	SR[d] ← 240 ∥ ((imm12₁₁⁵²∥imm12) − SR[a]_63..0) SV[d] ← SV[a]
Index logical immediate
ORXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 \| imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
XORXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 ^ imm12₁₁⁵²∥imm12) XV[d] ← XV[a]
Scalar integer logical immediate
loSI	d, b, imm	SR[d] ← 240 ∥ (SR[a]_63..0 lo imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
Scalar integer arithmetic immediate
ioSI	d, b, imm	SR[d] ← 240 ∥ (SR[a]_63..0 io imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
ADDSI	d, b, imm	SR[d] ← 240 ∥ (SR[a]_63..0 + imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
SUBSI	d, b, imm	SR[d] ← 240 ∥ (SR[a]_63..0 − imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MINUSI	d, b, imm	SR[d] ← 240 ∥ min_u(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MINSSI	d, b, imm	SR[d] ← 240 ∥ min_s(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MINUSSI	d, b, imm	SR[d] ← 240 ∥ min_us(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MAXUSI	d, b, imm	SR[d] ← 240 ∥ max_u(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MAXSSI	d, b, imm	SR[d] ← 240 ∥ max_s(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
MAXUSSI	d, b, imm	SR[d] ← 240 ∥ max_us(SR[a]_63..0, imm12₁₁⁵²∥imm12) SV[d] ← SV[a]
Non-speculative tagged word load/store with immediate addressing: L{A,X,S}I
AI	d, a, imm	if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else trap if boundscheck(AR[a], imm12) = 0 AR[d] ← AR[a] +_p imm12 AV[d] ← 1 endif
LAI	d, a, imm	if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p imm12 trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], imm12) = 0 AR[d] ← sizedecode(lvload72(ea)) AV[d] ← 1 endif
LXI	d, a, imm	if AV[a] = 0 then XR[d] ← 0 XV[d] ← 0 else ea ← AR[a] +_p imm12 trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], imm12) = 0 XR[d] ← lvload72(ea) XV[d] ← 1 endif
LSI	d, a, imm	if AV[a] = 0 then SR[d] ← 0 SV[d] ← 0 else ea ← AR[a] +_p imm12 trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], imm12) = 0 SR[d] ← lvload72(ea) SV[d] ← 1 endif
Non-speculative doubleword loads with indexed addressing: LADI (save/restore) and LACI (CHERI)
LADI	d, a, imm	if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p imm12 trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], imm12) = 0 AR[d] ← lvload144(ea) AV[d] ← 1 endif
LACI	d, a, imm	if AV[a] = 0 then AR[d] ← 0 AV[d] ← 0 else ea ← AR[a] +_p imm12 trap if ea_2..0 ≠ 0³ trap if boundscheck(AR[a], imm12) = 0 t ← lvload144(ea) trap if t_71..64 ≠ 232 trap if t_143..136 ≠ 251 AR[d] ← t AV[d] ← 1 endif
Non-speculative sub-word load/store with immediate addressing: L{X,S}{8,16,32,64}{U,S}I
LX64I	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload64(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ t XR[d] ← v
LX32UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ 0³² ∥ t XR[d] ← v
LS32UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ 0³² ∥ t SV[d] ← v
LX32SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ t₃₁³² ∥ t XV[d] ← v
LS32SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload32(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ t₃₁³² ∥ t SV[d] ← v
LX16UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ 0⁴⁸ ∥ t XV[d] ← v
LS16UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ 0⁴⁸ ∥ t SV[d] ← v
LX16SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t XV[d] ← v
LS16SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload16(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ t₁₅⁴⁸ ∥ t SV[d] ← v
LX8UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ 0⁵⁶ ∥ t XV[d] ← v
LS8UI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ 0⁵⁶ ∥ t SV[d] ← v
LX8SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +_p imm12) : 0 XR[d] ← 240 ∥ t₇⁵⁶ ∥ t XV[d] ← v
LS8SI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm12) = 0 t ← v ? lvload8(AR[a] +_p imm12) : 0 SR[d] ← 240 ∥ t₇⁵⁶ ∥ t SV[d] ← v
Speculative word load/store with immediate addressing: L{A,X,S}I
SLAI	d, a, imm	v ← AV[a] & boundscheck(AR[a], imm12) AR[d] ← v ? lvload72(AR[a] +_p imm12) : 0 AV[d] ← v
SLXI	d, a, imm	v ← AV[a] & boundscheck(AR[a], imm12) XR[d] ← v ? lvload72(AR[a] +_p imm12) : 0 XV[d] ← v
SLSI	d, a, imm	v ← AV[a] & boundscheck(AR[a], imm12) SR[d] ← v ? lvload72(AR[a] +_p imm12) : 0 SV[d] ← v
Instructions for loop iteration count prediction
LOOPXI	d	trap if XV[a] = 0 XR[d] ← XR[a] + imm12₁₁⁵²∥imm12 XV[d] ← 1

32‑bit instruction format 2 sources 1 destination
31	28	27	22	21	16	15	12	11	8	7	4	3	0
op32g		op32f		op32c		op32b		a		d		op32
4		6		6		4		4		4		4

Instructions for save/restore
MOVSB	d, a	SR[d] ← 240 ∥ 0⁶³ ∥ BR[a] SV[d] ← BV[a]
MOVBS	d, a, imm6	BR[d] ← SR[a]_imm6 BV[d] ← SV[a]
MOVSBALL	d	SR[d] ← 240 ∥ 0³² ∥ BV[15]∥BV[14]∥…∥BV[1]∥1 ∥ BR[15]∥BR[14]∥…∥BR[1]∥0 SV[d] ← 1
MOVBALLS	d	BR[1] ← SR[a]₁ BR[2] ← SR[a]₂ ︙ BR[15] ← SR[a]₁₅ BV[1] ← SR[a]₁₇ BV[2] ← SR[a]₁₈ ︙ BV[15] ← SR[a]₃₁
MOVXAVALL	d	XR[d] ← 240 ∥ 0⁴⁸ ∥ AV[15]∥AV[14]∥…∥AV[1]∥AV[0] XV[d] ← 1
MOVAVALLX	d	AV[1] ← XR[a]₁ AV[2] ← XR[a]₂ ︙ AV[15] ← XR[a]₁₅
MOVXXVALL	d	XR[d] ← 240 ∥ 0⁴⁸ ∥ XV[15]∥XV[14]∥…∥XV[1]∥XV[0] XV[d] ← 1
MOVXVALLX	d	XV[1] ← XR[a]₁ XV[2] ← XR[a]₂ ︙ XV[15] ← XR[a]₁₅
MOVSM	d, m, w	SR[d] ← 240 ∥ VM[a]_{w×64+63..w×64} SV[d] ← 1
MOVMS	d, a, w	trap if SV[a] = 0 VM[d]_{w×64+63..w×64} ← SR[a]

32‑bit instruction format with 2 sources 1 destination and 6‑bit immediate
31	28	27	22	21	16	15	12	11	8	7	4	3	0
op32g		op32f		imm6		b		a		d		op32
4		6		6		4		4		4		4

FUNSI	d, a, b, i	t ← (SR[b]_63..0∥SR[a]_63..0) >> imm6 SR[d] ← 240 ∥ t_63..0 SV[d] ← SV[a] & SV[b]
ROTRSI	d, a, i	assembler expands to FUNSI d, a, a, i
ROTLSI	d, a, i	assembler expands to FUNSI d, a, a, (−i)_5..0
SLLXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 <<_u imm6) XV[d] ← XV[a]
SRLXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 >>_u imm6) XV[d] ← XV[a]
SRAXI	d, a, imm	XR[d] ← 240 ∥ (XR[a]_63..0 >>_s imm6) XV[d] ← XV[a]

32‑bit instruction format 3 sources 0 destination
31	28	27	22	21	20	19	16	15	12	11	8	7	4	3	0
op32g		op32f		m		c		b		a		op32d		op32
4		6		2		4		4		4		4		4

Store address instructions with indexed addressing
SA	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]_2..0 + XR[b]_2..0) ≠ 0³ lvstore72(AR[a] +_p XR[b]<<sa) ← AR2mem72(AR[c])
SAD	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]_3..0 + XR[b]_3..0) ≠ 0⁴ lvstore144(AR[a] +_p XR[b]<<sa) ← AR2mem144(AR[c])
SAC	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 trap if (AR[a]_3..0 + XR[b]_3..0) ≠ 0⁴ lvstore144(AR[a] +_p XR[b]<<sa) ← AR2CHERImem144(AR[c])
Store index instructions with indexed addressing
SX	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore72(AR[a] +_p XR[b]<<sa) ← XR[c]
SX64	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore64(AR[a] +_p XR[b]<<sa) ← XR[c]_63..0
SX32	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore32(AR[a] +_p XR[b]<<sa) ← XR[c]_31..0
SX16	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore16(AR[a] +_p XR[b]<<sa) ← XR[c]_15..0
SX8	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore8(AR[a] +_p XR[b]<<sa) ← XR[c]_7..0
Store scalar instructions with indexed addressing
SS	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore72(AR[a] +_p XR[b]<<sa) ← SR[c]
SS64	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore64(AR[a] +_p XR[b]<<sa) ← SR[c]_63..0
SS32	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore32(AR[a] +_p XR[b]<<sa) ← SR[c]_31..0
SS16	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore16(AR[a] +_p XR[b]<<sa) ← SR[c]_15..0
SS8	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore8(AR[a] +_p XR[b]<<sa) ← SR[c]_7..0
Store vector mask instructions with indexed addressing
SM	c, a, b, sa	trap if (AV[a] & XV[b] & AV[c]) = 0 lvstore128(AR[a] +_p XR[b]<<sa) ← VM[c]
Branch instructions
Bboba	c, a, b	branch if BR[c] ba (BR[a] bo BR[b])
BbaEQA	c, a, b	branch if BR[c] ba (AR[a] = AR[b])
BbaEQX	c, a, b	branch if BR[c] ba (XR[a] = XR[b])
BbaNEA	c, a, b	branch if BR[c] ba (AR[a] ≠ AR[b])
BbaNEX	c, a, b	branch if BR[c] ba (XR[a] ≠ XR[b])
BbaLTAU	c, a, b	branch if BR[c] ba (AR[a] <_u AR[b])
BbaLTXU	c, a, b	branch if BR[c] ba (XR[a] <_u XR[b])
BbaGEAU	c, a, b	branch if BR[c] ba (AR[a] ≥_u AR[b])
BbaGEXU	c, a, b	branch if BR[c] ba (XR[a] ≥_u XR[b])
BbaLTX	c, a, b	branch if BR[c] ba (XR[a] <_s XR[b])
BbaGEX	c, a, b	branch if BR[c] ba (XR[a] ≥_s XR[b])
BbaNONEX	c, a, b	branch if BR[c] ba ((XR[a] & XR[b]) = 0)
BbaANYX	c, a, b	branch if BR[c] ba ((XR[a] & XR[b]) ≠ 0)
assembler simplified versions of the above
Bbo	a, b	equivalent to BORbo b0, a, b
BEQA	a, b	equivalent to BOREQA b0, a, b
BEQX	a, b	branch if XR[a] = XR[b]
BNEA	a, b	branch if AR[a] ≠ AR[b]
BNEX	a, b	branch if XR[a] ≠ XR[b]
BLTAU	a, b	branch if AR[a] <_u AR[b]
BLTXU	a, b	branch if XR[a] <_u XR[b]
BGEAU	a, b	branch if AR[a] ≥_u AR[b]
BGEXU	a, b	branch if XR[a] ≥_u XR[b]
BLTX	a, b	branch if XR[a] <_s XR[b]
BGEX	a, b	branch if XR[a] ≥_s XR[b]
BNONEX	a, b	branch if (XR[a] & XR[b]) = 0
BANYX	a, b	branch if (XR[a] & XR[b]) ≠ 0

32‑bit instruction format 2 sources 0 destination with 12‑bit immediate
31	28	27	20	19	16	15	12	11	8	7	4	3	0
op32g		i		c		i		a		op32d		op32
4		8		4		4		4		4		4

Store address instructions with immediate addressing
SAI	c, a, imm	lvstore72(AR[a] +_p imm12) ← AR[c]
SADI	c, a, imm	lvstore144(AR[a] +_p imm12) ← AR[c]
Store index instructions with immediate addressing
SXI	c, a, imm	lvstore72(AR[a] +_p imm12) ← XR[c]
SX64I	c, a, imm	lvstore64(AR[a] +_p imm12) ← AR[c]_63..0
SX32I	c, a, imm	lvstore32(AR[a] +_p imm12) ← AR[c]_31..0
SX16I	c, a, imm	lvstore16(AR[a] +_p imm12) ← AR[c]_15..0
SX8I	c, a, imm	lvstore8(AR[a] +_p imm12) ← AR[c]_7..0
Store scalar instructions with immediate addressing
SSI	c, a, imm	lvstore72(AR[a] +_p imm12) ← SR[c]
SS64I	c, a, imm	lvstore64(AR[a] +_p imm12) ← SR[c]_63..0
SS32I	c, a, imm	lvstore32(AR[a] +_p imm12) ← SR[c]_31..0
SS16I	c, a, imm	lvstore16(AR[a] +_p imm12) ← SR[c]_15..0
SS8I	c, a, imm	lvstore8(AR[a] +_p imm12) ← SR[c]_7..0
Branch instructions with immediate comparison
BbaEQXI	c, b, imm12	branch if BR[c] ba (XR[a] = imm12)
BbaNEXI	c, a, imm12	branch if BR[c] ba (XR[a] ≠ imm12)
BbaLTUXI	c, a, imm12	branch if BR[c] ba (XR[a] <_u imm12)
BbaGEUXI	c, a, imm12	branch if BR[c] ba (XR[a] ≥_u imm12)
BbaLTXI	c, a, imm12	branch if BR[c] ba (XR[a] <_s imm12)
BbaGEXI	c, a, imm12	branch if BR[c] ba (XR[a] ≥_s imm12)
BbaNONEXI	c, a, imm12	branch if BR[c] ba ((XR[a] & imm12) = 0)
BbaANYXI	c, a, imm12	branch if BR[c] ba ((XR[a] & imm12) ≠ 0)
assembler simplified versions of the above
BEQXI	a, imm	equivalent to BOREQXI b0, a, imm
BNEXI	a, imm	equivalent to BORNEXI b0, a, imm
BLTUXI	a, imm	equivalent to BORLTUXI b0, a, imm
BGEUXI	a, imm	equivalent to BORGEUXI b0, a, imm
BLTXI	a, imm	equivalent to BORLTXI b0, a, imm
BGEXI	a, imm	equivalent to BORGEXI b0, a, imm
BNONEXI	a, imm	equivalent to BORNONEXI b0, a, imm
BANYXI	a, imm	equivalent to BORANYXI b0, a, imm
Instructions yet to be grouped
SWITCHI	a, imm	PC ← AR[a] +_p imm12
LJMPI	a, imm	PC ← lvload72(AR[a] +_p imm12)
LJMP	a, b	PC ← lvload72(AR[a] +_p XR[b]<<sa)
FENCE		This is a placeholder for various FENCE instructions that need to be defined.
WFI	a	Wait For Interrupt for the current ring. May be intercepted by more privileged rings. Execution resumes after the interrupt is serviced (that is the return from interrupt goes to the following instruction). (Perhaps consider making this a BB descriptor type?) This is intended to be used when the processor has nothing to do, and is expected to reduce power consumption. For the duration of the wait, the interrupt enables are set to IntrEnable[PC.ring] \| XR[a]. That is, the operand specifies additional interrupts to enable. This allows software to disable interrupts, check for work, and if not, use WFI to wait for work to arrive without a window where an interrupt could occur before the WFI, return, and then wait when there is work to be done.
WFP	a	Wait For Interrupt Pending for the current ring. May be intercepted by more privileged rings. Execution resumes after InterruptPending[PC.ring] & XR[a] becomes non-zero. This may be used to wait until a particular cycle count is reached.
WAIT	a	Wait For memory location change. May be intercepted by more privileged rings.
HALT		The processor finishes all outstanding operands and halts. It will only be woken by Soft Reset. Ring 7 only.
BREAK		This is a placeholder for later definition.
ILL		This is a placeholder for later definition.
CSR*		This is a placeholder for later definition.
fmtCLASSS		This is a placeholder for later definition.

32‑bit instruction format with 1 source 1 destination and 16‑bit immediate
31	28	27	12	11	8	7	4	3	0
op32g		imm16		a		d		op32
4		16		4		4		4

AI	d, a, imm	v ← AV[a] trap if v & boundscheck(AR[a], imm16) = 0 AR[d] ← AR[a] +_p imm16 AV[d] ← v
Stack frame allocation for upward and downward stacks
ENTRY	d, a, imm8	trap if imm8 ≥ 192 osp ← AR[a] oring ← osp_135..133 osize ← 0³ ∥ osp_132..72 oaddr ← osp_63..0 naddr ← oaddr + osize e ← imm8_7..4 nsize ← e = 0 ? 0⁵⁴∥imm8_3..0∥0³ : 0^53−s∥1∥imm8_3..0∥0^e∥0² nring ← min(PC.ring, oring) ssize ← segsize(oaddr) trap if naddr_63..ssize ≠ oaddr_63..ssize nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr lvstore72(nsp) ← osp_71..0 AR[d] ← nsp
ENTRYD	d, a, imm8	trap if imm8 ≥ 192 osp ← AR[a] oring ← osp_135..133 oaddr ← osp_63..0 e ← imm8_7..4 nsize ← e = 0 ? 0⁵⁴∥imm8_3..0∥0³ : 0^53−s∥1∥imm8_3..0∥0^e∥0² naddr ← oaddr − nsize nring ← min(PC.ring, oring) ssize ← segsize(oaddr) trap if naddr_63..ssize ≠ oaddr_63..ssize nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr lvstore72(nsp) ← osp_71..0 AR[d] ← nsp

32‑bit instruction format with 24‑bit immediate
31	8	7	4	3	0
imm24		d		op32
24		4		4

XI	d, imm	XR[d] ← 240 ∥ imm24₂₃⁴⁰∥imm24 XV[d] ← 1
XUI	d, imm	XR[d] ← 240 ∥ imm24∥0⁴⁰ XV[d] ← 1
SI	d, imm	SR[d] ← 240 ∥ imm24₂₃⁴⁰∥imm24 SV[d] ← 1
SUI	d, imm	SR[d] ← 240 ∥ imm24∥0⁴⁰ SV[d] ← 1
DUI	d, imm	SR[d] ← 244 ∥ imm24∥0⁴⁰ SV[d] ← 1
FI	d, imm	SR[d] ← 245 ∥ 0³²∥imm24∥0⁸ SV[d] ← 1

48‑bit op48
1:0 3:2	0	1	2	3
0	3XI	3XUI	3SI	3SUI
1	3ADDXI	3ADDXUI	3ADDSI	3ADDSUI
2	3ANDXI	3ANDUXI	i48v
3	3ORXI	3ORXUI	3FI	3DUI

48‑bit instruction format
47	24	23	20	19	16	15	12	11	8	7	4	3	0
op48dabcm		e		c		b		a		d		op48
24		4		4		4		4		4		4

Vector-Vector Untyped
3SELVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] VR[d] ← select(VM[c], VR[a], VR[b]) VL[d] ← VL[a]
3EOVV	d, c, a, b	trap if VL[a] ≠ VL[b] trap if VL[a] > 64 VR[d] ← interleave(VR[a], VR[b]) VL[d] ← VL[a] × 2
Vector-Scalar Untyped
3SELTVS	d, a, b	VR[d] ← select(VM[c], VR[a], SR[b]) VL[d] ← VL[a]
3SELFVS	d, c, a, b	VR[d] ← select(~VM[c], VR[a], SR[b]) VL[d] ← VL[a]
Vector-Immediate Untyped
3SELTVI	d, c, a, imm12	VR[d] ← select(VM[c], VR[a], imm12) VL[d] ← VL[a]
3SELFVI	d, c, a, imm12	VR[d] ← select(~VM[c], VR[a], imm12) VL[d] ← VL[a]
Vector Integer
3i1V	d, a	VR[d] ← i1(VR[a]) VL[d] ← VL[a]
Vector-Vector-Vector Integer
3ioiaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] ia (VR[a] io VR[b]) VL[d] ← VL[a]
3lolaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] la (VR[a] lo VR[b]) VL[d] ← VL[a]
Vector-Vector-Scalar Integer
3ioiaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] ia (VR[a] io SR[b]) VL[d] ← VL[a]
3lolaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] la (VR[a] lo SR[b]) VL[d] ← VL[a]
Vector-Vector-Immediate Integer
3ioiaVI	d, c, a, imm, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] ia (VR[a] io imm) VL[d] ← VL[a]
3lolaVI	d, c, a, imm, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] la (VR[a] lo imm) VL[d] ← VL[a]
Vector-Vector integer comparison Comparisons set all bits of VM[d] regardless of VL[a]
3icbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] ic VR[b])
Vector-Scalar integer comparison
3icbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] ic SR[b])
Vector-Immediate integer comparison
3icbaVI	d, c, a, imm	VM[d] ← VM[c] ba (VR[a] ic imm12)
Vector-Vector-Vector Floating-Point
3DfofaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_d (VR[a] fo_d VR[b]) VL[d] ← VL[a]
3FfofaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_f (VR[a] fo_f VR[b]) VL[d] ← VL[a]
3HfofaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_s (VR[a] fo_h VR[b]) VL[d] ← VL[a]
3BfofaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_s (VR[a] fo_b VR[b]) VL[d] ← VL[a]
3P4fofaVV	d, c, a, b, e	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_b (VR[a] fo_p4 VR[b]) VL[d] ← VL[a]
3P3fofaVV	d, c, a, b, m	trap if VL[a] ≠ VL[b] trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_b (VR[a] fo_p3 VR[b]) VL[d] ← VL[a]
Vector-Vector-Scalar Floating-Point
3DfofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_d (VR[a] fo_d SR[b]) VL[d] ← VL[a]
3FfofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_f (VR[a] fo_f SR[b]) VL[d] ← VL[a]
3HfofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_h (VR[a] fo_h SR[b]) VL[d] ← VL[a]
3BfofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_s (VR[a] fo_b SR[b]) VL[d] ← VL[a]
3P4fofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_b (VR[a] fo_p4 SR[b]) VL[d] ← VL[a]
3P3fofaVS	d, c, a, b, m	trap if VL[a] ≠ VL[c] VR[d] ← VR[c] fa_b (VR[a] fo_p3 SR[b]) VL[d] ← VL[a]
Vector-Vector floating comparison
3DfcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_d VR[b])
3FfcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_f VR[b])
3HfcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_h VR[b])
3BfcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_b VR[b])
3P4fcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_p4 VR[b])
3P3fcbaVV	d, c, a, b	trap if VL[a] ≠ VL[b] VM[d] ← VM[c] ba (VR[a] fc_p3 VR[b])
Vector-Scalar floating comparison
3DfcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_d SR[b])
3SfcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_s SR[b])
3HfcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_h SR[b])
3BfcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_b SR[b])
3P4fcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_p4 SR[b])
3P3fcbaVS	d, c, a, b	VM[d] ← VM[c] ba (VR[a] fc_p3 SR[b])
Matrix Floating-Point Outer Product (matrix dimensions VL[a] × VL[b])
3DOPVV	d, c, a, b	MA[d] ← MA[c] +_d outerproduct_d(VR[a], VR[b])
3FOPVV	d, c, a, b	MA[d] ← MA[c] +_f outerproduct_f(VR[a], VR[b])
3HOPVV	d, c, a, b	MA[d] ← MA[c] +_h outerproduct_h(VR[a], VR[b])

48‑bit instruction format with source and 36‑bit immediate
47	12	11	8	7	4	3	0
imm36		a		d		op48
36		4		4		4

3ADDXI	d, a, imm	XR[d] ← XR[a] + imm36₃₅²⁸∥imm36 XV[d] ← XV[a]
3ADDXUI	d, a, imm	XR[d] ← XR[a] + imm36∥0²⁸ XV[d] ← XV[a]
3ANDSI	d, a, imm	SR[d] ← SR[a] & imm36₃₅²⁸∥imm36 SV[d] ← SV[a]
3ANDSUI	d, a, imm	SR[d] ← SR[a] & imm36∥0²⁸ SV[d] ← SV[a]
3ORSI	d, a, imm	SR[d] ← SR[a] \| imm36₃₅²⁸∥imm36 SV[d] ← SV[a]
3ORSUI	d, a, imm	SR[d] ← SR[a] \| imm36∥0²⁸ SV[d] ← SV[a]
3ADDDUI	d, a, imm	SR[d] ← SR[a] +_d imm36∥0²⁸ SV[d] ← SV[a]

48‑bit instruction format with 40‑bit immediate
47	8	7	4	3	0
imm40		d		op48
40		4		4

3XI	d, imm	XR[d] ← 240 ∥ imm40₃₉²⁴∥imm40 XV[d] ← 1
3XUI	d, imm	XR[d] ← 240 ∥ imm40∥0²⁴ XV[d] ← 1
3SI	d, imm	SR[d] ← 240 ∥ imm40₃₉²⁴∥imm40 SV[d] ← 1
3SUI	d, imm	SR[d] ← 240 ∥ imm40∥0²⁴ SV[d] ← 1
3DUI	d, imm	SR[d] ← 244 ∥ imm40∥0²⁴ SV[d] ← 1
3FI	d, imm	SR[d] ← 245 ∥ 0²⁴∥imm40 SV[d] ← 1

64‑bit op64
1:0 3:2	0	1	2	3
0	4I	4UI	4SI	4SUI
1
2
3	4FI	4DUI

64‑bit instruction format with 56‑bit immediate
63	8	7	4	3	0
imm56		d		op64
56		4		4

4XI	d, imm	XR[d] ← 240 ∥ imm56₅₅⁸∥imm56 XV[d] ← 1
4XUI	d, imm	XR[d] ← 240 ∥ imm56∥0⁸ XV[d] ← 1
4SI	d, imm	SR[d] ← 240 ∥ imm56₅₅⁸∥imm56 SV[d] ← 1
4SUI	d, imm	SR[d] ← 240 ∥ imm56∥0⁸ SV[d] ← 1
4DUI	d, imm	SR[d] ← 244 ∥ imm56∥0⁸ SV[d] ← 1
4FI	d, imm	SR[d] ← 245 ∥ 0⁸∥imm56 SV[d] ← 1

64‑bit instruction format
63	24	23	20	19	16	15	12	11	8	7	4	3	0
op64dabcm		e		c		b		a		d		op64
40		4		4		4		4		4		4

Software Conventions

Data Types

I expect SecureRISC software to use the ILP64 model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).

Register Names and Uses

Address registers are named a0 to a15.
Another name for a15 is sp (Stack Pointer).
Another name for a14 is fp (Frame Pointer), but use of a frame pointer is optional, and is typically only manifest when variable sized stack allocation is done.
Another name for a1 is cp (Context Pointer).
Index registers are named x0 to x15.
Boolean registers are named b0 to b15. b0 is hardwired to zero, and writes to b0 are negative assertions (i.e., attempts to write 1 trap). Thus, the assembler provides trap mnemonics that implicitly supply b0 as a destination.
Scalar registers are named s0 to s15.
Vector/matrix registers are named v0 to v15.
Vector Length registers are named vl0 to vl15.
Vector mask registers are named vm0 to vm15. Omitting a mask register specification is equivalent to specifying vm0, which is hardwired to all ones.
Matrix accumulators are named m0 to m3.
a2 to a9 are used for passing pointers to functions.
x2 to x9 are used for passing integers to functions.
a2 to a9 are used for returning pointers from functions (though most languages only require one or two return values). This allows the return values of a function to be passed to another as arguments.
x2 to x9 are used for returning integers from functions (though most languages only require one or two return values). This allows the return values of a function to be passed to another as arguments.
a10 to a15 are preserved by the callee (i.e., saved and restored on entry and exit if used). The callee need not preserve a0 to a9. Compilers are free to use different rules within a compilation unit, and in particular may decide whether a1 is preserved or not in a given set of siblings, since they are all in the same compilation unit. Typically, a1 would be preserved by the compiler in most cases, since this avoids saves and restores.
x10 to x15 are preserved by the callee (i.e., saved and restored on entry and exit if used). The callee need not preserve x0 to x9.
None of v0 to v15, nor vm0 to vm15 are preserved by callees.
Because call and return push and pop return addresses onto a separate stack (addressed by CSP[PC.ring]), some functions may not need stack frames at all. This is most likely in functions that make no calls but could be true in non-leaf functions as well.
For functions that allocate stack frames, the caller’s a15 (sp) is stored into the stack frame using SAI or SADI in the prologue and restored using LAI LADI in the epilogue. This preserves the caller’s stack frame size in sp, which would be lost if sp addition were used. The SAI/SADI will be bundled into an ENTRY sp, sp, framesizetag instruction when that gets defined, something along the lines of the following for upward-growing stacks:
trap if framesizetag ≥ 128
framesize ← tagsize(framesizetag)
o ← AR[a]
trap if o_71..64 ≥ 128 | o_143..136 ≠ 251 | o_135..133 > PC.ring
ssize ← segsize(o_63..0)
n ← o_63..0 + 0³∥o_132..72
trap if n_63..ssize ≠ o_63..ssize
AR[d] ← 251 ∥ framesize ∥ framesizetag ∥ n
lvstore144(AR[d]) ← o
This happens to make the stack a linked list of sequentially allocated blocks.
Function pointers (e.g., as used for dynamic linking, or function variables) consist of a code pointer and a context pointer, which might be loaded into a0 and a1, and then indirect call made to a0. Thus, a1 is passed by the caller to callee as the context pointer for the callee. For a top-level function in a compilation unit, this is typically the Thread-Local Storage (TLS) of the compilation unit, if any, or if not, the global variables and dynamic links of the compilation unit. When TLS is present, a pointer to the global variables is stored in the TLS. For a method, a1 is the object (the self-pointer). For a function lexically nested within a parent function, this a pointer to its lexical parent stack frame. Compilers may (and are encouraged to) use skip parent, grandparent, etc. frames as the context pointer for a callee when these are not needed by the callee (i.e., when the callee does not reference values from its parent’s stack frame), but to only use as a context pointer the first ancestor required. This allows the callee to make the fewest series of loads to get to the lexically scoped variables TLS or globals. Functions with a lineal descendant that references both its variables and variables of ancestor variables, store their a1 in a stack frame offset known to the descendant.
While a1 is unpreserved in the general calling convention, the compiler is free to know which functions in the same compilation unit happen to preserve a1. This may allow calls between siblings in the compilation unit to simply leave a1 constant from caller to callee. In particular, the top-level functions of a compilation unit all use a1 to reference global variables and need not save and restore it when calling each other, if a1 is known to be preserved.
a15 is the program stack pointer. Functions typically only use a frame pointer when variable allocation is done on the stack, or the stack frame is large. a14 is typically used as a frame pointer in these cases.
s0 to s9 are used for passing floating-point arguments to functions.
s0 to s9 are used for returning floating-point values from functions (though most languages only require one or two return values).
s10 to s15 are preserved by the callee (i.e., saved and restored on entry and exit if used). The callee need not preserve s0 to s9.
b1 to b15 are not used for argument passing and return values and are not preserved by callees. Booleans are passed as integers. A caller that wants to preserve BR values may store them to memory or to a preserved SR, and later restore.
The assembler will need to have pseudo-instructions that help build the BB descriptors. For example, to give the target of relative conditional branches and calls, the assembler
bb PREV|...,TYPE[|OPTIONorHINT|...][,PARAM]
will not generate an instruction, but will indicate the start of a basic block of TYPE and with a prev field from the |-separated list PREV. Optional parameters may follow TYPE, such as a LABEL for types %ubranch, %cbranch, %rcall or %crcall.
The typical function with a stack frame will begin with
[LABEL:] bb %prcall|%picall,TYPE[|OPTIONorHINT][,PARAM]
entryi sp, sp, FRAMESIZE
where the assembler will encode FRAMESIZE into the expected tag. The function will end with
bb PREV|...,%return
// restore other registers here lai sp, sp, 0
As an example, the fibonacci(n) function might be (no stack frame required):
fibonacci:
  bb %prcall,%creturn|%loopx
  loopxi x0, x2, 0 // count ← n
  beqz x2 // return if n = 0
  xi x2, 0 // f ← fibonacci(0)
  xi x3, 1 // g ← fibonacci(1)
l:
  bb %pfallthrough,%loop,l
  movex x4, x2 // prevf ← f
  movex x2, x3 // f ← g
  addx x3, x4, x3 // g ← prevf + g
  sobx x0, x0 // count ← count - 1, if count ≠ 0 goto l
  bb %pfallthrough,%return
  // no instructions (return value is x2)
which could be optimized to (mostly to illustrate the BB types)
fibonacci:
  bb %prcall,%creturn|%loopx
  loopxi x0, x2, 0 // count ← n
  beqz x2 // return if n = 0
  xi x2, 0 // f ← fibonacci(0)
  xi x3, 1 // g ← fibonacci(1)
leven:
  bb %pfallthrough,%loop,lodd
  addx x2, x2, x3 // f ← f + g
  sobx x0, x0 // count ← count - 1, if count ≠ 0 goto lodd
  bb %pfallthrough,%return
  movex x2, x3 // return g
lodd:
  bb %pbranch,%loop,leven
  addx x3, x3, x2 // g ← g + f
  sobx x0, x0 // count ← count - 1, if count ≠ 0 goto leven
  bb %fallthrough,%return
  // no instructions (return value is x2)
As an example of a function with a stack frame:
example:
  bb %prcall|%picall,%cbranch
  entry sp, sp, 128
  sx x10, sp, 8 // save three registers
  sx x11, sp, 16 // ...
  sad a10, sp, 32 // ...
  ⋮
done:
  bb %pfallthrough|%pbranch,%return
  lx x10, sp, 8 // restore saved registers
  lx x11, sp, 16 // ...
  lad a10, sp, 32 // ...
  lai sp, sp, 0
Code segments may be protected execute-only (X), execute and read (RX), or execute, read, and write (RWX, e.g., for dynamic code generation). Code segments mapped from files would typically not be writeable. Any static data in the code segment is typically copied into the heap on loading and the pointer to this data is passed to functions in the code segment as the context pointer (aka cp or a1). Dynamic linking is accomplished by loading the context pointer into a1 and LJMPI to load the code pointer.
CARRY, VLs, VMs, VRs, and MAs are not preserved by callees.

Direct Mapping and Paging

Introduction to Translation

Translation is a two-stage process, where in the first stage a Local Virtual Address (lvaddr) is translated to a System Virtual Address (svaddr), and then in the second stage that address is then translated to a System Interconnect Address (siaddr). The lvaddr→svaddr translation may involve multiple svaddr reads, each of which has to also be translated to a siaddr during the process. A full translation is therefore very costly and is typically cached as direct lvaddr→siaddr to make the process much faster after the first time. The following sections first describe the lvaddr→svaddr process, and then subsequent sections describe the svaddr→siaddr process. These translations are somewhat similar, with minor differences. Once the first stage lvaddr→svaddr process is understood, the second stage svaddr→siaddr process will be straightforward. Some systems may set up a minimal second stage translation process, but the process is still important for determining the cache and memory attributes and permissions, as stored in Region Descriptor Table (RDT).

Translation of local virtual addresses (lvaddrs) to system interconnect addresses (siaddrs) is typically performed in a single processor cycle in one of several L1 translation caches (often called TLBs), which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a lengthier procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described earlier, SecureRISC uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size). After the segment descriptor, 0 to 4 levels of page table walk are used to complete the translation, as depending on the table values set by the supervisor. While 4‑level page tables are supported, SecureRISC is designed to avoid this if the operating system can use its features, as multiple-level page tables needlessly increase the TLB miss penalty.

SecureRISC segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits and leaving the low bits (below the segment size) unchanged.

Paging

Processors today all implement some form of paging in their virtual address translation. Paging exists for several reasons. The most critical today is to simplify memory allocation in the operating system, as without paging, it would be necessary to find contiguous regions of memory to assign to address spaces. A secondary purpose is to allow a larger virtual address space than physical memory, which performs reasonably if the working set of the process fits in the physical memory (i.e., it does not use all of its virtual memory all the time).

Page Size Issues

A critical processor design decision is the choice of a page size or page sizes. If minimizing memory overhead is the criteria, it is well known that the optimal page size for an area of virtual memory is proportional to the square root of that memory size. Back in the 1960s, 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back to minimize the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.

In addition, today regions of memory vary wildly in size in computer systems, with many processes having small code regions, a small stack region, and a heap that may be small, large, or huge, and sometimes the size is dependent upon input parameters. Even in processors that support multiple page sizes, size is often set for the entire system. When page size is variable at runtime, there may be only one value for the entire process virtual address space, which makes the value for setting be sub-optimal for code, stack, or heap, depending on which is chosen for optimization. Further, memory overhead is not the only criteria of importance. Larger page sizes minimize translation cache misses and therefore improve performance at the cost of memory wastage. Larger page sizes may also reduce the translation cache miss penalty when multi-level page tables are used (as is common today), by potentially reducing the number of levels to be read on a miss.

A major advantage of segmentation is that it becomes possible to choose different page sizes on a per segment basis. Each shared library and the main program are individual segments containing code, and each could have a page size appropriate to its size. The stack and heap segments can likewise have different page sizes from the code segments and each other. Choosing a page size based on the square root of the segment size not only minimizes memory wastage, but it can also keep the page table a single level, which minimizes the translation cache miss penalty.

There is a cost to implementing multiple page sizes in the operating system. Typically, free lists are maintained for each page size, and when a smaller page free list is empty, a large page is split up. The reverse process, of coalescing pages, is more involved, as it may be necessary to migrate one or more small pages to put back together what was split apart. This however has been implemented in operating systems and made to work well.

There is also a cost to implementing multiple page sizes in translation caches (typically called TLBs though that is a terrible name). The most efficient hardware for translation caches would prefer a single page size, or failing that, a fairly small number of page sizes. Page size flexibility can affect critical processor timing paths. Despite this, the trend has been toward supporting a small number of page sizes. The inclusion of a vector architecture helps to address this issue, as vector loads and stores are not as latency sensitive as scalar loads and stores, and therefore can go directly to an L2 translation cache, which is both larger, and as a result of being larger slower, and therefore better able to absorb the cost of multiple page size matching. Much of the need for larger sizes occurs in applications with huge memory needs, and these applications are often able to exploit the vector architecture.

It may help to consider what historical architectures have for page size options. According to Wikipedia other 64‑bit architectures have supported the following page sizes:

Page Sizes in Other 64‑bit Architectures
Architecture	4 KiB	8 KiB	16 KiB	64 KiB	2 MiB	1 GiB	Other
MIPS	✔		✔	✔			256 KiB, 1 MiB, 4 MiB, 16 MiB
x86-64	✔				✔	✔
ARM	✔		✔	✔	✔	✔	32 MiB, 512 MiB
RISC‑V	✔				✔	✔	512 GiB, 256 TiB
Power	✔			✔			16 MiB, 16 GiB
UltraSPARC		✔		✔			512 KiB, 4 MiB, 32 MiB, 256 MiB, 2 GiB, 16 GiB
IA-64	✔	✔		✔			256 KiB, 1 MiB, 4 MiB, 16 MiB, 256 MiB
SecureRISC?	✔		✔	✔			256 KiB

The only very common page size is 4 KiB, with 64 KiB, 2 MiB, and 1 GiB being somewhat common second page sizes. I believe 4 KiB has been carried forward from the 1960s for compatibility reasons as there probably exists some application and device driver software where page size assumptions exist. It would be interesting to know how often UltraSPARC encountered porting problems with its 8 KiB minimum page size. Today 8 KiB or 16 KiB pages make more technical sense for a minimum page size, but application assumptions may suggest keeping the old 4 KiB minimum and introducing at least one more page size to reduce translation cache miss rates.

RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads (they are all derived from the 4 KiB page being used at each level of the table walk). These early outs do reduce translation cache miss penalties, but they do complicate TLB matching, as mentioned earlier. To RISC‑V’s credit, it introduced a new PTE format (under the Svnapot extension) that communicates to processors that can take advantage of it that groups of PTEs are consistent and can be implemented with a larger unit in the translation cache. SecureRISC will adopt this idea.

Even a large memory system (e.g., HPC) will have many small segments (e.g., code segments, small files, non-computational processes such as editors, command line interpreters, etc.), and a smaller page size, such as 8 KiB may be appropriate for these segments. However, 4 KiB is probably not so sub-optimal to warrant incompatibility by not supporting this size. Therefore, the question is what is the most appropriate other page size, or page sizes, besides 4 KiB (which supports up to 2 MiB with one level, and up to 1 GiB with two levels). If only one other page size were possible for all implementations, 256 KiB might be a good choice, since this supports segment sizes up to 2³³ bytes with one level, and segment sizes of 2³⁴ to 2⁴⁸ bytes with two levels. But not all implementations need to support physical memory appropriate to a ≥2⁴⁸‑byte working set.

Instead, it makes sense to choose a second page size in addition to the 4 KiB compatibility size to extend the range of 1 and 2‑level page tables, and then allow implementations targeted at huge physical memories to employ even larger page sizes. In particular, there is a 4 KiB page size intended for backward compatibility, but the suggested page size is 16 KiB. Sophisticated operating systems will primarily operate with a pool of 16 KiB pages, with a mechanism to split these into 4 KiB pages and coalesce these back for applications that require the smaller page size.

x86-64 and RISC‑V Paging

For comparison purposes, this section provides a few details on x86-64 virtual memory. RISC‑V and ARM are similar. All use fixed radix page tables, unlike SecureRISC‑s variable radix page tables. x86-64 has both 4‑Level Paging and 5‑Level Paging and similarly RISC‑V has Sv48 and Sv57 (also Sv39). ARM is similar with a 4 KiB granule for 4‑level, and supports an optional 16‑entry fifth level for a 52‑bit virtual address, but ARM also has separate page table registers for user and kernel, which may make the equivalent of the first-level table half as many entries?

Since the Intel, RISC‑V, and Linux all use different names for the virtual address fields, the following uses the Linux names in the diagrams below. The signext field is usually the sign-extension of bit 47 for 4‑level paging, and bit 56 for 5‑level paging.

Linux 4‑Level Paging
63	48	47	39	38	30	29	21	20	12	11	0
signext		PGD		PUD		PMD		PTE		offset
16		9		9		9		9		12

Linux 5‑Level Paging
63	57	56	48	47	39	38	30	29	21	20	12	11	0
signext		PML5		PGD		PUD		PMD		PTE		offset
7		9		9		9		9		9		12

Bits	Linux		Intel	RISC‑V	ARM
Bits	Abbr	Long	Intel	RISC‑V	ARM
56:48	PML5	Page Map Level 5	Page Map Level 5 (PML5)	VPN[4]	L0
47:39	PGD	Page Global Directory	Page Map Level 4 (PML4)	VPN[3]	L1
38:30	PUD	Page Upper Directory	Page Directory Pointer Table (PDPT)	VPN[2]	L2
29:21	PMD	Page Middle Directory	Page Directory Table (PDT)	VPN[1]	L3
20:12	PTE	Page Table Entry Directory		VPN[0]	L4

SecureRISC Paging

SecureRISC has three improvements on paging found in recent architectures. First, it takes advantage of segment sizes to reduce page table walk latency. Second, it allows the operating system to specify the sizes of tables used at each level of the page table walk, rather than tying this to the page size used in translation caches. Decoupling the non-leaf table sizes from the leaf page sizes provides a mechanism that sophisticated operating systems may use for better performance, and on such systems this reduces some of the pressure for larger page sizes. Large leaf page sizes are still however useful for reducing TLB miss rates, and as the third improvement, SecureRISC borrows from RISC‑V and allows the operating system to indicate where larger pages can be exploited by translation caches to reduce miss rates, but without requiring that all implementations do so.

Paging in SecureRISC takes advantage of segment size field in Segment Descriptors to be more efficient than in some architectures. Even a simple operating system—one that specifies tables with the same size at every level—benefits when small segments need fewer levels of tables to cover the specified size specified in the Segment Descriptor. Just because the maximum segment size is 2⁶¹ bytes doesn’t mean that every segment requires six levels of 4 KiB tables.

Segment descriptors and non-leaf page tables give the page size to be used at the next level, which allows the operating system to employ larger or smaller tables to optimize trade-offs appropriate to the implementation and the application. Some implementations may add additional page sizes beyond these basic two in their translation cache matching hardware, such as 64 KiB and 256 KiB, some implementation targeting huge memory systems and applications (e.g., HPC) may add even larger pages to target reduced TLB miss rates. The paging architecture allows this flexibility with Page Table Size (PTS) encoding in segment descriptors and non-leaf PTEs, and for leaf PTEs by an encoding borrowed from RISC‑V called NAPOT that allows enabled translation caches to take advantage of multiple consistent page table entries.

As mentioned earlier, the page size that optimizes memory wastage for a single-level page table is proportional to the square root of the memory size, or in a segmented memory, to segment size, and a single-level page table also minimizes the TLB miss penalty, with a 2‑level page table being second best for TLB miss penalty. SecureRISC’s goal is to allow the operating system to choose page sizes per segment that keep the page tables to 1 or 2 levels. It is therefore interesting to consider what segment sizes are supported with this criterion with various page sizes. This is illustrated in the following table, assuming an 8 B PTE:

Segment size reached in 1 or 2 levels by page size
Page Size		1-Level	2-Level	3-Level	Level
Last	Other	1-Level	2-Level	3-Level	bits
4 KiB		2 MiB	1 GiB	512 GiB	21	30	39
4 KiB	16 KiB	8 MiB	16 GiB	32 TiB	23	34	45
16 KiB		32 MiB	64 GiB	128 TiB	25	36	47
16 KiB	64 KiB	128 MiB	1 TiB	8 PiB	27	40	53
64 KiB		512 MiB	4 TiB	32 PiB	29	42	55
256 KiB		8 GiB	256 TiB	8 EiB	33	48	63
2 MiB		512 GiB	128 PiB		39	57

The other consideration for page size is covering matrix operations in the L1 TLB. Matrix algorithms typically operate on smaller sub-blocks of the matrices to maximize data reuse and to fit into either the more constraining of the L1 TLB and L2 data cache (with other larger blocking done to fit into the L2 TLB and L3, and smaller blocking to fit into the register file). Matrices are often large enough that each row is in a different page for small page sizes. For an algorithm with 8 B or 16 B per element, each row is in a different page at the following column dimension:

Columns equal to page size
Page size	Columns		×1024 rows per page
Page size	8 B	16 B	8 B	16 B
4 KiB	512	256	0.5	0.25
8 KiB	16	512	1	0.5
16 KiB	2048	1024	2	1
64 KiB	8192	4096	8	4
256 KiB	32768	16384	32	16

For large computations (e.g., ≥1024 columns of 16 B elements), every a row increment is going to require a new TLB entry for page sizes ≤16 KiB. Even a 16 KiB page with 16 B per element results in a TLB entry per row. For an L1 TLB of 32 entries and three matrices (e.g., matrix multiply A = A + B × C), the blocking needs to limited to only 8 rows of each matrix (e.g., 8×8 blocking), which is on the low-side for the best performance. In contrast, the 64 KiB page size fits 4 rows in a single page, and so allows 32×32 blocking for three matrices using 24 entries.

If the vector unit is able to use the L2 TLB rather than the L1 TLB for its translation, which is plausible, then these larger page sizes are not quite as critical. An L2 TLB is likely to be 128 or 256 entries, and so able to hold 32 or 64 rows of ×1024 matrices of 16 B elements.

A possible goal for page size might be to balance the TLB and L2 cache sizes for matrix blocking. For example, an L2 cache size of 512 KiB can fit up to 100×100 blocks of three matrices of 16 B elements (total 480 KiB) given sufficient associativity. To fit 100 rows of 3 matrices in the L2 TLB requires ≥300 entries when pages are ≤16 KiB, but only ≥75 entries when pages ≥64 KiB. A given implementation should make similar trade-offs based on the target applications and candidate TLB and cache sizes, and page size is another parameter that factors into the trade-offs here. What is clear is that the architecture should allow implementations to efficiently support multiple page sizes if the translation cache timing allows it.

Because multiple page sizes do affect timing-critical paths in the translation caches, it is worth pointing out that implementations are able to reduce the page size stored in translation caches to equal the matching hardware. An implementation could for example synthesize 16 KiB pages for the translation cache even when the operating system specifies a 64 KiB page. This will however increase the miss rate. Conversely, some hardware may support an even larger set of page sizes. SecureRISC adopts the NAPOT encoding from RISC‑V’s PMPs and PTEs (with the Svnapot extension) to allow the TLB to use larger matching for groups of consistent PTEs without requiring it. Thus, it up to implementations whether to adopt larger page matching to lower the TLB miss rate at the cost of a potential TLB critical path. The cost of this feature is one bit in the PTE (taken from the bits reserved for software).

Should it become possible to eliminate the 4 KiB compatibility page size in favor of a 16 KiB minimum page size, it may be appropriate to use the extra two bits the increase the svaddr and siaddr widths from 64 to 66 bits.

Translation Cache Tags (TCATs) for TLB Flush Optimization

Translation Caches (TLBs) introduce one other complication. Typically, when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in MIPS translation hardware. SecureRISC instead calls these values Translation Cache Tags (TCATs) as ASID is used for another purpose in SecureRISC (see the next section). The TCAT is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous TCAT, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, and also avoids the TLB flush. When the TCATs are used up, the TLB is flushed, and then TCAT assignment starts fresh as processes are run. For example, a 5‑bit TCAT would then require a TLB flush only when the 33rd distinct process is run after the last flush. TCATs may also be used when the memory tables that TLBs refill from change. For example, when clearing a valid or permission bit in a page table entry, entries in the TLBs need to be invalidated so that the change is visible. If the page in question is present in only one address space, then it may suffice to invalidate a single TLB entry, but for pages in multiple address spaces, which would require many invalidates, it may be appropriate to assign such address spaces a new TCAT instead.

The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the TCAT during the match process; such entries match any TCAT. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the TCATs in the TLB. RISC‑V for example introduced a VMID controlled by the hypervisor that works analogously to the TCAT.

For security it is recommended that supervisors and hypervisors have their own address spaces and TCATs. This prevents less privileged rings from probing these address spaces, for example to learn of Address Space Layout Randomization (ASLR) done in these address spaces. In this case, SecureRISC avoids the need for selective matching of TCATs in the TLB by providing per-ring TCATs. However, should system software choose to share the address space between some privilege levels, a way to have some mappings shared when the TCAT changes is useful. SecureRISC implements this on a per-segment rather than a per-page basis. The G bit in Segment Descriptor Entries (SDEs) specifies that the TCAT is ignored when matching TLB entries, similar to the MIPS and RISC‑V PTE G bits. Such selective matching complicates and potentially impacts critical timing paths in translation, and would be eliminated if all system software for SecureRISC uses per-ring address spaces.

Virtual Machine Identifiers (VMIDs) and Address Space Identifiers (ASIDs)

SecureRISC Virtual Machine Identifiers (VMIDs) and Address Space Identifiers (ASIDs) are global values used to identify supervisors and the address spaces created by those supervisors in the system. These identifiers will be communicated to I/O devices, and should not be reused unless all I/O is complete.

An address space is defined by its Segment Descriptor Table Pointer (SDTP), and an ASID is the index into the system Address Space Table (AST) that gives a SDTP. Typically supervisors create and manage the AST. For systems with a hypervisor, the supervisor is running in a virtual machine, and the ASID is augmented with a Virtual Machine Identifier (VMID) that is an index into Virtual Machine Table (VMT) that gives AST and RPT pointers. Thus the triple (VMID, ASID, lvaddr) specifies a system-wide 128-bit virtual address. When threads initiate DMA, they transmit lvaddrs to I/O devices, and the system transaction includes the current VMID and ASID values, which allows the I/O device to translate the lvaddr by using the VMID to find the AST, and the ASID to find the SDTP.

SecureRISC currently defines ASIDs and VMIDs to be 32‑bit quantities, but supervisors are unlikely to allocate ASTs of 2³² entries. Similarly, hypervisors are unlikely to allocate a VMT of that size. Instead the ASTs and VMT may be of any power-of-two size from 1024 to 2³² entries (8 KiB to 32 GiB).

As discussed in the previous section, SecureRISC uses Address Space Identifier (ASID) differently from MIPS and RISC‑V. What those ISAs called ASIDs are called Translation Cache Tags (TCATs) in SecureRISC (see the previous section). TCATs may thought of as a version number for ASIDs, indicating when changes to the translation tables require translation caches to update to see changes.

VMIDs currently exist only in RISC‑V to locate Address Space Tables of virtual machines (e.g., for I/O devices), and are not required as part of the translation cache lookup because second-level translation is not per-process. If second-level translation is generalized, then it would be necessary to introduce a Translation Cache Virtual Machine Tag to use a version number for VMIDs, as the TCAT is to the ASID.

It is unclear whether the following CSR is required. For the time being, it is defined as follows. The AST is located at svaddr addressed by the astp RCSR:

Address Space Table Pointer
63		12	11	0
svaddr_63..13+ASTS	2^ASTS		0
51−ASTS	1+ASTS		12

Fields of Address Space Table Pointer
Field	Width	Bits	Description
2^ASTS	1+ASTS	15:12	NAPOT encoding of Address Space Table Size
svaddr_63..13+ASTS	51−ASTS	63:13+ASTS	The Address Space Table (AST) is located at svaddr_63..13+ASTS ∥ 0^13+ASTS.

Segment Descriptor Table Pointer Registers and TCATs

The address space of each ring is defined by a Segment Descriptor Table (SDT) which is located by a per-ring Segment Descriptor Table Pointer (SDTP) RCSR. The SDT is the first-level table of the lvaddr → svaddr translation process. It is followed by zero to four levels of page table (zero in the case of direct mapping of segments). The segment size in the SDT allows the length of the walk to be per segment, so most code segments (e.g., shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped and require only the SDT level of mapping. In addition, if a hypervisor is not paging its supervisors, it might direct map many supervisor segments.

After a TLB miss, the processor starts by using the ring number of the access (ring ← PC.ring for Basic Block Descriptor and instruction fetch and most load and stores, but loads and stores with ring number tags (192..215) instead use ring ← AR[a]_66..64 when AR[a]_66..64 ≤ PC.ring). The 16 bits of the segment field are then an index into the table at the system virtual address in the specified register. The STS encoding of sdtp[ring] registers is used to bounds check the segment number before indexing, which allows the Segment Descriptor Table to be 512, 1024, 2048, …, 65536 entries (8 KiB to 1 MiB). The bounds check is STS = 7 | lvaddr_63..57+STS = 0^8−STS. If the bound check succeeds, the doubleword Segment Descriptor Entry (SDE) is read from
(sdtp[ring]_63..13+STS ∥ 0^13+STS) | (svaddr_63..48 ∥ 0⁴)
and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Translation Cache Tag (TCAT) specified in bits 11..0 of sdtp[ring]. Implementations may choose to implement less than the full 12 bits of the TCAT, which is WARL; software must determine the number implemented by writing all ones to this field and reading back to see which remain set.

Segment Descriptor Table Pointers
71	64	63		12	11	0
240		svaddr_63..13+STS	2^STS		TCAT
8		51−STS	1+STS		12

Field	Width	Bits	Description
TCAT	12	11:0	Translation Cache Tag (WARL) 0 Reserved for hardware use
2^STS	1+STS	12+STS:12	Encoding of Segment Table Size (values ≥ 8 reserved)
svaddr_63..13+STS	51−STS	63:13+STS	Pointer to Segment Descriptor Table

SecureRISC provides per-ring sdtp RCSRs so that each ring can have its own address space. In this case, each ring would also have its own TCAT value, and the G bit in Segment Descriptor Entries (SDEs) is not needed. Systems where some rings share address space would program the sdtp RCSRs with the same svaddr and TCAT values. The ring brackets in the SDE then restrict access as required. In addition, it is possible to subset the address space of less privileged ring by giving it a smaller STS value than the more privileged ring in the same address space. For example, to use bit 63 to distinguish between user and supervisor portions of the address space^*, setting STS in the range 0..6 leaves the segments from 32768..65535 inaccessible. However, this requires a 1 MiB SDT, which is needlessly large. If this usage were common, then it would be preferable to provide two sdtp RCSRs per ring, with bit 63 selecting one or the other.

SecureRISC provides per-ring address spaces because that prevents probes of higher privilege address spaces (e.g., attempts to measure access violation timing variation). For ringed pointers (tags 192..199), trapping in constant time when AR[a]_66..64 > PC.ring prevents this. It also allows the user address space to be a full 64 bits.

* This is what Linux does on many ISAs, including MIPS, x86-64, RISC‑V, and ARM. In MIPS this is built into the ISA. For RISC‑V the Svukte extension likewise enforces the bit 63 separation between user and kernel space.

VMID Enable

Rings of hypervisor and higher privilege need to be able to set their own VMID, and of less privileged rings, but application rings must not have access to VMID CSRs. Rather than hardwire a hypervisor ring number in the architecture, SecureRISC uses a separate RECSR to specify access to VMID[ring] in addition to the normal RCSR checks. In particular, no access to VMID registers is permitted when PC.ring < VMIDRE.

ASID Enable

Rings of supervisor and higher privilege need to be able to set their own ASID and SDTP, and of less privileged rings, but application rings must not have access to ASID and SDTP CSRs. This restriction is implemented by the RCSR enable feature: each RCSR has an associated RECSR to specify which rings may access their own RCSR. So, for example, the ASIDRE RECSR controls access to the ASID[ring] RCSR, and the SDTPRE RECSR controls access to the SDTP[ring] RCSR. In particular, no access to the ASID and SDTP registers is permitted when PC.ring < ASIDRE and PC.ring < SDTPRE respectively.

Segment Descriptors

The segment descriptor can be thought of the first-level page table, but with a 16 B descriptor instead of an 8 PTE. The first 8 B of the descriptor is made very similar to the PTE format, with the extra permissions, attributes, etc. in the second 8 B of the descriptor.

Possible future changes:

Add 8 tags for physical addresses and use instead of tag 240 for segment descriptor entries to extend the System Virtual Address space from 64 bits to 67 bits by using the ring field. This would complicate software, since the ring field is computed by some pointer operations, and integer computations cannot be used.
Require direct mapped segments to have bit ssize-1 be 1 and bits ssize-2..3 be zero. This would allow the same logic as for page tables to be used but does require clearing bit ssize-1 before performing the logical or with the lvaddr.
For paged segments, SecureRISC could require that SDE doubleword 0 MAP field be identical to SDE doubleword 1 XWR value so that hardware can use either. But XWR is still required for direct-mapped segments, so it is unclear that this would provide an advantage.

Segment Descriptor Entry Word 0
71	64	63		3	2	0
240		svaddr_63..4+PTS	2^PTS		MAP
8		60−PTS	1+PTS		3

Segment Descriptor Entry Word 1
63	62	61	60	59	58	57	56	55	40	39	32	31	27	26	24	23	22	20	19	18	16	15	14	13	8	7	6	5	4	3	2	0
G1		G0		GC	T	IJ	NS	0		SIAO		0		R3		0	R2		0	R1		0	D	ssize		G	0		C	P	XWR
2		2		1	1	1	1	16		8		5		3		1	3		1	3		1	1	6		1	2		1	1	3

Field

Width

Bits

Description

MAP

2:0

0 ⇒ invalid SDE: bits 135..72, 63..3 available for software use
2 ⇒ svaddr_63..4+PTS is first level page table
7 ⇒ svaddr_63..ssize are high-bits of mapping
1, 3..6 Reserved

2^PTS

1+PTS

3+PTS:3

See non-leaf PTE

svaddr_63..4+PTS

60−PTS

63:4+PTS

MAP = 2 ⇒ svaddr_63..4+PTS is first level page table
MAP = 3 ⇒ svaddr_63..ssize are high-bits of mapping

XWR

2:0

Read, Write, Execute permission:

0	Reserved
1	Read-only
2	Reserved
3	Read-write
4	Execute-only
5	Read-execute
6	Reserved
7	Read-write-execute

Pointer permission (pointers with segment numbers are permitted)
0 ⇒ Stores of tags 0..222 and 224..239 to this segment take an access fault

CHERI Capability permission
0 ⇒ Stores of tags 232 to this segment take an access fault

0 ⇒ TCAT matched in translation caches
1 ⇒ TCAT ignored on translation cache matching (a Global mapping)

ssize

13:8

Segment size is 2^ssize bytes for values 12..61.
Values 0..11 and 62..63 are reserved.

Downward segment (must be 0 if ssize ≥ 48)
0 ⇒ address bits 47..ssize must be clear, i.e., = 0^48−ssize
1 ⇒ address bits 47..ssize must be set, i.e., = 1^48−ssize

18:16

Ring bracket 1

22:20

Ring bracket 2

26:24

Ring bracket 3

SIAO

39:32

System Interconnect Attribute (SIA) override, addition, hints, etc. (e.g., cache bypass, as for example seen in most ISAs, such as RISC‑V’s PBMT). Details are TBD.

No Speculative Access

Injective mapping
0 ⇒ Software is responsible for translation cache consistency
1 ⇒ Hardware implements TLB shootdown for this segment
(this depends upon only one PTE referencing each page)

0 ⇒ Memory tags give type and size
1 ⇒ Memory tags are clique

Garbage collection in PTEs enabled
0 ⇒ GC field in PTEs available for software use
0 ⇒ GC field in PTEs causes traps on stores of older generation pointers

61:60

Generation number of this segment for GC.

63:62

Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G1 to 3.

Many of these fields are described elsewhere. For example, R1, R2, and R3 are described in Ring brackets. The NS field serves as one defense against Spectre and similar attacks (e.g., as described in SpectreGuard: An Efficient Data-centric Defense Mechanism against Spectre Attacks). The SIAO field is not currently defined, but is expected to change how caches are used for references to this segment. Future additions might affect prefetching, translation caching, and so on.

One reason segmentation is a valuable addition to SecureRISC is the extra functionality provided by the additional 64 bits in Segment Descriptor Entries (SDEs) compared to Page Table Entries (PTEs). These extra bits enable additional access control, enhanced memory protection, cache control, etc. that is not possible with 64‑bit PTEs.

Local Virtual Address Direct Mapping

For direct mapping, the segment mapping consists of:

Checking that the offset is not out of bounds for segments < 2⁴⁸ bytes or clearing bits 63..size for segments ≥ 2⁴⁸ bytes.
Checking that the mapping is aligned to the segment size.
Logical-or the offset with the mapping. The two checks above ensures that the logical-or never sees two ones in the same bit position.

For segments ≤ 2⁴⁸ bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero (or all ones if downward is set in the Segment Descriptor Entry), or equivalently that svaddr_47..0 < 2^size. For segments > 2⁴⁸ bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before the logical-or. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 2⁴⁸ bytes. Thus, paging does not need to handle segments larger than 2⁴⁸ bytes (the SDT for such segments is in effect the first level of the page table).

Local Virtual Address Paging

When paging is used, the page tables can have one or more levels. Each level has the flexibility to use a different table size, which is chosen by the operating system when setting up the tables. A simple operating system might use a single table size (e.g., 4 KiB or 16 KiB) at every level except the first, which would be a fraction of this size.

The following examples provide examples of how the local virtual address can be used to index levels of the page table for various page and segment sizes in this simple operating system. This is not the recommended way to utilize SecureRISC’s capabilities but rather illustrates a backward-compatible option.

Local Virtual Address with 4 KiB page size and 2²¹ segment size — 1‑level page table
63	48	47	21	20	12	11	0
SEG		0		V1		offset
16		27		9		12

Local Virtual Address with 4 KiB page size and 2³⁰ segment size — 2‑level page table
63	48	47	30	29	21	20	12	11	0
SEG		0		V1		V2		offset
16		18		9		9		12

Local Virtual Address with 4 KiB page size and 2⁴⁸ segment size — 4‑level page table
63	48	47	39	38	30	29	21	20	12	11	0
SEG		V1		V2		V3		V4		offset
16		9		9		9		9		12

Local Virtual Address with 16 KiB page size and 2²⁵ segment size — 1‑level page table
63	48	47	25	24	14	13	0
SEG		0		V1		offset
16		23		11		14

Local Virtual Address with 16 KiB page size and 2⁴⁷ segment size — 3‑level page table
63	48	47	46	36	35	25	24	14	13	0
SEG		0	V2		V3		V4		offset
16		1	11		11		11		14

At the other end of the spectrum, an operating system capable of allocating page tables in any power-of-two size and that does not require demand paging for page tables might use a single table of 2^ssize-14 16 KiB PTEs for most small segments. If the segment size is large enough that TLB miss rates are high, the operating system might allocate the segment’s pages in units of 64 KiB or 256 KiB and use NAPOT encoding to take advantage of translation caches that can match sizes greater than 16 KiB.

The follow examples illustrate how SecureRISC’s architecture might be used by such an operating system. Linux, which has a buddy memory allocator, is one such system.

Local Virtual Address with 256 KiB page size and 2⁴⁸ segment size — 2‑level page table
63	48	47	33	32	18	17	0
SEG		V1		V2		offset
16		15		15		18

The format of a segment page table is multiple levels, with all but the last level consisting of 8 B‑aligned 72‑bit words with integer tags in the following format:

Non-Leaf Page Table Entry (PTE)
71	64	63		3	2	0
240		svaddr_63..4+PTS	2^PTS		XWR
8		60−PTS	1+PTS		3

Fields of Non-Leaf Page Table Entries (PTEs)

Field(s)

Width

Bits

Description

XWR

2:0

0 ⇒ invalid PTE: bits 63..3 available for software
2 ⇒ non-leaf PTE (this figure)
6 Reserved
1, 3..5, 7 indicate a Leaf PTE (see below)

2^PTS

1+PTS

3+PTS:3

Table size of next level is 2^1+PTS entries (2^4+PTS bytes):

0	2	16	B
1	4	32	B
2	8	64	B
3	16	128	B
⋮	⋮	⋮
8	512	4	KiB
9	1024	8	KiB
10	2048	16	KiB
⋮	⋮	⋮
14	32768	256	KiB
⋮	⋮	⋮
34	2³⁵	256	GiB
35	2³⁶	512	GiB
≥36	reserved

svaddr_63..4+PTS

60−PTS

63:4+PTS

Pointer to the next level of table

The last level (leaf) Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:

Leaf Page Table Entry (PTE)
71	64	63		11	10	7	6	5	4	3	2	0
240		svaddr_63..12+S	2^S		SW		GC		D	A	XWR
8		52−S	1+S		4		2		1	1	3

Segments are meant as the primary unit of access control, but including Read, Write, and Execute permissions in the PTE might simplify porting of less-aware operating systems. If RWX permissions are not required in PTEs for operating system ports, this field could be reduced to just 2 bits—one bit for leaf/non-leaf distinction and a Valid bit for PTEs—freeing up one bit for another purpose. The most likely use of such a change would be to add a bit to System Virtual Addresses.

Field

Width

Bits

Description

XWR

2:0

Read, Write, Execute permission (additional restriction on segment permissions):

0	invalid, bits 63..3 available for software
1	Read-only
2	Non-leaf PTE (see above)
3	Read-write
4	Execute-only
5	Read-execute
6	Reserved
7	Read-write-execute

Accessed:
0 ⇒ trap on any access (software sets A to continue)
1 ⇒ access allowed

Dirty:
0 ⇒ trap on any write (software sets D to continue)
1 ⇒ writes allowed

6:5

Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the GG field. This feature is turned off by setting GC to 3.

10:7

For supervisor software use

2^S

1+S

11+S:11

This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 2¹² bytes (4 KiB).

svaddr_63..12+S

52−S

63:12+S

For last level of page table, this is the translation
For earlier levels, this is the pointer to the next level

As example of the NAPOT 0^S encoding, the following examples illustrate three page sizes:

4 KiB Leaf Page Table Entry (PTE)
71	64	63	12	11	10	7	6	5	4	3	2	0
240		svaddr_63..12		1	SW		GC		D	A	XWR
8		52		1	4		2		1	1	3

16 KiB Leaf Page Table Entry (PTE)
71	64	63	14	13	12	11	10	7	6	5	4	3	2	0
240		svaddr_63..14		1	0²		SW		GC		D	A	XWR
8		50		1	2		4		2		1	1	3

256 KiB Leaf Page Table Entry (PTE)
71	64	63	18	17	16	11	10	7	6	5	4	3	2	0
240		svaddr_63..18		1	0⁶		SW		GC		D	A	XWR
8		46		1	6		4		2		1	1	3

System Interconnect Address Attributes

SecureRISC‛s System Interconnect Address Attributes (SIAA) are inspired by RISC‑V’s Physical Memory Attributes (PMA). SIAAs are specified on Naturally Aligned Powers of Two (NAPOT) siaddrs. The first attribute is the memory type, described below. Attributes are further distinguished for some memory types based on the cacheability software chooses for a portion of the NAPOT address space. Cacheability options are instruction and data caching with a specified coherence protocol, instruction and data caching without coherence, instruction caching only, and uncached. The set of coherency protocols to be enumerated is TBD, but is likely to include at least MESI and MOESI. Uncached instruction accesses may require full cache block transfers on some processors to keep things simpler, and the cache block transfer used multiple times before being discarded on a reference to another cache block (so there is a limited amount of caching even for uncached instruction accesses).

The attributes are organized into the following categories:

Category	Applicability
Category	Void	ROM	Main	I/O
Memory type	✔	✔	✔	✔
Dynamic configuration (e.g., hotplug)		✔	✔	✔
Non-volatile		1	✔	✔
Error correction: type (e.g., SECDED, Reed-Solomon, etc.) and granularity (e.g., 72, 144, etc. bits)		✔	✔	✔
Error reporting (how detected errors are reported)		✔	✔	✔
Mandatory Access Control Set		✔	✔	✔
Read access widths supported		✔	✔	✔
Write access widths supported			✔	✔
Execute access widths supported		✔	✔	✔
Uncached Alignment		✔	✔	✔
Uncached Atomic Compare and Swap (CAS) widths			✔	✔
Uncached Atomic AND/OR widths			✔	✔
Uncached Atomic ADD widths			✔	✔
Coherence Protocols (e.g., uncached, cached without coherence, cached coherent (one of MESI, MOESI), directory-based coherence type)			✔	?
NUMA location (for computing distances)		✔	✔	✔
Read idempotency		1	1	✔
Write idempotency			1	✔

Memory type is one of four values:

Void: Access to these siaddrs are errors. There are no further relevant attributes for this memory type. Void could be considered to be a subtype of Main Memory or I/O without any access options, but separating it is perhaps more straightforward.
ROM: These siaddrs contain Read-Only-Memory devices (e.g., the boot ROM) with various attributes described below. ROM may be cached without coherence, instruction cached without coherence, or uncached. ROM is always read idempotent. ROM could be considered to be a subtype of Main Memory without any write access, but separating it is perhaps more straightforward.
Main Memory: These siaddrs serve as system memory with various attributes described below. All caching options apply to main memory. Main memory is always read and write idempotent.
I/O: These siaddrs are I/O device registers and memory with various attributes described below. The caching options for I/O siaddrs is TBD. In particular, do I/O areas allow cache coherency support, or are they always non-coherent?

Access Widths
Width	Tag		Align	Comment
Width	UT	T	Align	Comment
8	✔	TI	any	LX8, LS8, SX8, SS8, etc.
16	✔	TI	0..62 mod 64	Crossing cache block boundary not supported
32	✔	TI	0..60 mod 64	Crossing cache block boundary not supported
64	✔	TI	0..56 mod 64	Crossing cache block boundary not supported
72		✔	0 mod 8	Uncached LA, LX, LS, SA, SX, SS, etc.
128	✔	TI	0..48 mod 64	Crossing cache block boundary not supported
144		✔	0 mod 16	Uncached LAD, LXD, LSD, SAD, SXD, SSD, etc.
256	✔	TI	0..32 mod 64	Uncached vector load/store
288	✔	TI	0 mod 32	Uncached vector load/store
512	✔	TI	0 mod 64	Uncached vector load/store, cached untagged refill and writeback
576		✔	0 mod 64	Uncached vector load/store, cached tagged refill and writeback
768		✔	0 mod 64	Cached tagged refill and writeback with encryption

In the table above, the UT column indicates untagged memory support, the T column indicates tagged memory support, and the TI entry in the tagged column indicaes Tagged Immediate, defined on tagged memory where the word contains a tag in the range 240..255. Untagged memory supplies a 240 tag to the system interconnect on a read, and requires a 240 tag from the system interconnect on writes. Tagged writes (cached or uncached) to untagged memory siaddrs fail if the tag is not 240. Main memory and ROMs may impose additional uncached alignment requirements (e.g., Naturally Aligned Power Of Two (NAPOT) rather than arbitrary alignment within cache blocks).

Main memory must support reads and writes. ROMs only support reads. I/O memory may support reads, writes, or both, and may be idempotent or non-idempotent.

Question: Should there be a type enumeration, including, for example:

etc. Perhaps bandwidth, error rate, etc. too?

Cached Main Memory SIAAs
Attribute	Width
Attribute	512	576	768
Read	☐	☐	☐
Write	☐	☐	☐
Execute		☐	☐
Coherence protocols	TBD

Cached ROM SIAAs
Attribute	Width
Attribute	512	576	768
Read	☐	☐	☐
Write	n.a.
Execute		☐	☐
Coherence protocols	n.a.

Cached I/O SIAAs
Attribute	Width
Attribute	512	576	768
Read	TBD
Write
Execute
Coherence protocols

Uncached Main Memory SIAAs
Attribute	Width
Attribute	8	16	32	64	72	128	144	256	288	512	576	768
Read	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Write	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Execute	0				☐	0	☐	0	☐	0	☐	☐
Atomic CAS	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Atomic AND/OR	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Atomic ADD	☐	☐	☐	☐	0	☐	0
Coherence protocols	n.a.
Read Idempotency	1
Write Idempotency	1

Uncached ROM SIAAs
Attribute	Width
Attribute	8	16	32	64	72	128	144	256	288	512	576	768
Read	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Write	0
Execute	0				☐	0	☐	0	☐	0	☐	☐
Atomic CAS	0
Atomic AND/OR	0
Atomic ADD	0
Coherence protocols	n.a.
Read Idempotency	1
Write Idempotency	n.a.

Uncached I/O SIAAs
Attribute	Width
Attribute	8	16	32	64	72	128	144	256	288	512	576	768
Read	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Write	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Execute	0				☐	0	☐	0	☐	0	☐	☐
Atomic CAS	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Atomic AND/OR	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐	☐
Atomic ADD	☐	☐	☐	☐	0	☐	0
Coherence protocols	n.a.
Read Idempotency	☐
Write Idempotency	☐

Tagged memory is an attribute derived from the above. Tagged is true for ROM and main memory that supports uncached 72‑bit reads or cached 576‑bit or 768‑bit (for authentication and optional encryption) reads and optionally writes. Untagged memory supports some subset of uncached 8‑bit, …, 64‑bit, 128‑bit reads and optionally writes, or cached 512‑bit reads and optionally writes, and supplies a 240 tag on read, and accepts a 240 or 241 tag on writes. Code ROM (e.g., the boot ROM) might support only tags 241, 252, and 253.

Encryptable is an attribute derived from the above. Encryptable is true for ROM and main memory that supports cached 768‑bit (for authentication and optional encryption) reads and optionally writes.

CHERI capable is an attribute derived from the above. CHERI capable is true for tagged main memory that supports tags 240, 232, and 251. This could be a cacheable 512‑bit that synthesizes tags on read from a a in-DRAM tag table with cache and compression.

System Virtual to System Interconnect Address Mapping

After 64‑bit Local Virtual Addresses (lvaddrs) are mapped to 64‑bit System Virtual Addresses (svaddrs), these 64‑bit svaddrs are mapped to 64‑bit System Interconnect Addresses (siaddrs). This mapping is similar, but not identical to the mapping above. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. The RDT may be hardwired, or read-only, or read/write by the hypervisor. For the maximum 65,536 regions, with 16 bytes for a RDT entry, the maximum size RDT is 1 MiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).

The format of the Region Descriptor Entries is shown below. It is similar to Segment Descriptor Entries, but without the D, X, P, C, R1, R2, R3, G0, G1, and SIAO fields, and with the addition of the MAC, and ATTR fields.

A possible future addition would be a permission bit that prohibits execution from privileged rings. Alternatively, there could be a mandatory access bit required in MAC for this.

Region Descriptor Entry Word 0
71	64	63		3	2	0
240		siaddr_63..4+PTS	2^PTS		MAP
8		60−PTS	1+PTS		3

Region Descriptor Entry Word 1
71	64	63	36	35	32	31	16	15	14	12	11	10	9	8	7	6	5	0
240		ATTR		ENC		MAC		0	RPT		0		WR		0		rsize
8		28		4		16		1	3		2		2		2		6

Fields of Region Descriptor Entries

Field

Width

Bits

Description

MAP

2:0

0 ⇒ invalid RDE: bits 135..72, 63..3 available for software use
2 ⇒ siaddr_63..4+PTS is first level page table
3 ⇒ siaddr_63..rsize are high-bits of mapping
1, 4..7 Reserved

2^PTS

1+PTS

3+PTS:3

Table size of next level is 2^1+PTS entries (2^4+PTS bytes):

0	2	16	B
1	4	32	B
2	8	64	B
3	16	128	B
⋮	⋮	⋮
8	512	4	KiB
9	1024	8	KiB
10	2048	16	KiB
⋮	⋮	⋮
14	32768	256	KiB
⋮	⋮	⋮
34	2³⁵	256	GiB
35	2³⁶	512	GiB
≥36	reserved

siaddr_63..4+PTS

60−PTS

63:4+PTS

MAP = 2 ⇒ siaddr_63..4+PTS is first level page table
MAP = 3 ⇒ siaddr_63..rsize are high-bits of mapping

rsize

5:0

Region size is 2^rsize bytes for 12..61.
Values 0..11 and 62..63 are reserved.

9:8

Write Read permission

RPT

14:12

Region Protection Table ring
Accesses by rings less than or equal to this value apply permissions specified by rptp.

MAC

31:16

Mandatory Access Control Set

ENC

35:32

Encryption index
0 ⇒ no encryption
1..15 ⇒ index into table giving algorithm and 256‑bit key

ATTR

63:36

Physical Memory Attributes

The format of a region page table is multiple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.

Second-Level Non-Leaf Page Table Entry (SPTE)
71	64	63		3	2	0
240		siaddr_63..4+PTS	2^PTS		XWR
8		60−PTS	1+PTS		3

Field

Width

Bits

Description

XWR

2:0

0 ⇒ invalid PTE: bits 63..3 available for software
2 ⇒ non-leaf PTE (this figure)
1, 3 indicate valid Second-Level Leaf PTE (see below)
4..7 Reserved

2^PTS

1+PTS

3+PTS:3

Table size of next level is 2^1+PTS entries (2^4+PTS bytes):

0	2	16	B
1	4	32	B
2	8	64	B
3	16	128	B
⋮	⋮	⋮
8	512	4	KiB
9	1024	8	KiB
10	2048	16	KiB
⋮	⋮	⋮
14	32768	256	KiB
⋮	⋮	⋮
34	2³⁵	256	GiB
35	2³⁶	512	GiB
≥36	reserved

siaddr_63..4+PTS

60−PTS

63:4+PTS

Pointer to the next level of table

The Second-Level Leaf Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:

Second-Level Leaf Page Table Entry (SPTE)
71	64	63		11	10	7	6	5	4	3	2	0
240		siaddr_63..12+S	2^S		SW		0		D	A	XWR
8		52−S	1+S		4		2		1	1	3

Field

Width

Bits

Description

XWR

2:0

Read, Write permission:

0	invalid, bits 63..3 available for software
1	Read-only
2	Non-leaf PTE (see above)
3	Read-write
4..7	Reserved

Accessed:
0 ⇒ trap on any access (software sets A to continue)
1 ⇒ access allowed

Dirty:
0 ⇒ trap on any write (software sets D to continue)
1 ⇒ writes allowed

10:7

For hypervisor software use

2^S

1+S

11+S:11

This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 2¹² bytes (4 KiB).

siaddr_63..12+S

52−S

63:12+S

For last level of page table, this is the translation
For earlier levels, this is the pointer to the next level

Alternative to Second Level Address Translation

An alternative to Second Level Address Translation (SLAT) is sketched out in this section. In the author’s opinion, this would be preferable if it could be made to perform as well as SLAT. It is much more flexible than simple SLAT, as it generalizes to multiple levels easily. This alternative is named Delegated Translation (DT) for the purposes of exposition here. The possibility of performing well is the result of efficient cross-ring function calls. SLAT replaced Shadow Page Tables because the efficiency of exceptions on page table writes was poor, but a function call approach could result in improved performance.

In Delegated Translation, only ring 7 maintains page tables. Other rings request changes to address translation by calling more privileged rings. This is best explained by starting with the system boot in ring 7. The boot code identifies all of system resources (e.g., main memory, I/O, processors) and then allocates subsets of those resources to one or more ring 6 clients. It initializes the clients’ initial code and data segments and invokes their start code, passing each a set of abstract addresses to use as translation targets. For example, it might pass a base and size of ring 6 main memory abstract addresses and a base and size of ring 6 I/O abstract addresses. Call these r6mmaddr and r6ioaddr respectively. Ring 6 then makes calls to ring 7 requesting translations that translate ring 6 virtual addresses (r6vaddrs) to these r6mmaddr and r6ioaddr abstract addresses. Ring 7 checks that the abstract addresses are valid and translates the abstract addresses to actual siaddrs and writes the segment descriptor and page tables to create the requested lvaddr → siaddr translation. Ring 6 may in turn perform the same for multiple ring 5 clients, passing each r5mmaddr and r5ioaddr, and receiving said back in requests to set up translations, which it does by translating these to a subset of the r6mmaddr and r6ioaddr that it received from ring 7. The allocations need not be static, as any ring may provide gates for requesting additional abstract addresses or releasing previously requested abstract addresses. Any ring might swap or page as needed to reallocate its set of resources to its clients.

As an example, when a user application in ring 2 requests more memory, ring 4 may simply pull r4mmaddr pages from a free list, and call ring 6 to make the translation. Ring 6’s check may be as simple as a range comparison, and its translation from r4mmaddr to r6mmaddr be as simple as an addition. It would then in turn call ring 7 to change the translation tables, which might also use range comparison for checking and addition for translation, or a more complicated translation might involve table lookup. In any case, ring 7 would then write the translation tables for ring 2 and return. We therefore have a request to add a translation r2vaddr → r4mmaddr be implemented by r4mmaddr → r6mmaddr → siaddr and installing the r2vaddr → siaddr translation in the translation tables read by the hardware. This mechanism allows each ring to support any number of less privileged clients. For example, there might be 𝐻 hypervisors running at ring 6, each with 𝑆₁+𝑆₂+…+𝑆_𝐻 supervisors running in ring 4, and each supervisor having many ring 2 clients. Even a few ring 2 clients might create ring 0 sandbox address spaces for JIT code. Address Space Identifiers (ASIDs), Quality of Service (QoS) Identifiers, etc. would be handled in the same way as abstract addresses. The system operates as a tree of clients with each client able to pass a subset of its resources to its clients.

Translation Cache (TLB) Flushing

Cache coherency protocols automatically transfer and invalidate cache data in response to loads and stores from multiple processors. It is tempting to find a similar mechanism to avoid requiring software to perform translation cache invalidations. However, unlike coherent instruction and data caches, the same translation entries may exist in multiple translation cache locations, making a directory-based approach difficult. SecureRISC has an advantage over other ISAs in this regard, as its segment descriptors allow page tables to be shared. When segments are used as the unit of sharing, a physical page is found in only one segment page table (i.e., the segment page table forms an injective mapping). If SecureRISC can exploit this property to store a reverse mapping from a PTE address to a TLB location in a directory, processors could implement TLB shootdown on top of the cache coherency protocol. For instances where this is not possible, conventional TLB flush mechanisms will be provided. The instructions for this are TBD.

The following explores the possibility of translation coherence in more detail. Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it may be possible for the L2 on translation cache fills to store the location associated with the cache block and when the L2 block is invalidated, send an invalidation to the translation cache for that location. This might reduce the need for some translation cache flushes and interprocessor shootdown interrupts. However, implementing this would require the L2 to store the associated translation cache locations in the block metadata, which might be considered too costly. An alternative approach might be to have translations check the L2 when required, which would require only a single value rather than the multiple values an L2 directory would need. This could work by having the L2 note when a block has been fetched by the translation caches and, if the block is modified or invalidated, increment a counter. If the counter stored in a translation cache entry is lower than the L2 counter, the entry would need to be checked before use. Counter wrapping would necessitate flushing the translation caches.

While it is questionable whether this level of mechanism would be worthwhile, it is documented here in case further analysis suggests a different evaluation.

Region Protection

Hypervisors and supervisors share a unified 64‑bit address space of System Virtual Addresses (svaddrs), divided into 65536 regions. Since such software is generally adept at adapting to arbitrary memory layouts, this does not typically pose a functional issue. Moreover, the unified address space simplifies communication between supervisors and I/O devices. However, it does introduce security concerns, which region protection is designed to address.

A SecureRISC system typically hosts one or more hypervisors. When multiple hypervisors are present, ring 7 is responsible for allocating resources to them. Region protection supports this process, and typically, only ring 7 modifies region protection settings. In a unified address space, each supervisor is capable of attempting references to the addresses of other supervisors or even to hypervisor addresses. Only the region protection mechanism prevents such access attempts from succeeding. The first level of protection is simple Execute, Write, and Read permissions that the ring 7 sets for each region and hypervisor. This is implemented as a table of up to 65536 entries, one for each region, of 1‑bit (up to 8 KiB), 2‑bit (up to 16 KiB), or 4‑bit values (up to 32 KiB):

Region Protection Table 1‑bit encoding
Value	Permission
0	None (`RPT_1BIT_NONE`)
1	Execute, Write, Read (`RPT_1BIT_XWR`)

Region Protection Table 2‑bit encoding
Value	Permission
0	None (`RPT_2BIT_NONE`)
1	Execute and Read (`RPT_2BIT_XR`)
2	Execute-only (`RPT_2BIT_X`)
3	Execute, Write, Read (`RPT_2BIT_XWR`)

Region Protection Table 4‑bit encoding
Value	Permission
0	None (`RPT_4BIT_NONE`)
1	Read-only (`RPT_4BIT_R`)
2	Reserved
3	Write, Read (`RPT_4BIT_WR`)
4	Execute-only (`RPT_4BIT_X`)
5	Execute, Read (`RPT_4BIT_XR`)
6	Reserved
7	Execute, Write, Read (`RPT_4BIT_XWR`)
8..15	Reserved

Ring 7 allocates main memory and I/O devices to hypervisors using the Region Descriptor Table (RDT), either with direct-mapped NAPOT memory spaces, or with page tables. When page tables are used, they are likely to use a large page size, such as 256 KiB. Ring 7 then creates an initial segment for the hypervisor code and data (since supervisors do not execute without translation) and transfers to the hypervisor, which then sets up further segments as needed. To protect hypervisors from each other, ring 7 uses the Region Protection Table (RPT), granting each hypervisor access only to its own regions.

Hypervisors in turn allocate one or more regions of the System Virtual Address (svaddr) space given to them by ring 7 to supervisors. This is also done using sub-regions of the their regions. Because hypervisors cannot modify the RDT and RPT directly, they make ring 7 calls to request the creation of region translation and protection tables for their guest supervisors. These are derived as subsets of the hypervisor’s own region protection table. Hypervisors also call ring 7 to change the rptp XCSR when context switching between supervisors.

For SecureRISC processors, ring 7 specifies region permissions by storing a siaddr to the table in the rptp XCSR. This would typically be context switched by hypervisors with a ring 7 call. While most supervisors would have a unique rptp value, in theory a single protection domain could be shared by a cooperating group of supervisors. Region protection is cached in translation caches along with other permissions and the lvaddr→siaddr mapping. The TCPT field exists to allow cached values to be differentiated when the rptp value or its contents changes in a fashion similar to the TCAT field of sdtp registers. Implementations may choose to implement less than the full 9 bits of the TCPT, which is WARL; software must determine the number implemented by writing all ones to this field and reading back to see which remain set.

Region Protection Table Pointer
71	64	63		11	10	2	1	0
240		siaddr_63..12+PDS	2^PDS		TCPT		PB
8		52−PDS	1+PDS		9		2

Field	Width	Bits	Description
PB	2	1:0	Protection Bits 0 ⇒ Table of single-bit values (no-access/access) 1 ⇒ Table of 2‑bit values (see above) 2 ⇒ Table of 4‑bit values (see above) 3 ⇒ Reserved
TCPT	9	10:2	Protection Domain ID
2^PDS	1+PDS	11+PDS:11	Encoding of Region Protection Table Size
siaddr_63..12+PDS	52−PDS	63:12+PDS	Pointer to Region Protection Table

Because translation cache misses in many microarchitectures will access the Region Protection Table through the L2 data cache, hypervisors may find it benefits performance to allocate regions to supervisors in a clustered fashion, so that a single L2 data cache block serves all RPT accesses during a supervisor’s quantum. Hypervisors and supervisors are not expected to need very many regions to describe the space allocated to them.

Non-processor initiating ports into the system interconnect (Initiators) must also be checked for region permission. When DMA is set up, the transaction includes the VMID and ASID. The initiators must use the VMID to access the VMT to access the associated Segment Descriptor Table Pointer (SDTP) and Region Protection Table Pointer (RPTP) values.

As an additional check on access by ports, each Initiator is programmed by the ring 7 with three or more Mandatory Access Control (MAC) sets. One is for the Initiator’s control registers, one for the Initiator’s TLB accesses, and the others are for accesses made by agents that the Initiator services. The MAC set for a region is stored as part of the Region Descriptors and cached in the Initiator’s TLB. The Initiator tests each access and rejects those that fail. Multiple orthogonal MAC sets are supported in the same manner as in the processors’ AccessLevels and AccessCats XCSRs. For example, the Region Descriptor Table and the page tables those reference might have a MAC category bit that would prevent reads and writes from anything but Initiator TLBs. Processors have Mandatory Access Control sets per ring. This would allow the same system to support multiple classification levels, e.g., Orange Book Secret and Top-Secret, with Top-Secret peripherals able to read both Secret and Top-Secret memory, but Secret peripherals denied access to Top-Secret memory.

It is recommended that SecureRISC main memory support MAC sets for each 256 KiB region of memory in a table writeable only by the Secure Enclave. Storing 16 bits of MAC for every 256 KiB of memory would require 4 MiB for each 512 GiB DIMM, which is feasible in SRAM in DRAM controllers. This provides a secondary access check of the RDT and RPT tables.

Encryption can also be used to protect multiple levels of data in a system. For example, if Secret and Top-Secret data in memory are encrypted with different keys, and Secret Initiators are only programmed with that encryption key, then reading Top-Secret memory will result in an authentication failure in most cases, or rarely just garbage being read. Writing Top-Secret memory from a peripheral programmed only with a Secret key will likewise result in an authentication failure when the memory is read with the proper key.

Because authentication codes are only 64 bits, there is a tiny probability that authentication will succeed, but the data would be unintelligible. Because garbage data creates debugging difficulties, it will be desirable to employ both MAC sets and encryption.

Memory Encryption

An optional system feature of Region Descriptor Entries (RDEs) is to specify that the contents of the memory of the region should be protected by authenticated encryption on a cache block basis. If the keys are sufficiently protected, e.g., in a secure enclave, the data may be protected even when system software is compromised. A separate table in the secure enclave gives the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system interconnect address would be used as the tweak. The challenge of cache block encryption, with only 64 bytes of data, is providing sufficient security with a smaller storage overhead than is typical for larger data blocks, while keeping the added latency of a cache miss minimal.

Cache lines are 576 tag and data bits. To encrypt, use a standard 128‑bit block cipher (e.g., standard AES256) six times in counter mode using 128 bits of the key to xor, producing ciphertext. Append a 64‑bit authentication code and the 64‑bit nonce used for encryption and authentication yielding 704 bits. The authentication code is a hash of the 576‑bit ciphertext added to the other 128 bits of the key applied to a different counter value. Add 8 ECC bits to each 88 bits produces a memory block of 768 bits. Main memory might then be implemented with three standard 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 counter mode xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Writes would incur the counter mode computation latency (primarily six AES256 computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to efficiently support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).

Encryption would not be supported for untagged memory, as the purpose of untagged memory is primarily for I/O devices. Were encryption to be supported it would have to be with a tweakable block cipher (e.g., XTS-AES), because such memory would not support the extra bits required for tags and authentication.

In particular the encryption of the 576‑bit (ignoring ECC) cache block CL to a 768‑bit memory block ML (including ECC) using cache block address siaddr_63..6 and the 64‑bit internal state nextnonce would be as follows:

nonce ← nextnonce
nextnonce ← (nextnonce ⊗ 𝑥) mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
T0 ← AESenc(Key_127..0, nonce_63..0∥siaddr_63..6∥000000)
T1 ← AESenc(Key_127..0, nonce_63..0∥siaddr_63..6∥000001)
T2 ← AESenc(Key_127..0, nonce_63..0∥siaddr_63..6∥000010)
T3 ← AESenc(Key_127..0, nonce_63..0∥siaddr_63..6∥000100)
T4 ← AESenc(Key_127..0, nonce_63..0∥siaddr_63..6∥001000)
T5 ← AESenc(Key_255..128, nonce_63..0∥siaddr_63..6∥100000)_63..0
C ← CL ⊕ (T4∥T3∥T2∥T1∥T0)
A0 ← C_63..0 ⊗ K0 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A1 ← C_127..64 ⊗ K1 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A2 ← C_191..128 ⊗ K2 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A3 ← C_255..192 ⊗ K3 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A4 ← C_319..256 ⊗ K4 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A5 ← C_383..320 ⊗ K5 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A6 ← C_447..384 ⊗ K6 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A7 ← C_511..448 ⊗ K7 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
A8 ← C_575..512 ⊗ K8 mod 𝑥⁶⁴+𝑥⁴+𝑥³+𝑥+1
AC ← A8⊕A7⊕A6⊕A5⊕A4⊕A3⊕A2⊕A1⊕A0⊕T5
AE ← C ∥ nonce_63..0 ∥ AC
M0 ← ECC(AE_87..0) ∥ AE_87..0
M1 ← ECC(AE_175..88) ∥ AE_175..88
M2 ← ECC(AE_263..176) ∥ AE_263..176
M3 ← ECC(AE_351..264) ∥ AE_351..264
M4 ← ECC(AE_439..352) ∥ AE_439..352
M5 ← ECC(AE_527..440) ∥ AE_527..440
M6 ← ECC(AE_615..528) ∥ AE_615..528
M7 ← ECC(AE_703..616) ∥ AE_703..616
ML ← M7 ∥ M6 ∥ M5 ∥ M4 ∥ M3 ∥ M2 ∥ M1 ∥ M0

The inverse of the above for decrypting ML to CL and checking the authentication is obvious. A variant where a 144‑bit block cipher with a 144‑bit key (e.g., AES with 9‑bit S-boxes or an obvious Simon144/144) is used instead of 128‑bit AES is fairly obvious, and might make sense for datapath width matching, but the nonce and authentication would remain 64 bits to fit the result in 768, which probably makes datapath matching consideration secondary, and the extra key width is a slight annoyance (but see PQC note below where a 144‑bit key might be an advantage).

It is not yet determined whether K0, K1, …, K8 are constants or generated from the 256‑bit key via a key schedule algorithm, or simply provided by the software.

It would be nice to have a memory to memory copy engine that decrypts and encrypts while changing the nonce. This can be used for transmitting a cache block from a sender to a recipient, each using different encryption keys. To go at 704 non-ECC bits every two cycles requires 88×14=1232 S-boxes. The S-boxes compute the xor values for decryption on even cycles and for encryption on odd cycles.

The table in the secure enclave specifies up to 15 algorithm and key pairs (where typically a key is an encryption key and authentication key pair).

Value	Encryption	Auth	Extra bits	What
0	None	None	0	no encryption, no authentication
1	None	CWMAC	64+64	No encryption authentication using 64‑bit Carter-Wegman with 64‑bit nonce and Key_255..128
2	AES128	CWMAC	64+64	AES128 CTR encryption with 64‑bit nonce, 64‑bit tweak 64‑bit Carter-Wegman Authentication code Key_127..0 used for AES128 CTR, Key_255..128 used for authentication
3..15	Reserved

It is possible that Simon128/128 could be used in place of AES128 to reduce the amount of area required. The area of 16 S-boxes for one round AES being somewhat expensive, and six iterations of 14 rounds is too slow, so perhaps 96 S-boxes are required to keep the write latency latency reasonable (the read latency being covered by the DRAM access time with this many S-boxes).

Post-Quantum Cryptography (PQC) may require 192‑bit or 256‑bit keys for symmetric cryptography due to Grover’s Algorithm. AES192 and AES256 however require 12 and 14 rounds respectively (20% and 40% more than AES128), which may add too much latency to cache block write-back, which is already somewhat affected by 10 rounds of AES128, which are each relatively slow due the S-box complexity. It is possible that Simon128/192 or Simon128/256 become better choices at larger key sizes, as 192‑bit keys are only 1.5% and 5.9% additional rounds. On the other hand, it is also possible to use additional S-boxes for parallel AES computation. AES S-boxes are somewhat costly, which argues against this, but in counter-mode encryption inverse S-boxes are not required, so perhaps this is acceptable. For example, by using 32 S-boxes, the computations specified above allow for producing two of the six computations in parallel, with the S-boxes being used only three times rather than six. It would be nice to have cryptographers weigh in some of these issues (this author is definitely not qualified).

Given the 8×88=704 bits to be protected with 64 bits of ECC, which can detect up to 16 bits of errors and correct up to 8, it might be interesting to consider Reed-Soloman error correction and detection for block of 88 8‑bit codewords with eight check symbols, which would be able to detect up to 8 symbol errors (32 to 64 bits) and correct up to 4 symbol errors (16 to 32 bits). However, the latency of detection and ECC generation for cache fills becomes an issue.

Reset and Boot

A SecureRISC system is expected to include the following components:

One or more SecureRISC processors
Boot ROM (e.g., 1 MiB)
NVRAM (e.g., 1 MiB)
Main Memory (typically DRAM)
I/O devices
- Non-volatile storage (e.g., SSD, hard disk, etc.)
- Communication interface for TCP/IP network (e.g., Ethernet, WiFi, etc.)
- Boot device (e.g., non-volatile storage or network interface)

The most secure systems would also include a TPM or Secure Enclave. This can be thought of as ring 8 in the system. Simpler systems will use ring 7 of SecureRISC processors for the same functions.

Using the System Interconnect Address (siaddr) example earlier, with the high 14 bits being a port number for multiprocessor routing, the low 50 bits (1 PiB) might be organized as follows:

Example Port Address Map for SecureRISC systems
Lo	Hi	Size		Use
0_0000_0000_0000	0_0000_3FFF_FFFF	1	GiB	Reserved
0_0000_4000_0000	0_0000_FFFF_FFFF	3	GiB	I/O Device Registers
0_0001_0000_0000	0_00FD_FFFF_FFFF	244	GiB	Reserved
0_00FE_0000_0000	0_00FE_FFFF_FFFF	4	GiB	NVRAM
0_00FF_0000_0000	0_00FF_FFFF_FFFF	4	GiB	Boot ROM
0_0100_0000_0000	0_03FF_FFFF_FFFF	3	TiB	Persistent Memory
0_0100_0000_0000	3_FFFF_FFFF_FFFF	1020	TiB	Main Memory

The NVRAM would typically be protected by Mandatory Access Control. Ring 7 would typically enable such access only for itself.

On coming out of System Reset all ports in and out of the system, including memories, would initialize their required mandatory access control set and access set masks to all ones. Some SecureRISC processors will have a separate TPM or Secure Enclave. For these processors, boot will begin there, and that processor will likely initialize before other SecureRISC processors are allowed to boot. In this case it will be responsible for giving itself an all ones access control set and then programming the system devices and memories to their operating access control set and masks. It will then cryptographically verify the SecureRISC Boot ROM, and then take the SecureRISC processors out of Reset. In simpler systems, SecureRISC ring 7 executing from the Boot ROM and using NVRAM for configuration will implement the Root of Trust (RoT) and Trusted Execution Environment (TEE.) While it simplifies the system to use ring 7, the security of a separate Secure Enclave is recommended when feasible.

In systems with a TPM or Secure Enclave, devices and memories may implement two additional mandatory access control bits (bits 17..16) so that the Secure Enclave can restrict access to itself. Even ring 7 of system SecureRISC processors would not be able to add these bits to AccessRights[7], as they do not exist, thus making it impossible to override the Secure Enclave. I/O devices would likewise not be able to set bit 17, so the Secure Enclave might use bit 16 for I/O (though this is unlikely), but reserve bit 17 for its own accesses only.

SecureRISC processors have three levels of Reset and one Non-maskable Interrupt (NMI):

Power-on Reset
Hard Reset
Soft Reset
Non-maskable Interrupts (NMIs)

Power-on Reset is required when power is first applied to the processor, and may require thousands of cycles, during which time various non-digital circuits may be brought into synchronization. In addition the processor may run Built-in Self Test (BIST), which may leave the caches initialized, thereby eliminating the need for some of the steps below. Software detection of this might be based on reading status set by BIST as the first step (details TBD). After this initialization, Power-on Reset is identical to Hard Reset. Hard Reset forces the processor to reset even when there are outstanding operations in process, e.g., queued reads and writes, and will require system logic to be similarly reset to maintain synchronization. Power-on Reset and Hard Reset begin execution at the same hardwired ROM address. Soft Reset simply forces the processor to begin executing at the separate Soft Reset ROM address, while maintaining its existing interface to the system interconnect (e.g., queued reads and writes). Soft Reset may be used to restart a processor that has entered the Halt state. Finally Non-Maskable Interrupts cause an interrupt to a ring 7 address for ultra-timing-critical events. NMIs are initiated with an edge-triggered signal and should not be repeated while an earlier NMI is being handled. Timing-critical events that can be delayed during other interrupt processing should use normal message interrupts, to be serviced at their specified interrupt priority.

In many cases mBIST will have already initialized the caches. If not, Power-on Reset and Hard Reset begin with the vmbypass, icachebypass, and dcachebypass bits set. The first forces an lvaddr→siaddr identity mapping. This allows the hardwired reset PC to be fetched from a system ROM and initialize the rest of the processor state, including the lvaddr→svaddr and svaddr→siaddr translation tables. At this point it should clear the vmbypass bit. vmbypass cannot be reenabled once clear, and thus is available only to the Boot ROM. If mBIST has already initialized the instruction cache hierarchy, then icachebypass and dcachebypass will be clear on boot, and these steps may be skipped. If mBIST has already initialized the translation caches and region table, then vmbypass will be clear on boot, and this step may be skipped as well. However, the mBIST along is unlikely to properly initialize the region table, unless it is performed by a separate Root of Trust.

The Boot ROM is expected to initialize the various instruction fetch caches and then clear the icachebypass bit. Once clear, this bit may not be reenabled except by Power-on or Hard Reset. Next the Boot ROM is expected to initialize the various data caches and clear the dcachebypass bit. This bit also may not be reenabled except by Power-on or Hard Reset. Finally the Boot ROM is then responsible for starting the Root of Trust verification process and once that is complete, transferring to the hypervisor.

SecureRISC processors reset in an implementation-specific manner. During all three resets, the architecture requires some processor state to be set to specific values, and other state is left undefined and must be initialized by the boot ROM. In particular the following is required:

State	Initial value	Comment
PC	0xD7FFFFFFFFFF000000	Basic block descriptor pointer, ring 7, 16 MiB from end of address space
AccessRights[7]	0xFFF	Full access (e.g., NVRAM allowed)
QOS[7]	0	Highest quality of service
ALLOWQOS[7]	0	Allow writes
IntrEnable[7]	0	All ring 7 interrupts disabled.
vmbypass	1	Force lvaddr→siaddr identity map.
icachebypass	1	Bypass all instruction fetch caching.
dcachebypass	1	Bypass all data cache caching.

Once the Boot ROM has completed initialization of SecureRISC processor state not initialized by the hardware reset process, the Boot ROM consults the NVRAM to determine how to proceed. This NVRAM might direct the loading of software into main memory from non-volatile memory (e.g., a hard or solid state drive) or from the network. This software would then be cryptographically authenticated, and if successful, invoked.

Memory Options for SecureRISC

Most contemporary processors use a cache block size of 576 bits (512 data bits plus 8 bits of ECC for every 64 bits of data), and provide efficient block transfers of this size between main memory and the processor caches by using interconnect of 72, 144, 288, or 576 bits. The equivalent natural width for SecureRISC is 640 bits (512 data bits, 64 tag bits, and 8 bits of ECC for every 72 bits of data and tag). However, there are multiple ways to provide the additional tag bits for SecureRISC, including the use of a conventional 576‑bit main memory. A simple possibility is to set aside ⅛ of main memory for tag storage. Misses from the Last Level Cache (LLC) would then do two main memory accesses, one reading 576 and then another access reading 72 bits (a total of 648 bits—the additional 8 bits the result of not sharing ECC over tags and data).^* (There might be a specialized write-thru cache for the ⅛ of main memory after the LLC reserved for tag block read to exploit locality, but the coherence of this would need to be figured out.) Support for encryption of data in memory is a goal of SecureRISC, and good encryption requires the storage of authentication bits, increasing the size of cache blocks stored in main memory. The encryption proposed for SecureRISC encrypts 512 bits of data, 64 bits of tag into 704 bits of encrypted authenticated ciphertext, and then appends 64 bits of ECC (8 bits per 88) giving a 768‑bit memory storage, which conveniently fits three non-ECC DIMM widths. Alternatively, in a system with 512‑bit main memory, 1.5 main memory blocks could be used for a SecureRISC cache block (e.g., three transfers of 256 bits or six of 128 bits or twelve of 64 bits). Thus the cost for encrypted and tagged memory is the difference between two ECC DIMMs and three non-ECC DIMMs.

* If the system interconnect fabric is wide enough to support it, it may be preferable to move the read of the ≈⅛ of main reserved for tags into the memory controller, and then supply cache blocks with tags throughout the rest of the system.

The above is summarized in the follow table:

SecureRISC Memory Options
Data	Tags	Enc	ECC	Total	Organization	Type	Use
Cached, Tagged
512	64	128	64	768	96×8, 192×4, …, 768×1 or 64×12, 128×6, 256×3	Main	All
512	64		64	640	80×8, 160×4, …, 640×1	Main	All unencrypted
512	64		72	648	72×8, 144×4, …, 576×1 + 72×1	Main	All unencrypted ≈⅛ of main reserved for tags[1][3]
512	64	128	88	792	72×8, 144×4, …, 576×1 + 72×3	Main	All (≈⅓ of main reserved for tags + encryption)[2][4]
Cached, Untagged
512			64	576	72×8, 144×4, …, 576×1	I/O	Data only (no pointers or code)
512		128	64	704	?	?	Encrypted data only
Uncached
n.a.					8, 16, 32, 64, 128	I/O	Registers

Footnotes:

Actually about 88.88888% of main memory would be data storage, and about 11.11111% would be tag storage because the tag portion of memory doesn't require tags.
Similar to [1], about 66.66666% of main memory would be data storage, and about 33.33333% would tag storage. This could be compressed further, but only by crossing cache block boundaries for the extension (3 words per 8 required, so 6 words per 8 fit without crossing a boundary).
The L1 to L3 caches would likely store tags as part of the cache block. A L3 refill would read most of the block from the computed system interconnect address siaddr, and the remainder from
0³∥siaddr_MB-1..6∥0³ + OFFSET, where MB is the number of bits in the main memory siaddr, and OFFSET is approximately ⌈MEMSIZE×0.8888888÷64⌉×64 (the exact value being somewhat dependent on the main memory size MEMSIZE and would probably be configured by system software after boot). There would likely be a cache after the L3 to hold the rest of the cache block read for this remainder in case it is referenced subsequently. Such a tag remainder cache would need to be checked by coherency transactions on the system interconnect (e.g., invalidates), which has the potential to create false sharing at the 512 byte level, but this can be avoided by not targeting the L3 on an invalidate the hits in the tag remainder cache that does not hit in the L3. Since this cache would only be written on L3 evictions, it could be write-thru (i.e., L3 eviction writes a single 64‑bit group of 8 tags to the the tag portion of main memory) so that a L3 eviction simply writes to the 72 bits of the tag remainder.
Similar to [3], the tag plus encryption remainder could be cached in a remainder cache after the L3. The amount required for L3 miss is 3 words per cache block. Only two 3‑word remainders would probably be stored block in main memory to avoid crossing block boundaries, so only 3 words need be stored in the remainder cache with a single bit indicating which three. A L3 refill would read most of the block from the computed system interconnect address siaddr, and the remainder from
0∥siaddr_MB-1..6∥0⁵ + OFFSET, where MB is the number of bits in the main memory siaddr, and OFFSET is approximately ⌈MEMSIZE×0.3333333÷64⌉×64 (the exact value being somewhat dependent on the main memory size MEMSIZE and would probably be configured by system software after boot). As in [3], L3 evictions would probably write the remainder directly to main memory in a single 288‑bit transaction and invalidate the remainder cache to prevent false sharing.

It may be possible to add tags selectively to portions of memory. For example, slab allocators are typically page based. Thus one would direct the processor to read tags just from the beginning or end of the page. For example, the tag for vaddr might be read
from vaddr_63..12 ∥ 0³ ∥ vaddr_11..3
and the slab allocator made aware to start is allocation at offset 512 in pages, so tags are stored at offsets 32..511 (0..31 not being used as tags on tags are not required—these offset are available for allocator overhead). A Page Table Entry (PTE) bit might indicate this form of tag storage is in use. Separate mechanisms for bage tags, stack tags, and slab allocations larger than a page would still be required.

An Example Microarchitecture

Note: This example microarchitecture needs a lot of work. Most of it is quite old, and outdated by contemporary standards.

I expect both moderately speculative (e.g., 3–4 wide in-order with branch prediction) and highly speculative (e.g., 4–12 wide Out-of-Order (OoO)) implementations of SecureRISC to be appropriate, albeit with the highly-speculative implementations having solutions for Meltdown, SPECTRE, Foreshadow, Downfall, Inception, etc. and similar attacks that result from speculation. The moderately speculative processors are likely to be less vulnerable to future attacks, and the ISA should strive to enable such processors to still perform well (i.e., not depend upon OoO for reasonable performance, only for the highest performance). This is one reason I prefer the AR/XR/SR/BR/VR model (inspired by the ZS-1), where operations on the ARs/XRs may get ahead of operations on the SRs/BRs/VRs/MAs, and end up generating pipelined cache misses on SR/VR/MA load/store without stalling, thus being more latency tolerant. This is likely to work well for floating-point values, which naturally will be allocated to the SRs/VRs, but will depend on the compiler to put non-address generation integer arithmetic in the SRs/VRs. It may be that some microarchitectures will choose to handle SR load/store from the L2 data cache due to this latency tolerance, and the SR execution units will end up operating by at least the L2 data cache latency after the AR/XR execution units, causing branch mispredicts on BRs to have additional penalty, and for moves from SRs back to ARs to be costly, but this is better than penalizing every SR load miss.

An OoO implementation might choose to rename the AR/XR/SR registers to a unified physical register file but doing so would give up the reduced number of register file read and write ports that separating these files offers. I expect the preferred implementation will rename each to their own physical files.

The following example goes for full OoO (rather than the moderately speculative possibility mentioned above) but exploits the AR/XR vs. SR separation by targeting SR/VR/VM/MA load/store to the L2 data cache. The L1 data cache exists for address generation acceleration.

The challenge with highly speculative microarchitectures is avoiding vulnerabilities such as Spectre, Meltdown, RIDL, Foreshadow, Inception, etc. One possibility under consideration (not detailed in the table below) is to have all caches (including translation and control-flow caches) have a per-index way dedicated to speculative fills, and when the fill becomes not speculative, then you designate a different way as the speculative fill way for that index. Speculation pipeline flushes have to then kill the speculative fills, which is likely to reduce performance, so it might be necessary to introduce a per-process option. It is also a potential performance issue that there would only be one speculative fill way per index. It is the control-flow caches that are the most problematic because they usually have only probabilistic matching, but Inception shows that there is a potential hole here.

Another general consideration when employing speculative execution is to carefully separate privilege levels in microarchitectural state. For example, low-privilege execution should not be able to affect the prediction (branch, cache, prefetch, etc.) of higher privilege levels, or different processes at the same privilege level. Flushing microarchitectural state would be sufficient, but would unacceptably affect performance, so where possible, privilege level and process identifiers should be included in the matching used in microarchitectural optimizations (e.g., prediction). For example, the Next Descriptor Index and Return Address predictors suggested below include the ring number in its tag to prevent one class of attacks. For bypassing based upon siaddrs, a ring may be included; if the ring of the data is greater than the execution ring, this should force a fence. This does not address an attack from one process to another at the same privilege level, which would require inclusion of some sort of process id, which might be too expensive.

Note: Size in Kib (1024 bits) and Mib (1048576 bits) below do not include parity, ECC, column, or row redundancy. A + is appended to indicate there may be additional SRAM bits.

To illustrate how the heterogenous register files support wide issue, consider a microarchitecture targeting 12 instructions per cycle. In a conventional RISC ISA, this might require 12 read ports and 6 write ports for the integer register file, and 16 read and 8 write ports for the floating-point register file. For SecureRISC, the SRs would be similar to the floating-point register file, but the requirements for the ARs/XRs is reduced. For SecureRISC, 12 instructions per cycle might target eight early pipeline instructions and eight late pipeline instructions. The eight early instructions would be dispatched to four load/store units (4 AR read ports, 4 XR read ports), two branch units (2 AR read ports, 2 XR read ports), and four integer computation units (arithmetic, logical, shift) (8 XR read ports, 4 XR write ports). The eight late instructions would be dispatched to four computation units (four floating-point multiply/accumulate units or four integer units) (12 SR read ports, 4 SR write ports), and the vector/matrix unit. SR loads require 4 write ports, and stores 4 read ports. Writes to the ARs would either be from loads, or address calculations done in the load/store units (without resulting in a load or store); 2 write ports would be allocated for this purpose. The four write ports for XR integer computations would be also used for load results. AR/XR stores would use computational read ports. For the ARs this totals to 6 read, 2 write; for the XRs this totals to 10 read, 4 write; and for the SRs this totals to 16 read and 8 write ports.

Structure	Description
Basic Block Descriptor Fetch
Predicted PC	62‑bit lvaddr and 3‑bit ring
Predicted loop iteration	64‑bit count (initially from prediction, later from LOOPX) 64‑bit iteration (no loop back when iteration = count) 1‑bit Boolean whether LOOPX value received 64‑bit BB descriptor address with c set that started the loop
Predicted CSP	8×(61+3)‑bit 61‑bit lvaddr_63..3 and 3‑bit ring
Predicted Basic Block Count	8‑bit bbc
Predicted Basic Block History	128‑entry circular buffer indexed by bbc_6..0 (see below) ~9 Kib, not including register rename snapshot (~48 KiB?), and CSR reads
TLB IDs	TCAT: 10 bits TCPT: 4 bits
BB/Instruction TLB Page Sizes	2¹⁴ 4 KiB 2¹⁸ 256 KiB
Load/store TLB Page Sizes	2¹² 4 KiB (compatibility and I/O) 2¹⁸ 256 KiB 2³³ 8 GiB
Second-level Translation Page Sizes	2¹² 4 KiB (compatibility and I/O) 2¹⁸ 256 KiB 2³³ 8 GiB
L1 BB Descriptor TLB	256 entry, 8‑way set associative, 640‑bit block (8 SDE/PTEs), mapping TCAT‖lvaddr_63..12 to siaddr_63..12 in parallel with BB Descriptor Cache, block index: lvaddr_14..12, set index: lvaddr_16..15, tag: TCAT‖lvaddr_63..17, data: siaddr_63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Descriptor/Instruction TLB with 640‑bit read 20+ Kib data, 1.9+ Kib tag
L2 BB Descriptor TLB	2048 entry, 8‑way set associative, 640‑bit block (8 SDE/PTEs), block index: lvaddr_14..12, set index: lvaddr_19..15, tag: TCAT‖lvaddr_63..20, data: siaddr_63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data Cache with 512‑bit read and augmented with SDE bits 160+ Kib data, 7.4+ Kib tag
L2 TLB miss L1 walk cache	(used on L2 TLB miss to reduce L2 Data Cache latency for reading SDEs and PTEs) 20 entry, fully associative, 1 cycle latency, 512‑bit block (4 SDEs or 8 PTEs), tag: siaddr_63..6, block index: siaddr_5..3, filled from L2 Data Cache with 512‑bit read 10+ Kib data, 1.13+ Kib tag
L2 TLB miss L2 walk cache	(used on L2 TLB miss to reduce L2 Data Cache latency for reading RDEs and PTEs) 20 entry, fully associative, 1 cycle latency, 512‑bit block (4 SDEs or 8 PTEs), tag: siaddr_63..6, block index: siaddr_5..3, filled from L2 Data Cache with 512‑bit read 10+ Kib data, 1.13+ Kib tag
L1 Page Table Cache	(used on L2 TLB miss to reduce L2 Data Cache latency) 20 entry, fully associative, 512‑bit block (8 PTEs), tag: siaddr_63..6, block index: siaddr_5..3, filled from L2 Data Cache with 512‑bit read 10+ Kib data, 1.13+ Kib tag
L2 Page Table Cache	20 entry, fully associative, 512‑bit block (8 PTEs), tag: siaddr_63..6, block index: siaddr_5..3, filled from L2 Data Cache with 512‑bit read 10+ Kib data, 1.13+ Kib tag
BB Descriptor Cache	4096 descriptors (65 bits each), 8‑way set associative, 520‑bit block size, 65‑bit read, 520‑bit tagged refill, block index: lvaddr_5..3, index: lvaddr_11..6, tag: siaddr_63..12, 1.5 cycles latency, 2 cycles to predicted PC, filled from L2 Descriptor/Instruction Cache on miss and by prefetch. Might include some branch prediction bits that are initialized from hint bits, but then updated (whether to do this depends on the whether a separate write port is required, in which case a separate RAM is probably appropriate). For example, a simple 2‑bit counter might serve as a first stage for YAGS or TAGE. 260+ Kib data, 26+ Kib tag
Next Descriptor Index Predictor	512×(10+10+3), direct mapped (sized to access in less than a cycle) index: lvaddr_10..2, tag: lvaddr_20..11 + 3‑bit ring, data: lvaddr_11..2, 1 cycle to predicted BB Descriptor Cache index, This predictor is accessed in parallel with the BB Descriptor Cache (BBDC). It contains the most recent flow change hits from the BBDC and is used to accelerate fetch of next BB Descriptor by starting a new BBDC read 1 cycle after the last. If 2-cycle BBDC access and prediction yields the same index, then the read of the target BB Descriptor is accelerated by one cycle. If predicted next index differs then BBDC value fetched early is discarded. The index and data start at bit 2 anticipating tag 253 packed Basic Block Descriptors. The ring is included in the tag, and the data is only used if PC.ring ≤ tag.ring. 11.5+ Kib
Return Address Prediction	The committed version of return addresses are stored on per-ring call stacks in memory. This structure maintains speculative versions of those lines for the BB Descriptor next field types Call, Conditional Call, Call Indirect, and Conditional Call Indirect. Exceptions also speculatively update this structure. Attempts to write a block not in this structure fetch the block from memory unless CSP[PC.ring]_5..3 = 0, since in that case the call stack is initializing a new block. Lines from this structure are never written back to memory. This structure is read on the BB Descriptor next field type Return and Exception Return to predict the target PC. Unlike other microarchitecture Return Address Stacks, this structure is block-oriented, tagged, and searched by the predicted CSP[PC.ring], and may be filled from a block at a time from memory with non-speculative values as needed (and thus more likely to predict successfully compared to the typical wrapping Return Address Stack or after a context switch that changes CSP[PC.ring]). It is 8 entries and fully associative to handle cross-ring call and return gracefully. index: lvaddr_5..3, tag: TCAT ∥ lvaddr_63..6 + 3‑bit ring An entry matches only if PC.ring = tag.ring. 4.5+ Kib data, 464+ bits tag
Branch Predictor	~16 KiB BATAGE Whisper add-on? Consider using YAGS with 8192 entries of 2‑bit saturating counters in the choice table, and 1024 entries of 2‑bit saturating counters with 8‑bit tags for the T and NT tables (total 36,864 bits) as a replacement for the first two TAGE stages. ~128 Kib
Indirect Jump/Call Predictor	~16 KiB ITTAGE?
Loop Count Predictor	Predict loop count after fetching BB descriptor with c set. TAGE-like, based on history, no hit is equivalent to 2¹⁶−1 first-level (no history): 128 entry, 2‑way set associative index: lvaddr_8..3 of BB descriptor with c set tag: lvaddr_16..9 + 3‑bit ring data: 16‑bit count (0..65535) Prediction used only if PC.ring ≤ tag.ring. Written only on mispredicts that occur prior to LOOPX value received. 2+ Kib first-level data, 1+ Kib tag (other levels TBD)
BB Fetch Output	8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr_63..2, instruction start mask, branch and jump prediction to check
Instruction Fetch
L1 Instruction Cache	2048 entry (128 KiB), 4‑way set associative, 512‑bit block, read, write index: siaddr_14..6, tag: siaddr_63..15, 2-cycle latency, use 0-2 times per basic block descriptor, so 0 or 2–3 cycles for entire BB instruction fetch, filled from L2 Descriptor/Instruction Cache on miss and prefetch, experiment with prefetch on BB descriptor fill experiment with a larger cache and 3-cycle latency 256+ Kib, 24.5+ Kib 0 fetches required if the previous 512‑bit fetch covers the current one
Instruction Fetch Output	32‑entry Instruction Queue of 80‑bit decoded AR/XR instructions 32‑entry Instruction Queue of 96‑bit decoded SR/BR/VR/VM/MA instructions (16‑bit, 32‑bit, 48‑bit, and 64‑bit instructions expanded to canonical formats)
L2 Fetch
L2 Combined Descriptor/Instruction Cache	8192 entry (512 KiB), 8‑way set associative, 520‑bit block, read, write, index: siaddr_15..6, tag: siaddr_63..16, filled from system interconnect or L3 on miss and prefetch, evictions to L3 2080+ Kib data, 192+ Kib tag
AR/XR (Early) Execution Unit
PC, CSP	Committed values
Register renaming for ARs	16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list. 96 bits
Register renaming for XRs (and CSP?)	16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list. 96 bits
Register renaming for BRs	16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list. 96 bits
Register renaming for SRs	16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list. 96 bits
Register renaming for CARRY	3‑bit register for 1→8 mapping, 8‑bit bitmap of free entries for allocation 3 bits
(VRs/VMs/MAs are not renamed)
AR physical register file	128×144 (+ parity) 6‑read, 4‑write
XR physical register file	128×72 (+ parity) 6‑read, 4‑write
Segment Size Cache	for segment bounds checking: 128 entry, 4‑way set associative, parity protected, mapping TCAT‖lvaddr_63..48 to 6‑bit segment size log2 ssize and 2‑bit G0 for eight segments (one L2 TLB block), index: lvaddr_55..51 per way tag: 22 bits (10‑TCAT and lvaddr_63..56), per way data: 64 bits indexed by lvaddr_50..48 filled from L2 Data TLB 8+ Kib data, 2.75+ Kib tag
Segment Descriptor Cache	An alternative to the Segment Size Cache would be a cache of full Segment Descriptor Entries (SDEs). This would be used to save a L2 Data Cache access on some Translation Cache (TLB) misses at the cost of more complexity in the page table walk process, specifically a conditional based on hit or miss in the new cache). 128 entry, 4‑way set associative, parity protected, mapping TCAT‖lvaddr_63..48 to four 90‑bit SDEs index: lvaddr_54..50 per way tag: 23 bits (10‑TCAT and lvaddr_63..55), per way data: 360 bits indexed by lvaddr_49..48 filled from L2 Data TLB 45+ Kib data, 2.9+ Kib tag
L1 Data TLB	512 entry, 8‑way set associative, 640‑bit block (8 SDE/PTEs), mapping TCAT‖lvaddr_63..12 to siaddr_63..12 in parallel with L1 Data Cache, block index: lvaddr_14..12, set index: lvaddr_17..15, tag: TCAT‖lvaddr_63..18, data: siaddr_63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data TLB with 640‑bit read 40+ Kib data, 3.8+ Kib tag
L2 Data TLB	2048 entry, 8‑way set associative, 640‑bit block (8 SDE/PTEs), block index: lvaddr_14..12, set index: lvaddr_19..15, tag: TCAT‖lvaddr_63..20, data: siaddr_63..12, XWR, R1/R2/R3, etc. (80 bits), filled from L2 Data Cache with 512‑bit read and augmented with SDE bits 160+ Kib data, 7.4+ Kib tag
L1 Data Cache	512 entry (~36 KiB), 4‑way set associative, 576‑bit block, 144‑bit read, 576‑bit refill, index: lvaddr_12..6, tag: siaddr_63..13, write-thru, filled from L2 Data Cache on miss or prefetch 288+ Kib data, 25.5+ Kib tag
Return Address Stack Cache	8‑entry, fully associative, 576‑bit block size, fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache tag: siaddr_63..6 + 3‑bit ring, 4.5+ Kib data, 488+ bit tag
L2 Data Cache	32768 entry (~2.25 MiB), 8‑way set associative, 576‑bit block, read, write, index: siaddr_17..6, tag: siaddr_63..18 + state, write-back, used for SR/VR/VM/MA load/store and L1 Data Cache misses, filled from system interconnect or L3 on miss or prefetch, eviction to L3 18+ Mib data, 1.5+ Mib tag
L2 Data Cache Prefetch	TBD, possibly based on Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Hardware Prefetch (16.7 KiB).
AR Engine Output	64‑entry BR/SR/VR/VM/MA operation queue
BR/SR/VR/VM/MA (Late) Execution Unit (tends to run about an L2 Data Cache latency behind the AR Execution Unit)
BR physical register file	64×1 6‑read, 2‑write
SR physical register file	128×72 (+ parity) 8‑read, 4‑write
CARRY physical register file	8×64 (+ parity) 1‑read, 1‑write
VL register file	If not renamed: 16×9 (+ parity) 3‑read, 1‑write 160 bits If renamed: 32×9 (+ parity) 2‑read, 1‑write 320 bits
VR register file	16×72×128 (+ parity) 4‑read, 2‑write 144+ Kib
VM register file	16×128 (+ parity) 3‑read, 1‑write 2080 bits
MA register file	4×32×128×128 (32+1 for parity) 1‑read, 1‑write 2.1 Mib
Combined Fetch/Data
System virtual address TLB	256 entry, 8‑way set associative, 640‑bit block (8 RDE/PTEs), mapping system virtual addresses to system interconnect addresses (maintained by hypervisor) block index: lvaddr_14..12, set index: lvaddr_16..15, filled from L2 Data Cache with 512‑bit read, expanded with RDE bits sized small because large page sizes expected 160+ Kib data, 12+ Kib tag
L3 Eviction Cache serving multiple processor L2 Instruction and L2 Data caches	262144 entries (~18 MiB), 8‑way set associative, 576‑bit block size, non-inclusive, index: siaddr_20..6, tag: siaddr_63..21 + state, write-back, plus 8‑way set associative directory for sub caches, filled by evictions from L2 Instruction and Data caches 144+ Mib data, 11.5+ Mib tag, 11.5+ Mib directory

Using a block size in TLBs is unusual, but could represent a performance boost, given that the L2 data cache read is going to supply a whole block anyway. Without the block size, the L1 TLBs would only contain 32 or 64 entries for critical path reasons, and this is quite small. The issue is second level translation and svaddr protections. Performing these lookups for 8 PTEs would slow the TLB refill, so I expect the example microarchitecture to mark 7 of the 8 PTEs as requiring secondary checks and continue. On a match to an entry that requires secondary checks, these would be performed then, and the entry updated.

For tracking Basic Blocks (BBs) in the pipeline, there would be a 8‑bit internal basic block counter bbc (independent of the larger BBcount[ring] counters) incremented on each BB descriptor fetch. The low bits, bbc_6..0, would be used as index to write basic block information into a 128‑entry circular buffer for basic blocks in the pipeline, including the BB descriptor address, the prediction to check (including the conditional branch taken/not-taken, loop back prediction, and full target descriptor address for indirect jumps and returns), and so on. The circular buffer entry would also include a bit mask of completed instructions, and the entry may only be overwritten when all instructions are completed. Completion of all instructions of the BB causes state updates to commit (e.g., PC, CSP, and call stack writes).

Basic block ordering is tested testing the sign bit of subtraction: BBx is after BBy if (bbcx − bbcy) > 0. Each instruction in the pipeline includes its bbc value and the offset in the basic block. When a misprediction is detected, all instructions with bbc values after the basic block with the misprediction (using the above test) are flushed from the pipeline, bbc is reset to the bbc value of the mispredict plus one, and basic block descriptor fetch is restarted using the correct next descriptor (e.g., PC+8 for a not-taken conditional branch or the target calculated from the targr and targl fields, or the JMPA/LJMP/LJMPI/SWITCH destination for an indirect jump). Whether the circular buffer stores the targr/targl values or refetches them is TBD. 128 basic block predictions may seem large, but with the SecureRISC loop count prediction, 100% accuracy might be achieved, which means the 128‑entry circular buffer supports 128 loop iterations, and each loop iteration might only be three or four instructions. Note that in SecureRISC, there are 0, 1, or 2 predictions to check per basic block (e.g., a conditional branch and indirect jump, e.g., for a case dispatch), so 0, 1, or 2 mispredicts are possible (i.e., there might be two flushes).

I expect that immediately after each 512‑bit read of the L1 instruction cache, the start mask from the Basic Block (BB) descriptor will be used to feed the specified bits to eight parallel decoders which will convert them to a canonical form, something along the lines of the following. These canonicalized instructions would then be put into queues for the early pipeline (e.g., operations and branches on XRs and loads to and stores from ARs/XRs), late pipeline (SRs/BRs/VRs/VMs/MAs, or both of these (e.g., for loads to and stores from SRs/VRs/VMs/MAs and moves between early and late).

80‑bit early-pipeline canonical instruction example
79	24	23	22	21	17	16	12	11	7	6	0
i		sa		b		a		d		op80
56		2		5		5		5		7

96‑bit late-pipeline canonical instruction example
95	42	41	38	37	32	31	26	25	20	19	14	13	0
i		e		c		b		a		d		op96
54		4		6		6		6		6		14

Secure Enclave Example

While there are advantages to using a certified TPM or Secure Enclave in a SecureRISC system, there are issues that result from differences in main memory width, encryption, mandatory access control, and address size (many Secure Enclaves support physical addresses less than 64 bits and so might not work in a SecureRISC system). One possibility is to use a simple SecureRISC processor for the enclave. This would be implemented with a much simpler microarchitecture (e.g., no speculation) to avoid the security issues with today’s processors. Also, it would not run application code, which further reduces the attack surface. This enclave processor would not need rings, virtual to physical translation, floating-point, vector instructions, etc. Formal verification might be used in its design. The code size for the enclave is expected to be less than 1 MiB (perhaps much less). Either the SRs could be omitted as well, or provided for cryptography algorithms (there is even the possibility of widening the SRs to 512 bits for this purpose). This is TBD. In-order, 2-instruction superscalar should be sufficient, with the capability to execute a load/store and computation instruction in parallel every cycle. A target design might be along the following lines:

16 KiB BB Descriptor Cache + 8 KiB BB Descriptor ROM (1 cycle)
16 KiB L1 Instruction Cache + 8 KiB Instruction ROM (1 cycle)
16 KiB L1 D-cache, 4-way set associative (1 cycle, 2 cycles from load to use)
128 L2 cache, 4-way set associative (4 cycles)

The BBDC and Instruction caches would be organized as 5‑way set associative, with the 5th way being ROM (a single hardwired address for the ROM tag compare).

An Example Secure Enclave Pipeline
Name	Left Pipe	Right Pipe	L2
B	BBD ROM/RAM
I	Instruction ROM/RAM
Q	Instruction Queue
S	Instruction select/issue
R	AR/XR read, bypass, decode, etc.
A	Address Generation	ALU/Branch
D	Data cache	ALU stage 2
E	Exception resolution		L2 1
W	AR/XR write, instruction commit, queue SR		L2 2
X			L2 3
Y	SR read		L2 4
Y	SR execute
Z	SR write

In a full development cycle, a simple Secure Enclave processor might be developed first, and used to accelerate compiler and performance analysis while the more complex, higher performance processors are still in design.

Questions and Things Still Undecided

The following list is in no particular order. Also, some items are old and should be pruned.

Invent mechanisms to protect Function as a Service (FaaS) containers sharing a single address space from each other. The issue is not actual accesses, but side-channel attacks, as mitigations for things like Spectre are often based on tagging microarchitectural state with address space and privilege level, which is the same for containers sharing an address space. One possibility is to restrict speculative references to a single segment number.
Look at the following (unsorted) list of papers related to instruction fetch.
- Instruction Block Movement with Coupled High-Level Program Sequencing
- Instruction Presending
- Blasting through the Front-End Bottleneck with Shotgun
- AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers
- Warming Up a Cold Front-End with Ignite
- Exploiting Page Table Locality for Agile TLB Prefetching
- Rebasing Instruction Prefetching: An Industry Perspective
- I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing
- Micro BTB: A High Performance and Lightweight Last-Level Branch Target Buffer for Servers
- A Storage-Effective BTB Organization for Servers
- All Your PC Are Belong to Us: Exploiting Non-control-Transfer Instruction BTB Updates for Dynamic PC Extraction
- Branch Target Buffer Organizations
- ACIC: Admission-Controlled Instruction Cache
- Confluence: Unified Instruction Supply for Scale-Out Servers
- MANA: Microarchitecting an Instruction Prefetcher
- Proactive Instruction Fetch
- The Last-Level Branch Predictor
- Criticality Driven Fetch
- Path-Based Next Trace Prediction
SecureRISC already embodies many of these ideas in this ISA via its BBDs, but may be able to dynamically augment the BBDC with information suggested by some of the above. For example, a SecureRISC microarchitecture could augment its BBDC with a bit mask of following BBDs likely to be next fetched. These can be prefetched using BBDC banking, and the instructions of referenced by these BBDs prefetched. Such BBDs could be held in a single-cycle access CAM. Shotgun suggets having a dedicated unconditional BTB and augmenting the RSB. There may also be an opportunity for keeping some information in the L2 BBD/Instruction cache.
Should CSRs with addresses not have other fields for CHERI? CHERI wants to store addresses as capabilities, and extra fields complicate this.
Look at Common Weakness Enumeration.
The currently defined SecureRISC CSRs do not alias with each other. That is, the underlying state bits are only referenced by a single CSR address. The obvious definition of a set of floating-point CSRs similar to RISC‑V for individual field access and for context switch would change that. Is this a concern, or should SecureRISC seek to maintain the no-alias property? Note: the C99 standard functions fesetround and fegetround mandate a dynamic rounding mode. The primary question is whether CSRs that combine state bits for context switch is useful, and a secondary question is how important is it that the flags and rounding mode be read and written separately.
Look at Hardware Secure Boot.
Add prefetch algorithm overrides to SDE word 1?
Look at ConBOOM.
Look at Intel TDX Connect TEE-IO Device Guide.
Look at Confidential computing for OpenPOWER.
Should there be support for computing the bit position of the least-significant bit set of IntrPending[ring] & IntrEnable[ring] for the purpose of quickly dispatching to the next interrupt handler?
Add fractional power of two to segment size (SS) in SDEs? For example, augment ssize with one bit to indicate segment size is 2^ssize−2^ssize−2. This reduces the size of the first-level page table, allowing the supervisor to use ¼ of the first-level page table for other purposes, which might be useful for very large segments.
The maximum width of VMIDs and ASIDs may be excessive. Making each 24 bits would allow a (VMID, ASID, SEGMENT) tuple to be 64 bits instead of 80 bits. But probably 16‑bit VMIDs are sufficient. Or the VMID and ASID could be combined into an even smaller number of bits, and ring 7 software would allocate ranges to hypervisors, which would in turn allocate subranges to supervisors. This would primarily work when all VM is handled by ring 7, and supervisors call hypervisors for VM changes, and hypervisors call ring 7.
Now that CAM and TCAM cells are available in ≤5 nm nodes, it makes sense for the example to switch to fully-associative L1 TLBs. The cells should support 32 entries in the time allotted to the L1 TLBs, using CAM cells for TCPT ‖ TCAT ‖ lvaddr_63..33 and to implement multiple page sizes, TCAM cells would be used for lvaddr_32..12. In the case of a miss in the L1 and L2 TLBs, the CAM cells should also be used to implement a single-cycle 64‑entry Segment Descriptor Cache. Most page tables are expected to be single level after the Segment Descriptor is available, and so page walk caches are generally not necessary. However, some supervisors might chose two-level page tables for huge HPC address spaces, and processors targeting that situation might add a CAM on TCPT ‖ TCAT ‖ lvaddr_63..33 to shorten the latency to the last-level page table.
The G in SDEs means that TCAM cells are required on the TCAT bits, which is unfortunate. If the G can be eliminated, that would be helpful. Alternatively, could TCAT = 0 be reserved, and the hardware set TCAT to zero in translation cache entries for segments with G set?
Should add my favorite version of performance measurement (aka counters).
Would it work to address SDT and Page Tables with lvaddrs? This would allow single-level tables that could be sparse or scattered in siaddr. A L1 TLB miss would then form the PTE address to read and access the L2 TLB to translate it to an siaddr. This could trap to software to write a L2 TLB entry that would allow the translation to succeed on return. A L2 TLB miss on a page table translation should be infrequent. It is a bit complicated, but has the potential for replacing specialized page walk caches with just the L2 TLB.
Figure out if sandboxes can use one ring instead of two (can ring 1 be omitted?). The issue is how to make ring 0 RX and RW by ring 1. Current ring brackets make it RWX by ring 1, which is not good. The Multics execute ring bracket with the SecureRISC inversion is [R2,R1]. Changing it to [R2,R1−1] when W is set would make it not possible to have write and execute in the same ring, which would force JIT compilers to use two rings. Would this be acceptable?
Does SecureRISC need a simple multiply instruction the operations on XRs for address arithmetic (e.g., multiplication by the stride of a multi-dimensional array)? Using the multiply on SRs may be too slow in some implementations.
How might a (VMID, ASID) tuple hash work? SecureRISC systems would provide one or more components that convert this tuple into a Segment Descriptor Table Pointer (SDTP) using a cuckoo hash table. The tuple is a 64-bit value. The component is programmed with a system interconnect address (siaddr) hash table address and size (with NAPOT encoding). It uses a cuckoo hash table using two parallel CRCs appropriate to the specified size (logic for each size is instantiated, and the appropriate computation is selected). Two cache blocks are read in parallel based on the hash values. Each 512‑bit cache block holds four key/SDTP pairs in 128 bits. All eight keys are checked (four from each block), and the matching key supplies the associated SDTP. All system interconnect initiators (including processors) will have a cache, keyed by this tuple, with SDTP data, and only query the component on misses.
Alternatively, how might a (VMID, ASID, SEGMENT) tuple hash work? SecureRISC systems would provide one or more components that convert this tuple into a Segment Descriptor Entry (SDE) using a cuckoo hash table. The tuple is a 80-bit value. The component is programmed with a system interconnect address (siaddr) hash table address and size (with NAPOT encoding). It uses a cuckoo hash table using two parallel CRCs appropriate to the specified size. Two cache blocks are read based on the hash values. Each 512‑bit cache block holds two key/SDE pairs in 256 bits. All four keys are checked (two from each block), and the matching key supplies the associated SDE. All system interconnect initiators (including processors) will have a tuple to SDE cache, and only query the component on misses. If the key/SDE pairs can be compressed to 170 bits or less, the a cache block can contain three instead of two. This may be possible by combining the VMID and ASID into a single 32‑bit ID and then squeezing 6 bits out of SDEs, but this becomes somewhat ugly.
The reason to use only two hashes rather than three is simply to reduce L2 cache pollution, if this is implemented in the processor. In a separate component, the issue may be DRAM row cycle time.
Insertion into these tables and setting the table pointers in the components would be performed by ring 7.
If dynamic resizing is required, then elastic cuckoo hash tables could be provided.
Rather than a separate component, this logic could exist in each processor, and providing a mechanism for the processor to handle requests from other initiators would be created.
Look at DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory, Optimizing the TLB Shootdown Algorithm with Page Access Tracking, and Hardware Translation Coherence for Virtualized Systems.
Is the execute permission bit in second-level RDEs and PTEs needed?
Should there be a separate exception stack? There could be a separate ESP[ring] RCSR. This could be cache block aligned to simplify implementations.
Need to work on Translation Cache microarchitecture a lot. Need to figure out what page walk caches (PWCs) are appropriate. Also, is there a way to have the microarchitecture handle translation cache invalidation on PTE updates for where there is only page table referencing a page? Add an Injective bit in SDEs and RDEs to indicate this E.g., if the PWCs were based on physical addresses (as in AMD processors), then they could be easily invalidated using a bit in L2 directory, and these PWCs could say which TLB entry to invalidate.
As a start, consider a 64‑entry fully-associative CAM-based Segment Descriptor Cache (SDC) using TCPT ‖ TCAT ‖ lvaddr[63..48] as the tag, and the SDE as the data. On TLB misses, this allows the page table reads to start right away. In addition, provide a 32‑entry 8 GiB CAM-based cache on using TCPT ‖ TCAT ‖ lvaddr[63..33] that is accessed in parallel with the SDC for locating second-level page tables, when an operating system uses those.
What about moving TCPT values from tag side of TLB to data side? The data side could then store 1 or 2 TCPT values and their associated protection. This reduces the tag compare time and if 2 TCPT values are stored, the number of entries required for sharing, but at the cost of potential trashing if more than 2 TCPT values share a region.
Rename TCAT to TC1T and TCPT to TC2T?
Use a hash table to look up VMID, ASID, REGION to get a sdtp? Program a CRC polynomial and hash table address.
Look at Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism and Parallel Virtualized Memory Translation with Nested Elastic Cuckoo Page Tables as an alternative to radix page tables. Elastic Cuckoo Page Tables (ECPTs) don’t help (and likely hurt) segments with a single level of page table (a goal of SecureRISC is to encourage this for most segments). If it were not for the SDE lookup, ECPTs might help in more cases, but after the SDE lookup (which is required for extended access control that won’t fit in PTEs), many segments will require only a single level page table unless the operating system can only manage one allocation size (e.g., 4 KiB) for pages and tables. However, the SDE can help elastic cuckoo hash tables by specifying a per-segment page size, eliminating Cuckoo Walk Tables (CWTs). A follow-on to ECPTs from the same authors is Memory-Efficient Hashed Page Tables. Also need to consider Direct Memory Translation for Virtualized Clouds.
Explore design of a memory to memory DMA engine with AES-256-CTR encryption/decryption options that could be used by all rings, but with keys handled by the Secure Enclave. Reads could be optionally decrypted and write optionally encrypted, potentially with different keys and nonces. This should provide the option to copy encrypted data (changing the nonce) without having access to the key, so that hypervisors and supervisors can do page movement on encrypted main memory for confidential computing.
I have been writing TPM or Secure Enclave. Perhaps just decide on one or the other.
Need to define cryptographic instructions. These should probably start with PQC algorithms. In particular, need to define Random Bit Generation support. Probably only AES-256 for encryption?
Is there a use for a data structure holding information for each main memory page in system interconnect address space. AMD’s Reverse Map Table (RMP) is an example of one such structure using 0.4% of main memory. This could be 4 MiB at the end of every 1 GiB of main memory (or 1 MiB at the end of every 256 MiB). Or it could be tied to region size. Why does AMD RMP need 16 bytes per page? Could it be half the size? Could it just store a Mandatory Access Set? Linux also has reverse page maps for internal purposes; should these be merged? One issue with reverse page maps is that the access check is sequential (the translation is required to access the reverse map) rather than parallel.
Now that RPT is associated with VMID in VMT, perhaps just support multiple RDTs? The ring 7, hypervisor, supervisor layering wants three levels of protection, and two level translation only naturally provides two, so is some third protection layer required? Need to consider. It does seem important if multiple hypervisors are to be supported, but perhaps even just for Ultravisor/TEE/etc. layering? Can RPT be replaced with mandatory access control? Can second-level translation be replaced by per-ring SDT and calling more privileged ring to write PTEs? Consider whether to move all RDT, SDT, and page table control into ring 7. This would eliminate the need for second level translation, which would boost performance. It would eliminate the Region Protection Table, as that could be done in ring 7 software, checking and translating the addresses to be written to the RDT, SDTs, and PTEs. The RDT, SDTs, and page tables would probably be in memory allocated for hypervisors and supervisors, but would have to be prevented from being accessed by them, which creates some issues. This would allow ring 7 to also partition ASIDs among various hypervisors and supervisors.
Adopt IBM’s Ultravisor terminology for ring 7?
Need to figure out how RPT interacts with I/O.
Where is the NMI vector address specified? Should the NMI vector address be in ROM? The ROM code would then call a handler registered by less secure code.
Direct mapping of segments (e.g., for I/O space) might benefit from having a simple permission table for portions of that space. The question is how to specify the permission table address in addition to the direct-map address. Perhaps add a region MAP type where the siaddr field is a pointer to a protection table (similar to rptp) but at finer grain? In this case, siaddr would be identical to svaddr. This would be used for mapping physical memory to supervisors.
RCSRs used to be written by rings ≥ PC.ring. This test was recently broadened by introducing RECSRs to generalize VMIDRE etc. Is this both necessary and sufficient?
Should pointers include a Write permission bit? Other permissions? A read-only pointer might be used for reference parameters for example. In particular, a store using such a pointer as a base register would take an access fault. Probably SecureRISC should leave this sort of checking to CHERI capabilities, which are much wider and therefore better suited, to support pointer permissions. The sized pointer text above does lay out a possible read-only pointer encoding for small sizes, but eating into the sizes that can be encoded.
Add an instruction to load the memory attributes for a given address? These are currently called System Interconnect Address Attributes (SIAAs), but would best be renamed to just memory attributes or something similar if exposed to user mode.
Does SecureRISC need instructions that privileged rings can use to read BBD and instructions from execute-only segments for emulation?
Consider not masking vector operations, but instead just depend on the existing vector select on mask bits. This makes renaming simpler. Loads and stores might still have masking options for the memory access (perhaps loads would just load 0 for masked memory accesses?).
Remove GEN field from SDEs, add to amatch registers?
The SW field is now 4 bits to match x86-64 and ARM for Linux. Is this really necessary? Are generation dirty levels in PTEs useful, or does it cost too much to query the PTEs (it requires a gate call to the supervisor ring)? Do we need a user ring method of getting this? Generation dirty levels for macro regions larger than 512ªMiB would see finer granularity than 4 KiB (512 words) rather than tying dirty levels to amatch registers.
If applications could specify their own encryption keys for certain segments, then more privileged rings could have the ability to read and write the encrypted data (e.g., for paging), but would not be able to know the plaintext, allowing them to keep their data secret. This allows mutually suspicious processes. The question is how to keep the encryption key from the more privileged rings, since it needs to be context switched by them. Perhaps this could be accomplished by calls to ring 7 (the TEE).
For SIAAs, should hotplug be a category rather than an attribute?
Implement fences with a basic block code rather than an instruction? This would be a variation of a fall-through basic block
Should look at OpenPOWER Technical Specifications. Also A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.
Think about CSRs that could be read and stored in the microarchitectural buffer indexed by bbc.
Need to change CHERI capability format to match their most recent thinking. Need to investigate CHERI PTE additions.
It is good to avoid CHERI tags for cliqued pointers but could give more values for such pointers by moving CHERI-128 word 0 tag to 250. But that would violate the rule that pointers have tags <240. How important is the <240 rule?
How necessary is a full 64‑bit address space? Would it be so bad to go back to the 61‑bit address space in earlier versions of SecureRISC? Then bits 63..61 would always be a ring number in sized pointers.
CHERI capabilities cannot live in XRs/SRs, because SRs are only 72 bits wide. Loading them into the XRs/SRs would probably cause an exception, which means that capability-aware memcpy would be required. Is this a problem?
Rename GC areas to arenas? Jemalloc uses the arena terminology.
Need to add inverse masking for vector operations.
Investigate HardBound: Architectural Support for Spatial Safety of the C Programming Language.
Should there be a way to mark segments or pages as requiring less aggressive microarchitectural tricks such as DMP? DMP seems to be important to performance, but exposes memory contents. A safe mode is not as helpful, as this is really a memory property (i.e., the memory stores secrets). Of course a processor should take care to not leak anything, not just secrets, but some might opt for performance over security given the choice. Note that SecureRISC tagging may help DMP be more secure, as one would only prefetch real pointers, not things that happen to look like pointers. Perhaps a PTE bit that says that the page contents may only be accessed non-speculatively?
Look at the proposed RISC‑V FENCE.T.
RISC‑V has non-temporal locality hints. SecureRISC has cache block instructions that suggest eviction candidates. One question is whether something better is possible. For example, should the programmer be able to specify the size of a stream. The hardware could use that for deciding whether non-temporal locality exists at each level of the cache hierarchy.
How should processors tell I/O devices (Initiators) to flush their cached VMT and AST entries?
Need to add instructions to indicate changes to VM Table, ASID Table, Segment Descriptor Tables, Page Tables, Region Descriptor Table, and Region Protection Table.
Using RISC‑V mnemonics as shorthand, here are some instructions that are yet to be fully explicated in this document (if not present, then some discussion may be appropriate):
- SFENCE.VMA
In addition, there are the RISC‑V Zba, Zbb, Zbs, Zbx, and Zbkb extension instructions that need to be explained as absent in SecureRISC:
- REV8
- BCLR
- BEXT
- BINV
- BSET
- SEXT
- ZEXT
- ORC
- XPERM
- PACK*
- REV
- ZIP
- UNZIP
Similarly, the following CSRs need equivalents in SecureRISC (never got around to it—pretty obvious):
- FCSR
An instruction not in RISC‑V (but should be) that SecureRISC needs to document is HALT.
Need to investigate the DBOS model.
SecureRISC does not have anything corresponding to RISC‑V’s mvendorid, marchid, mimpid, and mhartid CSRs. These may be added at some point. Equivalents to mepc, mcause, and mscratch are not required. SecureRISC does not currently have exception and interrupt delegation, and so has no corresponding equivalents to these RISC‑V features. There is also not an equivalent to mstatus, which is deliberate, as SecureRISC has other mechanisms for that register’s SIE, MIE, SPIE, MPIE, SPP, MPRV, and SUM fields. Whether and how to provide functionality similar to FS, VS, XS, MXR, TVM, TW, and TSR is TBD. A FS equivalent is probably not useful, but the VRs may want lazy switching (albeit done properly to avoid the LazyFP vulnerability). SecureRISC has not considered instruction emulation, but that could become an issue with running operating systems at low privilege levels, which might suggest something like MXR, TSR and TVM. TW might be handled more generally.
Add support for 512‑bit blocks in main memory without encryption by storing tags in ⅛ of the memory with a tag cache covering 8 cache blocks (i.e., 512 bits of tags). Fills into the L3 use the tag cache to supply the tags. Coherency transactions on the system interconnect (e.g., invalidates) need to also target this cache, which has the potential to create false sharing at the 512 byte level. Since this cache would only be written on L3 evictions, it could be write-thru (i.e., L3 eviction writes a single 64‑bit group of 8 tags to the tag portion of main memory). There are four sorts of main memory: 640‑bit (72‑bit words with 8‑bit ECC), 768‑bit (encrypted 72‑bit words), 576‑bit untagged (64‑bit data, no tags, 8‑bit ECC, no code, no pointers), and 576‑bit tagged (64‑bit data, 8‑bit ECC, tags stored separately).
Add ISA support for MPI (e.g., MPI_Send direct from the vector unit without first storing to local memory)?
Add 576‑bit (and 288‑bit for completeness) loads and stores to the SRs? This might be useful for message passing (writing a whole cache block without reading it first and for uncached transfers). VR load/store can already generate the system interconnect transactions, but it is harder to compose a message to be written in the VRs. Somewhat questionable. If wider loads and stores are added, should the names be derived from Latin roots (duword, quadword, octoword) or Greek roots (diword, tetraword, octoword)? One issue is that IEEE 754 binary128 floating-point, typically called quad precision is 128 bits, and quadword on SecureRISC would be 288 bits, which would be confusing. Or add a block transfer between SRs and VRs?
Should one ring be used for the dynamic linker? That would allow some data to be writeable by the linker, but read-only to the application. Do we need 16 rings instead of 8? SecureRISC currently pads most ring fields to provide for future expansion from 8 to 16 rings, but is this a waste? One place which is difficult to expand is the ring number in ARs, which is currently bits 63:61. Adding 1 bit would reduce the maximum segment size from 2⁶¹ to 2⁶⁰ bytes. Is this an issue? A downside to expansion is the number of RCSRs.
Another alternative is to not use one ring for supervisor device drivers, because those drivers are likely for a minimal set of virtualized devices provided by the hypervisor, rather than the wilder world of device drivers for actual devices.
A third alternative might be to implement half rings, where SDEs have extra precision for write brackets. PC.ring would have to be 4 bits (3 usual ring bits, 1 half-ring bit).
Sized pointers (tags 1..127 or 1..193) detect out of bounds, but not realloc-after-free errors. Cliqued pointers do detect most realloc-after-free if the allocator does the right thing. How can realloc-after-free detection be added to the former? (Sized pointers can detect use-after-free errors if the freed tags are set to 255, but would lose this after realloc.)
Should unsigned loads supply a 241 tag rather than 240?
Should exceptions be vectored beyond the existing same-ring less-privileged ring vectoring? Exceptions could be grouped into major and minor codes, with major codes adding to ExcHandler[ring] to implementing vectoring.
Add an instruction for calculating the least-significant 0 (LS0) or least-significant 1 (LS1) in a VM register. For example, searching nodes of a B+-tree (one sort of m-ary search tree) might use a m-element vector-scalar less-than, followed by LS0 to determine which subnode to descend, or the final value index to use.
Conditional branches and loop back do not need the number of targr bits in BB descriptors currently allocated, as few functions exceed 4 MiB in size. Some of these bits (5–6?) might be reused for other purposes for conditional branches (e.g., expanding the hint field). The existing ±4 MiB range of targr is intended for calls within the current compilation unit. Unconditional branches also do not generally need the full targr range, but do provide a convenient way to extend conditional branch range when necessary. Conversely, it might be desirable to allocate more targr bits for calls, since some code segments are quite large. Using the hint bits to extend targr bits for calls may be appropriate. Merge the BB descriptor next and hint fields so that only the next values that need hint bits use them?
The code size of SecureRISC is probably larger than non-block ISAs due to the 64‑bit size of the BB Descriptors. Since the BBDC is separate from the L1 instruction cache, this is probably not a performance issue, but it would still be good to find a way to reduce code size. Some way to pack two or four BBDs into 64 bits might help, but it would really help to have some statistics to drive this packing. One possibility is to use tag 253 for two 32‑bit descriptors packed into a word for density. Such descriptors would be used only within a function for branching within a bage. For two descriptors in a word, this might look like a 10‑bit offset, a 10‑bit targl, a 2‑bit next field, leaving room for a 10‑bit start field. It is also possible to use tags to extend instruction size slightly (e.g., not store all instructions with an integer 240 tag), but the code size benefits of this are unclear as it would take 16 tags to provide one bit to each of four possible 16‑bit instructions in a word (4 tags would add one bit to each 32‑bit instruction).
Add an indirect BBDC prefetch to fall-through type?
Use one of the reserved values of the BBD next field for BBD and/or Instruction prefetch?
Add L1 instruction cache prefetch bit to BB descriptor? Checked on BBDC fill for target descriptor and those after it. If set, L1 instruction cache fill initiated early. Compiler or profiling would set this bit for the common code path, but not for cold instructions.
Should there be 4 CARRY registers (c0 to c3)?
To get rid of per ring scratch registers for exception handlers, it may be necessary to use the return address stack to push at least one AR. This needs to be worked out. Rather than push, there could be a store to a reserved slot? Anyway, add some method for exception/interrupt handlers to save one or more ARs onto the CSP[PC.ring] stack.
Get rid of 16‑bit and 48‑bit instructions and only support 32‑bit and 64‑bit? To buy back the code size loss, go with register windows (fixed increment of 4 on ARs/XRs).
Is it worth incorporating CRISP critical instruction designation as a prefix, or as a bit?
Add details of loop count prediction to example microarchitecture. Loop count prediction might be TAGE-like, storing values 1..32, with ∞ being the no-hit default. Start with first-level (no history) being 128 entries, 2-way set associative, with 5‑bit prediction and 8‑bit tags of the BB descriptor address with c set? Should loop count prediction be generalized from the innermost loop to 2 or 3 levels of loop count prediction by making c file 2 bits? Should loop count prediction state be context switchable and saved/restored on exceptions? Predictor only stores values when predicting ∞ and the actual value from LOOPX execution does not arrive in time to prevent mispredict. Having c set in a BB descriptor initializes the loop count predictor with the BB descriptor address, count predictor TAGE hit level or miss, predicted count, iterations to 0, and a Boolean for LOOPX value received or not. Loop back basic blocks increment the iterations, and if equal to the predicted count, predict not take, else taken. Mispredicts will write the predictor if value not received. LOOPX execution sets the received bit and the prediction.
Should the microarchitecture use the targr/targl fields in the BBDC as a first-level indirect predictor? Seems natural. Tempting to try to write back to other levels of the cache hierarchy, but write permission would be required, so forget that.
Add SIMD operations on SRs? For example, two single-precision multiply adds, or four 16‑bit integer adds. If SR operations use the same functional units as the vector operations, these will naturally be possible. Actually, we need to decide whether 8, 16, and 32 bit data is packed in VRs as in RISC‑V, or stored in individual elements, as in the SRs.
I am leaning towards converting all floating-point values to a canonical form on loads to SRs/VRs (as is done on integer). The operations would still be specific to the precision for rounding purposes, but it would be possible to operate on different precisions (e.g., add or multiply) after loading by using the larger precision. Stores would do the opposite (without rounding—that was already done by the operation). Thus converts would not be necessary, though an instruction from a higher precision to lower precision would save a store/load. I already know of 16 floating-point formats ≤64 bits†. Conversion on load/store seems preferable than on reading and writing the register file. There is an implicit assumption here that packed SIMD will not be done in SRs/VRs, unlike, for example, RISC‑V vector registers. This is possible because the SecureRISC VRs are large enough, and the performance advantage of packed SIMD in the VRs isn’t as necessary given the outer product instructions. Not having packed SIMD also makes widening operations much easier to implement. While the common SR floating-point loads and stores might be 32‑bit instructions, others might be 64‑bit to specify the format. It is likely that only some of these formats may survive, but without knowing which those might be, a general scheme for support might be best. This would also allow an 80‑bit extended format to be implemented if necessary (stored in memory as IEEE 754 binary128 perhaps?).
† binary64, binary32, binary16, TF32, BF16, binary8p5, binary8p4, binary8p3, OCP E5M2 (similar to binary8p3), OCP E4M3 (similar to binary8p4), and OCP block format E8M0 with any of six formats (E5M2, E4M3, E2M3, E3M2, E2M1, and INT8)
I am leaning towards making all vector/matrix instructions be 64 bits.

Do floating-point 8‑bit, 16‑bit, 32‑bit loads and stores to SRs transfer bits or convert to a register format? Should FP scalar loads of 8, 16, and 32‑bit precision expand to 64‑bit precision? What about the same question for vector FP loads? For the same rounding to result, it would be necessary to specify the rounding precision in the instruction, but there would only be one FP computation format. Should SRs have 80 bits of data instead 64 bits of data for IEEE extended? This would also allow extra guard bits for integer DSP

. However, this would give different results from the vector unit, which would not do this expansion. Is there any point in ISA support for 80‑bit floating-point? If we consider widening SRs/VRs from 72 bits (64+8) to 88 bits (80+8) and supporting IEEE extended formats with conversion to the extended formats on loads, what would it look like? The trade-offs include register spills, prologue/epilogue save/restore, and context switch, which must save the extra bits.

Width	Name	Mnemonic	Format			Extended Format
Width	Name	Mnemonic	S	E	P	S	E	P	Total
128	binary128	Q	1	15	113	1	19	141	160
64	binary64	D	1	11	53	1	15	64	80
32	binary32	F	1	8	24	1	11	29	40
32	TF	-	1	8	11	1	8	11	19
16	binary16	H	1	5	11	1	8	12	20
16	bfloat	B	1	8	8	1	8	12	20
8	binary8p3	P3	1	5	3	1	5	5	10
	binary8p4	P4	1	4	4
	binary8p5	P5	1	3	5

Note that the extended format for bfloat does not meet IEEE754 rules, as those would require a 9‑bit exponent to hold a product of two bfloats.
Note that the extended format for P3 does not meet IEEE754 rules, as those would require a 6‑bit exponent to hold a product of two P3s.

Should calls push sp on the call stack and returns pop it? Unclean but useful. Maybe register windows would be better?
?Add a AOCB (Add One Compare and Branch) instruction that does
XR[d] ← XR[a] + 1
loop back if XR[d] ≠ XR[b]
that works for LOOPX prediction and loop iteration end tests?
This allows the XR specified in the d and a fields to be used in indexing. However, it may not reduce register pressure compared to the proposed SOB instruction, because the end test in XR[b] may not otherwise be required in the loop, and reducing register pressure would be the point of this instruction. Yes, this would be nicer in hardware as
loop if XR[a] ≠ XR[b]
XR[d] ← XR[a] + 1
at the cost of making software decrement the end test value.
Look at Apple Secure Enclave. The use of encryption to allow the Secure Enclave to use system memory is very attractive.
Read up on FDIP, BOLT, Propeller, Whisper, Ripple, Twig, and CRISP. Whisper is branch prefix hints for branches that are hard to predict using TAGE, and the main text includes one thought on how this might be incorporated. Ripple is L1 instruction cache eviction via an invalidate instruction, which SecureRISC has, but which may be appropriate to add with a BBD encoding. Twig is Profile-Guided BTB Prefetching using a brprefetch instruction, which SecureRISC may already do well with the BBDC and if needed the CPI instruction, but further investigation is warranted. Ripple and Twig may also superceded by Instruction Presending, which could be incorporated into a SecureRISC microarchitecture, or perhaps even in the BBD. CRISP allows OoO processors to prioritize certain instruction slices to get a head start on DRAM accesses by using prefixes to indicate criticality. How to represent this in SecureRISC is the question.
Figure out if there is a way to incorporate Whisper’s 4‑bit history length into BB Descriptors. Adding one unused bit to the hint field might do it, perhaps adding a new next field to specify a new type of conditional branch. The 15‑bit Boolean formula will be trickier. At some point might need to support 128‑bit descriptors using a new tag?
Look at SiFive WorldGuard. Programing ports with encryption keys to use for DMA is stronger protection on reads; without the correct key, ports would see random bits on reads, but writes would trash the destination, which is still an issue. However, does WorldGuard degenerate to 2‑level TrustZone when using standard buses such as CHI?
Tensilica Xtensa Register windows are rather useful for both performance of function-call intensive programs and for code size. SecureRISC does not implement this because it is unclear how it might work with its heterogeneous register files, each of which might want a different increment. One possibility would be to only implement an increment by a small, fixed amount (e.g., 4) for both the ARs and XRs, leaving the BRs/SRs/VRs/VMs/MAs unwindowed. This results in the compiler having to use a hybrid approach: getting 4 registers saved for free and having to potentially save/restore explicitly additional registers in function prologues and epilogues. Of course, it may end up saving more than required, which is potential performance hit. How might this work? The AR XRs register files would each architecturally have 32 registers before renaming with a 3‑bit WindowBase CSR to provide the base of the 16-entry window into these files. Thus, the architectural register for XR[a] would actually be XR[(WindowBase_4..2 + 0∥a_3..2)∥a_1..0]. Calls would increment WindowBase by 4, in effect saving 4 of the caller’s registers in each register file (8 total): a0-a3 and x0-x3 and creating fresh a12-a15 and x12-x15 for the callee. Unlike Xtensa, SecureRISC should do overflow and underflow without using exceptions. This definition allows five call levels in 32 architectural registers of each register file. For example, when all 32 architectural registers are live, the processor begins saving four ARs and four XRs to the stack, reducing the number live to 28 in anticipation of the next call. Similarly, if the number live drops to 20, the processor begins loading registers in anticipation of the next return. The reason to target 32 architectural registers rather than 64 is to keep the post-renaming register files smaller, since these files need to hold all architectural registers plus registers for all registers being written by instructions scheduled in hardware. Thus, this is more of a code size feature than performance feature because there is relatively little hysteresis to significantly reduce saves and restores.
Under this scenario, sp would become a parameter to the called function (a0?), similar to Xtensa, and this would eliminate the sp save/restore needed for deallocating SecureRISC stack frames.
Would it work to have a register window increment of only 2 providing nine levels of call? This would save only a0 (i.e., sp), a1, x0, and x1, which means the compiler would be using explicit saves most of the time, but less than without the feature. In particular, this makes the sp save/restore automatic.
In a register windows version of SecureRISC, is the Xtensa WindowStart feature useful to get an extra level or two if the callee never references high-numbered registers? It would be nice to have statistics.
Could a register windows option provide separate increments for ARs and XRs?
Could use multiple tags to indicate the location of instructions in instruction words either in addition or instead of the start mask. Is this useful?
Should there be instructions that load sub-word values into the most significant bits rather than the least-significant bits of registers? Should there be separate floating-point loads?
Some algorithms (e.g., matrix multiply) would like to use a VR element as a scalar in a vector-scalar (VS) instruction (i.e., VR[b][imm]). Can enough 7+ bits for the immediate be found to specify this?
Should the RLA64 and RLA32 instructions be changed to a two-instruction sequence, e.g., LX64I to load the value and a new SEGREL instruction that combines that with the segment start address from the AR[a] base register? The transformation is only a 2:1 mux of the effective address and a mask generated from the segment size (along with a bounds check), which is less timing-critical than a LX8S for example. The better question might be what size should RLA64 write into AR[d]? The segment size is the only obvious possibility, so this instruction might typically be followed by an instruction to set a narrower size.
Add polymorphic instructions for Julia and Lisp to do the right thing with various integer and floating-point types? For example, PADDOX might add things with tags 240 or 241, and detect signed or unsigned overflow as appropriate. PADDS might handle single-precision and double depending on the tag.
Perhaps a 56‑bit immediate in the 64‑bit format is excessive. Make it 48 bits instead? Or is the 40‑bit format in 48‑bit instructions good enough? Is the 64‑bit format justified?
Should indexed load/store also add an immediate for effective address calculation?
Should the register file valid bits be replaced by a special tag value? This could then exist in memory. It might also represent the undefined value sometimes present in languages.
Should the co compare operation schema add bit-set and bit-clear tests? So for example, bbs a, b would be equivalent to srl t, a, b and banyi t, 1.
Should SecureRISC expand 64‑bit XR/SR load instructions to have both signed and unsigned forms, supplying tag 240 for signed and 241 or unsigned? Should L{X,S}{8,16,32}U{,I} supply tag 241? This would be convenient for Julia; will it cause any trouble elsewhere?
Should COUNTSIGN, COUNTMS0, and COUNTMS1 instead be magnitude calculations (MAGS and MAGU), i.e., how many bits are required to hold the value? Is it more useful to know x or 64−x?
Instead of having add and subtract with overflow trapping, perhaps there should be comparisons that indicate overflow, and these could be used in parallel with the arithmetic operation to trap on overflow using b0 as a destination? It is more instructions but has the advantage of being able to branch on overflow rather than just trap.
Should system virtual addresses (svaddrs) and system interconnect addresses (siaddrs) be 72 bits instead of 64 bits? The system interconnect fabric will often be this wide for data, and if data and addresses share the same wires, this would be natural. On the other hand, typically these days data and address are on separate wires. Seventy-two-bit addresses supports larger systems, but it makes it more difficult to generate these addresses in software to send to I/O devices since software has to use the tag field, and some tag values are reserved. It is fairly easy to use for processors using lvaddr to svaddr translation, as it just means providing a larger svaddr field in the SDE or PTE. There are not many extra bits in the PTE however, at least for the 4 KiB page size. At this point, SecureRISC specifies 64‑bit svaddrs and siaddrs, but the potential of 72 exists for future consideration. An easier way to extend svaddrs and siaddrs by 3 just bits (from 64 to 67 bits by using the ring field) is to use tags 184–191 (unsized pointers) for SDEs, PTEs, RDEs, and SPTEs instead of tag 240 (integer) for these data structures. This could be slightly tricky for software because the ring field is calculated separately by some instructions, but instructions for this purpose might be added to make this easier. If the physical address space needs to be extended only a little, this seems like a somewhat plausible way to do it. It seems most useful for svaddrs (since many svaddrs might address the same siaddr with different protection), but if done, it might as well be done in both. An alternative is to use tags 176..183 (reserved pointers) or tags in the 208..223 space (moving segment relative pointers), which might allow for expansion to 68 bits. (The Symbolics I-machine had a tag for physical address, even if it was only one value.)
Should there be a tag for leader words?
Should there be tags for forwarding pointers as on the Symbolics I-machine?
The header/trailer size words for blocks may be useful to memory allocators. One question is whether bits 2..0 (or 3..0) must be 0, or whether these bits should be ignored, to make them available to the allocator. Some could be ignored, and others trapped on (e.g., a free bit). It is also possible to provide a different tag to be used by allocators for freed blocks, or perhaps tag 255 would be used for this purpose. This would cause references to the header word of a freed block to trap, detecting the read-after-free error. It would not help with write-after-free without some further feature.
When does the size field of ARs get set when a pointer with tags 176–184 is loaded? Is there a LSIZE instruction, which compilers will only use in some cases?
Can the size word (normally tag 250 at pointer − 8 for a tag of 176–183) be used with a different tag for Lisp arrays that grow by giving a pointer to the new array? This could be implemented as a special instruction following LSIZE that either copies AR[a] or loads from the address in AR[a] depending on the tag.
My current thought is the hypervisor environment does not need to be identical to the supervisor environment, despite what most hypervisors do. Most machines start out with a supervisor environment and then virtualize that but retain the original supervisor environment for backward compatibility. SecureRISC could instead require the hypervisor from the start so that even a non-VM system would have both a hypervisor (very minimal) and a supervisor (somewhat like RISC‑V M-mode and OpenSBI being the basis for supervisors). This seems to require less duplication to me, and it is the direction I am headed, but is there a problem I am missing here?
Could generalize the valid bit idea to a Mandatory Access Control set so that levels and categories flow with data? This increases context switch cost a fair bit, so it is only appropriate if it is really useful. And the valid bit stuff is uncertain to be retained anyway.
How should the last BB of a bage handle fall-through? Add a bit to indicate that fall-through is the first BB of the following bage? When the even/odd page layout suggestion is used, this is unnecessary on the odd pages and doesn’t work on the even pages.
Should instruction words have their own tag, or just an integer tag? If their own tag, do various load instructions work as if an integer tag is present? The only purpose would be to check the offset field of BB descriptors, which is not terribly important.
Do Segment Descriptor Entries need Accessed and Dirty bits?
Is this sized pointer stuff silly? Should the compiler just do the right thing by always passing the size to check as a separate word? Are there enough XRs to make this work?
Should there be an attribute and permission lookup on the high 16 bits of siaddrs as a third-level to replace some region descriptor stuff? This would be akin to RISC‑V PMAs and PMPs. It could be accomplished in parallel with second-level page table walks, so the performance would be acceptable, except for direct-mapped regions. (Direct-mapped regions might want a specialized cache to reduce latency of this.) If the SIA and protection information could be kept to 8 or 16 bits, the lookup table would be only 64 KiB or 128 KiB.

Tag Summary

Tag	Use
0	Nil/Null pointer
1..31	Sized pointers exact
32..127	Sized pointers inexact
128..191	Reserved (possible sized pointer extension)
192..199	Unsized pointers with ring
200..207	Reserved (possible unsized pointers with ring)
208..215	Code pointers with ring
216..220	Reserved
221	Pointer to blocks with header/trailer sizes
222	Cliqued pointer in AR
223	Segment Relative pointers
224	Lisp CONS
225	Lisp Function
226	Lisp Symbol
227	Lisp/Julia Structure
228	Lisp Array
229	Lisp Vector
230	Lisp String
231	Lisp Bit-vector
232	CHERI-128 capability word 0
233	Reserved
234	Lisp Ratio, Julia Rational
235	Lisp/Julia Complex
236	Bigfloat
237	Bignum
238	128‑bit integer
239	128‑bit unsigned integer
240	64‑bit integer
241	64‑bit unsigned integer
242	Small integer types
243	Reserved
244	Double-precision floating-point
245	8, 16, and 32‑bit floating-point
246..249	Reserved
250	Size header/trailer words
251	CHERI capability word 1. Bits 143..136 of AR doubleword store (used for save/restore and CHERI capabilities)
252	Basic Block Descriptor
253	Reserved for packed Basic Block Descriptors
254	Trap on load or BBD fetch (breakpoint)
255	Trap on load or store

Glossaries of Terminology Used

General Glossary

ABI

An acronym for Application Binary Interface

, which is a set of software conventions for compilers, operating systems, and programmers that defines how code should use the processor architecture to allow interoperability of different components. It typically covers register usage, data type alignment, calling convention, etc.

Access Control

In computer systems, Access Control

determines whether a subject (e.g., users, processes) should be granted or denied access to an object. Subjects are the who of access control and are represented by an identifier after authentication. Objects are the what of access control and typically include data, files, peripherals, etc. of the computer system. The decision to grant or deny access is governed by the policies of the access control system as configured by its authorization mechanisms. Access control policies may follow models such as Discretionary Access Control

(e.g., via ACLs

), or Mandatory Access Control

, or a combination of both. Access rights are typically refined by the type of access, such as read, write, execute, and other specific permissions.

ACL

An acronym for Access Control List

, a form of Discretionary Access Control

ACPI

An acronym for Advanced Configuration and Power Interface

, is a standard from the UEFI Forum

for device configuration and power management. See also the ACPI at OSDev Wiki

. ACPI defines system states, device states, CPU states, and performance states. System states include working, sleeping, and off, with sleeping states are further divided, and include low power idle, standby and hibernation. CPU states include halt, stop-clock, and sleep. Performance states represent power and frequency scaling, with P0 being maximum power and frequency, and higher numbers being lower power and frequency. ACPI also defines tables that describe system hardware and for communication between the operating system and firmware (e.g. UEFI

). Linux systems tend to use devicetree

instead of the ACPI tables.

Acquire and Release

The paper Memory consistency and event ordering in scalable shared-memory multiprocessors

by Gharachorloo et al. created a taxonomy of shared writeable memory accesses, first dividing into competing and non-competing, then subdividing competing into synchronization and non-synchronization, and finally subdividing synchronization into Acquire and Release specifying that Acquire accesses (typically loads) are performed to gain access to a set of shared locations, and Release accesses (typically stores) are performed to grant access to sets of locations. The paper went on to introduce release consistency

API

An acronym for Application Programming Interface

, which is an interface specification for programming.

ASID

In many ISAs, ASID is an acronym for Address Space Identifier, which is a small integer used to reduce the frequency of translation cache flushing. In SecureRISC this is instead called a Translation Cache Tag (TCAT), and ASID is instead an index into the Address Space Table. For implementations with hypervisors implementing Virtual Machines, a similar tag (e.g., the Virtual Machine Identifier or VMID in RISC‑V) extends the ASID concept for translation cache management; this is called the Translation Cache Protection Tag (TCPT) in SecureRISC.

ASLR

An acronym for Address Space Layout Randomization

, which is an operating system security feature to make exploiting vulnerabilities more difficult by varying the placement of code and data in the address space.

Atomic Operations

In an ISA, atomic operations are a series of operations on memory locations that are indivisible, such that they all occur without other intervening operations on the memory location, or none occur (if the atomic primitive allows for failure). As an example, a memory increment involves the series of operations: read the memory location, increment the value read, write the memory location. An atomic increment prevents parallel entities from interleaving these operations (e.g., A load, A increment, B load, A store, B increment, B store) such that the memory location is increased by only 1 rather than 2. Atomic operations may be performed in the processor caches by preventing cache coherence operations from interrupting the sequence, or for non-cached locations may be performed in the memory controller by sending the operands required there. One of the most important atomic operations is Compare-And-Swap (CAS)

BB

An acronym for Basic Block

, which is a sequence of instructions without potential transfers of control in or out (except due to exceptions).

Bell-LaPadula Model

Bell-LaPadula

is a computer security access control policy to enforce data confidentiality in a system with Multilevel security

. It is the inverse of the Biba Integrity model

for integrity. In Bell-LaPadula, data and subjects are ordered by levels of confidentiality and subjects may not read data with higher confidentiality than their own level, and may not write data at lower confidentiality (with a strong star variant where subjects may only write objects at their own level).

BF16

An abbreviation of bfloat16

, which was originally a storage format introduced by Google for AI algorithms. It consists of simply throwing away (round toward 0) the low 16 bits of IEEE 754 binary32 (single precision) floating-point format to save space and padding with 16 zero bits when converting back to single precision. It has evolved from a storage format into a computation format, and now many implementations use round-to-nearest-even when converting from single precision to BF16 and when computing with BF16. It is second of two 16‑bit floating-point formats, the other being FP16, which was originally invented for graphics and standardized by IEEE in 2008.

Biba Integrity Model

Biba

a computer security access control policy to enforce data integrity. It is the inverse of the Bell-LaPadula model

for confidentiality. In Biba, data and subjects are ordered by levels of integrity and subjects may not write data with higher integrity than their own level and cannot read data of lower integrity.

Binary Prefix

Binary Prefixes are unit prefixes denoting powers of 1024 rather than 1000, and are typically used to modify B for bytes or b for bits, as in 4 KiB for 4,096 bytes, or 2 MiB for 2,097,152 bytes.

Prefix	Value
Ki	1024	2¹⁰	1,024
Mi	1024²	2²⁰	1,048,576
Gi	1024³	2³⁰	1,073,741,824
Ti	1024⁴	2⁴⁰	1,099,511,627,776
Pi	1024⁵	2⁵⁰	1,125,899,906,842,624
Ei	1024⁶	2⁶⁰	1,152,921,504,606,846,976
Zi	1024⁷	2⁷⁰	1,180,591,620,717,411,303,424
Yi	1024⁸	2⁸⁰	1,208,925,819,614,629,174,706,176

BIST

An acronym for Built-in self-test

BLAS

An acronym for Basic Linear Algebra Subprograms

, which is an API for linear algebra operations such as:

vector-scalar multiplication (scal)
dot products (dot)
linear combination of vectors (axpy)
matrix-vector product (gemv)
and matrix multiplication (gemm)

RISC‑V Glossary

Other acronyms and terminology may be less familiar as they come from RISC‑V. This section lists a few of the acronyms and terms unique to that ISA.

To quote from RISC‑V International’s About RISC‑V: RISC-V is an open standard Instruction Set Architecture (ISA) enabling a new era of processor innovation through open collaboration.

The official RISC‑V ISA specifications may be downloaded from RISC‑V specifications while working versions may be found at the GitHub RISC‑V ISA Manual repository.
The two primary specifications are:

A few RISC‑V security related (mostly draft) specifications are as follows:

Machine Mode

While RISC‑V can be thought of having two rings for most purposes (Supervisor mode and User mode), there is a privilege level higher than Supervisor mode called Machine Mode, that originally appeared to be for functionality similar to Alpha PALcode

. However, as with most systems that oversimplify privilege levels, pressue to provide more levels has resulted in more functionality being added to Machine Mode over time.

NAPOT

An acronym for Naturally Aligned Power-of-2

(Chapter “Svnapot” Standard Extension for NAPOT Translation Contiguity)
NAPOT refers to things of size 2^N bytes being aligned to that size (i.e., the address is a multiple of 2^N, viz. the least significant N bits of the address are zero). In RISC‑V NAPOT sizes are represented with repeated 1s (e.g., in PMP entries) or repeated 0s (e.g., in PTEs) in the lower address bits, then the opposite bit, and then from that point actual address bits, which allows the address and size to encoded in 1 bit more than the address alone. (Note: SecureRISC in the early days had separate address and size fields but changed to adopt this clever bit-saving idea.)

PBMT

An acronym for Page-Based Memory Types defined in the Svpbmt extension in the RISC‑V Privileged ISA (Chapter 14).

PMA

An acronym for Physical Memory Attributes

, which are attributes used by processors for regions of the system’s physical address space.
Example RISC‑V Physical Memory Attributes are as follows (SecureRISC PMAs are likely to be simpler, except for the addition of tagging):

Memory type	Vacant, Main memory, I/O
Read access width	Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit reads supported.
Write access width	Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit writes supported.
Execute access width	Subset of 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit instruction fetch supported.
Atomic Memory Swap (AMOSwap) width	Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit AMOSwaps supported.
Atomic Memory Logical (AMOLogical) width	Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit AMOLogicals supported.
Atomic Memory Arithmetic (AMOArithmetic) width	Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit AMOArithmetics supported.
Page Table Reads	Supported or not.
Page Table Writes	Supported or not.
LR/SC support level	None, NonEventual, Eventual.
Coherence	Not coherent or coherence channel number
Cacheability	Yes or No
Idempotency	Whether reads and writes have side effects.

PMP

An acronym for Physical Memory Protection

, which is a mechanism for providing Read, Write, and Execute permission for ranges of physical memory addresses independent of the translation mechanism, and which is logically anded with those permissions.

RISC-V Worlds

WorldGuard is technology developed by SiFive

for RISC‑V, proposed for standardization

, by RISC‑V International as RISC‑V Worlds. It provides isolation by constraining access to system physical addresses based upon identifiers (called Worlds) assigned to system interconnect ports (e.g., processors and devices) that are checked in a distributed fashion by resources (such as memories and peripheral devices) or checkers located before such resources. Worlds are created and configured by a Trusted Execution Environment (TEE), usually at system boot time. A two-world simplification of WorldGuard could be used to provide similar functionality to ARM’s TrustZone.

Sv39

RISC‑V has Sv39, Sv48, and Sv57 virtual to physical address translation modes, specified in the MODE field of the satp CSR, where the name specifies the number of virtual address bits supported as a signed integer (e.g., Sv39 supports virtual addresses in the range −2³⁸ to 2³⁸−1).

WARL

An acronym for Write Any Values, Reads Legal Values, which is a specification of a register field that allows processors to implement a subset of the functionality described in a CSR (e.g., by hardwiring certain bits to fixed values, or by allowing some values to be written but not others). If an unsupported value is written, the implementation substitutes some legal value.

WPRI

An acronym for Writes Preserve values, Reads Ignore Values. The RISC‑V privileged specification states, Software should ignore the values read from these fields, and should preserve the values held in these fields when writing values to other fields of the same register. For forward compatibility, implementations that do not furnish these fields must make them read-only zero.

ARM Glossary

A very small number of acronyms come primarily from the ARM Architecture.. This author has little experience with ARM, and so this is a very small set. At the moment this set is mostly related to cache maintenance, and the instructions that writeback or invalidate cache blocks. This terminology is related to the point in the cache hierarchy that writebacks target, or the point to invalidate to, such as:

Point of Coherency (PoC)
Point of Physical Aliasing (PoPA)
Point of Encryption (PoE)
Point of Unification (PoU)
Point of Persistence (PoP)
Point of Deep Persistence (PoDP)

The reader is referred to Arm® Architecture Reference Manual for A-profile architecture section About cache maintenance in AArch64 state.

TrustZone is another ARM term that is relevant in this document.

IPA

An acronym for Intermediate Physical Address (IPA), which in second-level address translation is what SecureRISC calls a System Virtual Address (svaddr). In ARM two-stage address translation:

Stage 1 provides an Intermediate Physical Address (IPA).
Stage 2 provides the Physical Address (PA).

PoC

An acronym for Point of Coherence (PoC), which is a point at which all agents that can access memory are guaranteed to see the same copy of a memory location.

PoDP

An acronym for Point of Deep Persistence (PoDP), which is a point in the memory system where the data is preserved even when the power and the back up battery fail simultaneously.

PoE

An acronym for Point of Encryption (PoE), which is the point in a memory system where any writes that have reached that point are encrypted.

PoP

An acronym for Point of Persistence (PoP), which is a potential point in a memory system at or beyond the Point of Coherency, where a write to memory is maintained when system power is removed, and reliably recovered when power is restored to the affected locations in memory.

PoPA

An acronym for Point of Physical Aliasing (PoPA), which is a point where updates to a location in one Physical Address Space are visible to all other Physical Address Spaces.

PoS

An acronym for Point of Serialization (PoS), which is a point within the interconnect where the ordering between requests from different agents is determined.

PoU

An acronym for Point of Unification

, which is the place in the cache hierarchy for a processor where instruction, data, and page table accesses are guaranteed to see the same copy of a memory location.

TrustZone

TrustZone provides two Security states: Secure and Non-secure. Non-secure is used to implement the Normal software stack, which has all the complexity of today’s system software, from hypervisor, supervisor, and application code. In contrast, Secure is used to implement a simpler Trusted software stack, a Trusted Execution Environment (TEE). One technique used for Secure state is a separate physical address space (e.g., by adding a Non-secure bit to the address). See TrustZone for Armv8-A

. SiFive’s WorldGuard

proposal provides similar functionality.

x86 Glossary

The smallest number of acronyms come from the x86-64 Architecture.. This author has little experience with x86-64, and so this is a very small set.

Negative Rings

Intel gave the 80286 architecture 4 rings numbered 0–3 with ring 0 being the most privileged. With the 80386, only rings 0 and 3 were typically used as segment registers were rendered obsolete by the flat address space. Intel later added more privileged modes of execution such as SMM

and VT-x (see below) in the i386 SL and some Pentium 4 models respectively. These other modes are sometimes described as negative rings (although the ring number register is not used).

−1	Hypervisor (Intel VT-x or AMD-V
−2	System Management Mode (SMM)
−3	Intel Management Engine (ME)

Segmentation

The Intel 8086 had segment registers for the purpose of extending the address space from 16 bits to 20 bits via a 16‑bit segment register that was shifted 4 bits and added to the 16‑bit effective address. This had little relation to the concept of segments as found in Burroughs and Multics CPUs, despite the name. The subsequent 80286 processor added a bounds check and ring-based access control. The 80386 took the x86 ISA from 16 bits to 32 bits, and introduced a flat 32‑bit address space with virtual memory. While segmentation continued to exist, it was not much used by software in the flat memory model (it was possible to use segment registers but they were typically set to 0), and the finer granularity of the rings (beyond a simple kernel/user distinction) provided by segmentation was largely unused by common operating systems. Furthermore, Page Table Entries (PTEs) contained only a single User/Supervisor (U/S) bit, not per-ring access control. With the 64-bit x86-64 ISA, most of the segment registers, while present, were not even included in address calculations. Thus the x86 never had Burroughs or Multics-style segmentation, but it did serve to give segmentation a bad reputation.

Software Guard Extensions (SGX)

Intel Software Guard Extensions (SGX)

is a Trusted Execution Environment (TEE) technology introduced by Intel to enhance the security of application code and data. It allows developers to create enclaves, which are protected, private regions of memory within an application's address space. Code and data inside an enclave are protected from disclosure or modification by software outside the enclave, including the operating system, hypervisor, System Management Mode (SMM), and other applications. SGX aims to ensure confidentiality and integrity for sensitive computations by leveraging hardware-based cryptographic protections, even if the privileged software layers are compromised. Despite its goals, SGX has been attacked.

Supervisor Mode Access Prevention (SMAP)

Intel Supervisor Mode Access Prevention

is an Intel x86-64 CPU security feature that restricts kernel (supervisor mode, Ring 0) access to user-mode (Ring 3) memory pages. When SMAP is enabled, the CPU will automatically generate a page fault if code running in supervisor mode attempts to fetch instructions from or read/write data to a page marked as user-accessible in the page tables. This protection helps mitigate a class of security vulnerabilities where a compromised kernel might be tricked into executing malicious code or accessing sensitive data planted by an attacker in user space.

System Management Mode (SMM)

Intel System Management Mode (SMM)

is a special-purpose operating mode in Intel x86 processors, conceptualized as Ring -2 due to its privilege level being even higher than the operating system's kernel (Ring 0). Entered via a System Management Interrupt (SMI), SMM suspends all normal CPU activity to execute firmware-level code, typically residing in a protected memory region (SMRAM) invisible to the OS. This mode is used by BIOS/UEFI for critical low-level system management functions such as power management, hardware error handling, and security features, features, often interacting with the operating system through the Advanced Configuration and Power Interface (ACPI

). SMM provides a transparent and isolated environment for platform-specific tasks.

Trusted Execution Technology (TXT)

Intel Trusted Execution Technology (TXT)

is an Intel hardware-based security technology designed to establish a trusted launch or measured launch of a computing platform. It serves as a Root of Trust (RoT), leveraging cryptographic principles and working in conjunction with a Trusted Platform Module (TPM) to measure the integrity of critical components, including firmware, BIOS, and operating system loaders, before they are executed. By verifying that the boot process has not been tampered with, TXT aims to create a trustworthy environment for sensitive workloads and can provide evidence of the platform's integrity (attestation) to remote parties.

Programming Language Glossary

Java

Julia

Lisp

See

Pastel

Pastel, A Colorful Systems Programming Language for Amber (or an off color Pascal), was the language created by the Amber Operating System

group of the LLNL S-1 project. Pastel, while based upon Pascal

, had many language extensions for systems programming. The language and compiler were created by Jeffrey Broughton. It was also used by the early MIPS

compiler group for writing the MIPS compiler and by Richard Stallman

as the template for GCC

. A small number of the Pastel extensions to the Pascal language ended up in MIPS Pascal. There are unfortunately no known surviving copies of the Pastel compiler.

Pastel

Python

Rust

Swift

Operating System Glossary

Amber

Amber was an operating system written for the LLNL S-1 processor in the early 1980s (these were the days when each new ISA had its own OS and when Unix was too immature to have become a de facto standard). The LLNL S-1 had segmented virtual memory and Amber intended to emulate many of the features of Multics

. Amber was written in Pastel

, a highly extended version of Pascal, created by Jeffrey Broughton of the Amber team for the purpose of writing the OS.

The Amber Operating System, by Charles Frankston
Amber: LLNL technical report UCID-21567, 1988, by Jay A. Pattin
Amber Kernel Specification, S-1 Project," LLNL Technical Report UCID-20378, 1985

CheriBSD

CheriBSD

DBOS

ITS

The Incompatible Timesharing System (ITS) was the MIT AI Lab and Project MAC operating system for their PDP-10s

Linux

Multics

Multics
MIT ILCS/TR-196 Final Report of the Multics Kernel Design Project,
by Schroeder, Clark, Saltzer & Wells
MAC TR-104 Cooperation of Mutually Suspicious Subsystems In A Computer Utility,
by Michael D. Schroeder
Multicians.org
- Introduction and Overview of the Multics System,
  by F. J. Corbató and V. A. Vyssotsky
- The Multics Virtual Memory: Concepts and Design,
  by A. Bensoussan, C.T. Clingen, and R.C. Daley
Multics Wiki

Redox

Architecture Glossary

There is much more on Computer Architecture History at Selected Historical Computer Designs by Professor Mark Smotherman.

Alpha

ARM

Cray-1

HP Labs Playdoh

HPL-PD Architecture Specification: Version 1.1

IA-64

MIPS

PA-RISC

PA-RISC

PDP-10

POWER

PowerPC

RISC-V

SPARC

Symbolics Lisp Machine

Symbolics

was one of several spin-offs from the MIT AI Laboratory Lisp Machine Project that designed and built Lisp Machines

. Symbolics initially productized the MIT Lisp Machine as their LM-2 product. This was followed by the 3600 and finally the Ivory family in 2 µ CMOS. The MIT and Symbolics Lisp Machines were were stack machine

architectures.

I-Machine Architecture Specification

x86-64

x86-64
x86 Glossary

Xtensa

Xtensa is the architecture created by Tensilica

for their configurable and extensible semiconductor IP products. It featured a base instruction set designed for code density (24‑bit and 16‑bit instruction formats) with register windows to save on register save/restore sequences in function prologs and epilogs. VLIW instructions could be added to the base instruction set for applications such as Digital Signal Processing

(DSP).

ZS-1

The Astronautics ZS-1 had an interesting latency tolerance architecture before OoO

Crypto Glossary

Other terminology and acronyms are associated with cryptography and are summarized below. The reader should return here if encountering unfamiliar cryptographic terminology.

AEAD

An acronym for the Authenticated encryption with associated data

, which are methods that provide both encryption and authentication.

AES

An acronym for the Advanced Encryption Standard

, which is a block cipher

for 128‑bit blocks with three key sizes: 128, 192, and 256 bits. See FIPS 197

for details.

AES128

AES

with a 128‑bit key.

AES256

AES

with a 256‑bit key.

Block Cipher

A block cipher

is a pair of algorithms (encryption and its inverse decryption) that operate on two inputs, a fixed-length key and fixed-length data (the block). The key length need not be the same as the data length. The input to encryption is called plaintext, and the output is called ciphertext. Decryption with the same key takes ciphertext and produces the original plaintext. This is represented as follows for encryption algorithm E, decryption algorithm D, plaintext P, ciphertext C, key K:
C ← E(K, P)
P ← D(K, C)
or sometimes written as
C ← E_K(P)
P ← D_K(C)
Thus the following property holds:
∀P: D_K(E_K(P)) = P

Carter-Wegman MAC

A Message Authentication Code

based on an encrypted hash function using a separate key from the encrypted data. Carter-Wegman MACs are sometimes abbreviated as CWMAC in this document.

Counter Mode

Encryption by xor with successive applications of a block cipher on a counter. Counter mode is sometimes abbreviated as CTR in encryption mode specifications (e.g., AES128-CTR).

Feistel Cipher

A Feistel Cipher

is a structure used in the construction of symmetric block ciphers.

GCM

An acronym for Galois/Counter Mode

, which is a mode of operation for authenticated encryption with associated data (AEAD)

encryption using a block cipher in counter mode with a Galois Message Authentication Code (GMAC)

of the authenticated data, ciphertext, authenticated data length, and ciphertext length. See NIST Special Publication 800-38D

for details.

MAC

An acronym for the Message authentication code

, which is a fixed-length bit stream used to authenticate other data. MACs prevent data tampering by checking data integrity while being designed to resist forgery.

ML-DSA

See Lattice-based cryptography

and FIPS 204 Module-Lattice-Based Digital Signature Standard

ML-KEM

See Key encapsulation mechanism

, Lattice-based cryptography

, and FIPS 203 Module-Lattice-Based Key-Encapsulation Mechanism Standard

Nonce

A nonce

is a bit string that is only used once, and which prevents data from being encrypted or authenticated twice, e.g., to prevent replay attacks.

PQC

An acronym for Post-Quantum Cryptography

. See NIST IR 8547 ipd Transition to Post-Quantum Cryptography Standards

for further information.

S-Box

A S-box

(or substitution box) is a function of input bits used in symmetric ciphers designed to resist certain attacks, such as linear and differential cryptanalysis.

SHA-3

See SHA-3

and FIPS 202 SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions

Simon

Simon

is a family of block ciphers for various block and key sizes. It was proposed as a lightweight (e.g., compared to AES) block cipher optimized for hardware. The Simonm/n family employs a Feistel

structure to encrypt m‑bit blocks with a n‑bit key.

Tweakable Block Cipher

A tweakable block cipher is a block cipher

whose encryption and decryption algorithms take an additional tweak input t as shown below for tweakable algorithm (E,D), key K, plaintext P, ciphertext C, and tweak t:
Encryption: C ← E(K, P, t)
Decryption: P ← D(K, C, t)
Property: ∀P: D(K, E(K, P, t), t) = P
Unlike keys, tweaks are typically publicly known (e.g., to adversaries). Tweakable block ciphers are used in applications where encryption must not expand the data as occurs in authenticated encryption when a MAC is appended. The tweak is often the location of the data, and ensures that even if the same data is encrypted twice in different locations, the resulting ciphertext is different.

XEX

An acronym for Xor-Encrypt-Xor

, which is a tweakable mode of operation a block cipher

typically for Data at rest protection

, for example in XTS-AES.

XTS-AES

XTS-AES is one use of XEX

and AES

defined in IEEE Std 1619-2007

for cryptographic protection of data stored in constant length blocks. The encryption of the 𝑗^th 128 bits of a block with tweak 𝑖 is as follows where the block length is a multiple of 128 bits (i.e., the following does not cover ciphertext stealing):
T_𝑗 ← AESenc(Key₂, 𝑖) ⊗ 𝑥^𝑗
C_𝑗 ← AESenc(Key₁, P_𝑗 ⊕ T_𝑗) ⊕ T_𝑗
and decryption is the obvious inverse:
T_𝑗 ← AESenc(Key₂, 𝑖) ⊗ 𝑥^𝑗
P_𝑗 ← AESdec(Key₁, C_𝑗 ⊕ T_𝑗) ⊕ T_𝑗
where:

⊕	Bitwise xor
⊗	Multiplication of two polynomials over the binary field GF(2) mod 𝑥¹²⁸+𝑥⁷+𝑥²+𝑥+1
Key	Is a two-part encryption key, consisting of Key₁ and Key₂ where Key = Key₁∥Key₂. For AES-128, Key would be 256 bits, and for AES-256 it would be 512 bits.
𝑖	is the value of 128-bit tweak
P_𝑗	is the 𝑗^th block of 128 bits (the plaintext)
C_𝑗	is the 𝑗^th block 128 bits of ciphertext for P_𝑗

Vulnerability Glossary

There are so many security vulnerabilities that instruction set and microarchitecture design should be aware of that these are now separated from the conventional glossary above.

Many computer vulnerabilities stem from insufficient tagging in microarchitectural structures used for speculation, such as address prediction. Attackers exploit these weaknesses to manipulate predictions and leak privileged information. Several design features can help mitigate these vulnerabilities:

Incorporate privilege level in tags. This ensures that predictions from less privileged levels cannot influence higher privilege levels. This is particularly important in control-flow predictors (e.g., BTB, RSB, indirect jump, branch), memory disambiguation, prefetch predictors, and other speculative structures.
For virtually tagged structures, include virtual address bits that distinguish privilege levels in the unhashed portion of tags whenever feasible. Additionally, incorporating an address space identifier (ASID) in virtual tags helps prevent cross-context attacks at the same privilege level.
Virtually tagged structures that lack ASIDs support should be flushed when the ASID changes to prevent leakage across address spaces. However, to minimize performance overhead, selective invalidation techniques (e.g., tagged invalidation) can be explored instead of full structure flushes.
ASIDs may also be necessary in structures using physical address tags for segments that are shared between different subjects at the same privilege level where a predictor could be trained by one subject to attack processes running for another subject.

Augury: Augury is a security vulnerability resulting from a data memory-dependent prefetcher (DMP) interpreting memory contents as a pointer and prefetching the blocks addressed. A newer vulnerability resulting from DMP is GoFetch.
Cacheout: Cacheout is a vulnerability of Intel processors where cache blocks evicted from the L1 data cache are store in the Line Fill Buffer (LFB), and loads search and can bypass from this buffer. This can be exploited by the attacker to leak the L1 data cache contents.
CrossTalk: CrossTalk is a vulnerability of Intel processors where certain microarchitectural data shared between cores can leak information from one core to an attacker on another. For example, the attacker can observe the hardware random number generator values sent to another core.
Downfall: Downfall is a security vulnerability of Intel processors that exploits the vector gather instruction to leak data from buffers shared across security domains and VMs running on the same processor. For example, it allows SIMD register loads containing encryption keys to be leaked. Downfall is called Gather Data Sampling (GDS) by Intel.
Fallout: Fallout is a Microarchitectural Data Sampling (MDS) attack that exploits two store buffer behaviors of Intel processors. Other MDS include RIDL, Fallout, CrossTalk, LVI, Snoop, L1DES, VRS, and TAA.
FLOP: See FLOP: Breaking the Apple M3 CPU via False Load Output Predictions.
Foreshadow and Foreshadow-NG: Foreshadow is a computer security flaw attack in the Meltdown class targeting Intel SGX technology which defeats enclave memory isolation, sealing, and attestation guarantees. Intel calls Foreshadow a L1 Terminal Fault (L1TF) vulnerability. Intel’s analysis identified two closely related variants of Foreshadow, which we collectively call Foreshadow-NG (quotes from Foreshadow-NG). These attacks allow the entire L1 data cache to be dumped, which potentially exposes data from other address spaces that otherwise not be nameable for leaking by other techniques: Foreshadow-NG-type attacks variants exploit a subtle L1TF microarchitectural condition that allows to transiently compute on unauthorized physical memory locations that are currently not mapped in the attacker’s virtual address space view. As such, Foreshadow-NG is the first transient execution attack that fully escapes the virtual memory sandbox-traditional page table isolation is no longer sufficient to prevent unauthorized memory access.
GhostRace: GhostRace is an attack on code with critical regions and synchronization. It involves training branches testing for mutual exclusion to mispredict, and exploiting the microarchitectural changes that result. Our key finding is that all the common synchronization primitives can be microarchitecturally bypassed on speculative paths, turning all architecturally race-free critical regions into Speculative Race Conditions (SRCs).
GhostWrite: See GhostWrite.
GoFetch: GoFetch is a security vulnerability resulting from a data memory-dependent prefetcher (DMP) interpreting memory contents as a pointer and prefetching the blocks addressed.
Inception: Inception is a transient execution attack that leaks arbitrary data on all AMD Zen CPUs.
LVI: Load Value Injection (LVI) exploits transiently incorrect forwarding, with variants exploiting forwarding in the L1 Data Cache, Store Buffer, Line Fill Buffer, and load ports.
Meltdown: Meltdown is a computer security flaw attack that is allowed when processors delay privilege checks until after subsequent instructions have been speculatively executed and thereby modified shared microarchitectural state. Vulnerable processors thereby allow low privilege levels to read the memory of higher privilege levels, breaking the privilege model. Meltdown is also known as Rogue Data Cache Load (RDCL) and Speculative execution exploit variant 3.
PACMAN: PACMAN.
Retbleed: Retbleed.
RIDL: Rogue In-Flight Data Load (RIDL) is a Microarchitectural Data Sampling (MDS) attack. Other MDS include Fallout, CrossTalk, LVI, Snoop, L1DES, VRS, and TAA.
S2C: An acronym for Synchronization Storage Channels (S²C) described in Synchronization Storage Channels (S²C): Timer-less Cache Side-Channel Attacks on the Apple M1 via Hardware Synchronization Instructions, which is an attack using the LL/SC instructions.
SLAP: See SLAP: Data Speculation Attacks via Load Address Prediction on Apple Silicon.
Spectre: Spectre is a computer security flaw attack based training branch prediction to transiently direct execution of the victim to code that in turn exposes secrets through shared microarchitectural state. The original Spectre attacks were bounds check bypass (variant 1) and branch target injection (variant 2). Spectre attacks have been extended beyond branch prediction and branch target buffer to the return address stack (SpectreRSB), and store to load bypass (Spectre-STL) (also known as Spectre variant 4). A Systematic Evaluation of Transient Execution Attacks and Defenses provides a breakdown of the many Spectre and Meltdown types.
SpectreRSB: SpectreRSB is a Spectre-like attack using the Return Stack Buffer.
Spoiler: Spoiler. Also SPOILER: Speculative Load Hazards Boost Rowhammer and Cache Attacks.
SWAPGS: SWAPGS.
Transient Execution Vulnerability: Transient execution vulnerabilities are a class of vulnerabilities caused by speculative execution. This class includes Spectre, Speculative Code Store Bypass (SCSB), Floating-Point Value Injection (FPVI), Branch History Injection. Retbleed, SQUIP, Cross-Thread Return Address Predictions, Zenbleed, Inception, Downfall, GhostRace, Register File Data Sampling (RFDS).
Zenbleed: Zenbleed.
ZombieLoad: ZombieLoad is a transient-execution attack which observes the values of memory loads and stores on the current CPU core. ZombieLoad exploits that the fill buffer is used by all logical CPUs of a CPU core and that it does not distinguish between processes or privileges.

		Earl Killian <webmaster at securerisc.org>
2025-07-11

71	64	63	61	60	57	56	53	52	47	46	28	27	26	25	17	16	14	13	3	2	0
251		R		0		SDP		AP		0		S	F	T		TE		B		BE
8		3		4		4		6		19		1	1	9		3		11		3

D	FP64	IEEE 754 binary64	11	53
S	FP32	IEEE 754 binary32	8	24
-	TF32	TensorFloat-32	8	11
H	FP16	IEEE 754 binary16	5	11
B	BF16	bfloat16	8	8
P3	FP8	IEEE 754 binary8p3	5	3	similar to OCP FP8 E5M2
P4		IEEE 754 binary8p4	4	4	similar to OCP FP8 E4M3
P5		IEEE 754 binary8p5	3	5
-	FP6	OCP FP6 E2M3	2	4
-	FP6	OCP FP6 E3M2	3	3
-	FP4	OCP FP4	2	2

Size	Name	Sign	Exponent	Precision hidden+stored	Comment
32	FP32	1	8	1+23	Single Precision
32	TF32	1	8	1+10	Nvidia 4-element dot product
16	FP16	1	5	1+10	2008 IEEE format for graphics
16	BF16	1	8	1+7	Google Brain Float ~2017
8	E5M2	1	5	1+2	Open Compute Project
	E4M3	1	4	1+3	Open Compute Project
	E8M0	0	8	1+0	Open Compute Project
	E0M7	1	0	7	8-bit integer Sb.bbbbbbb (int8)
6	E3M2	1	3	1+2	Open Compute Project
6	E2M3	1	2	1+3	Open Compute Project
4	E2M1	1	2	1+1	Open Compute Project
4	E0M3	1	0	3	4-bit integer Sb.bb (int4)

71	64	63	61	60	57	56	53	52	47	46	28	27	26	25	17	16	14	13	3	2	0
251		R		0		SDP		AP		0		S	F	T		TE		B		BE
8		3		4		4		6		19		1	1	9		3		11		3

SecureRISC Instruction Set Architecture

Up Front

Documentation Outline

Work In Progress Documentation

Open Source

Software

Terminology

Table of Contents

Introduction

Goals

Synergy Between Security and Other Goals

Language Specific Mechanisms

Is SecureRISC actually RISC?

Background

Conventional Aspects of SecureRISC

Unconventional Aspects of SecureRISC

Advantages of the Basic Block Descriptor

Capability Hardware Enhanced RISC Instructions (CHERI)

SecureRISC and CHERI Variants

Open Aspects

Documentation Conventions

Basic Terminology

Multics Terminology (Multicians may mostly skim)

Address Terminology

Tagged Pointer Terminology

Dynamic Typing

Python and Other Language Types

Block-Oriented ISA Terminology

Other Features and Aspects of SecureRISC

Sandboxing

Garbage Collection

GC Terminology

Generational GC

Garbage Collection Algorithm

Virtual Address Restriction

Exceptions And Interrupts

Virtualization of Interrupts

Dynamic Linking and Loading

Bounds Checking

Memory Model

Instruction Set

Processor State

Basic Block Descriptor Words

Basic Block Descriptor Types

Call/Return Details

Debugging

Overflow Checking

Comparisons

Floating-Point Comparisons

Branch Avoidance

Tag Checking

Multiword Multiplication

Multiword Division

Arithmetic for Polynomials over GF(2)

Multiword Addition

Multiword Shifts

Floating-Point Rounding Modes

Floating-Point Flags

Vector Register File

Vector Boolean Registers

Matrix Multiply

Matrix Algebra

Matrix Tiling

Other Loop Orderings

Matrix Multiplication Implementation

Matrix Multiply Using Vectors

Matrix Multiply Using An Outer Product Array

Matrix Accumulators

Control and Status Register Operations

Atomic Memory Operations

Wait Instructions

Fence Instructions

System Call Instructions

Prefetch and Cache Operations

Code Size Reduction

Instruction Formats and Overview

Assembler Mnemonics

Software Conventions

Data Types

Register Names and Uses

71	64	63	61	60	57	56	53	52	47	46	28	27	26	25	17	16	14	13	3	2	0
251		R		0		SDP		AP		0		S	F	T		TE		B		BE
8		3		4		4		6		19		1	1	9		3		11		3