hckrnws
This is how many registers the ISA exposes, but not the number of registers actually in the CPU. Typical CPUs have hundreds of registers. For example, Zen 4 's integer register file has 224 registers, and the FP/vector register file has 192 registers (per Wikipedia). This is useful to know because it can effect behavior. E.g. I've seen results where doing a register allocation pass with a large number of registers, followed by a pass with the number of registers exposed in the ISA, leads to better performance.
What compilers do this?
One writeup I know about is: "Smlnj: Intel x86 back end compiler controlled memory."
What you describe sounds counter-intuitive. And the paper you cite seems to suggest an ISA extension to increase the number architected (!) registers. That is something very different. It makes most sense in VLIW architectures, like the ones described in the paper. Architectures like x86 do hardware register renaming (or similar techniques, there are several) to be able to exploit as much instruction level parallelism as possible. That is why I find you claim hard to believe. VLIW architectures traditionally provide huge register sets and make less use of transparent register renaming etc, that part is either explicit in the ISA or completely left to the compiler. These are very different animals than our good old x86...
I'm not sure we're talking abou the same paper. Here's the one I'm referring to:
https://smlnj.org/compiler-notes/k32.ps
E.g. "Our strategy is to pre-allocate a small set of memory locations that will be treated as registers and managed by the register allocator."
There are more recent publications on "compiler controlled memory" that mostly seem to focus on GPUs and embedded devices.
FTA: “For design reasons that are a complete mystery to me, the MMX registers are actually sub-registers of the x87 STn registers”
I think the main argument for doing that was that it meant that existing OSes didn’t need changes for the new CPU. Because they already saved the x87 registers on context switch, they automatically saved the MMX registers, and context switches didn’t slow down.
It also may have decreased the amount of space needed, but that difference can’t have been very large, I think
Nitpick (footnote 3): "64-bit kernels can run 32-bit userspace processes, but 64-bit and 32-bit code can’t be mixed in the same process. ↩"
That isn't true on any operating system I'm aware of. If both modes are supported at all, there will be a ring 3 code selector defined in the GDT for each, and I don't think there would be any security benefit to hiding the "inactive" one. A program could even use the LAR instruction to search for them.
At least on Linux, the kernel is perfectly fine with being called from either mode. FASM example code (with hardcoded selector, works on my machine):
format elf executable at $1_0000
entry start
segment readable executable
start: mov eax,4 ;32-bit syscall# for write
mov ebx,1 ;handle
mov ecx,Msg1 ;pointer
mov edx,Msg1.len ;length
int $80
call $33:demo64
mov eax,4
mov ebx,1
mov ecx,Msg3
mov edx,Msg3.len
int $80
mov eax,1 ;exit
xor ebx,ebx ;status
int $80
use64
demo64: mov eax,1 ;64-bit syscall# for write
mov edi,1 ;handle
lea rsi,[Msg2] ;pointer
mov edx,Msg2.len ;length
syscall
retfd ;return to caller in 32 bit mode
Msg1 db "Hello from 32-bit mode",10
.len=$-Msg1
Msg2 db "Now in 64-bit mode",10
.len=$-Msg2
Msg3 db "Back to 32 bits",10
.len=$-Msg3This is also true on Windows. Malware loves it! https://encyclopedia.kaspersky.com/glossary/heavens-gate/
Much like there is 64-bit "code", there is also 32-bit "code" that can only be executed in the 32-bit (protected) mode, namely all the BCD, segment-related, push/pop-all instructions that will trigger an invalid opcode exception (#UD) when executed under long mode. In that strictest sense, "64-bit and 32-bit code can’t be mixed".
x86 has (not counting the system-management mode stuff) 4 major modes: real mode, protected mode, virtual 8086 mode, and IA-32e mode. Protected mode and IA-32e mode rely on the bits within the code segment's descriptor to figure out whether or not it is 16-bit, 32-bit, or 64-bit. (For extra fun, you can also have "wrong-size" stack segments, e.g., 32-bit code + 16-bit stack segment!)
16-bit and 32-bit code segments work almost exactly in IA-32e mode (what Intel calls "compatibility mode") as they do in protected mode; I think the only real difference is that the task management stuff doesn't work in IA-32e mode (and consequently features that rely on task management--e.g., virtual-8086 mode--don't work either). It's worth pointing out that if you're running a 64-bit kernel, then all of your 32-bit applications are running in IA-32e mode and not in protected mode. This also means that it's possible to have a 32-bit application that runs 64-bit code!
But I can run the BCD instructions, the crazy segment stuff, etc. all within a 16-bit or 32-bit code segment of a 64-bit executable. I have the programs to prove it.
Isn't it how recent Wine runs 32-bit programs?
Good post! Stuff I didn't know x64 has. Sadly doesn't answer the "how many registers are behind rax" question I was hoping for, I'd love to know how many outstanding writes one can have to the various architectural registers before the renaming machinery runs out and things stall. Not really for immediate application to life, just a missing part of my mental cost model for x64.
If you’re asking about the register file, it’s around a couple hundred registers varying by architecture.
You’d need several usages of the ISA register without dependencies to run out of physical registers. You’re more likely to be bottlenecked by execution ports or the decoder way before that happens.
I've seen claims that it's different for different architectural registers, e.g. _lots_ of backing store for rax, less for rbx. It's likely to be significant for the vector registers too which could plausibly have features like one backing store for the various widths, in which case deliberately using the smaller vectors would sometimes win out. I'll never bother to write the asm by hand with that degree of attention but would like better cost models in the compiler backend.
African or European?
Intel's next gen will add 16 more general purpose registers. Can't wait for the benchmarks.
So every function call will need to spill even more call-clobbered registers to the stack!
Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?
With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.
It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.
> It's up to the compiler to decide how many registers it needs to preserve at a call.
But the compiler is bound by the ABI, isn't it? (at least for externally visible entrance points / calls to sub routines external to the current compilation unit)
Someone decides how to define the ABI too, and that is a free choice. CPU register count doesn't constrain this particular system design question.
Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.
That depends on if something like LTO is possible and a function isn't declared to use one of the plethora of calling conventions. What it means is that new calling conventions will be needed and that this new platform will be able to use pass by register for higher arity functions.
LTO is orthogonal to inlining functions.
Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?
More registers leads to less spilling not more, unless the compiler is making some really bad choices.
Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).
So, let's take a function with 40 alive temporaries at a point where it needs to call a helper function of, say, two arguments.
On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.
If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.
So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.
The game is deeper than that. Your model is probably about right for the compiler you're using. It shouldn't be - compilers can do better - but it's all a work in progress.
Small scale stuff is you don't usually spill around every call site. One of the calls is the special "return" branch, the other N can probably share some of the register shuffling overhead if you're careful with allocation.
Bigger is that the calling convention is not a constant. Leaf functions can get special cased, but so can non-leaf. Change the pattern of argument to fixed register / stack, change which registers are callee/caller saved. The entry point for calls from outside the current module needs to match the platform ABI you claimed it'll follow but nothing else does.
The inlining theme hints at this. Basic blocks _are_ functions that are likely to have a short list of known call sites, each of which can have the calling convention chosen by the backend, which is what the live in/out of blocks is about. It's not inlining that makes any difference to regalloc, it's being more willing to change the calling convention on each function once you've named it "basic block".
You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.
The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.
> You’re missing the fact that the compiler isn’t forced to fill every register in the first place.
Temporaries start their lives in registers (on RISCs, at least). So if you have 40 alive values, you can use the same one register to calculate them all and immediately save all 40 of them on the stack, or e.g. keep 15 of them in 15 registers, and use the 16th register to compute 25 other values and save those on the stack. But if you keep them in the call-invariant registers, those registers need to be saved at the function's prologue, and the call-clobbered registers need to be saved and restored around inner call sites. That's why academia has been playing with register windows, to get around this manual shuffling.
> The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.
Would you be willing to work through that proof? There may very well be less total memory traffic for machine with 31 registers than with 16; but it would seem to me that there should be some sort of local optimum for the number of registers (and their clobbered/invariant assignment) for minimizing stack traffic: four registers is way too few, but 192 (there's been CPUs like that!) is way too many.
This argument doesn’t make sense to me. Generally speaking, having more registers does not result in more spilling, it results in less spilling. Obviously, if you have 100 registers here, there’s no spilling at all. And think through what happens in your example with a 4 register machine or a 1 register machine, all values must spill. You can demonstrate the general principle yourself by limiting the number of registers and then increasing it using the ffixed-reg flags. In CUDA you can set your register count and basically watch the number of spills go up by one every time you take away a register and go down by one every time you add a register.
> Obviously, if you have 100 registers here, there’s no spilling at all.
No, you still need to save/spill all the registers that you use: the call-invariant ones need to be saved at the beginning of the function, the call-clobbered at an inner call site. If your function is a leaf function, only then you can get away with using only call-clobbered registers and not preserving them.
Okay, I see what you’re saying. I was assuming the compiler or programmer knows the call graph, and you’re assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings. Your assumption is for sure safer and more common for a compiler compiling a function that’s not a leaf and not inlined.
So I can see why it might seem at first glance like having more registers would mean more spilling for a single function. But if your requirement is that you must save/spill all registers used, then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables, and not on the number of hardware registers at all? If your machine has fewer general purpose registers than live state footprint in your function, then the amount of function-internal spill and/or remat must go up. You have to spill your own live state in order to compute other necessary live state during the course of the function. More hardware registers means less function-internal spill, but I think under your function call assumptions, the amount of spill has to be constant.
For sure this topic makes it clear why inlining is so important and heavily used, and once you start talking about inlining, having more registers available definitely reduces spill, and this happens often in practice, right? Leaf calls and inlined call stacks and specialization are all a thing that more regs help, so I would expect perf to get better with more registers.
Thanks for actually engaging with my argument.
> assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings.
Most of the decision logic/business logic lives exactly in functions like this, so while I wouldn't claim that 90% of all of the code is like that... it's probably at least 50% or so.
> then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables
Yes, and this ties precisely back to my argument: whether or not larger number of GPRs "helps" depends on what kind of code is usually being executed. And most of the code, empirically, doesn't have all that many scalar variables alive simultaneously. And the code that does benefit from more registers (huge unrolled/interleaved computational loops with no function calls or with calls only to intrinsics/inlinable thin wrappers of intrinsics) would benefit even more from using SIMD or even better, being off-loaded to a GPU or the like.
I actually once designed a 256-register fantasy CPU but after playing with it for a while I realised that about 200 of its registers go completely unused, and that's with globals liberally pinned to registers. Which, I guess, explains why Knuth used some idiosyncratic windowing system for his MMIX.
It took me a minute, but yes I completely agree that whether more GPRs helps depends on the code & compiler, and that there’s plenty of code you can’t inline. Re: MMIX Yes! Theoretically it would help if the hardware could dynamically alias registers, and automatically handle spilling when the RF is full. I have heard such a thing physically exists and has been tried, but I don’t know which chip/arch it is (maybe AMD?) nor how well it works. I would bet that it can’t be super efficient with registers, and maybe the complexity doesn’t pay off in practice because it thwarts and undermines inlining.
Comment was deleted :(
I recalled there were some new instructions added that greatly help with this. Unfortunately I'm not finding any good _webpages_ that describe the operation generally to give me a good overview / refresher. Everything seems to either directly quote published PDF documents or otherwise not actually present the information in it's effective for end use form. E.G. https://www.felixcloutier.com/x86/ -- However availability is problematic for even slightly older silicon https://en.wikipedia.org/wiki/X86-64
- XSAVE / XRSTOR
- XSAVEOPT / XRSTOR
- XSAVEC / XRSTOR
- XSAVES / XRSTORS
Eh, pretty much nobody uses them (outside of OS kernels?); and mind you, RISC-V with its 32 registers has nothing similar to those, which is why 14-instruction long prologues (adjust sp, save lr and s0 through s12) and epilogues are not that uncommon there.
A good compiler will only do that if the register spilling is more efficient than using more stack varibles, so I don't really see the problem.
You can't operate on stack variables without loading them into the registers first, not on RISCs anyway. My main point is that this memory-shuffling traffic is unavoidable in non-leaf functions, so an extremely large amount of available registers doesn't really help them.
op is probably referring to the push all/pop all approach.
No, I don't. I use a common "spill definitely reused call-invariant registers at the prologue, spill call-clobbered registers that need to survive a call at precisely the call site" approach, see the sibling comment for the arithmetic.
That’s the push all/pop all approach.
Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.
I looked it up. It's called APX (Advanced Performance Extensions)[1].
[1]: https://www.intel.com/content/www/us/en/developer/articles/t...
Oh my, it has three-operand instructions now. VAX vindicated...
How are they adding GPRs? Won’t that utterly break how instructions are encoded?
That would be a major headache — even if current instruction encodings were somehow preserved.
It’s not just about compilers and assemblers. Every single system implementing virtualization has a software emulation of the instruction set - easily 10k lines of very dense code/tables.
The same way AMD added 8 new GPRs, I imagine: by introducing a new instruction prefix.
Yes, two in fact. One is the same prefix (with subtle differences of course, it's x86) that was introduced for AVX512's 32 registers. The other is new and it's a two byte extension (0xd6 0x??) of the REX prefix (0x40-0x4f).
The longer prefix has extra functionality such as adding a third operand (i.e. add r8, r15, r16), blocking flags update, and accessing a few new instructions (push2, pop2, ccmp, ctest, cfcmov).
x86 is broadly extendable. APX adds a REX2 prefix to address the new registers, and also allows using the EVEX prefix in new ways. And there's new conditional instructions where the encoding wasn't really described on the summary page.
Presumably this is gated behind cpuid and/or model specific registers, so it would tend to not be exposed by virtualization software that doesn't support it. But yeah, if you decode and process instructions, it's more things to understand. That's a cost, but presumably the benefit outweighs the cost, at least in some applications.
It's the same path as any x86 extension. In the beginning only specialty software uses it, at some point libraries that have specialized code paths based on processor featurses will support it, if it works well it becomes standard on new processors, eventually most software requires it. Or it doesn't work out and it gets dropped from future processors.
I imagine a turning point will be if/when the large HPC centers will significantly benefit from this. I need to see whether DOE facilities compile their software with these new features on a regular basis. (Well, for the average scientific user, whether the jellybean Python stack will take advantage…)
Those general purpose registers will also need to grow to twice their size, once we get our first 128bit CPU architecture. I hope Intel is thinking this through.
That's a ways out. We're not even using all bits in addresses yet. Unless they want hardware pointer tagging a la CHERI there's not going to be a need to increase address sizes, but that doesn't expose the extra bits to the user.
Data registers could be bigger. There's no reason `sizeof int` has to equal `sizeof intptr_t`, many older architectures had separate address & data register sizes. SIMD registers are already a case of that in x86_64.
You can do a lot of pointer tagging in 64 bit pointers. Do we have CPUs with true 64 bit pointers yet? Looks like the Zen 4 is up to 57 bits. IIRC the original x86_64 CPUs were 48 bit addressing and the first Intel CPUs to dabble with larger pointers were actually only 40 bit addressing.
> There's no reason `sizeof int` has to equal `sizeof intptr_t`
Well, there is no reason `sizeof int` should be 4 on 64-bit platforms except for the historical baggage (which was so heavy for Windows, they couldn't move even long to be 64 bits). But having an int to be a wider type than intptr_t probably wouldn't hurt things (as in, most software would work as-is after simple recompilation).
Doubling the number of bits squares the number range that can be stored, so there's a point of diminishing returns.
* Four-bit processors can only count to 15,or from -8 to 7, so their use has been pretty limited. It is very difficult for them to do any math, and they've mostly been used for state machines.
* Eight-bit processors can count to 255, or from -128 to 127, so much more useful math can run in a single instruction, and they can directly address hundreds of bytes of RAM, which is low enough an entire program still often requires paging, but at least a routines can reasonably fit in that range. Very small embedded systems still use 8-bit processors.
* Sixteen-bit processors can count to 65,535, or from -32,768 to 32,767, allowing far more math to work in a single instruction, and a computer can have tens of kilobytes of RAM or ROM without any paging, which was small but not uncommon when sixteen-bit processors initially gain popularity.
* Thirty-two-bit processors can count to 4,294,967,295, or from -2,147,483,648 to 2,147,483,647, so it's rare to ever need multiple instructions for a single math operation, and a computer can address four gigabytes of RAM, which was far more than enough when thirty-two-bit processors initially gain popularity. The need for more bits in general-purpose computing plateaus at this point.
* Sixty-four-bit processors can count to 18,446,744,073,709,551,615, or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, so only special-case calculations need multiple instructions for a single math operation, and a computer can address up to sixteen zettabytes of RAM, which is thousands of times more than current supercomputers use. There's so many bits that programs only rarely perform 64-bit operations, and 64-bit instructions are often performing single-instruction-multiple-data operations that use multiple 8-, 16-, or 32-bit numbers stored in a single register.
We're already at the point where we don't gain a lot from true 64-bit instructions, with the registers being more-so used with vector instructions that store multiple numbers in a single register, so a 128-bit processor is kind of pointless. Sure, we'll keep growing the registers specific to vector instructions, but those are already 512-bits wide on the latest processors, and we don't call them 512-bit processors.
Granted, before 64-bit consumer processors existed, no one would have conceived that simultaneously running a few chat interfaces, like Slack and Discord, while browsing a news web page, could fill up more RAM than a 32-bit processor can address, so software using zettabytes of RAM will likely happen as soon as we can manufacture it, thanks to Wirth's Law (https://en.wikipedia.org/wiki/Wirth%27s_law), but until then there's no likely path to 128-bit consumer processors.
There's a first time for everything.
Related. Others?
How many registers does an x86-64 CPU have? (2020) - https://news.ycombinator.com/item?id=36807394 - July 2023 (10 comments)
How many registers does an x86-64 CPU have? - https://news.ycombinator.com/item?id=25253797 - Nov 2020 (109 comments)
These seem more “past discussion” than “related” to me tbh
The amount of accumulated cruft in the x86 architecture is astounding.
Being a geezer, I remember when there was, for a brief moment, a genuine question whether National Semiconductor, Motorola, or Intel would win the PC market. The NS processors had a nice, clean architecture. The Motorola processors, meh, ok. Intel already had cruft from earlier efforts like the 4004, and was just ugly.
Of course, Intel won, Motorola came in second, and NS became a footnote.
The x86 architecture has only gotten uglier over time.
You have to evolve to compete. Look what happened to MIPS, the classic "pure RISC". (I know about RISC-V, but at this point it's become just another cheap core for those who don't want to pay for ARM licenses.)
RISC-V is not a core, it is an ISA.
Beware chips with high performance microarchitectures compliant with RVA23 are coming later this year.
Same difference. You can't get high performance without more complex instructions like ARM or x86.
We'll have to agree to disagree.
Fortunately, this will be determined in practice in just months, not years from now.
Tried to answer this question years back for just the "basic" x86 registers. Quickly realized there was never going to be any single answer until I had mastered the entire ISA. Oh well.
Comment was deleted :(
Some minor nitpicks, but hey, we're counting registers, it's already quite nitpicky :)
Add far as I van remember, you can't access the high/low 8 bits of si, di, sp. ip isn't accessible directly at all.
The ancestry of x86 can actually be traced back to 8 bit cpus - the high/low bits of registers are remenants of an even older arch - but I'm not sure about that from the top of my head.
I think most of the "weird" choices mentioned there boil down to limitations that seem absurd right now, but were real constraints - x87 stack can probably traced back to exposing minimal interface to the host processor - 1 register instead of 8 can save quite a few data line - although a multiplexer can probably solve this - so just a wild guess. MMX probably reused the register file of x87 to save die space.
The low 8 bits of SI, DI, BP and SP weren't accessible before, but now they are in 64-bit mode.
The earliest ancestor of x86 was the CPU of the Datapoint 2200 terminal, implemented originally as a board of TTL logic chips and then by Intel in a single chip (the 8008). On that architecture, there was only a single addressing mode for memory: it used two 8-bit registers "H" and "L" to provide the high and low byte of the address to be accessed.
Next came the 8080, which provided some more convenient memory access instructions, but the HL register pair was still important for all the old instructions that took up most of the opcode space. And the 8086 was designed to be somewhat compatible with the 8080, allowing automatic translation of 8080 assembly code.
16-bit x86 didn't yet allow all GPRs to be used for addressing, only BX or BP as "base", and SI/DI as "index" (no scaling either). BP, SI and DI were 16-bit registers with no equivalent on the 8080, but BX took the place of the HL register pair, that's why it can be accessed as high and low byte.
Also the low 8 bits of the x86 flag register (Sign,Zero,always 0,AuxCarry,always 0,Parity,always 1,Carry) are exactly identical to those of the 8080 - that's why those reserved bits are there, and why the LAHF and SAHF instructions exist. The 8080 "PUSH PSW" (Z80 "PUSH AF") instruction pushed the A register and flags to the stack, so LAHF + PUSH AX emulates that (although the byte order is swapped, with flags in the high byte whereas it's the low byte on the 8080).
Fun fact, that obviously you already know but may be interesting to others.
In the encoding the registers are ordered AX, CX, DX, BX to match the order of the 8080 registers AF, BC (which the Z80 uses as count register for the DJNZ instruction, similar to x86 LOOP), DE and HL (which like BX could be used to address memory).
Conservatively though, another answer could be when not considering subset registers as distinct:
16 GP
2 state (flags + IP)
6 seg
4 TRs
11 control
32 ZMM0-31 (repurposes 8 FPU GP regs)
1 MXCSR
6 FPU state
28 important MSRs
7 bounds
6 debug
8 masks
8 CET
10 FRED
=========
145 total
And don't forget another 10-20 for the local APIC.
"The answer" depends upon the purpose and a specific set of optional extensions. Function call, task switching between processes in an OS, and emulation virtual machine process state have different requirements and expectations. YMMV.
Here's a good list for reference: https://sandpile.org/x86/initial.htm
ZMM registers are separate from the 8 FPU registers. That's because the ZMM set is separate from MM registers. So there's 8 more.
In hardware, however, renaming resources are shared between ST/MM registers and the eight Kn mask registers.
x86-64 ISA general-purpose register containers: low-er 8 to 16 bits of the 64 bit GPR.
Heh, am I the only one who was expecting an article about register renaming?
[dead]
One thing that has happened since 2020 is that recent AMD CPUs support AVX-512, so that raises the number of registers by 16+16+32.
Intel's client CPUs still don't (officially*) support AVX-512, but pretty much all AMD CPUs with DDR5 do.
* Outside of older Alder Lake CPUs, and even then, it's kind of a hack.
Don't forget x86_64 like ARM is IP-locked, RISC-V is not.
Fun fact: the AMD64 patents have expired, with AMD-V patents expiring this year, so there really isn't a need for an x86 license to do anything useful. All that's still protected is various AVX instruction sets, but those are generally used in heavily optimized software, like emulators and video encoders, that tend to be compiled to the specific processor instruction set anyway.
I would love to see an Open Source, free to use x86 CPU design.
There's an open-source 80486 project, called ao486, but I don't know of anything now modern.
As far as I can remember, it is not only a "patent" issue. It seems there are other legal mechanisms.
That said, I would not use a x86_64 CPU without AVX nowadays.
As far as intellectual property protections go, You wouldn't be able to copy the layout of an old AMD or Intel processor copyright infringement, not that anyone would want to, because it wouldn't be cost effective to use the exact same process decades later. There's no trademark protection, as AMD was unable to register the x86-64 trademark (https://tsdr.uspto.gov/#caseNumber=76032083)
Other than protections against industrial espionage, that exhausts all forms of intellectual property rights in the US.
The microcode is protected by copyright; see NEC Corporation v. Intel Corporation.
Microcode is specific to a given implementation, so if you make your own x86 implementation, it's not going to run AMD's or Intel's microcode unless you go out of your way to make it do so. NEC didn't infringe Intel's copyright, because their processor ran different microcode than Intel's, and NEC won that lawsuit.
Allright, the ISA itself is probably protected under some copyrights(+patents), which in the US last for ~one century (cf Disney).
In the end, nobody sane would try its luck, better go for something non "IP-locked".
Aka RISC-V, not to mention that for a modern implementation RISC-V is more friendly.
How could we given you keep bringing it up whether it’s relevant or not?
Crafted by Rajat
Source Code