Surprising New Feature in AMD Ryzen 3000

Resiliencia · September 2, 2020, 1:10pm

https://www.agner.org/forum/viewtopic.php?t=41
Souce.

Kresimir · September 2, 2020, 1:24pm

Could you please tell me in your words what the surprising new feature is?

This website is asking me to disable my ad blocker, and I have no intention of doing that. They and their ads can go to heck.

Resiliencia · September 2, 2020, 1:28pm

I have just finished testing the AMD Zen 2 CPU. The results are in my microarchitecture manual and my instruction tables https://www.agner.org/optimize/#manuals.

I discovered that the Zen 2 has a new surprising feature that we have never seen before: It can mirror the value of a memory operand inside the CPU so that it can be accessed with zero latency.

This assembly code shows an example:

CODE: SELECT ALL

mov dword [rsi], eax
add dword [rsi], 5
mov ebx, dword [rsi]

When the CPU recognizes that the address [rsi] is the same in all three instructions, it will mirror the value at this address in a temporal internal register. The three instructions are executed in just 2 clock cycles, where it would otherwise take 15 clock cycles.

It can even track an address on the stack while compensating for changes in the stack pointer across push, pop, call, and return instructions. This is useful in 32-bit mode where function parameters are pushed on the stack. A simple function can read its parameters without waiting for the values to be stored on the stack and read back again. This does not work if the stack pointer is modified by any other instructions or copied to a frame pointer. Therefore, it doesn’t work with functions that set up a stack frame.

The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing. If the second instruction in the above example has a different pointer register with the same value, you have a problem of pointer aliasing. The CPU assumes that the addresses are different so that the value of eax is directly forwarded to ebx without adding 5. It takes 40 clock cycles to undo the mistake and redo the correct calculation.

Yet, this is a pretty amazing feature. Imagine how complicated it is to implement this in hardware without adding any latency. I wonder why this feature is not mentioned in any AMD documents or promotion material. At least, I can’t find any mentioning of this anywhere. AMD has something they call superforwarding, but this must be something else because it applies only to floating point registers.

Other interesting results for the Zen 2:
The vector execution units and data paths are all extended from 128 bits to 256 bits. A typical 256-bit AVX instruction is executed with a single micro-op, while the Zen 1 would split it into two 128-bit micro-ops. The throughput for 256-bit vector instructions is now as high as two floating point vector additions and two multiplications per clock cycle.

There is also an extra memory AGU so that it can do two 256-bit memory reads and one 256-bit write per clock cycle.

The maximum overall throughput for a mix of integer and vector instructions is five instructions or six micro-ops per clock for loops that fit into the micro-op cache. Long loops that don’t fit into the micro-op cache are limited by a fetch rate of up to 16 bytes or four instructions per clock. Intel processors have a similar limitation, and this is a very likely bottleneck for CPU intensive code.

All caches are big, the clock frequency is high, and you can get up to 64 cores. All in all, this is a quite competitive CPU as long as your software does not utilize the AVX512 instruction set. The software market is generally slow to adopt to new instruction sets, so I guess it makes economic sense for AMD to lag behind Intel in the race for new instruction sets and longer vector registers.

keybreak · September 2, 2020, 1:29pm

You can use https://archive.vn for that, comes handy to cut off ads and crap, as well as paywalls etc…

jonathon · September 2, 2020, 1:45pm

I’m not sure that re-posting the contents of other sites’ pages is an effective way of doing this - a summary plus a link tends to work well.

Also, that “Source” link (to the “vermaden” blog) seems to be unrelated.

Resiliencia · September 2, 2020, 1:57pm

You’re right, from now on I’ll do it that way.

Lemon · September 2, 2020, 2:13pm

This site didn’t ask me anything…strange.

Kresimir · September 2, 2020, 2:14pm

Now, it works for me, too.

keybreak · September 2, 2020, 2:17pm

@Lemon @Kresimir
Just like good old Richard Stallman said…NEVER trust javascript

Lemon · September 2, 2020, 4:48pm

I don’t even trust myself.