IBS Profiling with uProf on AMD CPUs

dragontamer

5.00/5 (2 votes)

Oct 27, 2018

CPOL

13 min read

14232

IBS (Instruction Based Sampling) requires a different point of view to fully understand

Introduction

AMD's CPUs are growing in popularity with the release of the "Zen core" (Ryzen, Threadripper, and EPYC). Programmers or IT Administrator running these systems may find themselves wondering how their code is performing. However, Intel-based tools like VTune do not work on AMD CPUs, as they rely upon Intel-specific features (such as PEBS or LBR) to count cache-hits or branch-mispredictions.

Instead, programmers have to make due with the tools built into AMD CPUs, either AMD PMCs or IBS, through the AMD uProf program.

AMD "PMC" counters are the classical hardware counters that most programmers are already familiar with. A counter inside the CPU counts up specific events, such as branch mispredictions. Every branch mispredict, the counter-ticks up. Once the counter ticks up to a user-configurable value, the profiler halts the program, collects some information (the current instruction pointer, the time, etc.), and finally starts the program back up. This information is later analyzed by the profiler, and the programmer uses the analysis to optimize their code.

But this article is NOT about PMC counters. A lot of articles already cover classical performance counters, which are broadly applicable across many different CPUs in the field. This article instead covers AMD's "IBS" counters, a grossly different methodology.

Overall, IBS inverts the performance counter methodology and profiles instructions themselves, leading to a similar-looking set of statistics but with different implications. Overall, I suggest that programmers use a combination of the classical performance counters, together with IBS to understand performance. This article focuses on the lesser-known, AMD-specific IBS counters.

Instruction Based Sampling Background

While most programmers are familiar with how AMD's classical PMC counters would work, their implementation has a "skew" problem. In particular, modern CPUs are pipelined, superscalar, AND out-of-order, so its difficult to pin down exactly what a modern "Program Counter" or "Instruction Pointer" even means today. With nearly a hundred instructions simultaneously in flight in pipelines or execution units or retirement queues, program counters are ambiguous during the time when profilers write down profile information. Programmers will notice branch-mispredicts on add or multiply instructions, and other such nonsense.

It should be noted that Intel CPUs correct this problem through PEBS. But this article will focus on AMD's IBS methodology.

AMD's IBS methodology was invented to counteract this problem. Before continuing, AMD IBS counters require require a basic understanding of a processor's pipeline. AMD's Zen processor is pipelined, which means every instruction goes through different stages of computation. For the purposes of IBS, these stages are Fetch, Decode, Execute, Completion, and Retirement.

Fetch is when a core actually grabs the instruction from memory. The processor may have to wait for main-memory, or maybe it already has the instruction in the L1 cache or uOp cache.
Decode is when the core figures out what the instruction does. Under ideal circumstances, AMD Processors can either decode 4-instructions per clock, OR issue 6-uOps from the uOp cache. After being decoded, instructions are saved as uOps in the uOp cache to speed up future decodes.
Execution is when a core actually executes the instruction. Be it a division, addition, or multiplication. AMD Zen cores have 10-execution pipelines: 4 integer pipelines, 4 vector/floating point pipelines, and 2 Load/Store units (called AGUs). Execution is strongly pipelined and parallelized.
Completion is where an instruction goes after it's done executing. Some instructions, like "xor eax, eax" can complete instantly without even going into an Execution core. But other instructions like division can take dozens of cycles. Completion is where instructions wait to be put back into the correct order.
Retirement is the final stage where instructions are finally removed from the processor. Agner Fog claims that AMD Zen can retire 8 uOps per clock cycle. Retirement ensures that code is put back in the correct order, as the original programmers expect.

Some instructions (like divide), may take many cycles in the execute stage. Modern processors will then execute "out of order", and try to complete OTHER instructions found in the program while waiting for the slow division to complete. The CPU will only stall if all paths lead to a dependency on an earlier instruction.

With a crude understanding of the AMD Zen pipeline ready, we can now understand how IBS works. AMD's IBS counts instructions (or cycles: its user configurable) that go through retirement. After X instructions (usually 50000, or higher), the IBS core tags a random instruction in the Decode stage of the pipeline. The core then collects statistics on this tagged instruction until retirement. Once the instruction is retired, an interrupt is generated and the profiler is then able to collect these statistics.

The AMD Zen Core requires two different performance counters to track IBS-Fetch and IBS-Ops.

Profiling Your Code

First and foremost: you must enable IBS Sampling in your BIOS before you can use it. Otherwise, you will only be able to use the PMC counters. In the case of my Asrock x399 Taichi motherboard, the BIOS setting was in Advanced -> AMD CBS -> Zen Common Options -> Enable IBS. Different motherboards and versions will likely put this setting into different spots.

Fortunately, setting up the BIOS is the only tricky step of using IBS Sampling. AMD uProf is an easy download and installation. The GUI is barebones, and the presentation of statistics isn't always intuitive or useful, but all the important information is in there somewhere.

https://developer.amd.com/amd-uprof/

I have not used the Linux version. There is a crude "API" with 2 function calls of note:

bool amdProfileResume(AMD_PROFILE_CPU);

bool amdProfilePause(AMD_PROFILE_CPU);

There are options to start uProf paused, in both the GUI or the command line. This way, your code can enable and disable the profiler to avoid irrelevant sections, like I/O. The AMD uProf manual provides guidance on how to use this API in both Visual Studio and GCC.

It should be noted that the system itself can be profiled, which is necessary to collect L3 cache statistics or Data-Fabric (DF) statistics. Total system profiling could be useful to non-programmers who are tuning their software, there was an excellent talk by Netflix engineers on how they used Intel hardware performance counters to tune their servers (http://www.brendangregg.com/linuxperf.html). Total system profiling could theoretically do the same thing on AMD systems.

Commentary on the Statistics

Original documentation of these statistics can be found in AMD's uProf user manual (https://developer.amd.com/wordpress/media/2013/12/User_Guide.pdf) or AMD's Family 15h Software Optimization Guide (https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf). While modern AMD Zen cores are Family 17h, the older Family 15h are the last manuals to actually describe IBS in depth.

I will only highlight the statistics I found most important for my profiling purposes.

All IBS op samples -- The grand total of all instructions. In IBS mode, remember that statistics are only generated on the specific instructions which are tagged. By default, uProf tags instructions based off of instruction-count (i.e.: every 50,000 instructions) INSTEAD OF time. This has the unfortunate effect: when code is more optimized (i.e.: executes ~2-instructions per clock), there is a higher chance of it being tagged by default. Nonetheless, this is the "denominator" of most derived statistics, so it's important to fully understand the implications of this fundamental IBS statistic.
IBS Tagged-to-Retirement Cycles -- The amount of cycles a particular instruction took to complete. This is a measurement of latency, and NOT of throughput. Instructions can be delayed for many reasons, from dependencies on other instructions, to waiting for Memory to respond.
IBS Completed-to-Retirement Cycles -- This value, in combination with Tagged-to-Retirement, can be used to determine if this particular instruction is the culprit... or if a "dependent" instruction is stalling out down the core.
IBS mispredicted branch op -- IBS precisely counts every mispredicted op.
IBS data cache miss load latency -- IBS precisely counts the number of cycles any instruction was waiting for memory. An average count of ~8-cycles implies that the data was found in the L2 cache. An average count of ~40 cycles implies that the data was found in the L3 cache. An average count of 150+ cycles is typical for data that is found in main-memory. An average count of 300+ cycles may occur for data found in a remote-NUMA node (for Threadripper or EPYC systems).
DTLB -- IBS has a number of statistics to track the Data Translation Lookaside Buffer, the acceleration structure within the core to implement Virtual Memory on modern OSes. In theory, DTLB issues may bottleneck your code, especially if you have large gigabyte-sized level data-structures in RAM that need to be traversed. I haven't run across any in my use cases, but I can imagine some other programmers running into this issue. AMD provides statistics for the L1, and L2 DTLBs, as well as page-walker statistics.
IBS Fetch Samples -- IBS Fetch is a different counter than IBS Op samples. For programmers who deal with many MBs of execution space, the instruction-fetch itself may become the bottleneck. Fetches on AMD Zen occur only on 64-byte boundaries (Associated with the L1 Cache size).

The AMD uProf profiler collates these statistics together and groups them by function-call.

Common Bottlenecks and IBS Metrics to Detect Them

Memory Bottleneck -- Watch the IBS Data Cache Miss Load Latency statistic. If your instructions are stalling by 150-cycles or more, your core is waiting for RAM. You'll also notice a large Completed-to-Retirement Cycles around instructions that were completed "out of order" while waiting for the RAM, while Completed-to-Retirement Cycles will be small for any instructions that were stalled on the RAM itself. Try to rewrite your code to do more things while waiting on RAM, for example, perform "harder calculations". Traversing a linked list or tree will always be a RAM heavy operation, but use the free cycles of the core to calculate an aggregate statistic during the traversal. You have free CPU cycles while waiting for RAM, might as well use them. In extreme cases, compression algorithms can be run on your data-structures themselves to improve performance.
Branch Mispredictions -- With OOP being common today, there will be a lot of objects, indirect-branches, and virtual function calls. In some cases, branch mispredictions themselves will become the bottleneck (see this StackOverflow question for a famous example. Keep a careful eye on branch-misprediction statistics in tightly written code, it can make a big difference in performance!
Executing Outside of Cache -- A lot of benchmark code fits inside of L1 cache, or even the uOp cache. However, in practical programs, a lot of code has to be traversed. Not all loops fit inside of the cache. Keep an eye on IBS fetch latency: if it's taking many, many cycles to fetch instructions, you'll want to shrink your code down (maybe optimize for size, instead of speed) to get back to fast execution.

Questions...

Unfortunately, my knowledge of IBS Profiling stops here. I personally still have many questions with regards to IBS.

Is it possible to get a measurement of execution-unit utilization through IBS statistics? PMC counters can count up FLOPs for example, which is useful for figuring out whether or not the instructions are taking advantage of instruction-level parallelism or not.
The PMC statistic "Instructions-per-clock" is immediately easy to understand and useful. I have so far been unable to generate an IBS equivalent. IBS "average tagged-to-retirement cycles" is close, but because execution units execute out-of-order and in parallel, this IBS-metric only measures the latency of instructions... not the throughput of the core.
Is it possible to detect long dependency chains in the assembly code through IBS statistics? If so, it would provide a way to measure loops which would benefit strongly from loop unrolling.

For these cases, I still am running the classical PMC counters to answer these questions. The instruction skew is a real problem when using PMC however, but it seems like it's the only real way to gather these metrics on AMD CPUs.

Pitfalls and Tradeoffs

The primary tradeoff is the sampling speed for IBS. The more samples collected, the better your statistics. However, every sample you collect is an interrupt: the CPU halts, switches to the profiler, the profiler writes down a bunch of information, and then the code finally continues to run. As such, setting the sampling speed too high hurts the performance of the program. The reverse: setting your sampling speed too low, will reduce the amount of statistics you gather, making it harder to make solid conclusions about the performance of your code.

Enabling more counters, such as IBS Fetch, Cycles not in Halt (a PMC counter), #Instructions (also a PMC counter), and IBS Ops will also cause these interrupts to run more often, further hampering the speed of your program. Decrease the sample rate if you hope to collect a large number of different counter information from the profiler.

I highly suggest learning how to use other profilers and timers. Windows has "QueryPerformanceCount", which ticks at approximately 3MHz. At the assembly leve, the x86 / AMD64 instruction sets include the "rdtsc" and "rdtscp" instructions. "rdtscp" is especially useful in that it clears out the pipeline before returning the clock-cycle counter, giving a more consistent performance read. "rdtsc" ticks at the base-clock rate of a computer, modern CPUs no longer change the timestamp-counter on turbo.

In any case, use these alternative timers to generate some rough timing estimates within your code. And be sure to compare these rough estimates with, and without, profiling. Be sure that whatever statistics you are gathering match up with your base case.

Conclusion

While I've only begun to use IBS Sampling recently, it is clear that IBS Sampling give great latency metrics. These latency metrics can easily provide insights into the degree of memory-problems (L2, L3, or Main Memory stalls), or branch-prediction problems throughout the code.

However, I haven't been able to figure out how to use IBS samples to gleen insight into the classical matters of throughput-based metrics: such as IPC (Instructions per Clock), GFLOPs, or GB/s of RAM bandwidth. For now, it feels like IBS is a useful, but incomplete, tool for collecting certain program information.

Overall, it seems like the best path for profiling on AMD CPUs is to first use the classical PMC counters to gather IPC, GFLOPs, and RAM metrics across the program to determine where bottlenecks generally exist. Afterwards, a 2nd profile run with IBS can provide a more accurate, assembly-level viewpoint, into the latency characteristics of these already identified bottlenecks.

References

Agner Fog's optimization manuals: https://www.agner.org/optimize/
- Excellent manuals for low-level programming
https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf
- AMD's first document on IBS sampling, dated 2007
AMD's Family 15h Software Optimization Guide: https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
- While this article focuses on AMD Zen cores (Family 17h), the 17h optimization guide does NOT have any information on IBS. Family 15h Appendix F is the most recent manual with IBS-sampling information
- Appendix E discusses RDTSC and RDTSCP timestamps
AMD uProf: https://developer.amd.com/amd-uprof/
- AMD's profiler, which has access to these PMC and IBS performance counters. The uProf user-manual has a list of IBS statistics
https://docs.microsoft.com/en-us/windows/desktop/sysinfo/acquiring-high-resolution-time-stamps
- Windows documentation on QueryPerformanceCount
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
- Intel has excellent documentation on the proper usage on the RDTSCP instruction for benchmarking code.
http://www.brendangregg.com/linuxperf.html
- A large amount of knowledge by a Netflix engineer on how they identified performance problems through profiler tools. Most of it is for Intel / Linux, but the top-down analysis and mindset of the webpage is excellent.

History

Version 1.0.0.0 -- Initial publication