With particular relevance to Dynamic PGO
Andy Ayers February 2023
Dynamic PGO works by instrumenting early versions of methods (aka Tier0 codegen) to produce a profile data set that the JIT can use to optimize subsequent versions of those methods (aka Tier1 codegen).
In this note we focus on just one aspect of this instrumentation: counter based instrumentation. With PGO enabled, the JIT adds code to each Tier0 method to count how often the parts of the methods execute so that later on in Tier1 it can focus optimization efforts on the parts of the method that seem to be most important for performance. More details on Dynamic PGO can be found here.
From the outset, Dynamic PGO has used fairly simplistic methods of counting—for each distinct counter, the code the JIT adds will simply increment a shared memory location. Let's call that aspect of counting the counter implementation and this particular way of counting racing.
(A related aspect of counting, not covered here, is that the JIT tries to be efficient with counter placement, relying on an approach first pioneered by Ball to try and reduce the total number of counters needed to a minimum, and to place them in the less often executed parts of the method.)
Recently we made two surprising observations about the racing implementation:
- We started measuring the runtime cost of counting in heavily multithreaded applications by forcing them to run only the Tier0 instrumented code. We saw that compared to uninstrumented Tier0 code, instrumented code was slower by a factor of 2 or 3. Some further experimentation revealed that cache contention (both true and false sharing) was a major contributor to the high instrumentation overhead.
- We started looking at how accurate the counts were in the Tier1 compilation and discovered widespread inaccuracies. The major contributor here was lost counter updates because of the unsynchronized access across many threads of execution (hence the name racing).
This was doubly bad news: not only were we paying a lot at runtime for our racing counter implementation, but it was also doing a pretty bad job of counting.
It seemed like we should be able to do better, and indeed we can. The rest of this note explores this problem space in more detail.
The obvious fix for the lost counter updates is to stop racing and start synchronizing the updates. Our various platforms provide nice atomic counter updates in the form of InterlockedIncrement
and similar, and the JIT can emit the proper machine code forms (say lock inc[mem]
) that lay at the heart of these in place of the unsynchronized (inc [mem]
) racing variant.
So let's call this new version the interlocked implementation scheme for a counter. It is simple to update the JIT to emit this variant and make various measurements. Interlocked largely fixes the accuracy problem we'd been seeing with racing counters, but creates even more runtime overhead. So while interlocked might serve as a component of a solution, it was not on its own a solution.
The counts gathered by Dynamic PGO have very wide dynamic range, and even when we are checking accuracy we are usually interested in relatively self consistent counts. For example, if we have a simple if / then /else
construct we would expect the count for the if
block to equal sum of the counts for the then
and else
portions, and given a flow graph one can formulate this sort of conservation law more broadly (a good analogy here are Kirchhoff's laws for current flowing through a circuit). But there are some variances that we must tolerate: the method could throw an exception or the thread could be asynchronously stopped. So we typically are happy if the profile flows are accurate at each conservation point to say 1%, and a diminishing return on having results more accurate than this.
Given that we already need to tolerate some inaccuracy, the idea came about to see if we could leverage that to produce an intentionally approximate counting scheme that was nearly as accurate as interlocked but with less overhead.
A bit of research turned up an interesting paper by Dice, Lev, and Moir: Scalable Statistics Counters which was concerned with a similar problem. Here the key insight is that we can leverage randomization to count probabilistically: once the counter value is sufficiently large, instead of always trying to update its value by 1 for each increment, we update it by
The approach in the paper relies either on using standard floating-point representations for counts or a "software float" representation where part of the counter storage is an exponent and the remainder a mantissa. Those representations were not well suited for our work, so we set about to find a similar scheme that would work for regular integer data types.
The key benefit of the exponent/mantissa form is that it makes it relatively simple to estimate the magnitude of the counter—the exponent is the integer part of the
Here is a C# implementation of our counter:
static void ScalableProfileCount(ref uint counter)
{
uint count = counter;
uint delta = 1;
if (count > 0)
{
int logCount = 31 - (int) uint.LeadingZeroCount(count);
if (logCount >= 13)
{
delta = 1u << (logCount - 12);
uint rand = random();
bool update = (rand & (delta - 1)) == 0;
if (!update)
{
return;
}
}
}
Interlocked.Add(ref counter, delta);
}
Here random
is a fast source of uint
sized random numbers, ideally obtainable without synchronization (so say from a thread_local
producer).
When the count value is small (less than
Let's call this implementation scalable.
A value of xorshift
RNG we've historically used for PGO empirically gives the scalable count an overall
We first worked up a C# simulation to compare the three approaches introduced above: racing, interlocked, and scalable. We've already mentioned some of these results in text above.
As noted, interlocked is perfectly accurate. If we compare racing and scalable to interlocked we see results similar to the following:
Here the X axis is the intended final value of the count, where counting is done by #processors (here 12) threads each doing nothing more than incrementing the counter, and the Y axis shows the mean counter value along with max and min values, scaled by the total expected count.
As expected, Interlocked is just a horizontal line at Y=1.0.
Racing is losing counts even with a small final count of 12, and the loss percentage gets worse and worse the higher we try to count, ending up with at best 20% of the final count value.
Scalable is perfectly accurate up to 8192, then starts to deviate somewhat, but the total relative deviation remains constrained.
If we remove Racing from the picture and zoom in, we see that the error is indeed 2% for the 5/95 spread, and 3% for worst case, and that on average the count is very accurate.
We can also adjust the accuracy by changing when we switch to probabilistic counting; the higher the switchover point, the more accurate the results are overall. In the below we vary the switchover point from
Benchmarking the same set of computations with, we end up with the following (here the switchover is at 8192):
So scalable and racing have similar costs, while interlocked becomes quite a bit more costly at high counts.
- Quite likely the benchmark has too much fixed overhead and could be improved, and if so we might see distinctions even for lower total counts.
- It was critical for scalable to use per-thread RNGs and to seed them each differently.
Mathematically we can model the scalable update distribution as a binomial distribution. If we make
So if we start probabilistically incrementing by
The variance in the actual number of updates is
As we count higher the standard deviation is limited by
If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy):
We next implemented this same technique within the runtime and JIT, encoding the logic above as a new helper call.
Trying this out on some Tech Empower benchmarks where we force the benchmark to execute Tier0 instrumented code, we see significant improvements in server metrics (here on mvc/plaintext, windows, citrine, switchover at 8192):
Metric (Tier0 instr) | Racing | Scalable | |
---|---|---|---|
First Request (ms) | 155 | 155 | +0.22% |
Requests/sec | 223,665 | 352,592 | +57.64% |
Requests | 3,372,373 | 5,308,812 | +57.42% |
Mean latency (ms) | 10.60 | 7.83 | -26.08% |
Max latency (ms) | 237.25 | 279.12 | +17.65% |
Read throughput (MB/s) | 34.77 | 54.81 | +57.64% |
Latency 50th (ms) | 10.03 | 6.68 | -33.42% |
Latency 75th (ms) | 14.69 | 9.70 | -33.98% |
Latency 90th (ms) | 17.65 | 11.76 | -33.38% |
Latency 99th (ms) | 29.59 | 0.00 |
And in normal processing, the extra accuracy from Scalable seems to provide some small benefits to the Tier1 code as well:
Metric (Tiered PGO) | Racing | Scalable | |
---|---|---|---|
First Request (ms) | 90 | 89 | -1.12% |
Requests/sec | 3,083,453 | 3,187,157 | +3.36% |
Requests | 46,558,728 | 48,125,121 | +3.36% |
Mean latency (ms) | 0.97 | 0.92 | -4.61% |
Max latency (ms) | 98.20 | 109.77 | +11.77% |
Read throughput (MB/s) | 479.32 | 495.44 | +3.36% |
Latency 50th (ms) | 0.76 | 0.74 | -2.76% |
Latency 75th (ms) | 1.08 | 1.05 | -2.77% |
Latency 90th (ms) | 1.30 | 1.28 | -2.05% |
Latency 99th (ms) | 0.00 | 0.00 |
We haven't yet explored the impact across a wider set of benchmarks or tried varying the switch-over point.
We have presented a technique to implement approximately correct counters with relatively low overhead and good scaling properties. The counter accuracy can be tuned by trading off some of the scalability benefit. The counting process uses minimal runtime state: one RNG per physical thread, plus one storage location per counter. The value of the counter is readily available without any post-processing.
This sort of counter seems well-suited for use in our Dynamic PGO instrumentation.
It may be that approximate counting will be useful in other application areas where scalability is a concern and exact counts aren't needed.
The Dice et al. paper suggests that at high count values one might retain more accuracy by capping the granularity of updates. That is, once the likelihood of an update reaches say
It might also be possible to continue with increment by one counting so long as there is no contention for the counter, so that a single-threaded application gets more precision. One could for example replace the Interlocked.Add
with a compare exchange, and only switch over to probabilistic counting when the exchange fails.
Another approach not considered here is to shard the entire counter array into per-thread counters. Aside from a potentially large size increase, a sharded counter requires more work to query the counter's current value, as one must sum across the shards.
Our scalable scheme still suffers from false sharing; this seems more difficult to fix.
The paper by Dice et al. refers to a seminal paper by Morris, Counting large numbers of events in small registers where similar techniques are used to reduce the size of each counter. This might prove interesting in other contexts, for example compressing the size of some of our other DynamicPGO data, like the dynamic class profile histograms.