For one of my clients I’m doing performance assessments on a quite regular basis. For adcloud, performance is a primary key to success. To give you a little background: adcloud’s adserver system is serving thousands of requests per second — only with a couple of application servers. And since customers don’t like slow loading ads on their pages, latency besides raw throughput is a fundamental criteria defining the service’s quality.
We basically run extensive load tests with different scenarios all the time to ensure that a new feature or even a small improvement does not compromise the stability and quality of the system — especially under load. Functional correctness is covered by unit and integration tests, but non-functional aspects like throughput and latency are a bit harder to measure and to deliver. How exactly the setup looks like, which I created to help adcloud to run automated, distributed load tests, will be the topic of another post. For this article I’d like to outline an idea I have had in mind for quite some time now and which has been proven to work at least once.
Please note that I cannot present you real numbers here as they are confidential. I hope that the concept behind the idea is still comprehensible.
The Problem: Periodic Effects
A few weeks ago, while performance testing a rather important and big release, we tried to figure out why sometimes the average response times (and the variation too) were looking okay and sometimes not. I had a suspicion that we could have a semi-periodic variance in the average response times, due to tasks running in the background. By just looking at the data and plotted graphs (see next next image), it was very hard to see.
So how can you verify that you have other (background) processes, scheduled activities or other external factors that influence your performance?
A Possible Solution: Autocorrelation
It turns out, that this problem was the topic of a lecture at the university. I remembered a lecture about the “scientific basics of audio-visual media”, where our professor explained a quite simple, yet powerful tool to uncover periodicities in semi-periodic signals: Autocorrelation. Okay, we don’t have a signal in the sense of audio-visual media, but we could think of the response time as a function over time, which would make it kind of like a signal.
Autocorrelation (short ACF, autocorrelation function) is a cross-correlation of a signal with itself. By correlating a signal with itself, repetitive patterns will stand out and make it much easier to see. The (discrete) autocorrelation of a signal
x is defined by the following simple equation.
The entire signal
x is shifted by an offset
j and then multiplied by the original signal. This is repeated for every sample in the discrete signal.
R is basically the energy of the signal and therefore the maximum of the ACF.
R is the correlation of the signal with itself shifted by one sample — you get the idea. If the signal has a significant enough self-similarity, the ACF will show this relation. In the case of a suspected periodicity the signal will repeat itself after each period. All we have to do now is run the autocorrelation and check for maxima in the result. If the maxima are within a certain delta of your suspected periodicity, you were right!
So far, so good. I started to implement a discrete autocorrelation function in Ruby to give it a try.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
autocorrelate will cross-correlate the given
signal with itself, returning a new signal. I am using a sliding window here to optimize the process a bit (see below how to choose it properly). Also note that I patched the internal array’s element accessor
 in order to return
0 instead of
nil for indices that are not set. To ease up further analysis you can normalize the signal. This is easy since the 0th shift is the maxima of the signal.
1 2 3
Now, all you have to do, is read your signal with something like
signal = load_signal("data/requests.txt") (signal needs to be an array of floats) and dump it into the method shown above.
How to choose a proper size for the sliding window?
window_size has to be big enough to catch your suspected periodicity. Choosing it too big just slows down the process and does not increase the accuracy. In our case I chose
window_size to be ~300 samples wide, since I had the suspicion, that one of our periodic executed background tasks could be responsible for the variation. This task runs every 30 minutes, and since tsung generates one sample every 10 seconds, I was on the safe side using 300 samples for the
window_size in this case.
Initially I had some problems getting my peak-picker working properly, so I decided to just plot the results using gnuplot. The following image shows the plotted autocorrelated and normalized values of the initial signal,
R[j]. The x-axis represents the used time-based shift. Keep in mind that tsung takes 1 sample every 10 seconds.
As you can clearly see, we have a peak at about every 180 samples, indicating that we have a base frequency of 1/1800 seconds, or 1/30 minutes. And 30 minutes looks pretty much exactly like one of our background tasks. The process runs for a few minutes, which is the reason the result is not a perfectly sharp spike. In the end, we optimized the task to have significantly less impact on the system’s latency and shipped the feature — end of story :)
PS: adcloud is always looking for skilled developers. Feel free to contact me if you want to get in touch and make sure to visit their dev site.