Real measurement is hard

By joe

February 9, 2015 - 11 minutes read - 2261 words

I had hinted at this last week, so I figure I better finish working on this and get it posted already. The previous bit with language choice wakeup was about the cost of Foreign Function Interfaces, and how well they were implemented. For many years I had honestly not looked as closely at Python as I should have. I’ve done some work in it, but Perl has been my go-to language. For me, the brevity of the interface, the ease of use of the FFI in it was what made me rethink some things. I look at languages as tools to an end, a way to implement an algorithm, which will not always be expressed in the same language. I’ve been doing one manner or the other of FFI (not called that back then) since the mid to late 1980s. Usually Fortran and C, but as often as not, Assembler, Basic, Fortran, C, etc. One of the first codes I wrote for hire in 1984 was a terminate and stay resident driver for an experiment control card. Main code was in basic (I kid you not), and I exposed interrupt service routines to handle control functions. It was a fun thing to do, and later the experiment became famous. Regardless of that, the ability to understand where code was spending its time, and what it was doing became very tightly ingrained in me from that experience. Not that basic was fast … it wasn’t. I had to do all the hard/fast stuff in assembler. If you didn’t understand your code, chances are your measurements of the code, and your ability to effect changes would be … limited. I look at rapid development languages (Perl, Python, etc.) as being high level mechanisms to bridge a variety of “fast” lower level bits (libraries, etc.) together. These tools often have good internal support for things that I frankly really don’t want to waste my time implementing.

In the case below, hashing and hashing algorithms are “under test.” I’ll explain the scare quotes in a second. First off, being a recovering physicist means that I sweat the details of measurements. I can’t say I always get them right, I don’t. I do however, question what I am measuring, and see if what I am measuring is actually reflective of what I think I am measuring. Moreover, I try to do a null hypothesis test, which is another basic sanity check. If you measure something and happily report the data without bothering to understand the things that go into your data, how on earth can you draw any conclusions about the thing you are “measuring” ? Hence the scare quotes. Last week, though I didn’t report it, I profiled all of the codes I ran to see where they spent their time. And thats where I saw the very inefficient … nee … terrible FFI implementation hooks for GMP in Perl. Had I time, I’d try re-implementing them using FFI::Platypus. But the point is, I don’t have time to re-implement this wheel, and GMP isn’t as important to me as other things. But it leaves me wondering what else is messed up. One of the test cases I’d done (and I should really issue a pull request for this already) is to re-implement the Julia benchmarks in Perl. They are quite fast in Perl when I leverage the power of the language: internals, and the PDL system. Quite fast. But I noticed something, that I had agreement on all the final values for all tests, save one. The Mandelbrot test. I’d finished the port months ago, but was not able to get that to report the correct numbers. The code is simply too simple to be wrong. Yet, I get wrong numbers out of it. Combine this with the generalized awkwardness of the PDL (seriously, if this could be fixed, Perl would be a killer data language), and that had me already starting to question things. But the test case in question below is about hashing. Perl has had hashing (aka “associative arrays”) for a very long time. It has efficient/fast algorithms in place for its hashing work. And this is where we go to the post I had been writing. One of the great rules in measuring things is, understand what it is you are measuring. Part of this is making yourself aware of what it is you are measuring versus what you think you are measuring. Because the two may be very … very different. I don’t remember how or why I happened across this, but I did find this post on “Hash read performance (perl,ruby,python,node.js)” from 3 years ago. In the post, the author replicates a study reference from a book “Ruby Under A Microscope”, and does what is effectively a cut and paste of the ruby authors code. From this code, he generates Perl, Python, node.js code (very simple) to test. The problem is later at the end of the analysis. He plots all his measurements, and a number of things lept out at me, not the least of which is that he didn’t measure what he thought he was measuring. Looking at the book in experiment 1, I find that its pretty obvious that the post author simply channeled the book author. The (ruby) code in question is this:

# setup hash
size = 2**exponent
hash = {}
(1..size).each do |n|
 index = rand
 hash[index] = rand
end
# "benchmark"
Benchmark.bm do |bench|
 bench.report("retrieving an element from a hash with #{size} elements 10000 times") do
  10000.times do
   val = hash[target_key]
  end
 end
end

Pardon the destruction of indentation. Basically you create a random hash, pick a random “target_key”, and then read it N times. Well, no, thats actually not whats happening. You execute a loop N times, that has a loop body of reading the hash once. If you have an optimizer of some sort, which notices that the val is unused, the optimizer could make the whole loop body go away, as a hash access should not have a side effect. Which makes the loop transit very fast indeed. One might think it would look … I dunno … constant in time? Or as nearly constant as the time resolution would allow, modulo other things you can’t control (OS/machine jitter). Which if you look at the chart in the book, looks … suspiciously like … what they have. Simple way to fix that. Set val=0.0 at top before loop. Change loop assignment to val += … . Then print the val at the end of the loop (or val/N_iterations). This will subvert most optimizers from making the loop go away. Moreover, as the hashes are random, it is quite possible you will hit near best and near worst case scenarios. Measurements of real things tend to be repeatable (this is an understatement). There is an entire branch of statistics devoted to experiment setup and analysis. In none of these areas do they say “run once and report values”. In almost all of them, they say “run many times, report averages for specific coordinates, along with standard deviations”. The latter assuming that in the limit of an infinite number of tests, the errors/jitter/etc. are distributed normally. You can even do a poor mans version of this, just run it 3 times, average the values and compute a very rough SD. Its not hard, and it will at least give you a rough idea of the size of your measurement signal (the average) and the measurement noise (the standard deviation). Even worse, if your language/system is helpful, and it memoizes the expensive hash lookups (ok, this is replacing one hash with another, but likely a faster to compute hash, and the short circuit in memoization with the faster hash might … I dunno … impact what you think you are going to measure?) But … thats not the most egregious problem. None of these aforementioned are. Assume that the optimization and memoization doesn’t take place. A very reasonable question, actually the null hypothesis, is to measure the loop body timing without the interior hash read. Then you subtract this (the average over many runs with its associated SD), from the loop with the hash read (assuming its not optimized away). This is what tells you what the hash lookup time actually is, after removing the impact of the loop body. And still worse, since they don’t scale up the value of N (to select more values out of the hash rather than a fixed N), you don’t actually have a feel for how selecting 1,10, 100, … hashes would scale. And even worse than that, since they are using the same exact hash value each and every single time during the test, never mind the obvious optimization aspects, they don’t really interact with the languages hash system, just one page in memory. Wow. This is all kinds of terrible. For the perl code, I pulled it down and ran it on my machine. It generated output like this:

landman@lightning:~$ ./bm1.pl
retrieving an element from a hash with 2 elements 10000 times tooks 1.16 ms
retrieving an element from a hash with 4 elements 10000 times tooks 1.26 ms
...
retrieving an element from a hash with 524288 elements 10000 times tooks 1.31 ms
retrieving an element from a hash with 1048576 elements 10000 times tooks 1.27 ms

Never mind that I didn’t run it N times and do stats against it, I am going to be lazy in this regard and not try to generate statistics. I am looking for a gross set of trends to discover what they should have done. This isn’t the post author’s fault BTW, this is definitely a problem with the book though. It gets to the fact that real measurement is hard … well … understanding what you are measuring is hard. I can tell you what wasn’t measured by this. Hashing speed. You hit a single hash entry, again and again and again. You may be measuring cache speed. Or quality of optimizer. Or other things depending upon the language. This bugged me greatly. So I rewrote the code in a way that actually allowed me to measure hashing speed, and grab statistical data from this. Here’s the code:

#!/usr/bin/env perl
use strict;
use Benchmark qw(:hireswallclock :all);
use Data::Dumper;
use Statistics::Basic qw(:all);;
use POSIX qw(ceil);
# better hash benchmarking
# concept and original implementation from here:
# http://gulden-pt.blogspot.com/2012/05/hash-read-performance.html
my $iterations = 10000;
my ($val,$size,%hash,$target_index,$index,$cr,$N,$M,$data);
my (@hashgentime,@indxs,@measurements,$rhgt);
my ($exponent,$n_indxs,$count,$per,$t0,$t1,$stats,$ti,$tf,$htt,@all_indxs,$ri);
my $fraction = 0.01; # hit at least 1% of indices.  Relevant for $exponent >= 7
print "size,nhash,ave_T_hash,stddev_T_hash,ave_T_overhead,ave_T_hashgen val\n";
foreach $exponent (7 .. 20) {
    $size 		= 2**$exponent;
    undef 		%hash;
    undef 		@indxs;
    undef 		@all_indxs;
    $n_indxs 	= int($fraction * $size);
    $per 		= $size/$n_indxs;
    $count		= 0;
    undef @measurements;
    # setup hashes and hash index
    map {
            $index = rand;
            $hash{$index} = rand;
            push @all_indxs,$index;
        } ( 0 .. $size-1 );
    # benchmark
    undef @measurements;
    undef @hashgentime;
    $ti = Benchmark->new();
    map {
          $val = 0;
          $t0 = Benchmark->new();
          # generate a random sample of $n_indxs from the @indxs array
          undef @indxs;
          map {
            $ri = ceil(($#all_indxs + 1) * rand) - 1;
            push @indxs,$all_indxs[$ri];
          } ( 1 .. $n_indxs);
          $t1 = Benchmark->new();
          push @hashgentime,timediff($t1,$t0)/1000; # timediff reports microseconds
          # start timing for main test
          $t0 = Benchmark->new();
          map { $val += $hash{$_} } @indxs;
          $t1 = Benchmark->new();
          push @measurements,timediff($t1,$t0)/1000; # timediff reports microseconds
        } (0 .. $iterations);
    $tf 		= Benchmark->new();
    $htt 		= 0 ; # hash test time total
    $rhgt		= 0 ; # random hash index generation time
    map {$htt 	+= $_} @measurements;
    map {$rhgt 	+= $_} @hashgentime;
    # now generate basic stats
    $stats->{$exponent}->{mean} 		= mean(@measurements)/$iterations;
    $stats->{$exponent}->{stddev} 		= stddev(@measurements)/$iterations;
    $stats->{$exponent}->{val}			= $val/$iterations; # average
    $stats->{$exponent}->{overhead} 	= (
                                           $htt  - timediff($tf,$ti)/1000
                                          ) / $iterations;
    printf "%8i,%8i,%.3f,%.3f,%.3f,%.3f,%.3f\n",
        $size,
        $n_indxs,
        $stats->{$exponent}->{mean},
        $stats->{$exponent}->{stddev},
        $stats->{$exponent}->{overhead},
        $rhgt/($iterations),
        $stats->{$exponent}->{val}
        ;
}

This code outputs a nice simple table

landman@lightning:~/play/hash$ ./bm1_opt.pl
size,nhash,ave_T_hash,stddev_T_hash,ave_T_overhead,ave_T_hashgen val
     128,       1,3.761,0.159,37612.254,37615.697,0.000
     256,       2,3.752,0.156,37518.914,37522.708,0.000
     512,       5,3.758,0.163,37578.600,37582.801,0.000
    1024,      10,3.741,0.188,37405.454,37409.295,0.001
    2048,      20,3.898,0.122,38976.719,38980.788,0.001
    4096,      40,3.952,0.122,39522.012,39526.087,0.002
    8192,      81,3.996,0.120,39959.245,39963.332,0.004
   16384,     163,3.999,0.138,39989.444,39993.554,0.009
   32768,     327,4.103,0.125,41028.580,41032.820,0.016
   65536,     655,4.097,0.195,40972.423,40976.704,0.033
  131072,    1310,4.153,0.078,41532.110,41536.220,0.067
  262144,    2621,4.137,0.142,41370.365,41374.603,0.129
  524288,    5242,4.309,0.407,43094.588,43098.868,0.262
 1048576,   10485,4.982,2.084,49823.580,49827.418,0.521

size is the size of the hash table, nhash are the number of random hashes chosen, and these are regenerated upon every iteration, ave_T_hash is the time required to run that loop (in ms), stddev is the standard deviation, ave_T_overhead is the loop overhead in ms, ave_T_hashgen is the time required to generate the hash in ms. What we see is a very gentle sloping line with a nice simple y intercept of about 3.7ms for the hash time, rising to about 4.99ms for 10485 hashes. I am testing 1% coverage of the size of the hash, we could easily change that. It is quite likely that we will see processor cache effects in this. Especially if we make the hash larger. While this seems to be more realistic (and I’ve not ported this to Python, Ruby or C), I am quite concerned in general with the level of “measurement” and “analysis” I see going on. People talk about “measure everything”. But if they don’t understand what they are measuring, and if their measuring technique is broken, can you put faith in their results? This is what we have to look forward with the IoT bits coming online. Measure everything … grab huge amounts of data. But you still will need lots of scientists whom know how to measure, to help you refine your signal to noise level. There’s nothing wrong with the blog poster, but there was a great deal wrong with the book.