Log in

On knowing what you're measuring - Thoughts of a meat popsicle
14th February 2013
09:29 pm


Previous Entry Share Next Entry
On knowing what you're measuring
So someone on reddit read this blog entry on why you shouldn't use linked lists for number crunching, and was very concerned with "locality of reference".

It smelled fishy to me, mostly based on what exactly the blog author was measuring, so I spent some time last night and tonight experimenting. I found that a large cost was the time spent in malloc. Admittedly, I was measuring an O(n) operation, and I don't have pretty graphs because I'm lazy and don't blog much. I didn't make any effort to keep the items in any particular order, but I think they were always either forward or backwards through memory (a page at a time). For that experiment I used 1e6 elements.

Then I realized I hadn't properly tested the locality theory. I wanted to measure how long it took to iterate elements, by a list or by a vector. To eliminate any other difference, I used an array of
struct ele {
        long link;
        long val;

For the list, I had a[i].link = i + 1, and the last element had -1 in the link. When iterating as a vector I just ignored the link field. So the only difference between the two iterations was in how it computed the next element; either by reading the index from the current entry, or by adding one to the index. The disassembly is pretty close on each:
 sum_array_by_list:                           sum_array_by_index:
        pushq   %rbp                                 pushq   %rbp
        movq    %rsp,%rbp                            movq    %rsp,%rbp
        movq    %rdi,0xe8(%rbp)                      movq    %rdi,0xe8(%rbp)
        movq    %rsi,0xe0(%rbp)                      movq    %rsi,0xe0(%rbp)
        movq    $0x00000000,0xf8(%rbp)               movq    $0x00000000,0xf0(%rbp)
                                                     movq    $0x00000000,0xf8(%rbp)
        jmp     end                                  jmp     end
cont:   movq    0xe0(%rbp),%rax               cont:  movq    0xf8(%rbp),%rax
        shlq    $0x04,%rax                           shlq    $0x04,%rax
        addq    0xe8(%rbp),%rax                      addq    0xe8(%rbp),%rax
        movq    0x08(%rax),%rax                      movq    0x08(%rax),%rax
        addq    %rax,0xf8(%rbp)                      addq    %rax,0xf0(%rbp)
                                                     incq    0xf8(%rbp)
        movq    0xe0(%rbp),%rax                      movq    0xf8(%rbp),%rax
        shlq    $0x04,%rax
        addq    0xe8(%rbp),%rax
        movq    (%rax),%rax
        movq    %rax,0xf8(%rbp)
end:    cmpq    $0x00,0xe0(%rbp)                     cmpq    0xe0(%rbp),%rax
        jns     cont                                 jl      cont
        movq	0xf8(%rbp),%rax                      movq    0xf0(%rbp),%rax
        leave                                        leave
        ret                                          ret

Both algorithms are O(n). One does an extra memory read to get the next index, the other an increment. I had a theory that by-index would be faster due to how the Intel hardware does things; it has some specialized instructions to iterate until a count is zero. gcc didn't do that, though. They both have about the same test to see if they're done. Now for the money shot:
Sum 10000000 elements (index) in 50013 microseconds
Sum 10000000 elements (list)  in 73902 microseconds

So the difference here in run time is significant and mostly due to having to run more instructions (9 versus 11 in the main loop). To test the locality of reference theory, I then randomized the links (this has to be done carefully to ensure it's still iterating every element). Now the linked list run time is:
Sum 10000000 elements (list) in 1098717 microseconds

Well, that is a big difference. I don't yet have enough data to say why, though. Am I constantly missing in L1 cache now? This could be measured by keeping all the elements that fit in a cache line together, while randomizing the order I hit each cache line. Doing this (assuming a cache line size of 128), I get:
Sum 10000000 elements (list) in 214726 microseconds. This is much better already, but still kinda bad. So L1 cache isn't the only issue. (Note the whole data set is 160MB). So I tried some other contiguous sizes:

Contiguous bytes Time (usec) for 1e7 element iteration
16 1112199
32 695746
64 362445
128 214169
256 175248
512 126300
1024 98137
2048 83341
4096 75693
8192 75257
16384 74682

So that's interesting. Once the memory is accessed a page at a time the performance is almost back up to maximum. I think this means that the TLB is the difference here. L1 cache (64 byte cache line) clearly also makes some difference, based on how the speed changes with the number of contiguous bytes accessed -- it stops getting twice as fast once I crossed 64 bytes.

(2 comments | Leave a comment)

[User Picture]
Date:15th February 2013 11:03 pm (UTC)
That was geeky fun to read! Thanks for writing it. :)
From:H. Mijail Antón Quiles
Date:24th July 2016 08:19 am (UTC)

Interesting, but...

That was an interesting experiment, thanks for reporting. The idea of testing the linked list traversal when the elements have been allocated contiguously, for different values of "contiguously", is great! :)

However, some questions come to mind:

How did you randomize the pointers? Did you make sure that, for example, for the "16 contiguous bytes" case every element access invalidated the cache line?

You only mention L1 and then jump to TLB. What about L2, L3? (which gets compounded with the first question) Did you check your ideas with Valgrind's Cachegrind/Callgrind for example, to see what it says on cache misses?

For extreme nitpicking, the count of instructions in the loop is a suspicious measurement. Are you sure that the pipelines are taking as many cycles in each case? Also, I would expect that the memory subsystem can prefetch more efficiently in the array case, given that the accesses are sequential and therefore predictable.

In general, it would be interesting to know the process, code and compilation options you used. (Maybe being really specific with the compilation flags (-O, -march, etc) the CPU's capabilities will be better exploited? What CPU was this, anyway?)

I'm asking because at the end of the day, with the array one is taking advantage of all the tricks that the hardware has developed to improve memory bandwidth, while with the linked list one is rather working *against* those tricks. The performance of the array should be rather tightly specified and close to the theoretical limit for the hardware, while the linked list's performance can vary a lot as you have shown - so it makes sense to see how bad can it get when the hardware is precluded from opportunistically jumping in to help.

And lastly, while it was a nice touch to use an element of the same size for both the array and the linked list, this is actually very unfair for the array, since it's moving around an useless 8-byte value just to be comparable to the linked list! So at the end of the day the array is only using half of the cache and memory bandwidth it could use.
Powered by LiveJournal.com