2012/09/04

What I learned from my first c coding challenge

Last week at Vimeo we had a challenge. Build a better query string parameter sorter for a Varnish plugin. The gauntlet was thrown down at 5pm on a Friday and by 2am Friday night we had our first contender. I arrived home to see the code and decided to throw my hat into the ring. I had made some attempts at c plugins for php but never really got anything off the ground. The K&R book had been a fascination for me but now it was time to put it into practice.

While not in the Varnish module format yet, my still-evolving entry is skwurly.

The challenge:

Take an input const char* url:

/test.php?b=b&c=bb&a=3&a=4

and output in a standardized format:

/test.php?a=3&a=4&b=b&c=bb

The goal here is to standardize a url so that it may be used as a key in a hash function of an http cache, thereby preventing duplicate pages in cache for the same url with differently ordered parameters. The sorted order is optional, so long as a canonical url is derived from any ordering of query string parameters.

It’s worth noting that since this was a friendly competition, much code sharing and talking over strategies and optimizations took place over the course of the week.

Initial strategy

My initial strategy was slow and inefficient. Scan the url and count the number of ampersands so that I can allocate an array of structs of sufficient size. Scan again to grab pointers to the first character in the param and the param length and store them in the structs. Using qsort and strcmp sort the array of structs in alphabetical order. Allocate a string of duplicate size to the input. Iterate over the param struct and using pointer assignment to replace each character in sorted order.

With the exception of the return string all of the memory used is stack allocated. This attempt produced a program that could sort ~400k urls/sec on my macbook pro. Not bad, but not nearly as good as it gets.

After getting a code review from a few other c devs I set out to take a different tack. The goal was only going touching each character of the input url once. I very nearly achieve that and it’s brought the performance up to ~1.8m urls/sec on my mbp.

This is the breakdown of all the optimizations that I tried and kept/discarded. Many great devs assisted me in this process. They are listed at the bottom of the page and I am very grateful for the time they spent with me!

Memory Matters

So it turns out the two arrays (one of const char*, one of unsigned short) performs better than an array of pointers to structs containing the same value types. The two values to track are the starting character position of the param in the url and it's length.

struct param {
    const char* start;
    unsigned short length;
}
param* params[];

const char* params[];
unsigned short param_lengths[];

This was a bit of a surprise to me. I had assumed that sorting a single pointer array would be faster than keeping track of two arrays. I was wrong. I tried different ordering of the elements of the struct, using a long to track length as an attempt at better memory alignment, doing array style access to elements vs incrementing pointer access. Each had different impacts but at the end of it all using two arrays was just faster. Also faster was an unsigned short array vs an int array.

Variable Locality

Another optimization was to copy a const char* value into a char when I needed to test it multiple times. I had heard that this may be the case due to register use. I found it to be a minor increase in speed.

Testing for zero

When iterating through a loop and testing that my index was less than the number of elements that were in the loop, I thought I was being efficient. But what I found was taking the number of elements and assigning it to a variable and decrementing that variable each time through the loop and testing when it was zero was faster. It’s a cheaper test to test for 0 than to compare two numbers.

Pointer arithmetic

I have always heard of pointer arithmetic but now I understand why it’s so useful. Taking a pointer to the first element in an array and incrementing it to iterate through it’s elements is elegant and more efficient than taking an integer i and using it as an array index. I wish more languages afforded this.

Hashing costs

A great suggestion from a friend was to hash each param to an integer and sort by the integers. In theory this would allow for touching each character of the input url only once before sorting was possible. I used a cheap hash called djb. I inlined it to reduce any function call overhead.

// djb hash
int hash = 5381;
while (*url && *url != '&')
    hash = ((hash << 5) + hash) ^ *url++;

What I found was that it was a little bit slower. Why? Aside from the extra arithmetic that was being processed on every character, most query string parameters are sortable by looking at the first character of the param name only. name=jason&job=dev&company=vimeo for example, doesn’t require looking at any but the first character of each param. So I opted for character comparison.

`strcmp` vs hand rolled

Unforseen to me, a hand rolled string compare function performed better than native strcmp. I assumed that a native func would be all kinds of optimized but in the end a simple k&r style character by character comparison performed better than strcmp. Dumping the assembly for both showed many more instructions for the strcmp version.

`qsort` vs insertion sort

I set out to use as many native funcs as I could but again found that it’s not always the most optimal path. A hand rolled insertion sort performed better than the qsort native function. But I found an even more performant way to sort params that reduced constant time and allowed me to return early when a url was already in sorted order.

double-wide insertion sort

I was at a bar in the east village called double-wide and after much merriment came home to work on my url sorter. I had the thought that if I used an array of twice the size of the params, I could set the middle element as the zero index and this would allow me to prepend and append elements in O(1) time. This would save having to shuffle the entire array up one every time I had to put an element at the beginning. While this would be easily implemented in a linked list, the performance was still better with a double-wide array in my use case. I only needed to keep track of the head and tail elements of the array for easy iteration over them later. And if I only appended each param, I knew the params were already sorted. So I could return the original url. (In this particular context the plugin architecture we were writing against allowed for returning the original string or a new malloc’d string)

Zero or one param

Returning early when only a single param or zero params were present was another more obvious optimization.

Optimizations that didn’t work:

Loop unrolling

In my feverish search for more and more obscure optimizations I stumbled upon a mind-bending gem called Duff’s device. On first glance I assumed my iphone had rendered the code wrong. After repeated attempts to understand, it slowly emerged in my mind. A loop unroller. In the author’s own words “disgusting”, and amazing.

send(to, from, count)
register short *to, *from;
register count;
{
    register n=(count+7)/8;
    switch(count%8){
        case 0: do{ *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++; }while(--n>0);
    }
}

I built a while-less version that optimized for 6 params or fewer where the switch cases fell through to a default of my original for loop. Initially I found a minor speed up. However it turned out to be due to an inefficiency in my loop code! Once fixed, they performed the same.

A fellow coder pointed out that loops have been pretty heavily optimized by compilers and unrolling almost never produces any benefit any more.

Single character comparison before a function call to `str_compare`

I had the idea that taking the first character of each param and comparing it inline and if they were the same, only then calling the str_compare function on the two params would cut many function calls and perform better.

if ((c == 0 && str_compare(params[TAIL], p) < -1) || c < -1)

if (str_compare(params[TAIL], p) < -1)

What I found was that the extra logic was slowing things down. I learned a little bit about branching and how processors optimize code paths by executing as many possible branches as they can in parallel. So a simple function call (although declared as inline), beat the character comparison.

`memmove` vs pointer assigment

This one is probably due to the number of elements being dealt with when shuffling elements in my sorting algorithm, but there was no speedup, only slowdown, when using memmove to move elements in the array up before inserting a new one. Simple pointer assignment beat it.

Profiling and analysing

Late in the game a friend showed me how to profile my code. Using gprof, gcov, and valgrind gave me more information on how to tweak my code and how exactly it was being executed. gcov -b for branch tracking was an enlightening experience as a programmer. And in the next competition I’ll be reaching for these tools much earlier in the game.

Conclusion

While our competition may be near it’s end, I suspect we’ll never stop tweaking our code. Finding new ways to shave time and memory off our performance. It's been such a great way to get acquainted with programming tactics in a language. I'm thoroughly looking forward to the next.

At this point I am slightly behind the leader.

2.17m urls/sec
2.08m urls/sec (me)
1.83m urls/sec

Once we have a winner and we build the code into a proper Varnish module, we will open source it on Vimeo's github.

Awesome people

Thanks to Naren, Carlos, Derek, Paul, Won, Mashiat. Good hackin with you guys. ↑