What I learned from my first c coding challenge

Last week at Vimeo we had a challenge. Build a better query string parameter sorter for a Varnish plugin. The gauntlet was thrown down at 5pm on a Friday and by 2am Friday night we had our first contender. I arrived home to see the code and decided to throw my hat into the ring. I had made some attempts at c plugins for php but never really got anything off the ground. The K&R book had been a fascination for me but now it was time to put it into practice.

While this is not in the Varnish module format yet, my still-evolving entry is here:  https://github.com/jasonmoo/skwurly

The challenge:

Take an input const char* url:

/test.php?b=b&c=bb&a=3&a=4

and output in a standardized format:

/test.php?a=3&a=4&b=b&c=bb

The goal here is to standardize a url so that it may be used as a key in a hash function of an http cache, thereby preventing duplicate pages in cache for the same url with differently ordered parameters. The sorted order is optional, so long as a canonical url is derived from any ordering of query string parameters.

It’s worth noting that since this was a friendly competition, much code sharing and talking over strategies and optimizations took place over the course of the week.

Initial strategy
My initial strategy was slow and inefficient. Scan the url and count the number of ampersands so that I can allocate an array of structs of sufficient size. Scan again to grab pointers to the first character in the param and the param length and store them in the structs. Using qsort and strcmp sort the array of structs in alphabetical order. Allocate a string of duplicate size to the input. Iterate over the param struct and using pointer assignment to replace each character in sorted order.

With the exception of the return string all of the memory used is stack allocated. This attempt produced a program that could sort ~400k urls/sec on my macbook pro. Not bad, but not nearly as good as it gets.

After getting a code review from a few other c devs I set out to take a different tack. The goal was only going touching each character of the input url once. I very nearly achieve that and it’s brought the performance up to ~1.8m urls/sec on my mbp.

This is the breakdown of all the optimizations that I tried and kept/discarded. Many great devs assisted me in this process. They are listed at the bottom of the page and I am very grateful for the time they spent with me!

Memory Matters

So it turns out the two arrays (one of const char*, one of unsigned short) performs better than an array of pointers to structs containing the same value types.  The two values to track are the starting character position of the param in the url and it’s length.

struct param {
    const char* start;
    unsigned short length;
}
param* params[];

vs

const char* params[];
unsigned short param_lengths[];

This was a bit of a surprise to me. I had assumed that sorting a single pointer array would be faster than keeping track of two arrays. I was wrong. I tried different ordering of the elements of the struct, using a long to track length as an attempt at better memory alignment, doing array style access to elements vs incrementing pointer access. Each had different impacts but at the end of it all using two arrays was just faster. Also faster was an unsigned short array vs an int array.

Variable Locality
Another optimization was to copy a const char* value into a char when I needed to test it multiple times. I had heard that this may be the case due to register use. I found it to be a minor increase in speed.

Testing for zero
When iterating through a loop and testing that my index was less than the number of elements that were in the loop, I thought I was being efficient. But what I found was taking the number of elements and assigning it to a variable and decrementing that variable each time through the loop and testing when it was zero was faster. It’s a cheaper test to test for 0 than to compare two numbers.

Pointer arithmetic
I have always heard of pointer arithmetic but now I understand why it’s so useful. Taking a pointer to the first element in an array and incrementing it to iterate through it’s elements is elegant and more efficient than taking an integer i and using it as an array index. I wish more languages afforded this.

Hashing costs
A great suggestion from a friend was to hash each param to an integer and sort by the integers. In theory this would allow for touching each character of the input url only once before sorting was possible. I used a cheap hash called djb. I inlined it to reduce any function call overhead.

// djb hash
int hash = 5381;
while (*url && *url != '&')
    hash = ((hash << 5) + hash) ^ *url++;

What I found was that it was a little bit slower. Why? Aside from the extra arithmetic that was being processed on every character, most query string parameters are sortable by looking at the first character of the param name only. name=jason&job=dev&company=vimeo for example, doesn’t require looking at any but the first character of each param. So I opted for character comparison.

strcmp vs hand rolled
Unforseen to me, a hand rolled string compare function performed better than native strcmp. I assumed that a native func would be all kinds of optimized but in the end a simple k&r style character by character comparison performed better than strcmp. Dumping the assembly for both showed many more instructions for the strcmp version.

qsort vs insertion sort
I set out to use as many native funcs as I could but again found that it’s not always the most optimal path. A hand rolled insertion sort performed better than the qsort native function. But I found an even more performant way to sort params that reduced constant time and allowed me to return early when a url was already in sorted order.

double-wide insertion sort
I was at a bar in the east village called double-wide and after much merriment came home to work on my url sorter. I had the thought that if I used an array of twice the size of the params, I could set the middle element as the zero index and this would allow me to prepend and append elements in O(1) time. This would save having to shuffle the entire array up one every time I had to put an element at the beginning. While this would be easily implemented in a linked list, the performance was still better with a double-wide array in my use case. I only needed to keep track of the head and tail elements of the array for easy iteration over them later. And if I only appended each param, I knew the params were already sorted. So I could return the original url.

(In this particular context the plugin architecture we were writing against allowed for returning the original string or a new malloc’d string)

Zero or one param
Returning early when only a single param or zero params were present was another more obvious optimization.

Optimizations that didn’t work:

Loop unrolling
In my feverish search for more and more obscure optimizations I stumbled upon a mind-bending gem called Duff’s device. On first glance I assumed my iphone had rendered the code wrong. After repeated attempts to understand, it slowly emerged in my mind. A loop unroller. In the author’s own words “disgusting”, and amazing.

send(to, from, count)
register short *to, *from;
register count;
{
    register n=(count+7)/8;
    switch(count%8){
        case 0: do{ *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++; }while(--n>0);
    }
}

I built a while-less version that optimized for 6 params or fewer where the switch cases fell through to a default of my original for loop. Initially I found a minor speed up. However it turned out to be due to an inefficiency in my loop code! Once fixed, they performed the same.

A fellow coder pointed out that loops have been pretty heavily optimized by compilers and unrolling almost never produces any benefit any more.

Single character comparison before a function call to str_compare
I had the idea that taking the first character of each param and comparing it inline and if they were the same, only then calling the str_compare function on the two params would cut many function calls and perform better.

if ((c == 0 && str_compare(params[TAIL], p) < -1) || c < -1)

vs

if (str_compare(params[TAIL], p) < -1)

What I found was that the extra logic was slowing things down. I learned a little bit about branching and how processors optimize code paths by executing as many possible branches as they can in parallel. So a simple function call (although declared as inline), beat the character comparison.

memmove vs pointer assigment
This one is probably due to the number of elements being dealt with when shuffling elements in my sorting algorithm, but there was no speedup, only slowdown, when using memmove to move elements in the array up before inserting a new one. Simple pointer assignment beat it.

Profiling and analysing

Late in the game a friend showed me how to profile my code. Using gprof, gcov, and valgrind gave me more information on how to tweak my code and how exactly it was being executed. gcov -b for branch tracking was an enlightening experience as a programmer. And in the next competition I’ll be reaching for these tools much earlier in the game.

Conclusion

While our competition may be near it’s end, I suspect we’ll never stop tweaking our code. Finding new ways to shave time and memory off our performance. It’s been such a great way to get acquainted with programming tactics in a language.  I’m thoroughly looking forward to the next.

At this point I am slightly behind the leader.
#1: 2.17m urls/sec
#2: 2.08m urls/sec (me)
#3 1.83m urls/sec

Once we have a winner and we build the code into a proper Varnish module, we will open source it on Vimeo’s github.

Awesome people
Thanks to Naren, Carlos, Derek, Paul, Won, Mashiat. Good hackin with you guys.

19 Responses to “What I learned from my first c coding challenge”

  1. hobbyist Says:

    Nice Work! Where can I get the 5Murls.txt?

  2. Guillaume Says:

    Could you add a link to 5Murls.txt?

  3. Curious Says:

    How many urls/sec did the original parameter sorter put through?

  4. Olivier Says:

    Hi,

    A while ago I’ve been building a urlsort varnish plugin, it uses a binary tree:
    https://github.com/cyberroadie/varnish-urlsort

    here is a short explanation:
    http://cyberroadie.wordpress.com/2012/01/05/varnish-reordering-query-string/

    Olivier

  5. Chris Says:

    Very neat! I’ve been working in embedded (more or less) C on Linux for the last 10 years and didn’t know about gprof or gcov (though I’ve made gratuitous use of valgrind). Excellent post. :)

  6. C2H5OH Says:

    There’s one thing I can’t understand. If you could beat up strcmp() with your own C version it means that GCC did not treat strcmp() as a builtin. Have you tried using __builtin_strcmp() instead?

  7. gigi Says:

    The arrays instead of structs optimization is classic in the gaming industry.

    Google for AOS (array of structs) vs SOA (struct of arrays).

  8. guest Says:

    Interesting comment by icefox on HackerNews :

    Any more detailed information about the coding challenge?

    For example the repo references a 5Murls.txt file, but it isn’t part of the repo and the blog says it needs to “output in a standardized format:”, but doesn’t specify what “standardized” means nor does the current code actually output anything (the printf is disabled). Does it specifically need to go to stdout or just that it exists somewhere in memory? Does it require a char* or will this char* be instantly hashed and tossed away? Does the challenge forbid the use of threads/cpus/workers? In the blog it says: “In this particular context the plugin architecture we were writing against allowed for returning the original string or a new malloc’d string” Are you allowed to muck with the original string? In the code on github it has a function that takes a const char* forcing a malloc, but it could easily just re-write the string in place if allowed.
    edit: As for sorting the params, (based upon your comments about the common usage) pretty sure there is a way to do this without any string comparisons at all. Post a sanitized 5Murls.txt file and give it a go to make a patch.

  9. Jorgen Says:

    Great article and explanation that even an on-off C programmer like myself could follow.

  10. Nico Says:

    I know sorting was the challenge, but knowing what it’s going to be used for, is it really necessary to sort it? Maybe you could have a hash function such that hash(“b=b&c=bb&a=3&a=4″) == hash(“a=3&a=4&b=b&c=bb”). I guess that’s exactly what sorting does, but I wonder if there might be faster functions that have the same property, without necessarily having to sort.

  11. Wayne Says:

    Interesting work.

    I am not sure about the compiler and platform for test. If it is gcc, the “likely” macro might help with branch predictor. But I haven’t profiled it personally.

    Example from :

    const char *home;

    home = getenv (“HOME”);
    if (likely (home))
    printf (“Your home directory is %s\n”, home);
    else
    fprintf (stderr, “Environment variable HOME not set!\n”);

  12. Edwin Martin Says:

    I think the loop unrolling has no effect is not due to better compiler optimizations, but better processors. Processors use branche prediction and know to jump to the top of the loop instead of the next instruction below.

  13. lol Says:

    How do you do the benchmark timing? Have you excluded the I/O time? It seems you are using stdio with a small buffer, which takes quite some time.

  14. indrora Says:

    On the topic of optimizations, I’ve seen some evil things done by compilers. The problem is that Loop unrolling causes great performance on x86 (since it uses nothing but a program counter and some memory for what its worth) but the entire point of loop unrolling becomes blown apart on the world of ARM.

    Why? because ARM likes small code sizes. Its actually optimized for small code sizes, as it uses a L1 cache that tries to look-ahead just a bit. GCC will add bust-out lines in unrolled loops so that if something happens (e.g. you increment by too much) it can bail. Duff’s Devices are cool and all, but they still fail on devices like ARM.

    I found this out while writing a CRC (I know, its CRC) algorithm for an ARM chip I’m working with.

  15. Dan Sutton Says:

    Interesting. Just for the sake of it, here it is in C#: .Net gives us a few things which come in handy… I know all you C purists are going to get at me for not writing my own code but using the built-in .Net stuff instead… however, it works: the point here is that sometimes C just isn’t the right language if you want a decent development time:

    private string Format(string Url)
    {
    string[] Sects = Url.Split(‘?’);
    try
    {
    string[] Parms = Sects[1].Split(‘&’);
    Array.Sort(Parms);
    return Sects[0] + ‘?’ + String.Join(“&”, Parms);
    }
    catch { return Url; }
    }

  16. sv Says:

    I think a lot of what you ran into is cases where the optimizations you thought would work only work when dealing with things on a larger scale, e.g. the url’s involve only had a handful of query parameters. If there were thousands of parameters involved, qsort would have kicked your insertion sort’s butt.

    And loop unrolling is most effective when what is done what is done in the loop is trivial compared to the loop overhead and when you don’t introduce any additional branching.

  17. Gordon Says:

    Perhaps this could help you get a few more obscure optimizations?
    http://graphics.stanford.edu/~seander/bithacks.html

  18. Dieter Says:

    Great writeup Jason, hope to join in next time!
    I don’t quite understand the “Testing for zero” part. So, if I understand correctly, these are the two approaches:
    A) given number X, subtract numbers y1, y2, …y5 to arrive at Y, then check whether Y == 0
    B) check whether X == Y (with Y already given, and equaling y1+y2+ … +y5)
    (5 is an example and denotes amount of loop iterations)

    you’re saying A is faster? this is weird because it has 5 subtractions and a comparision to 0.
    B compares two arbitrary ints, and if that is so slow, the compiler could turn it into one subtraction and comparison to 0: (X-Y) == 0

  19. Rubén Romero Says:

    Maybe the Varnish Modules overview should be updated now that you have a winner?

    And… Congrats to the winnng code: https://github.com/vimeo/libvmod-boltsor

    :-)

Leave a Reply