Abstract:
Unrolling the loop in strlen function.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

FunctionShort stringsThe long string
strlen by Microsoft14712104
strlen_my10191471
strlen by Agner Fog12791056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

Related articles

SSE2 optimised strlen by Dmitry Kostjuchenko

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

Peter Kankowski
Peter Kankowski

About the author

Peter lives in Siberia, the land of sleeping sun, beautiful mountains, and infinitely deep snow. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP (on Windows platform). He can be reached at kankowski@narod.ru.

64 comments

Ten recent comments are shown below. Show all comments

Misaligned memory accesses on x86,

On x86 and x86_64 (on core I7), there is no penalty for any misaligned memory access within a cache line, and for misaligned accesses crossing a cache line, the penalty is one cycle. You might as well consider misaligned accesses completely free. AMD architectures also have negligible penalties for misalignment.

I think I know why: I was doing performance analysis once and I enabled alignment checking in the eflags register and setup my debugger to catch the alignment traps and record the stack traces. To my surprise, I was getting millions of misaligned accesses throughout windows code and in the CRT. My theory is, Intel engineers did a similar investigation and found misaligned accesses to be happening so often, that they realized that they needed to make it super fast. In Core I7, the handling of misaligned accesses is absolutely amazing.

ace,

The first Intel processors x86 with no penalty for some misaligned memory access that I know of were some Core 2 models on 45 nm. The ones on 65 nm were significantly slower.

Brad Conroy,

you could eliminate the len+=4 from the loop by using a local char* instead of s.

char *buf=s;
...
... return buf-s;
... return buf-s+1;
... return buf-s+2;
... return buf-s+3;
buf+=4;
Peter Kankowski,

Brad, thank you very much for noticing. Your version has one instruction less in the loop. However, len += 4 can be executed in parallel with s += 4 on a superscalar processor. As far as I remember, both versions have the same speed for this reason.

Denis Novikov,

Hi Peter,

I've found that standard strlen works faster than your implementation with Ubuntu Linux 12.4 x86_64 (g++-4.6.3).

I disassembled `libc.so` and saw that strlen function checks the cpu features and calls one of the functions: __strlen_sse2_pminub, __strlen_sse2, __strlen_sse42 or __strlen_sse2_no_bsf.

Denis Novikov,

Sometimes standard `strlen` function works 8x times faster (with the long string).

Peter Kankowski,

Hi Denis,

thank you for noticing. My friend Dmitry wrote a faster SSE2 implementation:

http://www.strchr.com/sse2_optimised_strlen

Denis Novikov,

Peter, I've seen the sse2 implementation by Dmitry.

For the short string your implementation is the fastest (1.4x faster than the glibc's implementation and 1.5x faster than the Dmitry's one).

For the long string your implementation is the slowest (7.5x slower than the glibc's and 4.3x slower than the Dmitry's).

So the fastest implementation is the `standard` glibc's one.

Peter Kankowski,
One year ago, they added a fast implementation by Ondřej Bílka to glibc. I can only congratulate the glibc developers for making it faster than my code. :)
Denis Novikov,

Ubuntu 12.04 is `LTS` and contains 2 yrs. old version of `glibc`. It would be really interesting to test the newest version of `strlen`.

Your name:
Comment:

Please ignore this field: