March 17, 2012

Comparison of strlen Implementations

The strlen() function, to find the the length of null-terminated string.

In terms of algorithm, it is very simple, start from the first character, increment the counter after you scanning a non - null character, until you reach the '\0' character which is the NULL terminator . The time complexity is linear O(n).

However, when it comes to implementation, there are many different variants.

Naive implementation:
Comparing one byte each time, check if it is NULL.


size_t naiveStrlen(const char *str)
{
  const char *p = str;

  while(*p) p++;

  return p-str;
}



Better implementation:
Comparing four or eight bytes each time.

size_t betterStrlen(const char *str)
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, himagic, lomagic;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (longword) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  longword_ptr = (unsigned long int *) char_ptr;

  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  if (sizeof (longword) > 4) {
    himagic = ((himagic << 16) << 16) | himagic;
    lomagic = ((lomagic << 16) << 16) | lomagic;
  }

  for (;;) {
    longword = *longword_ptr++;
    if (((longword - lomagic) & ~longword & himagic) != 0) {
      const char *cp = (const char *) (longword_ptr - 1);

      if (cp[0] == 0) return cp - str;
      if (cp[1] == 0) return cp - str + 1;
      if (cp[2] == 0) return cp - str + 2;
      if (cp[3] == 0) return cp - str + 3;

      if (sizeof (longword) > 4) {
        if (cp[4] == 0) return cp - str + 4;
        if (cp[5] == 0) return cp - str + 5;
        if (cp[6] == 0) return cp - str + 6;
        if (cp[7] == 0) return cp - str + 7;
      }
    }
  }
}


SSE2 implementation:
Utilizing SSE2 instruction, compare 16 bytes each time.


size_t sse2Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i xmm1;
  const char *char_ptr;
  int mask;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (__m128i) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  for(;;) {
    xmm1 = _mm_cmpeq_epi8(_mm_load_si128((__m128i *) char_ptr), zero);
    if ((mask = _mm_movemask_epi8(xmm1)) != 0) {
        return (char *) (char_ptr) + __builtin_ctz(mask) - str;
    }
    char_ptr += 16;
  }
}





SSE4 implementation:
Utilize string processing instruction ”PCMPISTRI“provided by SSE4.


size_t sse42Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i *ptr = (__m128i *) str;

  for(;;) {
    offset = _mm_cmpistri(zero, _mm_loadu_si128(ptr++), _SIDD_CMP_EQUAL_EACH);
    if (offset != 16) {
      break;
    }
  }
  return (char *) ptr - 16 - str + offset;
}






Result:


Before comparing the performance, we have further optimize the implementations by unrolling the loop. Result shows that SSE2 is the fastest implementation. We also compared the loop unrolled SSE2 and normal SSE2, it shows that loop unrolling does helps make it faster. The chart below shows the result of running time for each implementation. Length of tested string range from 2 to 1024.




Conclusion:

SSE2 implementation beats the glib's implementation and becomes fastest implementation. The string processing instruction of SSE4 seems not optimized enough. And loop unrolling really helps in the implementations.

All the source code can be downloaded from https://github.com/Jasonhcwong/blog




References:

1. Intel 64 and IA-32 Architectures Software Developer's Manual
2. Glibc's strlen() implementation


No comments:

Post a Comment