June 18, 2012

C++ callback

According to Wikipedia, a callback is a reference to a piece of executable code that is passed as an argument to other code. Callback is widely used in C++, you can find it in STL algorithm, single handler of Linux, Windows API, etc..

In C++, callback can be implemented using function pointer and function object(functor).

Function pointer is a pointer that points to executable code in the memory.Usually, there are two types of function pointer: ordinary function pointer and class member function pointer(pointer to static method is treated the same as ordinary function pointer).

Function object(functor) is a class with function call operator "()" overloaded.

Function pointer V.S. functor:
1. When passed as a parameter, you pass the function pointer itself if you are using function pointer, if you functor, you passed the object.
2. The functor can preserve the state of the object between calls.

Ordinary function pointer V.S. Member function pointer:
1. The size of the pointer is different. In my environment(Ubuntu 10.04 X86_64) the size of ordinary function pointer is 8 byte while member function pointer occupies 16 bytes.
2. When invoking the callback, you need to passed the parameter, and if you are using functor, you have to passed one addition parameter -- the object.


I implemented a unified functor generator class that will take both ordinary function pointer and member function pointer and return a functor.

Here are some code example:



  // test ordinary function
  functor g1;
  g1 = &test1;
  g1(12.123);

  // test member function
  functor g3(&Base::static_b);
  functor g4(&Base::d9);
  functor g5(&Base::virtual_c);
  g3(29) == 29);
  g4(&b, 1,2,3,4,5,6,7,8,9);
  g5(&c, 5);

  functor functorCmp = &mycmp;
  std::sort(myvector2.begin(), myvector2.end(), functorCmp);



The code can be check out at https://github.com/Jasonhcwong/blog.




Reference:

1. Boost function, http://www.boost.org/doc/libs/1_49_0/doc/html/function/
2. Member Function Pointers and the Fastest Possible C++ Delegates, http://www.codeproject.com/Articles/7150/Member-Function-Pointers-and-the-Fastest-Possible
3. Wikipedia

March 17, 2012

Comparison of strlen Implementations

The strlen() function, to find the the length of null-terminated string.

In terms of algorithm, it is very simple, start from the first character, increment the counter after you scanning a non - null character, until you reach the '\0' character which is the NULL terminator . The time complexity is linear O(n).

However, when it comes to implementation, there are many different variants.

Naive implementation:
Comparing one byte each time, check if it is NULL.


size_t naiveStrlen(const char *str)
{
  const char *p = str;

  while(*p) p++;

  return p-str;
}



Better implementation:
Comparing four or eight bytes each time.

size_t betterStrlen(const char *str)
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, himagic, lomagic;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (longword) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  longword_ptr = (unsigned long int *) char_ptr;

  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  if (sizeof (longword) > 4) {
    himagic = ((himagic << 16) << 16) | himagic;
    lomagic = ((lomagic << 16) << 16) | lomagic;
  }

  for (;;) {
    longword = *longword_ptr++;
    if (((longword - lomagic) & ~longword & himagic) != 0) {
      const char *cp = (const char *) (longword_ptr - 1);

      if (cp[0] == 0) return cp - str;
      if (cp[1] == 0) return cp - str + 1;
      if (cp[2] == 0) return cp - str + 2;
      if (cp[3] == 0) return cp - str + 3;

      if (sizeof (longword) > 4) {
        if (cp[4] == 0) return cp - str + 4;
        if (cp[5] == 0) return cp - str + 5;
        if (cp[6] == 0) return cp - str + 6;
        if (cp[7] == 0) return cp - str + 7;
      }
    }
  }
}


SSE2 implementation:
Utilizing SSE2 instruction, compare 16 bytes each time.


size_t sse2Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i xmm1;
  const char *char_ptr;
  int mask;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (__m128i) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  for(;;) {
    xmm1 = _mm_cmpeq_epi8(_mm_load_si128((__m128i *) char_ptr), zero);
    if ((mask = _mm_movemask_epi8(xmm1)) != 0) {
        return (char *) (char_ptr) + __builtin_ctz(mask) - str;
    }
    char_ptr += 16;
  }
}





SSE4 implementation:
Utilize string processing instruction ”PCMPISTRI“provided by SSE4.


size_t sse42Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i *ptr = (__m128i *) str;

  for(;;) {
    offset = _mm_cmpistri(zero, _mm_loadu_si128(ptr++), _SIDD_CMP_EQUAL_EACH);
    if (offset != 16) {
      break;
    }
  }
  return (char *) ptr - 16 - str + offset;
}






Result:


Before comparing the performance, we have further optimize the implementations by unrolling the loop. Result shows that SSE2 is the fastest implementation. We also compared the loop unrolled SSE2 and normal SSE2, it shows that loop unrolling does helps make it faster. The chart below shows the result of running time for each implementation. Length of tested string range from 2 to 1024.




Conclusion:

SSE2 implementation beats the glib's implementation and becomes fastest implementation. The string processing instruction of SSE4 seems not optimized enough. And loop unrolling really helps in the implementations.

All the source code can be downloaded from https://github.com/Jasonhcwong/blog




References:

1. Intel 64 and IA-32 Architectures Software Developer's Manual
2. Glibc's strlen() implementation


January 26, 2012

Singleton

Singleton is one of the well known design patterns. It is used to ensure that there is only one instance of the class and provide a global access point to it.

Single-threaded version

A single-threaded version of singleton is provided by <<Design Patterns>>,  here is a similar but generic version :



template 
class Singleton
{
private:
        Singleton() { }
        ~Singleton() { }

public:
        static T &getInstance()
        {
                static T *_instance = 0;
                if(!_instance) {
                        _instance = new T;
                }
                return *_instance;
        }
}





Multi-threaded version


The variable "_instance" is shared among the threads. "_instance" should be written once only during lazy initialization when the first time the "getInstance()" method called. However, there can be situation that multiple threads call the "getInstance()" method simultaneously,  so "_instance" should be protected when writing data to it, otherwise, data race will happen.

The easiest and least efficient way is

static T &getInstance()
{
        static T *_instance = 0;
        // get LOCK
        if(!_instance) {
                _instance = new T;
        }
        // release LOCK
        return *_instance;
}


It is inefficient because:

  1. lock and unlock have overhead
  2. locking is need only during initialization, once initialized we dont need lock anymore.


Here comes Double Checked Locking Pattern:

static T &getInstance()
{
        static T *_instance = 0;
        if(!_instance) {
                // get LOCK
                if(!_instance) {
                         _instance = new T;
                }
                // release LOCK

        }
        return *_instance;
}


"_instance" is checked twice. The second time is used to ensure that the variable is not written by other threads when waiting for the lock. However, this pattern has a serious problem, the reading and writing to "_instance" is NOT synchronized, that means the reading outside the lock may get an uncompleted value.

How to make the write atomic? Compare And Swap(CAS) !!



static T &getInstance()
{
        static T *_instance = 0;
        if(!_instance) {
                T *n = new T;
                if(__sync_val_compare_and_swap(&_instance, 0, n)) {
                         delete n;
                }
        }
        return *_instance;
}

__sync_val_compare_and_swap() is a GCC atomic builtin to perform CAS.


Conclusion
Singleton is not a complicated pattern in serial programming, but once it comes multi-threaded, things become very complicated.
Locking is good to prevent data race, but it is expensive. Replace Locking with atomic operation if applicable.



Reference:
  1. <<Design Patterns: Elements of Reusable Object-Oriented Software>>
  2. GCC Atomic builtin: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
  3. C++ and the Perils of Double-Checked Locking: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf