June 18, 2012

C++ callback

According to Wikipedia, a callback is a reference to a piece of executable code that is passed as an argument to other code. Callback is widely used in C++, you can find it in STL algorithm, single handler of Linux, Windows API, etc..

In C++, callback can be implemented using function pointer and function object(functor).

Function pointer is a pointer that points to executable code in the memory.Usually, there are two types of function pointer: ordinary function pointer and class member function pointer(pointer to static method is treated the same as ordinary function pointer).

Function object(functor) is a class with function call operator "()" overloaded.

Function pointer V.S. functor:
1. When passed as a parameter, you pass the function pointer itself if you are using function pointer, if you functor, you passed the object.
2. The functor can preserve the state of the object between calls.

Ordinary function pointer V.S. Member function pointer:
1. The size of the pointer is different. In my environment(Ubuntu 10.04 X86_64) the size of ordinary function pointer is 8 byte while member function pointer occupies 16 bytes.
2. When invoking the callback, you need to passed the parameter, and if you are using functor, you have to passed one addition parameter -- the object.


I implemented a unified functor generator class that will take both ordinary function pointer and member function pointer and return a functor.

Here are some code example:



  // test ordinary function
  functor g1;
  g1 = &test1;
  g1(12.123);

  // test member function
  functor g3(&Base::static_b);
  functor g4(&Base::d9);
  functor g5(&Base::virtual_c);
  g3(29) == 29);
  g4(&b, 1,2,3,4,5,6,7,8,9);
  g5(&c, 5);

  functor functorCmp = &mycmp;
  std::sort(myvector2.begin(), myvector2.end(), functorCmp);



The code can be check out at https://github.com/Jasonhcwong/blog.




Reference:

1. Boost function, http://www.boost.org/doc/libs/1_49_0/doc/html/function/
2. Member Function Pointers and the Fastest Possible C++ Delegates, http://www.codeproject.com/Articles/7150/Member-Function-Pointers-and-the-Fastest-Possible
3. Wikipedia

March 17, 2012

Comparison of strlen Implementations

The strlen() function, to find the the length of null-terminated string.

In terms of algorithm, it is very simple, start from the first character, increment the counter after you scanning a non - null character, until you reach the '\0' character which is the NULL terminator . The time complexity is linear O(n).

However, when it comes to implementation, there are many different variants.

Naive implementation:
Comparing one byte each time, check if it is NULL.


size_t naiveStrlen(const char *str)
{
  const char *p = str;

  while(*p) p++;

  return p-str;
}



Better implementation:
Comparing four or eight bytes each time.

size_t betterStrlen(const char *str)
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, himagic, lomagic;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (longword) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  longword_ptr = (unsigned long int *) char_ptr;

  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  if (sizeof (longword) > 4) {
    himagic = ((himagic << 16) << 16) | himagic;
    lomagic = ((lomagic << 16) << 16) | lomagic;
  }

  for (;;) {
    longword = *longword_ptr++;
    if (((longword - lomagic) & ~longword & himagic) != 0) {
      const char *cp = (const char *) (longword_ptr - 1);

      if (cp[0] == 0) return cp - str;
      if (cp[1] == 0) return cp - str + 1;
      if (cp[2] == 0) return cp - str + 2;
      if (cp[3] == 0) return cp - str + 3;

      if (sizeof (longword) > 4) {
        if (cp[4] == 0) return cp - str + 4;
        if (cp[5] == 0) return cp - str + 5;
        if (cp[6] == 0) return cp - str + 6;
        if (cp[7] == 0) return cp - str + 7;
      }
    }
  }
}


SSE2 implementation:
Utilizing SSE2 instruction, compare 16 bytes each time.


size_t sse2Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i xmm1;
  const char *char_ptr;
  int mask;

  for (char_ptr = str; ((unsigned long int) char_ptr
                       & (sizeof (__m128i) - 1)) != 0;
                       ++char_ptr) {
    if (*char_ptr == '\0') {
      return char_ptr - str;
    }
  }

  for(;;) {
    xmm1 = _mm_cmpeq_epi8(_mm_load_si128((__m128i *) char_ptr), zero);
    if ((mask = _mm_movemask_epi8(xmm1)) != 0) {
        return (char *) (char_ptr) + __builtin_ctz(mask) - str;
    }
    char_ptr += 16;
  }
}





SSE4 implementation:
Utilize string processing instruction ”PCMPISTRI“provided by SSE4.


size_t sse42Strlen(const char *str)
{
  __m128i zero = _mm_setzero_si128(); int offset = 0;
  __m128i *ptr = (__m128i *) str;

  for(;;) {
    offset = _mm_cmpistri(zero, _mm_loadu_si128(ptr++), _SIDD_CMP_EQUAL_EACH);
    if (offset != 16) {
      break;
    }
  }
  return (char *) ptr - 16 - str + offset;
}






Result:


Before comparing the performance, we have further optimize the implementations by unrolling the loop. Result shows that SSE2 is the fastest implementation. We also compared the loop unrolled SSE2 and normal SSE2, it shows that loop unrolling does helps make it faster. The chart below shows the result of running time for each implementation. Length of tested string range from 2 to 1024.




Conclusion:

SSE2 implementation beats the glib's implementation and becomes fastest implementation. The string processing instruction of SSE4 seems not optimized enough. And loop unrolling really helps in the implementations.

All the source code can be downloaded from https://github.com/Jasonhcwong/blog




References:

1. Intel 64 and IA-32 Architectures Software Developer's Manual
2. Glibc's strlen() implementation


January 26, 2012

Singleton

Singleton is one of the well known design patterns. It is used to ensure that there is only one instance of the class and provide a global access point to it.

Single-threaded version

A single-threaded version of singleton is provided by <<Design Patterns>>,  here is a similar but generic version :



template 
class Singleton
{
private:
        Singleton() { }
        ~Singleton() { }

public:
        static T &getInstance()
        {
                static T *_instance = 0;
                if(!_instance) {
                        _instance = new T;
                }
                return *_instance;
        }
}





Multi-threaded version


The variable "_instance" is shared among the threads. "_instance" should be written once only during lazy initialization when the first time the "getInstance()" method called. However, there can be situation that multiple threads call the "getInstance()" method simultaneously,  so "_instance" should be protected when writing data to it, otherwise, data race will happen.

The easiest and least efficient way is

static T &getInstance()
{
        static T *_instance = 0;
        // get LOCK
        if(!_instance) {
                _instance = new T;
        }
        // release LOCK
        return *_instance;
}


It is inefficient because:

  1. lock and unlock have overhead
  2. locking is need only during initialization, once initialized we dont need lock anymore.


Here comes Double Checked Locking Pattern:

static T &getInstance()
{
        static T *_instance = 0;
        if(!_instance) {
                // get LOCK
                if(!_instance) {
                         _instance = new T;
                }
                // release LOCK

        }
        return *_instance;
}


"_instance" is checked twice. The second time is used to ensure that the variable is not written by other threads when waiting for the lock. However, this pattern has a serious problem, the reading and writing to "_instance" is NOT synchronized, that means the reading outside the lock may get an uncompleted value.

How to make the write atomic? Compare And Swap(CAS) !!



static T &getInstance()
{
        static T *_instance = 0;
        if(!_instance) {
                T *n = new T;
                if(__sync_val_compare_and_swap(&_instance, 0, n)) {
                         delete n;
                }
        }
        return *_instance;
}

__sync_val_compare_and_swap() is a GCC atomic builtin to perform CAS.


Conclusion
Singleton is not a complicated pattern in serial programming, but once it comes multi-threaded, things become very complicated.
Locking is good to prevent data race, but it is expensive. Replace Locking with atomic operation if applicable.



Reference:
  1. <<Design Patterns: Elements of Reusable Object-Oriented Software>>
  2. GCC Atomic builtin: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
  3. C++ and the Perils of Double-Checked Locking: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf

December 10, 2011

Bash shortcuts - Maximum Productivity

Bash is my favorite shell, I use Bash every day. Using shortcuts help you to use Bash more efficiently and hence increase your productivity.



Here are some shortcuts:

Movement:
  • Ctrl - a: jump to the beginning of the line
  • Ctrl - e: jump to the end of the line
  • Alt - b: move backward a word
  • Alt - f: move forward a word
  • Ctrl - f: move forward a character
  • Ctrl - b: move backward a character


Editing:
  • Ctrl - k: delete the text from the cursor to the end of the line
  • Ctrl - u: delete the text backward from the cursor to the beginning of the line
  • Ctrl - w: delete the text backward from the cursor to the previous space
  • Ctrl - y: paste the kill ring before the cursor
  • Ctrl - d: delete the character under the cursor
  • Ctrl - h: delete the character before the cursor
  • Alt - d: delete from the cursor to the end of the word
  • Alt - c: uppercase the current character and move to the end of the word
  • Alt - u: uppercase from the cursor to the end of word
  • Alt - l: lowercase from the cursor to the end of word
  • Alt - t: swap the current word and previous word 
  • TABTAB: double TAB, auto complete
  • $$: process number of the current process 
  • $?: return code of last finished program


Command Recall:
  • Ctrl - r: search command history
  • Ctrl - g: exit Ctrl + r without history command, ESC exits with history command in the current line
  • Ctrl - p: fetch previous command in the history
  • Ctrl - n: fetch next command in the history
  • !!: refers to the most recent command
  • !string: refer to the most recent command starting with string.


Command Control:
  • Ctrl - l: Clear the screen, similar to "clear" command, but preserve the argument if there is any. 
  • Ctrl - s: Stop screen output
  • Ctrl - q: re-enable screen output stopped by Ctrl-s
  • Ctrl - c: generate SIGINT signal, usually kills the foreground process
  • Ctrl - z: generate SIGSTP signal, usually suspends the foreground process






November 17, 2011

Upgrading G1 from CM4.2 to CM7.1

It is not difficult to compile a CM7.1 rom by yourself. It is not fun to follow instructions to compile and flash the rom, what is really fun is the process of finding out why other people's rom works well but your rom make your G1 brick.

Initially, I just checkout the code from github, compile the code, pack my own rom and then flash it to my phone. Then the phone "brick"(It is not brick actually because it still boot into bootoloader mode and use fastboot), stuck at splash screen and ADB not yet started(no debug message) !!! Now it becomes fun.

After googling, I find that HTC released new bootloader and radio that give my G1 15MB extra RAM, I need a patched kernel to work with the new firmware and ezterry published his rom GINGERBREAD-DS-Gamma-20111107 which works properly on g1.

I boot into bootloader mode and flash the following images using fastboot:
  • New bootloader: Hboot 1.33.0013d 
  • New radio: 2.22.27.08 
  • New recovery: RA-dream-v1.7.0 cyan 
and then reboot into recovery mode and flash my CM7.1 rom and ez-nightly271-cm-2708port_S kernel.

This time, much better since I can view debug message using logcat although it still stuck at splash screen.

The logcat shows that something wrong with the propriety libcamera.so(the file is copied from cm4.2), it cannot locate the symbol "_ZN7android16CameraParametersC1Ev".

Let's find out what happen to libcamera.so and compare with libraries in GINGERBREAD-DS-Gamma-20111107 rom.
The readelf command show that libcamera.so linked to the following libraris:

[jason@jasonpc tmp]$ readelf -a old-libcamera.so|grep "Shared library"
0x00000001 (NEEDED) Shared library: [libutils.so]
0x00000001 (NEEDED) Shared library: [libui.so]
0x00000001 (NEEDED) Shared library: [liblog.so]
0x00000001 (NEEDED) Shared library: [libcamera_client.so]
0x00000001 (NEEDED) Shared library: [libbinder.so]
0x00000001 (NEEDED) Shared library: [libdl.so]
0x00000001 (NEEDED) Shared library: [libc.so]
0x00000001 (NEEDED) Shared library: [libstdc++.so]
0x00000001 (NEEDED) Shared library: [libm.so]


[jason@jasonpc tmp]$ readelf -a ezgingerbread-libcamera.so|grep "Shared library"
0x00000001 (NEEDED) Shared library: [libutils.so]
0x00000001 (NEEDED) Shared library: [libui.so]
0x00000001 (NEEDED) Shared library: [liblog.so]
0x00000001 (NEEDED) Shared library: [libcamera_client.so]
0x00000001 (NEEDED) Shared library: [libbinder.so]
0x00000001 (NEEDED) Shared library: [libdl.so]
0x00000001 (NEEDED) Shared library: [libc.so]
0x00000001 (NEEDED) Shared library: [libstdc++.so]
0x00000001 (NEEDED) Shared library: [libm.so]

ezgingerbread's libcamera.so needs libcamera_client.so while my old-libcamera.so does not.
Let's look into libui.so and libcamera_client.so.

The nm command shows the symbol in the dynamic library.
[jason@jasonpc tmp]$ nm -D old-libui.so |grep _ZN7android16CameraParametersC1Ev
000177e9 T _ZN7android16CameraParametersC1Ev

[jason@jasonpc tmp]$ nm -D ezgingerbread-libui.so |grep _ZN7android16CameraParametersC1Ev
[jason@jasonpc tmp]$

[jason@jasonpc tmp]$ nm -D ezgingerbread/system/lib/libcamera_client.so |grep _ZN7android16CameraParametersC1Ev
0000d155 T _ZN7android16CameraParametersC1Ev

git log shows that the CameraParameter class has been moved from libui.so to libcamera_client.so since Froyo. That is why the old libcamera.so cannot find the symbol and the new library linked to libui.so.

Since I dont have a workable libcamera.so, I just copied the one from GINGERBREAD-DS-Gamma-20111107 rom. After using the new libcamera.so, the phone boot up successfully though there still few other errors.



References:
ezterry 's post on xda-developers

November 12, 2011

Parallel programming Optimization tricks

Comparision of 2 __m128i:
  • SSE 4.1:
#define iszero(a) _mm_testz_si128(a, a)
#define isequal(a, b) _mm_testc_si128(_mm_cmpeq_epi32(a, b), gAllOnes)
  • Others:
#define iszero(a) (_mm_movemask_epi8(_mm_cmpeq_epi32(a, _mm_setzero_si128())) == 0xffff)
#define isequal(a, b) (_mm_movemask_epi8(_mm_cmpeq_epi32(a, b)) == 0xffff)


Number of trailing zeros of an int:
  • intrinsic: _bit_scan_forward()
  • corresponding asm instruction: bsfl

Nehalem

Some features of Intel's Nehalem CPU architecture.


Turbo Boost:
  • allow processors to operate above the rated frequency
  • control by continual measurement of process temperature, current draw and power consumption
  • upper limit of frequency controlled by:
  • number of active(in “C0″ or “C1″ state) cores(dictates the upper limit)
  • estimated current consumption
  • estimated power consumption
  • process temperature
  • all active cores operate at the same frequency and voltage
  • BIOS can enable/disable Turbo Boost
  • ACPI awared OS need no change to support Turbo Boost
  • OS should request the highest performance state(P0), should enable ACPI ?
  • Software shows correct core frequency: i7z

Intelligent Power:
  • adjust processor frequency when process is idle

Hyper-Threading:
  • 2 threads per core

Quick Path:
  • replace FSB
  • up to 25.6 GB/s
  • each processor has its own dedicated memory that it accesses directly through an integrated memory controller
  • access the dedicated memory of another processor should go through QPI

SSE4:
  • XML, text processing

Multi-level cache:
  • Level 1: 32KB instruction cache, 32 KB data cache
  • Level 2: 256KB per core, for data and instruction
  • Level 3: cache shared across all cores