Saturday, May 31, 2008

False Sharing hits again!

You can ask what "false sharing" is? False sharing is an annoying effect that occurs when two processors apparently do not share any resource but due to undergoing hardware architecture they actually do. For example consider two threads writing in not overlapping memory locations, they do not need any synchronization so happily you are driven to think that you are able to split your data in order to implement a lock less algorithm. Unfortunately you hold a sparkling "centrino duo" processor in where both cores do share the L2 cache and then the data you partition in memory can be mapped on the same cache line. The same scenario is triggered on L1 caches, indeed due to coherency protocols if a thread write to a cache line then the cache line referring the same memory is invalidated on the other processor (cache trashing).

Consider for example the following case:

char a[10];
char b[10];

start_thread_that_works_on_a;
start_thread_that_works_on_b;


Very likely that 20 bytes will lie on contiguous memory location and then will be mapped inside the same single cache line, so each time a thread works on his own vector it invalidates the cache line for the other thread and then the hardware has to write and fetch the cache line in and from memory even if it is not strictly needed.

I've had to work on an algorithm that had that very issue and the following code, even if useless, exploits the same problem:

#include <iostream>
#include <boost/thread.hpp>

class threadS {
public:
   threadS(unsigned char *aVector, unsigned int aVSize) 
   :theVector(aVector),
    theVSize(aVSize)
   { }

   void operator()() {
      unsigned long long myCounter = 100000000;
      while(--myCounter) {
          for (int i=0; i<10; ++i) {
              ++theVector[i];
          }
      }
   }
private:
   unsigned char* theVector; 
   unsigned int   theVSize; 
};

int main() 
{
   unsigned char vectorA[10]; 
   unsigned char vectorB[10];

   std::cout << std::hex;
   std::cout << "A:[" <<  (int)&vectorA[0] << "-" << (int)&vectorA[9] << "]" << std::endl;
   std::cout << "B:[" <<  (int)&vectorB[0] << "-" << (int)&vectorB[9] << "]" << std::endl;

   threadS threadA(vectorA, 10);
   threadS threadB(vectorB, 10);

   boost::thread_group tg;
   tg.create_thread(threadA);
   tg.create_thread(threadB);

   tg.join_all();
}



You should be able to compile and link it with:

g++ main.cpp -o false_sharing -lboost_thread -O3

Let see what that codes does.
The class threadS stores the vector on which it will operate. The thread body (the operator()) just increases all the vector elements. As you can see I used the boost thread library to start two threads: threadA and threadB.

On my system I obtain an execution time that goes from 6 to 8 seconds and the following boundaries for the two vectors:

B:[bfa7d010-bfa7d019] - A:[bfa7d01a-bfa7d023]

as you can see vectorB and vectorA are at contiguous memory locations.

How to eliminate the false sharing in this case? The goal is to have both threads working on an underling different cache line, we can achieve this goal separating both data with some extra bytes. Declaring the two vector bigger than we need it's a dirty and quick way to do it, executing the same program with the following vector declaration:

   unsigned char vectorA[100]; 
   unsigned char vectorB[100];


I'm able to obtain an execution time that goes from 1 to 1.5 seconds.

Same problem would happen with a single vector "unsigned char vector[1000]" but with a thread working on elements [0,10] and another thread working on elements [11,20]. I wrote a simple program that creates two threads,one performs writes at locations [0,10] and the other that performs writes at [z,z+10] with z inside the interval [11,100]. The following graph shows the execution time while the bytes of separation between the two data increase; z=11 means that data have a 0 bytes separation.



As you can see as soon the two data "false shared" have 51 bytes separation the execution time collapses from ~8 secs to ~1.8 secs, so just wasting 51 bytes I'm able to obtain a x5 speed up, nice isn't it?