Consider for example the following case:
char a[10]; char b[10]; start_thread_that_works_on_a; start_thread_that_works_on_b;
Very likely that 20 bytes will lie on contiguous memory location and then will be mapped inside the same single cache line, so each time a thread works on his own vector it invalidates the cache line for the other thread and then the hardware has to write and fetch the cache line in and from memory even if it is not strictly needed.
I've had to work on an algorithm that had that very issue and the following code, even if useless, exploits the same problem:
#include <iostream> #include <boost/thread.hpp> class threadS { public: threadS(unsigned char *aVector, unsigned int aVSize) :theVector(aVector), theVSize(aVSize) { } void operator()() { unsigned long long myCounter = 100000000; while(--myCounter) { for (int i=0; i<10; ++i) { ++theVector[i]; } } } private: unsigned char* theVector; unsigned int theVSize; }; int main() { unsigned char vectorA[10]; unsigned char vectorB[10]; std::cout << std::hex; std::cout << "A:[" << (int)&vectorA[0] << "-" << (int)&vectorA[9] << "]" << std::endl; std::cout << "B:[" << (int)&vectorB[0] << "-" << (int)&vectorB[9] << "]" << std::endl; threadS threadA(vectorA, 10); threadS threadB(vectorB, 10); boost::thread_group tg; tg.create_thread(threadA); tg.create_thread(threadB); tg.join_all(); }
You should be able to compile and link it with:
g++ main.cpp -o false_sharing -lboost_thread -O3
Let see what that codes does.
The class threadS stores the vector on which it will operate. The thread body (the operator()) just increases all the vector elements. As you can see I used the boost thread library to start two threads: threadA and threadB.
On my system I obtain an execution time that goes from 6 to 8 seconds and the following boundaries for the two vectors:
B:[bfa7d010-bfa7d019] - A:[bfa7d01a-bfa7d023]
as you can see vectorB and vectorA are at contiguous memory locations.
How to eliminate the false sharing in this case? The goal is to have both threads working on an underling different cache line, we can achieve this goal separating both data with some extra bytes. Declaring the two vector bigger than we need it's a dirty and quick way to do it, executing the same program with the following vector declaration:
unsigned char vectorA[100]; unsigned char vectorB[100];
I'm able to obtain an execution time that goes from 1 to 1.5 seconds.
Same problem would happen with a single vector "unsigned char vector[1000]" but with a thread working on elements [0,10] and another thread working on elements [11,20]. I wrote a simple program that creates two threads,one performs writes at locations [0,10] and the other that performs writes at [z,z+10] with z inside the interval [11,100]. The following graph shows the execution time while the bytes of separation between the two data increase; z=11 means that data have a 0 bytes separation.
As you can see as soon the two data "false shared" have 51 bytes separation the execution time collapses from ~8 secs to ~1.8 secs, so just wasting 51 bytes I'm able to obtain a x5 speed up, nice isn't it?