c++ today: 2008

Thursday, September 18, 2008

Assignment operator

Today I went through a piece of code similar to this (what matters here is how the assignment operator was written):


class Test {
    Test()
    :theId(0), theName()
    {}

    Test& operator=(const Test& aRhs) {
        theID   = aRhs.theID;
        theName = aRhs.theName;

        return *this;
    }

private:
    int         theID;
    std::string theName;

};

as it is, the code works and I have nothing to say at first glance. However , who knows how the class will be extended in the future?

Imagine that in a future version some members are added:


class Test {
    Test()
    :theId(0), theName(), theHammer(), theHeap(0)
    {}

    Test& operator=(const Test& aRhs) {
        theID     = aRhs.theID;
        theName   = aRhs.theName;
        theHammer = aRhs.theHammer;  //This is going to be a bottleneck
        theHeap   = aRhs.theHeap;    //This is wrong

        return *this;
    }

private:
    int          theID;
    std::string  theName;
    HugeClass    theHammer;  //Extra member with huge footprint
    HeapClass*   theHeap;    //Extra member on heap
};

as you can see with these two extra members following the same code line as the previous version the code becomes invalid. The objection I get is: "if one day someone adds those two members then they will take care to write it correctly", unfortunately in the real world it doesn't work like this. The average programmer will just add those two extra members following the code line already in place, following the rule: "after all if the code works for the already present members why shouldn't it be the same for the two extra members I have been told to add?"

The trick to avoid problems like this is to write the right code, from the start thinking of what will happen in the future, when possible of course. Usually the change to make is just a matter of a few lines of code and a two line comment in order to warn the future coders.

The original well written class would have been:


class Test {
    Test()
    :theId(0), theName()
    {}

    Test& operator=(const Test& aRhs) {

        //Check for a self assignment
        if (this != &aRhs) {
            theID     = aRhs.theID;
            theName   = aRhs.theName;
        }

        return *this;
    }
private:
    int          theID;
    std::string  theName;
};

At this point the objection is: why check for a self assignment when the event is never going to happen? Who will ever write:


Test t;

t=t;

well, the code is valid so be sure someone will, also it is not always easy to spot a self assignment, consider this:


t[j] = t[i];

or even:


Test t;
Test &a = t;
...
t = a;

The real question here is: "Why was the assignment operator in that class was ever implemented?" The question makes a point, the operator was not needed at all, in that case the right thing to do is to remove it.

Let's then suppose the class is the one with the two extra members, the class is managing dynamically allocated memory then the operator must be implemented. The almost correct version is:


class Test {
    Test()
    :theId(0), theName(), theHammer(), theHeap(new HeapClass)
    {}

    Test& operator=(const Test& aRhs) {
        if (this != &aRhs) {
            theID     = aRhs.theID;
            theName   = aRhs.theName;
            theHammer = aRhs.theHammer;  

            delete theHeap;
            theHeap   = new HeapClass(*aRhs.theHeap);
        }

        return *this;
    }

private:
    int          theID;
    std::string  theName;
    HugeClass    theHammer;  //Extra member with huge footprint
    HeapClass*   theHeap;    //Extra member on heap
};

I wrote "almost correct" because the assignment operator is correct but not exception safe, imagine what will happen if an exception is thrown, if that is the case then the class will be left in an inconsistent state. The goal is not an easy one, in order to make an assignment operator exception safe before modifying the internal state it is better to create a temporary object and then swap it with the internal state with operations that do not throw exceptions. This is achieved using the Pimpl idiom, moving all the internal state of Test inside another class and then leaving inside the class Test a pointer to this new class, it is better to use a boost::shared_ptr in this case to avoid exception catches:


class TestImpl {

    friend class Test;

private:
    TestImpl()
    :theID(0), theName(), theHammer(), theHeap(new HeapClass)
    {}

    TestImpl(const TestImpl& aRhs)
    :theId(aRhs.theID), theName(aRhs.theName),
     theHammer(aRhs.theHammer), theHeap(new HeapClass(aRhs.theHeap))
    {}

    int          theID;
    std::string  theName;
    HugeClass    theHammer;  //Extra member with huge footprint
    HeapClass*   theHeap;    //Extra member on heap
};

class Test {

    Test()
    :theImplementation(new TestImpl)
    {}

    Test& operator=(const Test& aRhs) {
        if (this != &aRhs) {
            boost::shared_ptr<TestImpl> tmp(new TestImpl(*aRhs.theImplementation));

            std::swap(theImplementation, tmp);
        }

        return *this;
    }

private:
    boost::shared_ptr<TestImpl> theImplementation;
};

As you can see writing right code is not easy, and this becomes more difficult if you want to write exception safe code. What I suggest is, to remove assignment operator and copy constructor and write those only if really needed:


class Test {

    Test()
    :theID(0), theName(), theHammer(), theHeap(new HeapClass)
    {}

private:
    Test& operator=(const Test& aRhs);  //disabled (do not even implement it)
    Test(const Test& aRhs);             //disabled (do not even implement it)


    int          theID;
    std::string  theName;
    HugeClass    theHammer;  //Extra member with huge footprint
    HeapClass*   theHeap;    //Extra member on heap
};

Thursday, June 19, 2008

Threading mess (2)!

I got some comments about my last post (threading-mess) about the fact that the showed solution was just detecting the scenario of two or more threads entering at the same time a critical section (in the last post example the method Shared::foo). What it doesn't answer is the real question: "is this class during its life being used by more than a single thread, or more specifically, is a certain section of code used by more than a thread"? Indeed if you remember after a SCOPED_LOCK leaves his scope, it "releases" the stored current ID thread allowing another thread to enter it.

I have added a nested Watch class to ThreadCollisionWarning class that detects also if a critical section is ever used by two different threads ( for example you can detect if a given class is constructed and destroyed within the same thread).

The code is the following:

#ifndef THREAD_COLLISION_WARNING
#define THREAD_COLLISION_WARNING

#include <stdexcept>

#ifdef NDEBUG

#define THREAD_WATCH(obj)
#define SCOPED_WATCH(obj)
#define WATCH(obj)

#else

#define THREAD_WATCH(obj) ThreadCollisionWarning _##obj;
#define SCOPED_WATCH(obj) ThreadCollisionWarning::ScopedWatch scoped_watch_##obj(_##obj);
#define WATCH(obj)        ThreadCollisionWarning::Watch watch_##obj(_##obj);

#endif

class ThreadCollisionWarning {
public:
    ThreadCollisionWarning()
    :theActiveThread(0)
    { }

    ~ThreadCollisionWarning() { }

    class Watch {
        public:
            Watch(ThreadCollisionWarning& aTCW)
            :theWarner(aTCW)
            { theWarner.enter_self(); }

            ~Watch() { }

        private:
            ThreadCollisionWarning& theWarner;
    };

    class ScopedWatch {
        public:
            ScopedWatch(ThreadCollisionWarning& aTCW)
            :theWarner(aTCW)
            { theWarner.enter(); }

            ~ScopedWatch() { theWarner.leave(); }

        private:
            ThreadCollisionWarning& theWarner;
    };

private:

    void enter_self() { 
        //If the active thread is 0 then I'll write the current thread ID
        //if two or more threads arrive here only one will success to write on theActiveThread 
        //the current thread ID
        if (! __sync_bool_compare_and_swap(&theActiveThread, 0, pthread_self())) { 

            //Last chance! may be is the thread itself calling from a critical section
            //another critical section
            if (!__sync_bool_compare_and_swap(&theActiveThread, pthread_self(), theActiveThread)) {
                throw std::runtime_error("Thread Collision");
            }
        }
    }

    void enter() { 
        if (!__sync_bool_compare_and_swap(&theActiveThread, 0, pthread_self())) {
            //gotcha! another thread is trying to use the same class
            throw std::runtime_error("Thread Collision");
        }
    }

    void leave() { 
        __sync_fetch_and_xor(&theActiveThread, theActiveThread);
    }

    pthread_t theActiveThread;
};

#endif

The nested Watch class (used by WATCH macro) just during his constructor initializes theActiveThread member with the current id thread if it isn't still initialized, in case it gives another chance to check if the active thread is itself.

So let's see some examples of use:

Case #1: Check that one thread ever uses some critical section (recursion allowed)

struct Shared {
   void foo() { 
       WATCH(CriticaSectionA);
       bar();
   }

   void bar() {
       WATCH(CriticaSectionA);
   }

   THREAD_WATCH(CriticaSectionA);
};

Case #2: Check that a class is constructed and destroyed inside the same thread

struct Shared {

    Shared() {
        WATCH(CTOR_DTOR_SECTION);
        ...
    }

   ~Shared() { 
       WATCH(CTOR_DTOR_SECTION);
       ...
   }

   THREAD_WATCH(CTOR_DTOR_SECTION);
};

note that doing so the Shared destructor can throw an exception, so do not use this in a production code (put the WATCH between a try-catch and just notify it in some way).

Case #3: Two or more different threads can enter a critical section but in exclusive way (useful to check if external sync mechanism are working).

struct Shared {

    foo() {
        SCOPED_WATCH(CriticalSectionA);
    }


   THREAD_WATCH(CriticalSectionA);
};

Wednesday, June 18, 2008

Threading mess!

Software development requires discipline, you know what I mean: brainstorming, coding rules, code inspections, pair programming. Unfortunately all these activities for the management are a waste of time so at the end you end up to just act as a "code monkey"; to rub salt to the wound "multithread programming" requires ten time the discipline you need in a single thread environment. I've recently stepped in a project of medium size, and at the easy question: "are those class instances shared between two or more threads" the response was: "no... wait... yes, well I'm not sure... I don't know...". Riiiight.
Let's see a quick technique that should permit to detect (at runtime, sigh!) if two or more threads are using concurrently a class.

Suppose we have the following class:

struct Shared {
   void foo() { ...  }
};

and we are unsure if two threads are calling the Shared::foo() at same time. One way is to add a mutex to the class Shared and then attempt a "try lock" as first thing to do inside the foo and raise an error in case the try lock fails.

Something like:

class Shared {
   void foo() {
      TryScopedLock aLock(theMutex);

      if (!aLock) { throw std::runtime_error("BOOM"); }

      ...
   }

private:
   volatile mutex theMutex;

};

this approach works but it will slow down your software, hiding other problems around and, most of all, introduces useless synchronization; a mutex lock is not exactly a cheap operation.

The idea is to use the technique above but without using a lock, GCC gives us some functions for atomic memory access, and we can use for example:

bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)

for our very purpose. That function assigns at *ptr the value newval only if the current value of *ptr is oldval, it returns true if the comparison is successful and newval was written. We can use it to store the threadId when we enter the critical section, "zeroing" the value when we exit.

Basically I wrote a class that store (with an atomic operation) the threadID of the thread entering the critical section, and when it leaves forgets about the threadID. This was the result:

#ifndef THREAD_COLLISION_WARNING
#define THREAD_COLLISION_WARNING

#include <stdexcept>

#ifdef NDEBUG

#define THREAD_WATCH(obj)
#define SCOPED_WATCH(obj)

#else

#define THREAD_WATCH(obj) ThreadCollisionWarning _##obj;
#define SCOPED_WATCH(obj) ThreadCollisionWarning::ScopedWatch scoped_watch_##obj(_##obj);

#endif

class ThreadCollisionWarning {
public:
    ThreadCollisionWarning()
    :theActiveThread(0)
    { }

    ~ThreadCollisionWarning() { }

    class ScopedWatch {
        public:
            ScopedWatch(ThreadCollisionWarning& aTCW)
            :theWarner(aTCW)
            { theWarner.enter(); }

            ~ScopedWatch() { theWarner.leave(); }

        private:
            ThreadCollisionWarning& theWarner;
    };

private:

    void enter() { 
        if (!__sync_bool_compare_and_swap(&theActiveThread, 0, pthread_self())) {
            //gotcha! another thread is trying to use the same class
            throw std::runtime_error("Thread Collision");
        }
    }
    void leave() { 
        __sync_fetch_and_xor(&theActiveThread, theActiveThread);
    }

    pthread_t theActiveThread;
};

#endif

The class ThreadCollisionWarning has the responsibility to store the thread using the class (or more in general entering a critical section) while the nested class ScopedWatch is used to notify the entering and the leaving the critical section. Look the implementation of the two ThreadCollisionWarning::enter and ThreadCollisionWarning::leave, the former stores the thread Id only if the old value was 0 the latter just zeroes it. The macros simplify the usage.

So there we go, the class Shared becomes then:

struct Shared {
   void foo(char aC) { 
       SCOPED_WATCH(Shared)
       ...
   }

   THREAD_WATCH(Shared)
};

using SCOPED_WATCH we just check that two threads are not using the method foo concurrently.

Of course the implementation above is not by any mean a complete solution to the problem I exposed at the beginning, it helps and it can be a good start point to create a better tool to detect if someone messed around.

Saturday, May 31, 2008

False Sharing hits again!

You can ask what "false sharing" is? False sharing is an annoying effect that occurs when two processors apparently do not share any resource but due to undergoing hardware architecture they actually do. For example consider two threads writing in not overlapping memory locations, they do not need any synchronization so happily you are driven to think that you are able to split your data in order to implement a lock less algorithm. Unfortunately you hold a sparkling "centrino duo" processor in where both cores do share the L2 cache and then the data you partition in memory can be mapped on the same cache line. The same scenario is triggered on L1 caches, indeed due to coherency protocols if a thread write to a cache line then the cache line referring the same memory is invalidated on the other processor (cache trashing).

Consider for example the following case:

char a[10];
char b[10];

start_thread_that_works_on_a;
start_thread_that_works_on_b;

Very likely that 20 bytes will lie on contiguous memory location and then will be mapped inside the same single cache line, so each time a thread works on his own vector it invalidates the cache line for the other thread and then the hardware has to write and fetch the cache line in and from memory even if it is not strictly needed.

I've had to work on an algorithm that had that very issue and the following code, even if useless, exploits the same problem:

#include <iostream>
#include <boost/thread.hpp>

class threadS {
public:
   threadS(unsigned char *aVector, unsigned int aVSize) 
   :theVector(aVector),
    theVSize(aVSize)
   { }

   void operator()() {
      unsigned long long myCounter = 100000000;
      while(--myCounter) {
          for (int i=0; i<10; ++i) {
              ++theVector[i];
          }
      }
   }
private:
   unsigned char* theVector; 
   unsigned int   theVSize; 
};

int main() 
{
   unsigned char vectorA[10]; 
   unsigned char vectorB[10];

   std::cout << std::hex;
   std::cout << "A:[" <<  (int)&vectorA[0] << "-" << (int)&vectorA[9] << "]" << std::endl;
   std::cout << "B:[" <<  (int)&vectorB[0] << "-" << (int)&vectorB[9] << "]" << std::endl;

   threadS threadA(vectorA, 10);
   threadS threadB(vectorB, 10);

   boost::thread_group tg;
   tg.create_thread(threadA);
   tg.create_thread(threadB);

   tg.join_all();
}

You should be able to compile and link it with:

g++ main.cpp -o false_sharing -lboost_thread -O3

Let see what that codes does.
The class threadS stores the vector on which it will operate. The thread body (the operator()) just increases all the vector elements. As you can see I used the boost thread library to start two threads: threadA and threadB.

On my system I obtain an execution time that goes from 6 to 8 seconds and the following boundaries for the two vectors:

B:[bfa7d010-bfa7d019] - A:[bfa7d01a-bfa7d023]

as you can see vectorB and vectorA are at contiguous memory locations.

How to eliminate the false sharing in this case? The goal is to have both threads working on an underling different cache line, we can achieve this goal separating both data with some extra bytes. Declaring the two vector bigger than we need it's a dirty and quick way to do it, executing the same program with the following vector declaration:

   unsigned char vectorA[100]; 
   unsigned char vectorB[100];

I'm able to obtain an execution time that goes from 1 to 1.5 seconds.

Same problem would happen with a single vector "unsigned char vector[1000]" but with a thread working on elements [0,10] and another thread working on elements [11,20]. I wrote a simple program that creates two threads,one performs writes at locations [0,10] and the other that performs writes at [z,z+10] with z inside the interval [11,100]. The following graph shows the execution time while the bytes of separation between the two data increase; z=11 means that data have a 0 bytes separation.

As you can see as soon the two data "false shared" have 51 bytes separation the execution time collapses from ~8 secs to ~1.8 secs, so just wasting 51 bytes I'm able to obtain a x5 speed up, nice isn't it?