Toward Better Concurrency

Tagged:  

We've all heard the same story: the "free lunch" for developers is over. The exponential rocket ride of single-processor performance is no longer feasible, and in order to sustain Moore's Law, scaling out with more processor cores is now the convention. Developers eager to give their software a performance boost can no longer simply wait for the next generation of hardware; instead they must write more concurrent code in order to maximize utilization of the available CPUs.

But concurrent programming (at least as it is practiced now) is complicated, and many developers cringe at the thought of race conditions, deadlocks, contention, and other problems that await them in the murky waters of multi-threading...In many ways they are in the same situation that they were with memory management back in the 1990's (before garbage collection became mainstream). Is there a "silver bullet" for concurrency waiting out there as well?

In this article I will cover some of the current and potentially near-future approaches to handling concurrency, and evaluate some of the potential "saviors-in-waiting".

The Current Paradigm: Threads and Locks

The dominant (although by no means the only) way of dealing with concurrency today is with threads and mutual exclusion via locks and synchronization (semaphores and monitors). When I was a student, few people even talked about any other method of handling concurrency. You did your "dining philosophers" problem using pthreads (oh, the pain), and that was it. Java greatly simplified this approach by building monitors into the language (via synchronized blocks), but the core approach to handling concurrency has remained pretty much the same as it was in C/C++. Other languages, such as C# (the lock keyword), have generally followed the same approach.

Without going too deeply into semaphores, monitors, or other concepts, the basic idea behind the traditional thread model is this:

  • Threads share memory--and more importantly, mutable state--with other threads. In the absence of any control, changes to this state by different threads can cause non-deterministic behavior (bad, really bad).
  • Locks are placed around these sensitive shared regions of code (called "critical sections") and must be acquired by a thread if it wishes to enter that section of code.
  • If a thread fails to acquire the lock, it is put into a wait state (either spinning--i.e., busy waiting--or suspended).
  • If the thread acquires the lock, it is free to execute the critical section of code.
  • After the thread is done executing the critical section, it releases the lock (signals), so that other waiting threads (usually in a queue) may now acquire the lock.

A Simple Example

A simple (and commonly used) example of how concurrency is handled is the bank account. A basic implementation of such an account in Java might look something like Figure 1 below.

public class Account 
{
     private int balance;
     public void withdraw(int amount) { balance -= amount; }
     public void deposit(int amount) { balance += amount; }
}
Figure 1: A simple (non-thread safe) bank account

If only one thread ever accesses an account at one time, no problem. However, if an account is accessed by multiple threads (potentially simultaneously), then this implementation is decidedly non-thread safe. One thread may read the balance while another is updating it, two threads may attempt to write the balance at the time same, etc. Race conditions abound, and chaos ensues.

public class Account 
{
     private int balance;
     public synchronized void withdraw(int amount) { balance -= amount; }
     public synchronized void deposit(int amount) { balance += amount; }
}
Figure 2: A thread-safe bank account...right?

The simple solution to fixing this lack of thread safety is to use the synchronized keyword on both the withdraw and deposit methods. This is essentially the same as declaring a synchronized block (i.e., synchronized(this) { ... }) around the entire body of each method that is synchronized on the Account instance itself. Now if one thread is executing the withdraw method, other threads are blocked from entering either the withdraw or deposit methods until after the currently executing thread releases the lock. Does this now guarantee that the Account class is thread worry-safe? Well, no. As we'll see a little later on, the need for mutual exclusion can span different objects (hence different locks).

Problems with the Current Paradigm

There are a number of problems with the current lock-based paradigm that have caused even experts to conclude that "concurrency is difficult".

Performance Bottlenecks

The key thing to remember about the traditional thread model is this: it is inherently pessimistic in its strategy. Potential problems are prevented by serializing access (i.e., one thread at a time) to any critical section of code. This is generally not a huge problem in uni-processor systems, where threads are already sharing processor time in a serial fashion anyway, but in multi-processor systems locks can create very real performance bottlenecks as other threads wait for the currently executing thread to exit a critical section. It's somewhat analogous to multi-lane highway all of a sudden narrowing down to a single lane. One of the reasons transactional memory is being touted as the "silver bullet" of concurrency is because it avoids this bottleneck.

Not Modular

Good performance therefore dictates that synchronized blocks (bottlenecks) be kept as granular as possible*. But this can be hard to do, because mutex locking is a decidedly unmodular method of concurrency control. Let's take, for example the Account class in Figure 2 above, and introduce a new class, the BankTransfer, in Figure 3 below.

public class BankTransfer  {
      public void transfer(Account from, Account to, int amount) {
           from.withdraw(amount);
           to.deposit(amount);
     }
}
Figure 3: The BankTransfer class

The transfer() needs to be atomic, i.e. no thread should be able to observe a state where an amount has been withdrawn from one account but not deposited in the other. Even if both the withdraw(int amount) and deposit(int amount) methods are both synchronized, the above transfer method is not atomic, because a thread that completes the first line may be suspended by the operating system, allowing other threads to read or write from either account before the to.deposit(amount) is called--a race condition. Synchronizing the transfer() method won't help because another thread (with a handle on either the to or from Account instances) may be altering the state of one of the accounts outside the method. In order to be atomic this method must now be synchronized on both the from and to objects (of course, by doing this you have just opened yourself up to a potential deadlock...does your head hurt now?). In other words, to properly do concurrency here with locks requires some knowledge about the internal state and implementation of the Account class; this runs counter to the principles of encapsulation in OOP. As this example also demonstrates, locks are not composable: you cannot simply put them together in a modular way without frequently needing additional locks to ensure correct behavior.

Deadlocks and Other Nasties

Deadlocks can occur when one thread holds a lock (on which another thread is waiting) while trying to acquire another lock (which is held by the other thread). This circular dependency causes both threads to "deadlock" or be held in a wait state with no way out. It generally happens when the same locks are acquired in a different order by different threads. Enforcing an ordering on how these locks are acquired is the way to solve this problem, but in practice potential deadlocks can be tricky to spot, and enforcing lock ordering is tedious and error-prone.

There are other problems as well: livelocks, thread starvation, and lock contention, to name a few.

Error Prone

It very easy to make mistakes in a lock-based concurrent environment. The programmer has to think hard about potential (concurrent) paths through his/her code, and synchronize them properly. There are also many subtleties when dealing with concurrency that are easy to miss. When errors do occur, they can be very hard to debug because of the frequently intermittent nature of concurrency bugs: most of the time the code works, sometimes it doesn't. Errors can take months, or even years, to manifest themselves, so they often go unnoticed until a catastrophe hits. Edward Lee gives an example in "The Problem With Threads" of an otherwise carefully written piece of software that deadlocked four years after it was initially deployed.

Scalability

The complexity of getting locks "right" in concurrent software systems makes scaling lock-based concurrency (from a development perspective) difficult. As the number of lines in a program increases, so inevitably do the potential code paths that different threads may take, making proper synchronization that much harder. There are some optimizations (such as lock elision) that help ease the pain, but these are generally just band-aids on the larger problem.

Avoiding Shared (Mutable) State

Is there a better way? Some languages (generally functional programming languages) avoid concurrency issues altogether by doing one or both of the following:

1. Immutable Variables

When variables are immutable--that is to say, they are initialized once and only once--it has the nice effect of making them perfectly thread-safe (after they are initialized). Programming this way, though, can be pretty uncomfortable especially to those used to imperative programming (where variables, um, vary). But there is a lesson to be learned here even for imperative languages: try to make variables immutable whenever possible. This is why concurrency gurus such as Brian Goetz have been beating the drum, "final is the new private [in Java]", emphasizing the fact that immutability is now as important as encapsulation (though final does not guarantee object immutability in Java, but that's another story).

2. Not Sharing State

A concurrent process without shared state has nothing to worry about. Much like immutability, not sharing state precludes the possibility of non-deterministic behavior. (This is the basis for the concept of message passing, which I will discuss in more detail below.) If threads or lightweight processes need to communicate, information is sent as a copy (or reference to an immutable object) from one to the other.

Alternative Methods of Concurrency

In the face of the problems associated with lock-based concurrency, other alternatives have emerged. Two of the more practical alternatives include the actor model and the use of transactional memory. The first involves not sharing state but using message passing between concurrent "actors", the second the use of transactional variables that are committed after validating a transactional log. The first is more near-term implementable in existing languages, but the second is a little less intrusive into the programming model.

The Actor Model

Much of the recent buzz around the Erlang programming language has been because of its elegant handling of concurrency, built around what is called the actor model. Based on a network computing paradigm (it's not hard to imagine "actors" existing on different machines, passing messages over the network), the actor model mimics a message-oriented "client-server" in the language itself. Each actor is the guardian of its own state, neatly side-stepping the need to synchronize code blocks because nothing is shared; hence actors promote the concept of encapsulation to the "thread" (or lightweight process) level. The great strength of actor-based concurrency is that it creates code that is easily distributable over multiple processors or multiple nodes on a network, and is much more scalable than the existing shared memory paradigm.

To illustrate the actor model with an example, I've written a simple program in Scala using its actors library, shown in Figure 4. The program is a guessing game where the Game object picks an number between 0-9, and the client (the Main) attempts to guess it (to a maximum of 10 tries, but guesses are not remembered...).

import scala.actors.Actor
import scala.actors.Actor._
import java.util.Random

case class Guess(number: Int)
case class Answer(correct:Boolean)

object Main 
{
  def main(args : Array[String]) : Unit = {
    val game = new Game()
    Console.println("Starting server...");
    game.start
    for(i <- 1 to 10) {
      val n = Math.abs(new Random().nextInt() % 10);
      Console.println("Guess..." + n);
      game !? Guess(n) match {
        case result => 
          if(result==Answer(true)) {
            Console.println("CORRECT guess: " + n);
            exit
          }
      }
    }
    //if didn't exit, no Answer(true) returned
    Console.println("No correct guesses");
    exit
  }
}

class Game extends Actor {
    def act {
       // number between 1-10
       val number = Math.abs(new Random().nextInt() % 10);
       Console.println("And the number is: " + number);
       loop {
         react {
           case Guess(n) =>
             Console.println("Receiving guess..." + n);
             if(Guess(n)==Guess(number))
               sender!Answer(true)
             else
               sender!Answer(false);
         }
       }
    }
}
Figure 4: A simple guessing game in Scala. Look, Ma, no programmer-visible locks!

Here the Main class starts a server (Game.start), then goes into a loop in an attempt to guess the number the Game has randomly generated. Each guess from the Main class is sent synchronously--i.e., the client needs to wait to see if the guess is correct before guessing again--via the !? operator. The server simply replies asynchronously via the ! operator. The client then exits when the game is over. All communication between the Game and the client is done via messages which are immutable case classes.

An example output from running the code in Figure 4 might look like:

Starting server...
Guess...0
And the number is: 2
Receiving guess...0
(...omitted for brevity...)
Guess...8
Receiving guess...8
Guess...2
Receiving guess...2
CORRECT guess: 2

Peeking inside Scala's implementation of the event-based actor, you can see that the Actor class itself is actually a trait (Scala's version of a mixin) that provides the mailbox (a non-blocking message queue) as well as the necessary definitions of the message-passing operators.

trait Actor extends AbstractActor  {
  // mailbox of the actor
  private val mailbox = new MessageQueue
  
  /**
   * Sends message to this actor (asynchronous).
   */
  def !(msg: Any) {
    send(msg, Actor.self)
  }

  /**
   * Sends msg to this actor and awaits reply (synchronous).
   */
  def !?(msg: Any): Any = {
    val replyCh = Actor.self.freshReplyChannel
    send(msg, replyCh)
    replyCh.receive {
      case x => x
    }
  }
  //...etc.
}

The Scala actors implementation itself actually owes a lot to Doug Lea's Fork/Join framework (also known as JSR-166y...why? because we like it!), which Java programmers should hopefully see in Java SE 7...hint, hint? To anyone who has ever done programming in JMS,the actor model should look very familiar. In fact, "actor model" programming in any existing language (Java, C#, etc.) should not be that hard--see Links below--though doing it correctly really requires immutability as a first-class concept in the language.

Getting back to the simple transfer() method in Figure 3, you'll notice though that this type of problem requires a higher-level actor than just an Account, which is hard to do since there may be many different banks, each with accounts. In this case, one would have to model a kind of in-memory "distributed transaction" between different Account actors (most likely with one serving as a "coordinator"). This is a little bit more work than just "withdraw from one account, deposit into another", so the actor model is not necessarily the "silver bullet" for concurrency, although it does solve certain problems very elegantly.

(Software) Transactional Memory (STM)

Where the actor model is based upon a "network" view of concurrency, the concept of transactional memory is based upon a "database" view of concurrency using optimistic locking. Transactional memory both enforces atomicity and isolation within these "atomic" blocks. Much like a database, transaction logs are kept for transactional variables, with the possibility that a transaction must be retried if validation of the log fails; as is the classic caveat in optimistic locking, this model relies on the fact that such validation failures are rare. Overall, the great strengths of transactional memory are that 1) (ideally) it does not force the average programmer to change the way they program, other than wrapping a piece of code in some type of "atomic" block, (2) it scales/performs better because optimistic locking is used, and (3) is easy to use because, like garbage collection, it happens "like magic". So, is transactional memory the "silver bullet"? This still remains to be proved, although it looks very promising.

Here is the implementation of the transfer() method of Figure 3 in Haskell (borrowed from Simon Peyton Jones):

transfer :: Account -> Account -> Int -> IO()
transfer from to amount
   = atomically ( do { deposit to amount
                  ; withdraw from amount } )

The interesting thing about Haskell is that STM is actually a type in the language. Within an atomically operation, non-STM actions (such as IO) are not allowed. Hence the type of atomically itself is: atomically :: STM a -> IO a. This, of course, is where the meat of the STM implementation is--initializing transactional logs, validating the logs, committing changes, etc.

In an existing popular language, such as Java or C#, STM might look like the code in Figure 5.

public void transfer(Account from, Account to, int amount) {
     atomic {
           from.withdraw(amount);
           to.deposit(amount);
      }
}
Figure 5: Hypothetical STM Construct

Although Sun (and possibly others) are working on implementations of transactional memory in hardware, it is unlikely that programmers will see widespread adoption of it in the near future. Software implementations include Haskell and Clojure, but add-on libraries for other languages exist. Unfortunately for the average programmer, STM is unlikely to make it into more popular programming languages like Java or C# in the near future.

Conclusion

Software Transactional Memory has the makings of a "silver bullet", but most programmers are unlikely to program in a language that directly supports it in the near future. The actor model is little closer to being pragmatic since it exists in languages that are gaining in popularity (Erlang, Scala). For the more adventurous, there are also add-on implementations of both concurrency models in abundance (see Links). Either way, the concurrent future is looking much brighter. For the average programmer, though, how soon it will be brighter is not certain.

Parting Thought: The Santa Claus Problem

John Trono published a new exercise in concurrency in the SIGCSE Bulletin back in 1994, dubbed the "Santa Claus problem". It goes something like this:

Santa Claus sleeps at the North pole until awakened by either all
of the nine reindeer, or by a group of three out of ten elves. He
performs one of two indivisible actions:

1) If awakened by the group of reindeer, Santa harnesses them to
a sleigh, delivers toys, and finally unharnesses the reindeer who
then go on vacation.
2) If awakened by a group of elves, Santa shows them into his
office, consults with them on toy R&D, and finally shows them
out so they can return to work constructing toys.

A waiting group of reindeer must be served by Santa before a waiting group of elves. Since Santa’s time is extremely valuable, marshaling the reindeer or elves into a group must not be done by Santa.

At first this seems to be a problem tailor-made for counting semaphores, and in fact this was the first solution to the problem. But a number of other solutions to the problem have been written using decidedly different methodologies (and languages). As an exercise in studying the application of different methods of handling concurrency, I've listed various solutions to this problem below:

  • Mr. Ben-Ari has published a solution written both for Ada 95 and Java, using more traditional lock-based methods.
  • Simon Peyton Jones has a solution written in Haskell in his "Beautiful Concurrency" chapter (a PDF version of the draft can be found here), using Software Transactional Memory.
  • Richard O'Keefe wrote a solution both in Haskell (STM) and Erlang (message passing).
  • Chris Grindstaff has written a solution for Fan, which uses message passing.
  • Adam Sampson has written a solution in occam-pi (note: code comments say it is "non-compilable")

(If you know of one that I've missed, feel free to leave a comment.)

Links

  • The Kilim framework, which implements an actor model for Java.
  • The Jetlang framework, which also implements an actor model in Java
  • The Retlang framework, which implements an actor model for .NET.
  • Brian Goetz's presentation at QCon on "Concurrency: Past and Present"
  • "Beautiful Concurrency", a draft chapter by Simon Peyton Jones for the O'Reilly book "Beautiful Code" that discusses Software Transactional Memory in Haskell
  • "Actors that Unify Threads and Events", Phillip Haller and Martin Odersky, describes Scala's implementation of the actor model.
  • "Scala Actors 101", a nice introduction to actors in Scala by Debasish Ghosh
  • JSTM, an open-source library for Software Transactional Memory in Java. See the Wikipedia page (under Implementations) for others.
  • Clojure, an implementation of Lisp that uses Software Transactional Memory and runs on the JVM.
  • Jim Larus, Ravi Rajwar, "Transactional Memory", Morgan and Claypool Publishers.
  • AtomJava, an implementation of STM for Java
  • Sun's DSTM2, a library-based approach to STM. Also, a paper describing it.

Notes

* Unless, of course, you have the situation where multiple blocks of code (e.g. methods) which are guarded by the same lock must be called in sequence. But there is a compiler optimization called lock coarsening that keeps the programmer from having to worry about this case.

Hmm. Interesting article. Thanks.

But any chance of a print-friendly version of your blog?

@Frank
I'll put in the enhancement request...:-)

Very nice article with lots of details. One thing though, Scala is not faithful to message passing semantic since a reference is passed in the message, even for mutable data. This effectively introduces shared, mutable state in actor programs.

@Rajesh
Yes, that's right, though in Scala's defense it does encourage passing immutable objects.

Great article. I'm treading in the same water these days and I couldn't have summarized this any better.

For anyone else interested in these topics, I maintain a concurrency link blog that covers exactly this kind of stuff.

@Alex
Thanks for the link! I'm already a reader :-)

What is so hard about memory management and pthreads? Typical whining from a Java weenie. Java's concurrency sucks because all you get is mutexes. If you used Java's only-just-hacked-on *real* concurrency primitives instead of braindead "synchronized", you wouldn't be complaining. Or maybe you would, considering you think pthreads are hard.

@Anonymous
You should take a look at Edward Lee's "The Problem With Threads"...or perhaps you think that a distinguished professor and chair of the EECS dept. at Berkeley is also a whiner? I've been using Doug Lea's library since before it was integrated into Java 5, and quite frankly wonder how I lived without them before...The problem is not that pthreads or concurrency primitives are hard (at least conceptually), the problem is that they are the "assembly language" of concurrency.

"wonder how I lived without them"
Same here, when I have to trudge around with Java. Java's problem is it was meant to be a simple language, and has been mutated to be able to do anything anyone asks of it, and therefore does things poorly until some band-aid library comes along to try and make things work.

As for the rest, I'm not saying pthreads are the answer, they're just not that *hard* I think the answer to concurrency is the same as the answer was to I/O, task scheduling, and memory handling: let a general library do it. No one has to write a function that does inline assembly to put characters on the screen, so why should anyone have to write a parallel sort? We should have scalable data structures and algorithms that handle optimizing themselves to a multi-cpu arch. As with anything in Computer Science, once the data structures and their associated algoithms are done, the rest is trivial. However, threads should not be abandoned. That is just stupid. Threads map certain real-world concepts very well, and anyone who says 'down with threads' is either selling something or begging for attention.

And as for the paper, I don't care where he teaches, someone who pushes Object Oriented Design Patterns and "better languages" that will inevitably do little more than tie hands behind backs without actually proposing a solution to the "problem" he sees is just complaining. I'm not asking him to be Don Knuth, but it would be nice if he'd put up or shut up.

I don't think that anyone is arguing that threads themselves are bad, it is that threads+locks have a number of serious issues associated with using them directly. Neither the actor model nor transactional memory "does away" with threads (e.g., Scala's actor implementation uses an underlying thread pool)--or even locks for that matter--they simply put forth easier models for working with them. Ultimately, that makes working with concurrency easier and makes programming with them more productive.

The paper you pointed out pretty much directly claims "threads are bad".

Locks can be overused. So can anything. But in that case, locks are not the problem, programmers are, through ignorance or use of bad tools. Getting rid of locks is really really hard, and mortal programmers shouldn't be trusted with doing it. Getting rid of the BKL in Linux is a good example of this. And like I said, good libraries like TBB are the way to go about handling multicores.

Not that I think multicore is the direction we should be looking, but at least automagically scalable multicore aware libraries will help.

nice overview, i voiced very similar arguments here a little while ago: http://blog.splitbody.com/2008/9/12/concurrency-part1 and http://blog.splitbody.com/2008/9/19/concurrency-part-2-actors

[...] Full Story [...]

There is another solution of the Santa Claus Problem in occam-pi
http://offog.org/darcs/research/santa/santa-orig.occ

Thanks, Brian. I will update the article.

[...] Toward Better Concurrency [...]

[...] - Complete callingcards,pc2phone,callshop solution saved by mjohnson162009-04-15 - Toward Better Concurrency saved by hermy1granger2009-04-14 - Concurrent locking for dummies saved by pantherxc2009-04-13 - [...]

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <pre> <div> <blockquote> <object> <embed> <img> <param>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Copy the characters (respecting upper/lower case) from the image.