Synchronization
The most difficult thing about multithreaded programming is synchronization. Multithreading, really, is a hack created for speed. The advantage of using separate threads instead of separate processes is that you don't need to flush the TLB when switching between them and you can pass pointers around (the second point is blurred on some systems[md]for example, anything with Mondrian memory protection and a big address space). The disadvantage is that every thread can access every block of memory, so the potential number of interactions grows exponentially with the number of threads.
This is why people using languages like Java tend to create more threads than people using C. There are no pointers and (more or less) no global variables in Java, so it is easier to reduce the number of potential interactions. People using languages like Erlang, which don't support aliasing of mutable objects between threads, can easily create tens of thousands of threads, because they do not have to worry about potential interactions between them except for the very few points that they explicitly created.
With threaded programming, the primitives you typically use for synchronization are things like mutual exclusion locks (mutexes) and condition variables. With the task-based parallelism model, you often don't need any of these. If you have tasks that must be executed serially relative to each other, you can put them into the same queue. Often, however, you do need tasks to rendezvous at some point.
As a simple example, consider a media player that decodes the audio and video frames concurrently. GCD provides dispatch groups as a trivial way of implementing this. In this example, you would create a new dispatch group for each video frame, and then associate all of the relevant audio frames with it. You would then add blocks decoding the audio and video frames to two separate queues with the same priority. GCD would then select the optimum number of threads to run these on and begin scheduling them. You would then use the dispatch_group_notify() function to submit the block that handled displaying the frame and playing the audio to another queue. This one would then not be added to a work queue until the other blocks had all completed.
The nice thing about this model is that you are defining relationships between tasks (for example, this task runs when all of these have finished), rather than locking specific resources. Whether this is a good fit for your application depends on your design.
There's been a lot of research in the last decade into optimizing concurrency in Java VMs. In Java, it is possible for the VM, using the same techniques it uses for accurate garbage collection, to be able to tell exactly which threads have access to a specific lock. If all of the Java threads that hold references to a lock are running on the same OS thread, then the JVM can remove expensive lock operations and replace them by setting a simple flag indicating that the other threads should not be scheduled, or replacing their calls to the synchronization primitives with ones that aren't really thread-safe. These tricks are not available to POSIX mutexes, because it is not possible for the pthread library to know which threads hold references to a given mutex. That is not the case with the implicit locking used by GCD. It knows exactly which threads hold references to which queues, and so can dynamically remove locks if they are not needed.