In general, optimization techniques usually complicate code more than they help it. Optimized code is tricky, error prone. Thus, whenever possible OptimizeLater
; concentrate on getting something working first, then find the inefficiencies and remove them.
In many cases, threads help to simplify code; for example, they replace code that would otherwise be used to manage polling. Moreover, some problems are naturally parallel and fit nicely with threads. However sometimes, threads are used to maximize processor utilization -- that is, to distribute the workload efficiently. In these cases, threads are optimizations.
Unfortunately, the efficiency gained from threads is quickly overcome by the combinatorial explosion in complexity they create. To interact, threads must use semaphores to detect each others' activities. This makes them very sensitive to each others' private states.
don't use threads for optimizations unless you have to. If you have to, prevent threads from interacting. If you can't even do this, use coarse-grained synchronization--lock everything. If you can guarantee that only one thread will ever interact with a given subsystem at a time, you can avoid the thorny issues of synchronization.
If this creates a bottleneck, then you can focus on optimizing where it is necessary, such as using fine-grained locking. By optimizing later, you can greatly simplify the code, making it easier to avoid defects in the first case, and find and correct them once they arise.
Some people have found multi-threading hard to retrofit, and advocate designing for it up front. For some applications at least, I would expect threads to be part of the initial system metaphor. As in, "Every request to the server will be handled by its own thread" - common in HTTP servers and JavaServlets
On the other hand, good OO design is often good thread design too, since both are about minimising interactions, and some people have found they can retrofit multi-threading after all. -- DaveHarris
It's often easier to fork() new processes than to use threads. This is especially appropriate when the threads don't have to communicate with each other. Moreover, fork() will give you all kinds of nifty optimizations, such as copy-on-write, for free. -- StephanHouben
That might be true in C. With Java, creating threads is much simpler and faster than creating new processes.
Also dependent on your OS; spawning new processes is generally cheap in Unix/Linux and expensive in Windows, for instance.
Another important issue to realise is that in the general case threading an application will make it slower. In fact the only exception to this rule I am aware of is when threading permits improved cache utilisation.
PierrePhaneuf claimed that, "On SymmetricMultiprocessing systems, there is no exception. Where on a single-processor system you had improved cache utilization, on an SMP you have a cache coherency problem and performance drops sharply." This is a gross generalization, and therefore must be wrong. In fact, there used to be the mystery of the SuperLinearSpeedup? -- how some programs would go more than twice as fast if you moved from one processor to two. Eventually, people figured out that by increasing the number of processors, they were also increasing the amount of cache, which explained why the program ran so much faster. --BillTrost
If you have 12 iterations of a module, and you already have reasonable cache utilisation, you will find it faster to run the system 12 times serially then 1 time with 12 in parallel.
Optimisation is very rarely a good reason to introduce threads. -- AndraeMuys
Consider I/O latency... If there is some, Consider it a lot.
Also threads blocked while waiting for a semaphore consume very little CPU time. If you are chasing the last 2% of the possible then A) get a life, B) Buy a bigger fridge and overclock the beast some more.
Most high-performance, low-latency TCP/IP servers are non-threaded, event-driven affairs, with various tricks (SIGIO, POSIX real-time signals, FreeBSD kqueue) used.
In CPU bound programs, I can easily see how threading would, in general, reduce performance rather than improving it.
But the most common exception to the "adding threads reduces performance" rule is when the threads are I/O bound -- making blocking calls for file or network communication tasks.
Yes, you can avoid threads in this case by using asynchronous APIs with callbacks -- but only if such interfaces are available:
Try to make asynchronous calls with embedded SQL, for example; it can't be done.
And some blocking calls have no asynchronous counterpart.
So, if you're spawning threads that will block on I/O calls that block (and which can really run "simultaneously"),
then threading can really produce substantial performance improvements.
Having just survived through a three-months long exercise of testing/tuning/debugging a multithreaded realtime event processor on a 24 CPUs UNIX box, I can't help but agree - if you need parallelism, prefer multi-process whenever possible.
The exercise was a lot of fun and eventually we won, but boy, was it a real war!
No, actually there are NO "blocking calls have no asynchronous counterpart" in POSIX or
in the "BSD" TCP socket API. In fact I/O is precisely where non-threaded programs work
ideally. Consider djbdns, betaftp, X, BSD's userland pppd, and a range of other
applications that run in a single thread, and handle multiple concurrent I/O sessions.
Okay, but the threaded solution might be a lot simpler, because then your code does not have to juggle the different computational tasks--which is what threads are for anyway! I strongly advocate SynchronousThreading?
and SynchronousIPC (e.g. SendReceiveReply
, a.k.a. AdaRendezvous?
). In combination with LightweightThreads?
is the ultimate synchronization/communication building block that leads to simple systems. If only the whole damn world hadn't gone off and built their systems around pipes and message queues and other broken
abstractions! (These asynchronous mechanisms are ideal only at the boundaries of the system, i.e. the parts that interact with hardware or users or whatever. The core of the system should really be built with a synchronous IPC mechanism. I apologize for this religious flame but I needed to vent some emotional baggage on this issue..
<rant type="counter">Unfortunately, synchronous IPC forms an abstraction that caters to the "easiest to implement in the kernel" category of software, not the "easiest for the programmer to deal with." However, you touch upon the crux of the solution in your rant.
What you want, indeed, is rendezvous. The Amiga operating system is built entirely around asynchronous
message passing using only
signal bits to put the CPU to sleep. Literally everything else in the OS builds on top of these "sigbits," including message passing itself. AmigaOS relies on the send/reply cycle, as you describe, and as a result, produced incredibly
clean and simple software.
(Footnote: AmigaOS got the sigbit idea from VMS, not from Unix. Further evidence that VMS got it right after all, and Unix should either evolve to accept well-engineered IPC solutions or shrivel up and die the horrible death it deserves if it doesn't.)
When examining the over-all structure of AmigaOS applications against Erlang applications, the similarities are uncanny. One must wonder how much AmigaOS influenced the development of Erlang.
However, the issue I have is with synchronous
IPC to make this happen. While it is wonderful
from the kernel implementation perspective (hey, look, no need to dynamically allocate message buffers and perform eager copying!), it's down-right rotten to the application coder. On many occasions, one will
want to send a rich message to another thread (sometimes even expecting a reply!) without having to wait for the rendezvous. Indeed, the more real-time the application, the greater the need for this situation to occur.
The usual solution to this conundrum is the instantiation of dedicated support threads
whose sole purpose in life is to "asynchronize" an otherwise synchronous interface. This creates a not-insignificant amount of overhead, and I'm positively not at all happy with this hack. They're also prone to failure, just as any other thread in a shared memory space is, which can result in both
end-points of a communications link deadlocking. Ouch!
This is why QNX hasn't conquered the desktop market; although a great embedded real-time OS, its UI performance is positively abysmal. Moving a mouse cursor on the screen results in noticable jumpiness, because the UI layers just don't see enough updates. Why? The software sitting between the mouse driver and the Photon GUI literally is dead-locked waiting for a rendezvous to complete. It positively is not
for lack of mouse driver events.
I recently booted QNX from a floppy on a 33Mhz 486 and I never noticed any problem with UI performance. Blaming QNX itself for the above behavior is suspect, let alone blaming the whole concurrency model it uses! Given that QNX is a "great embedded real-time OS" (used in nuclear reactors, for instance) I'm sure it can manage snappy UI performance ... In fact, I've witnessed it.
Network vendors solved this problem decades ago. They call their solution datagrams.
What I think we should explore is the use of a datagram model for asynchronous message passing in memory protected systems. That way, kernel resource consumption, while non-zero, remains bounded, and thus fully predictable. Yes, messages can be lost; so can network packets. User-space reliability protocols (implemented via shared libraries, of course, so as to not duplicate code everywhere) can exist to implement reliability when it's necessary (usually, with the exception of cut-and-paste operations, most UI interactions don't need it). You gain the benefits of asynchronous "touch-n-go" messaging, while keeping the kernel implementation simple. The payment primarily exists in user-space, where reliability protocols become necessary if you really
need it. Fortunately, it is in user-space where one can thankfully afford the cost.
Even if they make an application slower in real terms, having a separate UserInterface
thread can make it more responsive and thus seem faster.
See also: SynchronizationStrategies
which claims that ThreadsAreOptimizations