Fail Fast

Failfast refers to a lightweight form of FaultTolerance, whereby an application or system service terminates itself immediately upon encountering an error. This is done upon encountering an error so serious that it is possible that the process state is corrupt or inconsistent, and immediate exit is the best way to ensure that no (more) damage is done. Common errors that cause failfast include access violations and numeric exceptions, but may also involve internal consistency checks. This is an easy way to increase the reliability and predictability of a system.

FailFast also refers to some programming techniques that cause an exception to be thrown or other redirection of control to occur upon meeting certain conditions. For example, modification to a collection that is being iterated over in another thread may immediately cause an exception to be thrown, rather than allowing clients to continue using the now-invalid iterators.
Who first thought of FailFast, I have found a reference in Jim Grays (1985) paper "Why do computers stop and what can be done about it, is there an earlier reference? -- JoeArmstrong
Joe, a much earlier reference must be the principle of Jidoka, exemplified by the weaving loom invented by Sakichi Toyoda which cut the electric circuit to the machine if one of the weaving threads broke (1924). MalcolmSparks

Errr, Kris, have you described the whole process above, or only the first phase of it? It would seem to me that a restart would be needed? -- GarryHamilton

An automatic restart is not always appropriate. The failures that lead to failfast are often the kind of thing that somebody needs to check before restarting. The purpose of failfast is to prevent damage or generation of incorrect output, not to maintain system availability. But ApplicationRecycling may be warranted.

FailFast is useful but not always appropriate. For example in an application iterating over a collection of data, you don't want to FailFast when you hit the first erroneous data. Instead, you'd want to log errors for all the data, and then rollback. Unfortunately, with exception mechanisms being so simple, many programmers code FailFast as a default, and don't think of the consequences (interminable correction/run cycles in the above example). Rather than just terminating immediately in order to prevent the application doing any damage, the programmer must also try to help the user recover from the error as painlessly as possible. -- RomanStawski?

I'm basically JustaWebMonkey? doing AllaireColdFusion, but I quite often use the concept of FailFast or rather, GetTheHellOuttaHereFast? when doing user authentification. A lot of what I do is putting together security models for corporate Intranets and Extranets, and quite often the majority of code on a given web page will be to do with ascertaining if the user is logged in, what security groups they belong to etc. and comparing the results against the security requirements of the page in question.

This processing gets done before the actual page is served to the browser, and it's common practice to FailFast at the slightest hint of a security breach and GetTheHellOuttaHereFast? redirecting to a logon or error page. -- DarrenIrvine

Perhaps I'm taking the term FailFast too literally. For me it means terminate, immediately, now (cf the common Perl idiom: "do_something() or die"). While this is acceptable in an ad-hoc development environment, I've yet to see a production system where I can get away with it. Even in your example, you're redirecting to an error page, and I suspect that you're writing something to a security log somewhere. This is more a example of FailingGracefully? -- RomanStawski?

The FailFast philosophy is central to the ErlangLanguage - the Erlang motto is "just fail" and "let some other process do the error recovery" - this has been used in very reliable production systems with millions of lines of code -- JoeArmstrong

Oh sure, I see what you mean, and yes, the code does do some bits and bobs before "failing"; I suppose what I was using FailFast in the context of "don't even think about trying to do the proper processing for this page". And yes, sometimes in the development phase rather than doing the nice intrusion logging and redirection, I'll just dump some relevant data to the browser and abort, which is what I guess you were talking about in the first place -- DarrenIrvine

I use FailFast as a personal heuristic to FightProcrastination?. (Not speaking of writing code here, but of starting any task.) I procrastinate because I want my stuff perfect, when GoodEnough will often do to help me at least get started. My goal is to fail quickly so I can try out a number of avenues to find an acceptable path that will meet the requriements and the deadline. Otherwise, I sit and stew over what's the perfect path and nothing gets done. For some reason, if I set out quickly expecting to fail, it reduces my perfection anxiety, perhaps by short-circuiting my inner critic. -- MichaelBrown
I worked in an environment where fail fast was one of the central axioms: high availability transaction processing. If a process encountered any sort of failure it had to log the details of the failure and exit immediately. Every process was constantly monitored. If one went down it would automatically restart. There were multiple distributed servers (co-located thousands of miles from each other to tolerate disasters) with message routers configured to use alternate servers if one timed out.

The philosophy was that nothing should ever fail. When it did, the best thing to do was to let someone know and restart. Processes ran for months, restarted only when the machine was rebooted for system software or hardware changes. Any failure of any system demanded immediate human attention. Until the problem was resolved, the processes affected wouldn't try to mask the problem in any way. They assumed that the problem would be fixed by the time they restarted and ran into the same problem again. This worked very well because of multiple levels of redundancy. Sometimes developers didn't understand the axiom and tried to continue after a failure. That just made it harder to diagnose and repair.

Small airplane pilots, once they've taxied to the start of the runway, put on the brakes, and rev the engine to maximum RPM.

If the engine (or the brakes) fail, it's much, much better if they fail when we're sitting motionless on the ground, than when we're most of the way to our destination.

Whenever I write a CircularBuffer, I always initialize the pointers just before the the end of the buffer (rather than at the very beginning).

That way, if I mess up the wrap-around code, I don't get a HeisenBug that bites only after hours of operation. It becomes immediately obvious after the first few inserts.

(The wrap-around code is especially tricky when inserting variable-length objects.)

-- DavidCary

"if I mess up" ? I think you mean "when".
Klipstein's Fifth Prototyping and Production Law: A transistor protected by a fast-acting fuse will protect the fuse by blowing first.

Is there any difference between FailFast vs. ExposeErrors ? Well, other than "ExposeErrors" deals only with programming bugs (or does it ?), while FailFast applies to both programming bugs and also unexpected inputs.

Is there any difference between FailFast vs. OffensiveProgramming ?

See also: RecoveryOrientedComputing, PokaYoke

View edit of May 30, 2008 or FindPage with title or text search