| My Notes
I hold these to be true:
- When throwing exceptions, if the caller could reasonably have detected the error condition
ahead of time, then the caller is buggy and the exception should be a runtime exception
(e.g., parameter validation). If the caller could not reasonably have detected the error
condition on its own, then the exception should be a checked exception.
- Don't die forward; die back. If deep in processing you encounter a fatal error, don't
call a fatal error handler that makes blocking calls to stop the system; throw a runtime exception
to unwind the stack, then stop the system. Don't hide (eat) unexpected exceptions; rather,
make sure they get propagated up as soon as possible. (Think Erlang.)
- As a corollary, use
finally blocks to ensure invariants
are sufficiently restored during an unexpected exception to allow a service to be successfully
How should an event dispatcher (message queue) handle unexpected exceptions?
- By the time you reach the dispatcher level, you've lost all context.
You can't "handle" it or recover from it. You can only crash gracefully.
What's the best way to crash gracefully?
- First, you have an obligation to pass the exception up to any sort of system
level unexpected exception handler (UEH). You can't just eat it unless you can guarantee that you
are dieing forward (directly calling an UEH).
- The big question is, should you continue processing messages off the queue? The fear here
is that by processing more messages, you enter a loop generating infinite exceptions. On the
other hand if you stop processing messages, you can hamstring a system and make graceful shutdown
impossible. It could be looked at as a question of liveness vs. safety. Which is worse, deadlock
- It seems like your invariant should give you a clue what to do (i.e., retreat to a safe
state that maintains your invariant). But what is the invariant for an event dispatcher?
- I think if you mix graceful shutdown tasks with panic shutdown tasks, that creates problems
For example, the situation that's bothering me involved hanging while trying to stop a socket when
the unhandled exception occurred on the socket. A socket is not a critical resource. It doesn't
need to be cleanly shutdown during a panic.
- I suppose that if you consider the system to be panicked, it doesn't matter if you fail forward
or fail backward. Once a component has broken its invariant, any subsystem that touches it
can also expect to break its invariant. in that case the goal should be to terminate the process
as fast as possible. Again, the idea of doing a graceful shutdown from a panic state is a bad
idea. I suppose there's a difference between a fatal error that does not break invariants and a
panic where invariants are broken.
- So, how to handle my present predicament? Multiple times I've hung when an unexpected
exception is thrown while processing on an IO thread, because the event dispatcher is left
in a broken state and the shutdown handler tries to stop the socket.
- I've come to the following conclusions.
How do you handle fatal exceptions in large asynchronous systems?
- How do you coordinate shutdown in an asynchronous system? - Note there is a difference
between a graceful shutdown and a panic shutdown.
- Once a component has broken its invariant, any subsystem that touches it or depends on it
can also expect to die unpredictably. Deadlock becomes contagious.
- It's likely that multiple fatal exceptions will be generated. Later exceptions are
probably not interesing.
What can you do when the caller is separated from the callee by a dispatch queue? Where should
an exception thrown by the caller go, since the callee is no longer on the stack?
- My ReplyListener interface works pretty well. The code that receive the return value must
also be able to receive an excepion. This doesn't work when there's no return value though.
- One idea I thought was interesting would be to capture that caller's stack trace when the
asynchronous call was started. When an exception was throws, the dispatcher could report the
whole virtual stack trace. The down side is that this added the overhead of capturing a stack
trace to every call, even the successfull ones which are the majority.
C o m m e n t s :