Crash safety using domains in Node.js

A look at how domains can gracefully deal with unhandled errors and uncaught exceptions

crash_safely_using_domains_in_node.js

In a previous post about handling errors in Node.js programs, I took a look at how different flow control styles handle errors. I wanted to revisit the topic of error handling in a little more detail, and take a deeper look at an important error handling tool in Node.js: domains. We’ll see how domains work, why you would use them, and how to go about it.

Before we start, I want to recommend a useful article about error handling in Node by Joyent. I found it in their “Production Practises” documentation and it explains in good detail all of the essential information about errors in Node, such as the different kinds of error in a node program and the scenarios in which you might encounter them. It acts as a great primer to the material in this article, so take some time to absorb it and carry on with this after.

Why should I care about domains?

Domains were introduced in Node.js v0.8 as a way of dealing with unhandled ‘error’ events, uncaught exceptions, and even errors passed to callbacks via domain.intercept().

The general idea of domains is that if an error is not handled and is allowed to bubble up the stack all the way out of your program reaching node itself, you can get Node.js to notify your domain about the error instead of freaking out and immediately exiting the process. The domain is set up by your program, allowing your program to decide what to do with an un-handled error in a controlled manner. Usually, the best course of action is to gracefully shut down your program and exit the process at an appropriate time.

A good example for why we should use domains is for a Node.js HTTP server. Imagine we have a simple HTTP server which invokes your program’s callback to handle each incoming request:


var http = require('http');
http.createServer(function(req, res) {
  // On 10% of requests this code will run and throw an exception
  if (Math.random() > 0.9) {
    woops.thisWillThrow();
  }
}).listen(8000);

You can see there is a mistake in the if block. If one of the requests try to run that code, an exception is thrown due to a ReferenceError from the undefined woops object.

In the example of a web server, we will likely be serving many clients concurrently, and if an exception happens during a request from one client, it will crash the server and abruptly hang up on that client as well as all the other clients! This can lead to all kinds of nasty consequences, but at the very least it is damaging to our user’s experience.

So, here’s where we would use a domain. If we have a domain active, Node.js will tell that domain when an exception occurs. Our domain can then perform the necessary steps to gracefully shut down the program. In the case of our server program, we would want to:

  1. Close the server to stop accepting new connections. Hopefully this will also signal a terminating proxy (such as a load balancer) to stop routing requests to this process.
  2. Wait for existing connections to finish and close normally.
  3. Only exit the process once all clients have finished and disconnected happily.
  4. Failsafe: before starting the shutdown, it’s advisable to set an exit time limit. If the shutdown hangs or takes too long, something else is probably wrong so exit the process anyway.

What is a domain, exactly?

A domain is an EventEmitter object which, when “entered”, is accessible globally via process.domain and require('domain').active. Multiple domains can exist in the domain stack, which is simply an array internally in Node holding all domains in the order they were created. Only one domain can be active on the process at a time, and is only active when it is “entered”. This is so that when Node.js is running code, it knows which domain should be notified about any unhandled errors that might occur.

How to use domains

By  attaching a listener to the domain’s 'error' event, you control what happens when the domain is notified of an error:


var domain = require('domain');
var d = domain.create();
// Domain emits 'error' when it's given an unhandled error
d.on('error', function(err) {
  console.log(err.stack);
 // Our handler should deal with the error in an appropriate way
});

// Enter this domain
d.run(function() {
  // If an un-handled error originates from here, process.domain will handle it
  console.log(process.domain === d); // true
});

// domain has now exited. Any errors in code past this point will not be caught.

In this example, we’re setting up a domain with an ‘error’ handler. We call d.run to enter the domain. The domain will be active while the code in the callback runs. After the function completes, the domain is exited. We’re not waiting on any asynchronous I/O here so the code in this example runs synchronously in the context of the domain.

What happens when we need to run a callback asynchronously, where the callback could be called after the domain has exited? How do we still ensure the domain will capture unhandled errors exuded by these callbacks? Node has an in-built method of doing this: domain binding.

Implicit/explicit binding on event emitters and timers

The documentation explains implicit and explicit binding.

Asynchronous callbacks are called by events and timers in Node. When you have a callback which is waiting for the result of a particular I/O operation, the callback will be called on a different tick of the event loop to the one in which the I/O operation was queued, and after the domain has exited.

To make sure their callbacks are called in the context of a domain, event emitters and timers are implicitly bound to a domain if it is active when they are created. This is what is meant by Implicit binding.

It is possible to re-assign the event emitter or timer to another domain using Explicit binding. Likewise, the explicitly bound domain will be active while the callback is run.  You can see this happening in events.js.

Let’s see what happens with an event emitter:

var EventEmitter = require('events').EventEmitter;
var domain       = require('domain');

// Create 2 domains
var d1 = domain.create();
var d2 = domain.create();

// Enter the first domain
d1.run(function() {
  // This emitter is implicitly bound to d1
  // because it is created while process.domain === d1
  var implicitEmitter = new EventEmitter();
  implicitEmitter.on('someEvent', function() {
    console.log(process.domain === d1); // true
  });

  implicitEmitter.emit('someEvent');

  // Explicitly bind this emitter to another domain
  var explicitEmitter = new EventEmitter();
  d2.add(explicitEmitter);

  explicitEmitter.on('someEvent', function() {
    console.log(process.domain === d2); // true
  });

  explicitEmitter.emit('someEvent');
});

So, let’s recap on what’s happening here.

Whenever an event is emitted on the EventEmitter object, the EventEmitter enters its bound domain immediately before the event listener is called, and exits the domain straight afterwards. This ensures that any error generated by the event listener will be routed to the EventEmitter’s domain, and not the domain that might have been active before the event was fired.

This is one of the confusing aspects of domains: they exist globally, but event emitters and timers switch the global domain to their bound domain while running their listeners. Therefore, we can think of domains as different error handling contexts that switch as the thread of execution enters different areas of your program.

How errors are routed

Now that we’ve seen how domains are set up and used, what actually happens when errors occur? That depends on the type of error, and as we’ve seen, there are 3 types: ‘error’ events, exceptions and error arguments passed to callbacks.

‘error’ events

As we know, ‘error’ events are a special event type in Node. They are emitted by event emitters whenever there is a problem, for example socket errors such as 'EACCES' triggered by net.createServer.listen(1).

When ‘error’ is emitted on an event emitter:

  1. If > 0 ‘error’ listeners, call it/them with the error.
  2. Else, if there is a bound domain, emit ‘error’ on it.
  3. Else, convert to exception and throw.

You can see this logic in events.js. The following example illustrates the propagation of steps 1-3:


var EventEmitter = require('events').EventEmitter;
var domain       = require('domain');

var emitter = new EventEmitter();

// Bind to domain
var d1 = domain.create();
d1.on('error', function(err) {
  console.log('Handled by domain:', err.stack);
});
d1.add(emitter);

// Attach listener
emitter.on('error', function(err) {
  console.log('Handled by listener:', err.stack);
});

emitter.emit('error', new Error('this will be handled by listener'));
emitter.removeAllListeners('error');
emitter.emit('error', new Error('this will be handled by domain'));
d1.remove(emitter);
emitter.emit('error', new Error('woops, unhandled error. This is converted to an exception. Time to crash!'));

Exceptions

Exceptions bubble up the call stack until they are caught by a try/catch block. If they are not caught, they climb out of your program, causing Node to do the following:

  1. Check if there is a domain active. If there is, it will pass the exception to the domain by emitting it as an 'error' event on the domain object.
  2. Else, Node will check for any listeners for the process’ 'uncaughtException' event (there shouldn’t be) and call them with the error if they exist.
  3. Else, Node will pack up shop and terminate the process because an uncaught and unhandled exception essentially means the program is broken so it’s probably the best thing to do.

Callback arguments

In Node, the convention for callback arguments is function(err, arg1, ...). If the asynchronous operation failed, err should be passed as the first argument to the callback and be an instance of Error.

Binding domains directly to callbacks

When a callback runs, it is typically within a call stack with an event emitter or timer listener at the top. If the listener is called in the context of a bound domain, any uncaught errors coming from that callback will be handled by the domain.

However, if you’d like to ensure the callback runs in the context of a different domain, you can bind it to the desired domain using domain.bind(fn).

Furthermore, in a callback, you’d usually check for the err argument and handle it accordingly. Using domain.intercept(fn), you can instead delegate the error checking and handling step to a domain.

In conclusion

I was a little confused by domains at first. The domain stack and global active domain concept can get confusing. It all makes sense when you understand how event emitters and timers can be implicitly and explicitly bound to domains, and know how the different types of errors travel through your program. It really helps to read through the actual Node source code to follow the path of errors through domains and event emitters.

Once that understanding is in place, domains are a very useful tool to help you handle crashes in your Node programs gracefully. They should be used with care due the dangerous nature of uncaught exceptions, but when harnessed properly they can help take the edge off nasty crashes and error conditions.

Never miss a post

  • Abdul Hannan Ali

    This article is really helpful Thumbs up

  • Pedro Checkos

    Thanks for this great article. One question though: in my domain error handler I try to return a 500 then immediately call server.close() before killing the process. However, each time I call server.close() I get an exception ‘Not Running’ from Node. My server is therefore not able to deliver the 500 to the client. Did you also encounter this issue?

  • Can you please update your link for “Before we start, I want to recommend a useful article about error handling in Node by Joyent”. It’s outdated.