Doctrine.fr is a legal search engine which allows to search in the French legal data, read court decisions, legislation, commentaries and receive legal news. Our backend has been entirely implemented in Node.js for more than 2 years (we started with Ruby on Rails but switched to Node.js after a few months).

Like many professional users of Node.js, we were having a hard time diagnosing errors and finding the associated client requests that caused them, especially in the case of uncaught exceptions. But a few months ago (July 2018), we were able to come up with a solution using Async Hooks, an experimental API in Node.js. In this article, we share the problem and the solution we were able to implement.


The problem

In “classic” languages used to write web applications, like PHP or Ruby on Rails, HTTP requests are typically handled by separate threads spawned to handle each request. In case of an uncaught exception, it is then obvious to determine which HTTP request caused the error, which then allows to reproduce it. For instance, an uncaught exception in Ruby on Rails is automatically reported with all the details of the HTTP request.

But Node.js is different. In Node.js, all the code you have written is executed in the same thread, and Node alternates between the different requests. For instance, look at the following very simple route declaration in Express, which simply asks the database for time and returns it to the user:

router.get('/date_from_db', function a(req, res, next) {
 db.execute('SELECT now()', function b(error, result) {
   if (error) return next(error)
   res.send(result.rows[0].now)
 })
})

The way to handle requests is one of the main reasons why Node.js is so popular, because it makes it possible to handle many simultaneous requests without paying the cost of launching several threads/processes.

However, there are drawbacks. One is that you should avoid running long computations in Node.js, since they block the main Node thread. Node.js is very efficient only if all long operations (> 100 ms) are external (DB queries, file reads, etc.).

The other main problem, which interests us here, is debugging errors. If the error is nicely handled, like the DB error above (with “if (error) return next(error)”), it will “go up” to your error handler and you will know which HTTP request caused the error, with a log typical entry like this:

GET /date_from_db 500 10.219 ms - 1041

If you use an advanced error reporting tool (we use Rollbar), you will get a lot more information, like user agent, referrer, logged in user, etc.

Now let’s see what happens with an uncaught exception. Here we create a function called crashInCallback() which only goal is to crash Node. And then we make an HTTP request to /crash1 or /crash2 (I won’t tell you which).

17    function crashInCallback() {
18      db.execute('SELECT now()', () => {
19        let x = null
20        x.crash = 4
21      })
22    }
23    router.get('/crash1', function a(req, res, next) {
24      crashInCallback()
25    })
26    router.get('/crash2', function a(req, res, next) {
27      crashInCallback()
28    })

Here is what you get in the log:

boa ➜  myapp npm start> myapp@0.0.0 start /Users/raphaelchampeimont/myapp
> node ./bin/www/Users/raphaelchampeimont/myapp/routes/index.js:20
   x.crash = 4
           ^TypeError: Cannot set property 'crash' of null
   at Timeout.db.execute [as _onTimeout] (/Users/raphaelchampeimont/myapp/routes/index.js:20:13)
   at ontimeout (timers.js:498:11)
   at tryOnTimeout (timers.js:323:5)
   at Timer.listOnTimeout (timers.js:290:5)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! myapp@0.0.0 start: `node ./bin/www`
npm ERR! Exit status 1

Here Node crashes. Not only we don’t have HTTP request information, but we don’t even have a call stack going back to the controller. Here the only information that we have is “line 20 in index.js”, which is where we do the “x.crash = 4” which crashes the application. But there is not information at all on whether /crash1 or /crash2 was requested. In a real-world application like Doctrine.fr, there may be dozens of possible routes which could have called the piece of code where the crash occurred. This makes it almost impossible to diagnose. The only way is to go to your logs, look at all the requests that were received before the crash, and try to guess which may have triggered the error.

How can Async Hooks help?

Async Hooks is an API which has been designed to “link” the execution context of the parent function to the execution context of the callback. Basically, if we go back to the execution, Async Hooks can tell you “you are going to execute the “b” function, which was triggered by the “a” function”.

Async Hooks in itself is not an automatic solution to make error messages better. You have to do some work depending on how you want to use it.

At Doctrine, we use it in 3 different ways:

  1. Be able to magically know the current Express request at any point in the code, even if we have not passed the actual “req” object (Express HTTP request).
  2. Provide long stack traces which contain the concatenation of stack traces from successive callbacks.
  3. Detect long synchronous operations.

Feature 2 looks cool but is very CPU intensive. We tested deploying it in production but it required to multiply by 4 the number of servers if we wanted to keep it. In the end, we realized that feature 1 is really the killer one.

A simplified proof of concept implementation

Here is a very simple Express app (auto-generated with the Express app generator) to which we added our Async Hooks implementation of “feature 1” above. The key parts are the asyncHooksDoctrineSimplified.js file and a section we added to app.js.

Now let’s see what happens when you try we go to http://localhost:3000/crash1 for example (crash2 is the same):

*** Application crashed (uncaught exception) ***Path triggering the error (this is the magic): /crash1
TypeError: Cannot set property 'crash' of null
   at Timeout.db.execute [as _onTimeout] (/Users/raphaelchampeimont/async-hooks-demo/myapp/routes/index.js:22:13)
   at ontimeout (timers.js:498:11)
   at tryOnTimeout (timers.js:323:5)
   at Timer.listOnTimeout (timers.js:290:5)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! myapp@0.0.0 start: `node ./bin/www`
npm ERR! Exit status 1

In order to add information to errors, we catch uncaught exceptions in app.jsand use our async hooks system to get the corresponding request. The result is that we can report the request is on /crash1, something we are unable to do without async hooks.

Since we get the “req” object, we could also display a lot of useful information, like HTTP heads, request body, logged in user, cookies, etc.

To be able to use this implementation for real in Doctrine, we also needed to instrument some of our middleware or systems with AsyncResource (which is quite simple and requires just a few lines of code). This was the case for 3 systems: our Redis-based request queue middleware, our DB access layer, and a worker-spawing computation.

How it helped us

Prior to Async Hooks, we had these 3 cases:

  1. The error goes up all the callback chain to Express, where it is reported with the associated HTTP request. This is the nice case where Async Hooks is unnecessary .
  2. We want to report an error but not to respond to the client with an HTTP error. In that case, we use to pass “req” in many function calls where its sole purpose is to allows calls like “reportError(err, req)” where reportError is our internal error-reporting function. Without this req object, the associated HTTP request cannot be known. So we created bad code by adding “req” as an argument to many functions.
  3. Uncaught exceptions. This is the case explained above, and we had not fix around that.

After deploying Async Hooks, we analyzed how it helped us improve error association, so we analyzed our most frequent error and whether we could associate them with client HTTP requests.

  • The big blue piece “both AH and explicit” which means that Async Hooks was able to guess the HTTP request, but we already had it anyway, sometimes at the cost of code readability in case 2 above. These are cases where Async Hooks can help simplify the code.
  • The green part “AH only” is the big win: cases where we did not have HTTP request information before, but have it now. In those cases, Async Hooks therefore adds information.
  • The red part “explicit only” is cases where Async Hooks was unable to guess the HTTP request, but we could get it with traditional methods. This part shows the limit of our Async Hooks implementation, which still requires explicit passing of the “req” argument. These cases are probably caused by insufficient instrumentation with AsyncResource.
  • The yellow part “no association” are errors for which we are still unable to track the associated HTTP request, neither with Async Hooks nor with traditional methods.

Next steps

The next steps for us are:

  • simplify the code by removing “req” arguments which where used only for error reporting
  • improve Async Hooks instrumentation with AsyncResource to reduce the “explicit only” and “unknown” parts of the chart above

We are also investigating the possibility to release our entire Async Hooks system as an open source project. More about this in a future post.

Finally, if you want to explore this new technology with us, we are looking for talented and motivated people to grow our backend team in Paris! You can find more information here.