Doctrine.fr is a legal search engine which allows to search in the French legal data, read court decisions, legislation, commentaries and receive legal news. Our backend has been entirely implemented in Node.js for 3 years.

Most of the search engine load relies on external components, that are called asynchronously by our backend code, which perfectly suits the Node.js architecture. Nevertheless, for historical reasons, some of this code still contains long synchronous operations. It was therefore mandatory for us to detect these long synchronous hotspots, in order to eliminate them.

We will introduce this presentation by describing the problem with long synchronous operations in Node.js. After that we will talk about our analysis method and its results, which allowed us to make hypotheses that drove our success fixing the most important hotspots.

On synchronous operations in Node.js

Most Web framework use the same approach to serve an HTTP request, that can be modeled as a different thread being used for each request. The following figure illustrates this model:

From Slite.com

We call this approach the traditional framework model.

An asynchronous operation happens when the work is delegated to another resource, for instance an I/O call, and the processing is suspended waiting for the response. In a traditional framework model, the behavior can be illustrated as follows:

From Slite.com

This figure shows that most threads are suspended waiting for a response.

A synchronous operation is an operation where the thread is not suspended. In the figure above, the call and respond operations are synchronous. Having synchronous operations is inevitable and normal. Most synchronous operations are short.

In the present article, we define a long synchronous operation, in the context of a Web server, as a long-running task, that runs for tens or hundreds of milliseconds before returning a result. When a long synchronous operation occurs in the traditional framework model, the thread handling the request is busy executing the synchronous operation, but the other threads keep on running, so that they continue serving the other requests:

From Slite.com

Node.js is based on the hypothesis that synchronous operations are always short, and most of the operations are asynchronous. Requests are served by a single thread, that sends queries and receives asynchronous responses, thereby interlacing the execution of several requests at the same time:

From Slite.com

This architecture is extremely efficient, optimizing the resource usage by avoiding the costs of managing multiple threads. On the other hand, this model is defeated when a long synchronous operation occurs:

From Slite.com

In this figure, we see that the thread is monopolized by the first request, and the Node.js engine is not able to serve the second request until the long synchronous operation finishes.

The extent of the problem

In a previous article, we described how we instrumented Async Hooks to analyze errors. It was only a small step to instrument it to notify us of long-running synchronous operations:

From Slite.com

The warning threshold was set to 1 second. It appeared that we had more than 300 synchronous operations per day with a duration above 1 second.

We had removed the most flagrant long synchronous operations long ago, hence we knew that the remaining ones would probably be looking like harmless pieces of code, for instance a simple loop, whose actual performance is in fact dramatically impacted by the size of the user input.

Our codebase is large and complex, so we needed a way to detect these synchronous hotspots.

Classifying synchronous operations

Before going further, we first defined classes of synchronous operations, according to their duration, and depending on their overall impact on the application:

Duration Impact
More than 1 second Monsters that can crash the application
100 ms - 1 second Very slow, strong impact on user experience
10 ms - 100 ms Slow, can hurt priority task
1 ms - 10 ms Medium, acceptable if infrequent
Less than 1 ms Fast

When a monster operation, with a duration above 1 second, occurs, our monitoring system may consider Node.js as unresponsive, in which case it would kill and restart the server, by that losing all the currently processed requests. For the user, this would amount to an application crash.

The very slow operations have a visible impact on the user experience, by slowing down the whole server, and give a feeling of sluggishness.

Priority tasks are simple tasks, that cannot afford to be slow; the most prominent example is the autocomplete feature, which proposes completion while the user is typing. This feature is only useful if the completions come as fast as the user types, which needs a backend time of less than 40 ms, an impossible goal if a slow synchronous operation (10 to 100 ms) is monopolizing the execution thread.

We deem acceptable operations between 1 ms and 10 ms, if they are not too frequent, and we consider fast a duration below 1 ms (actually, this class is itself divided in multiple subclasses that we will not describe further).

As discussed above, we knew that we had more than 300 monster operations per day, which may seem a lot, but is in fact an anecdotical percentage of the whole set of queries, and these monsters can be traced back to various infrastructure glitches and edge cases. On the other hand, we did not know anything about the other classes. In particular, we had no information about the very slow, slow and medium classes.

Recording duration of synchronous operations

In order to analyze synchronous operation, we created a module that encapsulates async_hooks.

This module first adds a hook to handle measurements, by calling createHook:

   AsyncHook.createHook({
    init: init,
    destroy: destroy,
    before: before,
    after: after,
  }).enable()

The module instantiates a simple map, in order to store information about each async call started.

As described in their documentation, init and destroy callbacks are called when a class is constructed (respectively, destroyed), that has the possibility of emitting an asynchronous event. The init function simply creates a tracking object in the map, the key being the asyncId of the call. Note that it is mandatory to remove this object from the map in destroy, to avoid memory leaks.

Once this data structure is in place, it is fairly easy to record the duration of an asynchronous call. The aptly-named before callback is called before each async call, and is implemented as follows:

function before(asyncId) {
  const async_data = map.get(asyncId)
  if (async_data == null) return
  async_data.before_time = process.hrtime()
}

The object associated to this asyncId is first fetched, then the start date is stored, in the form of a high-resolution timer, which is far more accurate than using a Date object, and whose precision gets down to the nanosecond.

The after callback is executed at the end of each call:

function after(asyncId) {
  const async_data = map.get(asyncId)
  if (async_data == null) return
  const delta_time_struct = process.hrtime(async_data.before_time)
  if (reportSyncTimeCallback != null) {
    reportSyncTimeCallback({
      /* ... */
      hrtime: delta_time_struct,
    })
  }
}

The start time is retrieved from the map, based on the asyncId, and the elapsed time is calculated. A reportSyncTimeCallback module variable (which is declared elsewhere in the module, with a setter) contains a callback to report various information on the call (including other data than those related to duration calculation).

The function we assign to the reportSyncTimeCallback does not store the durations themselves, but a histogram, where each bucket is one of the duration classes discussed above. Each of these buckets contains the number of sync calls corresponding to this duration class. Every ten seconds, we log the histogram line, and reset all the buckets to 0. We then have data looking like the following:

SYNC_TIMES_HISTOGRAM 0 0 0 139 1250 446 663 2 0 0
SYNC_TIMES_HISTOGRAM 0 0 0 162 1268 512 704 6 0 0
SYNC_TIMES_HISTOGRAM 0 0 0 139 1236 444 632 16 0 0

Storing only the counts, rather than the durations themselves, ensures a constant storage size.

Analyzing duration records

During a peak period in production, we recorded durations for one hour where about 238000 requests were served. Its was therefore trivial to calculate a distribution per 100 requests for each duration class:

From Slite.com

From this distribution, we were able to estimate the time spent in each class and, based on the number of requests, we constructed the following table, which exposes, for each class, the probability of a request to hit a synchronous operation:

From Slite.com

If we look at the class 100 ms - 1 s, we notice that there is a 5% percent chance that a request hits a long synchronous operation of this class.

Most notably, the autocomplete feature is a priority task whose performance is tightly monitored. Prior measurements exhibited a 95th percentile of autocomplete at about 100ms. Considering the above 5% probability for a request to hit a 100 ms - 1 sec long synchronous operation, we hypothesized that the 5% of autocomplete that were above 100 ms were all caused by synchronous operations. Solving this priority task problem was therefore a strong motivator to find and resolve these long synchronous operations.

Before proceeding further, it was mandatory to set up a monitoring, in order to be able to decide whether a solution had the effect we wanted, and also to quickly detect new hotspots if they appeared. Since the histogram data is stored in our logs, we could directly visualize its evolution other time. The following graph exhibits the three most critical duration classes (> 1s, 100 ms - 1000 ms and 10 ms - 100 ms):

From Slite.com

The most prominent line is the 10 ms - 100 ms class. Following our above hypothesis, its is worth zooming on the 100 ms - 1000 ms class:

From Slite.com

Both of these graphs span two weeks, between Monday, February 25th, 2019 and Friday, March 8th, 2019. The flat part in the middle of the graph is a weekend, where the traffic on Doctrine.fr reaches a low point.

Identifying synchronous operations

At this point, we knew that there were an important number of long synchronous operations, that we were able to classify by their durations. Nevertheless, we needed to find the exact hotspots in the code that exhibited these synchronous calls.

Searching for these hotspots required to be far more intrusive in the code, which excluded performing this search on the production platform. Hence, before going further, we first assessed whether the development platform exhibited the same distribution as the production platform.

This was easily done by running the whole backend test suite, and comparing the relative sizes of the different buckets for the two environments. The following figure summarizes the results:

From Slite.com

As you can see, the distribution was similar enough between the two platforms for the remaining work on the development platform to be meaningful.

First, we recorded a profiling trace of an execution of our whole backend test suite. The flamegraph is a follows:

From Slite.com

We then parsed the profiler logs to spot the synchronous operations. To do this, we wrote the analyzeLongSyncOpsInProfilingData tool, that we open-sourced in the Doctrine open-source github repository.

This tool constructs an ordered list of the hotspots in the source code, with the source file, the line number and the function name (if any). For each line, the following three measures are provided:

  • The worst occurrence (in ms)
  • The number of occurrences that lasted more than 10 ms
  • The number of distinct call stacks, which is a quick discriminator between functions being part of a hot path (low distinct call stack count - optimizing the function would only provide gain for a small number of paths in the code) and functions being hotspots (high distinct call stack count - optimizing the function will have an impact on several different code paths).

The following figure shows the top results of this analysis:

From Slite.com

With these information, it was straightforward to know what the hotspots were, and what was their impact.

Fixing long synchronous operations

We proceeded to fix some of the detected long synchronous operations. Each codebase is different, but some common lessons may be derived from our investigations.

We decided to focus only on a sample of the top-ranking functions, namely the ones linked to our search engine query analysis, because we felt that optimizing them would bring quick rewards. All the functions linked to HTML parsing are based on more complex algorithms, that would be more complicated to optimize.

All of these query analysis functions look for specific patterns in the user input (the search query), for instance to identify jurisdictions or law articles. Therefore, they all have similar structures: They are loops on large sets of regular expressions. Of course, the longer the user input, the higher the regular expression search duration, and therefore the longer the overall loop duration.

These functions perfectly fit the traditional framework model, since executing them in a dedicated thread would not disturb the other queries. In a Node.js framework, they are anti-patterns, executing long synchronous code that monopolize the main thread.

The simplest approach to tackle these synchronous hotspots was to target optimizations at the loop level (rather than, for instance, trying to use Node.js worker threads).

We decided to use async.eachSeries to replace the loops. Our initial code looked like this:

constants.JURIDICTIONS_RULES.forEach(function(rule) {
    /* CPU-intensive processing */
})

We replaced it with asynchronous calls looking like this:

async.eachSeries(constants.JURIDICTIONS_RULES, (rule, iterationCallback) => {
  /* CPU-intensive code */
  return setImmediate(iterationCallback)
}, (error) => {
  return callback(error, juridictions_in_query)
})

Note the use of setImmediate, to release the main thread at each loop iteration. There are several options to release the queue in Node.js (setTimeout and process.nextTick), but setImmediate is executed when all the other callbacks have been executed (to be precise, when the poll phase becomes idle), which is exactly the best way to let other queries be handled by the main thread in Node.js.

Unfortunately, after implementing this approach, we noticed that the performance penalty was huge: The async loop was 15% slower than the synchronous version, which was unacceptable. Worse, this performance penalty was measured without other queries running, which means it was only linked to the cost of scheduling the next loop iteration.

We took a forthright route to try overcoming this problem, by adding a mechanism to call setImmediate only on each N loop iterations. Depending on the functions, we concluded that calling setImmediate every two or three iterations would drop the asynchronous performance tax to 5%.

Even though not entirely satisfied with this approach, we nevertheless pushed this version to production, to assess its impact on long synchronous operations. The results were clearly visible on the 100 ms -1000 ms class:

From Slite.com

The left hand-side of the graph is the number of synchronous operations of this class during the day of March 13th. The modification has been pushed to production in the late afternoon of this same day. We clearly see that, on March 14th, the number of synchronous operations has been divided by 2.

Another interesting fact is that the number of 10 ms -100 ms synchronous operations did not grow:

From Slite.com

This indicates that the synchronous operations we modified has moved below the 10 ms threshold, thereby slipping in our medium and fast classes.

The effect on priority queries was also quite noticeable:

From Slite.com

We see that the autocomplete response time, at the P95 level (the orange line) has dramatically dropped, and, furthermore, is now only marginally impacted by the number of requests (the green background), which was not the case before.

This validates our hypothesis that the 5% of slow autocomplete queries were entirely linked to long synchronous operations.

Conclusion and future works

We showed how we analyzed the synchronous hotspots in our code with async hooks, and how we fixed them. The results on the application performance, and on the behavior of the priority tasks, proves our approach is sound and meaningful.

We still have synchronous hotspots to tackle, and we first favored an ad-hoc fix that we were able to release early. Since then, we have been iterating on this solution, by setting up a framework that will automatically break long synchronous operations when they reach a certain threshold.