How to Handle the Circuit Breaker Exception in OpenSearch

We had a search functionality requirement in one of the projects and that’s how I got to know about Amazon OpenSearch Service, It provides a quick, relevant search experience and makes it easier to add a search feature to your applications.

The only hiccup is sizing the OpenSearch is a tricky and long-term process, and it often takes many iterations to make sure that you get the right specifications according to your workload. It also needs monitoring so as to be aware of future problems. You make an initial estimate of the resources, test them, and verify their performance. If the performance is not good, then you need to resize it and iterate. In this process, you often face many challenges. Some of them are listed below:

Red / Yellow Cluster Status
Exceeded maximum shard limit
JVM OutOfMemoryError
Failed cluster nodes

While I was working with OpenSearch, I faced one such issue, which I will cover in this post.

Our cluster was ready to go after going through the estimation process and performance testing, but after a while, we started seeing CircuitBreakerException in our logs. I find it difficult to understand, so I thought to write about it. I hope this blog helps you understand it better.

Before we get into the topic, I would like to explain a little about clusters and nodes in OpenSearch.

OpenSearch cluster is made up of one or more nodes, which are servers that handle search queries and store your data. As the cluster grows we can subdivide the responsibilities among different nodes(Master, Data, and more).

Now, Let’s dive into the issue.

What is a circuit breaker?

In data nodes, 50 % of the available memory up to 32GB is used by the JVM heap and the rest is used for other operations.

Circuit breakers are limitations put on a node to stop operations with the risk of resulting in JVM OutofMemoryError, which could cause the node to crash completely. The amount of memory each circuit breaker can utilize is specified.

In response to requests, the breakers calculate how much memory the activity requires and compare the calculated size to the specified heap size limit. The query is aborted if the anticipated size exceeds the available heap size. In order to avoid overloading the node, a CircuitBreakerException is raised.

Types of Circuit breakers

There are many types of circuit breakers and a few of them are as follows:

Parent circuit breaker - It specifies the maximum amount of memory that all breakers can utilize. If the combined memory utilization results in more than the specified limit the parent circuit exception will occur. It is configured using the below settings.
- indices.breaker.total.use_real_memory (default to true) - Determines whether the parent breaker should take real memory usage into account (true) or only consider the amount that is reserved by child circuit breakers (false).
- indices.breaker.total.limit (default - 95% of JVM heap) - Limit for overall parent breaker. If indices.breaker.total.use_real_memory is false then 70% JVM heap otherwise 95% of the JVM heap.
Field Data circuit breaker - It is anticipated how much heap memory will be required to load a field into the field data cache. The circuit breaker terminates the operation and reports an error if loading the field will cause the cache to use more memory than was allowed.

The field data cache contains field data (to allow text fields to be available for aggregations, sorting, and scripting) and global ordinals (It is an internal data structure used in elasticsearch for pre-computing and optimizing the performance of terms aggregations).
- indices.breaker.fielddata.limit - Limit for fielddata breaker. Defaults to 40% of the JVM heap.
- indices.breaker.fielddata.overhead - A constant (1.03) that all field data estimations are multiplied to determine a final estimation.
Request circuit breaker - It is estimated how much heap memory will be required to process a request. It also includes the memory used for calculating aggregations during a request. If the memory usage is more than the limit, the request is terminated and an exception is raised.
- indices.breaker.request.limit - Defaults to 60% of the JVM heap.
- indices.breaker.request.overhead - A constant (1) that all request estimations are multiplied to determine a final estimation.
In-flight requests circuit breaker - It is caused when the memory usage of all active incoming requests exceeds the configured threshold on a node.
- network.breaker.inflight_requests.limit - Defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.
- network.breaker.inflight_requests.overhead - A constant (2) that is multiplied to determine a final estimation.
Accounting circuit breaker - It is a limit to prevent items from using too much memory that isn’t released when a request is finished, such as Lucene segment memory. A segment is an inverted index.
- indices.breaker.accounting.limit - Limit for accounting breaker, defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.
- indices.breaker.accounting.overhead - A constant (1) that is multiplied to determine a final estimation.

Useful Commands

GET /_nodes/stats/breaker To retrieve the current memory use per node and per breaker.
GET _cat/nodes?v=true&h=id,r,ram,heap* To obtain information about heap and memory details per node.
GET /_cluster/settings?include_defaults=true It will return explicitly defined and default setting in the cluster including breaker settings.
GET _nodes/stats?filter_path=nodes..jvm.mem.pools.old To calculate the JVM memory pressure of each node. Use the response to calculate memory pressure as JVM Memory Pressure = used_in_bytes / max_in_bytes.

Circuit Breaker Exception Example

Screenshot 2022-11-18 at 12.33.34 PM.png

Above is an example of a circuit breaker error message which I had encountered while working with AWS OpenSearch Service. The error will result in a 429 status code.

Let’s look into the error more closely and try to understand what is happening here.

type - It specifies the type of the exception raised.
reason - More detailed information about the reason which led to the mentioned exception.
- [parent] - It specifies that the parent circuit breaker exception has resulted in the error. The default parent circuit breaker setting is 95%.
- real usage - It defines the current heap usage.
- new bytes reserved - It specifies the number of new bytes required.
- limit of - It is the maximum memory allocation for the parent circuit breaker.
- If the real usage + new bytes reserved exceeds the limit of, the parent circuit breaker will be triggered.
durability - It specifies if the issue that triggered the circuit breaker eventually resolves itself (TRANSIENT) or calls for manual intervention (PERMANENT).

Suggestions

Reduce JVM memory pressure - High JVM memory pressure often causes circuit breaker errors. Check the JVM memory pressure and try to reduce it using the following suggestions.
- Reduce the shard numbers of each index-
  - For search latency workloads use a shard size between 10–30 GiB.
  - For write-heavy workloads use a shard size between 30–50 GiB.
  - On a given node, have no more than 20 shards per GiB of Java heap.
- Avoid searches that might be very expensive (Using large size in pagination). Enable slow logs for identifying expensive search queries. Aggregation, wildcards, and wide time ranges in your queries might also result in high JVM pressure.
- A mapping explosion, which consumes a lot of memory, might result from defining too many fields or nesting fields too deeply.
Avoid sending a large number of requests at the same time or tuning bulk size according to your workload.
Increase the cluster’s size to get an extra JVM heap to handle yours.
Disable and avoid using fielddata as it can consume a large amount of heap space.
Monitoring OpenSearch cluster metrics with Amazon CloudWatch and create alarms for various cluster metrics.
Enable logs for better observability of the errors and issues.