«    »

How to Determine Maximum Heap Size

What is a good way to determine the maximum heap size a virtual machine (VM) should be allocated in production? A simple but flawed approach is to simply start with an initial size and increase it by 50% to 100% whenever it runs out of memory. While this might be acceptable for test environments, incurring potentially multiple outages in production is usually not acceptable. Furthermore, there is a risk of discovering that the memory required exceeds what is physically available.

So assuming you need to determine the maximum heap size and are willing to put in some effort, how might you go about accomplishing this? At its essence this is a measurement problem so you can apply the principles from the book How to Measure Anything by Douglas Hubbard.

Determine Purpose of Measurement

First you must determine the purpose of the measurement. What decision(s) will it support? This can go beyond just determining the maximum size to specify to the virtual machine. For example, there could be an upper limit to the physical memory available in the target production server that you must not exceed.

Determine When to Stop Measuring

The second step in the parlance of Hubbard's book is assessing the economic value of measuring and only performing measurements with a positive return on investment. Essentially you are determining the criteria for when to stop measuring. In the context of determining maximum heap size this corresponds to considerations such as:

  1. Determining if there are any heap size threshholds that are significant to the decision. For example if your production server has a physical upper limit then this is a threshhold to measure against, but you might not care how much your required heap size is below this limit.
  2. What the desired precision is when measuring. This is based on a return on investmest calculation comparing the cost of making an additional measurement to improve the precision versus the possible cost savings of needing less memory. For example if each refinement in precision will take about one hour to measure at a salary cost of $100 and the enterprise class RAM in your production cloud costs $500 / GB / year then for a one-year positive return you would seek a precision of no more than 200 MB.

Determine Memory-Relevant Processes

This step involves analyzing the application(s) running in the virtual machine. While traditionally a separate VM is used for each application, application/web servers running within a single VM can host multiple applications.

Each application consists of one or more indepdenent logical processes (not actual operating system processes) that are executed as one or more threads within the VM. You need to determine what these processes are and which of these processes consume a non-trivial amount of memory. Examples of these logical processes include:

  • Serving up web pages.
  • Synchronously processing web service requests.
  • Asynchronously processing messages from a queue.
  • Performing scheduled batch operations.
  • Nightly refresh of in-memory caches.
  • Regularly responding to a health check / heartbeat request.

From these examples you might determine that the last one involving a health check uses minimal memory and thus is not worth measuring.

Determine Process Measurement Sets

In this step you determine which of the processes identified in the prior step can be measured together versus which need to be measured separately. I refer to each group of processes that will be measured together as a process measurement set. The key consideration is the variation in memory usage over the execution of each process. If two processes are going to be measured together, then you need to ensure that the total memory used by both processes is a good representation of the maximum memory required by both processes combined.

For example, assume two processes A and B that can execute concurrently. If process A starts off quickly allocating memory up to its maximum needed then gradually releasing it while process B slowly increases its memory usage until the end, then these two processes should be measured separately, as running them together is very unlikely to reveal the combined maximum.

The best processes to measure together are those that run quickly and/or in high volume (in terms of number of concurrent threads of execution). For example load testing of web pages is essentially testing many individual processes (serving up each individual web page). The high volume and the quick turnaround per page means that assessing memory usage in aggregate is highly reliable. It is not necessarily the theoretical maximum but it does find a maximum that is statistically very unlikely to be exceeded.

Measure Memory Required by Each Process Measurement Set

Now that your process measurement sets are defined you need to measure the maximum memory required for each. This requires executing the processes set by set . To obtain an accurate measurement the following conditions must be met:

  1. The processes of the set must execute in isolation - other processes in other sets should not be included. Sometimes I have had to add special code to applications so I could temporarily disable certain processes. See also the next section on measuring baseline memory usage - background processes included in the baseline are not a concern.
  2. Realistic data and transaction volumes should be used when executing the processes. By realistic I mean as close as reasonably possible to production. For new applications this means you have to rely on estimates or simulations of what you expect to happen in production. For existing applications you can use measurements of production volumes and perhaps even use a copy of the production data set. If you cannot achieve production volumes in your test environment then you need to take multiple measurements at different sizing levels. This allows you to measure how the application scales, which then allows you to extrapolate memory usage at production levels.

    If you are running multiple servers for redundancy to achieve high-availability then you typically want to determine the required volume per server assuming one server is done. E.g. in a three server setup, two servers need to be able to handle all of the traffic, which means each server needs to be sized for half the total volume rather than one-third.

Determining the maximum memory required by a set of executing processes is actually surprisingly difficult to do with high accuracy. There are two different methods of measuring this that you can use:

Direct Observation Method

This method involves observing the memory used by the virtual machine and treating this as the maximum memory required. The observation is typically made using a performance monitoring tool that frequently samples the memory used by the VM. A primitive version of such a tool can be made by regularly logging memory usage to a log file. One easy way to add such instrumentation to an application is via an aspect. The direct observation method has two limitations:

  1. Memory used at any given point is likely to be significantly higher than the true maximum. The specific behavior will vary from VM to VM, but in general virtual machines try to minimize the amount of garbage collection performed. So if an application is running well under its maximum heap size, often a VM will only perform what is called minor garbage collection which will reclaim some unused objects, typically over only a subset of the heap. Usually it is only when the heap fills up to near its maximum size that the VM performs major garbage collection that scans the entire heap and frees as much space as possible. In my recent experience with IBM JRE VMs I have seen memory usage at levels more than double the actual maximum memory required.
  2. Sampling memory used provides no indication of what happens between samples. If the rate of change of memory usage is very high compared o the sampling rate, then there is a good possibility that the point of time corresonding to maximum memory used falls between sampling points, with the observations made falling significantly below the maximum. To mitigate this problem multiple test runs can be performed and the highest result used. If you see a wide variance in memory usage reported across otherwise identical test runs, sampling error is a likely cause.

Memory Cap Method

To avoid the limitations of the direct observation method you can use a second method of measurement that I refer to as the memory cap method. This method essentially involves trying various values for the VM maximum heap size and seeing for which values the set of processes successfully executes. The minimize such value is taken as the maximum memory required for that process measurement set.

In order to minimize the number of test runs required I prefer to use a binary search to find the maximum memory required. This involves determining an initial lower and upper bound on the memory required plus a desired precision. The precision was already determined in the second step as the criteria for when to stop measuring. For the lower bound you can use the baseline memory as per the next section. To determine the upper bound you can perform an initial test run using direct observation, and use the observed memory usage as an upper bound.

For a given test run you want to determine whether the processes will succeed or fail with the current heap size setting. An obvious sign of failure is an out of memory exception, but there are more subtle signs. As the memory required by a process set grows close to the maximum heap size, the virtual machine tends to spend more and more time doing garbage collection. After a certain threshold the VM is essentially thrashing, and throughput / performance of the processes will significantly drop. Although the processes may still complete without error you want to avoid this nearly-out-of-memory condition. One recommendation I have see is to increase memory allocated when the percentage of time spent in garbage collection is more than five percent.

Measure Baseline Memory

Applications that are event or schedule driven typically consume a certain amount of memory even when not actively executing processes. I refer to this as the baseline memory. One very common example is applications running within an application server - the memory used by the application server itself would be part of the baseline. Another example is applications architected using the Spring framework - the instantiation of the Spring context and any defined singleton beans would be part of the baseline.

Perhaps the easiest way to think about the baseline is that it is that portion of memory consumed that is common across the process measurement sets. This is why you need to measure the baseline separately: when you add up the memory required by each process measurement set you only want to account for this common memory once.

Measuring the baseline is straightforward: ensure no processes in any process measurement set are running and observe the VM memory usage. While you could use the memory cap method to make this measurement, I find that because the baseline memory used is usually quite small in comparison to the process measurement sets the extra accuracy compared to direct observation is not worth the effort.

Calculate Maximum Heap Size

In this last step you finally calculate the maximum heap size required by the virtual machine. This consists of adding together the following contributions:

  • Baseline memory required
  • Maximum memory required by each process measurement set, subtracting the baseline from each.
  • Operational buffer, which I usually calculate as 10 to 20 percent of the total of the prior items. Lean thinking provides the theoretical basis: as spare capacity goes to zero, throughput is reduced. Systems require 'slack' in order to absorb fluctuations in demand without affecting overall throughput. As discussed earlier in the descriptin of the memory cap method of observation, this applies to memory usage. As the free heap available drops close to zero, the time required to allocate new memory increases significantly, thus reducing the throughput / performance of the rest of the system. You can minimize the amount of operational buffer if you are prepared to carefully monitor garbage collection frequency in production and can use planned outages to increase the maximum heap size if required, but a cost-benefit analysis usually indicates it is cheaper to maintain a larger buffer.

Example of Determining Maximum Heap Size

I will now work through a hypothetical example that works through the process outlined above while adding in some of the complexities that real life adds to the mix.

  1. Determine Purpose of Measurement: Determine the size of the production server to obtain for a new system in addition to determining the maximum heap size to allocate.
  2. Determine When to Stop Measuring:Servers are available in various configurations, with memory starting at 2 GB for the smallest configuration. Larger configurations are available, each one increasing RAM by an additional 2 GB at an increased cost of roughly $1000 / year. So memory measurements are only meaningful at the boundaries of 4 GB, 6 GB, 8 GB, etc.
  3. Determine Memory-Relevant Processes: The virtual machine will be running a Java EE application server with three applications "App1", "App2", and "App3". App1 is an administrative application that is expected to have minimal memory usage so has no memory-relevant processes. App2 is a pure web application that is expected to handle a high volume of traffic, so its sole memory-relevant process is serving web pages. App3 is a complex enterprise application with a variety of memory-relevant processes: serving web pages, weekly generation of reports, a nightly batch processing task, and asynchronous message processing.
  4. Determine Process Measurement Sets: The following sets are defined:
    1. Serving of web pages by App1 and App2.
    2. Weekly generation of reports by App2.
    3. Nightly batch processing by App2.
    4. Asynchronous message processing by App2.
  5. Measure Memory Required by Each Process Measurement Set: A test environment is provisioned using a middle-of-the-line server configuration. Test data is generated based on expected production volumes. Because we only care about discrete thresholds of memory usage we start with an initial measurement of each process measurement set using the direct observation method. The results are as follows:
    1. Serving of web pages by App1 and App2: Automated test scripts are built for a web load test tool based on the expected production load. Two load levels are expected: a regular load 24x7, and a peak or burst load occasionally during business hours. Memory usage for the regular load is 0.5 GB, while at peak load is 4 GB.
    2. Weekly generation of reports by App2: Report generation happens on the weekend only and takes 4 GB.
    3. Nightly batch processing by App2: This only runs at night, never during business hours, and takes 1 GB.
    4. Asynchronous message processing by App2: The amount of memory consumed depends on the number of threads allocated for this processing, so there's an interesting tradeoff between allocating more threads to ensure sufficient processing capacity to keep up with the rate of incoming messages, and minimizing the total number of threads to reduce overall memory usage. To further complicate this tradeoff, moving to a more powerful server in production with more memory will also provide more CPU cores that will boost throughput. Memory usage is 0.1 GB per thread, and the test environment requires 20 threads so 2 GB. Spikes in incoming messages are possible during the day that build up the queue, but the system can catch up at night. So the system needs to handle the full load of message processing 24x7.

    We will potentially need to iteratively return to this step based on what the initial maximum heap size is determined to be.

  6. Measure Baseline Memory: Memory used by the VM is observed while all three applications are installed in the application server, but with no activity happening. Baseline memory usage is determined to be 0.2 GB. We also determine that the operating system will require 0.5 GB of RAM in additon to the RAM required by the VM.
  7. Calculate Maximum Heap Size:
    The calculation of maximum heap size is not a simple sum due to the fact that not all process measurement sets are expected to be active at the same time. The following scenarios are possible:

    1. Weekday load during business hours: Peak web page load of 4 GB + Asynchronous message processing of 2 GB = 6 GB
    2. Weekend load at night: Regular web page load of 0.5 GB + Weekly report generation of 4 GB + Nightly batch processing of 1 GB + Asynchronous message processing of 2 GB = 7.5 GB

    The second scenario is the largest. Adding in the baseline of 0.2 GB and a 10% buffer results in a total of 8.5 GB maximum heap size. The provisioning recommendation of our VM is that allocated physical RAM be 20% greater than the VM maximum heap size to account for VM overhead. This results in a total required physical RAM of 8.5 * 1.2 + the operating system baseline of 0.5 GB = 10.7 GB, which bumps up the server required to 12 GB. More accurate measurements using the memory cap method might reveal smaller maximums that allow us to drop to 10 GB and save $1000 / year, so we allocate up to 5 hours on further measurement.

  8. Measure Memory Required by Each Process Measurement Set (iteration 2): We focus first on the largest contributor to memory usage: the report generation. Using the memory cap method reveals that it only actually requires 2 GB of memory.
  9. Calculate Maximum Heap Size (iteration 2): The second scenario now comes to a total of 5.5 GB, which is now less than the first sceanrio. So adding the 6 GB of the first scenario with the 0.5 GB baseline and adding a 10% buffer gives us a total heap size required of 7.2 GB. Physical RAM required is 7.2 * 1.2 + 0.5 = 9.1 GB, which corresponds to a server with 10 GB of RAM. The next smaller server size is 8 GB. Since both scenarios have roughly equal usage at this point, more accurate measurements of multiple process sets would be needed and it seems unlikely that the RAM required would drop to 8 GB. So we decide that there is insufficient benefit to continue further measurement. We go with a server with 10 GB of RAM, and decide to use a maximum heap size of 7.8 GB to provide 20% operational buffer since the space is available on the server.

If you find this article helpful, please make a donation.

2 Comments on “How to Determine Maximum Heap Size”

  1. Michael Padberg says:

    Nice billing model….
    but did it really only take 1 person 1 hour to go through the first iteration of measurement? I read “provisioning test environments” , and generating test data” and probably writing test scripts…. in addition to the think time of actually figuring out a plan as to how your particular situation fits in with the practice of measuring, consensus to the plan etc.

  2. @Michael, the examples are hypothetical. You are right that there are often considerations such as test planning and setup that would require a larger initial investment. However if most of this setup is going to be done anyways for other tests (like performance and stress testing), then you should not include it

«    »