<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Basil Vandegriend: Professional Software Development &#187; infrastructure</title>
	<atom:link href="http://www.basilv.com/psd/blog/category/infrastructure/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.basilv.com/psd</link>
	<description></description>
	<lastBuildDate>Wed, 25 Jan 2012 13:23:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>How to Determine Maximum Heap Size</title>
		<link>http://www.basilv.com/psd/blog/2011/how-to-determine-maximum-heap-size</link>
		<comments>http://www.basilv.com/psd/blog/2011/how-to-determine-maximum-heap-size#comments</comments>
		<pubDate>Wed, 11 May 2011 13:00:22 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[measurement]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/?p=645</guid>
		<description><![CDATA[What is a good way to determine the maximum heap size a virtual machine (VM) should be allocated in production? A simple but flawed approach is to simply start with an initial size and increase it by 50% to 100% whenever it runs out of memory. While this might be acceptable for test environments, incurring [...]]]></description>
			<content:encoded><![CDATA[<p>What is a good way to determine the maximum heap size a virtual machine (VM) should be allocated in production? A simple but flawed approach is to simply start with an initial size and increase it by 50% to 100% whenever it runs out of memory. While this might be acceptable for test environments, incurring potentially multiple outages in production is usually not acceptable. Furthermore, there is a risk of discovering that the memory required exceeds what is physically available.</p>
<p>So assuming you need to determine the maximum heap size and are willing to put in some effort, how might you go about accomplishing this? At its essence this is a measurement problem so you can apply the principles from the book <a href="http://www.amazon.ca/gp/product/0470539399/ref=as_li_tf_tl?ie=UTF8&#038;tag=basilvandegri-20&#038;linkCode=as2&#038;camp=15121&#038;creative=330641&#038;creativeASIN=0470539399">How to Measure Anything</a><img src="http://www.assoc-amazon.ca/e/ir?t=basilvandegri-20&#038;l=as2&#038;o=15&#038;a=0470539399" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> by Douglas Hubbard. </p>
<h3>Determine Purpose of Measurement</h3>
<p>First you must determine the purpose of the measurement. What decision(s) will it support? This can go beyond just determining the maximum size to specify to the virtual machine. For example, there could be an upper limit to the physical memory available in the target production server that you must not exceed. </p>
<h3>Determine When to Stop Measuring</h3>
<p>The second step in the parlance of Hubbard's book is assessing the economic value of measuring and only performing measurements with a positive return on investment. Essentially you are determining the criteria for when to stop measuring. In the context of determining maximum heap size this corresponds to considerations such as:</p>
<ol>
<li>Determining if there are any heap size threshholds that are significant to the decision. For example if your production server has a physical upper limit then this is a threshhold to measure against, but you might not care how much your required heap size is below this limit.</li>
<li>What the desired precision is when measuring. This is based on a return on investmest calculation comparing the cost of making an additional measurement to improve the precision versus the possible cost savings of needing less memory. For example if each refinement in precision will take about one hour to measure at a salary cost of $100 and the enterprise class RAM in your production cloud costs $500 / GB / year then for a one-year positive return you would seek a precision of no more than 200 MB. </li>
</ol>
<h3>Determine Memory-Relevant Processes</h3>
<p>This step involves analyzing the application(s) running in the virtual machine. While traditionally a separate VM is used for each application, application/web servers running within a single VM can host multiple applications.</p>
<p>Each application consists of one or more indepdenent logical processes (not actual operating system processes) that are executed as one or more threads within the VM. You need to determine what these processes are and which of these processes consume a non-trivial amount of memory. Examples of these logical processes include:</p>
<ul>
<li>Serving up web pages.</li>
<li>Synchronously processing web service requests.</li>
<li>Asynchronously processing messages from a queue.</li>
<li>Performing scheduled batch operations.</li>
<li>Nightly refresh of in-memory caches.</li>
<li>Regularly responding to a health check / heartbeat request.</li>
</ul>
<p>From these examples you might determine that the last one involving a health check uses minimal memory and thus is not worth measuring.</p>
<h3>Determine Process Measurement Sets</h3>
<p>In this step you determine which of the processes identified in the prior step can be measured together versus which need to be measured separately. I refer to each group of processes that will be measured together as a <em>process measurement set</em>. The key consideration is the variation in memory usage over the execution of each process. If two processes are going to be measured together, then you need to ensure that the total memory used by both processes is a good representation of the maximum memory required by both processes combined. </p>
<p>For example, assume two processes A and B that can execute concurrently. If process A starts off quickly allocating memory up to its maximum needed then gradually releasing it while process B slowly increases its memory usage until the end, then these two processes should be measured separately, as running them together is very unlikely to reveal the combined maximum. </p>
<p>The best processes to measure together are those that run quickly and/or in high volume (in terms of number of concurrent threads of execution). For example load testing of web pages is essentially testing many individual processes (serving up each individual web page). The high volume and the quick turnaround per page means that assessing memory usage in aggregate is highly reliable. It is not necessarily the theoretical maximum but it does find a maximum that is statistically very unlikely to be exceeded.</p>
<h3>Measure Memory Required by Each Process Measurement Set</h3>
<p>Now that your process measurement sets are defined you need to measure the maximum memory required for each. This requires executing the processes set by set . To obtain an accurate measurement the following conditions must be met:</p>
<ol>
<li>The processes of the set must execute in isolation - other processes in other sets should not be included. Sometimes I have had to add special code to applications so I could temporarily disable certain processes. See also the next section on measuring baseline memory usage - background processes included in the baseline are not a concern.</li>
<li>Realistic data and transaction volumes should be used when executing the processes. By realistic I mean as close as reasonably possible to production. For new applications this means you have to rely on estimates or simulations of what you expect to happen in production. For existing applications you can use measurements of production volumes and perhaps even use a copy of the production data set. If you cannot achieve production volumes in your test environment then you need to take multiple measurements at different sizing levels. This allows you to measure how the application scales, which then allows you to extrapolate memory usage at production levels.
<p>If you are running multiple servers for redundancy to achieve high-availability then you typically want to determine the required volume per server assuming one server is done. E.g. in a three server setup, two servers need to be able to handle all of the traffic, which means each server needs to be sized for half the total volume rather than one-third.</li>
</ol>
<p>Determining the maximum memory required by a set of executing processes is actually surprisingly difficult to do with high accuracy. There are two different methods of measuring this that you can use:</p>
<h4>Direct Observation Method</h4>
<p>This method involves observing the memory used by the virtual machine and treating this as the maximum memory required. The observation is typically made using a performance monitoring tool that frequently samples the memory used by the VM. A primitive version of such a tool can be made by regularly logging memory usage to a log file. One easy way to add such instrumentation to an application is via an aspect. The direct observation method has two limitations:</p>
<ol>
<li>Memory used at any given point is likely to be significantly higher than the true maximum. The specific behavior will vary from VM to VM, but in general virtual machines try to minimize the amount of garbage collection performed. So if an application is running well under its maximum heap size, often a VM will only perform what is called <em>minor</em> garbage collection which will reclaim some unused objects, typically over only a subset of the heap. Usually it is only when the heap fills up to near its maximum size that the VM performs <em>major</em> garbage collection that scans the entire heap and frees as much space as possible. In my recent experience with IBM JRE VMs I have seen memory usage at levels more than double the actual maximum memory required.
</li>
<li>Sampling memory used provides no indication of what happens between samples. If the rate of change of memory usage is very high compared o the sampling rate, then there is a good possibility that the point of time corresonding to maximum memory used falls between sampling points, with the observations made falling significantly below the maximum. To mitigate this problem multiple test runs can be performed and the highest result used. If you see a wide variance in memory usage reported across otherwise identical test runs, sampling error is a likely cause.
</li>
</ol>
<h4>Memory Cap Method</h4>
<p>To avoid the limitations of the direct observation method you can use a second method of measurement that I refer to as the <em>memory cap</em> method. This method essentially involves trying various values for the VM maximum heap size and seeing for which values the set of processes successfully executes. The minimize such value is taken as the maximum memory required for that process measurement set.</p>
<p>In order to minimize the number of test runs required I prefer to use a binary search to find the maximum memory required. This involves determining an initial lower and upper bound on the memory required plus a desired precision. The precision was already determined in the second step as the criteria for when to stop measuring. For the lower bound you can use the baseline memory as per the next section. To determine the upper bound you can perform an initial test run using direct observation, and use the observed memory usage as an upper bound. </p>
<p>For a given test run you want to determine whether the processes will succeed or fail with the current heap size setting. An obvious sign of failure is an out of memory exception, but there are more subtle signs. As the memory required by a process set grows close to the maximum heap size, the virtual machine tends to spend more and more time doing garbage collection. After a certain threshold the VM is essentially thrashing, and throughput / performance of the processes will significantly drop. Although the processes may still complete without error you want to avoid this nearly-out-of-memory condition. One recommendation I have see is to increase memory allocated when the percentage of time spent in garbage collection is more than five percent.</p>
<h3>Measure Baseline Memory</h3>
<p>Applications that are event or schedule driven typically consume a certain amount of memory even when not actively executing processes. I refer to this as the <em>baseline memory</em>. One very common example is applications running within an application server - the memory used by the application server itself would be part of the baseline. Another example is applications architected using the Spring framework - the instantiation of the Spring context and any defined singleton beans would be part of the baseline.</p>
<p>Perhaps the easiest way to think about the baseline is that it is that portion of memory consumed that is common across the process measurement sets. This is why you need to measure the baseline separately: when you add up the memory required by each process measurement set you only want to account for this common memory once. </p>
<p>Measuring the baseline is straightforward: ensure no processes in any process measurement set are running and observe the VM memory usage. While you could use the memory cap method to make this measurement, I find that because the baseline memory used is usually quite small in comparison to the process measurement sets the extra accuracy compared to direct observation is not worth the effort.</p>
<h3>Calculate Maximum Heap Size</h3>
<p>In this last step you finally calculate the maximum heap size required by the virtual machine. This consists of adding together the following contributions:</p>
<ul>
<li>Baseline memory required</li>
<li>Maximum memory required by each process measurement set, subtracting the baseline from each. </li>
<li>Operational buffer, which I usually calculate as 10 to 20 percent of the total of the prior items. Lean thinking provides the theoretical basis: as spare capacity goes to zero, throughput is reduced. Systems require 'slack' in order to absorb fluctuations in demand without affecting overall throughput. As discussed earlier in the descriptin of the memory cap method of observation, this applies to memory usage. As the free heap available drops close to zero, the time required to allocate new memory increases significantly, thus reducing the throughput / performance of the rest of the system. You can minimize the amount of operational buffer if you are prepared to carefully monitor garbage collection frequency in production and can use planned outages to increase the maximum heap size if required, but a cost-benefit analysis usually indicates it is cheaper to maintain a larger buffer.</li>
</ul>
<h3>Example of Determining Maximum Heap Size</h3>
<p>I will now work through a hypothetical example that works through the process outlined above while adding in some of the complexities that real life adds to the mix. </p>
<ol>
<li><strong>Determine Purpose of Measurement: </strong>Determine the size of the production server to obtain for a new system in addition to determining the maximum heap size to allocate.
</li>
<li><strong>Determine When to Stop Measuring:</strong>Servers are available in various configurations, with memory starting at 2 GB for the smallest configuration. Larger configurations are available, each one increasing RAM by an additional 2 GB at an increased cost of roughly $1000 / year. So memory measurements are only meaningful at the boundaries of 4 GB, 6 GB, 8 GB, etc.
</li>
<li><strong>Determine Memory-Relevant Processes:</strong> The virtual machine will be running a Java EE application server with three applications "App1", "App2", and "App3". App1 is an administrative application that is expected to have minimal memory usage so has no memory-relevant processes. App2 is a pure web application that is expected to handle a high volume of traffic, so its sole memory-relevant process is serving web pages. App3 is a complex enterprise application with a variety of memory-relevant processes: serving web pages, weekly generation of reports, a nightly batch processing task, and asynchronous message processing.
</li>
<li><strong>Determine Process Measurement Sets:</strong> The following sets are defined:
<ol>
<li>Serving of web pages by App1 and App2. </li>
<li>Weekly generation of reports by App2.</li>
<li>Nightly batch processing by App2.</li>
<li>Asynchronous message processing by App2.</li>
</ol>
</li>
<li><strong>Measure Memory Required by Each Process Measurement Set:</strong> A test environment is provisioned using a middle-of-the-line server configuration. Test data is generated based on expected production volumes. Because we only care about discrete thresholds of memory usage we start with an initial measurement of each process measurement set using the direct observation method. The results are as follows:
<ol>
<li>Serving of web pages by App1 and App2: Automated test scripts are built for a web load test tool based on the expected production load. Two load levels are expected: a regular load 24x7, and a peak or burst load occasionally during business hours. Memory usage for the regular load is 0.5 GB, while at peak load is 4 GB.</li>
<li>Weekly generation of reports by App2: Report generation happens on the weekend only and takes 4 GB.</li>
<li>Nightly batch processing by App2: This only runs at night, never during business hours, and takes 1 GB.</li>
<li>Asynchronous message processing by App2: The amount of memory consumed depends on the number of threads allocated for this processing, so there's an interesting tradeoff between allocating more threads to ensure sufficient processing capacity to keep up with the rate of incoming messages, and minimizing the total number of threads to reduce overall memory usage. To further complicate this tradeoff, moving to a more powerful server in production with more memory will also provide more CPU cores that will boost throughput. Memory usage is 0.1 GB per thread, and the test environment requires 20 threads so 2 GB. Spikes in incoming messages are possible during the day that build up the queue, but the system can catch up at night. So the system needs to handle the full load of message processing 24x7.</li>
</ol>
<p>We will potentially need to iteratively return to this step based on what the initial maximum heap size is determined to be.
</li>
<li><strong>Measure Baseline Memory:</strong> Memory used by the VM is observed while all three applications are installed in the application server, but with no activity happening. Baseline memory usage is determined to be 0.2 GB. We also determine that the operating system will require 0.5 GB of RAM in additon to the RAM required by the VM.
</li>
<li><strong>Calculate Maximum Heap Size:</strong><br />
The calculation of maximum heap size is not a simple sum due to the fact that not all process measurement sets are expected to be active at the same time. The following scenarios are possible:</p>
<ol>
<li>Weekday load during business hours: Peak web page load of 4 GB + Asynchronous message processing of 2 GB = 6 GB</li>
<li>Weekend load at night: Regular web page load of 0.5 GB + Weekly report generation of 4 GB + Nightly batch processing of 1 GB + Asynchronous message processing of 2 GB = 7.5 GB</li>
</ol>
<p>The second scenario is the largest. Adding in the baseline of 0.2 GB and a 10% buffer results in a total of 8.5 GB maximum heap size. The provisioning recommendation of our VM is that allocated physical RAM be 20% greater than the VM maximum heap size to account for VM overhead. This results in a total required physical RAM of 8.5 * 1.2 + the operating system baseline of 0.5 GB = 10.7 GB, which bumps up the server required to 12 GB. More accurate measurements using the memory cap method might reveal smaller maximums that allow us to drop to 10 GB and save $1000 / year, so we allocate up to 5 hours on further measurement.
</li>
<li><strong>Measure Memory Required by Each Process Measurement Set (iteration 2):</strong> We focus first on the largest contributor to memory usage: the report generation. Using the memory cap method reveals that it only actually requires 2 GB of memory.
</li>
<li><strong>Calculate Maximum Heap Size (iteration 2):</strong> The second scenario now comes to a total of 5.5 GB, which is now less than the first sceanrio. So adding the 6 GB of the first scenario with the 0.5 GB baseline and adding a 10% buffer gives us a total heap size required of 7.2 GB. Physical RAM required is 7.2 * 1.2 + 0.5 = 9.1 GB, which corresponds to a server with 10 GB of RAM. The next smaller server size is 8 GB. Since both scenarios have roughly equal usage at this point, more accurate measurements of multiple process sets would be needed and it seems unlikely that the RAM required would drop to 8 GB. So we decide that there is insufficient benefit to continue further measurement. We go with a server with 10 GB of RAM, and decide to use a maximum heap size of 7.8 GB to provide 20% operational buffer since the space is available on the server.
</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2011/how-to-determine-maximum-heap-size/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Digital Disaster: Preparing for a Hard Drive Crash</title>
		<link>http://www.basilv.com/psd/blog/2006/digital-disaster-preparing-for-a-hard-drive-crash</link>
		<comments>http://www.basilv.com/psd/blog/2006/digital-disaster-preparing-for-a-hard-drive-crash#comments</comments>
		<pubDate>Thu, 29 Jun 2006 15:00:28 +0000</pubDate>
		<dc:creator>Basil Vandegriend</dc:creator>
				<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[backups]]></category>
		<category><![CDATA[hardware]]></category>
		<category><![CDATA[RAID]]></category>

		<guid isPermaLink="false">http://www.basilv.com/psd/blog/2006/digital-disaster-preparing-for-a-hard-drive-crash</guid>
		<description><![CDATA[Digital data is growing in importance not just in the workplace, but also at home. We correspond by email, take digital pictures and videos, and maintain digital music collections. All this valuable digital content and more is stored on your computer's hard drive. Hard drives are delicate mechanical devices with a finite lifespan. A quick [...]]]></description>
			<content:encoded><![CDATA[<p>Digital data is growing in importance not just in the workplace, but also at home. We correspond by email, take digital pictures and videos, and maintain digital music collections. All this valuable digital content and more is stored on your computer's hard drive. Hard drives are delicate mechanical devices with a finite lifespan. A quick online search suggests that a typical lifespan for a drive is somewhere between three and six years. So a drive will fail eventually. Are you prepared for a crash?</p>
<p>I was given the opportunity to answer that question myself when my hard drive crashed recently. I certainly wasn't expecting it: the drive was less than two years old. Was I prepared? The short answer is not nearly enough.</p>
<p>I had regularly performed backups of my most important data files to CD once every two months. So I did have a five-week-old backup that contained all my emails, my development projects (including my <a href="http://subversion.tigris.org/">Subversion</a> version control repository), my writing, and other important personal documents. This backup did not include my digital photos, but I had recently printed a set by my typical process of burning them to CD. I also had a data CD that contained a significant subset of my music collection, plus a variety of audio CDs. Nothing else was backed up. Fortunately, luck was on my side. After some investigation, I discovered that my hard drive had only suffered a partial failure. Using the program <a href="http://www.stompsoft.com/recoverlostdata.html">Recover Lost Data</a>, I was able to recover some of the files that weren't part of my backup, or that I had modified in the five weeks since my last backup.</p>
<p>Despite my foresight in making backups and my luck in recovering files off the damaged drive, recovering from this crash was a painful, time-intensive process. I permanently lost many of my digital photos and personal videos. I had to reinstall and reconfigure all the applications I regularly use. Firefox was especially painful: I was unable to recover my long list of Firefox extensions, so I had to search online to try to remember which ones I regularly use. I also had to download all the development libraries that I use in my personal development projects. Fortunately I use only a few commercial software products - the rest is open source - and had the necessary license information to install them.</p>
<p>Even before I started this recovery process, I knew that it wouldn't be easy or quick, and I decided that I didn't want to go through this experience again. I needed a better backup strategy. I wanted to be able to back up all my applications, including configuration information, as well as all my data files. Since some applications store configuration information in the windows registry, this essentially requires backing up the entire drive. Doing this would allow for a seamless recovery from a crash. I also wanted backups to be done as often as possible and require minimal effort on my part. This might sound like a tough set of requirements to fulfill, but I already knew of a solution. Disk mirroring or <a href="http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks">RAID</a> is when the same data is stored on multiple physical drives. RAID 1 is the simplest RAID configuration, consisting of two physical drives that are treated as a single logical drive. If one drive fails, the system can continue to operate using the remaining good drive without any loss of functionality or data. I first heard of RAID from <a href="http://www.joelonsoftware.com/news/20030125.html">an article by Joel Spolsky</a>, in which he writes about deciding to switch all non-laptop machines in his company to RAID after an experience with a hard drive failure wasted at least a day of his time, even with daily backups available.</p>
<p>Like Joel, I decided I must have a RAID 1 configuration for all future computers. Rather than replace the drive in my existing computer, I went shopping for a new system. But apparently my conclusion about the value of RAID isn't shared by the average consumer, since none of the big-name retail stores I checked offered RAID as an option, even in their customized machines. My continued search led me to <a href="http://www.dell.com">Dell</a>, which offered RAID 1 as an optional extra (called DataSafe) on their <a href="http://www1.ca.dell.com/content/products/features.aspx/dt_3100">Dimension 3100 model</a>. I was surprised how cheap the RAID 1 option was: it cost only a little more than the cost of the second hard drive. So now I sit typing this article on my new Dell computer, knowing that my data is instantaneously being written to the two hard drives inside. I no longer need to worry about hard drive crashes, because I'm fully prepared.</p>
<p>Are you prepared for when your drive crashes? When I mentioned my hard drive crash to my coworkers, I was surprised by how many of them indicated that they had absolutely no backups. A few mentioned that they didn't store much of value on their computers, so a crash wouldn't bother them. But a growing majority of people do have valuable data to protect, and I think RAID 1 is the best option to do so. I'd like to predict that RAID 1 will eventually become the standard configuration for home computers, but that may be a little optimistic.</p>
<p>Where I would like to see RAID 1 as a standard is in the workplace, where a cost-benefit analysis clearly favors its use. And I'm not talking servers, but individuals' workstations. I've never worked at or heard of a company besides Joel's that provides RAID-configured machines to employees. Even if all important data is stored on network drives or in a version control repository, a drive crash still means time wasted by the employee waiting for a new drive, plus time spent reinstalling and reconfiguring the applications. And how many people don't keep at least a few important files on their local hard drive, which are almost never backed up by IT departments? Even a few hours of time wasted by a crash covers the cost of the RAID configuration for that machine, and I expect it is normally days of work lost or wasted by a crash. So I think RAID makes sense economically, and not just for developers but for executive and administrative staff as well. It pays to be prepared.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.basilv.com/psd/blog/2006/digital-disaster-preparing-for-a-hard-drive-crash/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

