«    »

Streaming Data to Reduce Memory Usage

I recently performed a series of optimizations to reduce an application's memory usage. After completing several of these I noticed that there was a common theme to many of my optimizations that I could explicitly apply to help identify further opportunities for improvement. As a reoccuring solution, this qualifies as a design pattern which I refer to as Streaming Data.

Context

This pattern applies when you need to process a significant volume of data but the processing can be done incrementally on small subsets of the data. A typical example is loading a list of entities and then iterating through the list to process each one. While the results (output) of processing can be combined across all the entities, it is important that the input to the processing only requires a small subset of all the data, and not the entire list of entities. A code example illustrating this problem context is shown below:

List<Entity> entities = loadEntities();
List<ProcessingResult> results = new ArrayList<ProcessingResult>();
for (Entity entity : entities) {
  ProcessingResult result = processEntity(entity);
  results.add(entity);
}

Solution

Reducing the memory usage in the above example is based on the observation that loading the entire list of objects to process can consume a large amount of memory and is not necessary since we only use one object at a time. So the solution is to stream - incrementally retrieve - these objects instead of loading them all at once. For the consumer of this data the only change required is to first obtain a reference to the stream such as an Iterable that incrementally fetches data. Updating our prior code example results in the following (changed lines shown in green background):

Iterable<Entity> entities = streamEntities();
List<ProcessingResult> results = new ArrayList<ProcessingResult>();
for (Entity entity : entities) {
  ProcessingResult result = processEntity(entity);
  results.add(entity);
}

Examples

The mechanism to use for streaming objects will depend on the source of the data and may require significant changes compared to a bulk load. Here are some specific examples.

Parsing XML

Parsing XML files using JAXB is a convenient approach for converting the entire file into a tree of Java objects, but it populates the entire tree at once. To instead stream such data use the SAX parser provided as part of the JAXP API. The SAX parser is event-based, which means that it iterates over the entities (and attributes) of your XML and for each item invokes callbacks you define.

Querying Databases using Hibernate

When using Hibernate to query for a collection of entities it is convenient to simply ask Hibernate for the entire collection. A typical example of doing this using the query by criteria API within Hibernate is below:

public List<Entity> queryData() {
  Criteria criteria = session.createCriteria(Entity.class)
  // Add appropriate restrictions
  // ...
  List<Entity> result = criteria.list();
  return result;
}

When the criteria returns a large volume of data, however, this approach will consume a high volume of data. Instead use the scroll method on Criteria to return a ScrollableResults instance that can be used to iterate through the results. If you prefer to not expose the rest of the application to Hibernate classes, you can wrap the ScrollableResults in a special implementation of Iterator (which I leave as an exercise to the reader). The revision of the above example using streaming looks like the following (changed lines shown in green background):

public Iterator<Entity> queryData() {
  Criteria criteria = session.createCriteria(Entity.class)
  // Add appropriate restrictions
  // ...
  ScrollableResults scrollableResults = criteria.scroll();
  Iterator<Entity> result = new ScrollableResultsIterator(scrollableResults);
  return result;
}

This scroll approach only works when all the data can be processed within the same database transaction since the Hibernate session must remain open for the ScrollableResults to be able to continue fetching data. If this is not suitable then another option is to load the data using multiple queries that each return a subset of the data. One common example of this is when displaying search results to an user. Rather than showing all the results (which may number in the hundreds or thousands) show one page at a time and let the user step through the various pages of results. Due to the frequency with which this occurs I refer to this solution as paging. To implement this in Hibernate using the query by criteria API is fairly simple:

  1. Start by creating your criteria object and defining its restrictions as you normally would.
  2. Apply an ordering to the criteria. It is best if this ordering is consistent, by which I mean that database updates or inserts between queries will not result in invalid or unexpected results being returned. This assumes each query for a page executes in a separate database transaction which provides no guarantees of transactional isolation for the group of queries as a whole. In some contexts, consistency is not required. If it is then I prefer to use an auto-incrementing surogate primary key as the field to sort by in order to achieve the highest level of consistency.
  3. Apply restrictions to retrieve only the specific page. This is done using the methods setFirstResult and setMaxResults on the Criteria object.

Consequences

One potential consequence of streaming data is a reduction in performance because data is loaded piece by piece rather than in bulk. To mitigate this, the solution is use what I call loading sets: define subsets of the total data volume that are small enough to not impact memory usage but large enough to minimize performance impacts. Then load the data one set at a time. The consuming API does not need to change: it can still iterate or stream over each loaded set, and then fetch the next set once the current one is exhausted.

If you find this article helpful, please make a donation.

Leave a Reply

(Not displayed)

«    »