I was recently doing some performance tuning and made the surprising discovery that doing less caching in Hibernate actually improved performance in a particular scenario. When I discovered the problem this seemed very counter-intuitive. In fact, my original design maximized the use of caching in order to improve performance, but the opposite happened in practice. In hindsight, naturally, the reason for this was fairly obvious. So I thought I would share the details of this situation so help you avoid making the same mistake.
I was tuning a batch processing application that received XML input data sets, each consisting of thousands of separate input records. The processing logic converted each input record into multiple Hibernate entities – as many as several hundred. This logic required a number of queries to implement - some to load related, preexisting entities, and others to verify consistency with existing data. This queried data would often be needed for multiple input records in the same data set. Based on this, I decided to use a single Hibernate session to process the entire data set, committing after each input record but keeping the session open to be able to make use of cached entities for subsequent processing.
When initial performance tests were carried out, they showed a disturbing trend: the processing time required per input record in the data set increased linearly. This meant that the total time required to process a data set increased exponentially with the size of the set! This is illustrated by the diagrams below.
An analysis of where the time was being spent showed that the majority of the processing logic required only constant time per record. Where was the extra time going? The culprit seemed to be the call to commit the transaction to the database. I knew that even a few hundred database insert/update statements would execute quickly in nearly constant time (databases are built to scale, after all). The actual database commit was equally speedy. Normally by default I assume that network calls will be the source of performance delays. But in this case this assumption appeared to be incorrect.
So what exactly was happening when I committed the Hibernate transaction, before the calls to the database? Hibernate's first step is to perform a flush to write all entities with changes (called dirty entities) to the database via insert/update/delete calls. How exactly does Hibernate determine which entities are dirty? For loaded entities Hibernate uses byte-code instrumentation to add logic to track when entities become dirty. But my scenario involved new entities, for which Hibernate could not work its magic. So on each flush Hibernate scanned the fields of each entity to see if there were changes. A linearly-increasing number of entities naturally led to a linearly-increasing time per flush. To make matters worse, Hibernate's flush algorithm apparently has a performance problem when dealing with cascaded collections, which I was using in my scenario.
The solution to my performance problem was to evict all the entities from the session after committing, thus making all entities detached, and then reattaching to the session the few entities I reused in subsequent processing.
This article is one of a series on Hibernate Tips & Tricks.
If you find this article helpful, please make a donation.