Feb 3, 2014

Choosing Hibernate's caching strategy

One, if the not the, most common use case for Ehcache out there is using it with Hibernate. With only a little configuration, you can speed up your application drastically and reduce the load on your database significantly. Who wouldn’t want that? So off you go: you add the couple lines of configuration to your application. Nothing too complicated… You’re quickly left with two questions to answer though:

What entities and/or relations do I cache?
What strategy do I use for these?

We’ll let the former question as an exercise for the user to answer… but I’d like to spend some time here discussing the second question.

Choosing the strategy

So you decide to use caching on an entity or a relationship. You now have to tell Hibernate what caching strategy to use. Caching strategies come in 4 flavors: read-only, read/write, non strict read-write and transactional. What these entail is to be considered from two perspective: your ORM’s, i.e. Hibernate’s and from the cache’s perspective. The latter actually only applies to caches that loosen the consistency semantics, especially when you go distributed. Let’s start with the easy one first, Hibernate’s perspective.

From Hibernate’s perspective

The Hibernate documentation does give you some hints, but we’ll go in slightly more details here about what’s going on.

Read-only

Easiest one to reason about. The data held in the cache is immutable and as such will never be updated. Solves all isolation problems. Data is. Hibernate will actually go as far as prohibiting any mutations to that dataset, throwing an exception at you if you try to update any such entity. In a “strongly consistent’ cache as you’d have a single VM from any cache vendor, there isn’t much more thoughts to put into this one. Straight forward… Gotta love this one!

Non strict Read/Write

Now if you have mutable entities, that’s one of the three options you have. As the name implies it’s read and write, but in a non strict sense… Non strict here means that Hibernate will not try to isolate any mutations to an entity from other concurrent transaction reading the same data. But what does this mean? At this stage, I think it’s important to debunk a couple of misunderstanding about Hibernate. Firstly, and most importably here, Hibernate does NOT store your entities in the second level cache. Instead it stores a “dehydrated” representation of the entity. Hibernate comes with an option to have it store Maps in the cache. You would never want to do that on a production system, but this can be useful for debugging an app. It makes it so a cache entry is a Map of property name to property value for a given entity type (e.g. entity “Company”: [“id”: 1, “name”: “Terracotta”, “doesCaching”: true]). The default storage is merely a more efficient way of storing that same data. And secondly, the cache is accessed/updated as the database is accessed/updated. So say a query is issued against companies, trying to select all companies that do caching, yet the session holds newly created company (managed) instance, Hibernate would flush all these to the database. Should these be reloaded from the database within this uncommitted transaction, they’d also make it in the second level cache. This is where non strict is important. Pending (i.e. uncommitted) changes can become visible through the cache to other transactions, breaking the I guarantee from ACID. Also worth mentioning is that the behavior above is taken from Ehcache’s NonStrictReadWrite strategy. We’ve tried our best to make it hard for users to shoot themselves in the foot. This strategy will invalidate cached entries on updates and only populate the cache with data loaded from the database. Note that these invalidations happen after the transaction has successfully committed. As a result there is a race where “old” values can be seen in the cache, while updated in the database.

Read/Write

Read/Write (or strict read/write) tries to account for the short-comings of the non strict approach. It does so by implementing a “soft-lock” mechanism, locking entries in the cache on a per entry granularity as they are mutated. Only after the transactions has successfully committed will these locks be released, installing the appropriate value in the cache. Using this caching strategy, Hibernate will lock entries on flushes to the database. Every other transaction accessing a locked entry will consider it a cache miss and hit the database instead. What’s nice about this strategy is that on contention, it let’s the database handle the concurrent access, meaning that the isolation level provided is resolved by the database, as if there was no cache at all. Now this seems perfectly reasonable… Yet, there is one shortcoming to this strategy as well: it stores these soft-lock in a cache. A cache that could evict (or expire) these. The absence of a lock for an entry can result in stale data be present in the cache or (as for non-strict) result in uncommitted state being exposed to other transactions.

Transactional

Transactional deals with all these limitations. It basically expects your cache to be a full-blown JTA XAResource that can be modified alongside the database (meaning you do need to configure the both to use the same isolation level). Yet you also pay the price of two-phase commit. Also, fully XA compatibility does require recovery support, which for a Cache might be overkill (as it caches the data held in the cache and your application could potentially work with a cold, i.e. empty, cache).

From a distributed perspective

When you move to a distributed environment (i.e. multiple application nodes hitting the same databases), your cache needs to be kept in sync across all these nodes. While enabling Terracotta clustering with Ehcache is only a couple lines of config again, how you configure your clustered cache becomes important. Terracotta provides basically two consistency models for clustered caches: Strong and Eventual. In Strong, you basically get JMM guarantees across your cluster. Our Hibernate caching strategies will account for all that’s to be taken care of for you to provide you with proper visibility semantics. In Eventual consistency though, things get slightly more complex for the user to understand (yet provide the better performance both in terms of read and write throughput). Besides the consistency, Terracotta lets you configure the non-stop behavior of the clustered caches (what paragraph on distributed systems would be complete without mentioning CAP?!). If a certain operation can’t happen within a given configured time, you have multiple options. In the following paragraphs, when referencing non-stop, we will be talking about any behavior that’s not failure (i.e. favored A instead of C).

Read-only

That’s again the easiest one to reason about, since the data never mutates there is nothing to worry about in terms of isolation. Yet, in terms of visibility, multiple nodes could be putting the same value into the cache, resulting in more database hits than you would expect. Say you have a list of countries in such a cache, configured to hold all countries and never expire or evict them. This is very common pattern for reference data that you want to keep close to your application. It could well be some countries are loaded and put in the cache from multiple nodes, especially under very high initial load. The same is true in the face of partitioning with non-stop configured: at worse you’ll see more database hits as you’d expect. But since data is immutable, there isn’t any stale state ever…

Non strict Read/Write

While this mode will also work fine with eventual consistency, you’re basically blowing the race wider for inconsistent data making it in the cache. It becomes very use case dependent on how much problem this may or not be to your application. Invalidation are being propagated asynchronously to all nodes, which makes outdated values available slightly longer on the nodes that haven’t done the mutation. But since data is only ever populated from the database, when populated it’s always with the latest state. Non-stop here can result in even larger races. Say an invalidation is ignored during a partition (noop configured non-stop caches), that data will remain in the cluster until the next mutation happens. How this situation is acceptable to your application is again up to you…

Read/Write

This is where things get more interesting! While it could sound like it would be as acceptable as the other two strategies so far, it actually isn’t. All because of the Soft-Locks mainly. As reads remain mainly unlocked, we can’t provide Hibernate with the expectations it has on such a strategy (see below Hibernate 3 vs. 4). As explained for non-strict, with non-stop caches, you could end up with stale locks in the cache, basically rendering the whole strategy useless. Long story short, don’t use this strategy with a distributed cache that isn’t strongly consistent.

Transactional

In Terracotta land, that one is actually surprisingly easy. As you need to have your cache be a proper XAResource, Ehcache will not let you configure anything non-sensible here. That is, Ehcache will only let you use an XA transactional cache with strong consistency. And here the only sensible behavior in the face of partitioning is to fail, but as such also to remain consistent.

One last thing… or two

Optimistic locking to the rescue

In order to deal with stale state (as it can happen even without a second level cache, your session being the first level cache already), you can implement an optimistic locking strategy within your data-model. As a result though, some layer(s) in your application will either have to deal with OptimisticLockingException (e.g. you tried to update the salary of Alex version 12, but that’s not the version present in the database at flush time anymore) or have your user deal with it… The latter not sounding too good probably, it is worth understanding what corruptions your data exposes itself with any given deployment (with or without second level cache, be it distributed or not).

Hibernate 3 vs. 4

In Hibernate 4, the caching provider can actually implement some of the behavior of the strategy himself. In Hibernate 3, nothing like it was feasible. As a result, there is smarter things (to a still limited extent though) we could do about dealing with these different modes in a clustered environment. Yet one main problem remains: Read/Write is a FSM which is not implementable atop a weakened consistency model. Also, as of today, Hibernate doesn’t try to handle cache “failures”, which is probably is fair thing to do, but forces you to understand those quirks.

In conclusion

Hibernate tries as best as possible to hide the complexity of the caching layer from the user. Yet there is only so much it can do about it. Yes, it enables users to easily plug in a cache without much further thought. But as your application’s deployment complexity might grow, you might be forced to revisit that initial strategy so it deals with the oddities of distributed systems in a way both acceptable by the domain and to its users.