Why a cache in an O/R mapper doesn't make it fetch data faster.
Preface
One of the biggest myths in O/R mapper land is about 'caching'. It's often believed that using a cache inside an O/R mapper makes queries much faster and thus
makes the O/R mapper more efficient. With that conclusion in hand, every O/R mapper which doesn't use a cache is therefore less efficient than the ones who do, right?
Well... not exactly. In this article I hope to explain that caching in O/R mappers is not there for making queries more efficient, but is there for uniquing.
But more on that later on. I hope that at the end of the article, I have convinced the reader that the myth Caching == more efficiency is indeed a myth. Beware,
it's perhaps a bit complicated here and there, I'll try to explain it in as much layman's terms as possible.
What's a cache?
Before I can explain what a cache is, it's important to understand what an entity is, what an entity instance is etc. Please consult this article first to learn
about what's meant with these terms.
A cache is an object store which manages objects so you don't have to re-instantiate objects over and over again, you can just re-use the instance you need from
the cache.
A cache of an O/R mapper caches entity objects. Pretty simple actually. When an entity is fetched from the
persistent storage (i.e. the database), the entity object (i.e. the entity class instance which contains the entity instance (== data)) which contains the data fetched, is
stored in the cache, if it's not there already. What exactly does that mean: "if it's not there already" ? It means that the entity object isn't there yet.
Caches in O/R mappers are above all used for a concept which is called uniquing. Uniquing is about having a single entity object for every entity (== data)
loaded. This means that if you load the entity of type Customer and with PK "CHOPS" from the Northwind database, it gets stored in an entity object, namely
an instance of the Customer entity class. What happens if you load the same entity with PK "CHOPS" again in another instance of the Customer entity class? You
would end up with two instances of the same class, but with the same data. So effectively the objects represent the same entity.
This doesn't have to be a problem. Most actions on entities don't require a unique entity object. After all, they're all mirrors of the real entities in the
database and with a multi-appdomain application (like desktop applications accessing the same database or a multi-webserver using webapplication) you have the chance of
having multiple entity objects containing the same entity data anyway.
However sometimes it can be a problem or an inconvenience. When that happens, it's good that there's a way to have unique objects per entity loaded. Most O/R
mappers use a cache for this: when an entity is loaded from the database, the cache is consulted if there's already an entity object with the entity data of the
same entity fetched. If that's the case, that instance is updated with the data read from the database, and that instance is
returned as the object holding the data. If there's no object already containing the same entity, a new instance is created, the entity data fetched is stored
in that instance, that instance is stored in the cache and the instance is returned. This leads to unique objects per entity.
Not all O/R mappers use a cache for uniquing however, or don't call it a 'cache'. You see, a central cache is really a very generalizing. What if you need for a given
semantical context a unique entity, and outside that context you don't need a unique instance or a different, unique instance? Some O/R mappers, like LLBLGen Pro, use Context objects which provide uniquing for a semantical context, e.g. inside a wizard or
an edit form. All entity objects inside that context are stored in unique objects.
Caches and queries: more overhead than efficiency
So, when does this efficiency the myth talks about occur exactly? Well, almost never. In fact, using a cache is often less efficient. I said almost,
as there are situations where a cache can help, though these are minor or require a lot of consessions. However I'll discuss them as well so you have a complete picture.
Let's state I have a cache in my O/R mapper and I want to see, by using theory, how efficient it might be. So I have my application running for a given period of time,
which means that the cache contains a number of entity objects. My application is a CRM application, so at a given time T the user wants to view all customers who
have placed at least 5 orders in the last month. This leads thus to a query for all customer entities which have at least 5 related order entities which are placed
in the last month.
What to do? What would be logical and correct? Obviously: fetching the data from the persistent storage, as the entities live there, and only then we'll
get all known and valid customer entities matching the query. We can consult our cache first, but we'll never know if the entities in the cache are all the entities
matching my query: what if there are many more in the database, matching the query?
So we can't rely on the cache alone, we always have to consult with the persistent storage as well.
This thus causes a roundtrip and a query execution on the database. As roundtrips and query executions are a big bottleneck of the complete entity fetch pipeline,
the efficiency the myth talks about is nowhere in sight. But it gets worse. With a cache, there's actually more overhead. This is caused by the
uniquing feature of a cache. So every entity fetched from the database matching the query for the customers has to be checked with the cache: is there already an instance
available? If so, update the field values and return that instance, if not, create a new instance (but that's to be done anyway) and store it in the cache.
So effectively, it has more overhead, as it has to consult the cache for each entity fetched, as well as store all new entities into the cache. Storing entities
inside a cache is mostly done with hashvalues calculated from the PK values and stored per type. As hashvalues can result in duplicates (it's just an int32 in
most cases) and compound PKs can complicate the calculation process, it's not that straight forward to get the lookup process of entities very efficient.
Some tricks can help... a bit and for a price
Before I get burned down to the ground, let me say that there are some tricks to speed things up a bit. I have to say "a bit" because it comes at a high price:
you've to store a version field in every entity and the O/R mapper of choice must support that version field. This thus means that you've no freedom over how
the entities look like or how your datamodel looks like.
This is a high price to pay, but perhaps it's something you don't care about. The trick is that instead of returning the full resultset in the
first query execution, only the PK values and the version values are returned for every entity matching the query. By checking the cache, you use the version
value to see if an entity has been changed in the db since it was fetched and stored in the cache. If it's not changed, I don't have to fetch the full entity, as the
data is already in the cache. If it is changed or not in the cache at all, I've to fetch the full entity. So then I will fetch all entities matching the PKs I've
collected from my cache investigation. The advantage of this is that it might be that the second query is very quick. It however also can bomb: what if you're
using oracle and you have 3000 customers matching your query? You then can't use an WHERE customerid IN (:a, :b, ...) query as you'll be exceeding the limit
of parameters to send in a single query. It also will cause a second
query run, which might add actually more time than simply doing a single fetch: first the PK-Version fetch query has to be run, then the second full fetch query
(which might result in less rows, but still...).
You might wonder: what if I control all access to the database? Then I know when an entity is saved, and thus can keep track of when which entities are changed as
well and thus can make assumptions based on that info whether an entity is updated or not! Well, that's true, but that's not scaling very well. Unlike Java, .NET
doesn't have a proper cross-appdomain object awareness system. This means that if you have even two systems targeting the same database
(webfarm, multiple desktops), you can't use this anymore. And even if you're in the situation where it could help (single appdomain targeting single database),
it's still takes time to get things up to steam: until all entities of a given type are inside the cache, you still have to fetch from the database.
Cache and single entity fetches
There is one situation where a cache could help and be more efficient. That is: if you know or assume the data you might find in a cache is 'up to date' enough
for you to be used. That situation is with single entity fetches using a PK value. Say the user of our previously mentioned CRM application wants to see the details of
the customer with PK value "CHOPS". So all what should happen is an entity fetch by using the PK value "CHOPS". Consulting the cache, it appears that the entity with
the PK value "CHOPS" is already loaded and available in the cache! Aren't we lucky today!
Again, what to do? What's logical in this situation? Pick the one from the cache, or consult the database and run the risk of fetching the exact same entity data as
is already contained in the entity object in the cache? I'd say: go to the database. You only know for sure you're working with the right data by consulting the persistent
storage. If you pick the one from the cache, you might run the risk that another user has updated the customer data and you're working with outdated, actually
wrong data which could lead to wrong conclusions while you could have made the right conclusions if you would have looked at the right data.
If you pick the one from the database, you might run the risk of burning a few cycles which turned out to be unnecessary. A trick here could be that
if the entity in the cache is fetched X seconds ago, it's still considered 'fresh enough', as all data outside the database is stale the moment it's read anyway.
But if correctness is in order, you can't be more sure than by reading from the database and bypass the cache.
So what's left of this efficiency?
Well, not much. We've seen that it actually adds more overhead than efficiency, so it's even less efficient than not using a cache. We've seen that it could be
solved in a way which could lead to more efficiency but only in a small group of situations and required consessions you're likely not willing to make. We've also seen
that a cache could be more efficient in single-entity fetches, but only if you're willing to sacrifice correctness, or if you're sure the data in your cache is
valid enough.
Are caches then completely bogus? Why do O/R mappers often have a cache?
They're not bogus, they're just used for a different feature than efficiency: uniquing. Some O/R mappers even use the cache for tracking changes on entity objects (and
thus the entities inside these objects). Claiming that the cache is making the O/R mapper more efficient is simply giving the wrong message: it could be a bit more
efficient in a small group of situations, and is often giving more overhead than efficiency.
I hope with this article that people stop spreading this myth and realize why caches (or contexts or whatever they're called in the O/R mapper at hand) in O/R mappers are used for uniquing, and not for object fetch efficiency.