What is the Correct Caching Strategy?

Table of contents
Reading Time: 3 minutes

While uncovering ways to speed up our application on the Google App Engine, we decided to use Memcache. This led us to an interesting discussion which I am reproducing here to get your inputs.

As you would observe, if you are following our blog, that there are 2 potential ways to cache, invasive and non-invasive. May be there is a third way which you would be able to tell us. We decided that the entities which would not change much but are still being fetched again and again need to be cached.

For fetching the entities, there are again two possible ways

  1. Fetch the entity individually or
  2. Fetch the entity as a part of a group of entities.

This would become clear with the following example. Say, we have to find the tasks assigned to a person. These task assignments are based on a range of dates. So this is the API that we are talking about would be this

[sourcecode language=”java”]
List<TaskAssignment> fetchTaskAssigments(User user, Date startDate, Date endDate);
[/sourcecode]

Now, behind the scenes, this query would go to the datastore and fetch the TaskAssignment(s) for that user based on the date range. Also, for our scenario the date ranges are kind of canned. For simplicity, the date ranges would be month ranges. Hence we would be interested in TaskAssignment(s) for the month of Jun, Jul, Aug, Sep and so on.

CASE I

One way to cache would be cache lists. i.e. cache all assignments belonging to Jun, Jul, Aug and Sep. Hence we would have 4 lists cached List<TaskAssignment> for Jun, List<TaskAssignment> for Jul, List<TaskAssignment> for Aug and List<TaskAssignment> for Sep.

Benefits of caching this way,

  • Once the results are cached, there is no more computation necessary. All the lists would be fetched from the cache.
  • We can apply non-invasive caching on the methods as aspects. The results are put into the cache and the business logic does not need to know about the caching framework.

Limitations of caching this way,

  • There is duplication of TaskAssignment being cached. If the month of Jun, Jul and Sep have the same TaskAssignment then that entity is present in your cache 3 times.

CASE II

Another way to cache is to get all the TaskAssignments for the user irrespective of the date range and then cache that. Hence, effectively we are talking about the following API

[sourcecode language=”java”]
List<TaskAssignment> fetchTaskAssigments(User user);
[/sourcecode]

Now when there is a need to invoke a method of the following API

[sourcecode language=”java”]
List<TaskAssignment> fetchTaskAssigments(User user, Date startDate, Date endDate);
[/sourcecode]

then the implementation would be something like this

[sourcecode language=”java”]
public void List<TaskAssignment> fetchTaskAssigments(User user, Date startDate, Date endDate){
List<TaskAssignment> assignments = fetchTaskAssigments(User user);
assignments = filterAssignmentsOnDateRange(assignments, startDate, endDate);
}
[/sourcecode]

Here, there would be a non-invasive cache aspect applied on the fetchTaskAssigments(User user) method which would either fetch the list from the datastore or from the cache.

Benefits of caching this way,

  • There is NO duplication of TaskAssignment being cached. Each TaskAssignment is cached only once.
  • This caching is also non-invasive since the business logic is not aware of the cache.

Limitations of caching this way,

  • There needs to be a computation, filtering done everytime the TaskAssignment(s) need to be returned on the basis of date range.
  • Some extra logic needs to be written for fetching all the TaskAssignment which was not required earlier.

So in a nutshell instead of doing filtering on the datastore, we are doing it in the code. And instead of storing duplicate entities, we are storing a single entity.

Let us assume that the number of TaskAssignment(s) is not huge as a result of which, the fetchTaskAssigments(User user) in Case II is not very expensive. Also assume that we have enough caching space available as a result of which storing duplicate entities in Case I is also not very expensive.

Given these facts which strategy would you use and why? Are there any other benefits / limitations that you see for the above approach which would help you make your decision. For our case we went with Case I, since we could quickly write an around aspect and inject caching but we are not sure whether it is the best way to go. What are your thoughts and recommendations?

Written by 

Vikas is the CEO and Co-Founder of Knoldus Inc. Knoldus does niche Reactive and Big Data product development on Scala, Spark, and Functional Java. Knoldus has a strong focus on software craftsmanship which ensures high-quality software development. It partners with the best in the industry like Lightbend (Scala Ecosystem), Databricks (Spark Ecosystem), Confluent (Kafka) and Datastax (Cassandra). Vikas has been working in the cutting edge tech industry for 20+ years. He was an ardent fan of Java with multiple high load enterprise systems to boast of till he met Scala. His current passions include utilizing the power of Scala, Akka and Play to make Reactive and Big Data systems for niche startups and enterprises who would like to change the way software is developed. To know more, send a mail to hello@knoldus.com or visit www.knoldus.com