Caching Made Easy Part I or: "How I learned to stop worrying and love the memcached"
Congratulations! Your website just made the front page on [awesome social media site here]. And while the business folks begin their joy-leap whose height will only be matched by the record in revenues, the server admins brace for impact as their apache/mod_php stack begins to strain under the weight of dynamically generated content.
The typical traffic seems abysmal in comparison to the wave of new anonymous users surging to your site. To compound the complexity site updates are instantaneous and have little or no user lag time.
There is more than one way to speed up your site: caching query objects, remote variables, using static memory
While all of these are perfectly valid they all require complicated invalidate logic and usually some refactoring of your applications code base - both of which our current situation's time constraints do not allow for. The easiest way to speed up your site is to cache rendered pages and serve the cached page.
The rules
The concept of anonymous users defines the keystone of our caching system. Logged in users typically will have a different set of permissions, content and experience in mind so we'll just say, for now, that all of our caching work will be related to anonymous users. But before we get into actually caching the content, we need to setup some ground rules to make caching work in our favor. After the rules, then we'll work through the process of caching.
- Use a single site entry point
- Store an authentication marker in the session cookie
- Decide on a simple pre-shared key (4-8 characters)
- Set the appropriate headers to prevent client caching
Using a single site entry point
This very important to fast caching so don't skip this section!
If your site is available via multiple web assets such as typo domains, www or other prefixes you will need to pick one asset to funnel all your traffic into. I'd suggest using the base domain as the primary web asset. All other assets should gracefully redirect to the primary asset using the webserver as the redirector. Doing this will make sure that cache keys are generated using uniform URLs.
Store an authentication marker in the session cookie
The only way to tell if a user is logged in without hitting the database is to set a marker in the cookie letting your web application know if the user is logged in or not. Doing this will allow us or other tiers in your application attempt to serve a cached page prior to even opening up the database connection. There little security risk in doing this other than possibly exposing the UID's of your logged in users.
Decide on a simple pre-shared key
A pre-shared key is just a marker to prevent any collisions in your memcached instance. For example, you might want to pick anon_cache as your pre-shared key or pages but not cache as it is too ominous. In part 2 I'll cover how to take advantage of this pre-shared key to serve your cache directly from the webserver instead of the application for a massive speed up!
Headers
- Set the
Expiresheader to something far far in the past. This can be your birthday (not suggested), the year your favorite cult classic came out (better), or your favorite historical event (best). - Set it to force clients to hit your website.
Cache-Control: must-revalidate, post-check=0, pre-check=0In my experience webpages themselves often vary between 5kB to 80kB in contrast to images which vary between 5K and 5MB (or more). For web pages serving 10,000 80K pages is easier to do for a webserver than 10,000 1MB images. For this, pages themselves use
Expiresheaders in the past, and images useExpiresheaders in the future to encourage client side/proxy caching of images, and not your dynamic pages.
Caching logic
With some very basic rules in place, we can work on the caching logic for your web application. This method works well even when anonymous content is allowed on the site.
The Request
- Is this a GET/HEAD request?
- Only process GET/HEAD requests via cache, everything else would invalidate the cache
- Does the user have a session?
- If not, create one for the user
- Is the user logged in?
- If we just created a new session, this will always be false
- If they are, skip over the remaining steps
- Check for cache
- Since the user is an anonymous user, check if there is page cache using the
$presharedkey-$base_url/request_uri(). For example:PAGE-http://example.com/blog/1?page=2 - You could use just request_uri() but then it would require that it be only web asset using that memcached instance.
- If a key exists with that name, serve the page immediately
- If no key exists with that name, continue with normal processing
- Since the user is an anonymous user, check if there is page cache using the
Caching a page
- Generate the page as normal, but don't return it just yet
- This process must happen before the page is printed. Typically a template system will be in place to handle output.
- Store the page in memcached
- Only do this for anonymous users!
- Key length is important! Memcached can only store 250 characters as the key.
- Size is important. By default, memcached can only store 1MB into a single slab. Use
-Ito override this or you will have a lot of set_miss/get_miss in your stats. This will result in poor performance. - This process invalidates the cache regardless of wether it was a GET/POST request. There will be a small window where a bunch of requests stampede memcached with set requests. You can use locking in memcached to curb this type of behavior.
- Serve the page to the user
Invalidating the cache
You can have cron run though and invalidate cache entries based on their age, keyname, or just nuke everything and call it a day. The proposed method above handles cache invalidation transparently as stale content is served during the processing of a PUT/POST request, and then updates to pages are re-generated and stored in memcached.