ryah ([info]four) wrote,
@ 2008-06-29 10:52:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
dedicated cache server?
I'm thinking about writing a website caching server. Something in-between memcached and squid. I don't want to write it (it will destroy my flow on other projects) but I think it's necessary. I have this written in Ruby but it's too slow. I was considering writing an Nginx module to handle it, but if I invest the sort of time that it takes to write an Nginx module, I might as well strive for a proper (web server independent) solution.

Please let me know if I can get the functionality I describe out of some existing software.

The problem is making caching exact. What is needed is a cache store which attaches IDs from database objects (from a Relational DB row or a CouchDB document) to cached output. Additionally each cache must have a request template. I will explain what I mean by request template:

Each dynamically generated web page has several parameters from the request that it uses to generate the response. For example, the HOST and PATH_INFO parameters are used to respond to a GET request to
http://four.livejournal.com/871515.html
Additionally, livejournal also checks the COOKIE header to authenticate. The necessary elements of the request (headers, http version, request uri, query parameters) along with their values are what I call a request template. In this example the request template might be
{ PATH_INFO: "/871515.html"
, HOST: "four.livejournal.com"
, cookies: { "ljdomsess.four": "v1:u1329:s17..." }
}
Any other request which matches all of the elements and their values in the request template will be served the cache.

Note that for each page, different parameters from the request are needed. (For example, the "about livejournal" page might not use HOST: four.livejournal.com or the cookie - it might only depend on PATH_INFO.)

Caches are generated from dynamic web pages, they have a request template and a list of IDs. Expiration of cache is done using the IDs. When I change post 12345, the application server (or the database) should notify the cache server that all caches involving 12345 should be expired. In the case of livejournal, it would probably be expiring various caches of friend's pages, the page for the post itself, and the calendar page which lists post counts for each month.

I don't pretend that this method of caching is the right solution for every case, but for very many simple dynamic websites this will work well, I think. Wordpress blogs and big catalog style websites, for example, would make good use of such a caching server. The main benefit is that the caching is exact and can be abstracted from the website programmer; the web framework can measure which request parameters are used and which database IDs a generated HTML chunk depends on.

How I intend to implement this

The caching server should be a simple HTTP server (written in C and using a simple HTTP server library). It will have 3 functions: serving cache, storing cache, and expiring cache. This will all be done though HTTP.

Serving Cache
The front-end webserver (Nginx or whatev) will send all GET requests to the cache server. If the cache server returns 404, it will then forward the request to the Application Server, otherwise it will serve the response. Inside the cache server, when it receives a GET request - it matches the request against all of its request templates to find a suitable cache. If it cannot find a matching request template it returns 404.

Expiring Cache
The Application Server will send POST /_expire?id=1234 to the cache server to expire all caches which are associated with the id 1234.

Storing Cache
The Application Server will send a POST request containing the cache and the associated IDs in the body of the request. Headers, path_info, and query params should constitute the request template for that cache. For example,
POST /871515.html HTTP 1.1
Host: four.livejournal.com
Cookie: ljdomsess.four=v1:u1329:s17...
The storage POST should not have any additional headers. The headers are exactly what will be used as the request template to filter GET requests later.
The cache would then be stored in memory, like memcached.

The key difference between this and memcached is filtering requests by the request template. In fact, I might use memcached as a back-end for storage (although, that's probably more overhead than it's worth). The difference between this and Squid is - well Squid doesn't do expiration or filtering at all (I think?).

suggestions? objections?



(12 comments) - (Post a new comment)

Erlang
(Anonymous)
2008-06-29 11:19 am UTC (link)
Seriously consider doing this in Erlang instead of C.

(Reply to this) (Thread)

Re: Erlang
[info]four
2008-06-29 11:58 am UTC (link)
playing with mochiweb would be fun.

(Reply to this) (Parent)

varnish
(Anonymous)
2008-06-29 12:59 pm UTC (link)
What's wrong with Varnish plus etags? Also, Squid does expiry in a hacky way.

(Reply to this) (Thread)

Re: varnish
[info]four
2008-06-29 01:54 pm UTC (link)
I haven't heard of Varnish, I'm going to check it out.

(Reply to this) (Parent)


(Anonymous)
2008-06-29 01:22 pm UTC (link)
As a quick note, the following request is rather non-standard:
POST /871515.html HTTP/1.1
Host: four.livejournal.com
Cookie: ljdomsess.four=v1:u1329:s17...

(page body)

The problem is that it lacks the Content-Length or TE: chunked specification. You'd probably have to explicitly ignore these headers.

Also, where does the /_expire?id=1234 come from? Maybe you could get on with the times ;) and use the PUT/DELETE methods to manage the cache contents (using request headers like above)?

(Reply to this) (Thread)


[info]four
2008-06-29 02:25 pm UTC (link)
Yes, it'd ignore the Content-Length and Content-Type headers (which are not useful to a get request anyway).

DELETE /_expire?id=1234 or POST /_expire?id=1234, it doesn't matter. But not DELETE /871515.html. The cache server expires based on id not path_info. Perhaps an extra feature could be added to explicitly delete single caches, in which case the DELETE method could be useful.

I have a forthcoming post about what I see as undeserved fanaticism over the PUT and DELETE methods. (Summary: GET read, POST write - RESTful design works fine with just these. In fact, it has to since browsers don't support the others. Not that I disagree with the use of DELETE/PUT in certain circumstances - eg DELETE a document in couchdb)

(Reply to this) (Parent)(Thread)


(Anonymous)
2008-07-01 09:30 pm UTC (link)

OK, but where do you get the id from? Or the other way round, how does the cache map id=1234 to /87515.html?

As for PUT and DELETE methods, my comment was rather tongue-in-cheek :) Anyway, you'd still probably need a separate server (socket, interface etc.) or some authentication mechanism to allow cache purging only from certain clients (i.e. your real app server).

(Reply to this) (Parent)


[info]evan
2008-06-29 04:38 pm UTC (link)
It seems you're just using a more-complicated cache key and doing whole-page caching. From the cache's perspective it doesn't even need to understand your key format -- it can just string-compare. The main reason memcached wouldn't work is that it has key length limits, which you could increase if you wanted.

The expiry thing needs to be handled off to the side, since you need the list of objects depending on a given id to reliably last longer than any of those objects live in your cache, and the objects are getting expired as load changes.

You mention a request potentially matching multiple cache entries, but it's not clear to me why it would. It seems you could, at development time, write your "request -> cacheid" function that can be applied to all requests.

In general, the sort of project where you have a web-request-juggling frontend that needs a bit of extra smarts (like generating cache keys here) and the ability to proxy requests from backends sounds like a Perlbal plugin to me.

(Reply to this) (Thread)


[info]evan
2008-06-29 04:48 pm UTC (link)
(For example, you could have the app framework, in the normal process of generating a request, output "X-Cache-Key: " and "X-ObjectID-List: " header that was then understood by Perlbal and inserted in the right places. Then the app only needs to notify Perlbal when an object changes so it can handle the expiry.)

(Reply to this) (Parent)(Thread)


[info]four
2008-06-29 05:18 pm UTC (link)
The idea is that requests never get to the application when there is a cache, so the cache server would do more than just string-compare. For the following key
{ PATH_INFO: "/friends"
, HOST: "four.livejournal.com"
, params: {skip:20}
}
a request to /friends?skip=20&something=irrelevant would be severed the corresponding cache, but a request to /friends?skip=21 would not.

The cache look-up algorithm would take a full request
{ PATH_INFO: "/friends"
, HTTP_VERSION: "1.1"
, HOST: "four.livejournal.com"
, Accept: "text/html"
, Language: "en-US"
, params: {skip:20}
, cookies: { "ljdomsess.four" : "v1:u1329:s17" }
}
and search for the largest cache key that is a sub-hash of that request.

(Reply to this) (Parent)(Thread)

F5 cache?
(Anonymous)
2008-07-25 07:57 am UTC (link)
What you have proposed is a hell alot like what F5 caching product does.. get parameter + host + URI + cookie to match the cache, although to swap out non-popular items to disk when memory is full..

The results will be very similar to nginx+memcache solution, of course it may not validate aginst all those parameters you proposed..

will be interesting to see nginx+memcache Vs your caching project, esp considering web server need to open new connection for every request, where as the nginx+memcache grab the cache copy internally.

(Reply to this) (Parent)(Thread)

Re: F5 cache?
[info]four
2008-07-25 08:35 am UTC (link)
The additional, and to me very important, feature is the ability to associate IDs with the caches and expire based on those IDs. This allows a cache to be tied to a number of database entries and expired when any of those entries changes.

where as the nginx+memcache grab the cache copy internally.

in terms of data transfer, i think my project is essentially the same

(Reply to this) (Parent)


(12 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…