How to setup a cheap Infinite Disk system using S3 without having to use EC2.
I don’t normally post technical stuff on my blog, but I’m putting this up because I think it might help others.
Wists images are stored in the file system with filenames based upon a hash of their url (principally). This keeps the db small (overloading the directories is prevented by creating subdirectory structures based upon the first few letters of the hash (/b/c/bc267867 etc.)
We want to use Amazon for reliability and unlimited storage, but we need to couple the system to application logic (and our code isn’t multi-treaded), so a straightforward S3 use isn’t possible.
The problem with S3 at the moment is all to do with latency. We don’t want to have to manipulate and store wists thumbnails remotely on an EC2 instance, but if we do it locally then there will be a delay sending the images over the wire and storing them in S3. Furthermore, rsync or rsync style approaches are a real load problem for us so we have to do batch backups every hour based upon the age of files (in theory rsync should be lighter weight, in practice it is not for our particular use). We also use cache.wists.com via a CDN that pulls the images when they are there, using squid. We will keep this piece, because the CDN works out cheaper than S3, but it can’t readily be used as a backup service, however the method below does not depend on it.
Step 1. Setup a cron to call a shell script hourly, that backs up files to S3 that are more recent than 1hour, and deletes files that are older than 1 week (we will keep a weeks images as a buffer in case there is a failure).
Step 2. Point the CDN to S3 (instead of at Wists servers).
Step 3. Change application code so that all image requests are for local local versions, instead of directly from the CDN i.e. wists.com/images/foo (instead of from cache.wists.com at moment), the application logic will check to see if the exists, if not they try cache.wists.com and lastly show a placeholder as default if this in turn doesn’t exist.
This way all files newer than 1 week will be served live from wists and everything else from the CDN which will cache from the S3 backup.
Although this usage is perversely the reverse of the normal way a quid based CDN is used (i.e. CDN usually checks if file exists on each call, and then caches, if not) and this will result in higher webserver loads, the advantages of reliability and simple infinite storage outweigh the disadvantages in our instance.