Scalable static file hosting and some thoughts on Amazon S3

Boris Mann
2006
27
04
created on Thu, 2006-04-27 15:54

So, one of the big things with static file serving is that it requires an entire Drupal bootstrap, which sucks up a lot of system resources. We have work to do in Drupal/the File API to support some interesting scalability options, remote files, etc. etc. But, I think there are some interesting models for doing highly scalable static file serving outside of Drupal (and Amazon's S3 is cool). For the first architecture, consider the following, for the domain example.com:

  • 3 (or more) front end web servers with Drupal installed (identical codebase checkouts, call these www1 - N)
  • a database backend (could be a cluster or just a big high performance machine, call this db1 - N)
  • Drupal files and tmp directories mounted via NFS from a separate machine (can either be a NAS or the static file serving box itself)
  • a static file serving machine (call this static1 - N)

To Drupal, all file operations are local -- the mounted NFS system is configured to the local file and tmp settings, and it thinks it's writing to a local file systems. So, no changes to Drupal code required at all. Now, in actually serving up the files, the front www* servers have Apache mod_rewrite configured to redirect all requests for files (e.g. *.jpg, *.doc, etc.) to static.example.com. This doesn't need to run PHP (or even Apache) at all -- it is completely optimized for serving up static files. That's it. I don't (yet) have an example of the mod_rewrite rules for this, but I think this would make an interesting "recipe" for how to serve up lots of static files without any Drupal performance penalty.

Also: this doesn't HAVE to be used with a cluster -- it can even be used with a single machine: one vhost runs Drupal, and another runs the static file web server. But even a two machine minimum would provide quite a performance increase.

My thoughts on Amazon S3? First, for those that don't know about it, here's the quote from the page describing it:
Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.

So, you get dedicated file storage from Amazon with a web services front end and good bandwidth rates. Also, they have some very cool extra niceties, like you can request any file with ?torrent on the end to get a BitTorrent seed file automatically. A full on Drupal module that did some more interesting things would be exciting, but is not the point of this post...

So, imagine a server-based script that synced those same file and tmp directories to Amazon's S3 service, and the mod_rewrite script pointing at the Amazon S3 space. Now, you're not even using your own bandwidth or file storage, and you have predictable costs associated with both bandwidth and storage.

However, there is a gotcha with this that needs some work. Basically, it's going to take X amount of time for a script to transfer new files to the S3 service. During that time frame, mod_rewrite needs to NOT forward requests for just the files that haven't been synched yet. So, the synch script probably needs to dynamically update mod_rewrite (insert hand waving here) waiting for the files to synch.

Note: I am most definitely not "the server guy" -- I'm pretty decent at systems architecture, but I may very well be on crack with pieces of this. I've had ideas around this rolling around in my head for a while, so I thought it was time to just post it. Comments and feedback welcome.

File Hosting

I use Dumpplace.com for File hosting

http://www.dumpplace.com

Another Idea

Another idea would be to have a specified class of files (such as videos and audio files) go directly to S3 and bypass Drupal altogether. We built a Cocoa application that does exactly this, uploads videos to S3 then creates a node in Drupal pointing to them.

M

Conditional rewrites

"mod_rewrite needs to NOT forward requests for just the files that haven't been synched yet."

You should be able to do that with Apache and RewriteCond using the -f, "is regular file" test. I haven't tried it with Apache, but we're doing something similar with lighttpd and S3: Serving static content with lighttpd and S3

Syndicate content