friday, 22 march 2013

posted at 17:50

Over the last couple of weeks I've been working on integrating remote filestores into FastMail. We've had our own online file storage facility for years (long before it was cool), and you've always been able to attach a file in your store to an email and save an email attachment to your store. We've been extending that to allow you to use a "cloud" file storage service in exactly the same way.

Our file storage facility is fairly simple in concept, and operates around the traditional files-and-folders model that we've used forever. For the first external service to integrate with we chose Dropbox, mostly because its by far the most widely used, but also because they use the same model and so it was very easy to create an abstraction and slot it behind it. For FastMail subscribers, you can try it out right now on beta. Once we've finished polishing and testing it we'll be releasing it to production. Should only be a couple of weeks away, but don't quote me on that!

When I developed our internal remote filestore abstraction I designed it with the idea that it would be fairly simple to integrate other remote filestores as well. Today I spent a good amount of time working on an integration for Google Drive. I'm mostly doing this to satisfy myself that I have a good abstraction in place, but of course Google is no small fish and I think that it would be wonderful if we could make this available to users as well.

This has not been a simple undertaking. To find out why, lets talk the architecture of our client a bit.

One of the features of our internal API (the AJAX stuff that our client uses) is that it is completely stateless. This has been done deliberately, as it makes it very easy to scale our backend servers. Obviously state can be held (there is a very nice database available all the time) but the API itself has no real concept of state, so it becomes tough to know what and when to store and expire data. So to build anything we start by assuming it will be stateless.

Our attachment system is very simple. There is a file picker that requests metadata for a given path (a standard /foo/bar/baz construction). It gets back name, full path, timestamp and type for the requested folder and its immediate children. When the user selects a folder, a new metadata request comes in for that folder. The server does not care what's gone on before, it just turns paths into lists of metadata. Later to actually attach a file, we call a different method with the wanted and path, and the file data comes back. Like I said, very simple.

So back to Drive. The major reason for it being an utter pain in the backside to integrate is that the API itself has no concept of folders or paths. Now anyone using the Drive web interface will know that it has folders. This is actually something of a lie. A Drive is just a giant pool of files with various properties that you can query on. A file can have "parent" and "child" pointers to other files, which allows a loose hierarchy of files to be constructed. A folder is simply a zero-size file with a special type (application/ and appropriate parent and child pointers.

Every file has a unique and opaque ID, unrelated to the file's name. These IDs are what's used in the parent and child pointers. There's no way for us to construct the ID of a file from a path. To find the metadata for our file, we have to follow parent and child pointers around.

So lets say we want to get the metadata for the folder /foo/bar/baz and the files inside it. We start off by getting the metadata for the root "folder", helpfully called root (gotta start somewhere). Along with all the info about that root folder we get back its ID. Lets say its ID is 'root123456' (it won't be, its opaque and apparently random, but this will do for our purposes).

Now we have to find foo. We request the file list, with some search filters (normally all on one line, presented here with newlines for readability):

'root123456' in parents and
title = 'foo' and
mimeType = 'application/'

Gotcha 1: deleted ("trashed") and hidden files are returned by default. We don't want those. So actually the filter is:

'root123456' in parents and
title = 'foo' and
mimeType = 'application/' and
trashed = false and
hidden = false

Gotcha 2: this query goes into the q= parameter of a GET request, however it needs to be form-encoded rather than the standard URI 'percent' encoding.

(Neither of these gotchas are documented. Good luck).

Assuming it exists, we'll get back a "list" containing one item. I'm not actually sure if two items with the same name and type can exist. Probably, so I return "not found" if I don't get exactly one result yet. That's an implementation detail though, and it might change.

So now we have our foo metadata, we can get its ID and then repeat the process for bar, and so on.

Each of these requests is a separate HTTP call. They're stateless, so various performance tricks can be utilised (keepalives, etc). My servers are on good networks so its not that slow, but its still a lot of round-trips.

Once we drill down that far, we do a final request for a file list with the same filter, this time leaving off the title and mimeType term (we want everything).

'baz123456' in parents and
trashed = false and
hidden = false

Gotcha 3: this will return Google documents, spreadsheets, presentations and the like. These are identifiable by MIME type, and are not downloadable (because they're internal application magic). Their metadata has various URLs for converting to eg Word documents, but these aren't really appropriate for our use. We'd like to filter them out. Unfortunately that means filtering by excluding a specific set of MIME types in the filter:

'baz123456' in parents and
mimeType != 'application/' and
mimeType != 'application/' and
mimeType != '...' and
trashed = false and
hidden = false

That sucks because you have encode the full list of exclusions right there in the query, so you have to update that when Google adds a new type of document. Instead I've opted to drop anything with a zero size, but there's no size search term available, so instead I've got to pull the lot and then filter.

Anyway, we now have the metadata for the requested path and all its children, so we can return this back to the caller. It takes N+2 HTTP requests to get all the data we need, where N is the number of "inner" path components. This is hardly ideal, but it works well enough, is probably fast enough for most cases (ie there aren't likely to be very very deep folder hierarchies) and isn't even a lot of code.

So next up is search. Our file picker has a "search within folder" option, which looks for a given file name (or name fragment) within a folder and its subfolders. The subfolders bit is actually a significant problem for us here. Finding matching files within a single folder is pretty easy - its just a repeat of the above but the last query gets an additional title contains filter.

Deep search is far more difficult. The obvious approach (and the one I started implementing) is to drill down to the given path, then do a search for files with title contains or subfolders. And then loop through the folders, repeating as we go. The slightly more refined version of that is to drill through the folders, collecting their IDs, then constructing a single filter for the files of the form:

title = 'bananas' and (
    'root123456' in parents or
    'foo123456' in parents or
    '...' in parents

Gotcha 4: You can group terms with parentheses. This is not documented.

The trouble here is that this is potentially unbounded. We don't know how deep the the hierarchy goes, or how many branches it has. It wouldn't be a problem if each request was negligible (as it often is with a local filesystem with metadata hot in a memory cache), but here its hugely expensive in a deep hierarchy. As noted above, the regular metadata lookup suffers this but to a lesser degree, as it only ever goes down one branch of the tree.

This is where I got to when I left the office today. The approach I'm likely to take is to request a list of all folders (only), assemble an in-memory hierarchy, drill down to the folder I want, collect all the IDs and then perform the query. So it actually only becomes two requests, though potentially with a lot of metadata returned on the full folder list.

And from there I guess the metadata lookup becomes the same thing really.

And I suppose if I was in the mood to cache things I could cache the folder list by its etag, and do a freshness test instead of the full lookup.

But mostly I'm at the point of "why?". I recognise that around Google search is king, and explicit collections like folders are always implemented in tearms of search (see also tags in Gmail). But folder/path-based filesystems are the way most things work. We've been doing it that way forever. Not that we shouldn't be willing to change and try new things, but surely its not hard to see that an application might want to take a traditional path-based approach to accessing their files?

I'm doubly annoyed because Google is supposed to be far ahead of the pack in anything to do with search, yet I cannot construct a query that will do the equivalent of a subfolder search. Why can the API not know anything about paths, even a in light way? Its clearly not verboten, because parents and children pointers exist, which means a hierarchy is a valid thing. Why is there no method or even a pseudo-search term that does things with paths? Wouldn't it be lovely use a query like:

path = '/foo/bar/baz'

to get a file explicity? Or even cooler, to do a subfolder search:

path startswith '/foo/bar/baz' and title = 'bananas'

Instead though I'm left to get a list of all the files and do all the work myself. And that's dumb.

I'll finish this off, and I'll do whatever I have to do to make a great experience for our users, because that's who I'm serving here. It would just be nice to not have to jump through hoops to do it.