friday, 22 march 2013

posted at 17:50

Over the last couple of weeks I've been working on integrating remote filestores into FastMail. We've had our own online file storage facility for years (long before it was cool), and you've always been able to attach a file in your store to an email and save an email attachment to your store. We've been extending that to allow you to use a "cloud" file storage service in exactly the same way.

Our file storage facility is fairly simple in concept, and operates around the traditional files-and-folders model that we've used forever. For the first external service to integrate with we chose Dropbox, mostly because its by far the most widely used, but also because they use the same model and so it was very easy to create an abstraction and slot it behind it. For FastMail subscribers, you can try it out right now on beta. Once we've finished polishing and testing it we'll be releasing it to production. Should only be a couple of weeks away, but don't quote me on that!

When I developed our internal remote filestore abstraction I designed it with the idea that it would be fairly simple to integrate other remote filestores as well. Today I spent a good amount of time working on an integration for Google Drive. I'm mostly doing this to satisfy myself that I have a good abstraction in place, but of course Google is no small fish and I think that it would be wonderful if we could make this available to users as well.

This has not been a simple undertaking. To find out why, lets talk the architecture of our client a bit.

One of the features of our internal API (the AJAX stuff that our client uses) is that it is completely stateless. This has been done deliberately, as it makes it very easy to scale our backend servers. Obviously state can be held (there is a very nice database available all the time) but the API itself has no real concept of state, so it becomes tough to know what and when to store and expire data. So to build anything we start by assuming it will be stateless.

Our attachment system is very simple. There is a file picker that requests metadata for a given path (a standard /foo/bar/baz construction). It gets back name, full path, timestamp and type for the requested folder and its immediate children. When the user selects a folder, a new metadata request comes in for that folder. The server does not care what's gone on before, it just turns paths into lists of metadata. Later to actually attach a file, we call a different method with the wanted and path, and the file data comes back. Like I said, very simple.

So back to Drive. The major reason for it being an utter pain in the backside to integrate is that the API itself has no concept of folders or paths. Now anyone using the Drive web interface will know that it has folders. This is actually something of a lie. A Drive is just a giant pool of files with various properties that you can query on. A file can have "parent" and "child" pointers to other files, which allows a loose hierarchy of files to be constructed. A folder is simply a zero-size file with a special type (application/ and appropriate parent and child pointers.

Every file has a unique and opaque ID, unrelated to the file's name. These IDs are what's used in the parent and child pointers. There's no way for us to construct the ID of a file from a path. To find the metadata for our file, we have to follow parent and child pointers around.

So lets say we want to get the metadata for the folder /foo/bar/baz and the files inside it. We start off by getting the metadata for the root "folder", helpfully called root (gotta start somewhere). Along with all the info about that root folder we get back its ID. Lets say its ID is 'root123456' (it won't be, its opaque and apparently random, but this will do for our purposes).

Now we have to find foo. We request the file list, with some search filters (normally all on one line, presented here with newlines for readability):

'root123456' in parents and
title = 'foo' and
mimeType = 'application/'

Gotcha 1: deleted ("trashed") and hidden files are returned by default. We don't want those. So actually the filter is:

'root123456' in parents and
title = 'foo' and
mimeType = 'application/' and
trashed = false and
hidden = false

Gotcha 2: this query goes into the q= parameter of a GET request, however it needs to be form-encoded rather than the standard URI 'percent' encoding.

(Neither of these gotchas are documented. Good luck).

Assuming it exists, we'll get back a "list" containing one item. I'm not actually sure if two items with the same name and type can exist. Probably, so I return "not found" if I don't get exactly one result yet. That's an implementation detail though, and it might change.

So now we have our foo metadata, we can get its ID and then repeat the process for bar, and so on.

Each of these requests is a separate HTTP call. They're stateless, so various performance tricks can be utilised (keepalives, etc). My servers are on good networks so its not that slow, but its still a lot of round-trips.

Once we drill down that far, we do a final request for a file list with the same filter, this time leaving off the title and mimeType term (we want everything).

'baz123456' in parents and
trashed = false and
hidden = false

Gotcha 3: this will return Google documents, spreadsheets, presentations and the like. These are identifiable by MIME type, and are not downloadable (because they're internal application magic). Their metadata has various URLs for converting to eg Word documents, but these aren't really appropriate for our use. We'd like to filter them out. Unfortunately that means filtering by excluding a specific set of MIME types in the filter:

'baz123456' in parents and
mimeType != 'application/' and
mimeType != 'application/' and
mimeType != '...' and
trashed = false and
hidden = false

That sucks because you have encode the full list of exclusions right there in the query, so you have to update that when Google adds a new type of document. Instead I've opted to drop anything with a zero size, but there's no size search term available, so instead I've got to pull the lot and then filter.

Anyway, we now have the metadata for the requested path and all its children, so we can return this back to the caller. It takes N+2 HTTP requests to get all the data we need, where N is the number of "inner" path components. This is hardly ideal, but it works well enough, is probably fast enough for most cases (ie there aren't likely to be very very deep folder hierarchies) and isn't even a lot of code.

So next up is search. Our file picker has a "search within folder" option, which looks for a given file name (or name fragment) within a folder and its subfolders. The subfolders bit is actually a significant problem for us here. Finding matching files within a single folder is pretty easy - its just a repeat of the above but the last query gets an additional title contains filter.

Deep search is far more difficult. The obvious approach (and the one I started implementing) is to drill down to the given path, then do a search for files with title contains or subfolders. And then loop through the folders, repeating as we go. The slightly more refined version of that is to drill through the folders, collecting their IDs, then constructing a single filter for the files of the form:

title = 'bananas' and (
    'root123456' in parents or
    'foo123456' in parents or
    '...' in parents

Gotcha 4: You can group terms with parentheses. This is not documented.

The trouble here is that this is potentially unbounded. We don't know how deep the the hierarchy goes, or how many branches it has. It wouldn't be a problem if each request was negligible (as it often is with a local filesystem with metadata hot in a memory cache), but here its hugely expensive in a deep hierarchy. As noted above, the regular metadata lookup suffers this but to a lesser degree, as it only ever goes down one branch of the tree.

This is where I got to when I left the office today. The approach I'm likely to take is to request a list of all folders (only), assemble an in-memory hierarchy, drill down to the folder I want, collect all the IDs and then perform the query. So it actually only becomes two requests, though potentially with a lot of metadata returned on the full folder list.

And from there I guess the metadata lookup becomes the same thing really.

And I suppose if I was in the mood to cache things I could cache the folder list by its etag, and do a freshness test instead of the full lookup.

But mostly I'm at the point of "why?". I recognise that around Google search is king, and explicit collections like folders are always implemented in tearms of search (see also tags in Gmail). But folder/path-based filesystems are the way most things work. We've been doing it that way forever. Not that we shouldn't be willing to change and try new things, but surely its not hard to see that an application might want to take a traditional path-based approach to accessing their files?

I'm doubly annoyed because Google is supposed to be far ahead of the pack in anything to do with search, yet I cannot construct a query that will do the equivalent of a subfolder search. Why can the API not know anything about paths, even a in light way? Its clearly not verboten, because parents and children pointers exist, which means a hierarchy is a valid thing. Why is there no method or even a pseudo-search term that does things with paths? Wouldn't it be lovely use a query like:

path = '/foo/bar/baz'

to get a file explicity? Or even cooler, to do a subfolder search:

path startswith '/foo/bar/baz' and title = 'bananas'

Instead though I'm left to get a list of all the files and do all the work myself. And that's dumb.

I'll finish this off, and I'll do whatever I have to do to make a great experience for our users, because that's who I'm serving here. It would just be nice to not have to jump through hoops to do it.

tuesday, 24 july 2012

posted at 03:21

First day in the office today, and all is going well. I'll write more about that and post some pics later, but for now lets talk about that most mundane of tasks: grocery shopping.

the least-weird products available

For the non-Norwegian in Norway, at least two problems present themselves (I say "at least" because I created more problems of my own):

  1. Pretty much every single word on every single packet is in a language I don't understand.
  2. The prices are all in a different currency on a different scale, making comparisons quite difficults.

I went in thinking I needed to at least grab sugar (for my coffee), shampoo, and something for breakfast and dinner for the next couple of nights. I made a quick spaghetti bolognese last night and saved the leftovers, but I undercooked the spaghetti so I'm not keen on reheating it. Besides, I probably can't live off that for four weeks.

Most of this is what you'd expect, though I hated having to buy 1kg packets of salt and sugar and 1.5L of oil. Maybe I should bake a cake?

The most difficult/controversial item here is probably the butter ("Meierismør"). At home I eat spreadable butter, which is butter with a little oil mixed in to make it softer at room temperature. I don't like the taste of margarine. Most of the products on offer looked to be margarine, which I managed to infer from either the packet having "marg" on it somewhere or the ingredients list having more than a couple of things (ie butter, salt, preservative, that sort of thing). Anyway it looks like I lucked out but man, it took a lot of study to finally select this one. There was a few other things like that, but none quite this tough.

The other crazy thing is the prices. 1AUD is worth about 5-6NOK, depending who you ask. What you see here cost 373,10Kr. According to my bank I paid $59.06. That doesn't seem too outrageous considering there's meat and fish in there. So that's ok. When considering the difference between two products its really hard to suppress my normal instinct about what's cheap, what's expensive and how two prices compare. Consider the shampoo. I paid 21,90Kr, which works out to about $3.50. Prices for shampoo ranged from 15-40Kr. So my warning bells go up with "holy crap, $22 for shampoo and a range of $25? wtf?" when in practice the range is around $2.50-6.50, which is pretty much spot on.

Other things you can do to make your shopping trip harder than it needs to be:

  1. Go in having no particular idea of what it is you need (this applies in Melbourne too).
  2. Go to the supermarket at the work end of the train ride, so you have to public transport your loot all the way home.
  3. Forget that this is Europe where they make you pack your own stuff and don't just give you a bag for every two items.

I was also lucky enough to find a cashier that couldn't (or wouldn't) speak English. Fortunately buying groceries is pretty much the same wherever you go.

Now, dinner!

sunday, 22 july 2012

posted at 01:46

So here I am at Brussels Airport. Its growing on me, but only because I'm sitting up in the departure lounge. The arrivals area is a complete hole. I can't decide if that's the wrong way around - on the one hand, you want to make a good first impression (failed) but on the other you want people to leave with a fond memory (succeeded). It might be a combination of being awake for the best part of the last 24 hours and the spaghetti, chips and beer talking. Its probably not important.

Anyway, the backstory for the uninitiated is that I now work for Opera Software, working from the Melbourne office. I meant to write something about leaving Monash and starting at Opera but hadn't got around to it yet. Maybe next time. In any case, I've been there a couple of months, love it, and now its time to take the pilgrimage to the head office in Oslo, Norway. They seem to like everyone to visit overseas offices semi-frequently, and I have double reason to go in that the fine fellow that I was hired to partially replace lives and works there. It will be extremely useful to be able to get an answer in 30 seconds instead of waiting a whole day and night for an email round trip. Its a good thing!

Of course, I have to leave my dear family, a fact that none of us are particularly happy about. The timing isn't great either because everyone got sick. Francesca has some infection messing her up, so much so that she cried all the way to Coburg where we dumped her at her grandma's house and continued on to the airport. Wife tells me she's getting sick too, which is only going to make things harder for her - my children are delightful but are also about eight full time jobs to take care of. My wife is an amazing woman. I will likely write her an email telling her so very soon and I will likely have to study ancient languages to find words that going even some small way to communicate just how magnificent she is.

Beth cried at the airport and wouldn't let me go. I'm told that Penny looked after her as well as a three-year-old can - touching her a lot and telling her "don't be sad, its ok". I love my family so much! I'm fairly fried right now but I think once I've slept in a bed and had a moment to think I'll really start to miss them.

So the flights. I left Melbourne at around 10pm for a 14 hours flight to Abu Dhabi. Etihad is a fantastic airline - the seats were comfortable, the food was very good and they couldn't do enough to look after you. I got myself signed up for the "Etihad Guest" program which among other things gets me a 5kg increase in my baggage allowance. So between that and freeing up another 5kg by not having to carry engine parts with me on the return flight (help a friend out: buy an oil cooler in Australia, chuck it in your bag and take it to Norway for him), I should have no excuse to bring heavy presents back with me.

I slept fitfully for a few hours which appears to have been enough to keep me alive. Either that or I'm just used to functioning on very little sleep. Most of the rest of the time was spent reading and hacking. Same old story with me.

Abu Dhabi is bizarre. The sun was up at perhaps 5am. We landed at about 6.30am and already it was over 30 C and humid. The airport is fully airconditioned of course but it never quite felt comfortable. You could tell that it was overworked. And the land is brown, so brown. It really is the middle of a desert. Seems a very strange place to attempt to eke out an existence, let alone build a throbbing metropolis, but I guess if its your home then you know what you're doing.

The terminal was large and modern, with free internet that everyone used and as such I couldn't get near. And people everywhere, so many people. And surprisingly good coffee! And everyone spoke excellent English, so really it wasn't so hard to tell that I wasn't in Mebourne. Only a very short stop though, so I didn't really have much time to see anything.

The flight to Brussels was more of the same, I'd gotten used to it now. More book (finished it), more hacking (taking apart Pioneer's starsystem generator, making great progress there). Not really a whole lot to add except to say that I glanced up every know and again at the movie "Deep Impact", and it looks to be equal parts terrible and awesome. I really am tired.

So now I'm in Brussels. As noted, its pretty crap. I'm reliably informed that this extends to the city and probably the country as a whole, so I guess I should give them points for consistency. Got myself a good grilling by the immigration official (because the airport is so retarded that you have to go through customs twice even if you're just transferring), then had to talk to a bored Brussels Airlines representative to a) figure out how to get a boarding pass and b) figure out where my luggage went. After twenty minutes and a good amount of "merci", "pardon" and "je ne comprends pas" because they didn't seem keen on speaking English but that's about the extent of my French, I managed to get it all sorted out.

Anyway I've managed to find my way up to the departure lounge, which is actually quite nice. Quiet, stolen power available, internet paid for (when you haven't had contact with the outside world for a while 20EUR does not seem a lot of money for four hours). Another ~20EUR got me a reasonable spaghetti bolognaise, a beer (Leffe Blond, passable) and a little can of Pringles (ahh globalisation, good wherever you go). Its not terrible, but even if it was I think my standards at this point are too low to notice.

Just had half a chat with my lady via Skype. I say half because we had no sound in either direction and only my video seemed to work, so the actual conversation was conducted via Google Talk. But that was nice just to check in. We'll have to get the technology fixed prooperly before we try it again.

So I'm a couple of hours away from my flight, which should get in to Oslo just after 10pm local. From there I collect my things and catch the train to the Oslo S. I'm hoping to find an English-speaking human there that can a) sell me a map and b) direct me to my apartment. The office there is closed but I'm told the nearby 7-11 has been informed of my arrival and is holding the key for me. And after that it should be as simple as shower and bed. And then I get to spend the remainder of Sunday (after I wake) figuring out where the hell I am and how to get to work so that I can be there spritely and on-time on Monday morning.

While I'm sure visiting another part of the world is great, getting there is a complete pain in the arse. Someone hurry up and invent a transporter. Please.