wednesday, 12 december 2007

posted at 13:56
  • mood: elvish

I wrote a simple launcher for WebKit that creates a WebCore::Page, attaches it to a WebCore::Frame, then tries to load the Google homepage with it. Unsurprisingly, when I ran it it crashed, as most of my factory methods just return NULL. I fired up the debugger and figured out where the crash was coming from, and found it was in FrameLoaderClient::createDocumentLoader, one of my factory methods. Curiously, this function calls notImplemented(), and so should have printed something to the console. A little poking revealed that I had been done a release build, not a debug build, so I recompiled with --debug.

The resulting binary was almost three times the size, up around 300MB, which makes sense because its now carrying almost the entire source code for debugging as well. I had to start AROS with -m 512, to give it enough memory to actually be able to load the thing. I started AROS, opened a shell, started ArosLauncher, and then the amazing fireworks began.

On my debug console, I got a line of output:

[LoadSeg] Failed to load 'ArosLauncher'

Thats a problem - LoadSeg() is the program loader/linker. More exciting though was the line after line of pure binary appearing in my AROS shell. Do something like cat /bin/ls to see what I mean.

My first thought was that the awesome size of the binary was trampling something in memory, but a bit of poking around revealed the answer. When you type a command into the shell, it tries to load it as an executable file. If that fails, it checks if the file has the script flag enabled. If it does, it calls C:Execute with the file as an argument. Execute is the script runner, and it simply feeds the contents of the file into the shell's input buffer to be executed as though the commands were being typed.

Execute doesn't have any smarts to determine if what its being passed is really a script; that would be a useful feature for it to have. The real issue though is that the ArosLauncher binary had the script flag. I never set it, it shouldn't be.

Closer inspection revealed that the hosted filesystem driver, which maps Unix file permissions to AROS file permissions, was setting the script flag for every file without exception. That was perhaps a reasonable choice at the time it was written, as Unix does not have a script flag or anything similar it wouldn't have been immediately obvious what to map it too and it was never used in AROS anyway until recently (the shell gained support for testing for it and calling Execute a couple of weeks ago). Clearly though its not write, so I had to do something. I modified the permissions mapping code in emul.handler to map the AROS script flag to the Unix "sticky" (t) permission bit. I also implemented FSA_SET_PROTECT at the same time, so now typing protect +s file in AROS acheives the same as chmod +t file in Unix, and vice-versa.

So with that fix in hand, ArosLauncher was rerun and the far simpler error was returned:

ArosLauncher: file is not executable

So the next step was to dig into LoadSeg() and find out why it couldn't load the file.

A tiny bit of background: Any program, library or other "executable" thing under AROS (and most Unix systems) is stored in a format called ELF. It is split into a number of "sections". Each one contains some information. It might be program code, data, symbol names, debugging info, there's lots of different types. Its up to the OS loader/linker to pull all these together into a runnable program.

So, with the ELF specs in hand I started stepping through the loader code, and quickly found the problem. When you compile something with debugging information, it adds many extra sections to the binary object, containing what amounts to the entire source code for the program, so the debugger can give you the proper context and so on. Because it includes all of WebKit, ICU, cURL, libxml and SQLite, it has a lot of sections. Somewhere in the order of 75000 in fact.

The field in the ELF header that stores the count of sections is a 16-bit field, which means it can count up to ~65000. Clearly there are too many sections in the file to fit. In this case, the number of headers is marked as 0, and the loader should try to load the first header. In there is the real count, in a 32-bit field that normally is used for something else (the header size) but is borrowed just for this special case.

So I implemented this, and it works - it finds the headers correctly and does the relocations as it should. Its still not at the point where it will run ArosLauncher. It would appear that there's a symbol type that the AROS loader doesn't know about and is interpreting as being invalid, rather than handling/ignoring it. I'm not sure what's appropriate yet; I'll take more of a look on my bus ride home today.

More todo items: There are three ELF loaders in AROS currently, elf, elf64 and elf_aros. elf is the main one that I'm working on, elf64 is a copy of it taken recently with support for 64-bit symbols, and elf_aros is an old one that I have no idea of what its for or where it came from. I have no desire to make my modifications in three files, particular when I have no 64-bit system to test on, so I'm going to look at trying to merge these three files back together.