The bug that went undetected for over a year…

Yesterday I announced that:

Just fixed a bug which has existed since the first week that I wrote FlingOS – so excited!!

I promised a blog post would follow, so here is that blog post – it’s going to be a good one. So sit back, relax and read on as I take you through the history of this bug.

When I first started writing FlingOS I had somewhere between no idea and less than no idea what I was doing. Seriously, all I knew was the structure of a C# to native compiler – I had worked on Cosmos for about 5 months but for various reasons I decided I wanted to write my own C# OS, with a different structure. I had my own compiler working relatively quickly and the next step was to implement basic features that would underpin the entire OS.

It’s probably reasonable to say that if something is going to underpin your entire system, you want to get it right, lest it have any nasty, invisible side effects later on. One such underpinning component was the Heap. Ahh the Heap… The heap implementation has caused much confusion and difficulty. This is not because a heap is difficult to understand or use, it’s just fiddly and crops up all over the place such that slight errors kill the entire OS. Some such errors have included not using spin locks (after I introduced multi-threading a few months back) and not allocating the heap enough space. (The heap space use to be allocated in the .TEXT section of the code! 120MiB .ISO files ;D With the new drivers compiler I shifted it to .BSS where it should be.)

When I first looked at implementing a heap, I wasn’t too interested in the internal workings. I knew what a heap did, I knew there were lots of ways of implementing them, each with pros and cons. I just wanted something simple and easy that I could use. Of course, FlingOS being one of only three active C# operating systems worldwide, there aren’t many (if any) suitable heap implementations floating around. However, there are lots of samples in C. The one I lumped for was a simple one from OSDev.

Over time I have adapted and updated the implementation to add things like allocation on or avoiding a boundary (as required by USB). The main Alloc function, however, has remained the same. Entirely the same. And this is where the issue lies. Early on in my development I found that with a 10MiB heap, my OS regularly seemed to run out of allocatable memory resulting in seemingly random page faults. My somewhat ignorant solution at the time was? : Make the heap 100MiB! This seemed to fix the problem, even though 90% of the heap was never allocated.

As time has gone on, page faults have appeared and disappeared sporadically until recently when I started trying to read large files from USB sticks. This used a lot of heap memory and suddenly the page faults were happening all the time. Not having really had to deal with page faults before I had no idea what was causing it. My instinct said that a page fault is due to unmapped memory, so something must be allocating an invalid pointer. But the heap wasn’t anywhere near out of memory. Unfortunately, with so many compiler changes going on in the past few months, there was no consistency to the issue. Even in the past few days the faulting address (or even instruction address) was not reproducible. It was seemingly random.

I investigated everything from interrupts to memory leaks to who knows what trying to trace this. Eventually I realised I wasn’t going to be able to unless I had a better view of two things:

  1. The sequence of events which lead to the page fault (and subsequent crash)
  2. The layout of all memory so I could see where things might be going wrong

For point (1) I had previously just outputted stuff to the screen through FlingOS’s BasicConsole class. But this wasn’t enough. I needed traceability and the ability to go back more than  a screen’s worth of information. So I implemented a new Serial class and hooked into the BasicConsole to redirect the output to a file on my host machine. As it turns out, I never needed to implement anything for point (2) – I realised what was going on before I got that far.

I realised the issue was to do with USB code, so I turned on all the trace code for the USB stack and inspected the output. The output contained the key piece of information : virtual and physical addresses of the memory the heap was allocating. They were invalid. They were valid addresses, but they didn’t fit inside the heap’s allotted block of memory! They were overrunning the end of the heap. I’d found my one, consistent piece of information.

Naturally I went looking for what could be overwritten by code writing to space beyond the end of the heap. Sure enough, immediately the heap memory were the unprotected page tables. And because USB uses physical addresses, there was no way I could have protected the page tables. I tested by allocating padding space between the page tables and the heap – sure enough the code took far longer to crash.

So the issue was in the heap. But the OSDev implementation looked pretty solid and many people had tested it. So I must’ve converted the code incorrectly to C#.  In the 18 months I’ve been doing low-level programming my understanding of pointers and pointer manipulation has grown – a lot. I went back to look at the heap in detail and found this spurious line of code:

void* result = (void*)(x * b->bsize + (UInt32*)(&b[1]));

This line of code is supposed to calculate the address of a block (x) with block size (b->bsize) and offset from the start of the heap (&b[1]). For those who understand pointer arithmetic they will easily see my mistake from when I converted the code – UInt32* will result in the pointer being multiplied by 4 (because UInt32 is 4 bytes in size). In my defence, however, the original line of code was thus:

return (void*)(x * b->bsize + (uintptr)&b[1]);

I asked a number of experienced programmers what they would expect uintptr to be a type for and all but one said: pointer to a uint (i.e. UInt32*). Only one was able to give me the actual definition (as per C99, which I found online) which is:

In C99, uintptr is “an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer”.

Great…so uintptr is not UInt32* it is in fact just UInt32 in C#.

Here’s a few caveats for those pedantic types amongst you:

  • Yes, pointers can be 8, 16, 32 or 64-bit. So it shouldn’t be UInt32 it should be some other form of UInt that allows agnostic size. However, FlingOS is entirely a 32-bit OS and C# doesn’t have a useful equivalent for uintptr (see next point).
  • Yes, C# has an IntPtr type – but it is intended for managed pointers and doesn’t play easily with most of the low-level code I’m trying to write for FlingOS. I may start using it in future, I may not. It might need some compiler updates to support it.

Welcome!

Welcome to the Fling OS blog! As the first post I thought I’d do a brief description of what the Fling OS blog will contain (and there’s a few things you might not expect). So, here goes…

The Fling OS blog will be frequently updated with short articles from myself, Ed Nutting, and future developers including regular updates from Roland Baranyi, this year’s summer intern! The blog won’t just be covering OS development though. There’ll be all sorts of information including all the following topics:

  • OS dev,
  • pieces from the Summer Interns,
  • the website,
  • documentation, articles and tutorials,
  • sponsors,
  • events across the UK
  • and the occasional rant.

So watch this space for future (more exciting, more interesting) articles here!