A Clash Of Cultures
or: All Code Is Library Code
Jun 22, 2023
Recently, I got into an interesting discussion. At its core was a function I wrote that looks something like this:
-- | Unpacks a ByteString into a temporary buffer and runs the provided 'ST'
-- function on it.
unpackByteStringCStringLen :: ByteString -> (CStringLen -> ST s a) -> ST s a
unpackByteStringCStringLen bs f = do
let len = BS.length bs
allocaBytes (len + 1) $ \buf -> do
unsafeIOToST $ BS.unsafeUseAsCString bs $ \ptr -> do
copyMem buf ptr (fromIntegral len)
pokeByteOff buf len (0 :: Word8)
f (buf, len)
Don't get too hung up on the details; the gist is that this function does
the same thing as Data.ByteString.useAsCStringLen, except in ST instead
of IO.
The part that sparked discussion is that it explicitly appends a NUL
terminator, even though the callback takes a CStringLen, which, by
convention, is not NUL-terminated.
And this is where two cultures clash.
Two Cultures
On one side, there's the Correctness culture. This culture's approach to writing working software is to ensure that all code is correct, and the underlying assumption is that correct code means working software.
This is, in general, a good and solid mindset, because incorrectness tends to stack up, and the only way to prevent that from happening and turning your codebase into a giant heap of "I'm not sure" is to bend over backwards to sustain 100% correctness (even if it's only 100% within certain well-defined boundaries). It's also one of the things that makes Haskell a "productivity superweapon": the language and its ecosystem encourage a coding and architecture style of stacking sound abstractions on top of one another to, paraphrasing Dijkstra, "be precise on a higher level". Fixing unsound abstractions with "sanity wrappers" is not a great strategy, because abstractions tend to leak, and we program against an unsound foundation with a false sense of safety.
On the other side, there's the Failure culture. A core part of its mindset is that "everything is broken". It throws out the idea of "working software" more or less entirely, and instead analyzes systems in terms of threat scenarios - what can go wrong, how likely is that scenario, what would the impact be, what can we do to lower the likelihood, what can we do to reduce the impact. Instead of attempting to prevent failure at all cost, it acknowledges that failures may occur, and tries to put up reasonable defenses to deal with them.
This is also a good and solid mindset, because failure do in fact happen, and it would be naive to assume that we won't have to deal with them ever. It is particularly important when it comes to information security, because that game is stacked against us - an attacker needs only one hole in the wall, but the defender has to check the entire wall for holes.
These two cultures manifest in all sorts of ways:
- Typed vs. untyped languages
- Static checks vs. live coding
- Proof of correctness vs. elaborate automated tests
- "Coconut" security (focus on hard barriers and trusted cores) vs. in-depth security (focus on redundancy and minimizing trust)
- Principled abstractions vs. convenience macros
- ...
And the members of both cultures often have strong opinions about them.
But the truth is that both also have pretty severe flaws.
The problem with the Correctness approach is that the real world is messy. We build islands of sanity by making some assumptions which enable a nice clean world with neat formal constraints, but the Law of Leaky Abstractions still hits us sometimes - if anything, the hardware our code runs is messy, and subject to all sorts of things that our code cannot capture.
One example is how reasoning about performance is so difficult in Haskell: it's because performance is one of the things we largely ignore when reasoning about soundness - Haskell's type system is based on the assumption that execution time doesn't matter, and that we have infinite amounts of memory at our disposal. This makes the type system a lot simpler, and makes it possible to reason equationally rather than operationally, and that in turn unlocks those powerful abstractions that allow us to command huge complex things with very little code while still being fully confident that it will give us the correct answer. But that only works within the implicit constraint of "...assuming that we don't run out of memory or time". The moment that constraint doesn't hold, we're back in Awful Land, and figuring out why our code is slow or eats up all our RAM is just as difficult as in a typical imperative language, if not worse.
The problem with the Failure approach is that there's a bit of a fallacy there - just because all code can fail, doesn't mean we have to throw the baby out with the bath water and just assume that we literally cannot be certain about anything. And as a result, reasoning about large codebases written in such a style tends to be hard, if not practically impossible. Just because we cannot guarantee everything, we choose to be unable to guarantee anything.
Now, back to that code sample. The core question that the discussion was about is this:
"Should unpackByteStringCStringLen add a NUL terminator or not?"
The Correctness argument is simple:
No, because
CStringLen, by convention, is notNUL-terminated, so it would be incorrect to add aNULterminator. If anyone uses this function incorrectly, then that's on them.
The Failure argument is also simple:
Yes, because adding the
NULterminator will not change the function's behavior if the caller uses it correctly, but if it's used incorrectly, then theNULterminator will cause it to be incorrect in a much less harmful way, preventing a buffer overrun bug that could lead to an exploitable vulnerability.
In this particular case, I'm inclined to say that the Failure camp is right,
which is why I added the NUL terminator. Sure, the caller should use the
function correctly, and in a perfect world, they will - but in this
particular case, we cannot enforce this, the strongest tool we have is the
convention that CStringLen means you need to look at the length and not
expect a NUL terminator. But conventions aren't very strong; it's easy to
forget about them, not know about them in the first place, or make an honest
mistake, and in any of these cases, the compiler will not catch it.
If we could have a type that actually guarantees NUL termination, then this
wouldn't be necessary: we would write all functions that expect a NUL
terminator using a type that demands NUL termination, and trying to pass
unpackByteStringCStringLen would not return such a type, so the incorrect
usage would be prevented. But given the state of the Haskell ecosystem, no
such type currently exists, so we can't do that.
Detour: Hierarchy Of Assertions
In this context, I'd like to postulate a hierarchy of assertions, from strong to weak:
- Absence. Code that does not exist cannot be used incorrectly.
- (Static) types. Type checks are mandatory and automatic, and they automatically propagate.
- Static analysis. Almost the same as type checks, but static analysis is external to the code and thus not mandatory.
- Automated tests. Weaker than static analysis, in that tests can only show the presence of incorrectness, not the absence. Also, unlike static analysis, automated tests do not propagate automatically.
- Naming. This is where we start to fully depend on manual diligence. Good names suggest correct usage, but it is up to a human to interpret the name and ensure that the usage is in fact correct. This is also where the assertions we get can be wrong. An automated test doesn't lie, if the test succeeds, then the code it tests does what the test says it does, but a function name can easily be misleading.
- Documentation. Weaker than names, because documentation is only linked to the code it documents by implicit convention, and does not propagate - changing a name will typically cause errors prompting the programmer to look at usage sites, but changes to documentation will produce no such triggers, and will usually go unnoticed at the usage site.
Programmer Brains Are A Scarce Resource
So, why are we so concerned with those assertions in the first place? Why can't we just go and make sure all usages of our function are correct?
Well, for starters, it's pretty obvious that we can only do this if we can control those usages in the first place. If our function is a library function, then we can't, because we don't know who will use our library, so in that case, we have to be extra defensive and assume that people will use our code wrong, unless we can make that "impossible" (and a reasonable threshold for "impossible", IMO, is "steps 1 or 2 in the Hierarchy Of Assertions": use types, or simply do not expose the function at all).
In theory that means that if our function is not a library function, then we can resort to trusting usages to be correct, right?
I'm going to argue that no, we can't. And the reason for that is because programmer brains are small; we can only hold so much information in our working memory at any time, and that "so much" is frighteningly little. Estimates include "seven things, ± two", or "about one screenful of code"; it's impossible to come up with a hard metric, but I think these estimates may actually be very generous in a context where "100%" is the goal - after all, programmers regularly fail to spot bugs in code snippets as small as a single line.
Either way, limiting the amount of information that is needed to correctly and reliably reason about code to an absolute minimum is crucial for building reliable software, no matter which camp you're in. And a direct conclusion is that even if our function is not a library function, and we could in principle verify its usages for correctness, we should avoid keeping this information in the programmer's brain: the programmer should not need to think about usages when working with the function itself, and they should not need to think about the implementation of the function when using it. Often, we can achieve this by enforcing "contracts" at the interface boundary, and types are about the best way we have for that.