I recently checked in some work to the main branch and in certain corner cases asserts would fire, so this topic is fresh in my mind. Though nobody has pointed direct fingers I can't help but feel like I'm being blamed for breaking the build, a message that goes beyond "kill the messenger" to "stop having these naysaying messengers in the first place" - and while not all of the assertions that fired were due to true faults, a couple of them were real logic errors, so I stand by them as mostly good things to have around.
I used to use the term "brittle" to describe good software, in that it would break when somebody coded something wrong, usually with a nice clear callstack. In most of my old code, dereferencing a nullptr wasn't the mistake of the function doing the dereferencing - it was the fault of the code that passed a nullptr in in the first place. After discovering one of these issues I might thread assertions through the code trying to find out when and why that thing was null, and catch the real mistake, frex:
void foo(shared_ptr<Thing> myThing)
// this is how we used to do not_null before we had not_null
assert(myThing != nullptr); // this assert has value because it can catch the problem sooner
… a pile of code here…
// no point in asserting !nullptr here, it'll crash nicely for us anyway
void outerFoo(shared_ptr<Thing> myThing)
assert(myThing != nullptr);
… a pile of code here …
Which was great when working on shrinkwrap or one-version games, and even fine with my recent roblox work because I could turn around a fix in an hour or two if there was a common failure. With Minecraft, where it can take days, weeks, sometimes even months, to get a fix up I learned pretty quickly that my old ways wouldn't fly anymore. We needed to be "robust", not "brittle."
Still, catching bugs early is a good thing, so I like to promote the idea of "brittle and robust" - the example here is we assert that a pointer isn't null, but then we handle the possibility anyway. This is from the old classic Writing Solid Code (though I forget if they called it that.)
void foo(shared_ptr<Thing> myThing)
… a pile of code here…
assert(myThing != nullptr); // brittle in debug/test builds. Maybe it sends telemetry in production
// but, assuming we're not worried player/customer data is corrupt, we handle the case and soldier on, so our players/customers don't hate us:
if( myThing )
// also it allows devs/testers to keep working when they encounter asserts rather than being paralyzed
(I used to be a strong advocate of fatal asserts, which made sense for my small co-located team at the time - you got an assert you could probably go grab the coder and wrote it and say, 'hey, your assert caught a problem, what's it mean?' - but on the ginormous distributed all-over-the-globe team I'm working on now it's simply untenable.)
But I've just realized that the term "brittle" is mostly considered pejorative by engineers these days so I was casting about on Twitter for a better term https://twitter.com/happionlabs/status/1441914044856766465 and the winners were "fault intolerant" and "fault evident" - I could imagine "fault intolerant" meaning it intentionally crashes, which is something you could well want to do if you're worried your MMO state may have become corrupt and you do _not_ want to save player data at that point, whereas "fault evident" might be error messages that get sent to telemetry but we still try to be robust to it, or tolerant of the fault.
Now part of me wants to just make that the new rule of thumb: be fault-evident but fault-tolerant if you're not concerned about data corruption; be fault-intolerant if you are. That would jibe nicely with the dominant philosophy in Writing Solid Code, which was one of my favorite programming books; it leveled me up as a coder overnight.
Writing Solid Code is filled with fault intolerant ideas - for example, they make it a habit to initialize memory to a garbage value in debug builds so you won't accidentally forget to initialize data and accidentally rely on that uninitialized memory being zero - code that works fine most of the time (because uninitialized memory tends to usually be zero). I got into an argument with a fellow coder about it once - he wrote an allocator for his game that initialized everything to zero for him, even in production code. I wanted to change it to the garbage value so I'd 'remember to initialize'. To me, not initializing a variable was a mistake, and I wanted to catch that mistake. But for him, forgetting to initialize a variable to 0 in a heap where it's initialized for you? No longer a mistake. Modern high-level languages tend to do things his way. It would be a misnomer to call his system "fault" tolerant because it was no longer a mistake to forget to initialize variables - it just worked - the system was designed so what used to be a fault was no longer a fault. What should we call that? Fault Forgiving? Let's go with that. If you have a better name let me know.
I can't help but notice that those high-level languages that make us so much more productive than bad old C++ tend to be fault forgiving- forget to initialize variables? It's not a fault, you probably just wanted it to be 0 anyway.. Not strictly controlling object lifetime? It's not a fault, the garbage collector will clean it up when there are no references left, that subsystem that refers to an object we no longer care about can harmlessly act on it for all we care.
But we can't just go crazy and always undefine things as faults. For example, even with garbage collection or smart pointers there are plenty of cases where when we access data that we didn't intend to it's a real logic mistake and we should have been fault intolerant/evident. At which point we can start using our assert() (and maybe expect() macros to be modern) again.
So I'm writing some code. I get to a point where I have a choice of making it fault intolerant, evident, or completely robust. So I ask myself if it's really a fault. A recent example was passing this in closures to callbacks in a network handshake function. I could imagine a case where the user cancels the operation, or halts or suspends the program, and the original object that would be passed in is not there anymore. We could attempt to make the code robust to that by not deallocating the original object until we're sure any callbacks referring to it are dead - getting rid of the "fault". I could also imagine a callback accidentally firing twice: that no doubt would be due to a coder error or mistake, something unintentional, but … do we care? Do we really want to define that as a mistake? If we define these failures as "not mistakes", as "not incorrect", and protect ourselves, perhaps with a shared or weak pointer to the object we're referring back to, our users have a good experience, we have a lower burden on the coders, but maybe we do have some corner cases or states where things have truly gotten bad and we wish that the fault was more evident.
A situation more recently, that shows how subtle the choice can be - we have a situation where the game's behavior is determined by an A/B test. The A/B test choice is determined by a website. I made it fault intolerant at first by defaulting to an invalid choice and asserting if it was invalid when we got there. That's clearly a mistake - we have an expression on our team - "Asserts are not error handling" - and here race conditions, the website being down, the computer being down, could all mean that by the time the code needs to know whether it's doing A or B it's too late, a legit error that can happen in the wild is not being handled robustly. But the problem here isn't so much with the assert; it's with the bad default. So this is a situation that needs fault tolerance/forgiveness.
But right next to that variable for whether the website is telling us A or B is a variable for whether our QA team wants to override the website's choice. That variable is set by another system in the code. It looks something like this:
ChoiceEnum settingFromWebsite = ChoiseEnum::Default; // needs to be fault tolerant
bool localOverride = false;
ChoiceEnum localSetting = …?; // should this be ChoiceEnum::Default or ChoiceEnum::Invalid?
(Really we could eliminate that variable entirely and query the other system, inject it as a dependency and "pull" instead of "push", but, well, we didn't, for expediency. But I digress.)
In code like this -
if( localOverride )
and we get there and localOveride has been set and localSetting hasn't, that is likely a coder mistake (which could maybe have been avoided with std::optional but I digress again) that we want to catch. So, for that situation, this is the right choice:
ChoiceEnum localSetting = ChoiceEnum::Invalid;
if( localOverride )
assert(localSetting != ChoiceEnum::Invalid);
Going back to the original issue that spawned this idea for a post: fault intolerance and fault evidence will often have false failures. We'll assert on something that was actually okay - this goes back to saying "it wasn't really a mistake." But if they at least sometimes catch real problems they can still be worth it. Rust's borrow checker is kind of an example here: we've been living with being able to have different objects both refer to a third object and both make state changes to that third object willy-nilly. Most of the time it's okay. I don't like it, it makes the code hard to read and the system impossible to reason about, and we have to expend a lot of QA resources to test it thoroughly and we still can never be totally sure it functions 100% properly in the wild…but usually it's okay. Rust prevents us from doing that, at compile time, even though most of the time it's fine. Rust is treating a lot of use cases that didn't use to be considered faults as faults! The opposite of fault tolerance/forgiveness. But quite possibly worth it, to protect us from whole classes of possible bugs. (And since these intolerances are at compile time not as much of a burden as an assert or other runtime check.)
And yet another subtle distinction: nondeterministic asserts, ones that pop up occasionally due to race conditions or uninitialized variables (that's why we have to initialize to DEADBEEF or garbage values still) or state that's impossible to reason about are as bad as nondeterministic (aka flaky) unit tests. Like a unit test that fails every once in a while, an assert that pops up every once in a while in hard to pin down situations will be ignored, the boy who cried wolf. We used to have one in Minecraft that stuck with us for several years that flagged a possible performance issue but which didn't do any real harm and was rare enough that we eventually undefined it as a fault and passed a log message and some telemetry instead.
All in all, it would be nice if there was a simple rule of thumb like "Be brittle+robust: use lots of asserts but also put in guards for when they fire." that I could write a pithy article about…but there's not. We want to ask ourselves if the fault is something that could naturally occur in the wild no matter our best intention or if it's really a developer mistake. We want to ask ourselves if we want to consider that "mistake" to truly be a mistake or just a harmless unintended consequence. We want to ask ourselves if we can catch some real bugs/errors early and if we're willing to put up with the false positives to do so.
The system we're working with will affect our decision as well: I know studios that strip asserts completely out of publish general audience builds, as described in Writing Solid Code; but I also know there's at least one studio who leaves asserts in, and fatal, in publish servers to guarantee an exit before accidentally writing corrupt data. Minecraft Bedrock has a middle ground: asserts halt (but usually can be continued from) in our QA and dev builds; they're gone from general audience publish builds; they still exist and fire telemetry in preview and beta builds. So we can be more lenient about leaving intermittent assertions in than some other studios.
In general I'm going to keep erring on the side of having too many asserts than too few. I'll get it wrong sometimes but can always course correct after that failed build verification test.