There’s a pattern I’ve seen over and over again in game development. I’ve seen it in networking (currently wrestling with it now, in fact), UI, AI, and more. A pattern I used to frequently use myself. A pattern I’ve also seen abstracted and wrapped in various ways. This pattern can easily become a bug farm. The heart of the pattern looks something like this:
class myStateMachine {
public:
enum State {
stateA,
stateB,
etc...
};void setState(State newState) {
myState = newState;
}private:
void doThingA() {
// various logic which sometimes calls setState()
}void doThingB() {
// various logic which sometimes calls setState()
}
// etc...void update() {
switch( myState ) {
case stateA: doThingA(); break;
case stateB: doThingB(); break;
case stateC: doThingC(); break;
}
}
}
When we make the setState public this state machine is no longer one of the FSM’s that we learned about in school (or, in my case, from textbooks). Usually those change the state variable in one and only one place.
If it weren’t for setState being public, code like this can be hard but not impossible to reason about. Once setState becomes public, or there are other variables in the class that its logic depends on that can be set from the public interface (technically part of the state), things can get nearly impossible—even if it’s single threaded. I usually find the next step developers take once they start encountering bugs in the class is instrumenting the setState function with logging and maybe asserting that it’s in an expected state before going to a new state. I made a Wheel of Fortune game once where the turn-based game loop was broken down into a myriad of states. I was very proud of my debug UI that displayed the current state and its main variables in the upper left corner, but my lead was justifiably concerned that I even needed such a thing.
I spent the first half of my engineering career, became a technical director, and shipped some great titles before I started really caring about whether I could reason about my code or not. I spent more time in the debugger checking if my code worked as expected rather than actually writing the code in the first place and I thought that was just normal life. Formal proofs that code did what it was supposed to do was something for ivory tower academics! Today I would never bother to write an actual formal proof, but what I’m looking for is code that I can read and be pretty sure I know what it does. “I can reason about this” isn’t quite the same as “I can read this”; code can make my eyeballs bleed with cryptic acronyms and nested lambdas and still be reasoned about if I have the patience, and highly stateful nondeterministic code can be nicely self-documented with friendly, meaningful identifiers and plenty of well-named sub-functions and still be an unpredictable mess. These code values are orthogonal. We can have both. And in the past decade or two it’s become even more important to have “reasonable” code because we’re often working with concurrent systems (in a single threaded system, even if you don’t understand your code you can be pretty confident it will behave the same way every time); because we’re repairing the airplane while we’re flying it (before, we would get the code to a state that QA signed off on and then we wouldn’t touch that house of cards); and because code review has become the rule rather than the exception. (Now, “unreasonable” code puts a burden on the poor souls trying to review our code.)
For me there are a few main problems with the bashable FSM:
One: we literally don’t know what the code might do. That setState call can happen while we’re in any state and take us to any other state, and be used incorrectly by our clients. Maybe they call it multiple times between updates, and important transition logic that was supposed to be executed by the first state gets skipped. Maybe they don’t call it at all when they should have.
Two: we’re bashing data. That state variable or variables get changed irrevocably and there is no history of what they were, leaving the system rudderless when trying to determine the correct action and leaving the developers rudderless when there are bugs in the logic that they’re trying to get to the bottom of. Changing source-of-truth variables is where we get into trouble when coding, and while we can't actually make our programs do anything if we don't, we don't have to do it as much as is common practice.
Three: often, that persistent state is duplicated data. For example, a turnstile might have a record of when coins are inserted. And the coin detector might put the turnstile in an “unlocked” state. Really that “unlocked” state is duplicated data – we know when the coins were inserted; we know when the turnstile turned; whether it should be unlocked or not is just a computation at that point. When data is duplicated and cached it can lead to sync errors. It may be the result of computations, but there were still original sources of truth that have been possibly munged and saved. Sometimes we want to cache duplicated data or the result of computations for performance reasons, but most of the time (80%? 90%? 99%?) we don’t because of the possibility of sync errors, of using an out of date cached variable because we forgot to update it at some point (forgot to setState).
So I have a few approaches when I want to tame one of these state machines. In the case of the turnstile state machine where we have a record of inputs, I’ll usually get rid of the state variable entirely and write a nice little function like isTurnstileLocked that takes the records of coin drops and turnstile pushes as arguments. That function is easy to put a unit test around, has all the logic in one easy-to-reason-about place, and probably won’t ever bit-rot once we get it working to our satisfaction.
Maybe we don’t actually have or keep that record of inputs, so Problem Three is not actually a problem and the state machine state is the current source of truth. We still have problems One and Two. Sometimes my approach will be to make that state “richer” – for example, with a turnstile, maybe I record the number of coin drops and turnstile pushes, or maybe I record the timestamps of the last coin drop and the last turnstile push. It is still technically a state machine at this point, and one with even more mutable data—instead of one simple setState() function we now have, perhaps, insertCoin() and attemptToPushTurnstile()—so it sounds like I’m just trading one problem for another. I know from experience, at almost a gut level, that the new way is easier to reason about and harder to break accidentally but I have a hard time putting it into words: one improvement is that the encapsulation is at a higher layer, so we’ve limited the ways clients can use the system and made it harder for them to use it wrong; another is that the system and QA and future developers have richer historical data to base decisions on; and another is that the system is easier to get under test.
There’s also good old Functional Programming; don’t change the state in-place. state updateState(state oldState, stuff otherInputs). Easy to get under test, impossible for clients to set the state wrong without running the state-change’s attendant logic. We still then have to put that new state somewhere, and usually we’ll bash a state variable at the client level to do that, but we don’t have to. We could keep them in a list, recurse, or double-or-triple buffer.
My final approach, in the case where the turnstile contains its source of truth but we also want to be able to override that truth (an example in Minecraft’s networking code is when we need to temporarily shut down network access to prepare for the OS suspending) is to think in terms of state change “requests” rather than ham-handed setters – here we might have a queue of inputs that’s maintained by a requestTurnstileOverride() function. Again, we’ve ended up with more mutable state than the original problem, but now clients can’t change our state without our permission. We’re back to only changing the source of truth in one place, the input queue is just another input into the system, we can reason about how we process that input queue and get it under test. (Also, in this case, where the turnstile might be locked because nobody has put a coin in, or it might be locked because the whole subway station shuts down at midnight: we really don’t want to conflate those two sources of truth into one variable.)
One thing I’m not a fan of that may be even more popular these days than FSM’s, probably because they’re easier to write in the first place, are chains of tasks with callbacks. Here we often don’t even know when the functions will get called or what thread they’re on; if they have side effects we can get all sorts of fun races. Give me an old-school polled FSM over that--FSM's have their advantages! They can be a great way to organize spaghetti code, for example--just don’t make its setState() function public.
And I’d like to reiterate that even if that setState function isn’t public, once a state machine reaches a certain level of complexity it might as well be: people reading the code will not be able to figure out what state it is in at a given moment or what state it’s about to transition to. The approaches above still work to tame the machine.
Hear hear! Transitions should be the only way to invoke a state change.
I've really enjoyed the "Elm Pattern" or "Unidirectional Pattern". It seems counter intuitive that it should work, but all state changes are localized since the application only send messages/transitions, doesn't actually make the state changes itself.
Posted by: Jay K | November 13, 2022 at 06:46 PM