Lately Wordle has been losing my streak count. At first I thought it might be my imagination: maybe I missed a day? So I started texting my girlfriend my victories on a daily basis (even if she was sitting right next to me on the couch and I could have just told her) and today I caught it in the act. I had just given a talk to a few other code-quality-minded people in the Xbox division at Microsoft about Don’t Repeat Yourself and Sources of Truth and it reminded me of one of my slides, which said that cached computations count as duplicated data.
Quick refresher on Don’t Repeat Yourself: it’s most useful when applied to data rather than code, and means that you should have a single source of truth. (Database peeps are so into this they just use the acronym--SSoT--but I don't hear much talk about it in software engineering circles.) Because if I have multiple representations of the same truth in my system it’s only a matter of time before some programmer (often me) screws it up and updates one truth without updating the other. “Human error is never the root cause.” (But wait, was it human error that created the system this way in the first place?🤔)
What’s not totally obvious is the idea that a computation, which our knee-jerk reaction is to cache, is a duplication of truth. And I’m sure that’s what’s going on with Wordle. I’m certain they have a database of every time we’ve ever played that probably includes when we started, each word we entered, when we finished, so they can measure engagement – which means the source of truth about my streak is there, but they’re displaying some kind of dependent truth that is not updated correctly or can be bashed.
Maybe it was a premature optimization. Maybe they could have lazily queried the database every time a client wanted to see their streak, but decided to maintain a separate variable anyway to speed things up. Maybe they might just Not Do That. (Reminded of a friend who, whenever I do something stupid but unavoidable like bang my knee on the coffee table, asks, “Have you tried not banging your knee on the coffee table?”) Lots of factors go into why duplication of truth creeps into our code so I try not to judge; I still do it even though DRY has been my #1 principle when coding for most of my career, and I’ve seen plenty of other senior engineers do it as well. It would be nice if this was the Wordle problem and they could just switch to using a query because then I’ll get my streak back.
On the otherh and, maybe it was a necessary optimization. Maybe they found that Wordle users were consuming a lot of bandwidth on New York Times servers and they were trying to get that number down and save costs, and discovered caching the streak was a significant improvement. And I don’t know their code. But I do know that some ways to cache are safer than others.
Here is a way that isn’t safe:
onWin(gameNumber) {
wins.push_back(gameNumber);
streak = streak+1;
}getStreak() {
return streak;
}
A bunch of ways that could go wrong: maybe onWin gets called multiple times by accident, maybe a coder adds some other flow that can change your wins and forgets to update streak, maybe accessing streak in itself is unreliable (pulling from a database or over the internet)…and once streak goes out of sync it will never come back.
Even though, on the surface, that code doesn’t look like a source-of-truth violation! There’s no telltale assignment from one variable to another, but the same truth is definitely represented twice in the system: I have data representing my history of play, and also I have this separate variable counting my streak. There’s probably no way to detect this sort of thing with static analysis. Which is one of the reasons I wish engineers were more mindful of DRY and SoT; our tools won’t protect us.
So event driven architectures, systems that fire code once in response to isolated events, which my toy example above might be part of, are one of my red flags when attempting DRY. There’s the truth of when the event happened (user flicked a switch at this point), and there’s the truth of how we responded to it (we updated the lightbulb state) – sometimes we find ourselves maintaining both, or doing multiple things in response to the event if we didn’t keep its timestamp, and those multiple things are in themselves duplicated truths. Event driven systems like this are very popular – I see it in various UI and game frameworks and tween engines and whatnot. They’re good for decoupling, good for performance, not so good for safety.
But back to how to cache safely. Our CPUs have caches built right into them and it’s fine, so obviously it can be done.
How about this?
onWin(gameNumber) {
wins.push_back(gameNumber);
streak = recomputeStreak(wins);
}getStreak() {
return streak;
}
We still have a duplicated truth here and a cached computation, but we have a one-way data flow. Streak is always computed from wins; wins is the authoritative source of truth. We could imagine a situation where for whatever reason wins gets updated and we forget to update streak, and there would be a desync for a time, but it would correct itself later. Data Oriented Design does this a lot—I was leery of it at first, for example seeing mark-and-sweep idioms in our ECS system where a bunch of flags for components are set during a frame and then thrown away, but it gives us performance, parallelism, and pretty good safety.
Here’s a way that’s safer still, where we lazily evaluate. This is also good for performance – the point is moot for wordle because whenever you win it immediately shows you your streak, but other cases where there may be fewer reads than updates we never have to do the computation.
onWin(gameNumber) {
wins.push_back(gameNumber);
needsRecompute = true;
}
getStreak() {
if (needsRecompute) {
cachedStreak = recomputeStreak();
needsRecompute = false;
}
return cachedStreak;
}
I like it because it's lazy (if there are more writes than reads it saves time; if there are more reads than writes perf is the same). What I don't like is coders still need to remember to invalidate/dirty the cache. It's better than remembering to update in parallel but still an easy slip up to make.
My favorite option is to track the old state and detect if it has changed:
onWin(gameNumber) {
wins.push_back(gameNumber);
}getStreak() {
if( wins.length() != winLengthForCache ) {
cachedStreak = recomputeStreak();
winLengthForCache = wins.length();
}
return cachedStreak;
}
What do you think? Does that seem clearly safer or are you like, “I don’t see the difference”? That’s intuitively safer to me though I have trouble explaining why. I’ll give it a shot: although there’s still what looks at first blush like a duplicated truth here: the winLengthForCache, and I could maybe see a coder possibly helpfully updating it when they shouldn’t – for example, in onWin() – without totally understanding what it does. But it is its own, separate truth: it is the truth of what the win length was when we last updated the cache. (Naming it winLengthWhenWeLastUpdatedCachedStreak might have been a better choice.)
Put another way, it’s very hard for me to imagine how that might desync. If a coder later on provides some functionality that allows us to change the timestamp of a given win, that would do it. For this example that wouldn’t make a lot of sense (maybe a customer complains that their data is off and they ask customer support to update that they won game 345 instead of game 346?) but it could make a lot of sense with other data sets – in which case perhaps a hash of the data would be a quick way to determine if the data was dirty and we needed to recompute the cache.
Another way I like to cache computations that probably wouldn’t work here because the data needs would be enormous is memoization. (We’d need 2^500 bits here if we do it the obvious way.) The cache in this case is no longer a dependent truth of the source data – the cache is a set of truths mapping source data to destination data – “somehow who didn’t play game 1, won games 2, and 3, lost game 4, and won games 5 and 6 has a streak of 2”, etc. It wouldn’t work here but there are plenty of data sets where it would work and it’s really hard for me to imagine how it might desync.
Although source-of-truth-confusion is rarely going to crash our programs they often do things that are worse: corrupt customer data permanently. Losing my Wordle streak isn’t a big deal, but other data bashing can be really harmful to our players/customers/the world. (An example from Minecraft is players spontaneously dying when the server thinks they’re falling but the client thinks they’re standing safely on a block, or losing livestock between chunk boundaries as data is persisted to disk. We have to duplicate truths when saving to disk or communicating across the internet, but are there ways we can do it safer?) Lately I feel that if I squint a little most logic bugs that make it to the public can be thought of as source of truth problems. My coding mantra used to be Don’t Repeat Yourself, but it has evolved and these days it’s: Be Mindful of your Sources of Truth. Duplication of data isn’t necessarily bad and often necessary, but when it happens we need to ask ourselves if it’s going to be maintainable and look for ways to prevent it from being a bug farm.
Recent Comments