Continuing my posts about n00b game design data science: https://www.gamedevblog.com/2019/07/a-n00b-does-game-design-data-science.html https://www.gamedevblog.com/2019/08/finally-ab-test-victories.html
Talking to the tech director of a fast growing up-and-coming always-online game I asked if they were doing AB testing with their live population. They weren't, and he said that they didn't really need to because they could tell if an update sucked by looking at the metrics before-and-after. I was surprised because they were a pretty big, well funded game and I thought all the cool kids were doing the real thing these days.
I learned long ago when getting my psych degree that this kind of testing, or 'baseline testing' - making a change to your test and then measuring to see if results changed - is not great. Without having proper control and experimental groups, factors beyond your control might be influencing your results. With a videogame update it might be having longer session lengths because it's a holiday; it might be more popular because a youtuber has told people it's cool; and so on.
To mitigate that you can do 'double baseline' testing, which is to take the change away and see if behavior returns to what it was. But who wants to implement a feature and then take it away once they see a spike in player engagement, just to make sure?
So enter AB testing. It is fairly easy to assign your online players to different test groups and have legit 'control' and 'experimental' groups, and now you've introduced some scientific rigor.
But there's another reason I like proper AB testing better than baseline testing and this is probably the clincher for me: it's being able to test lots of changes independently but still release them all at the same time. If you do, say, an update every week with a bunch of changes, and you see your metrics change, you don't know which changes in that update were responsible. Maybe one change made things way better and another change made things a little worse. You won't know. If you want to be scientific you have to slow down, make one change at a time, wait and see how people react, take them away again, yikes.
With AB testing, however, you can independently test all the features in your update, and even investigate to see if features interact in interesting ways. This didn't really occur to me until I started doing it and then I wished I had gotten started much sooner. Soooo worth the small amount of time I invested.
One thing I'm concerned about and don't have a good answer for yet - sometimes I'll make a change and when I do the AB testing the change, the 'B' group, is definitely a success when compared to the A group, but looking at my overall numbers things didn't seem to improve. For example I changed how leveling works on some Dungeon Life servers - before the change I was seeing average session lengths over 600. After the change in the control group I started seeing session lengths of 400 and the new leveling system still had session lengths around 600. WTF? So clearly the new leveling system is better...but what is going on? Did I accidentally introduce some bug in the control group? Playing in the control group myself to check I didn't see any obvious problems. So that's going to remain one of life's little mysteries...because obviously I'm keeping the new system after all that.
That said, all in all I'm really enjoying the sort of objective rigor that this bringing to my design process. When I try something will make the game more appreciated or engaging and get to have some objective measure afterwards of whether I'm right or wrong, and with the AB testing I can continue developing at close to the same speed as before (minus a little overhead for measurement and more overhead for throwing out the changes that didn't work :/). I feel I'm learning about game design at a much faster rate than I have the rest of my life!
Comments