Because I'd rather Do Several Things Poorly™ than specialize, for fun and learnings I've been playing around with machine learning and data science with my last game, Dungeon Life. I want to share what I learn both because I like to write and hopefully educate but also because maybe you'll help catch me when I make mistakes.
One of the cool things about Roblox is that it's a place you can do a data science side project without breaking the bank because the cost of user acquisition is so low. But I was not satisfied with either the Google Analytics or Game Analytics plug-ins because they weren't letting me see my own data and weren't letting me do the queries I wanted to do. Sure, I could look at my engagement, change things, and see if my changes seemed to have an effect, but could never tell for sure if the effects were due to my changes or other factors such as it being a holiday or what.
I hope to share how I set up my metrics system in detail in another post - in brief, it uses Postgres SQL (it used to use MongoDB but I hit the memory cap and switched to something free; also I thought brushing up on my SQL would be a more marketable skill than learning Mongo's weird query language) with NodeJS on the backend and Roblox http service on the game servers. With just a handful of concurrent users at any moment I can run the backend on my desktop at home; I had no need to get another computer or set something up in the cloud.
My data packets currently contain these fields:
server_key: a guid generated by Roblox's HttpService:GenerateGUID.
session_key: ditto.
category: the class of event, such as 'Spawn' 'Kill' 'Tutorial' 'PlayerAdded' 'FrameDuration' - there are a bunch.
action, label: additional info about the event.
value: numeric info about the event. Can be real.
time: timestamp.
Once I had it set up, and once I'd made my way through a chunk of this online ML class (https://www.coursera.org/learn/machine-learning/) I started seeing what I could learn. Like most game devs I'm interested in session length (how long did they play) and retention (did they come back?) but lately I've also been interested in intrinsic motivation and whether a game brings joy or not. I believe it's quite possible for an activity to make you desire to keep doing it without it actually bringing you happiness or joy; they're separate neural subsystems, dopamine and endorphins, and humans are notoriously bad about desiring to do things that don't make them happy. (For more on that you can read Stumbling On Happiness by Daniel Gilbert.)
In my attempt to include joy as a metric to optimize, Dungeon Life pops up a dialog box at various intervals asking "How are you enjoying Dungeon Life lately?" and allows an input of 1-5 stars. Self-reporting isn't perfect when it comes to measuring happiness but what else can you do?
The way I approached it was by trying to create predictive models using linear and logistic regression. (That class is free and can give a far better explanation than I can so jump on that if you want to dive deeper.)
Some features I thought might be relevant to a player's first Dungeon Life experience included:
framerate, language, screen resolution, percentage of successful hits with melee & ranged weapons, amount of currency earned, death rate, and kill rate. I needed to be careful to make sure I didn't just count deaths or kills, because obviously more deaths and kills would be correlated with a longer session. There are a lot more features that might be relevant but this seemed like a good start.
I've loved a good scatterplot ever since reading Tufte's Visual Display of Quantitative Information; Excel is a little twitchy when I try to make scatterplots with it, so I've been using Matlab just like I learned in my ML class.
Here's a graph of here's time played vs. frame duration:
So, not too surprisingly, players running the game on toasters do not hang around. What surprised me is how slow the game ran for so many people; it's not a very high-resource game. I think this may go some way to explaining something I wondered about Roblox: it seems the most popular games are not that graphically intensive, but sometimes you find real works of art languishing in low popularity. My conclusion is the market on Roblox has a lot of players with very low-end machines and approaching Roblox development with the attitude that your game will have more detailed graphics than the competition might be exactly wrong.
Using just these features I was surprised at how accurate a session-length prediction algorithm ended up being. A graph of predicted session length against actual session length for a test set looked like this:
The line represents perfect predictions. I personally was impressed that the cloud seemed to hover near the line. (Doing logistic regression on retention, however, has yet to be terribly accurate.)
From looking at that data I could see which features contributed the most to the session length prediction. Death rate was one of them.
FWIW, this is what my death query looked like:
-- this query gets the count of death events in the first session
SELECT pkey_deaths, COUNT(death_time) as first_session_deaths FROM (
-- this subquery gets all the death events before the end of the first session
SELECT * FROM
-- this subquery gets all the deaths we ever tracked
(SELECT player_key as pkey_deaths, time as death_time FROM events WHERE category='Death' ) AS deaths
JOIN
-- this subquery gets the end of the first session (I hope)
(SELECT DISTINCT ON(player_key) player_key as pkey2, time as session_end FROM events WHERE category='SessionLength' ORDER BY player_key, time ) AS session_end_times
ON pkey_deaths = pkey2 AND death_time <= session_end
) AS first_deaths
GROUP BY pkey_deaths
(I divided by time spent playing and took the logarithm on the Matlab side.)
Death rate made some interesting graphs. Death rate was not a bell curve; some players quit before ever dying at all, some died once early on and quit right away. Applying a logarithm to it made it look nicer and ended up being what I put into the predictive model.
This reminded me of Csikszentmihalyi; die too much you quit, don't die enough you also quit. The green pluses means players who came back.
So, yay, I've got some insights. What do I do with them?
This is when I implemented A/B testing. I wrote some new code to assign players to groups (in typescript in case you lua people are wondering wtf):
export namespace GameplayTestService {
let testGroupData =
[
{
name: "InstantHeroExpress",
values: 2,
},
{
name: "ClassesLocked",
values: 3 // 0: nothing locked; 1: mage locked; 2: mage & rogue locked
},
{
name: "DelayPlaceButtonReveal",
values: 2
}
]
let testGroups = new Map<Player,TestGroups>()
export function playerAdded( player: Player, testInfoHolder: TestInfoHolder ) {
if( !testInfoHolder.testGroups ) {
// choose test groups for new player
testInfoHolder.testGroups = new Map<string,number>()
}
let existingGroups = testInfoHolder.testGroups
for( let testGroup of testGroupData ) {
if( !existingGroups.has( testGroup.name ) ) {
let value = MathXL.RandomInteger( 0, testGroup.values-1 )
existingGroups.set( testGroup.name, value )
Analytics.ReportEvent( player, "AssignTestGroup", testGroup.name, "", value )
}
}
}
}
The 'testInfoHolder' is stored in the datastore so we'll never forgot what test group a player belongs to, and is where the game keeps track of what code track to send a given player down.
So now I can query various metrics against test groups, frex:
SELECT test_group_value, COUNT(first_return_day), COUNT(day0pkey) FROM (
SELECT * FROM ( -- 1) get everyone who joined after the latest test was introduced
(SELECT player_key AS day0pkey, time AS day0start FROM events WHERE category='Day' AND value=0) as day0
JOIN
(SELECT player_key AS test_group_pkey, time AS test_group_time, action AS test_group, value AS test_group_value FROM events WHERE category='AssignTestGroup' AND action='ClassesLocked') as test_group
ON day0pkey=test_group_pkey ) AS tbl
WHERE test_group_time <= day0start ) AS relevant_samples
LEFT JOIN
(SELECT DISTINCT ON(player_key) player_key AS r_pkey, value AS first_return_day, time FROM events WHERE category='Day' AND value>0 ORDER BY player_key, time ) AS retained
ON r_pkey=day0pkey
GROUP BY test_group_value
"ClassesLocked" is a change I made where instead of being able to pick from Warrior, Rogue, and Mage right off the bat, the choices are limited. This was inspired a bit by learning about the death rate. I had noticed playing the game that players liked to pick Mage, but Mage is actually harder to play; they're fragile and when they run out of mana they are inferior fighters. Even though the game recommends starting with Warrior it seemed like players didn't read that text. In addition, this would give the players motivation to keep playing to unlock more classes. A lot of games do it that way, maybe that's the reason. So what was the result?
Classes Locked Out |
Retained |
Total |
Expected |
None |
71 |
2413 |
81.07047 |
Mage |
77 |
2425 |
81.47364 |
Mage and Rogue |
98 |
2484 |
83.45589 |
246 |
7322 |
With my naked eye that looks significant but people tend to see significance when it's not really there. I'd forgotten how to do a chi-square from my psych major days but happily Excel has it built in. I calculated the expected results by taking the retained total, multiplying by the line item total, and dividing by the total sum, then ran CHISQ.TEST on 71,77,98 vs 81,81,83. The result: 0.13. That means that there's a 13% chance we could have seen those results just from luck. (Correct me if I'm wrong, please, n00b here.)
So that wouldn't be good enough to publish a paper in a psych journal, but it is enough to convince me to leave that change in!
What about my other metrics? For those, I query all the results, pull them into Excel and do t-tests. Excel also happily has a button for that, but I don't know how to do multivariate statistical tests it turns out. So I just test locking out 'mage & rogue' vs 'none' - anybody who wants to tell me where to look to do the multivariate equivalent of a t test please do.
The mean of the player ratings is a skoch higher (3.90 stars instead of 3.86) but as I expected, not significant with a sample size of 979 and a t-stat of -.44.
And that's when I get the bad news.
t-Test: Two-Sample Assuming Unequal Variances |
||
|
Variable 1 |
Variable 2 |
Mean |
509.463 |
427.967 |
Variance |
508881 |
316605 |
Observations |
2370 |
2454 |
Hypothesized Mean Difference |
0 |
|
df |
4502 |
|
t Stat |
4.3957 |
|
P(T<=t) one-tail |
5.6E-06 |
|
t Critical one-tail |
1.6451 |
|
P(T<=t) two-tail |
1.1E-05 |
|
t Critical two-tail |
1.9604 |
|
The session length of the players with the unlocked classes vs. the locked classes is the opposite of what I want to see. I haven't had this significant a result since I began playing around with t-tests. I was shocked enough that I triple-checked that I had the test set up correctly.
Wah wah wah.
What am I to conclude here? Players see the mage is locked out and decide to quit right then?
And then it occurs to me: it's not immediately obvious to the player that the rogue & mage will unlock after they've played; they have the same padlock icon on them that the premium classes, the priest & barbarian, have. Maybe players are thinking "I have to pay for the mage? That's bullshit!" So. I'll change that, and test some more!
I won't go into as much detail with other things I A/B tested; a few seemed to have no effect, but another thing that hurt was taking out the first Daily Prize that is thrown in the players' faces when they boot up the game for the very first time. Not only were they more likely to come back if they got one (alpha 5%) they might even have rated it higher (alpha 30%.) I thought I didn't need the GUI clutter, but apparently I do.
So! That's been my experience with data science and game-making so far. I've yet to have a real win--mostly the value so far has been in catching mistakes--but one of these days I hope to legit improve the game.
I appreciate you reading all the way to the end; hope it was useful to you, and if I screwed something up let me know!
Comments