Database 101: How to Model Leaderboards for 1M Player's Game.

Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we'll find out how to properly model a schema to handle them in a monstrously performative way!

If you’re just getting started with databases in general or databases, you might want to start off by reading my initial post, Database 101: Data Consistency for Beginners for Beginners. That article captures my own exploration of how many database paradigms exist as I look far beyond my previous experience with just SQL and MySQL. I’m keeping track of my studies in this Database 101 series.

It has been almost a year since I published my first article in this series! Thank you for being here with me as I learn this subject. Your comments and thoughts are always very helpful!

1. Prologue

Ever since I was a kid, like most average developers, I've been fascinated by games and how they're made. Speaking of which, let me introduce you to my favorite childhood game: Guitar Hero 3: Legends of Rock.

Well, more than a decade later, I decided to try to contribute with some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and also the main character of this article: the YARG (Yet Another Rhythm Game).

YARG is literally another rhythm game, but the difference about this project is that it is completely open source and they united legendary contributors in game development and also design to make this project works.

Suddenly the game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch, and I thought: well, it's an open source project, so maybe I can use my database skills for something like creating a monstrously fast leaderboard or store the past games.

It started as a simple chat on their Discord, which turned into a long discussion about how to make this project grow faster.

Then I decided to talk to my boss and ask him if I could work with the YARG guys and the condition was to create something cool enough to implement ScyllaDB (NoSQL Wide-column Database) since I'm working as a Developer Advocate there. You won't believe how the simplicity and scalability brought by ScyllaDB perfectly fit the needs of YARG.in!

Anyway, talk is cheap. Let me show you some code and concepts!

2. QDD - Query Driven Data Modeling

When we're talking about develop with NoSQL, mostly we should understand that depending on the paradigm (document, graph, wide-column etc) you should first understand which query you want to run.

While in MySQL the main goal is to understand the consistency, in Scylla you should focus on the query and create your schema based on that query.

At this project, we will handle two types of paradigm, which are:

Key-Value
Wide Column (Clusterization)

Now let's talk about the queries/features of our modeling.

2.1 Feature: Storing the matches

Every time that you finish a YARG gameplay, the most interest thing would be submitting your scores together with many other in-game metrics.

Basically it will be a single query based in a main index and that's all.

SELECT score, stars, missed_notes, instrument, ...  
FROM leaderboard.submisisons  
WHERE submission_id = 'some-uuid-here-omg'

2.2 Feature: Leaderboard

And now our main goal: a super cool leaderboard that you don't need to care about after a good data modeling. The leaderboard is per song, so every time you play a specific song, your best score will be saved and ranked.

However there's a big point on this interface, which is having filters to know exactly "which" leaderboard to bring:

song_id: required
instrument: required
modifiers: required
difficulty: required
player_id: optional
score: optional

So imagine our query looks like this, and it returns the results sorted by score in descending order:

SELECT 
    player_id, score, ...
FROM 
    leaderboard.song_leaderboard 
WHERE 
    instrument = 'guitar' AND
    difficulty = 'expert' AND
    modifiers = {'none'} AND
    track_id = 'dani-california' 
LIMIT 
    100;

--   player_id  | score
----------------+-------
--        tzach | 12000
--  danielhe4rt | 10000
--     kadoodle |  9999
----------------+-------

Now we know the features that will be worked here, but can you already imagine how will be the final schema?

No? Ok, lemme help you on that!

3. Data Modeling time!

Time to take a deep dive into data modeling with ScyllaDB and better understand how to scale it.

3.1 - Matches Modeling

First, let us understand a little more about the game itself:

It's a rhythm game;
You play a certain song at a time;
You can activate "modifiers" to make your life easier or harder before the game;
You have to choose an instrument (e.g. guitar, drums, bass and microphone).
Every aspect of the gameplay is tracked, such as:
- Score;
- Missed notes;
- Overdrive count;
- Play speed (1.5x ~ 1.0x);
- Date/time of gameplay;
- And other cool stuff.

Thinking on that, we can easily start our data modeling, which will turn in something like this:

CREATE TABLE IF NOT EXISTS leaderboard.submissions (
    submission_id uuid,
    track_id text,
    player_id text,
    modifiers frozen<set<text>>,
    score int,
    difficulty text,
    instrument text,
    stars int,
    accuracy_percentage float,
    missed_count int,
    ghost_notes_count int,
    max_combo_count int,
    overdrive_count int,
    speed int,
    played_at timestamp,
    PRIMARY KEY (submission_id, played_at)
);

Let's skip all the int/text values and jump to the set<text>.

The set type allows you to store a list of items of a particular type. I decided to use this list to store the modifiers because it's a perfect fit. Look at how the queries are executed:

INSERT INTO leaderboard.submissions (
    submission_id, 
    track_id,
    modifiers, 
    played_at
) VALUES (
    some-cool-uuid-here,
    'starlight-muse'
    {'all-taps', 'hell-mode', 'no-hopos'},
    '2024-01-01 00:00:00'
);

With this type, you can easily store a list of items to retrieve later.

Another cool piece of information is that this query is a key-value like! What does that mean?

Since you will always query it by the submission_id only, it can be categorized as a key-value.

3.2 Leaderboard Modeling

This is the part of the article where you will learn some cool wide-column database concepts.

On our leaderboard query, as mentioned earlier, we will always need some dynamic values in the WHERE clauses, which means that these values will belong to the Partition Key while the Clustering Keys will have value that can be "optional".

A partition key is a hash based on a combination of fields that you added to identify a value. Did you get it? No? Well, it took me a while to understand that too, but let me show you something:

Let's imagine that you played Starlight - Muse 100x times. If you would query this information, would be 100x different results differentiated by Clustering Keys like score or player_id.

SELECT 
    player_id, score ---
FROM 
    leaderboard.song_leaderboard 
WHERE 
    track_id = 'starlight-muse' 
LIMIT 
    100;

If 1.000.000 players play this song, your query will become slow and it will become a problem in the future, because your partition key consists of only one field, which is track_id.

However, if you add more fields to your Partition Key, like mandatory things before playing the game, maybe we can shrink these possibilities for a faster query. Now you see the big picture? Adding the fields like: Instrument, Difficulty and Modifiers will give you a way to split the information about that specific track evenly.

Let's imagine with some simple numbers:


-- Query Partition ID: '1'
SELECT 
    player_id, score, ...
FROM 
    leaderboard.song_leaderboard 
WHERE 
    instrument = 'guitar' AND
    difficulty = 'expert' AND 
    modifiers = {'none'} AND -- Modifiers Changed
    track_id = 'starlight-muse' 
LIMIT 
    100;

-- Query Partition ID: '2'
SELECT 
    player_id, score, ...
FROM 
    leaderboard.song_leaderboard 
WHERE 
    instrument = 'guitar' AND
    difficulty = 'expert' AND
    modifiers = {'all-hopos'} AND -- Modifiers Changed
    track_id = 'starlight-muse' 
LIMIT 
    100;

So, if you build the query in a specific shape it will always look for a specific token and retrieve the data based on these specifics Partition Keys.

Let's take a look at the final modeling and talk about the clustering keys and the application layer:

CREATE TABLE IF NOT EXISTS leaderboard.song_leaderboard (
    submission_id uuid,
    track_id text,
    player_id text,
    modifiers frozen<set<text>>,
    score int,
    difficulty text,
    instrument text,
    stars int,
    accuracy_percentage float,
    missed_count int,
    ghost_notes_count int,
    max_combo_count int,
    overdrive_count int,
    speed int,
    played_at timestamp,
    PRIMARY KEY ((track_id, modifiers, difficulty, instrument), score, player_id)
) WITH CLUSTERING ORDER BY (score DESC, player_id ASC);

The partition key was defined as mentioned above, consisting of our REQUIRED PARAMETERS such as: track_id, modifiers, difficulty and instrument. And on the Clustering Keys we added score and player_id.

Note that by default the clustering fields are ordered by score DESC and just in case a player has the same score, the criteria to choose the winner will be alphabetical ¯\(ツ)/¯.

First is good to understand that we will have only ONE SCORE PER PLAYER, but with this modeling like that if the player goes through the same track twice in different scores, it will generate two different entries.

INSERT INTO leaderboard.song_leaderboard  (
    track_id, 
    player_id,
    modifiers,
    score,
    difficulty,
    instrument,
    stars, 
    played_at
) VALUES (
    'starlight-muse',
    'daniel-reis', 
    {'none'}, 
    133700, 
    'expert', 
    'guitar', 
    '2023-11-23 00:00:00'
);

INSERT INTO leaderboard.song_leaderboard (
    track_id,
    player_id,
    modifiers,
    score,
    difficulty,
    instrument,
    stars, 
    played_at
) VALUES (
    'starlight-muse',
    'daniel-reis', 
    {'none'}, 
    123700, 
    'expert', 
    'guitar', 
    '2023-11-23 00:00:00'
);


SELECT 
    player_id, score
FROM 
    leaderboard.song_leaderboard 
WHERE 
    instrument = 'guitar' AND
    difficulty = 'expert' AND
    modifiers = {'none'} AND
    track_id = 'starlight-muse' 
LIMIT 
    2;

--   player_id  | score
----------------+-------
--  daniel-reis | 133700
--  daniel-reis | 123700
----------------+-------

So how do we fix this problem? Well, it's not a problem per se. It's a feature! LOL

As a developer, you have to create your own business rules based on the project needs, and this is no different. What do I mean by that?

You can run a simple DELETE query before insert the new entry and guarantee that you will not have a specific data from the player_id with less than the new score inside that specific group of partition keys.

-- Before Insert the new Gampleplay

DELETE FROM 
    leaderboard.song_leaderboard 
WHERE 
    instrument = 'guitar' AND
    difficulty = 'expert' AND
    modifiers = {'none'} AND
    track_id = 'starlight-muse' AND
    player_id = 'daniel-reis' AND
    score <= 'your-new-score-here';

-- Now you can insert the new payload...

And with that we finished our simple leaderboard system, the same one that runs in YARG and can also be used in games with MILLIONS of entries per second :D

4. How to Contribute to YARG

Here's the part of the text that I'll invite you to contribute on this wonderful open source project!

Today we're building a brand new platform for all the players using:

Game: Unity3d (Repository)
Front-end: NextJS (Repository)
Back-end: Laravel 10.x (Repository)

and we will need many developers and testers as possible to discuss future implementations of the game together with the main contributors!

First, make sure to join their Discord Community. There is the place where all the technical discussions happen with the back of the community before going to the development board.

Also, outside of Discord, the YARG community is mostly focused on the EliteAsian (core contributor and project owner) Twitter account for development showcases. Be sure to follow him there as well.

And FYI, the Lead Artist of the game, aka Kadu is also a Broadcast Specialist and Product Innovation Developer at Elgato that worked with streamers like:

Ninja
Nadeshot
StoneMountain64
and the legendary DJ Marshmello.

Kadu also use his twitter to share some insights and early previews of new features and experimentations for YARG. So, don't forget to follow him on Twitter as well!

Here is some useful links to know more about the project:

Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?

5. Conclusion

Data modeling is sometimes challenging, and this study took 3 months of many new ScyllaDB concepts and a lot of testing together with my community at Twitch.

I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB!

Also, if you liked ScyllaDB and want to learn more about, I strongly suggest you to watch our free Masterclass Courses or vising ScyllaDB University!

Don't forget to like this post, follow me on the socials and fill your water bottle xD

See you in the next article!

Follow me on Twitter
Follow me on Github
Follow me on Github
Follow and Subscribe at my Twitch Channel

Daniel Reis @danielhe4rt