All posts by aroha

Dev log 3: Always read the fine print

Yep, another one of these is long overdue, right?

Since the very start of the Eleven alpha, our dedicated testers are showing some serious commitment, and we have been able to identify and fix a large number of bugs thanks to their help. By now, most of the major game features have been ported from our prototype server to the new one — the notable exception being location instancing, which includes home streets.
But since without doubt the most commonly asked question is some variation of “When are you going to let more players in?”, I would like to give you an honest update in that regard from a technical point of view.

During the past three months, we very slowly ramped up the number of players, keeping a close eye on the system’s performance. While things are mostly working ok-ish (apart from crashes, numerous bugs and just generally being an alpha), it is becoming quite clear that as it stands, the server would not be able to cope with actual MMO-like player numbers.

After some analysis of the underlying issues, we now know that the problem is rooted in a core part of our architecture. Our general approach since day one has been “get the game running with as few changes as possible to the code released by TS”. In order to achieve this, we used a certain bleeding-edge Javascript language feature called Proxies to replicate how the original game server handled references between game objects and communication between server instances, because they are just perfectly suited for that purpose. In retrospect though, we probably should have paid more attention to the fact that their current implementation in Node.js is actually a dead end, and the topic is not a priority for V8.

To illustrate, here’s how “fast” some typical operations are on our server right now:

login_start: 4.12 ops/sec
groups_chat: 2,126 ops/sec
itemstack_verb_menu: 216 ops/sec
itemstack_verb: 316 ops/sec
move_xy: 4,787 ops/sec
trant.onInterval: 1,618 ops/sec

And in comparison, the same operations with the problematic parts taken out (just for the benchmark — the game would not work that way, obviously):

login_start: 6.34 ops/sec
groups_chat: 11,023 ops/sec
itemstack_verb_menu: 1,766 ops/sec
itemstack_verb: 4,454 ops/sec
move_xy: 126,096 ops/sec
trant.onInterval: 49,433 ops/sec

Unfortunately, there is no easy solution here: Reconsidering our early technology platform decisions would of course be a huge step backwards — but more intrusive modifications to the TS architecture and code, to be able to get rid of the “slow” proxies, are not a pleasant prospect either (remember, roughly a million lines of code).
We are of course pondering ways to tackle the problem more creatively, too, but without that liberating Eureka moment so far.

Sorry if all this sounds a bit bleak now, but we would rather be upfront about where we’re at, than raise expectations and then keep you in the dark about the challenges ahead. Rest assured that we are still working hard on Eleven (there are many other moving parts that are not related to this issue), and who knows, maybe there is a feasible solution around the corner that we just didn’t think of yet.
(We should probably donate to Tii more…)

Dev log 2: To the GitHubs!

So, it’s programmer mumbo-jumbo time again! Sorry, I wanted to do another one of these much sooner, but there always seems to be so much else to do, too… anyway, there has been a lot going on behind the scenes during the past couple of weeks. In fact, one of these things concerns coming out from behind said scenes a bit: since creating an open source GlitchEleven game server has been the plan all along, we finally started moving parts of our code to GitHub. What you can see there right now is actually the humble beginning of the “real” game server — no more throwaway prototype code.

Now, before you fire up Git: this is still in early stages, and does not support the actual game client yet. You can marvel at unit test results if that’s your thing, though! Besides, it is just one of several components we are working on; another big one being the webapp, parts of which you have already seen in screenshots or demos (the new vanity/wardrobe and the skills interface). An important next step will be integrating these components and having them talk to each other. This game sure needs a lot of stuff!

Testing can be fun, too.

Unit tests can be fun, too.

Also, a couple of new people have joined the dev team recently, and I want to specifically mention two of them who already contributed some great work. Amabaku has been making steady progress on the long-neglected topic of NPC movement, and it makes working on other parts of the system so much more enjoyable when you get to see the world coming more and more to life in the background. Meanwhile, Egiantine started looking into creating a better AMF library for Node.js (AMF being the data format of the messages exchanged between game server and client). The existing options all turned out to be lacking in one area or another, but her initial benchmark results look very promising. With the library we are currently using, the server process simply cannot keep up with the messages from more than a handful of clients, so this really is a key piece of infrastructure.
While it is still easier to try and test these and other new features in the existing prototype server (being able to fire up the game and all), step by step they will be ported over to the “real” server in the coming weeks and months. Moving forward 🙂

Dev Log 1: Hunting for Leaks

In the Pre-Game Show post, Justin mentioned memory issues in our current game server, which required frequent restarts during the recording of the video because the server became unresponsive. Even though this is just a prototype, we decided to look into these problems — otherwise, we would just make the same mistakes again later.

To reproduce the situation without needing a bunch of real people to log on to a server and do stuff, we have a fairly simple script that simulates that: A set number of fake players that log in one by one (in the same location), and immediately start moving around without pause. The continuous movement causes a non-stop flow of messages from the clients to the server, which makes problems bubble up more quickly than in real-world use.

Monitoring the game server process memory usage while running that script resulted in this diagram:

During the login phase, things still look more or less normal, and memory usage ramps up from below 200mb to ~350mb. After the fifth login though, something bad happens: The garbage collector starts a big cleanup cycle, and manages to free up over 100mb of memory — but it takes more than a minute to do that, making the server completely unresponsive during that time.

Following that, the players can continue running around (but memory is being consumed at an alarming rate), until it all comes to a grinding halt again, this time for over three minutes. Finally, it all goes pear-shaped and the server process just crashes (that’s where the graphs abruptly end towards the right).

In order to find out what is consuming memory so quickly, we first tried an analytic approach: Taking snapshots of the server process memory before and after certain operations (e.g. a player moving once), and comparing these snapshots. Unfortunately, this did not lead to any useful results, as there are a lot of unrelated things “going on” within the process even during short time intervals, making it very difficult to spot the changes relevant to our problem.

Instead, we had to switch to a somewhat more painful empiric approach: Removing “suspicious” parts of the code, bit by bit, and repeatedly running the aforementioned script, while closely watching for significant changes in the memory usage patterns. As you can imagine, this gets quite tedious after a while. While googling for less frustrating ways to solve such problems, I came across this half-joking remark by Ben Noordhuis (a long-time core node.js contributor), which I wholeheartedly agree with:

Tracking down memory leaks in garbage-collected environments is one of the great unsolved problems of our generation.

Eventually, we did find the culprit. A slightly simplified explanation: All of the game objects (players, items, locations etc.) are wrapped in a “persistence proxy” when they are loaded, which tells the persistence layer to save the object whenever it changes. When a nested property of an object is accessed (e.g. player.metabolics.energy or player.stats.xp), such a proxy has to be created for the subordinate layers (metabolics or stats in this example). Our mistake was creating these proxies on every access, instead of just once and keeping them around. Really obvious once you know it (as is often the case with bugs)!

After a pretty simple fix, the script produced much more pleasant results:

Looking good! Now, off to make this work for more than five players…

Dev log 0: The Game Server

Hello lovely in-limbo Glitchen and other curious supporters! Quick intro, I’m this guy, one of the more technically inclined members of Team Eleven. Which is why I was dragged into the spotlight prodded gently to give you some technical background on our approach to recreate the core missing piece of the Glitch architecture: the game server.

As most of you know, a part of the server-side code has been released by Tiny Speck in the glitch-GameServerJS repository on Github. This repository — referred to as “GSJS” below — consists of roughly one million lines of Javascript code that, put simply, contain all of the Glitch game logic and textual content. We decided early on that it would be a good idea to reuse that code with as few changes as possible, first and foremost in order not to introduce new bugs in tried and tested code, but also because the sheer volume of it would make any structural change (let alone rewriting it in another language) an enormous task.

What TS did not release is the actual server component that the game clients connect and send messages to, that processes these messages by calling the respective GSJS functions, sends resulting responses back to the clients, and manages the persistent state of every object in the game world. Originally, this component was a Java application that ran the GSJS code inside the Java virtual machine using Rhino. While we did initially consider rewriting the server based on the same technologies, we eventually agreed to try our hands on implementing it using Node.js, at least for a first iteration. The reasons behind that decision were roughly the following:

  • much better performance of V8 (the JS engine that drives Node.js) compared to Rhino (and Rhino’s successor has not been released yet)
  • Node.js is clearly on the rise and currently has a very active community, while Rhino is in end-of-life mode
  • one less language to worry about, having both core game server and GSJS written in JS
  • greater expected likelihood to find people willing to write JS in their spare time than Java
  • and last but not least, cal and serguei (of Tiny Speck) suggested Node.js as more or less the obvious way to go from our perspective (“I think node is the most pragmatic choice”)

So off we went cobbling together a prototype server around the beginning of December, and a few weeks later we had something multiple people could connect to at the same time, and do glitchy things in (you’ve seen the screenshots). We were able to integrate the existing GSJS code after running it through a fairly simple preprocessor script, with only a couple of minor manual adjustments. This prototype server now serves three main purposes:

  • a means for us to learn how all aspects of the game actually work internally, in a very hands-on way, and to tinker with stuff we do not understand yet
  • a way to determine serious issues that might cause us to revise our technology choices, and to test various options for components where we have not reached a decision yet (e.g. the persistence layer)
  • a platform for other tasks that require parts of the game in a “live” state, like the tagging process

Regarding the second point, so far, we have not yet encountered any obvious, insurmountable roadblocks (except maybe concerns regarding performance — but that is a topic for another blog post). We do however struggle with the fact that Node’s architecture differs from the original server in one significant way: it is strictly single threaded and relies on the application code to “play fair” by not performing long-running, uninterruptible operations. The existing GSJS code is obviously not designed with that restriction in mind. In practical terms this currently means that, for example, anytime a new player logs in, the game is effectively paused for everyone else for a second or so.

It is important to note that this is not a server we can use for any kind of public testing/demo, unfortunately — it was simply not made for that purpose at all. But once we are reasonably confident we understand how everything works (“soon”), it will serve as a sort of blueprint and hopefully allow us to work on the “real thing” in a structured, efficient way. While this approach may seem like doing the same work twice, the reasoning here is that we would not have gotten it “right” the first time anyway, having started off with pretty much no prior knowledge about the inner workings of Glitch.

If there is interest in future technical blog posts about how we are trying to solve issues like the ones described above (i.e. if you are longing for more long-winded articles with techy words, abbreviations and no screenshots), let us know!