Category Archives: Tech Rambles

Dev log 3: Always read the fine print

Yep, another one of these is long overdue, right?

Since the very start of the Eleven alpha, our dedicated testers are showing some serious commitment, and we have been able to identify and fix a large number of bugs thanks to their help. By now, most of the major game features have been ported from our prototype server to the new one — the notable exception being location instancing, which includes home streets.
But since without doubt the most commonly asked question is some variation of “When are you going to let more players in?”, I would like to give you an honest update in that regard from a technical point of view.

During the past three months, we very slowly ramped up the number of players, keeping a close eye on the system’s performance. While things are mostly working ok-ish (apart from crashes, numerous bugs and just generally being an alpha), it is becoming quite clear that as it stands, the server would not be able to cope with actual MMO-like player numbers.

After some analysis of the underlying issues, we now know that the problem is rooted in a core part of our architecture. Our general approach since day one has been “get the game running with as few changes as possible to the code released by TS”. In order to achieve this, we used a certain bleeding-edge Javascript language feature called Proxies to replicate how the original game server handled references between game objects and communication between server instances, because they are just perfectly suited for that purpose. In retrospect though, we probably should have paid more attention to the fact that their current implementation in Node.js is actually a dead end, and the topic is not a priority for V8.

To illustrate, here’s how “fast” some typical operations are on our server right now:

login_start: 4.12 ops/sec
groups_chat: 2,126 ops/sec
itemstack_verb_menu: 216 ops/sec
itemstack_verb: 316 ops/sec
move_xy: 4,787 ops/sec
trant.onInterval: 1,618 ops/sec

And in comparison, the same operations with the problematic parts taken out (just for the benchmark — the game would not work that way, obviously):

login_start: 6.34 ops/sec
groups_chat: 11,023 ops/sec
itemstack_verb_menu: 1,766 ops/sec
itemstack_verb: 4,454 ops/sec
move_xy: 126,096 ops/sec
trant.onInterval: 49,433 ops/sec

Unfortunately, there is no easy solution here: Reconsidering our early technology platform decisions would of course be a huge step backwards — but more intrusive modifications to the TS architecture and code, to be able to get rid of the “slow” proxies, are not a pleasant prospect either (remember, roughly a million lines of code).
We are of course pondering ways to tackle the problem more creatively, too, but without that liberating Eureka moment so far.

Sorry if all this sounds a bit bleak now, but we would rather be upfront about where we’re at, than raise expectations and then keep you in the dark about the challenges ahead. Rest assured that we are still working hard on Eleven (there are many other moving parts that are not related to this issue), and who knows, maybe there is a feasible solution around the corner that we just didn’t think of yet.
(We should probably donate to Tii more…)

Dev log 2: To the GitHubs!

So, it’s programmer mumbo-jumbo time again! Sorry, I wanted to do another one of these much sooner, but there always seems to be so much else to do, too… anyway, there has been a lot going on behind the scenes during the past couple of weeks. In fact, one of these things concerns coming out from behind said scenes a bit: since creating an open source GlitchEleven game server has been the plan all along, we finally started moving parts of our code to GitHub. What you can see there right now is actually the humble beginning of the “real” game server — no more throwaway prototype code.

Now, before you fire up Git: this is still in early stages, and does not support the actual game client yet. You can marvel at unit test results if that’s your thing, though! Besides, it is just one of several components we are working on; another big one being the webapp, parts of which you have already seen in screenshots or demos (the new vanity/wardrobe and the skills interface). An important next step will be integrating these components and having them talk to each other. This game sure needs a lot of stuff!

Testing can be fun, too.

Unit tests can be fun, too.

Also, a couple of new people have joined the dev team recently, and I want to specifically mention two of them who already contributed some great work. Amabaku has been making steady progress on the long-neglected topic of NPC movement, and it makes working on other parts of the system so much more enjoyable when you get to see the world coming more and more to life in the background. Meanwhile, Egiantine started looking into creating a better AMF library for Node.js (AMF being the data format of the messages exchanged between game server and client). The existing options all turned out to be lacking in one area or another, but her initial benchmark results look very promising. With the library we are currently using, the server process simply cannot keep up with the messages from more than a handful of clients, so this really is a key piece of infrastructure.
While it is still easier to try and test these and other new features in the existing prototype server (being able to fire up the game and all), step by step they will be ported over to the “real” server in the coming weeks and months. Moving forward 🙂

Dev Log 1: Hunting for Leaks

In the Pre-Game Show post, Justin mentioned memory issues in our current game server, which required frequent restarts during the recording of the video because the server became unresponsive. Even though this is just a prototype, we decided to look into these problems — otherwise, we would just make the same mistakes again later.

To reproduce the situation without needing a bunch of real people to log on to a server and do stuff, we have a fairly simple script that simulates that: A set number of fake players that log in one by one (in the same location), and immediately start moving around without pause. The continuous movement causes a non-stop flow of messages from the clients to the server, which makes problems bubble up more quickly than in real-world use.

Monitoring the game server process memory usage while running that script resulted in this diagram:

During the login phase, things still look more or less normal, and memory usage ramps up from below 200mb to ~350mb. After the fifth login though, something bad happens: The garbage collector starts a big cleanup cycle, and manages to free up over 100mb of memory — but it takes more than a minute to do that, making the server completely unresponsive during that time.

Following that, the players can continue running around (but memory is being consumed at an alarming rate), until it all comes to a grinding halt again, this time for over three minutes. Finally, it all goes pear-shaped and the server process just crashes (that’s where the graphs abruptly end towards the right).

In order to find out what is consuming memory so quickly, we first tried an analytic approach: Taking snapshots of the server process memory before and after certain operations (e.g. a player moving once), and comparing these snapshots. Unfortunately, this did not lead to any useful results, as there are a lot of unrelated things “going on” within the process even during short time intervals, making it very difficult to spot the changes relevant to our problem.

Instead, we had to switch to a somewhat more painful empiric approach: Removing “suspicious” parts of the code, bit by bit, and repeatedly running the aforementioned script, while closely watching for significant changes in the memory usage patterns. As you can imagine, this gets quite tedious after a while. While googling for less frustrating ways to solve such problems, I came across this half-joking remark by Ben Noordhuis (a long-time core node.js contributor), which I wholeheartedly agree with:

Tracking down memory leaks in garbage-collected environments is one of the great unsolved problems of our generation.

Eventually, we did find the culprit. A slightly simplified explanation: All of the game objects (players, items, locations etc.) are wrapped in a “persistence proxy” when they are loaded, which tells the persistence layer to save the object whenever it changes. When a nested property of an object is accessed (e.g. player.metabolics.energy or player.stats.xp), such a proxy has to be created for the subordinate layers (metabolics or stats in this example). Our mistake was creating these proxies on every access, instead of just once and keeping them around. Really obvious once you know it (as is often the case with bugs)!

After a pretty simple fix, the script produced much more pleasant results:

Looking good! Now, off to make this work for more than five players…

Tag! You’re it!

Sirentist told you about the user’s side of “tagging” in her last post, and now I’d like to add something about the technology behind this “tagging” thing and why we chose that way. At the end of this post you’ll probably be able to tell what the (boring) technical reason to call it “tagging” was, but I like Sirentist’s idea way better!

Before anyone was able to yell “Tag! You’re it!” several things had to happen. And before we even had the idea to solve our problem this way, we tried something else (and failed) … but let me start at the beginning.

TinySpeck released most of the Glitch source code (client and server) under a CC license.1 The important word here is “most” — they did not release everything, especially not all the data we would need to get back Ur the way it was before the world ended. There’s a lot to say about the things we do or don’t have; Aroha has already told you about it and I bet there will be more posts on the subject. However, today I wanted to tell you about some of the things we had to do to get Ur back to the state it was.

As Sirentist said, everything in Ur had its own TSID, which is the unique key or ID of every item, every street, every player and so on. For each of those TSIDs, there was an XML-file that contained all the information related to that object. Among the assets TinySpeck released were the XML-files for (nearly) all locations (streets). Each of those XML-files contains information like the label, the TSID, references to that street’s geometry files, a list of players currently in that street, and so on. In addition, every location had a list of items contained in that street. An entry might look like this:

<objrefs id=”items”>
<objref tsid=”IA5HOD6RUSF3922″ label=”Fruit Tree”/>
<objref tsid=”IA5C9TB8OTS2N5S” label=”Shrine to Zille”/>
<objref tsid=”IA5CH4UBOTS2TL5″ label=”Street Spirit”/>
<objref tsid=”IA23I4QA23T2L15″ label=”Quoin”/>
</objrefs>

As you can see, the XML-file for a location contains only the TSIDs and the labels of the items. So not only are we missing the position of the item in the street, but we’re also missing all detailed information about the items. Things like what kind of Street Spirit? What did it look like? What type of quoin? When was the last time that Fruit Tree was watered? How many harvests are left? I think you get the idea.

Getting this level of detail was one part of the problem. We needed some of those values to initialize the items in the street. That part can (mostly) be done automatically. We could create the items in-game and use the result as a template for our new items. So far, so good. But how do we get the original positions of the items in the street back?

Idea #1: There has to be a way to do this automatically

Well, I’m sure there is, but it wasn’t as easy as it first seemed. The idea was to have screenshots of all streets in Ur and then use those to automatically find the position of, let’s say a Fruit Tree, in a street. Thanks to Mackenzie2, we had a complete and perfectly organized collection of full street snaps. We used python and the initial results looked really promising:

snap

Find The Fruit Tree!

It looked like 90% of the job was already done. However, as is typical, the remaining 10% turned out to be a problem. We soon ran into several complications:

  • There is not one Fruit Tree; Fruit Trees exist in 60 (again in words: sixty!) different states.
    Fruit Tree

    Fruit Tree

    While this is not a showstopper, it would have made our progress significantly slower. We would have had to check for all 60 versions every time. Besides that, it would have led to a lot of false positives.

  • In some cases the items are really difficult to find with the algorithms we tried, like this dirt pile here:

    Dirt Pile

    Brown Dirt Pile on brown background

  • The XML-files that we have for the locations are from an arbitrary point in time, pretty close to the end of the world. Let’s call it foo. The screenshots we have of those streets, meanwhile, are from a different arbitrary point in time. Let’s call it bar. Now between foo and bar (or bar and foo, for that matter) the world changed. People poisoned trees, for example, and what once was a Fruit Tree (on the snap) became a Bean Tree (in the XML-files). Or the other way around. Or nothing was replanted and there was only a patch left. Or, or, or. In other words, many streets had XML-files and screenshots that didn’t perfectly match.

Altogether, these reasons made us give up on this approach and try something else…

Idea #2: Tag! You’re it!

So instead, we needed a “semi-automatic” way to solve this problem. In the end, a human being would have to decide what item to place where, but we could at least try to make that process as simple and fast as possible. After all, we’re talking about more than 21000 items on about 1300 streets.

Using the above-mentioned snapshots of the streets, we set up the first version of the “tagging server.” A small python web server and some magic jQuery libraries (similar to those used on some photo tagging sites) are the core components. We know what items are expected on a given street from the XML, and those need to get converted to our target format (JSON). With that info, we generate the colored list of items you saw on Sirentist’s screenshots. That list is updated on the fly so the tagger always sees the current state, without having to reload the page every single time. For every tagged item, we get a JSON with x- and y-coordinates, width and height of the drawn box and the TSIDs of the street and the original item. Then we use the geometry file of the location to map these pieces of information to a position in the street. Finally we use the label to create a new instance of the item class of that TSID (like our Fruit Tree). Repeat that about 21000 times and you get all items back into Ur.

Remember? This is where we went before we had the Spice Route

Remember? This is where we went before we had the Spice Route

When we started to work on this, I thought it would take a long time to tag all those items, but I was definitely wrong. With the help of some very enthusiastic members of Project Eleven it was a matter of weeks, and not months, like I had expected, and it was done. Of course, this is only one of many steps to get Ur back the way it was. But it does feel good to walk through the streets again and see all the trees and quoins and dirt piles and peat bogs and some of the old inhabitants!

 

 

1 I would like to thank everyone at TinySpeck again. Without the sources, without the instance of Slack we have or without all the help current and former TinySpeckers have given us, we would be nowhere near where we are now.

2 Special thanks to Mackenzie (Jade)! Her knowledge and experience have been invaluable for us!

Dev log 0: The Game Server

Hello lovely in-limbo Glitchen and other curious supporters! Quick intro, I’m this guy, one of the more technically inclined members of Team Eleven. Which is why I was dragged into the spotlight prodded gently to give you some technical background on our approach to recreate the core missing piece of the Glitch architecture: the game server.

As most of you know, a part of the server-side code has been released by Tiny Speck in the glitch-GameServerJS repository on Github. This repository — referred to as “GSJS” below — consists of roughly one million lines of Javascript code that, put simply, contain all of the Glitch game logic and textual content. We decided early on that it would be a good idea to reuse that code with as few changes as possible, first and foremost in order not to introduce new bugs in tried and tested code, but also because the sheer volume of it would make any structural change (let alone rewriting it in another language) an enormous task.

What TS did not release is the actual server component that the game clients connect and send messages to, that processes these messages by calling the respective GSJS functions, sends resulting responses back to the clients, and manages the persistent state of every object in the game world. Originally, this component was a Java application that ran the GSJS code inside the Java virtual machine using Rhino. While we did initially consider rewriting the server based on the same technologies, we eventually agreed to try our hands on implementing it using Node.js, at least for a first iteration. The reasons behind that decision were roughly the following:

  • much better performance of V8 (the JS engine that drives Node.js) compared to Rhino (and Rhino’s successor has not been released yet)
  • Node.js is clearly on the rise and currently has a very active community, while Rhino is in end-of-life mode
  • one less language to worry about, having both core game server and GSJS written in JS
  • greater expected likelihood to find people willing to write JS in their spare time than Java
  • and last but not least, cal and serguei (of Tiny Speck) suggested Node.js as more or less the obvious way to go from our perspective (“I think node is the most pragmatic choice”)

So off we went cobbling together a prototype server around the beginning of December, and a few weeks later we had something multiple people could connect to at the same time, and do glitchy things in (you’ve seen the screenshots). We were able to integrate the existing GSJS code after running it through a fairly simple preprocessor script, with only a couple of minor manual adjustments. This prototype server now serves three main purposes:

  • a means for us to learn how all aspects of the game actually work internally, in a very hands-on way, and to tinker with stuff we do not understand yet
  • a way to determine serious issues that might cause us to revise our technology choices, and to test various options for components where we have not reached a decision yet (e.g. the persistence layer)
  • a platform for other tasks that require parts of the game in a “live” state, like the tagging process

Regarding the second point, so far, we have not yet encountered any obvious, insurmountable roadblocks (except maybe concerns regarding performance — but that is a topic for another blog post). We do however struggle with the fact that Node’s architecture differs from the original server in one significant way: it is strictly single threaded and relies on the application code to “play fair” by not performing long-running, uninterruptible operations. The existing GSJS code is obviously not designed with that restriction in mind. In practical terms this currently means that, for example, anytime a new player logs in, the game is effectively paused for everyone else for a second or so.

It is important to note that this is not a server we can use for any kind of public testing/demo, unfortunately — it was simply not made for that purpose at all. But once we are reasonably confident we understand how everything works (“soon”), it will serve as a sort of blueprint and hopefully allow us to work on the “real thing” in a structured, efficient way. While this approach may seem like doing the same work twice, the reasoning here is that we would not have gotten it “right” the first time anyway, having started off with pretty much no prior knowledge about the inner workings of Glitch.

If there is interest in future technical blog posts about how we are trying to solve issues like the ones described above (i.e. if you are longing for more long-winded articles with techy words, abbreviations and no screenshots), let us know!

Eleven, Flash, HTML5, and You

It has been asked numerous times, both from inside and outside the project (it got asked so much in Slack that I finally had to append to one of our channel topics a statement informing readers that we are not, in fact, rebuilding the client), why Eleven is using Flash instead of HTML5 to build the game client. We appreciate the concern, and as a matter of fact, have discussed the topic fairly extensively. I myself have tried to answer the question many times, in many places, but have never really taken the time to provide a comprehensive rationale for our continued use of Flash.

I suppose the first misconception I’ve seen floating around is that using Flash allows a quicker path to launch, but absolutely no other benefits. From my perspective, the opposite is true. By sticking with Flash for the game client, we actually gain numerous advantages. I’ll address the most obvious (speed of development) first, followed by the two most common concerns people see with Flash (performance and mobile), then discuss the technical hurdles we would face implementing a game client in HTML5, and finally discuss what matters most to me personally: providing an experience on par with that offered by Glitch.

The most obvious advantage gained from continuing to use the Flash client is speed of development. The very first milestone our team accomplished (way back in November, but time has flown and it feels like we’ve progressed so far since what seems like a rather short time ago) was the successful compilation of the Glitch client as provided to us by Tiny Speck. For those who are curious, the process is something like this. My method differed slightly, but the essence is the same. It’s not necessarily a complicated thing to build, but the fact that building the complete client was still a process that took hours to figure out meant that we were dealing with quite the complex beast. It’s about half a million lines of code, and rewriting it would take a significant investment of time. We’re by no means in a rush, but people would like us to open in a (reasonably) timely manner, and we’re trying not to waste your time (as I’ll continue to explain in a bit); HTML5 offers no practical benefit, but numerous downsides, anyway. 🙂

Next up are performance concerns: What are the performance implications of Flash versus HTML5? Not being much of an optimization guru myself, a cursory look at the Glitch client, which still feels like it should run better on the rMBP I bought a couple of months back, would lead one to believe there’s much room for improvement. And maybe there is. However, that improvement would most likely be found by tweaking the Flash client, as opposed to switching to HTML5, which is hardly the performance savior it’s been depicted as. A quote from Cal (Bees!) of Tiny Speck:

“flash is a bit of a performance hog” – try running the same graphics in html5 and you’ll find that flash has very very good performance. unless you’re going to rewrite the client in unity/native-code then flash is going to give you *much* better performance than html5

While I’m in full-on TS-quoting mode, I should also mention the pathway Jono has mentioned for optimizing the Flash client:

 

“the memory leaks are in the game client, it needs a lot more object pooling”

the game needs 3/4 GB minimum”

We cache assets in memory indefinitely too”

 

Suffice it to say, the Flash client could use some work, but in general, the client works, and in the grand scheme of things, works well, so it’s not on the list of things that’s holding us back from launch, at least in my perspective.

 

Next up is the relationship between Eleven and mobile devices. In short, such a relationship would mean certain disaster. MMO’s generally are not something you want to implement on a mobile device. Once again in the words of Cal (and taking another opportunity to shamelessly plug Slack – its search functionality has made finding all these quotes a piece of cake):

 

you can compile some flash to ios/android, but:
1) adobe have given up on that
2) it was built for very simple stuff anyway
3) glitch requires high bandwidth / low latency that you generally don’t get on phones
4) it’s very cpu/gpu intensive with lots of very large textures
5) mmos aren’t really possible under ios app store guidelines – the content & behavior needs to be built into the app
6) typing & playing on a tablet is pretty terrible
glitch is not a game that was designed for or could work (as-is) on mobile
In short, an MMO, especially a social MMO like Glitch, is not something you’d want to play on a mobile device. The experience would be awful, and it would also discourage the social behavior we’d like to encourage. Speaking from personal experience, I frequently remoted into my web server from my phone to chat with people in Glitch, and it was almost impossible to hold a fluent conversation. We seek to offer a pleasant experience, and supporting mobile devices would degrade that heavily.

 

The penultimate reason for our decision to stick with Flash (and the part of this post that [finally] includes the cool screenshot you’ve probably been waiting for, and may well have skipped through the rest of the article to find) is that the Glitch client is a technical marvel, that would be next to impossible to replicate. But, you say, shortly after The End of the World, there were various HTML5 remakes of small parts of the game! True. But some of the more difficult parts to re-implement are the ones you don’t see. Namely, LocoDeco, the Eleven/Glitch level editor.

Game___Glitch_18AD7E83

As you can see, it’s a fairly complex tool, and possibly one of the more advanced things ever done with Flash (I’m not the only member of our team to have made this observation). Its use is reminiscent of Photoshop in a way (and having used Photoshop most certainly helps one figure it out). Even if HTML5 were more mature as a technology, I highly doubt something like this would be possible to implement in it (as it is, it’s a testament to the coding prowess of Tiny Speck that they managed to create such an advanced design tool in a platform originally designed to play simple animations on web pages).

 

Finally, there’s the reason for sticking with the Flash client that matters the most to me personally. It’s why I’m so passionate about this particular issue. In order to deliver an experience that “feels” like Glitch, we need to use the Flash client. Even if it came at some cost in other areas, that’s priceless. We need things like the physics engine baked into the client to work exactly as they were, or things just won’t feel right. If we thought we could’ve done this without the myriad resources Tiny Speck released into the public domain, we could’ve started much earlier (and yet, we probably wouldn’t be nearly as far along as we are now – building an MMO from scratch is nigh-impossible). And using the client TS so graciously provided to us, that can happen. I think Kukubee put it best a couple of weeks ago when we demonstrated our progress to a few of the folks at Tiny Speck:

 

Kukubee: “damn motherfather, this is glitch!”

 

And that’s exactly what we’re going for. Nothing less. We’ve discussed it time and again, but if we wish to provide a quality experience, it’s the only way to go forward. I suppose that ultimately, I don’t get the affinity toward HTML5. Presumably what everyone really wants is the Glitch experience they know and love, and we’re trying to deliver that in the best way possible.