FOAM TOTEM


Foamy
Christmas Radio
[high] [low]
Last update:

Caching Run Maps

Update: I updated the heatmap cutoffs and the re-averaging algorithm a bit. Text below has been updated.

I'm now caching images of the maps for all my runs. The live maps were very expensive. Once I got more than two on a page, it became irritating to even scroll the page. One would have to wait until they all loaded before the browser became responsive again. So, now I'm caching an image of them. This was not the easiest thing to get.

I'm lucky I can get them at all, I suppose. Google provides a static map image API which is the only reason this is even possible. Everything on the map is encoded into a giant URL which you send to Google, who renders the map on the server and returns it as an image. Pretty awesome, actually. Foam Totem does this periodically and caches the resulting image.

Google Maps can take a KML file and display it. All I have to do is provide them with a URL for the file. Runmeter already provides that for me, which is cool. If you click on any of the shortlinks (or the map images themselves, now), it will take you to Google Maps and you can see the name of the KML file. Frustratingly, though, the static maps API doesn't read KML files. All was not lost, however, as it lets me specify a path in the URL.

Well, that'd be great except that the URL I send to Google has a length limit of around 1500 characters. The path information is much, much longer than that. So I needed to find a way to reduce the number of data points for the cached map. Before I wrote my own, I was able to find a Perl module (Math::Polygon) which has a straightforward simplifier that seems to produce adequate results.

At the same time, I really wanted to show my pace on the map so I can see where I am running fast or slow. The data was hidden away in the KML file that Runmeter provides. Since I am now fetching, spindling, and mutilating it, I figured I could add this in to. Of course, nothing is easy: because the path is being reduced I have to also reduce the pace information intelligently. I did a simple weighted average (by time) of the pace. It seems right more or less.

So now I have fancy heatmaps for my runs. From slowest to fastest: Red, Purple, Blue, Cyan. The cutoffs are hardcoded right now based on my whim. Red is anything at or below what I would consider an easy jog (9:30 pace and slower). Purple is 8:39 pace and slower, Blue is less than 8:00, and Cyan is anything faster than that. It's all a bit squishy, of course, because I'm averaging what are already averages and I have a limited number of points to which I can assign colors.

Future possibilities: I'm considering making so when you click on the map, it gets replaced with the live map. (Did it!) Now that I'm fetching the KML files from Runmeter, I could copy and fiddle with it so that it has the heatmap data in it to, and serve it myself. Lastly, I'd like to add the mile-markers back in.

NoSQL Databases: Redis

(Other NoSQL posts: Intro, MongoDB)

Another caveat about these posts: they aren't meant to be a complete reference to the software. I'm hitting on the stuff I find interesting as well as any obvious shortcomings they have. There are other articles out there that are easy to find that cover the specifics of these databases much better than I am doing so.

Redis is the main reason why I'm writing these posts at all. It's almost certainly inappropriate for player data storage in a game, but there's something about it which caused a significant shift in how I thought about these things. I'll get to that in a bit.

Redis is basically a key-value store. It doesn't have multiple tables, or any schemas, or queries. So rather than being a database as most people think of them, it's more of a persistent hash table. It's blazing fast, mainly because it's an in-memory database. All the keys in the database need to fit in memory at the same time. (It will swap out the values to disk. Of course that will decrease performance.) It supports clustering (where each master node has a subset of all the keys) to address data sets larger than RAM. It also has master-slave replication for redundancy.

MAGIC: I find it vaguely frightening.

As an in-memory database focused on speed, however, it sacrifices durability. In typical use, it simply creates a snapshot and flushes the whole database to disk. If something catastrophic happens, you lose everything since the last save. There are recent provisions for using a write-ahead log which improves durability, but has performance implications (even writing to an append-only file is a lot slower than writing to RAM). Something to keep in mind.

The values in Redis can be strings, lists of strings, sets of strings (sorted and unsorted), and hashes. There are atomic actions for adding/removing from lists, sets, and hashes. It also has some primitives which treat values as integers for increment/decrement purposes. Multiple primitives can be bundled together into a single transaction.

Except for those primitives, though, values are entirely opaque to Redis. There is no such thing as querying on a value. Sounds like a pretty useless database, doesn't it?

That's because Redis isn't really a database in the RDBMS sense; it's a data store. It provides building-blocks for you to build indices or queries. The keys most people use to store values in Redis are highly structured. For example, "post:1001:tags". (Those colons aren't special at all, they are a convention.) Assuming you've set up your keys properly, one glues together a strings to fetch it.

Here's an example: "user:666" might be a hash with name, birthday, and a photo (stored as a string, Redis is 8-bit safe). "user:666:posts" would be a set of all of the posts made by user 666 by id number. "post:1001" is a hash with the content and title of the post. "post:1001:tags" is a set of the tags applied to the post (again, by id). "tag:10" has the name of the tag and "tag:10:refs" is a set of all the posts with tag 10.

None of this is done for you. Instead, when a post is created, it is (in the content management software somewhere) added to the "user:666:posts" set and the "tag:10:refs" set. The post itself is stored in "post:1001" and the tag in "post:1001:tags". If the tag is removed, it would need to be removed from that set and the post from the refs set on the tag. All this is done by hand. Long story short, you keep your own indices and other lookups on your own.

In a way, it's the opposite of MongoDB. MongoDB has fancy hierarchical documents and the ability to do ad-hoc queries across them. Where MongoDB doesn't enforce a schema in its documents, it still understands how to go in, manipulate, and query them. Redis is practically schema-less as far as the database itself is concerned (hash values notwithstanding). I'm not sure which gets the Mirror Universe Evil Goatee, though.

So, why do I find this interesting? Well, I fear magic. My experience is that magic ends up being something that eventually needs to be worked around. One often spends as much effort to avoid the magic as it took to craft it originally. And this always comes up much later in the project when you really don't need speed bumps. MongoDB looks awesome, but has an attractive sparkly halo of magic that nonetheless worries me. Redis is basically at the far other end of the spectrum. There not the faintest whiff of magic; it's entirely mundane. I find that attractive.

It's probably too mundane to be reasonably used for player data. Ad-hoc queries are basically out entirely. If you didn't need any queries, then it is perhaps more tenable. Even so, you'd have to flatten the player data to some extent (because hash values can't contain more hashes). My initial reaction was to write a layer on top of Redis to handle all of that...

...magically.

Whoops.

NoSQL Databases: MongoDB

One note before I hop into things: These write-ups are on the basis of internet research and sometimes some poking around with the database itself. I haven't made any significant use of any of these databases. The point of the research is to decide which I'll use.

So, let me start with MongoDB. MongoDB is classified as a "document-oriented database". I had never really heard of such a thing before and I'm not quite sure what the real difference is between that and an object database. Perhaps the difference is that MongoDB enforces no schema. Any document schema you use is by your own convention; it doesn't really care. There are no table definitions to speak of. You simply make a "document", which is a set of attrib/value pairs, and store it into the database.

db.noobs.save(
   {
      "name" : {
         "first": "Shannon",
         "last": "Posniewski"
      },
      "awesomeness" : 100,
      "loserness" : 20,
      "likes" : ["bananas", "cookies"]
   }
)

The values can be objects ("name" above) and arrays ("likes" above) as well as scalars. And, of course, arrays of objects and object that contain arrays and so on. It probably didn't escape your notice that the document is represented in JSON. I think this is brilliant, especially when one pairs MongoDB with a dynamic language which can instantiate objects from JSON. So a natural partner to MongoDB is node.js, which is a server-side javascript engine (which I'm also enamored of).

One does queries in MongoDB mainly by example. You simply provide the parts of the document you wish to match as a document. This finds all the documents where the last name is Posniewski, for example. Note also, that traversing to internal objects simply uses dot-notation.

db.noobs.find({ "name.last" : "Posniewski" })
But MongoDB goes a lot further than that, allowing some basic tests in the values of a query. This one finds all the docs with an awesomeness greater than 50 and less than 200.
db.noobs.find({ "awesomeness" : { $gt: 50, $lt: 200 } })
In a final bit of coolness, you can provide a full-on javascript expression for these query values. These expressions have access to the whole document. Of course, these are much slower than the simpler queries (especially if one makes appropriate indices for them).
db.noobs.find({ function() { return this.awesomeness > this.loserness; } })
There are a bunch of comparators like $gt and $lt, and one can couple the simple queries and the fancy advanced javascript. There are also special provisions for arrays and so on. In short, MongoDB lets you do arbitrary ad-hoc queries on the documents, which is pretty amazing.

Updates in MongoDB are done in basically the same way. You provide a query to select one or more documents, and then a mutator document. Fields present in the mutator are modified in the selected documents. Like the queries, there are also more complicated modifiers that can be used beyond just setting the value. In the example below, awesomness in incremented by 10. (The _id field is a built-in unique object id MongoDb provides.)

db.noobs.update(
   { _id: XXXX },
   { _id: XXXX, name.first: "Mister", { $inc: { awesomeness : 10 } },
   true
);
The modifiers include pushing, popping, and yanking items from arrays. MongoDB modifies documents in place rather than re-writing them. For data which changes often, this can have some performance bonuses.

MongoDB is designed to allow horizontal scaling. This is useful for load balancing and redundancy. You can add cohorts to the cluster on the fly, which it pretty neat if you need nine 9s uptime, I suppose. Anecdotally, though, MongoDB is not very reliable. There are several reports of lost data from crashes even when running with replica sets. You should never run MongoDB with less than two physical servers because their main strategy for stability and durability is through replication. (They aren't necessarily wrong about this, but it doesn't fill one with confidence.) It wasn't until recently that they added write-ahead journaling.

Mongo just pawn in game of life.

MongoDb relaxes some bits of ACID compliance, as many of these databases do for performance or usability reasons. In this case, atomicity is only provided for a single document. All changes within a single document update are atomic, but there is no such thing as a cross-document transaction. They explain that the inherently distributed nature of MongoDB makes doing this largely impossible (without gigantic locking problems) so they aren't even pursuing it. The website ostensibly explains how to do this with a application-side two-phase commit. However, it's not clear to me that it is actually solvable only with app-side logic.

In game terms, this means that item trades will need some kind of special handling to avoid item duping, which is a bummer. It's not a stopper, in my opinion. It's basically the only thing which needs cross-player transactions and I suspect one could find a clever way to at least minimize the chance of failure/duping.

On the whole, though, MongoDB seems like a really good fit for player data in an online game.

NoSQL Databases: Part 1

I've been doing some research on databases over the past few weeks. The new hotness are so-called "NoSQL" databases, which basically encompasses all databases which aren't structured in the classic relational fixed-schema tables/columns/rows. These databases don't usually map particularly well to SQL-style queries, hence the name of the genre. However, there is some work being done to wrap some of these databases in SQL shells. Some are recasting "NoSQL" to mean "Not Only SQL".

I'm looking at these from the multi-user game perspective that I've been soaking in for a while. A major issue we had with using SQL-ish flat schemas is that the player data isn't flat. It's hierarchical. A lot of effort went into doing the object-relational mapping efficiently. We often had to go back and redo it. (In particular, we often had to denormalize practically everything, which flies in the face of relational database design.) I came out of this experience thinking that a basic object store (which supports hierarchies) is the best approach.

All of the object databases I've been exposed to previously, however, had an enormous conceptual and actual weight to them. They require object definitions in some abstract language plus some kind of magic to actually marshal these objects into and out of the game. Further the ability to do anything but the simplest query-by-example was often obtuse or not really possible.

This has nothing to do with the post. I just like Godzilla.

So, Cryptic decided to write its own object database back end. We knew what needed to be fast and our general usage model, so we knew what restrictions we could could play fast and loose with. Of course, we still needed to do the marshaling and all that, but we took the approach that the actual struct in the C code would drive the database schema. One needed to annotate the struct a bit, but on the whole it was a pretty slick approach.

I did a talk about this decision called "SQL Considered Harmful". (That link is for completeness, please don't watch that old talk.) Database Admins were not amused by my thesis that relational databases weren't a great choice for game data such as ours. They said that all the problems I pointed out could be solved by carefully-planned schemas, server-side procedures, and subtle tweaking of the parameters of MS-SQL. They were probably correct, but that wasn't the point. The point is that we didn't want to do any of those things.

The CrypticDB is now operating in a production environment for Champs and Star Trek. It worked out well. It keeps a live mirror, can recover from catastrophic failures, supports cross-server transactions, and so on. It's a real ACID object database, which is pretty awesome. But of course it isn't perfect. I feel that the Magic Quotient ended up being too high. A high MQ meant that everything works, but sometimes had awful performance characteristics. And so we had to revisit code and the schema to optimize it. If I had to do CrypticDB over again, I'd remove some of its magic. (The magic seemed really cool until we kept getting hit with the performance issues.)

There are now several new/sexy databases out there which could conceivably fit into the same usage model as CrypticDB. In particular, I've looked at MongoDB, Hadoop, Redis, CouchDB, db40, Riak, and a few others.

And that went on longer than expected, so I'll stop here and write up more later.

X-Men: First Class

X-Men: First Class (IMDB 8.3|Rot 85%|Netflix 4.4)

Saw the new X-Men movie, which is a prequel to the other movies that have been coming out over the last few years. It's not a reboot, but fits in with the other movies. I don't know the comics at all, so I have no idea how true it is to the comic canon.

What is Mrs. Peel doing in the reactor room?

It shows and explains the sudden existence and increase in Mutants, how Prof X, Magneto, and many other mainstays of the series first met. It also introduces the basic schism between Prof X's and Magneto's worldviews. Even though this is a lot of ground to cover, it was a pretty good action movie, though perhaps a bit heavy on the CGI. (Beast in particular was rather awful.) I loved the 60s spy vibe they used. Perhaps it's true to the comics, but the bad guy has a submarine with a sleek and spacious modernist office in it, and even that has a secret door in it. The only thing missing was the cat for him to pet.

And Emma Frost (January Jones-- how Bond-ian is that?!) comes oh-so-very close to Emma Peel. Or maybe Barbarella.

Good enough to see in a theater, but there's nothing in it that would make me recommend the theater over home viewing, really.

To the Cloud!

Today was the first day of the 2011 Apple WWDC, the keynote of which traditionally announces a bunch of new features (sometimes hardware). Perhaps it's because I own both an iPhone and an iPad, but I found what they're doing very interesting. (Don't worry, I won't go into too many specifics regarding the individual features.)

The drift from data living on the desktop to the data living in the "cloud" has been happening for a few years now. Nearly everything Google does is cloud-based. For example, GMail and Google Docs (which is a replacement for MS Office and the like) store all their data on Google's servers out there in the ether. All of Google's offerings come through the web browser. When they decided that browsers were slow and not improving rapidly enough, they wrote their own. Their ultimate goal in this aspect are the "Chromebooks" which are laptop computers whose entire user interface is a browser; there is nothing else. All apps are web apps.

Apple's new data center, located over Bespin.

Apple, first and foremost, is a device manufacturer. And so they're coming at it from an entirely different direction. What they announced today is basically everything I just described for Google with the exception that they are focused on the devices themselves. The cloud is merely a giant disk drive in the sky. The devices automatically and seamlessly sync to that disk drive so your data appears everywhere. As the hardware and software manufacturer, they have the opportunity to make this perfectly seamless and smooth. Apple typically does this very well (though their previous stumbles with MobileMe means success is not guaranteed).

Microsoft also has cloud offerings, but has managed to make them complex enough and the pricing so opaque that I don't even know what they are. They have clever "To the cloud!" commercials, but as an owner of several Windows computers, I don't have any idea how to do such a thing. I think Microsoft really needs to get its act together or they will slip further behind. It may simply be improving the bundling and presentation of what they already have; for all I know, they've already have all the pieces.

Google's starting point (like Microsoft's with Wintel) is inclusive, heterogeneous, and thus messy, ill-mannered, and incomplete. Apple exerts complete control over their universe and so everything is usually well-integrated, works well, and makes sense. But this route is also often limited and frustrating. These two approaches challenge each other, which eventually means better stuff overall. Hooray for competition!

Anyway, if Apple delivers on their vision in the fall, I think it will be transformative to the industry. Though all the parts have been done before (and perhaps done better) having everything working seamlessly together will be a first. Users will come to expect it, and I think that's awesome.