Game development is about writing fun features and fun stories, right? Wrong. Sometimes it's just about fighting stupid server problems.

If you make a desktop game in C++, you only have to care about your own code. Bugs in the operating system of the player are none of your business. If you develop a Flash game, you upload it to Kongregate, and you are done. But if you decide to write an MMO, you will probably suck with servers.

Zandagort is based on PHP+MySQL+AJAX. AJAX is client-sided and PHP+MySQL is pretty much the standard of web servers. Okay, there's a slight difference, because Zandagort runs on Lighttpd and not Apache, but it's only to have a low memory footprint for each instance, because the in-game chat needs a lot of concurrent connections (cf. C10k problem).

During the last half year there have been quite a few mysterious restarts. More than seems acceptable, but less than seems testable. Unfortunately there's a gray zone of error frequency where the number of cases is too low, but the damage caused is too high. A transient error in either the memory or the power supply unit or the constellation of system software components that leaves no trace is pretty hard to identify.

Increasing the availability of hardware gets exponentially more expensive after a certain level. So it seems more efficient to reduce the damage in case of server issues than to lower the probability of such cases.

The weakest link of Zandagort are MEMORY tables. MEMORY tables are a special storage engine of MySQL that store all data in memory. The biggest advantage of this is that there's no I/O which is one of the slowest things in computers. Since Zandagort is persistence, ie. the in-game world is never lost, even if you log out (or all players log out), a huge "state vector" that describes everything (the factories and resources on each planet, the position and composition of fleets, etc) has to be stored and recalculated continuously. MEMORY tables are a very good way to store this state vector.

Of course there's one drawback: if the server goes down even for a few seconds and even if only the MySQL server, all data is lost. It means that in case of a transient error, everything goes back to normal, except for the game.

The solution is to dump the database on a regular basis. But that's not enough, because if there's a backup once a day (let's say at 4am), and shit happens at 4pm, even if the backup is restored, 12 hours of active game time is lost. All the factories you built during that time, all the battles you won.

How about binary logging? It's a great feature of MySQL that writes all changes in the database to disk. Sequentially, so I/O speed is not a big problem. But restore speed is a problem. There are a lot of changes in Zandagort every minute, and a lot of calculations. Binary logging stores the queries that make the changes and not the changes themselves. Which is very efficient in terms of I/O but can consume a lot of CPU during replaying the log. An "average" 12 hours of in-game time takes about 70 minutes to replay.

And you have to add the time it takes for someone to notice and report the problem and the time it takes for me to get to a computer and start restoring.

So the real solution is to have an automatic system that

  1. checks the server on a regular basis
  2. creates dumps pretty often
  3. and restores the database from the latest dump as soon as it encounters a problem

That's exactly what the Shit Management System (SMS) does. It consists of three shell scripts: one of them is started every minute by cron, the other two are called by this one.

sms

A few minor stuff (I won't comment those, because they are easy to understand):

#!/bin/sh
server_prefix="$1"
mysql_password="$2"
cd "/home/web2/mmog_$server_prefix"
date=$(date +"%Y-%m-%d_%H%M%S")
min=$(date +"%M")

A file ("flag") that holds the server status:

server_status=$(cat www/up/index.html)

This flag is available for everyone at http://sx.zandagort.com/up/ (x=server number) so e.g. the server that hosts the main Zandagort homepage can check it and put out an apologizing error message.

The script's output is written to a logfile:

echo "$date sms: $server_prefix $server_status"

If the SMS is currently restoring, the script exits so it won't start restoration again. Except if it's going on for too long because it indicates that something went wrong during restoration and the flag just stayed that way:

if [ "$server_status" = "RESTORING....." ]; then
	echo "RESTART sms_restore"
	./sms_restore "$1" "$2"
	exit
fi

if [ $(echo "$server_status" | cut -c-9) = "RESTORING" ]; then
	echo "$server_prefix $server_status"
	echo -n "." >> www/up/index.html
	exit
fi

Waiting means that sms_restore was started but the MySQL server is not running. In this case the system waits and tries again later. And sends an email to the admin that something is very wrong (not just a quick server restart):

if [ "$server_status" = "WAITING" ]; then
	echo "$server_prefix WAITING"
	if [ "$((min%15))" -eq 0 ]; then
		echo "Server is totally down, still waiting." | mail -s "Zandagort $server_prefix is totally down" "admin@email.com"
	fi
	echo "START sms_restore"
	./sms_restore "$1" "$2"
	exit
fi

If the system is neither restoring nor waiting, it checks the database. A tiny php reads a single row from a one-row MEMORY table. It is possibly the fastest MySQL operation (because the table has only one row and because the game uses it quite rarely), but if the database is gone, this table is gone (ie. empty) as well.

is_it_up=$(php -f www/up/up2.php $server_prefix)

If there's anything wrong, the restoring script is started:

if [ "$is_it_up" != "UP" ]; then
	echo "$server_prefix DOWN"
	echo "START sms_restore"
	./sms_restore "$1" "$2"
	exit
fi

And finally, if everything goes according to plan, the database is dumped every 15 minutes:

if [ "$((min%15))" -eq 0 ]; then
	echo "START sms_backup"
	./sms_backup "$1" "$2"
fi

It means that 15 minutes it the longest period of active game time that can be lost. It's short enough so that replaying the binary log can be omitted, which makes restoring pretty fast (cca 1 minute). Dumping lasts about 5-10 seconds so it's acceptable to dump every 15 minutes.

sms_backup

The backup script starts with a few minor stuff too:

#!/bin/sh
server_prefix="$1"
mysql_password="$2"
cd "/home/web2/mmog_$server_prefix"
server_database="mmog""$server_prefix"
server_admin="mmog""$server_prefix""admin"
date=$(date +"%Y-%m-%d_%H%M%S")

echo "$date sms_backup: $server_prefix $date"

Then comes dumping:

mysqldump -u "$server_admin" --password="$mysql_password" --net_buffer_length=4096 "$server_database" | gzip --fast > dump/mmog_teljes_dump_$date.sql.gz

Setting net_buffer_length is necessary because mysqldump uses the extended insert option by default, which creates INSERTs with multiple values. It is good because the dump will be smaller (resulting in fewer I/O) and reloading it is faster. The problem is that it creates awfully long lines by default. So long that MySQL itself cannot handle them. net_buffer_length makes these lines a bit shorter.

Zipping is important, because a dump file is around 100-200 megabytes, that is 10-20 gigabytes per day. And the real problem is not storage space but I/O speed. gzip with the fast option is faster than writing out the unzipped dump file to disk.

After dumping, let's check the database:

is_it_up=$(php -f www/up/up2.php $server_prefix)

If our one-row table is empty, the server went down sometime during dumping. In this case we better drop the dump file, because storing corrupt dump files can easily lead to using one of them for restoring.

if [ "$is_it_up" != "UP" ]; then
	echo "$server_prefix backup mmog_teljes_dump_$date.sql.gz deleted"
	rm dump/mmog_teljes_dump_$date.sql.gz
else
	echo "$server_prefix backup mmog_teljes_dump_$date.sql.gz stored"
fi

sms_restore

Just the usual:

#!/bin/sh
server_prefix="$1"
mysql_password="$2"
cd "/home/web2/mmog_$server_prefix"
server_database="mmog""$server_prefix"
server_admin="mmog""$server_prefix""admin"
last_intact_dump=$(ls -t dump | head -1)
date=$(date +"%Y-%m-%d_%H%M%S")

echo "$date sms_restore: $server_prefix $last_intact_dump"

There's only one extra, last_intact_dump, which is the latest of the dump files. That's why it's important to delete corrupt dump files.

Let's set the flag file to restoring (so sms won't start sms_restore again after a minute) and let's send an email to the admin:

echo -n "RESTORING" > www/up/index.html
echo "Server is down, trying to restore." | mail -s "Zandagort $server_prefix is down" "admin@email.com"

Here comes restoring:

gunzip -c "dump/$last_intact_dump" | mysql -u "$server_admin" --password="$mysql_password" --default-character-set=utf8 "$server_database"

If it was unsuccessful, let's start waiting (so sms will start sms_restore again after a minute):

is_it_up=$(php -f www/up/up2.php $server_prefix)
if [ "$is_it_up" != "UP" ]; then
	echo "sms_restore: unsuccessful"
	echo -n "WAITING" > www/up/index.html
	echo "Server is totally down, waiting." | mail -s "Zandagort $server_prefix is totally down" "admin@email.com"
	exit
fi

If everything went fine, let's set the flag to UP and send another email:

echo "sms_restore: successful"
echo "Server successfully restored." | mail -s "Zandagort $server_prefix is up again" "admin@email.com"
echo -n "UP" > www/up/index.html

That's it basically. It has been running since the start of server 8 of Zandagort (Omen Galaxy). During 15 days there has been 13 spontaneous server restarts, which is a very fucked up situation, but at least the Shit Management System served quite well.

No it's probably time to spend some money and replace some hardware...