I recently had the opportunity to catch up with Oskar Holmkratz, who is a daytime AAA game-developer specialized in technical animation. At night-time he puts on his web-development cape to develop MetaGamerScore.com.
Recently, they switched from Heroku to Elastx (powered by Jelastic) and Oskar was kind enough to share the experience.
Tell us about MetaGamerScore.com
We started MetaGamerScore during 2010, with the aim of connecting achievement systems from different gaming platforms. By doing this, their intrinsically perceived value would be boosted through putting them in a greater, persistent context. MetaGamerScore at the time of writing supports 10 different platforms, including Steam, Xbox Live, PlayStation Network, World of Warcraft & theHunter. The first public version was deployed to Heroku in 2010, as this was pretty much the only feasible Ruby cloud provider at the time.
MetaGamerScore currently has 3,100 registered users.
What was the challenge?
MetaGamerScore grew organically, increasing it’s active user count and slowly climbing toward reaching break-even. Before getting there however, the site suddenly started suffering from really bad performance. Loss of performance means lost members – people don’t stay on sites that give 5xx errors. At the time, we had about 100 users that were active daily on the site.
Most of the critically bad performance was due to slow database requests. The first thought is obviously “it’s in the cloud, just add more performance”, but that is not really a viable (or smart) option if you are already loosing money on the application. We spent a lot of time optimizing our queries and improving our indexes and tuning the database – but the site would still go down during European and American peak hours.
The main culprit was the cache, because at any time the data was not in the PostgreSQL cache, it was very slow to fetch it on the largest (and most accessed) table; even with optimized indexes. Basically, what it means is that, on AWS, to get great performance, you have to have a database cache equal to the size of the active dataset. The next two tiers in database cost (with increased caches), was x4 and x15 times more costly than the current one, and we didn’t even know if either would have solved our problem – definitely not in the long run as our database is growing fast.
Simply put: on Heroku, cost scaled much faster than income for us – it was time to find something else.
What was the solution?
My friend Martin, one of the co-founders of the site, had seen a demo by Elastx at SHRUG (Stockholm Ruby User Group), and thought the concept sounded interesting. I also remembered that I had seen a video some time back of the Jelastic concept; but at that time it was Java only. Researching it further, I discovered that Elastx was using SSD drives for database storage, which made me believe that it would behave better when the data wasn’t cached. The vertical scaling sounded mostly like a bonus in my initial analysis, but now after deployment I know how invaluable the feature is for handling peak traffic.
How have you set up your environment?
We currently have NGINX at 1/1, application at 3/10 and PostgreSQL at 4/10. (reserved vs scaling).
What were the results when you switched from Heroku to Elastx?
Before the switch, the bad performance caused a total drop of about 50% of active users daily. After the switch, site usage is now rising again. We are no longer hard-locked on data being cached to obtain high performance, which means we are very confident that the database node will scale very well into the future. Most of the load on the database comes from our backend parsing/data-crunching thread. The web node adapts very well with the vertical scaling, spawning new Phusion Passenger instances to respond to requests during peak hours.
We were now also able to upgrade to Ruby 2.0 for a slighter increased performance.
Can you share the numbers measured with New Relic?
- Database worst-case: From 127s to 2.5s.
- Database median call during normal load: From 400ms to 150ms.
- Number of faulty requests due to database time-outs: from ~5% (ranging between 0% during low-peak and 100% during high-peak) to 0%.
- Average total page load-time: From 2s (only including requests that didn’t time-out) to 255ms.
What is the cost difference?
These numbers are based on current usage, the numbers are approximate since it adapts dynamically. It is quite difficult to compare the two since the pricing models differ.
Database price: up ~30%. (Compared with up 400% or up 1500% that would have been needed to keep the site on Heroku).
Total price difference for web-node and background process: currently similar. (But Elastx powered by Jelastic has less incline on the cost-curve, so it will scale better. Heroku has 1 dyno for free, and at approximately 3 total dynos the Elastx pricing model seems to be winning.).
All in all, a ~25% increase in cost for a massive increase in performance.
And finally, what are the highlights of using the platform?
- The SSD drives for database storage – makes uncached requests immensely faster. Speed of cached requests will depend more on the size of your vertical scaling.
- Vertical scaling makes sure you only pay for the performance you need when you need it (i.e. during peak hours).
- Great experience from Elastx support department – answers questions really really fast and knowledgeable.
- No special plugins with magic and injecting things into your code – it becomes a more straightforward standard Ruby server.
- Ruby 2.0 support, for us mainly for the performance increase
Below are some pictures, mostly taken from my New Relic dashboard, that show some smooth sailing for MetaGamerScore on the Elastx servers. Unfortunately, I don’t have any pictures from before the switch.
New Relic Response Times:
PostgreSQL CPU Usage:
Browser Load Times: