How to tune a Ruby on Rails application running on Heroku which uses production level Heroku Postgres?

https://stackoverflow.com/questions/22580297

19-06-2023
|

Question

The Company I work for decided on moving their entire stack to Heroku. The main motivation was it's ease of use: No sysAdmin, no cry. But I still have some questions about it...

I'm making some load and stress tests on both application platform and Postgres service. I'm using blitz as an addon of Heroku. I attack on the site with number of users between 1 to 250. There are some very interesting results I got and I need help on evaluating them.

The Test Stack:

Application specifications

It hasn't anything that much special at all.

Rails 4.0.4
Unicorn
database.yml set up to connect to Heroku postgres.
Not using cache.

Database

It's a Standard Tengu (naming conventions of Heroku will kill me one day :) properly connected to the application.

Heroku configs

I applied everything on unicorn.rb as told in "Deploying Rails Applications With Unicorn" article. I have 2 regular web dynos.

WEB_CONCURRENCY  : 2
DB_POOL          : 5

Data

episodes table counts 100.000~
episode_urls table counts 300.000~
episode_images table counts 75.000~

Code

episodes_controller.rb

  def index
    @episodes = Episode.joins(:program).where(programs: {channel_id: 1}).limit(100).includes(:episode_image, :episode_urls)
  end

episodes/index.html.erb

<% @episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>

Scenario #1:

Web dynos   : 2
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 10
End users   : 10

Result:

HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)

This rush generated 218 successful hits in 30.00 seconds and we transferred 6.04 MB of data in and out of your app. The average hit rate of 7.27/second translates to about 627,840 hits/day.

Scenario #2:

Web dynos   : 2
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 20
End users   : 20

Result:

HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)

This rush generated 365 successful hits in 30.00 seconds and we transferred 10.12 MB of data in and out of your app. The average hit rate of 12.17/second translates to about 1,051,200 hits/day. The average response time was 622 ms.

Scenario #3:

Web dynos   : 2
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 50
End users   : 50

Result:

HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)

This rush generated 371 successful hits in 30.00 seconds and we transferred 10.29 MB of data in and out of your app. The average hit rate of 12.37/second translates to about 1,068,480 hits/day. The average response time was 2,631 ms.

Scenario #4:

Web dynos   : 4
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 50
End users   : 50

Result:

HITS 100.00% (484)
ERRORS 0.00% (0)
TIMEOUTS 0.00% (0)

This rush generated 484 successful hits in 30.00 seconds and we transferred 13.43 MB of data in and out of your app. The average hit rate of 16.13/second translates to about 1,393,920 hits/day. The average response time was 1,856 ms.

Scenario #5:

Web dynos   : 4
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 150
End users   : 150

Result:

HITS 71.22% (386)
ERRORS 0.00% (0)
TIMEOUTS 28.78% (156)

This rush generated 386 successful hits in 30.00 seconds and we transferred 10.76 MB of data in and out of your app. The average hit rate of 12.87/second translates to about 1,111,680 hits/day. The average response time was 5,446 ms.

Scenario #6:

Web dynos   : 10
Duration    : 30 seconds
Timeout     : 8000 ms
Start users : 150
End users   : 150

Result:

HITS 73.79% (428)
ERRORS 0.17% (1)
TIMEOUTS 26.03% (151)

This rush generated 428 successful hits in 30.00 seconds and we transferred 11.92 MB of data in and out of your app. The average hit rate of 14.27/second translates to about 1,232,640 hits/day. The average response time was 4,793 ms. You've got bigger problems, though: 26.21% of the users during this rush experienced timeouts or errors!

General Summary:

The "Hit Rate" never goes beyond the number of 15 even though 150 users sends request to the application.
Increasing number of web dynos does not help handling requests.

Questions:

When I use caching and memcached (Memcachier add-on from Heroku) even 2 web dynos can handle >180 hits per second. I'm just trying to understand what can dynos and the postgres service can do without cache. This way I'm trying to understand how to tune them. How to do it?
Standard Tengu is said to have 200 concurrent connections. So why it never reaches that number?
If having a prdouction level db and increasing web dynos won't help to scale my app, what's the point to use Heroku?
Probably the most important question: What am I doing wrong? :)

Thank you for even reading this crazy question!

Solution

I particularly figured out the issue.

Firstly, remember my code in the view:

<% @episodes.each do |t| %>
<% if !t.episode_image.blank? %>
<li><%= image_tag(t.episode_image.image(:thumb)) %></li>
<% end %>
<li><%= t.episode_urls.first.mas_path if !t.episode_urls.first.blank?%></li>
<li><%= t.title %></li>
<% end %>

Here I'm getting each episodes episode_image inside my iteration. Even though I've been using includes in my controller, there was a big mistake at my table schema. I did not have index for episode_id in my episode_images table!. This was causing an extremely high query time. I've found it using New Relic's database reports. All other query times were 0.5ms or 2-3ms but episode.episode_image was causing almost 6500ms!

I don't know much about the relationship between query time and application execution but as I added index to my episode_images table, now I can clearly see the difference. If you have your database schema properly, you'll probably won't face any problem with scaling via Heroku. But any dyno can not help you with a badly designed database.

For people who might run into same problem, I would like to tell you about some of my findings of relationship between Heroku web dynos, Unicorn workers and Postgresql active connections:

Basically, Heroku provides you a dyno which is some kind of a small virtual machine having 1 core and 512MB ram. Inside that little virtual machine, your Unicorn server runs. Unicorn has a master process and worker processes. Each of your Unicorn workers has their own permanent connection to your existing Postgresql server (Don't forget to check out this) It basically means that when you have a Heroku dyno up with 3 Unicorn workers running on it, you have at least 4 active connections. If you have 2 web dynos, you have at least 8 active connections.

Let's say you have a Standard Tengu Postgres with 200 concurrent connections limit. If you have problematic queries with bad db design neither can db nor more dynos can save you without cache... If you have long running queries you have no choice other than caching, I think.

All above is my own findings, if there is anything wrong with them please warn me by your comments.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow