2014-01-18

If you are working with highly intensive Unicorn deployments, you will be familiar with following problems.

How to find out what Ruby processes are doing?

Occasionally some Ruby processes are running pedal to the metal, possibly burning CPU like crazy, you know they shouldn't be, and you just can't tell what's happening. They are getting stuck, but can't tell where or why - it's frustrating.

Tracing sometimes works. Run strace -s 999 -fp <pid> on a suspicious process, and you may see some useful output. It may end with something like this, without any movement:

[pid  7990] select(18, [16 17], NULL, [7 8], {133, 947558} <unfinished ...>
[pid 17728] select(19, [14 18], NULL, [7 8], {22, 500000} <unfinished ...>

It's selecting something from file descriptor 19, and you can try to find out what it is:

$ lsof -p 17728
COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
...
ruby    17728    ruby   19u  IPv4          808319634      0t0       TCP web1.infra.net:35682->db3.infra.net:mysql (ESTABLISHED)

In this case it's waiting for MySQL, so you can look at what's happening over there. But sometimes you see that worker is hanging on select from a pipe or socket:

$ lsof -p 17728
COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
...
ruby    58610    ruby   12r  FIFO        0,8        0t0 3435272094 pipe
ruby    58610    ruby   13u  sock        0,6        0t0 3435234779 can't identify protocol
ruby    58610    ruby   14w  FIFO        0,8        0t0 3435270598 pipe

If your FD shows a pipe or socket, it's difficult to dig further.

Another case is when Ruby process is endlessly spilling it's guts into trace output, and you can see a pattern repeating, meaning that there is some sort of infinite loop going on, but it's hard to tell where.

There are more advanced techniques, but they are pretty cumbersome, and you may get your process killed before you manage to find out anything useful.

What Ruby is doing at any given moment?

Wouldn't it be nice if you could be able to find out what line of code your Ruby process is executing at any given moment? Good news is that you can. Combine Signal.trap with Thread.list and Thread#backtrace, and you're set.

Signal.trap('USR2') do
  pid = Process.pid 
  puts "[#{pid}] Received USR2 at #{Time.now}. Dumping threads:"
  Thread.list.each do |t|
    trace = t.backtrace.join("\n[#{pid}] ")
    puts "[#{pid}] #{trace}"
    puts "[#{pid}] ---"
  end
  puts "[#{pid}] -------------------"
end

Place this code in Unicorn's after_fork, Resque after_fork, beginning of Rake tasks, or wherever it fits. When something is stuck or otherwise suspicious, find the pid, and then do:

$ kill -USR2 26749

Take a look at unicorn log. It should show something like this:

[26749] Received USR2 at 2014-01-19 04:28:35 +0000. Dumping threads:
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `call'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `sleep'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `index'
[26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/action_controller/metal/implicit_render.rb:4:in `send_action'
[26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/abstract_controller/base.rb:167:in `process_action'
[26749] 
...
[26749] ---
[26749] -------------------

You can see that it's hanging on sleep in welcome_controller.rb, line 8. Bullseye.

Dumping Ruby threads when killing lazy Unicorn workers

You can't sit all day and watch those processes manually. Unicorn has a timeout configuration option, where it kills workers that run longer than given amount of seconds.

Trouble is that Unicorn sends SIGKILL when terminating stuck workers, and SIGKILL cannot be trapped. Workers get killed, and all you get are these lousy messages:

E, [2013-11-19T12:30:08.905908 #61893] ERROR -- : worker=75 PID:62991 timeout (46s > 45s), killing
E, [2013-11-19T12:30:08.939120 #61893] ERROR -- : reaped #<Process::Status: pid 62991 SIGKILL (signal 9)> worker=75

That's not very helpful. It just tells us that something went wrong, and that the perpetrator got a bullet in the head.

But we can patch things up by overriding Unicorn::HttpServer#murder_lazy_workers. Here's a monkey patch that will send SIGTERM 2 seconds before sending SIGKILL:

class Unicorn::HttpServer
  # forcibly terminate all workers that haven't checked in in timeout seconds.
  # The timeout is implemented using an unlinked File
  def murder_lazy_workers
    next_sleep = @timeout - 1
    now = Time.now.to_i
    WORKERS.dup.each_pair do |wpid, worker|
      tick = worker.tick
      0 == tick and next # skip workers that haven't processed any clients
      diff = now - tick
      tmp = @timeout - diff

      # monkey patch begins here
      if tmp < 2
        logger.error "worker=#{worker.nr} PID:#{wpid} running too long " \
                     "(#{diff}s), sending TERM"
        kill_worker(:TERM, wpid)
      end
      # end of monkey patch

      if tmp >= 0
        next_sleep > tmp and next_sleep = tmp
        next
      end
      next_sleep = 0
      logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
                   "(#{diff}s > #{@timeout}s), killing"
      kill_worker(:KILL, wpid) # take no prisoners for timeout violations
    end
    next_sleep <= 0 ? 1 : next_sleep
  end
end

Place this code somewhere where it gets loaded on unicorn startup. It can be next to Unicorn's before_fork. You may want to adjust your Signal.trap to be a little more universal:

['TERM', 'USR2'].each do |sig|
  Signal.trap(sig) do
    pid = Process.pid 
    puts "[#{pid}] Received #{sig} at #{Time.now}. Dumping threads:"
    Thread.list.each do |t|
      trace = t.backtrace.join("\n[#{pid}] ")
      puts "[#{pid}] #{trace}"
      puts "[#{pid}] ---"
    end
    puts "[#{pid}] -------------------"
    exit unless sig == 'USR2'
  end
end

Restart your unicorns (this change may require full stop and clean start) and wait for any worker to receive a timeout. When it does, read your logs and rejoice:

E, [2014-01-19T04:17:19.340983 #26722] ERROR -- : worker=5 PID:27033 running too long (45s), sending TERM
[27033] Received USR2 at 2014-01-19 04:17:19 +0000. Dumping threads:
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload'
[27033] ... your code ...

And you can still use kill -USR2 <pid> on your unicorn workers (or any other Ruby process for that matter) to get the backtrace on demand. Just don't send USR2 to unicorn master process expecting to get a thread dump, it will reload your application instead.

Everything is so much simpler when you can pinpoint problems in your code.

Ruby