If you are working with highly intensive Unicorn deployments, you will be familiar with following problems.
How to find out what Ruby processes are doing?
Occasionally some Ruby processes are running pedal to the metal, possibly burning CPU like crazy, you know they shouldn't be, and you just can't tell what's happening. They are getting stuck, but can't tell where or why - it's frustrating.
Tracing sometimes works. Run strace -s 999 -fp <pid>
on a suspicious process, and you may see some useful output. It may end with something like this, without any movement:
[pid 7990] select(18, [16 17], NULL, [7 8], {133, 947558} <unfinished ...>
[pid 17728] select(19, [14 18], NULL, [7 8], {22, 500000} <unfinished ...>
It's selecting something from file descriptor 19, and you can try to find out what it is:
$ lsof -p 17728
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
ruby 17728 ruby 19u IPv4 808319634 0t0 TCP web1.infra.net:35682->db3.infra.net:mysql (ESTABLISHED)
In this case it's waiting for MySQL, so you can look at what's happening over there. But sometimes you see that worker is hanging on select from a pipe or socket:
$ lsof -p 17728
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
ruby 58610 ruby 12r FIFO 0,8 0t0 3435272094 pipe
ruby 58610 ruby 13u sock 0,6 0t0 3435234779 can't identify protocol
ruby 58610 ruby 14w FIFO 0,8 0t0 3435270598 pipe
If your FD shows a pipe or socket, it's difficult to dig further.
Another case is when Ruby process is endlessly spilling it's guts into trace output, and you can see a pattern repeating, meaning that there is some sort of infinite loop going on, but it's hard to tell where.
There are more advanced techniques, but they are pretty cumbersome, and you may get your process killed before you manage to find out anything useful.
What Ruby is doing at any given moment?
Wouldn't it be nice if you could be able to find out what line of code your Ruby process is executing at any given moment? Good news is that you can. Combine Signal.trap
with Thread.list
and Thread#backtrace
, and you're set.
Signal.trap('USR2') do
pid = Process.pid
puts "[#{pid}] Received USR2 at #{Time.now}. Dumping threads:"
Thread.list.each do |t|
trace = t.backtrace.join("\n[#{pid}] ")
puts "[#{pid}] #{trace}"
puts "[#{pid}] ---"
end
puts "[#{pid}] -------------------"
end
Place this code in Unicorn's after_fork
, Resque after_fork
, beginning of Rake tasks, or wherever it fits. When something is stuck or otherwise suspicious, find the pid, and then do:
$ kill -USR2 26749
Take a look at unicorn log. It should show something like this:
[26749] Received USR2 at 2014-01-19 04:28:35 +0000. Dumping threads:
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each'
[26749] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `call'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `sleep'
[26749] /srv/app.intra.net/releases/20140118185816/app/controllers/welcome_controller.rb:8:in `index'
[26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/action_controller/metal/implicit_render.rb:4:in `send_action'
[26749] /usr/local/lib/ruby/gems/2.0.0/gems/actionpack-3.2.16/lib/abstract_controller/base.rb:167:in `process_action'
[26749]
...
[26749] ---
[26749] -------------------
You can see that it's hanging on sleep
in welcome_controller.rb
, line 8
. Bullseye.
Dumping Ruby threads when killing lazy Unicorn workers
You can't sit all day and watch those processes manually. Unicorn has a timeout configuration option, where it kills workers that run longer than given amount of seconds.
Trouble is that Unicorn sends SIGKILL
when terminating stuck workers, and SIGKILL
cannot be trapped. Workers get killed, and all you get are these lousy messages:
E, [2013-11-19T12:30:08.905908 #61893] ERROR -- : worker=75 PID:62991 timeout (46s > 45s), killing
E, [2013-11-19T12:30:08.939120 #61893] ERROR -- : reaped #<Process::Status: pid 62991 SIGKILL (signal 9)> worker=75
That's not very helpful. It just tells us that something went wrong, and that the perpetrator got a bullet in the head.
But we can patch things up by overriding Unicorn::HttpServer#murder_lazy_workers
. Here's a monkey patch that will send SIGTERM
2 seconds before sending SIGKILL
:
class Unicorn::HttpServer
# forcibly terminate all workers that haven't checked in in timeout seconds.
# The timeout is implemented using an unlinked File
def murder_lazy_workers
next_sleep = @timeout - 1
now = Time.now.to_i
WORKERS.dup.each_pair do |wpid, worker|
tick = worker.tick
0 == tick and next # skip workers that haven't processed any clients
diff = now - tick
tmp = @timeout - diff
# monkey patch begins here
if tmp < 2
logger.error "worker=#{worker.nr} PID:#{wpid} running too long " \
"(#{diff}s), sending TERM"
kill_worker(:TERM, wpid)
end
# end of monkey patch
if tmp >= 0
next_sleep > tmp and next_sleep = tmp
next
end
next_sleep = 0
logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
"(#{diff}s > #{@timeout}s), killing"
kill_worker(:KILL, wpid) # take no prisoners for timeout violations
end
next_sleep <= 0 ? 1 : next_sleep
end
end
Place this code somewhere where it gets loaded on unicorn startup. It can be next to Unicorn's before_fork
. You may want to adjust your Signal.trap
to be a little more universal:
['TERM', 'USR2'].each do |sig|
Signal.trap(sig) do
pid = Process.pid
puts "[#{pid}] Received #{sig} at #{Time.now}. Dumping threads:"
Thread.list.each do |t|
trace = t.backtrace.join("\n[#{pid}] ")
puts "[#{pid}] #{trace}"
puts "[#{pid}] ---"
end
puts "[#{pid}] -------------------"
exit unless sig == 'USR2'
end
end
Restart your unicorns (this change may require full stop and clean start) and wait for any worker to receive a timeout. When it does, read your logs and rejoice:
E, [2014-01-19T04:17:19.340983 #26722] ERROR -- : worker=5 PID:27033 running too long (45s), sending TERM
[27033] Received USR2 at 2014-01-19 04:17:19 +0000. Dumping threads:
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `backtrace'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:291:in `block (3 levels) in reload'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `each'
[27033] /srv/app.intra.net/releases/shared/config/unicorn-config.rb:290:in `block (2 levels) in reload'
[27033] ... your code ...
And you can still use kill -USR2 <pid>
on your unicorn workers (or any other Ruby process for that matter) to get the backtrace on demand. Just don't send USR2
to unicorn master process expecting to get a thread dump, it will reload your application instead.
Everything is so much simpler when you can pinpoint problems in your code.
Ruby