2017-09-03

I've collected a lot of random notes and takeaways from SREcon 17 Europe. Unfortunately, I cannot attribute each note to a talk or speaker, so I'll just share a random, intertwined brain dump. Many of these ideas are obvious, many are taken from SRE Book, but hopefully some will be new and refreshing. Here goes:

Nice quotes and ideas:

  • Hope is not a strategy
  • Strong opinions are bad, when weakly held. (-I hate Java? -Why? -Because, you know, it's slow and it's shit.)
  • Developers are the customers of your database
  • Everyone's backend is some one else's frontend
  • Hero culture creates bus factor
  • Cloud providers are Chaos Monkey as a Service

Misc:

  • Have WTF diagrams of your systems
  • Your tiny infrastructure of few hundred machines is not Google, don't overengineer it
  • Write postmortems for failed auto deployments
  • There are 3 work modes of an SRE team: Firefighting (the server has ran out of RAM, we have to fix now), Preventive (the server will run out of RAM soon, let's better fix it now) and Proactive (Let's do this, so the server will never run out of RAM). Your team has to be in good mental health and product has to be stable for that last mode to happen.
  • Mental health of your team is probably more important than the health of your product. Exhausted team will only do firefighting.
  • Have definition of done for SRE tasks
  • Leads should have 1-on-1 talks with team members every week, to feel the pulse
  • Your team should design it's own T-Shirt, that could be worn with pride
  • Have outage and hit by the bus drills to reduce on-call stress
  • Carry a physical TODO pad with you, so when someone asks you to do something, they can see how long your TODO list is when you're adding their item.
  • If you don't test your backups, you don't have backups
  • Run selects on read-only replicas to distribute load

Gamifying operational excellence:

  • Introduce service score cards, rating all services from F to A+
  • Come up with various requirements with weights
  • Codify all requirements into checks
  • Tally up the scores, then set grades
  • No SRE support for F grade services
  • High scores, published for the whole company
  • A services can be deployed day 24/7
  • A services can have priority build queues
  • Teams with F services get extra help
  • Have HackDays to improve grades

On-call & incident management:

  • Outage is not the end of the world. If your product uptime is not directly related to life and death situations, you don't want to kill yourself over trying to keep the downtime minimal. So what if people cannot buy some shit for a little while? They will survive, and so will your company.
  • Have more people on call
  • Don't spray & pray - send pages to few relevant people only
  • People who get paged don't have to be ones that do the fixing
  • Fact of possibility of getting a page is as stressful as the fact of getting a page
  • If something never breaks, break it yourself. Eventually everything will break anyway, and then you might get in big trouble
  • Dashboards with graphs allow finding problems much faster
  • Assess the impact before trying to fix it:
    • how many users are affected?
    • how broken is it?
  • Ask yourself if it's not this, what else could be causing it? before trying to fix the first thing that comes to your mind when something is broken.
  • Have an incident manager who coordinates and communicates during the outage, while those who are fixing it can focus
  • Knowing the start time (in granularity of a minute) is very important
  • Have a centralized change database, where deployments, production configuration changes, infrastructure and other changes are registered. Use it to find what could have caused the incident, because in majority of cases it's caused by some change.
  • Centralized changed database should have a revert button
  • When mitigating, think of turning the component off, redirecting the traffic elsewhere, error boundaries and cascading failures
  • Have built-in controls for all your components, where you can change things in runtime
  • Have status endpoints for all your components (HTTP page listening on some port, providing useful information about the service state)
  • Root cause is often the least important thing during the incident - deal with it after service is restored
  • Parallelize mitigation efforts if possible
  • Confirm user impact is gone after incident is resolved
  • Communicate the fact of incident resolution - something might still be broken and others could think you are still fixing it
  • Have incident followups, concerning detection, escalation, recovery and prevention
  • Use Slow thinking when dealing with an outage
  • ChatOps can be a massive aid for handling incidents
  • Minimum team for on-call rotation is 6 people
  • When you begin your on-call rotation duty, you should get a page about it and ACK it.
  • If your main chat (i.e. Slack) goes down during the incident, do you have a failover method of communication ready?

Errors and debugging:

  • Fingerprint error messages and aggregate their counts
  • Use git stacktrace to find the commit that caused the error
  • Deduplicate stack traces by comparing two Elastic queries
  • Use git whatchanged and git log -S (pickaxe)
  • Show error messages that are useful to both end users and developers
  • Include encrypted debug data and trace IDs in your error pages
  • Use staging, canaries, forked production traffic, log playback to detect stuff early
  • Use real user measurements (RUM)
  • ElasticSearch can be used for anomaly detection
  • VM snapshots in bad state could be useful for debugging

Metrics, monitoring and resource management:

  • Collect metrics people care about
  • Use predictive algorithms to figure out when resources will be exhausted
  • Data scientists should be doing more sophisticated resource forecasts
  • Have short, mid, long term forecasts for resources
  • Be prepared for resource delays (DC is late on delivering new machines) and unlikely events (natural disasters destroy several hard drive manufactoring plants)
  • Migration of capacity can be automated
  • See Microsoft research papers on Vector Bin Packing
  • Test what happens when root partition runs out of space - you may get surprised
  • Prometheus is the new gold standard in monitoring
  • Monitor if your HDFS cluster is balanced
  • Beware of alert fatigue
  • Alerts should always be actionable
  • Name your alerts well enough to understand them at 3 AM
  • Graph alerting stats
  • Use cloudflare/unsee dashboard with Prometheus
  • Check out PromCon talk videos
  • Real user monitoring > synthetic monitoring
  • There should be deep analysis of every single transaction
  • Check out EBPF Linux instrumentation
  • SLOs must have a time period (1 minute, 5 minutes, etc)
  • Disk monitoring alerts should be based on when you'll run out of space
  • Invert your SLOs and visualize to be able to answer questions like what percentile of requests was under 10ms?
  • Set realistic SLOs. Do you really need your service to respond in 10ms, or 50ms is also perfectly fine?
  • Never measure rates, use gauge values and quantities instead.
  • Track CPU ticks, not utilization %
  • Monitoring should reside outside your tech stack
  • Monitor business KPIs
  • Do not silo monitoring data
  • Observe real workloads, not synthetic ones
  • If there is no workload on service, introduce synthetic workload to see if service still works
  • Histograms are great
  • History is critical, do not aggregate, do not average out over time
  • Build your monitoring top down
  • Egress traffic is one of the most important metrics
  • Monitor down to syscall level

Distributed systems & scaling:

  • Use Consul to coordinate service migrations
  • If running a Kubernetes cluster was an overkill for Shopify, it probably would be for you as well
  • Nobody knows the exact time. By the time it takes light to travel from a watch to your eyes (1 ns), a modern CPU can do 3 ops.
  • A 100% accurate distributed clock would solve most distributed system problems, unfortunately, it doesn't exist yet.
  • Consensus algorithms: Raft is better than Paxos, and there is also View Synchrony
  • Configuration management doesn't scale
  • Autonomous > automated

Improving resilience:

  • Randomly reboot your servers
  • What would happen, if all your servers were rebooted at once? (cyclic dependencies can lock up)
  • Visualize dependencies to look for cycles
  • Set rules for introducing dependencies
  • Don't put DNS servers on Hypervisors that depend on those DNS servers (doh)

Anti bot techniques:

  • Ban bots by fingerprint, not IP address.
  • TCP fingerprinting can be used to compare claimed OS in user agent with actual one.
  • Try to not give any clues to banned bot about the fact that they are banned, this only speeds up the arms race.
  • Use reverse Slowloris to mess with annoying bots.
  • Lead bots into honeypot.
  • Be aware of clients connecting from AWS, DO, GC, Azure, etc.
  • You don't have to determine if client is a bot in real time. Gather some intel for a few minutes to get a more accurate evaluation.
  • Deal with bots on load balancer level, or as early along the stack as possible.

Fun facts:

  • People use bots to auto-buy exclusive sneakers on Shopify flash sales and flip them for 10x profits. Meaning, some poepple are willing to pay $5000 for a pair of ugly looking red sneakers with a Nike logo all over it. Okaaaay...

Overall, it was a great conference, and even though some people didn't like it that much, I certainly left more inspired than I came, and I'll be willing to attend again.

SRE