Blog of Tomas Varaneckas

2017-09-03

I've collected a lot of random notes and takeaways from SREcon 17 Europe. Unfortunately, I cannot attribute each note to a talk or speaker, so I'll just share a random, intertwined brain dump. Many of these ideas are obvious, many are taken from SRE Book, but hopefully some will be new and refreshing. Here goes:

Nice quotes and ideas:

Hope is not a strategy
Strong opinions are bad, when weakly held. (-I hate Java? -Why? -Because, you know, it's slow and it's shit.)
Developers are the customers of your database
Everyone's backend is some one else's frontend
Hero culture creates bus factor
Cloud providers are Chaos Monkey as a Service

Misc:

Have WTF diagrams of your systems
Your tiny infrastructure of few hundred machines is not Google, don't overengineer it
Write postmortems for failed auto deployments
There are 3 work modes of an SRE team: Firefighting (the server has ran out of RAM, we have to fix now), Preventive (the server will run out of RAM soon, let's better fix it now) and Proactive (Let's do this, so the server will never run out of RAM). Your team has to be in good mental health and product has to be stable for that last mode to happen.
Mental health of your team is probably more important than the health of your product. Exhausted team will only do firefighting.
Have definition of done for SRE tasks
Leads should have 1-on-1 talks with team members every week, to feel the pulse
Your team should design it's own T-Shirt, that could be worn with pride
Have outage and hit by the bus drills to reduce on-call stress
Carry a physical TODO pad with you, so when someone asks you to do something, they can see how long your TODO list is when you're adding their item.
If you don't test your backups, you don't have backups
Run selects on read-only replicas to distribute load

Gamifying operational excellence:

Introduce service score cards, rating all services from F to A+
Come up with various requirements with weights
Codify all requirements into checks
Tally up the scores, then set grades
No SRE support for F grade services
High scores, published for the whole company
A services can be deployed day 24/7
A services can have priority build queues
Teams with F services get extra help
Have HackDays to improve grades

On-call & incident management:

Outage is not the end of the world. If your product uptime is not directly related to life and death situations, you don't want to kill yourself over trying to keep the downtime minimal. So what if people cannot buy some shit for a little while? They will survive, and so will your company.
Have more people on call
Don't spray & pray - send pages to few relevant people only
People who get paged don't have to be ones that do the fixing
Fact of possibility of getting a page is as stressful as the fact of getting a page
If something never breaks, break it yourself. Eventually everything will break anyway, and then you might get in big trouble
Dashboards with graphs allow finding problems much faster
Assess the impact before trying to fix it:
- how many users are affected?
- how broken is it?
Ask yourself if it's not this, what else could be causing it? before trying to fix the first thing that comes to your mind when something is broken.
Have an incident manager who coordinates and communicates during the outage, while those who are fixing it can focus
Knowing the start time (in granularity of a minute) is very important
Have a centralized change database, where deployments, production configuration changes, infrastructure and other changes are registered. Use it to find what could have caused the incident, because in majority of cases it's caused by some change.
Centralized changed database should have a revert button
When mitigating, think of turning the component off, redirecting the traffic elsewhere, error boundaries and cascading failures
Have built-in controls for all your components, where you can change things in runtime
Have status endpoints for all your components (HTTP page listening on some port, providing useful information about the service state)
Root cause is often the least important thing during the incident - deal with it after service is restored
Parallelize mitigation efforts if possible
Confirm user impact is gone after incident is resolved
Communicate the fact of incident resolution - something might still be broken and others could think you are still fixing it
Have incident followups, concerning detection, escalation, recovery and prevention
Use Slow thinking when dealing with an outage
ChatOps can be a massive aid for handling incidents
Minimum team for on-call rotation is 6 people
When you begin your on-call rotation duty, you should get a page about it and ACK it.
If your main chat (i.e. Slack) goes down during the incident, do you have a failover method of communication ready?

Errors and debugging:

Fingerprint error messages and aggregate their counts
Use git stacktrace to find the commit that caused the error
Deduplicate stack traces by comparing two Elastic queries
Use git whatchanged and git log -S (pickaxe)
Show error messages that are useful to both end users and developers
Include encrypted debug data and trace IDs in your error pages
Use staging, canaries, forked production traffic, log playback to detect stuff early
Use real user measurements (RUM)
ElasticSearch can be used for anomaly detection
VM snapshots in bad state could be useful for debugging

Metrics, monitoring and resource management:

Collect metrics people care about
Use predictive algorithms to figure out when resources will be exhausted
Data scientists should be doing more sophisticated resource forecasts
Have short, mid, long term forecasts for resources
Be prepared for resource delays (DC is late on delivering new machines) and unlikely events (natural disasters destroy several hard drive manufactoring plants)
Migration of capacity can be automated
See Microsoft research papers on Vector Bin Packing
Test what happens when root partition runs out of space - you may get surprised
Prometheus is the new gold standard in monitoring
Monitor if your HDFS cluster is balanced
Beware of alert fatigue
Alerts should always be actionable
Name your alerts well enough to understand them at 3 AM
Graph alerting stats
Use cloudflare/unsee dashboard with Prometheus
Check out PromCon talk videos
Real user monitoring > synthetic monitoring
There should be deep analysis of every single transaction
Check out EBPF Linux instrumentation
SLOs must have a time period (1 minute, 5 minutes, etc)
Disk monitoring alerts should be based on when you'll run out of space
Invert your SLOs and visualize to be able to answer questions like what percentile of requests was under 10ms?
Set realistic SLOs. Do you really need your service to respond in 10ms, or 50ms is also perfectly fine?
Never measure rates, use gauge values and quantities instead.
Track CPU ticks, not utilization %
Monitoring should reside outside your tech stack
Monitor business KPIs
Do not silo monitoring data
Observe real workloads, not synthetic ones
If there is no workload on service, introduce synthetic workload to see if service still works
Histograms are great
History is critical, do not aggregate, do not average out over time
Build your monitoring top down
Egress traffic is one of the most important metrics
Monitor down to syscall level

Distributed systems & scaling:

Use Consul to coordinate service migrations
If running a Kubernetes cluster was an overkill for Shopify, it probably would be for you as well
Nobody knows the exact time. By the time it takes light to travel from a watch to your eyes (1 ns), a modern CPU can do 3 ops.
A 100% accurate distributed clock would solve most distributed system problems, unfortunately, it doesn't exist yet.
Consensus algorithms: Raft is better than Paxos, and there is also View Synchrony
Configuration management doesn't scale
Autonomous > automated

Improving resilience:

Randomly reboot your servers
What would happen, if all your servers were rebooted at once? (cyclic dependencies can lock up)
Visualize dependencies to look for cycles
Set rules for introducing dependencies
Don't put DNS servers on Hypervisors that depend on those DNS servers (doh)

Anti bot techniques:

Ban bots by fingerprint, not IP address.
TCP fingerprinting can be used to compare claimed OS in user agent with actual one.
Try to not give any clues to banned bot about the fact that they are banned, this only speeds up the arms race.
Use reverse Slowloris to mess with annoying bots.
Lead bots into honeypot.
Be aware of clients connecting from AWS, DO, GC, Azure, etc.
You don't have to determine if client is a bot in real time. Gather some intel for a few minutes to get a more accurate evaluation.
Deal with bots on load balancer level, or as early along the stack as possible.

Fun facts:

People use bots to auto-buy exclusive sneakers on Shopify flash sales and flip them for 10x profits. Meaning, some poepple are willing to pay $5000 for a pair of ugly looking red sneakers with a Nike logo all over it. Okaaaay...

Overall, it was a great conference, and even though some people didn't like it that much, I certainly left more inspired than I came, and I'll be willing to attend again.

SRE

Notes from SRECon 17 Europe