These are my notes for Release It: Design and Deploy Production-Ready Software 2nd Edition by Michael T. Nygard.
I’m not sure how I found this book but it is likely from a search for SRE books. The fact that I’ve been reading this book on and off for about 8 months says two things:
- I enjoyed the book enough to keep reading it - otherwise I would have stopped
- It started to feel a little less relevant to my job as my SRE responsibilities ramped up
Overall, the book was enjoyable to read and a good balance between information density and understandability. The case studies were quite enjoyable!
The main things I learned about were system design patterns, circuit breakers, bulkheads, backpressure, deceleration zones, leaky bucket, governors, multihoming, load balancing, and canary deploys.
I made a list of all the books in the bibliography here: openlibrary.org/people/raybb/lists/OL200793L
Some things from the book:
- “The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.”
- “Just as auto engineers create crumple zones—areas designed to protect passengers by failing first—you can create safe failure modes that contain the damage and protect the rest of the system.”
- “Once established, a TCP connection can exist for days without a single packet being sent by either side.”
- “Every integration point will eventually fail in some way, and you need to be prepared for that failure.”
- “Partitioning servers with Bulkheads,, can prevent chain reactions from taking out the entire service—though they won’t help the callers of whichever partition does go down. Use Circuit Breaker on the calling side for that.”
- “It would be wonderful if there was a way to keep things in the session (therefore in memory) when memory is plentiful but automatically be more frugal when memory is tight. Good news! Most language runtimes let you do exactly that with weak references.”
- “some servers are handling more than a million concurrent connections”
- “If you’re dealing with vendor code, it may also be worth some time beating them up for a better client library.”
- “Any special offer meant for a group of 10,000 users is guaranteed to attract millions. The community of networked bargain hunters can detect and share a reusable coupon code in milliseconds.”
- “For service providers, use Handshaking and Backpressure to inform callers to throttle back on the requests. Also consider Bulkheads to reserve capacity for high-priority callers of critical services.”
- “Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.”
- “I like the Leaky Bucket pattern from Pattern Languages of Program Design 2 [VCK96]. It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.” Sisyphean - something that can never be completed
- “Log file rotation requires just a few minutes of configuration.”
- “Every performance problem starts with a queue backing up somewhere.”
- “A governor limits the speed of an engine. Slow things down to allow intervention”
- “In development, the server can always call its language-specific version of getLocalHost, but on a multihomed machine, this simply returns the IP address associated with the server’s internal hostname”
- “Reject work as close to the edge as possible. The further it penetrates into your system, the more resources it ties up”
- “the best way to tell if users are receiving a good experience is to measure it directly. This is known as real-user monitoring (or RUM, if you like).”
- “the purpose of the canary deployment is to reject a bad build before it reaches the users.”
- “The only safe way to handle file uploads is to treat the client’s filename as an arbitrary string to store in a database field. Don’t build a path from the filename in the request. Generate a unique, random key for the real filename and link it to the user-specified name in the database. That way, the names in the filesystem stay under your service’s control and don’t include external input as any part.”
- “trickle, then batch” - this is basically the pattern of putting an if in your code to manually change over something about data as it is accessed. Then you see it working slowly. Once you’re confidence you can run a batch job to do the less frequently accessed data.
- “A good health check page reports the application version, the runtime’s version, the host’s IP address, and the status of connection pools, caches, and circuit breakers.”
- “Instead of creating a single system of record for any given concept, we should think in terms of federated zones of authority.”
- “Highly efficient systems handle disruption badly. They tend to break all at once.”
- “Also, as Charity Majors, CEO of Honeycomb.io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.”
- “You can make this more fun by calling it a “zombie apocalypse simulation.” Randomly select 50 percent of your people and tell them they are counted as zombies for the day. They are not required to eat any brains, but they are required to stay away from work and not respond to communication attempts.”
Why DNS round robin can be bad:
- Too much control in client’s hands
- DNS has no info about health of instances
- Some clients control more info than others
- global server load balancing (GSLB) server keeps track of the health and responsiveness of the pools.
- GSLB can also apply weighted distribution and a host of load-balancing algorithms. These can be used as part of a disaster recovery strategy or even part of a rolling deployment process.
- Use a different DNS provider still for your public status page.