Andrew Cowie
Track: Scaling and High Availability
Date: Monday, April 18
Time: 1:30pm - 5:00pm
Location: Cypress
TrackBack
How do you ensure that you don't make mistakes when carrying out upgrades to mission critical systems?
Massive changes and upgrades are a significant part of the life-cycle of any large site. These types of events are often complex, involving numerous interdependent systems and people both internal and external to the team carrying out the procedure. They can only be allowed to disrupt services minimally, if at all. Numerous people need to be coordinated. And you need to "get it right the first time."
Databases are often the cornerstone of such high load mission critical systems, and bring unique challenges. An update to the application code often brings schema changes. Ongoing DBA work like updating tablespaces, tuning replication, and reconfiguring cluster configurations require direct actions on the very production systems that can't be down.
This tutorial teaches you proven methods for planning, rehearsing, and executing such events. Topics include:
- "Know your enemy": learn what can go wrong in a mission critical event and why preparing for them needs to be done with precision.
- "The best defense is a good offense": methodologies for preparing a procedure. Techniques for research and sourcing information, tips for writing, and how compulsive use of good style will help you get organized--and help you get buy-in from management.
- "Beta tests for people": how to conduct effective rehearsals which will accustom people to working together and catch problems before the real thing. In particular, events involving database upgrades are particularly challenging, as the datasets involved in production systems tend to be huge and server topologies complex. We will discuss techniques to simulate such environments.
- "Make it happen": how to execute the procedure, keep people on track, and deal with the unexpected.
- "Afterglow": Only by effectively and honestly reviewing what happened can you learn from your experiences--and that's essential to avoid making the same mistakes in future events.
There's nothing like learning by doing, so after the theory we will work through a mock event on a small cluster of systems. We'll build up a procedure piece by piece, rehearse, and then execute, all with the participation of the tutorial attendees.
If you're involved with managing or executing changes to production environments--be it systems upgrades, database administration, application development, or network engineering--then this tutorial is for you.