Lightning Talk at Linux Conference Australia
Simon Lyall grabbed me at registration yesterday morning and asked if I'd give a lightning talk at the system administration track. I get 10–15 minutes to talk about whatever I want. Though, what they want to hear about is system administration at Weta.
I hate talking off the top of my head, so here are the beginnings of the outline:
Statistics Overview
- ~4000 render procs (mostly IBM Blade servers)
- ~100T online storage (mostly Network Appliance)
- ~700T near line
- ~85 racks of gear
- Very simple network, big L2's separated by single L3 core
Lessons
- Automate everything you can, time-consuming tasks which repeat, will kill you. Spend the time upfront.
- Lights out management is basically required, blade servers rule.
- OpenLDAP doesn't scale as well as you'd hope (big hopes for the fedora directory server), but we'll probably go back to flat files since we don't have that many users (edit: OpenLDAP worked just fine with some more attentive performance tuning).
- We only run one copy of Linux. All boxes (workstations, render wall, some servers) rsync to a "golden image" at boot time to stay in sync and get updates.
- You want to monitor everything by default. By the time you know you need to monitor something it's too late, you NEED history to see problems.
- Vendor lock in sucks, avoid it at all costs.
- The single biggest technical problem we have is the 16 group RPC limitation. Bring on NFSv4 ...
Moral
- So long as you keep things simple, you can get away with an amazing amount of stupid.