- Ten Commandments for Building a "World-Class" Infrastructure
- Thou shalt measure customer satisfaction.
- Thou shalt structure and mentor thy organization to focus on Reliability, Availability, and Serviceability (RAS).
- Senior IT management must focus at least 50% of their time, resource, and budget on organization, people, and process initiatives.
- Honor thy mainframe disciplines, and keep them holy—but keep out the bureaucracy!
- Keep all production systems equal in the eyes of the IT staff.
- Thou shalt maintain centralized control for infrastructure standards and processes.
- Thou shalt design the infrastructure as an internal XSP.
- Thou shalt build an attractive, cost-effective, and flexible IT infrastructure and thy customers will come.
- Measure all; verily, you cannot manage what you do not measure.
- Harris Kern's Enterprise Computing Institute
III. Thou shalt structure and mentor thy organization to focus on Reliability, Availability, and Serviceability (RAS).
Many companies structure the organization to support a particular technology. Whether you run your business on NT, UNIX, or some other system, never structure to focus on a particular technology. This is one of the most common mistakes in IT. IT should always structure to support RAS. With RAS as the focus of your organization, the entire computing environment will flourish.
These are the key functions and elements to focus on while structuring your organization to support RAS:
Mission-critical versus nonmission-critical systems. It's essential when striving for RAS to define the scope of production. Define which system is truly mission critical to the company. How much revenue will be lost if that system is down for N number of minutes? Don't proclaim everything as mission-critical. Be frugal. If you try to take on the world, you will surely fail.
Production control. The production control organization was established in the early 1970s to provide a QA function to support mission-critical applications that were housed in the data center. In the late 1980s and early 1990s it became a forgotten functionpushed aside just like the mainframe because it had been labeled bureaucratic. It's unfortunate, because production control had nothing to do with bureaucracy. Its sole responsibility was to preserve RAS. This function is more critical today than it was 20 years ago, and its activities aren't that much different:
Provide second-level production support.
Participate in the disaster-recovery process/drills.
Reject new applications or major revisions to applications into production prior to thorough testing and documentation.
Breed technical resources.
Provide centralized ownership/accountability for key processes such as change management, storage management, and so on.
Maintain system-management tools.
Assist senior systems programmers in the installation, support, and documentation of new systems.
Provide training to other groups within IT on newly installed system-management tools.
Three-tier support structure. This is one of the best structures ever designed, and probably the single most important reason that everything the data center stood for was so successful. Some of the roles and responsibilities of this structure are listed in the following table.
Level 1 |
Monitor the systems (servers, network, peripheral devices). Perform incremental and full backups. Provide tape librarian functions. Assist in the physical layout of production servers. Issue trouble tickets and monitor the data center on a 24-hour by 7-day basis. First-level problem determination and resolution attempt. After N number of minutes, as determined by the problem-management process, the problem will be escalated to second-level support. |
Level 2 |
Process design, implementation, ownership, and accountability (production acceptance, change management, and so on). Support software installation and configuration. Perform system maintenance as required. Perform storage-management functions. 24 hours a day, 7 days a week on-call support. Perform disaster-recovery drills. Define and reset standards to support mission-critical applications. Problem determination and attempted resolution. After N minutes as determined by the problem-management process, the problem will be escalated to third-level support. A side note: There should be fear instilled into second-level support staff before escalating to third level. The group's goal is to do everything possible to resolve the problem here before escalating to the senior gurus of the department. |
Level 3 |
Physical location of the server, network connections, and sufficient power for all peripherals. Preventive maintenance diagnostics on all incoming equipment. Apply patches to the operating system as needed. Assist database administration with RDBMS installations. Install any unbundled products, such as tape management and disk mirroring, applying patches to unbundled products as needed. Support of software installation and configuration. Maintain and configure system security. Perform system maintenance as required. 24 hours a day, 7 days a week on-call support. Perform disaster-recovery drills. Tune systems for peak performance. Implement capacity planning. Perform security audits; monitor security access. Establish system user accounts' root ownership. Define and reset standards to support mission-critical applications. Problem resolution; the buck stops here. If they cannot fix the problem, no one can. Design and architect infrastructure-related programs. |