- Service Is King!
- The Role of Logging in Complex Software Systems
- What Do We Do with Log Files?
- Error Categories
- Self-Healing Systems and Standards
- Conclusion
- References
What Do We Do with Log Files?
Let's assume that an exception occurred in the process shown in Figure 1, and we now have an entry in a log file that corresponds to a serious problem. What happens next? Traditionally, a developer, help desk operator, or other expert is tasked with problem determination. This process involves reading the log file, sometimes even cross-referencing it with the source code (in Figure 1), and attempting to see what caused the problem. It may be a software bug, or it may have been a valid problem such as the file being open and locked. Overall, this is a complex business!
Autonomic Computing and Log Files
Autonomic computing tries to handle all of these log file operations automatically without human intervention. This strategy requires standardized log file entries, which IBM calls the Common Base Event (CBE) format. CBE is based on a structured three-entry format:
The particular component affected by a situationthe source
The component observing a situation
The situation dataproperties describing the situation, including correlation information
When all logging follows this format, it then becomes possible to employ an automated system to determine which component is failing. Appropriate action can then be taken, such as stopping and starting a particular software component. Clearly, a log file note may be purely informational or it may reflect a transient problem. An informational message may be something like "Server Up"useful to know but not a problem. A transient problem is one such as the incorrect password entry described earlier. A more serious problem occurs when a particular software entity is not responding to an incoming message. In the latter case, intervention (automated or human) may be required.
The issue of problem resolution has been around for many years in network management and has been solved to some extent with the use of notifications or trapsautonomously generated messages that network devices emit when some important event occurs. The messages are sent to a network management system for analysis and corrective action (if required). The format and meaning of the notification messages are generally known ahead of time by the management system. This design provides the capability to define policies in the management system for dealing with each notification. [1] Autonomic computing takes this very lightweight version of automatic problem resolution much further.