Policy Definition and Usage Scenarios for Self-Managing Systems
Managing IT (information technology) infrastructure is hard. From Fortune 500 enterprises to small businesses and from nationwide data centers to personal computers in homes, an inordinate amount of time and effort is spent in managing IT. Management and operational expenses are taking an increasingly larger share of the IT budget in many organizations, with a major part of it attributed to the complexity of the systems that need to be managed.
IT management is a labor-intensive task, and skilled administrators need to intervene frequently to keep the IT infrastructure running. The exponential increase in the size of IT infrastructures coupled with increasing technical complexity has led to a situation where, despite automation, remote management, and off-shoring, the fundamental problem—there are not enough skilled people to ensure seamless operation of IT systems—remains untamed. This has driven research and industry to look for management frameworks that go beyond the direct human manipulation of network devices and systems [AUTO]. One approach toward this aim is to build policy-based management systems (PBMS). Policy-based management refers to a software paradigm developed around the concept of building autonomous systems or systems that manage themselves with minimum input from human administrators. This paradigm provides system administrators and decision makers with interfaces that let them set general guiding principles and policies to govern the behavior and interactions of the managed systems. Although large portions of the IT management chores are still carried out manually and in an ad hoc manner, policy-based management systems are maturing and can be found in areas such as data center management, privacy, security and access management, and the management of quality of service and service level agreements in networks. The main objective of this book is to provide the reader with a firm understanding of what policy-based management systems are, how they can be used to reduce the cost of IT administration, and the state of the art in policy-based management in real life.
1.1 Formal Definition of Policy
The word “policy” has its origins in government and regulations and its source is Middle English and Middle French. If we open a dictionary and look for the word “policy” we may find the following definitions [MERR]:
- A definite course or method of action selected from among alternatives and in light of given conditions to guide and determine present and future decisions.
- A high-level overall plan embracing the general goals and acceptable procedures especially of a governmental body.
The reader may notice that even though both the definitions convey a very similar idea, that is, a policy is a plan or course of action, the distinction between the two comes from the specificity of the plan. In the first case the plan is definite and concrete, whereas the second definition refers to a high-level plan. In many occasions, the term “policy” is used interchangeably with “regulation” although regulations have more emphasis on enforcement, usually describing authoritative rules dealing with details or procedures [MERR]. In general, the word “policy” is used in a broad spectrum of situations in common English.
The use of the word “policy” in computer science, networking, and information technology has experienced a similar phenomenon. It has been used to describe among other things: regulations, general goals for systems management, or prescriptive plans of action. A few examples of where the term has been applied are access control policies, load balancing policies, security policies, back-up policies, firewall policies, and so on. We also find references to policy in high-level programming languages and systems generally referred to as Business Rules Systems [CJDA].
In many cases policies are equated with system configuration. Take, for example, policies in Microsoft® 2000 Exchange servers. In Exchange, a policy is “a collection of configuration settings that are applied to one or more Exchange configuration objects.... You can define a policy that controls the configuration of some or all settings across a server or other objects in an Exchange organization....”
Given the variety of usage of the word “policy,” we first need to precisely define what we mean by policy. Heuristically, a policy is a set of considerations designed to guide decisions on courses of actions. Policies usually start as natural language statements. From these descriptions, many details need to be sorted out before policies can be implemented. Consider the following statement usually implemented as a default policy in Apache Web servers:
Do not allow the execution of CGI scripts.
The policy is activated by setting the value of an appropriate variable in a configuration file. During initialization the Web server reads the configuration file and adjusts its behavior in a way that when interpreting and serving documents to Web clients it will throw an exception if it encounters a CGI script as the source of the document to be rendered into the Web client.
Compare this policy to the following example from banking regulations:
A currency transaction report (CTR) must be filed with the federal government for any deposit of $10,000 or more into a bank account.
This statement, extracted from the Money Laundering Suppression Act enacted by the U.S. Congress in 1994, is a typical policy regulation that banks must implement. In modern bank systems, the implementation will probably be done using database triggers.
The implementation of these two policies has little in common. However, there is significant commonality in the specification. First, both policies identify a target system: the computer where the Web server is running and the bank information system. Second, both policies express constraints over the behavior of the target system.
From the point of view of high-level policy specification, what the system is or how the system is implemented is not relevant.1 The policy merely indicates how to regulate the behavior of a system by indicating the states that the system can or cannot take. In the Web server example, if we can take a snapshot of the state of the server at any moment in time, the policy indicates that we should never find a process associated with a CGI script that was started by the Web server. In the banking example, if we take a snapshot of the system and find a transaction containing a transfer of $10,000 or more, the snapshot must also contain the generation of a CTR. Accordingly, for us to specify a policy we need first to identify three things:
- The target of the policy, which we will call the target system. A target system may be a single device such as a notebook computer or workstation, or it can be a complex system such as a data center or a bank information system consisting of multiple servers and storage systems.
- A set of attributes associated with the target system. The value of an attribute can be a simple number or text string, or it can be as complex as a structured object containing other attributes. At this moment we do not need to define a data model for attributes; we need to know only that these attributes are identifiable and accessible and that they take values from a predefined set of types.
- The states that the target system can take at any given time, which are defined by an assignment of values to the system attributes.
In practice, there are many alternatives for the definition and identification of target systems. For example, the computer system where the Web server is running could be identified by an IP address; or we can group subsystems and identify the group with a unique logical name, for example, all the computers on the second floor of an office building. There are also many ways to define and get the values of system attributes. For example, an attribute of a computer system could be a set of objects representing the processes running in the computer system at a given time. These objects could be complex objects with testable properties that identify whether the object represents a process that has been started by the Web server, and whether it is a CGI script.2 However, the behavior of a system is not completely characterized by the set of states it is in. A definition of “behavior” needs to take into consideration how the system moves through these states. Given that policies constrain the behavior, it is not surprising to find policies that constrain these state transitions. Consider the following example:
If a credit card authorized for a single person has been used within 1 hour in different cities that are at least 500 miles apart, reject the charge and suspend the credit card immediately.
This policy is also a constraint. But in contrast to our previous policies, the constraint is not imposed on a single state of the system, but on at least three states: the state of the system at the time the credit card is first used; the state at the time when a second use of the credit card is detected and the transaction needs to be rejected; and any state in the future where credit card transactions must be rejected.
Thus, we will define the behavior of a system to be a continuous ordered set of states, where the order is imposed by time. Consider a system S that may behave in many ways. Let B(S) be the set of all possible behaviors the system S can exhibit (that is, any possible continuous ordered set of states).
Definition: A policy is a set of constraints on the possible behaviors B(S) of a target system S; that is, it defines a subset of B(S) of acceptable behaviors for S.
We note that this is a very generic definition, and it does not say how policies can be implemented or enforced. Implementations will require systems to provide operations that can affect their behavior. If there is no way to affect the behavior of the system, we will not be able to implement policies. These operations are special attributes of the system that policies can use. We will generically refer to these operations as actions. Note also that even though the system states can change continuously, implementations will be able to observe only discrete changes.
In many real-life systems, the state of the system may not be completely defined or known. Note that the determination of the full state of a system is not necessary to use a policy based approach. Policies can be defined using only a small number of attributes of system state and do not require the determination of the complete state a priori.
Let us return to our Web server example. We have noticed that activating the policy to restrict the execution of all CGI scripts is straightforward—we set the appropriate variable in the configuration file of the Apache server, the server will be restarted and it will take care of the rest by itself. Now let’s take a more interesting policy. We can create policies that will allow different sets of users to execute different sets of CGI scripts. The implementation of a policy like this in a standard Apache server is not that obvious. One could try to implement this policy by creating a directory structure that reflects the different sets of scripts with links to the scripts from the appropriate directories, and creating access control files for each directory with the different sets of users that have access to the scripts. Thus the policy would be enforced by giving user names and passwords to the users and forcing users to authenticate themselves before executing any of the scripts. A severe inconvenience with this implementation is that changes in either the set of users or scripts may require reshuffling of the directories and changes in different access control files. The difficulty arises because there is no obvious connection between what the policy wants to enforce and how it is enforced. In the simple CGI policy, how the policy is implemented is hidden inside the implementation of the Web server and the implementer needs merely to set the policy on or off. For the second case, having only the possibility of setting the CGI script execution policy on or off is too restrictive because an essential component of what the policy wants to constrain is conveyed by the different sets of users and scripts.
A policy-based management system aims to provide an environment to policy authors and implementers where they can concentrate their efforts on describing what the policy restricts and thereby alleviating the burden created by having to describe how the policy will be enforced. This separation of what from how varies widely among different systems and applications, and in practice most policy authors are still required to have at least partial understanding of policy implementation.
1.1.1 Types, Nature, and Usage of Policies
As defined earlier, policies are constraints on the behavior of a system, and system behavior is a sequence of system states. In turn, each state of a system can be characterized by the values that a collection of system attributes takes. In this section, we enumerate some of the common types of constraints specified on the system behavior, and discuss how they result in different types of policies.
The attributes of a state can be divided into three groups—a set of fixed attributes, a set of directly modifiable attributes, and other observable but not directly modifiable attributes. The fixed attributes of a system cannot be modified directly or indirectly. As an example, a server in a data center has a state characterized by attributes such as maximum number of processes, size of virtual memory, size of physical memory, amount of buffer space for network communication, processor utilization, disk space utilization, memory utilization, time taken to respond to a user command, and so on. Among these, the size of physical memory is a fixed attribute for the purposes of systems management—it cannot be changed until the hardware of the server itself is modified. Some of these attributes, such as the maximum number of processes, size of virtual memory, or the amount of buffer space, can be modified directly by changing some values in a configuration file. Other attributes, such as processor or memory utilization, cannot be modified directly. They can be manipulated only by modifying the direct parameters or taking some other action—for example, by killing a running process. We define the set of directly modifiable attributes of a system as its configuration attributes. Furthermore, the set of attributes that are not directly modifiable, but can be observed or computed from observation of system attributes, are defined as system metrics. The configuration of a system, using these conventions, is the collection of the configuration attributes and the assignment of values to them.
The simplest policy type specifies an explicit constraint on the attributes of the state that the system can take, thereby limiting system behavior:
Configuration Constraint Policy: This type of policy specifies constraints that must be satisfied by the configuration of the system in all possible states. These may include allowable values for an individual configuration attribute, minimum and maximum bounds on the value of an individual configuration attribute, relationships that must be satisfied among different configuration attributes, or allowable values for a function defined over the configuration attributes. Some examples of configuration constraint policies are as follows:
-
Do not set the maximum threads attribute on an application server over 50.
-
The size of virtual memory in the system should be less than two times the size of physical memory.
-
Only users in the administration group have access to the system configuration files.
Configuration constraint policies are often used to ensure correct configuration of a system, to self-protect the system from operator errors, and to prevent the system from entering the operational modes that are known to be harmful.
Metric Constraint Policy: This type of policy specifies constraints that must be satisfied by the system metrics at all times. Unlike configuration attributes, the metrics of a system cannot be manipulated directly. The system needs to determine in an automated manner how to manipulate the configuration of the system, or to take appropriate actions such that the constraints on the metrics are satisfied. The constraints on the metrics may include bounds on any observable metrics, or relationships that may be satisfied among a set of system attributes including at least one metric attribute. Metric constraint policies that specify an upper or lower bound on a metric are also known as goal policies because they provide a goal for that metric, which the system should strive to achieve. Examples of metric constraint policies include the following:
-
Keep the CPU utilization of the system below 50%.
-
All directory lookups on the name of a person should be completed in less than a second.
-
The end-to-end network latency should be kept below 100 milliseconds.
Metric constraint policies are often used to enable self-configuration of systems in order to meet specific performance requirements or objectives.
Action Policy: In the preceding examples, the two types of policies described specify constraints on a single system state. In many cases, a policy may require explicit actions to be taken when the state of a target system satisfies some constraints. These types of policies are called action policies because they require the system to take a specific set of actions. Action policies constrain a sequence of states. That is, when a particular state is observed then certain actions must be taken at a later point so that the target system will be in some other state. In most cases, the action policy would modify the configuration of the system in response to some condition being true. Action policies essentially provide a plan according to which the system should operate when it encounters a certain condition specified in the policy. Examples of action policies include the following:
-
If the CPU utilization of a server in a data center exceeds 70%, allocate a new server to balance the workload.
-
If the temperature of the system exceeds 95 degrees Celsius, then shut-down the system.
-
If the number of bytes used by a hosted site exceeds 1 Gbyte in a month, then shut down access to the site.
-
If the inbound packet has a code-point for expedited forwarding (EF) per-hop behavior (PHB) in the packet header, then put it in the high priority queue.
In these examples, action policies have been used to manage the performance of computer servers and networks, for managing the effect of environmental conditions, for limiting resource utilization, and for providing different Qualities of Service in communications networks.
Not all action policies specify an action that can be directly executed on a system. One important type of action policy is the alert policy, which is commonly used to flag any conditions that may require operator intervention.
Alert Policy: An alert policy is an action policy where the action consists of a notification sent out to another entity. A notification is an action that does not modify the configuration of the system itself. Instead it can take one or more of the following forms: sending an email or an SMS message, making a phone call, logging a message in a file, or displaying an alert visually on a display. Some examples of alert policies are as follows:
-
Notify all users who have not accessed their account for three months by email to warn of possible account deletion.
-
If a system has not installed the latest version of anti-virus software, send an email to the employee and his/her manager.
-
If a system has gone down, send a message to the administrator’s pager.
Although real-life policies, as shown here, are specified in various different styles, all of these policies can be restructured using a common pattern or a model. Formally, this model is called the policy information model. One of the most widely used policy information models describes a policy using a condition-action rule, which means if the condition is true then perform the action. A more specific version of the condition-action rule is the event-condition-action (ECA) rule, which means upon occurrence of the event, if the condition is true then perform the action. It is not difficult to see that the preceding policy rules can be transformed into some version of ECA rules. For example, the metric constraint “The end-to-end network latency should be kept below 100 milliseconds” can be rewritten as “Upon completion of measurement, if the end-to-end network latency is above 100 milliseconds, then record the violation in the system log file.” The policy information model is a useful framework to describe, compare, and analyze various different policy rules. In Chapter 3, “Policy Information Model,” we will review some of the popular policy information models that are being widely used.
Having defined policies as constraints on the operation of a system, let us examine how the specification of such constraints can help in the management of IT systems. The specification of constraints on the state of the system can be used for several purposes, such as
- When the demand or workload on a system changes requiring a reconfiguration of the system, the constraints can be used to determine a desirable new configuration.
- When there is a contention for resources in the system, the constraints can be used to determine the manner in which to resolve that contention.
- When any external entity tries to access the resources in the system, the constraints can be used to determine whether that access ought to be permitted.
- When a system violates certain constraints, it can determine and execute a set of actions that will allow it to remove that violation.
Policies can be used to build systems that are autonomic—that is, exhibit the properties of self-configuration, self-protection, self-optimization, and self-healing. A self-configuring system would configure itself according to its intended function. A self-protecting system would identify threats to itself and take corrective actions. A self-optimizing system would modify its configuration according to the current workload to maximize its performance. A self-healing system would automatically repair any damage done to its components. The manner in which policy technology can be used to enable the development of such systems is described in the next few subsections.