Sun ONE Portal Server 6 Best Practices
When proposing a technical solution for an specific problem, the first step is to collect functional and nonfunctional requirements. Generally, these requirements fall into the following categories:
Performance
Risk
Cost
Schedule
Most of the time, especially in complex systems, such as portals where content is aggregated from many different sources, these requirements conflict with each other. Usually security and ease of use call for different approaches. A very secure site can be hard to use because it requires complicated, hard-to-remember passwords, or there are stringent session and inactivity timeouts. In addition, availability and performance sometimes conflict. For example, to provide session failover, it is necessary to keep the session in sync on all of the servers. This synchronization adds some delay to each transaction.
The process of creating a solution involves understanding the trade-offs between all conflicting requirements and deciding what is more important for a successful implementation. This article presents some architectural guidelines that are frequently applied to Sun™ ONE Portal Server 6 software implementations and will help you to identify and understand potentially conflicting requirements on the performance and risk categories. After you understand these categories, you should be able to include the cost and schedule requirements when you define the final solution.
This article presents the best practices for high availability, security, and scalability that commonly have more impact on the success of a Sun ONE Portal Server software solution. In addition, the article includes guidelines for creating a Sun ONE Portal Server software solution that can be easily supported.
Before you read this article, you should have a detailed technical understanding of the Sun ONE Portal Server 6 software and the Sun ONE Portal Server Secure Remote Access components, such as the Gateway, the Netlet, the Rewriter proxies, and the search engine. Also, an in-depth knowledge of the embedded Sun ONE software products (for instance, the Sun™ ONE Directory Server, the Sun™ ONE Web Server, and the Sun™ ONE Identity Server) is required.
High Availability
Delivering high services levels is a top priority for all Sun ONE Portal Server software implementations. You can determine the availability of a system by using a simple equation, as shown in FIGURE 1.
FIGURE 1 Availability Equation
As the equation shows, if you decrease the downtime of the system, you can increase the availability of the system. However, when you measure the downtime, you must measure the total amount of time the system is unavailable, which should include the planned downtime (for example, maintenance, backups, and repairs to the system) and the unplanned downtime (for example, system or network failures). Some studies show that planned downtime can account for up to 80 percent of the total time a system is unavailable.
Thus, when you are architecting a solution, you must consider both the planned and unplanned downtime to ensure that you create a highly available solution. Availability is affected by all of the components in the system, such as the following infrastructure components:
Hardware
Network
Operating system
Applications
In addition to these infrastructure components, availability is also affected by people and processes, so when you are architecting a highly available solution, you must ensure that the people who will be supporting the solution have the proper training and skill sets, and you must ensure that clearly defined processes are in place to support the system.
For background information on the concept of availability, refer to "Availability - What It Means, Why It's Important, and How to Improve It" (Sun BluePrints™ OnLine, 1999).
In reference to availability, system types can be defined in four ways: noncritical, task critical, business critical, and mission critical. The noncritical system type is a basic system that has no requirements for availability. If the system goes down, it can be repaired in a matter of days without affecting users. This type of system is not important to the discussion of availability in this article.
Task-Critical Systems
Unlike the noncritical system, the task-critical system does have availability requirements. If the system goes down, it would affect users, and the performance of the system could be affected. The best way to achieve the availability levels required for this kind of system is by using redundancy of services. To optimize the usage of the system resources, all of the redundant components should be active (that is, they should not be in standby mode). Replication, load balancing, and service redundancy must be used to achieve this goal. FIGURE 2 shows the basic design of a task-critical system.
FIGURE 2 Task-Critical System Diagram
Gateway Server Availability
As FIGURE 2 shows, in this architecture, there are at least two gateway servers that are front-ended by a load balancer so that all of the requests are spread across the gateways. The load balancer must also be configured to detect failures in the gateways. If a gateway fails, then the load balancer sends all of the requests to the surviving gateway.
The gateways are a stateless process, so if a gateway fails, all of the sessions associated with that gateway can be redirected to the other gateway. Users will not perceive any downtime because the Portal Server session is maintained on the Portal Server nodes, not in the gateways.
When you use gateways, it is likely that the resource servers that are being accessed through the gateways will be on a private network that is protected by a firewall. In this case, you might want to use a web proxy to access these resource servers so that a single hole is open in the firewall. Even though the Sun ONE Portal Server software includes a rewriter proxy, it is not a fully functional web proxy server. For example, it does not support caching, the Internet Caching Protocol (ICP), and URL filtering. However, the Sun ONE Web Proxy Server software is a reliable, inexpensive, and highly configurable web proxy server, which in addition to these features, provides generic protocol support for a firewall traversal by using SOCKSv5.
Portal Server Instances
In the architecture depicted in FIGURE 2, the Portal Server instances are installed on the Sun ONE Web Server as web containers. The Sun ONE Web Server software does not support replication of user sessions across instances, so when a Portal Server instance goes down, all of the sessions maintained on that instance are lost. The same is true if the web container used is an application server, such as the Sun ONE Application Server Standard Edition, that does not support session failover.
To increase availability of the Portal Server, you can create multiple instances of the Sun ONE Web Server on the same machine or have multiple instances on multiple machines. In this way, the number of users affected by a Portal Server instance failure is minimized. Users that are affected would have to log in to another server.
If an instance fails, the gateway detects the failure and reroutes the requests to one of the surviving instances. If you are not using the Sun ONE Portal Server Secure Remote Access software, you must have load balancers to perform the functions of the gateways, and the load balancers have to detect the failure of a Portal Server instance and send the requests to one of the surviving instances.
Directory Server Availability
Another important component of the Sun ONE Portal Server software solution is the Directory Server where the user and services profiles are stored. To remove this component as a single point of failure, you can use the Sun ONE Directory Server software's multi-master replication (MMR) configuration or the Sun™ Cluster software framework. Because of the loosely consistent replication mechanism of the Sun ONE Directory Server software, it is possible, albeit very unlikely, that an update can be lost. If a system failure occurs right after a change has been accepted by one master, but before the change is replicated to the second master, it is possible that the change will be lost, and there is no easy way to detect this fact.
In some very demanding environments, the possibility of losing an update might not be acceptable. In this case, the best option is to use the Sun Cluster software to achieve high availability of the Directory Server. The use of the Sun Cluster software can increase the availability of the system, but the configuration, maintenance, and monitoring of this environment require more specialized knowledge and very well defined operational processes.
When you are installing the Portal Server software, you can only specify one LDAP server, and this server must be a master LDAP server because the installation process is affected by the propagation delay of the LDAP replication process. To add multiple LDAP masters or to point the Sun ONE Identity Server to use a consumer after the installation, you must edit the serverconfig.xml file and add a Server element for each additional LDAP server. The following example shows the format of the server entries:
<Server name=Server1 host=master1.company.com port=389 type=SIMPLE /> <Server name=Server2 host=master2.company.com port=389 type=SIMPLE />
The Identity Server uses the first entry as the LDAP server for all requests of service, roles, organization, and user profiles. If that LDAP server fails, the Identity Server fails over to the next server in the list. There is no round-robin or failback between the LDAP servers, so if you want to design a solution in which all of the LDAP servers are used evenly, you will have to use the Sun ONE Directory Proxy Server software. A load balancer cannot be used because the Sun ONE Identity Server software uses a pool of connections that are kept open and are reused. The same is true for the LDAP and membership authentication modules and for the Policy Configuration service. They can use primary and backup LDAP servers, but you have to add the failover servers after the installation by using the administration console.
Planned Downtime
With the architecture shown in FIGURE 2 on page 4, there is redundancy of services, so most of the unplanned downtime can be minimized or eliminated. However, the planned downtime is still an issue. For instance, if the Portal Server software must be updated, services could be affected. If the upgrade or patch includes changes to the Sun ONE Directory Server software schema used by the Sun ONE Identity Server software, all of the software components must be stopped to update the information stored in the Directory Server.
In addition, the Solaris™ Operating System (Solaris OS) patch installation process does not work if the application services are enabled. Thus, you must shut down all of the services, patch the system, then bring the system back online. In some environments, the downtime incurred during the patch process might not be acceptable. But, with a highly available solution with duplicate services and components, you can use a phased approach for maintaining the system. For instance, you could remove one Portal Server node from the production configuration and upgrade it. Then, you could remove one of the gateways from the production configuration and upgrade it. Afterwards, you could integrate that silo back into the production configuration and repeat the process for the other silo, resulting in an upgrade with minimal interruption of service.
In theory, you would not have downtime; however, because of the architecture of the Portal Server software, it is not possible to just remove one Portal Server instance from the active configuration without affecting some users because there is no way to prevent a user from logging in to the server and to keep the active sessions untouched. Thus, you must create a mechanism to prevent users from logging in to the server while the gateways still process request from the already-authenticated users. This can be accomplished by using a custom-developed authentication module.
Business-Critical Systems
The third type of system is the business-critical system. For this type of system, availability is a critical requirement because if the system goes down, it could lead to lost revenue, lost productivity, and customer dissatisfaction. FIGURE 3 shows the typical configuration of a business-critical system, which builds on the architecture described for the task-critical system and includes all of its benefits.
FIGURE 3 Business-Critical System Configuration
To enhance the availability of the system, an application server is used as a web container for all of the Identity Server and Portal Server services. This configuration is needed to use the HTTP session failover features of the Application Server. This maintains a database of all of the sessions that are created in the system, and that database is accessible to all of the Portal Server instances in the configuration. Thus, if one Portal Server instance fails, the gateways redirect all of the requests to the surviving Portal Server, and that Portal Server will be able to validate that the session is still valid so that the operation will continue to work smoothly.
In the architecture depicted in FIGURE 3, the availability of the system is much higher than the architecture discussed in the "Task-Critical Systems" section. However, it is not possible to achieve absolute session failover. Depending on how an application server instance fails and how and where user sessions are stored and replicated, newly created individual user sessions might not have been written to the common database, so they might be lost.
The Sun ONE Portal Server software, version 6.2, supports BEA's WebLogic 6.1 and the Sun™ ONE Application Server 7.0 Enterprise Edition for session failover. When using BEA's application server, the WebLogic Cluster software is required to create an environment in which the sessions are replicated; however, the Sun ONE Portal Server Secure Remote Access software does not work with the WebLogic Cluster software. Thus, it is not possible to implement a highly available solution with the WebLogic software if the gateways are also required.
For the business-critical system, upgrades have a minimum impact. Either Portal Server node can be taken out of the production configuration without affecting the users because the sessions are in the shared database and because requests will be handled by the available Portal Server node.
Mission-Critical Systems
The fourth type of system is the mission-critical system. For the mission-critical system, failures could have catastrophic results for an organization (for example, loss of life or serious injury, significant loss of money, serious inability to conduct business, or serious operational chaos). Most mission-critical systems are usually custom built using special hardware such as fault-tolerant computers and software.