- What Is It?
- How Does It Work?
- What's in It for Me?
- How Do I Get My Hands on It?
What’s in It for Me?
Now that we’ve taken a quick look at what XI is and how it works, let’s take a look at how XI compares to Nagios Core and the various commercial monitoring systems with which it was designed to compete.
One Slick Interface
Given the general quality of the alternative PHP interfaces we find in the Nagios Exchange repository, the XI interface is shockingly excellent. It is certainly not yet another effort to bring the CGI interface “up to date” by replacing it with a PHP version of itself. The XI user interface is a complete rethinking of the UI, which truly takes advantage of the strengths of a web programming platform like PHP at every opportunity. Elements within dashboards can be unlocked, moved around, or even deleted to suit the preferences of the user. AJAX is employed, both to update individual information elements and to provide feedback, so that when I send a command via the UI to reschedule a service check or acknowledge an alert, a box momentarily appears to let me know my command has been accepted. One of my least favorite things about the Core UI is the way it dumps me to an acknowledgment page after I’ve issued a command, forcing me to manually navigate back to somewhere useful.
The traditional Nagios tables like “service detail” and “hostgroup grid” still exist, but are implemented as repurposable widgets that I can use to build custom dashboards. New tables have been added, a few of which are very dense and handy, like the “minemap” visualization pictured in Figure 9.2.
Figure 9.2. Nagios XI Minemap
One of my favorite Nagios Core views is the hostgroup grid, which shows at a glance the state of entire hostgroups, including their services. This is one of the more dense status visualizations available in the old UI; unfortunately, I still need to scroll around to see everything in my environment. The Minemap visualization, by comparison, shows the same information in a much smaller amount of screen space, enabling me to get a coherent, uncluttered, detailed service-level visualization of my entire network on a single screen.
Integrated Time Series Data
PNP4Nagios (described in Chapter 8, “Visualization”) is integrated out of the box, and definitions exist for all the included plug-ins. This means that without any additional configuration whatsoever you get time series data for every service you configure. The RRDTool graphs are so well integrated into the UI that the uninitiated user would never guess PNP or RRDTool were community-sourced add-ons, so you get a snazzy UI without losing any of the power and flexibility that these community-driven development efforts provide.
In addition to the RRDTool graphs, small bar-graph visualizations for metrics collected by the Nagios Core daemon, as well as remote execution tools like NRPE, are sprinkled throughout the interface. These do a great job of conveying capacity planning info at a glance, as well as giving the UI a very polished look.
Rounding out the time series visualization is a Graph Explorer tool, which allows you to draw, among other things, ad hoc time series and stacked time series graphs. The graph explorer uses a commercial JavaScript library from HiCharts.com and looks quite elegant. The data comes from the RRD’s resident on the Nagios server via rrdtool fetch and is provided to the end-user’s browser to compute the graph locally. This saves the server’s CPU and provides a snappy, feature-rich data visualization, allowing you to scale the graph by dragging to select a range and providing pop-up numerical values when you mouse over any data areas. The stacked time series graphs include time-shifted historical data, so you can easily compare today’s data to that of yesterday, and so on.
Modularized Components
The UI as a whole is highly modular, incorporating add-on components to implement extra features. This enables the XI developers to quickly react to the needs of the user community by adding features to the UI as needed or even adding custom developing features for larger end users with special needs. A notable example is the Operations screen depicted in Figure 9.3, which is intended to be displayed on a dedicated screen in a Network Operations Center. In addition to this and other single-page summaries, custom views can be configured to rotate between pages with more detailed information on timed intervals. I bring up these little summary views because seeing them so prominently displayed in the XI interface hits home for both the extent to which the Nagios developers are listening to the needs of the community and their eagerness to satisfy those needs now that incremental progress in the UI is possible.
Figure 9.3. Nagios XI Operations screen
Finally! Acknowledgments and Scheduled Downtime for Multiple Hostsv
Another component that implements a feature for which the core community has been begging for years is the Mass Acknowledgment Component. This allows an admin to schedule downtime and acknowledge problems for groups of hosts and services. I know more than one sysadmin who would purchase XI for this feature alone.
Enhanced Reporting and Advanced Visualization
The XI developers are not solely focused on the community, however, as a quick glance at the Reporting tab in XI shows; they are proactively exploring some interesting data visualization techniques from the neoformix data-visualization field. Components that implement heat maps, force directed graphs, and stream graphs, as depicted in Figure 9.4, have been added to the classic reporting options. Several shiny new implementations of the core reports are also provided, each of which I find generally cleaner than their legacy counterparts and more likely to impress the wearers of neckties and high heels in our lives. The new reports may be exported in CSV and PDF formats with the click of a button. The button, which links to a predictable URL, makes it possible for the shorts and t-shirt wearers among us to automatically grab the reports with tools like curl and wget.
Figure 9.4. Nagios XI Stream Graph component
Nagvis
Nagvis, (described in Chapter 8) is installed and available in the Maps section of the Home view. Setting up your own NagVis diagrams couldn’t be easier. First, copy your map or diagram graphic to /usr/local/nagvis/share/userfiles/images/maps, launch the Nagvis tool in the XI UI, select Manage Maps from the options menu, and create a new map, pointing the Background to the map you uploaded. Finally, open your map using the Open menu, and add status icons to it by selecting Add Icon from the Map menu.
Business Processes
Nagios XI contains wrapper logic for grouping individual services into higher-level entities called business processes. The intent here is to implement what the Gardiner Group calls Business Application Monitoring, or BAM. BAM attempts to provide real-time status for critical business entities like a sales catalog web site or corporate email. Nagios XI implements BAM by breaking a high-level concept like “corporate email,” into its requisite pieces, such as Mail Transfer Agents, Mail Exchangers, Groupware systems, and Databases, and then quantifying the relative importance of each of the services that make up those pieces as well as describing dependency relationships between them.
XI Business Process groups contain services that are said to be “essential” or “non-essential.” A database service in our example might be considered essential, whereas the SMTP port on a single mail exchanger might be “non-essential” (because they are usually redundant, and even if they go down, the mail will queue somewhere else). When any essential service or the combination of all non-essential services goes critical, the XI business process logic registers this as a “problem.”
Each business process group contains critical and warning thresholds that depend on the number of problems that are occurring in the group. In our example, we might imagine two business process groups, one for SMTP speakers (MXs and MTAs) and one for SQL-speakers (groupware systems and DBs). If the latter group registers a single problem because a database is down, that might throw the whole group into a warning state.
Business process groups can contain other nested business process groups, and so on. Our top-level entity, corporate email, is therefore just a business process group that contains the two groups previously described. It is configured like the other two groups so that a single “problem” in any of the nested groups causes it to go into a warning state. Finally, notification commands can be assigned on each business process group in the same way they are assigned to individual host and service events. Additionally, visualization widgets exist for the top-level groups. These can be added to any dashboard or view, and they allow the user to drill down into the groups to see what services or subgroups constitute them.
Integrated Plug-ins and Configuration Wizards
The core installation of Nagios XI includes all the plug-ins in the standard plug-ins package, as well as NRPE, NSCA, and NRDP. In addition to all the plug-ins being preinstalled, the XI developers have provided a plethora of semiautomated configuration wizards, which, given the bare-minimum information about a host, take care of the initial setup as well as adding and modifying services on already configured hosts.
If you consult the official XI documentation at
http://library.nagios.com/library/products/nagiosxi/documentation,
you’ll quickly discover that the wizards are the preferred method for host and service configuration. With names like Exchange Server, website, and Windows Workstation, they make setting up new hosts and services easy enough that these tasks can be delegated to first-level support techs, or even end users. The autodiscovery wizard is capable of bootstrapping an environment given only a CIDR netblock to start with, and it does a good job of initial setup. To add NRPE-based host checks or other services after the fact, run the appropriate wizard on the preexisting host.
For example, if Server1 was created with the autodiscovery wizard, and you now want to add NRPE checks to get CPU, memory, and disk information from the host, you must first install NRPE on Server1. If Server1 doesn’t already have NRPE on it, and is one of several common server types, such as a Windows 200X server, Red Hat, or Ubuntu, the XI developers have an agent package designed to work with XI specifically at:
http://assets.nagios.com/downloads/nagiosxi/wizards
After the agent is installed on Server1, run the NRPE Wizard on the server from the configuration tab of the XI user interface, as shown in Figure 9.5, entering the IP or FQDN of the server, and choosing the type from the drop-down list. The wizard will then display a preconfigured subset of available check commands relevant to your server type, and provide text-entry fields for you to specify custom settings or additional commands if you’d like.
Figure 9.5. The Nagios XI NRPE Wizard
As I said earlier, static configuration files may still be maintained in etc/nagios/static. So it’s entirely possible to run your own scripts, or autogeneration tools like those included with check_mk, provided you configure them to write their configuration to the static directory. I can’t deny that the automated configuration features in XI have, perhaps ironically, complicated things a bit for those of us who have reason to maintain the configuration manually. In the Nagios Core universe, there is a single way to configure Nagios (text files). However, there are three ways to configure Nagios Core in the XI universe (text files, NagiosQL, and XI Wizards), and although the three coexist well enough, it can become burdensome to ensure a uniformity of parameters if the administrators mix and match their configuration methodologies in XI. I’ll give you an example.
Larry, his brother Darryl, and his other brother Darryl all work at Bloody Stump Lumber Mill, where they recently purchased a Nagios XI server to monitor their growing sales web-application server farm. Larry was a UNIX admin in college, so he prefers to edit the config files. Darryl likes to have fine-grained control over the config, but isn’t very good in vim, so he uses the XI advanced configuration section, and other Darryl would rather be watching football, so he just runs the wizard for everything. Each of the brothers has a server running sshd that he wants to configure in XI.
When other Darryl runs the Autodiscovery Wizard on his server’s IP, XI scans the host and automatically configures a host check and a check_tcp service check for the SSH port. It then pushes the config to NagiosQL, which commits it to the DB, writes out the configuration, and restarts the daemon.
Darryl meanwhile, sets up his host using the NagiosQL forms directly, but instead of choosing check_tcp, he chooses the check_ssh service, which does pretty much the same thing, but returns slightly different output. He also names the service “ssh” instead of “SSH” like the wizard does.
Larry, meanwhile, has really done his homework. He already has a servicegroup for ssh servers in the static config files he created, so rather than doing all the typing and clicking that his brothers do, he simply adds his server to the ssh_servers servicegroup, and the rest takes care of itself. The problem is, his servicegroup inherits a different set of templates than NagiosQL, so although his service check uses the same name and check command as the wizard, his polling interval is different, and he has a different notification target for service warnings.
In this way, the brothers end up with three different definitions for the same service, which might not be a problem immediately, but will cause all manner of headaches if and when they want to integrate Nagios with another tool, or generally try to do any sort of automation using their monitoring server.
I admit these sorts of disconnects are possible with text configuration files, but my point is the text configuration encourages administrators to use templates to normalize the configuration, like Larry did in the previous example. The automated tools by comparison encourage isolating the configuration at the host level, because it’s easier for the automated tools to parse them that way. Thus, in Larry’s configuration, we find a single services.cfg wherein every service is defined and assigned a hostgroup, whereas in NagiosQL’s configuration, we find a services directory with a single file for each host. The former makes it pretty easy to verify that all the service checks for every host are implemented in the same way. The later makes it much more difficult.
Further, in my experience, the disdain that people like Larry naturally feel for people like other Darryl generally discourages them from paying close attention to what people like other Darryl are doing. In fact, merely inviting other Darryl to configure the monitoring server with wizards might trigger a tendency in Larry to go off on his own and “do it the right way” using well-written static config files, which only exacerbates the problem by more widely diverging the configuration paths.
Whether this will be a problem in your shop will depend on how many hands are stirring the pot and the extent to which the more clueful users are aware of the potential problem. The idea of delegating the configs is certainly tempting, and I’m not saying you shouldn’t. If you do, my advice would be to use either the wizards or static config for service and host creation, and avoid using NagiosQL directly if you can avoid it (you could still safely use it for host and service modification). That way, you can carefully set up the static config to ensure that it references the wizard templates, or simply copy definitions from the NagiosQL files, and everything should remain pretty much uniform.
Automated Configuration for Passive Checks
Another very cool bit of functionality that is related to automated configuration in Nagios XI is the Unconfigured Objects feature. In the event that XI receives a passive check result for a host or service that it doesn’t know about, it automatically generates an inert configuration for that host or service and places it in the Unconfigured Objects section of the Configure tab. Administrators may then approve the inert objects, and they will become part of the running configuration. Good stuff.
Operational Improvements
In addition to the myriad functional improvements in Nagios XI, several maintenance-related features exist that make it easier to manage the Nagios server itself.
Backups
Out of the box, XI takes a snapshot of the running configuration each time it changes. These configuration snapshots can be downloaded from the UI in an automated fashion using tools like curl or wget. It can be used to restore the configuration in the event the monitoring system kicks the bucket, or it can roll it back to a prior version if someone made an inappropriate change. A real system backup, including historical state and metric data, involves a lot more than just the configuration files, however. Remember, XI maintains three databases and has untold amounts of performance data stored in RRDs, not to mention the Nagios Core state file and logs. For detailed instructions on properly backing up your XI install, see:
http://assets.nagios.com/downloads/nagiosxi/docs/Backing_Up_And_ Restoring_XI.pdf
User Management
Account management is more important in XI, especially when individual users are encouraged to change configuration parameters and create new hosts and services. Individual users in XI also have the ability to configure the interface with custom views and dashboards as they see fit. For these reasons, XI must track users in its own database rather than leaving it up to Apache to sort out like the Nagios Core UI does. Account management is well done in XI and generally behaves in a manner that enterprise users expect. Access control exists to prevent individual accounts from making modifications, and components exist to enable XI to use LDAP servers. Nagios has published official documentation on multitenant setups, where, for example, access to a Nagios server hosted by a service provider is shared by multiple customers. This documentation resides at:
http://assets.nagios.com/downloads/nagiosxi/docs/XI_Multi-Tenancy.pdf
Daemon Status
As depicted in Figure 9.6, the XI interface provides an array of detailed of information about the Core daemon process. This includes metric values for the server hardware as well as performance metrics internal to the daemon itself. A real-time graph of the event queue displays reaper and service check events scheduled 5 minutes into the future. This really is fantastic capacity planning info of a quality I’ve never seen in any monitoring system.
Figure 9.6. Detailed daemon statistics