The Basics of Monitoring Cassandra
As the old systems adage goes, a service doesn’t exist unless it’s monitored. In this chapter, we will cover the basics of monitoring Cassandra. These include file-based logging, inspection of the JVM, and monitoring of Cassandra itself.
Logging
Under the covers, Cassandra uses the standard Java log library Log4j. Log4j is another Apache project that enables the capability to control the granularity of log statements using a configuration file. If you want to find out more about what is happening on a particular node than what nodetool and JMX MBeans (which we will cover in more detail later in the chapter) are telling you, you can change the logging levels.
As a front end to the Log4j back end, Cassandra uses Simple Logging Façade for Java (SLF4J). The logging levels from least verbose to most verbose are
- TRACE
- DEBUG
- INFO
- WARN
- ERROR
- FATAL
Understanding these logging levels is important not only to help monitor what is going on in the system or on a particular node but also to help troubleshoot problems. In troubleshooting complex systems such as Cassandra, Cassandra’s nodetool, logging, and even the JMX MBeans can lead to red herrings. So it is necessary to compile as much information pertinent to the problem as possible to help diagnose what might be going on.
Taking a look at a normal healthy Cassandra node’s system.log, you will see INFO lines that refer to various stages of the system executing their tasks. These include MemTable flushes, HintedHandoffs, and compactions, just to name a few.
Changing Log Levels
If you want to make any changes to the logging schema, you will need to find the log4jserver.properties file. The default logging level for Cassandra and the rootLogger is INFO. This level provides a standard amount of information that is sufficient for understanding the general health of your system. It is definitely helpful to see what your system looks like, so you should do so while logging at the DEBUG level. Be sure not to leave Cassandra in DEBUG mode for production as the entire system will act noticeably slower. To change the standard logging level in Cassandra from INFO to DEBUG, change the line that looks like this:
log4j.rootLogger=INFO,stdout,R
to this:
log4j.rootLogger=DEBUG,stdout,R
Now your Cassandra node will be running in DEBUG mode. To change it back, just swap the INFO and DEBUG again. To show less logging, you can change the logging level to WARN, ERROR, or FATAL.
Example Error
It is worth noting that not all problem messages enter the logs as WARNING or higher (higher meaning toward FATAL). Listing 8.1 presents an example of when things start to go south. This is a common set of log messages that you may see with your system set at INFO level. Even with the logging level set to INFO, there is a lot of useful information in the logs. Don’t be afraid to keep a regular eye on the logs so you know what patterns of log messages are normal for your system. For example, if things are starting to slow down, you may see something like Listing 8.1.
Listing 8.1 INFO Messages That Show Mutation and READ Messages Dropped
INFO [ScheduledTasks:1] 2012-07-09 20:48:57,290 MessagingService.java (line 607) 3476 MUTATION messages dropped in last5000ms INFO [ScheduledTasks:1] 2012-07-09 20:48:57,290 MessagingService.java (line 607) 677 READ messages dropped in last 5000ms INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 50) Pool Name Active Pending Blocked INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 65) ReadStage 32 621 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 65) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 65) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 65) MutationStage 32 4105 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,291 StatusLogger.java (line 65) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) MigrationStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,292 StatusLogger.java (line 65) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,293 StatusLogger.java (line 65) HintedHandoff 1 8 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,293 StatusLogger.java (line 70) CompactionManager 0 0 INFO [ScheduledTasks:1] 2012-07-09 20:48:57,293 StatusLogger.java (line 82) MessagingService n/a 0,0
When you see messages being dropped, as in the first two lines of the listing, that’s a sign that your system is under stress. Depending on your application requirements, some level of dropped messages may be acceptable. But regardless of whether or not your application can tolerate running in a degraded state, the overall health of your cluster (or at the very least this node) is in question. Most applications are capable of handling dropped READ requests, but dropped MUTATION messages mean that data that should have been written isn’t getting written. Depending on the consistency level of the write in question, this could mean the write didn’t happen at all, or it could mean the write didn’t happen on this node. Also notice that the ReadStage and MutationStage lines have multiple Active and Pending messages left to work on. The reason these messages are dropped is that Cassandra wants to do its best to keep up with the volume of work that it is being given.
There are other such common log lines to watch for, which can be done via a log monitor. One method for monitoring the logs programmatically using Nagios will be discussed later in this chapter.