How do I use Nagios to monitor a log file

https://stackoverflow.com/questions/2373212

24-09-2019
|

Question

We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.

The issues are:

It only shows the last entry
There doesn't seem to be a way to acknowledge the critical error and return the monitor to a good state

Is nagios the wrong tool, or are we just not setting up the service monitering right?

Here are my entries

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

Solution

As there are many ways to achieve a goal, there is also a nice plugin from Consol available: https://labs.consol.de/lang/en/nagios/check_logfiles/

supports regex
supports log rotation

To use it, you need a cfg file, this is an example for oracle databases

@searches = ({
  tag => 'oraalerts',
options => 'sticky=28800',
  logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
  criticalpatterns => [
      'ORA\-0*204[^\d]',        # error in reading control file
      'ORA\-0*206[^\d]',        # error in writing control file
      'ORA\-0*210[^\d]',        # cannot open control file
      'ORA\-0*257[^\d]',        # archiver is stuck
      'ORA\-0*333[^\d]',        # redo log read error
      'ORA\-0*345[^\d]',        # redo log write error
      'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
      'ORA\-0*48[0-5][^\d]',
      'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
      'ORA\-0*1114[^\d]',        # datafile I/O write error
      'ORA\-0*1115[^\d]',        # datafile I/O read error
      'ORA\-0*1116[^\d]',        # cannot open datafile
      'ORA\-0*1118[^\d]',        # cannot add a data file
      'ORA\-0*1122[^\d]',       # database file 16 failed verification check
      'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
      'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
      'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
      'ORA\-0*1578[^\d]',        # data block corruption
      'ORA\-0*1135[^\d]',        # file accessed for query is offline
      'ORA\-0*1547[^\d]',        # tablespace is full
      'ORA\-0*1555[^\d]',        # snapshot too old
      'ORA\-0*1562[^\d]',        # failed to extend rollback segment
      'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
      'ORA\-0*163[0-2][^\d]',
      'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
      'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
      'ORA\-16038[^\d]',      # log cannot be archived
      'ORA\-19502[^\d]',      # write error on datafile
      'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
      'ORA\-0*4031[^\d]',        # out of shared memory.
      'No space left on device',
      'Archival Error',
  ],
  warningpatterns => [
      'ORA\-0*3113[^\d]',        # end of file on communication channel
      'ORA\-0*6501[^\d]',         # PL/SQL internal error
      'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
      'Archival stopped, error occurred. Will continue retrying',
  ]
});

OTHER TIPS

For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:

max_check_attempts              1
is_volatile                     1

This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.

My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.

There is a Nagios plugin that you can use to check the log files: it's called check_logfiles and it's used to scan the lines of a file for regular expressions.

The following link shows how to install and configure check_logfiles for Nagios and Opsview: https://www.opsview.com/resources/nagios-alternative/blog/syslog-monitoring-nagios-opsview

Nothing in your config jumps out at me as being misconfigured.

By design, check_log will only show either an OK message, or the last log entry that triggered an alert. If you need to see multiple entries, you'll need to modify the plugin.

However, I find the fact that you're not getting recoveries somewhat odd. The way check_log works (by comparing the current log to the previous version), you should get a recovery on the very next service check. Except of course, when there have been additional matching entries added to the log since the last check.

Does forcing another service check (or several) cause it to recover?

Also, I don't intend this in a mean way, but make sure it's really malfunctioning. Is your log getting additional matching entries in between checks, causing it not to recover? Your check is matching "?" which will match anything new in the log. Is something else (a non-error) being added to the log and inadvertently causing a match?

If none of the above are the issue, I would suggest narrowing it down by taking Nagios out of the equation. Try running check_log manually (from the command line, but as the same user as nagios), and with a different oldlog. It should go something like this -

run check with a new "oldlog" - get initialization message
run check - check OK
make change to log
run check - check fails
run check - check OK

If this doesn't work, then you know to focus on the log, the oldlog, and how the check_log is doing the check.

If it works, then it points more towards a problem with your nagios configuration.

I believe there's now a real Nagios plugin that monitors logs effectively.

http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088

The home page of the Nagios plugin on that page is Nagios Log Monitor

Your [ commands.cfg file ] will contain:

define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}


OR


define command {
                            command_name         NagiosLogMonitor
                            command_line            $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}




Your [ services.cfg file ] will look similar to:

define service {
                      check_command                         NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
                      max_check_attempts                  1
                      service_description                     500_ERRORS_LOGCHECK
                      host_name                                  sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
                      use                                              fifteen-minute-interval
 }

Nagios now has a solution that integrates tightly with Nagios Core, XI, etc.

Nagios Log Server which can alert on any query on any log file on any system in your infrastructure.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow