How to Troubleshoot Zenoss Effectively

From SysAdminWiki

Jump to: navigation, search

(This is taken from the Zenoss Admin guide, wikified and placed here for easier reference.)

Contents

How To Troubleshoot Zenoss Startup Errors

I have had the most success starting zenoss with the runzeo command when it fails to start up properly. It is very verbose and often spits error out in human readable form.

How To Troubleshoot Zenoss Daemons

Help for figuring out why collectors are failing

If Zenoss behaves strangely or otherwise does not do what you expect it to do, you can enable debugging messages to help you diagnose problems. To do so, edit the configuration file of the daemon where you want to enable debugging:

  • Go to Settings/Daemons/edit config.
  • Change the line "level info" in the section eventlog to "level debug".
  • Save, then restart the daemon

You should now see debug information in the daemon's logfile unter $ZENHOME/log. Do not forget to turn off debugging when you are finished troubleshooting; the logfiles grow quickly in debug mode.

Identify the Problem

Check the logs

Your first step in troubleshooting a daemon is to look in the logs for errors. The Zenoss logs are located in $ZENHOME/logs. You can view the logs via the web user interface under the About link.

Get more information

You can get additional logging by decreasing the verbose limit and running the daemon in foreground mode. For example, this command will run zenperfsnmp:

$ $ZENHOME/bin/zenperfsnmp run -v 10

See this page: http://www.zenoss.com/community/docs/howtos/zenoss-daemon-command-line-arguments/ for a list of command line options for the remaining Zenoss daemons. You will get more (sometimes, a lot more) log messages, but zenperfsnmp will only perform a single scan of all the devices.

Find out what the program is really doing

You can enable the watchdog feature by doing the following:

status >> daemons >> edit config for zenperfsnmp -> type in:

watchdog True

Restart the daemon, then have a look at $ZENHOME/log/zenperfsnmp.log and see whats going on.

If a program is hanging, or not behaving as you expect, you can eavesdrop on what the program is asking the operating system to do on its behalf. This is a good way to determine if a helper program is failing, or system errors are not propagating up to the log file. On Linux machines, the command to trace these system calls is "strace". On the appliance you can add strace to your system with:

$ conary update strace

Other posix-like operating systems have their own commands (truss on Solaris, dtrace on OS X). For example, you can verify that zenperfsnmp is really sending packets:

$ strace -f -e trace=sendto $ZENPERFSNMP/bin/zenperfsnmp run

Don't forget the largest and most complex Zenoss daemon: MySQL.

Check the version against the one needed by Zenoss. If you see "Lost Connection to Zenoss" in the dashboard, it is likely a MySQL connection problem.


Narrow the Problem

If you have managed to limit a problem to a single device, or simply suspect a device because it's running an odd configuration, or was recently added, most of the active collectors will allow you to scan a single device:

$ $ZENHOME/bin/zenperfsnmp run -v 10 --device SomeDevice

If the problem is related to a long running server, you can ask that the server run in foreground mode, but continue with the normal endless cycle:

$ $ZENHOME/bin/zenperfsnmp run -v 10 --cycle

Look for Conflicts

Are you running more than one copy of a daemon? During debugging and before version 1.1, Zenoss could lose track of background processes.

Stop zenoss and look for stray processes:

$ ps auxww | grep /z

Are you resource limited? Is the file system full? Do you have free memory? My favorite tool for this is "top" under Linux:

$ top

This program will constantly update the display with a list of the most CPU hungry programs. You can also sort the list by memory usage.

Reproduce the Problem

The ability to reproduce a problem with a consistent set of steps will help enormously. Often the only way to find the problem is to use "binary search". You reproduce the problem and take away "half" of the configuration. Slowly you can reduce the "halves" that are causing the problem until a single element remains.

Getting Help Solving the Problem

Search the Zenoss Forums

Hopefully someone has seen a similar problem.

Report the Problem to Zenoss

Some problems should be reported, even in the absence of detailed information because they are almost certainly bugs.

  • the python interpreter crashes (a segmentation fault, for example)
  • a python trace in a log file
  • a daemon regularly drops heartbeats
  • a daemon's size grows over time to consume all resources
Personal tools
Advertisement