How to Troubleshoot Zenoss Effectively
From SysAdminWiki
(This is taken from the Zenoss Admin guide, wikified and placed here for easier reference.)
Contents |
How To Troubleshoot Zenoss Startup Errors
I have had the most success starting zenoss with the runzeo command when it fails to start up properly. It is very verbose and often spits error out in human readable form.
How To Troubleshoot Zenoss Daemons
Help for figuring out why collectors are failing
If Zenoss behaves strangely or otherwise does not do what you expect it to do, you can enable debugging messages to help you diagnose problems. To do so, edit the configuration file of the daemon where you want to enable debugging:
- Go to Settings/Daemons/edit config.
- Change the line "level info" in the section eventlog to "level debug".
- Save, then restart the daemon
You should now see debug information in the daemon's logfile unter $ZENHOME/log. Do not forget to turn off debugging when you are finished troubleshooting; the logfiles grow quickly in debug mode.
Identify the Problem
Check the logs
Your first step in troubleshooting a daemon is to look in the logs for errors. The Zenoss logs are located in $ZENHOME/logs. You can view the logs via the web user interface under the About link.
Get more information
You can get additional logging by decreasing the verbose limit and running the daemon in foreground mode. For example, this command will run zenperfsnmp:
$ $ZENHOME/bin/zenperfsnmp run -v 10
See this page: http://www.zenoss.com/community/docs/howtos/zenoss-daemon-command-line-arguments/ for a list of command line options for the remaining Zenoss daemons. You will get more (sometimes, a lot more) log messages, but zenperfsnmp will only perform a single scan of all the devices.
Find out what the program is really doing
You can enable the watchdog feature by doing the following:
status >> daemons >> edit config for zenperfsnmp -> type in:
watchdog True
Restart the daemon, then have a look at $ZENHOME/log/zenperfsnmp.log and see whats going on.
If a program is hanging, or not behaving as you expect, you can eavesdrop on what the program is asking the operating system to do on its behalf. This is a good way to determine if a helper program is failing, or system errors are not propagating up to the log file. On Linux machines, the command to trace these system calls is "strace". On the appliance you can add strace to your system with:
$ conary update strace
Other posix-like operating systems have their own commands (truss on Solaris, dtrace on OS X). For example, you can verify that zenperfsnmp is really sending packets:
$ strace -f -e trace=sendto $ZENPERFSNMP/bin/zenperfsnmp run
Don't forget the largest and most complex Zenoss daemon: MySQL.
Check the version against the one needed by Zenoss. If you see "Lost Connection to Zenoss" in the dashboard, it is likely a MySQL connection problem.
Narrow the Problem
If you have managed to limit a problem to a single device, or simply suspect a device because it's running an odd configuration, or was recently added, most of the active collectors will allow you to scan a single device:
$ $ZENHOME/bin/zenperfsnmp run -v 10 --device SomeDevice
If the problem is related to a long running server, you can ask that the server run in foreground mode, but continue with the normal endless cycle:
$ $ZENHOME/bin/zenperfsnmp run -v 10 --cycle
Look for Conflicts
Are you running more than one copy of a daemon? During debugging and before version 1.1, Zenoss could lose track of background processes.
Stop zenoss and look for stray processes:
$ ps auxww | grep /z
Are you resource limited? Is the file system full? Do you have free memory? My favorite tool for this is "top" under Linux:
$ top
This program will constantly update the display with a list of the most CPU hungry programs. You can also sort the list by memory usage.
Reproduce the Problem
The ability to reproduce a problem with a consistent set of steps will help enormously. Often the only way to find the problem is to use "binary search". You reproduce the problem and take away "half" of the configuration. Slowly you can reduce the "halves" that are causing the problem until a single element remains.
Getting Help Solving the Problem
Search the Zenoss Forums
Hopefully someone has seen a similar problem.
Report the Problem to Zenoss
Some problems should be reported, even in the absence of detailed information because they are almost certainly bugs.
- the python interpreter crashes (a segmentation fault, for example)
- a python trace in a log file
- a daemon regularly drops heartbeats
- a daemon's size grows over time to consume all resources
