How to configure Netdump on Linux?

What is Netdump?

Unlike traditional crash dump facilities, this facility dumps memory images to a centralized server via the network.
The goal of a crash dump facility is to provide fault analysis, particularly exhaustive first fault analysis (first fault analysis is when a bug can be corrected without requiring reproducing the bug), in the case of software or hardware bugs that manifest as system crashes (in Linux parlance, Oops, BUG(), or panic). Linux has traditionally provided an abbreviated signature of a crash which includes the processor state (on the processor that registered the crash), a stack trace, and a limited instruction trace. The utility of these signatures has been proved over the years; they nearly always provide all the information that is required to debug a fault, even at first fault.

The network console functionality provides the ability to log all kernel messages, including Linux crash signature messages, to a network syslog server. This has very low system requirements; it merely requires a simple syslog server (any Linux system can serve as a syslog server) that allows incoming network logging. This allows first fault analysis of the majority of crashes.

However, some crashes involve more subtle problems. Some of these problems are not easy, or even possible, to fix after seeing a single Linux crash signature message. Successful first fault analysis of these kinds of problems is sometimes enabled by the ability to look at a memory dump of the kernel image. It is no guarantee, but in certain kinds of cases it significantly increases the odds of successful first fault analysis.

What are the capabilities and requirements of netdump?

The netdump service:

* Saves memory images of up to the first 4GB of memory on a server somewhere on the network.
* Saves a textual representation of the Oops/panic/BUG message and preceding kernel messages in a file associated with the memory image, and after the memory image has been saved, attempts to append task and memory state information in textual form (the equivalent of [alt]-[sysrq]-[t] and [alt]-[sysrq]-[m] output when the [sysrq] key has been enabled).
* Sends console messages to a syslog server in addition to, or instead of, sending them to a netdump server.
* Requires a supported network adapter on the client machine (the machine whose memory image is being saved).
* Requires a server with sufficient storage space and network to store the dumps.
* Requires some manual setup on both the client and server side (as of this writing).
* Requires that packets be able to traverse between the client and server, both for ssh connections and for UDP packets aimed at the netdump port on client and server (default is port 6666).
* Currently requires that the server not change IP address (this can be worked around).

How do I set Netdump?

There are two tasks involved. The first is to set up a netdump server, and the second is to configure clients and server so that the dumps take place.
Server:

The only significant requirement of the server is sufficient disk space and having development tools (particularly gdb) installed. Each dump will be limited to 4GB (at least in the near future), so you can count the number of crash dumps that you want to keep around at any one time, add a few extra GB as a fudge factor, and make sure you have that much disk space available. Any modern machine will have enough memory and CPU power. The crash dump images and the log files are written in /var/crash, so provide the space for the crash dump images in the /var file system, or create a separate file system in /var/crash to hold the crash dump images.

Install the netdump-server package on the server using RPM, or select it at system install time. The only configuration option you might want to set is the maximum number of concurrent dump operations that are permitted. If you have an event that causes a large number of machines to crash at once, you can limit the number of simultaneous crash dump images to accept simultaneously to something reasonable. If you wish to set this limit, create the file /etc/netdump.conf with a line like max_concurrent_dumps=5.

Set a password for the netdump user. This is used for propagating a public key used to negotiate a secret cookie for the netdump server to use to authenticate itself to the client. You will only need to type this password once on each client while setting up netdump, you will not need to type it while running netdump. (The README file that comes with netdump-server explains what is going on here; you can set up alternative mechanisms locally if you do not want to provide this netdump password.)

Enable the netdump server to start with the command chkconfig netdump-server on. This will cause it to be started automatically on subsequent boots.

Start the server now with service netdump-server start.

Your netdump server is now ready to receive network crash dump images, except for one last bit of configuration that needs to be done per-client in order to authenticate the server to each client.

Depending on whether you wish to also use syslog to log kernel messages from netdump clients, and whether you wish to store those logs on the netdump server, you may also wish to modify the file /etc/sysconfig/syslog and add the -r option, and possibly the -x option, to the SYSLOGD_OPTIONS line, as documented in /etc/sysconfig/syslog.

Finally, there are a set of scripts that can be run when events happen. The scripts go in /var/crash/scripts, and sample scripts are in the directory /usr/share/doc/netdump-server*/example_scripts/. You can read a description of these scripts in the netdump-server man page. You probably care most about netdump-nospace, which can make room for new crash dump images when the netdump server is filled up with existing images, and netdump-crash, which you can use to report crashes to system administrators.

Clients:

First of all, this will only work on some network devices. As of this writing, the supported drivers are 3c59x, eepro100, e100, tlan, and tulip. More will be added over time, the netdump service will log failures from unsupported drivers, and it does not hurt anything to attempt to load netdump on an unsupported client, so you can always follow these instructions and then check /var/log/messages to see if you have an unsupported driver.

Install the netdump package, or choose it at installation time. Edit the file /etc/sysconfig/netdump and add a line like NETDUMPADDR=10.0.0.1 in which you specify the address of the netdump server.

Now, you need to make it possible for the netdump init script to send a dynamic random key to the server. To do that, you can run service netdump propagate and be prepared to give the netdump user's password on the netdump server. This only needs to be done once when the client is set up; it sets up the netdump server to allow connections to provide the dynamic random key to the server each time the module is loaded on the client.

If you do not want to have to provide that netdump password to set up a client, you will need to create some local procedure for appending the contents of the client machine's /etc/sysconfig/netdump_id_dsa.pub file to the server's /var/crash/.ssh/authorized_keys2 file.

If you wish to send kernel messages to a syslog server as well as having them logged in a file with the crash dump (if any), you can set that up by specifying the syslog server's IP address in the file /etc/sysconfig/netdump with a line like SYSLOGADDR=10.0.0.2. The syslog server can be the same as or different from the netdump server.

You can, if you wish, set up syslog only, and not netdump, by leaving NETDUMPADDR unset and setting only SYSLOGADDR.

Enable the netdump client to start with the command chkconfig netdump on. This will cause it to be started automatically on subsequent boots. Then start the service right now with the service netdump start command.

Comments

Popular posts from this blog

configure Netbackup email notification on Unix