« Running VMWare Server 2.0 on Windows Home Server SP1 | Main| AC/DC tickets for Oslo-show has been purchased »

How a single byte could crash the server!

Tags: Lotus Domino
0
This is a story of a Domino server which crashed burning to the ground, every time an integration agent made with Lotus Connectors (use "lsxlc"...) ran. Read on to how I discovered what agent that was the problem and why I pinpointed the Lotus Connectors as the culprit.
A customer experienced a completely astray agent here the other day. The environment was Domino 6.5.1 running on a Windows 2003 server.

For some reason the Domino server crashed real hard, and sometimes a Notes System Diagnotic (NSD) file was generated, and others not. The NSD-dump file is really the state of the machine with lots of information about the running processes and their threads, as they were when Domino experienced the crash. Lots of the information in the NSD file is highly technical, and you will see stuff like call stacks, memory dumps and processor register content. However, by following the simple rules below, you can at least often point in the right direction.

1. The NSD-file is a text file and can be opened with a texteditor such as Notepad or EditPlus. Open the NSD log fil from the IBM_TECHNICAL_SUPPORT-directory located in the default Domino data directory. The format of the NSD log files are on the format type_plaftorm_systemname_date@time.log, such as nsd_all_W32I_NotesSrv1_10_06@07_40.

2. Search for the words FATAL or PANIC or ERROR. Beware that just searching for ERROR may generate too many finds. FATAL or PANIC is best!

Below you see a screenshot from the first FATAL section;

A picture named M2

Note that we see the task-name NAMGR. This indicate that the offending component, probably causing the crash, is the Agent Manager. Also note so-called call stack  just below, which reveal the internal functions that were called, and their call-sequence. You read this like LCFieldCompare calls LCStreamCompare, which again calls LCStreamGetTextFormat, which again calls itself.


From experience I know that the LC-prefix could probably mean Lotus Connectors, which turned out to be right. And as you see, there is probably a problem when Lotus Connectors code tries to compare stuff.

The next step is to try to locate what database and agent that was in play. This is often, but not always, revealed in the following memory dumps contained in the NSD file. So scroll down until you find some "readable". Below you see the top of the call-stack and the first entries;

A picture named M3

Note how each function call has it's own call stack content dumped to. For example the LCStreamGetTextFormat-part shows the stack content for the following function in the previous screenshot;

A picture named M4

You probably have to scroll down many pages until you find something readable. I had to scroll down to the 20th stack-content part, until I found what I was looking for. Below you see how the top of the 20th-part looked like;

A picture named M5

And I had to scroll down in the 20th part to find this;

A picture named M6

Ahhh, a NSF file in play! The PBS0023.nsf database was in play when the server went down, and was absolutely a database to check for agents with Lotus Connector code inside. And so it was!

Now, in order to find out what crashed in the server agent (which by the way synchronized two databases - with Lotus Connectors), I though that I had to debug the code. This could probably have been done with the Remote Debugger feature in Lotus Notes and Domino, but I couldn't afford to take down the server any more than strictly necessary. I decided to copy the databases involved in the synchronization to my local harddrive, and run the agent manually on my Notes client. Luckily for me the agent crashed in the same way on my client, so I could track what the error was. I say this because its not always the same to run code on a client versus the server. A completely different environment and OS may play a role in the reason for the crash, and that couldn't be detected with my above strategy.

So what was the error? It turned out to be a single, extra newline in one of the fields!!

I stepped through the agent, record by record, and Lotus Connectors turned out to have a huge issue with a field containing a leading newline. In the Field Properties of the field it looked like this;

A picture named M7

The field Int_Mobile  contains (a blurred) phonenumber, with an extra newline in front of it, and through the Notes form, it looked like this;

A picture named M8

Note the small vertical scrollbars on the field, indicating that there is some "extra content" in the field.

How was this solved?

The Domino server crash vanished instantly after I removed the leading newline. Existing Domino 6.5.1 can run again!

In order to prevent the same error in the future, the following measures were taken;

a) Ensure the Input Translation formula of each synchronizable field contains @Trim(@ThisValue), to remove any leading- or trailing newlines or blanks.
b) Upgrade the server to it's latest incarnation in the 6.5 codestream. At present that is 6.5.6 Fix Pack 3. Note that I didn't find any evidence in the Notes Fix list for this issue.

Post A Comment

:-D:-o:-p:-x:-(:-):-\:angry::cool::cry::emb::grin::huh::laugh::lips::rolleyes:;-)