Fail to error first. Then, fail to safe

Designer's notes #4 - Home - Prev - Next
Øyvind Teig,  Trondheim,  Norway (http://www.teigfam.net/oyvind/

Wrappings may hide crucial details

When you pipe data over a "socket" on top of TCP/IP, your data is error free as long as the connection is open, right? Wrong, until the opposite has been proven, even if a socket is believed to be on the OSI top, layer 7 or application level.

This incident involved several software engineers over some months. As a matter of fact, the last chapter hasn't yet been written when I start writing this note. The system in mind involves a PC host and an embedded data acquisition and computation unit. The PC contains 1.)  a rewarped Turbo Pascal Windows program, 2.) a dynamic "data base" in occam, and 3.) a socket router in C, 3 separate programs with several threads each. The embedded unit runs a popular embedded operating system, let's call it Wy, and A.) a router client plus B.) the application program in C with a CSP library. This again communicates with X.) a DSP running the concurrent language occam, on a PC/104 board. The topology is PC:1-2-3 ---- Embedded:A-B -- PC/104:X. Router 3 and A talk with each other over a socket. We had some problems with this connection.

Some times, after several minutes, on some machines, the socket went down, and since 2 and X talked on application level, any error would close down 2 and inform the user. Please restart. This just shouldn't happen as long as the cable and power was present. The error was found when we understood that A did not wait for ack from 3, and 3 would only send the ack whenever it wanted to send something else down. This is standard IP to avoid sending small packets all around the world. The no delay flag was set at A, and the watchdog heartbeat was regular with no timeouts. Problem solved.

But this was when things really started to happen. Much more seldom. The next day, 10 hours later, we had another stop. A Snoop program was set up to log the traffic. An expert, this time in the house, analysed the huge Snoop log and found that A forgot to send off an IP packet. 834 bytes went awash. We failed "to error" and had an error in the sieve. The interesting thing was that the error was neither in 1,2,3,A,B nor X. It was in the embedded Wy operating system's TCP/IP stack.

We had been running this same software in another system, a more safety critical application, where everything had been done to ensure reconnection, resynchronization, for solid up-time. Once, when there was a lot of traffic on the net, one of the engineers noticed a hiccup, but the parties were soon talking again, they had "failed to safe". We took notice, but these things happen.

It was only after second thoughts that we saw the connection between the two errors. And how the poor help the rich. Because more safety had been built into the second application, it could live forever without ever seeing the Wy driver error, in which data were actually lost. You have to build a pretty solid protocol to handle this -  new layers on top of layer 7. The lesser sturdy "fail to error" system gave a helping hand to the "fail to safety" system. By detecting the error, the first system had a better quality on its up-time as long as the cable was present, than the second system. When we send off money for a renewal of the support contract, maybe Wy will fix their error, with our money.

01.2004

Other publications at http://www.teigfam.net/oyvind/pub/pub.html