Contents
This page is in group Technology. Coding with timeouts may be the only correct solution, but it may also become one of your worst coding nightmares. Reading it over this may be a rather complicated note that perhaps requires that you have some experience.
«Timeouts and several parties causing antiresonance stop»
..because that might have been a verbose title. I piggy-back on the «antiresonance» term (see Wiki-refs below) here to imply that some software may become part of an oscillation without actually noticing it and causing the system to stop and thus malfunction. This figure gave me the mental picture to remember it by:
By Deltacrux – Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=31228298
Figure text: «Animation showing time evolution to the antiresonant steady-state of two coupled pendula. The red arrow represents a driving force acting on the left pendulum.»
It’s described in the Wikipedia Antiresonance Applications chapter (copy here 15April2016, original is here):
An important result in the theory of antiresonances is that they can be interpreted as the resonances of the system fixed at the excitation point. This can be seen in the pendulum animation above: the steady-state antiresonant situation is the same as if the left pendulum were fixed and could not oscillate. An important corollary of this result is that the antiresonances of a system are independent of the properties of the driven oscillator; i.e. they do not change if the resonance frequency or damping coefficient of the driven oscillator are altered.
This result makes antiresonances useful in characterizing complex coupled systems which cannot be easily separated into their constituent components. The resonance frequencies of the system depend on the properties of all components and their couplings, and are independent of which is driven. The antiresonances, on the other hand, are dependent upon the component being driven, therefore providing information about how it affects the total system. By driving each component in turn, information about all of the individual subsystems can be obtained, despite the couplings between them.
I am not certain if I really understand this, or if it is relevant for this blog note. However, cognitively it is relevant to me: the two pendulums both being happy doing their own work. But since they are connected then in this case one of them soon takes over and slowly stops the other! One might say that none of them observe this by the amount of their push?
First: delay and polling is ok
This note is about software. More than one component. Then I’d day it’s also about concurrency.
If you do a loop that prints out passed seconds since the start, and do a wait_ms(1000), then firstly you haven’t implemented a good clock (it would slow), but secondly you have shown the use of delay. There’s the SW of the operating system that understands delay and can let the other sw do their job during the waiting time. No problem here.
(By the way, something like time=wait_after_ms(time,1000) would have removed any skew built up by the processing time for you.)
If we instead measured the temperature in an incubator (for eggs on a farm?) every 60 seconds and fed a thermoelectric element to heat if it got too cold and the same element to cool it (by reversing the current) if the compartment got too warm, then that’s also a very good use of delay. We polled or sampled a thermistor for the purpose of new chicken.
I have described a related theme in note 109. Also, about Hoare’s «Concurrent programs wait faster» in note 062. A newer note, partly inpired by (and partly overlapping) this is 128, «Timing out design by contract with a stopwatch».
Then: internal timeouts when they are plain wrong
A colleague at work told me that he some years ago himself had a colleague who had insisted that when internal software components (tasks or processes) communicate with another, then there always should be a timeout of a communication! This is the opposite of programming by contract or protocol, and is very difficult to get right in all situations. The short explanation is, how do you treat a message that arrives after you have timed out when your process has continued doing other things? Is the late message something that you may discard? Maybe it just isn’t. More later.
I can think of only one situation where this could be ok: if you have such a design or coding practice that you would allow a process to crash by some programming error. Providing that the other processes keep running. Individual Linux threads could be an example. Anything could happen to any process. In this case you might think it risky not to have a timeout for every communication. I think this is the reason why many Linux programmers don’t do threading if they don’t have to.
In the case above we unfortunately mix layers. Timeout should perhaps have been handled by a lower layer, see next chapter. Then it’s not a timeout you should wait for at the application layer. You decode the message that comes in; it’s a proper message most of the time. But some times it’s a meta message (from lower layer) saying connection down.
I remember another situation where we did have contracts between internal processes, clean and without timeouts, when I had to introduce a timeout just because the other code could not guarantee that it always held the contract. In other words there was no contract.
A precondition
There probably is a precondition to much of this speak: a process picks up the next message from any other process in an event loop (well, rather not… since it implies callback functions, see 092 plus many of my other notes) or rather (in my opinion) in a select loop (like in Linux select or a selective choice select of Go (Golang) or XMOS XC or alt as in UML / WebSequenceDiagrams (below) and AltSelect as in PyCSP. Modeling languages like CSPm and Promela also support this, plus a myriad of other languages. Ada uses entry calls to handle the rendezvous. Most of my technical blog notes spin around this theme).
Really, it doesn’t much matter if all messages are picked up in this construct at the top of the task (process, thread) or uses individual send and receives down in the code. However, with guarded selective choice statements (or using empty channels as in Golang) it’s probably more convenient to bundle everything together at the top.
To the point: external timeouts and busy etc. may stop the system
When you communicate with components connected with a cable or wi-fi (wifi) or over the internet then there has to be some kind of timeout at some level. Usually this is not handled by the application layer. Connection down or connection not up have to be there. I have described some of this here. Case closed.
But here’s the situation that triggered this blog note. When a polling type of timeout querying (from a client) a box of units (servers) that themselves have a window of time when they are not able to reply.
Example
A client’s link level can do polling of ten servers (address #1 to #10) in one second, i.e. 100 ms per address. The application in the client sends out a message DIR-BURN to server #1 and expects an application level reply REPLY-BURN within three seconds. One server may be unable to send this response (well received at the link level and ACKed there) for maximum two seconds, since BURNing has to be finished before the REPLY is sent. Usually BURNING goes well within 100 ms, meaning that the REPLY would be sent on the next polling round. In one second the fastest, we shall see:
The link level may pull in more replies once it’s got one. Let’s say that in order to level out the communication load, and in trying to make each server have a fair amount of attraction – it pulls in two replies from each server. Let’s say that this takes 200 ms per server. At worst, for nine units (we have subtracted the one that got the DIR-BURN) this is 1.8 seconds. Again, at worst, let’s say that all those nine units have so much to tell that this goes on also for the next round.
Here’s the sequence: DIR-BURN to #1 plus two replies from each has taken us 1.9 seconds out. At 2.0 seconds #1 did not have its REPLY-BURN ready, but it did after 2.1 seconds, but this was too late. Now #2 to #10 use 1.8 seconds taking us to 3.8 seconds before the REPLY-BURN arrives – 800 ms too late!
Nothing is «wrong» here except for the setting of the original timeout requirement of three seconds; or perhaps the whole communication architecture. We see that the three seconds timeout now gets to be a mix of timeouts of link level and application level behaviour.
The antiresonance similarity I think is when the load slowly increases and the requirement of the client application of three seconds and the delay of the servers’ application REPLY by maximum two seconds, and the maximum round time of also two seconds, get one of the components to stop. They don’t nicely oscillate between asking and delivering any more. In this case it’s the server that stops. It has not been designed to handle the late arriving REPLY-BURN properly. It throws it and does not get on with its state machine. This is a traditional deadlock. Neither the client nor the servers are able to proceed. |
Don’t think this gets any easier by knowing that the communication line between the client and the servers is not reliable. The cable may be pulled, after which the client must detect and warn (link level) and not deadlock (application level). And they all have to reconnect when the cable is plugged.
It’s up to you to model this in some way (formally or by words) and find a better solution. In most cases increasing or dynamically adapt the timeout or introducing a BUSY reply (although possible resolves) probably just get the application to the point of stop later. Any way, when you have left the project or the company there is a new programmer who may not be aware of your tuning!
The ideal solution, in my opinion, is to implement all scenarios that you can’t design away from happening and then have a recipe on how each and every scenario are to be solved. The ideal start, in my opinion, is to work for a long time on the specification. Like how fast a broken cable shall be seen.
MSC to show it?
I think a Message Sequence Chart (MSC or Diagram for MSD) dynamic simulator in JavaScript would have been very nice. Like this tree described here? It should by dynamic and show the action that ends in this stop.
Simpler, ready-made tools that generate static diagrams could be used, like
- WebSequenceDiagrams – even if states are available only in a purchased version and it’s a server side solution with a plugin for diagrams to appear here in WordPress
- js-sequence-diagrams is inspired by WebSequenceDiagrams and is free. I reckon it’s possible to draw such a diagram step by step, thus making it dynamic. This also means that I can do standard debug printouts in this format and then have the msc drawn. However, so far I think the only place to store the msc txt-file is on a web server (no! see below).
I have in fact tested this out, see below. (Of course, knowing the internal format of the other tools might also get me to this point) - draw.io looks very nice! It’s a «A web based diagramming application built on mxGraph» and «mxGraph – A JavaScript diagramming component, started in 2005, that works on all major browsers, including touch devices and back to Internet Explorer 8» – pasted from https://www.jgraph.com
- If I were not to be able to present it (more or less) live with JavaScript, then with Python there might be seqdiag, see http://blockdiag.com/en/seqdiag/introduction.html
- Here’s a comparison of some: wiki/Comparison_of_network_diagram_software
- Martin at work pointed me to Mermaid
Standard disclaimer: I have no relationship with any of the urls pointed to and discussed in any of my blog notes. There is no money involved and no gifts accepted (yes, I have been offered). I do this only because I so much like doing this.
Testing the js-sequence-diagrams tool
I have removed some code I placed here. JavaScript embedded directly in WordPress doesn’t work the next time I edit, since WordPress changes the script when I go from Text to Visual editing. It’s too volatile to keep that code. This is a WordPress feature that I must live with. But this seems to solve it:
MSC with CJT plugin
Here’s the MSC description file. Below is the MSC. The code on the left side (MTimer, Master and M) cannot be changed even if this behaviour revealed a coding error in it (thousands installed). The code in the centre (S and STimer) can be changed, but then (Sub) cannot (mask-burnt controller). What’s a possible (and acceptable) solution?
I haven’t been successful in showing the «antiresonance» pattern here, I guess I would need a dynamic MSC to indicate it; one that shows the messaging over some time. Anyhow, the MSC below gives a flavour. Thanks to js-sequence-diagrams! I love it:
[cjtoolbox name=’p_125_ex3′]
Again: here’s the MSC description file. The code in the CSS & Javascript Toolbox now like this:
<script src="https://www.teigfam.net/oyvind/js/underscore-min.js"></script> <script src="https://www.teigfam.net/oyvind/js/raphael-min.js"></script> <script src="https://www.teigfam.net/oyvind/js/sequence-diagram-min.js"></script> <script src="https://www.teigfam.net/oyvind/js/jquery-1.12.3.min.js"></script> <div id="diagram3"></div> <script> $.get("http://www.teigfam.net/oyvind/blog_notes/125/001_msc.txt", function(file) { var diagram_f = file; var diagram3 = Diagram.parse(diagram_f); diagram3.drawSVG("diagram3", {theme: 'simple'}); }); </script>
It’s referred here with this shortcode:
[cjtoolbox name='p_125_ex3']
MSC at home from an html file
To run JavaScript in WordPress successfully (also after an Update) we have seen requires a plugin. The reason is that the WordPress environment builds the WordPress page you read from a data base. Even worse: it stores it into the same data base when I update it. And things have to be clean and nice and work and have as few side effects as possible. It has the price not to work so well. The plugin holds JavaScript in a wrapper that WordPress understands.
But what about an old-fashioned html file? It just works. That is Safari and Firefox on OS X (Mac). Not Chrome on OS X. A friend of mine tested Firefox, IE and Edge on Windows and they didn’t work. Neither did Firefox or Chrome on Linux. So 1. should work on any platform, 2. should run if you run it here but not when you run it locally and 3. are only for those that use an operating system that interprets only internal usage as no safety threat:
- A clean html file with the above code is in http://www.teigfam.net/oyvind/blog_notes/125/001_msc.html. It should draw the same diagram as above
- However, if you copy the 001_msc.txt MSC text file to your home computer, together with 002_msc.html and then run the latter then the result should look like this 002_msc.pdf
- I can even run the javascripts locally, with both the four js libraries, the msc txt file and the html locally with this file: 003_msc.html. To test it, move all these files to a local directory on your computer. You can of course also switch off internet
- At CPA 2016 I was advised to try to decorate the script tag with type=»text/javascript» to be very precise about the script type. It didn’t help. It’s still only Chrome and Firefox on Safari/macOS that draws the MSC diagram when the files are placed locally. I still put the code at 004_msc.html (move file, as above)
Using js-sequence-diagrams, problems and errata
I really like this package! Twitter here.
State as of 12June2016 with jQuery v1.12.3, Raphaël 2.1.4, Underscore.js 1.8.3 and js sequence diagrams 1.0.6:
- The browsers (in debug or inspect mode) issue two error messages, that they are missing «underscore-min.map» and «sequence-diagram-min.js.map». It still works.
- «participant Func1» and «Func1 –>Func2: Text» yields two participants. The second is
«Func1 » with a trailing white space, not possible to see. It might also go for leading white spaces, but I haven’t checked - It is not possible to not have an arrow. So «Func1–Func2: Synch point» defaults to filled arrow. Some times it’s ok to have this, even if it might not be UML. I’d like it just to draw some kind of relation. The syntax diagram (here) states that dropping the arrow is possible. I test with Safari 9.11
- The layout of the text in the boxes seems to be well placed at one particular scaling. However, it is possible to find a scaling of the layout and see the first chararcter cross the left side of the box. Adding a leading space some times helps, and some times not
- I’d like to have a «note begin SendTask» and «note end SendTask» to have notes that may cross anonymous participants. They could start a little differently than «note over SendTask,ReceiveTask» so it may be seen that they differ.
References
Wiki-refs: Antiresonance, Message Sequence Chart (MSC), Thermistor, Thermoelectric cooling
- ..