Tips, Tricks and Techniques to Tame Timer Trouble

First published on the PARSE Software Devices website September 18th, 2006 © Copyright 2006 by Robert Krten, all rights reserved.

Synopsis

This article discusses several cases of timers as used in a large QNX Neutrino project. The author's experiences with the evolution of the use of timers in this project is highlighted.

It seems so simple...

The project is a control program for a piece of hardware. The main program is an event driven set of state machines that performs actions based on various events (such as I/O points changing, timers expiring, other modules completing tasks) arriving into the main state machine.

It's in this context that I designed the initial timer functionality for this control program. Three simple calls were provided:

The basic idea was that when the control program required a timer (say for determining that a piece of hardware did not get to a desired state within a certain period of time), it would simply arm a timer, giving it the number of milliseconds from "now" when it should fire, and a 32-bit timer ID. Then, the control program would return to its blocking point (a MsgReceive call).

The control program had a regular heartbeat "tick" that would generate a pulse every 10ms or so, and that was used for tickling the software watchdog process, as well as alerting the timer subsystem that it was time to check the timer chains to see if any timers had expired.

If no timers in the chain had expired, nothing happened. If a timer had expired, a pulse would be generated back to the control program, and would eventually be handled the next time the control program hit its MsgReceive rendezvous point.

It seems so simple, and yet there were many problems with this.

In this article, I'll discuss two scenarios with timers and blocking calls, and examine the problem in depth as well as the solution.

The First Implementation and Problem

The first implementation of this timing system is the simplest. Every time that the control program's "heartbeat tick" occurred, we assumed that 10ms had gone by — after all, the heartbeat tick was controlled by a timer_settime function, and it was programmed to 10ms. (The minor quibble about whether the kernel's timing system would give us exactly 10ms, and not something like 9.999ms or 10.001ms was irrelevant — the things we were timing were in the hundred milliseconds to tens of seconds range, so a few microseconds either way was unimportant.)

Well, for a long time, this worked, or at least seemed to.

Occasionally, there were unexplained events, where it looked like the hardware had failed — from the point of view of the control program, it looked like the hardware didn't reach a particular state within its allotted time because its associated timer had "popped" (timed out).

These events were rare, and thus were prioritized at the bottom of the work queue. When we finally got around to analyzing the problem, it turned out that time was "running too fast". This was a real head scratcher. Surely, every time the timer ticked, 10ms had gone by, so therefore, simply subtracting 10ms from each timer in the timer chain would be the correct thing to do, no ifs ands or buts about it, right? Well, that's the way it was designed, but not the way it turned out to work "in system".

Consider the following sequence of operations:

  1. display message to operator
  2. arm timer for 2 seconds
  3. trigger hardware
  4. go back to MsgReceive

One of the assumptions of the system is that all function calls are virtually non-blocking. This means that when we issue the function call to display a message to the operator, it might block for a few hundred microseconds, but it certainly wouldn't go away for "a long time" (e.g. tens of milliseconds). However, like most cookbook :-) implementations using QNX Neutrino, this was a deeply-blocking system (e.g., A sent a message to B and blocked, B passed the request on to C and blocked, C performed the work, then unblocked B, which then unblocked A. Effectively, A was "deeply blocked" on C).

The practical impact of this is that after we called the function to display a message to the operator, we then armed the timer for 2 seconds (which really only put the value 2000 into the timer's timeout field). Due to the implementation, the function to display the message may indeed have blocked for a long time (without getting into the gory details, there's a serial protocol involved with timeouts and retries).

Where things get interesting is what happened during the time that the control program was blocked. Well, the kernel was dutifully sending us 10ms pulses, and since there was nobody around to receive them, they got queued by the kernel. When we finally unblocked from the display function, we then set the timer value to 2000, and eventually got back to the MsgReceive function. From there, we immediately unblocked with a heartbeat tick. This told us that 10ms had elapsed, so we changed the timeout value from 2000 to 1990, and went back to the MsgReceive function. We immediately unblocked from that with another heartbeat tick, which once again meant that another 10ms had elapsed, and so we changed 1990 to 1980, and so on, until we had processed all of the queued heartbeat ticks. Of course, we were processing heartbeat ticks at a rate much faster than once every 10ms; it would have been more like one every few microseconds. This meant that, from the perception of the control program, time seemed to run very quickly for a short period of real elapsed ("wall clock") time.

Thus, we ran into situations where the timer had "instantly expired" (or at least came really close, thus at least dramatically shortening its actual time). Once the timer expired, of course, we would conclude that the hardware had failed. (Notice that the hardware is triggered after the timer is set; this is fine, normal, and expected. The problem is that there are queued pulses already waiting for us to process once we get back to the MsgReceive function.)

And the First Simple Fix

Having had the "Aha!" moment, there was an equal and opposite "Aw Crap!" moment. How were we going to fix this? The first solution that came to mind was, "We need to rewrite the entire control system (about 24kLOC)." Of course, that was ruled out almost instantly, but it's funny how that's always the first solution that comes to mind :-)

Anyway, I concluded that the problem was that we were lying to the timer function, by telling it that 10ms had expired, when in fact we really had no idea how much real time had expired. So, the fix seemed fairly simple. Instead of telling the timer function how much time had expired, we simply changed all the timers to use absolute time instead of relative time, and the "10 ms has expired" function changed meaning and simply became a "some amount of time has expired, you should check your timers". This way, we weren't lying to the timer expiration function by giving it a bogus number (of how much time had expired), but rather we were just telling it to go check its chain. Since the timers in the chain now had absolute expiration times, it was a simple matter of doing a comparison against the current time and the expiration time to see if a timer had indeed expired.

The thinking as it applied to the deeply-blocking situation was this. Sure, the kernel would queue up heartbeat timer pulses for the control program, but all that would happen is that the granularity of time would appear to be much better for a few hundred microseconds after we came out of the deeply blocking state, rather than having time itself "move" faster.

To walk through the scenario, let's say it's exactly 11:00:00 now, and we need a timer to pop in 2 seconds. So, we display the status (which would block for let's say 1 second), and then we'd arm the timer to pop at 11:00:03, and then go back to the rendezvous point of the MsgReceive function. Since we blocked for 1 second, at 10ms per tick, that means that 100 pulses would be waiting for us. Having nothing else to do, we'd rip through those 100 pulses in our main MsgReceive loop very quickly, and call the timer function (the "check your timers" one) 100 times in a very short time — say for argument's sake that this took 876 microseconds. This means that at 11:00:01.000 (the instant we started the timer, i.e., the time we unblocked), we would check the timer 100 times (so between the period 11:00:01.000000 and 11:00:01.000876), and in each case, conclude that nothing had expired (we're waiting for 11:00:03), and thus we would have drained the pulse queue.

This worked fine, but it caused another problem in a different system.

And the Next Problem with that is...

The previous system was a block-then-arm implementation, meaning that there was a blocking call, and then a timer-arming call.

In the same control system, there's also an arm-then-block implementation, meaning (as you would expect) that we arm a timer, and then block. In an arm-then-block implementation, having the previous situation of relative timers and assuming a certain amount of time goes by on each pulse will of course work. The timer is armed, and it now knows how many more pulses need to go by in order to be considered expired. Since (we're assuming) there had been no pulses queued before the timer was armed, it's starting at a relative zero point in the timing cycle. When we're blocked, timer pulses will queue, and that's fine, because that's a true reflection of reality — time is indeed going by while we are blocked.

Of course, changing the fundamental timing system fixed the block-then-arm, but broke the arm-then-block version.

To reiterate the arm-then-block:

  1. hardware indicates status
  2. arm timer for 4 seconds
  3. display message to operator

In practical terms, this implements a hardware monitoring system. When the hardware tells us that it has reached a certain state, we expect that it will reach that state again within four seconds, so we set a four second timer. At some point, we're also displaying messages to the operator, and these can indeed block for a long time.

What we were seeing in the field was that an apparently unrelated error had occurred on the hardware, even though the hardware was operating normally. On deeper analysis, it looked like the 4 second timer had popped, even though other logs on the system showed that the hardware was well within the 4 second window.

Before we fixed the first problem, during the blackout period when we were blocked displaying the message to the operator, timer pulses would be queued by the kernel. Periodically, the hardware would also queue up an I/O change pulse. Since these were all at the same priority, and timer pulses had an implicit timestamp in them, everything worked. Once we unblocked, we'd (very quickly) process timer pulses, decrement the timers, and when we hit a hardware I/O change, we'd rearm the timer with the new 4 second value, and everything would be ok.

After we fixed the first problem, though, things were a little bit different. We'd arm the timer (let's say it's 11:00:00 and we arm it for 4 seconds, so it's now set to go off at 11:00:04). Then we'd go into the blackout period. Meanwhile, timer pulses would be building up, as would I/O change pulses. When we unblocked (say at 11:00:10), we processed the timer pulse, and looked at the time: 11:00:10 was way past when the timer was supposed to expire (11:00:04), so we declared a timeout. Then we rushed through the rest of the timer pulses (ignoring the I/O change pulses, because we were now in a different state), and finally declared a hardware fault (due to timeout).

The first solution that came to mind was, "We need to rewrite the entire system (about 60kLOC)." Of course, that was ruled out almost instantly, but it's funny how that's always the first solution that comes to mind :-)

And the Next Fix is...

So, once again being under time constraints ("there's never time to do it right, but there's always time to fix it"), we decided the simplest fix IN THIS CASE would be to change the priority of the hardware pulse.

In order to explain that, I need to step back just a little bit. Whenever an I/O point changes, an I/O monitoring process sends the control program a pulse, indicating the I/O point number and its new state (on or off). In the system, all such pulses were the same priority as the timing pulses. So this meant that really what happened during the deeply-blocking "blackout" period was that a bunch of timer pulses were generated (and queued) by the kernel, and, intermixed with that, a few I/O pulses were generated (and queued) by the I/O monitoring process.

So what that means is that when we unblocked and went to the MsgReceive rendezvous point, we got a bunch of timer pulses, which caused us to check the time and determine that the timeout period had elapsed. The I/O pulses that would have come after at least the first timer pulse was ignored, because by then we had already determined that the hardware had faulted and we were in a different state.

By doing a quick hack and boosting the priority of the I/O message pulses, we effectively bypassed detection of timeouts within the deeply-blocking "blackout" period. Not ideal, but it worked for the limited case. (Philosophically, we could get away with a hack in this case because the blackout period had its own error monitoring, and was a higher priority of error than the I/O timeout error, but it turns out that the I/O timeout error would be the first error detected and thus would latch. Another long story for another day perhaps.)

So, to do a walkthrough in the higher-priority case, even though timer pulses had been building up, the instant that an I/O pulse came in (at the higher priority) it would effectively jump to the head of the queue and be processed first. Thus, when we came out of the "blackout" period, the first pulse we'd see was the I/O pulse, which would cause us to reload our timers. Only then would we process the timer pulses, and look at time go by "in high definition".

The Solution

Well, we've solved the problem for the block-then-arm implementation, but only partially kludged up a solution for the arm-then-block case.

As luck would have it, I remembered this problem during the Wednesday-night-beer-with-QNX-people outing, and Brian Stecher and I discussed the problem. After much beer (on my part) and thinking (on Brian's part) it turns out that both implementations presented above have merit, they just needed to be combined.

Therefore, let's define the hybrid approach.

For starters, when setting a timeout, the value needs to be in absolute time. That only makes sense, because when you are running and have control of the CPU, that's the time that you make decisions that require access to the timeout time — you have no idea how many kernel pulses there may be queued ahead or behind you, you need to know "now + delta" and convert it to an absolute time for the timeout.

That was considered in the first case above; using the timer pulses as a way of keeping track, but I missed the part about combining it with the absolute time. Basically, you need to know when the pulse was generated (not necessarily when it was received). Thus, the "pulse generation time", or PGT, is basically a function of the time when the timer was started, plus the number of pulses received times the rate at which they are generated.

Putting this into practical terms, we'd use 64-bit variables for all of our times, and store everything in nanoseconds (it's what the kernel does).

SYSPAGE_ENTRY(qtime)->nsec gives us the current time, and we can use that to determine both the starting time of the timer as well as in calculations of the absolute expiration time.

So, to arm our timer, we simply multiply the millisecond value by 1000000 to convert to nanoseconds, add the current time, and store that into the timer's 64-bit "expiration" area.

The tricky value, however, is the PGT — it's:

NSI * PN * ((TP + NSI) / NSI) + ES

Where:

NSINanoSecond Incrementthe value SYSPAGE_ENTRY(qtime)->nsec_inc (in nanoseconds)
PNPulse NumberRunning count of the pulses received
TPTimer PeriodThe number of nanoseconds that the process's base timer is set to
ESEpoch StartThe system time (nanoseconds) when the timer was started.

The complexity stems from the fact that we're trying to be as accurate as possible, with no time slippage which would accumulate (since we're counting the number of pulses and using that as our basis for when the pulse was generated, we could be accumulating an error). Therefore, the formula above will accurately reflect how the kernel itself decided when to issue the pulse.

Note also that if you use ClockPeriod you'll need to reset the pulse number (PN) to zero, and recompute the epoch start (ES) after draining all the pulses.

Why?

A question that comes up is, "Why didn't you just use QNX Neutrino's timers directly instead of writing your own timer system and getting into trouble?" Well, you see, I was billing by the hour and ... :-)

It's a good question. The answer is that this method is less complicated. There are two complications with using timers under Neutrino:

So, under Neutrino, I'd either have to create all of my timers a priori, or I'd have to dynamically manage them (binding and unbinding various struct sigevents because of the need for different values).

With my method, a timer was already in place in the process (the watchdog tickling timer), so I felt it would be more efficient and simpler to effectively "subdivide" the time coming from this timer. My function simply added a timeout value and an id to a list of timers that were all counted down.