Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timers seem to get clobbered in Linux multi-threaded application #164

Open
TheRealZago opened this issue Nov 11, 2024 · 4 comments
Open

Comments

@TheRealZago
Copy link

I've taken inspiration from the Linux port in issue #140, and quickly turned it into a delta-timer, using OS-level timers in nanosecond(!) precision (timer_create and friends), where I got back to <2% timer variation.

The result looked pretty good, until I realized, after ~2 hours of free-running, the PDOs were no longer triggering. The effective behavior dances around (1) no TPDO will ever transmit until the application is restart, or (2) one of the 2 active TPDOs stops, while the other keeps going normally, and a quick jump between Op-PreOp-Op might restore it temporarily.
In short, the soft-timers seem to get "corrupted" and the HAL timer never gets rearmed by the stack.

I can't provide the real application due to NDA, but I've reproduced the problem in this reduced project, which behaves pretty much the same: https://github.com/TheRealZago/canopen-timers. I've left some debugging notes I've acquired over the last 3 weeks of analyzing this problem, but it's extremely annoying to reproduce and debug.

If anyone has deployed this stack in a Linux environment, did you ever encounter this issue?
Otherwise, what interaction should I be tracking more in detail in the stack for figuring out why the timers seem to get corrupted?

Before getting lapidated, I'm not expecting an "I HAZ CODES" solution, but I'd be very happy to get input from "experts" who've been working with this project for longer than I have... 😄

@jmcmullan
Copy link

Firstly, your 'timer update' is always returning 1. I've found in the past that the stack can get really upset with that simplification.

Here's my (monotonic clock) based timer that I've been using successfully for the past few years. It only needs a mutex and a monotonic clock with (at least) ms resolution. Ignore the off-brand function signatures.

typedef uint64_t msclock_t;
msclock_t msclock_now(); // Return monotonic ms clock value.

static msclock_t timer_expires;
static os_mutex_t timer_mutex;

/// Initialize the timer driver.
static void canopen_timer_init(CO_IF *cif, uint32_t ticks) {
    DPRINTF("init");
    timer_expires = 0;
}

/// Set the next value for the timer to use.
static void canopen_timer_reload(CO_IF *cif, uint32_t ticks) {
    msclock_t now = msclock_now();
    timer_expires = now + (msclock_t)(ticks_ms);
    DPRINTF("reload(%ums), now = %lluns, expires = %lluns", (unsigned)ticks, (unsigned long long)now,
            (unsigned long long)timer_expires);
}

/// Return how long the timer is expected to run.
static uint32_t canopen_timer_delay(CO_IF *cif) {
    msclock_t now = msclock_now();

    if (now >= timer_expires) {
        DPRINTF("delay, 0 (expired)");
        return 0;
    }

    DPRINTF("delay, %ums", timer_expires - now);
    return timer_expires - now;
}

/// Update the timer, return the expired edge detection.
static uint8_t canopen_timer_update(CO_IF *cif) {
    if (timer_expires == 0) {
        DPRINTF("update, 0 (not running)");
        return 0;
    }

    msclock_t now = msclock_now();
    if (now >= timer_expires) {
        timer_expires = 0;
        DPRINTF("update, 1 (expired)");
        return 1;
    }

    DPRINTF("update, 0 (running)");
    return 0;
}

/// Start the timer.
static void canopen_timer_start(CO_IF *cif) {
    /// Nothing to do here.
}

/// Stop the timer.
static void canopen_timer_stop(CO_IF *cif) {
    timer_expires = 0;
}

/// Timer lock
static void canopen_timer_lock(CO_IF *cif) {
    os_mutex_lock(&timer_mutex);
}

/// Timer unlock
static void canopen_timer_unlock(CO_IF *cif) {
    os_mutex_unlock(&timer_mutex);
}                      

@TheRealZago
Copy link
Author

I'm assuming your timer is configured as a cyclic timer, with the ISR being triggered every X milliseconds. I've had issues keeping precise output timing with that setup on our Linux system, as there's a bazillion more threads and programs running in parallel and causing timers to skew more or less severely over time.

That said, the timer update() function is called only in the COTmrService() ISR, so with my timer raising the interrupt only on expiration, I think it makes sense it's reporting always as elapsed, and the docs seem to agree. I can totally see it being a problem if the ISR is never called again, which seems to be my case.

Regardless, thank you very much! Great input, I'll try to check again what happens in the main loop seems to stop creating and refreshing the internal delta timers. I'll definitely make sure to avoid headaches if I'll ever need to implement this stack on a real MCU.

@jmcmullan
Copy link

Under Linux, just use std::chrono::steady_clock as your time source - no need for custom timers.

@TheRealZago
Copy link
Author

I had a cyclic clock with chrono::steady_clock time points in a previous iteration, very similar to your snippet actually, but it yielded sub-optimal results, like relatively high CPU usage for the periodic tick thread and very imprecise timing (eg, the 150ms expected TPDO rate skewed more often in the 120-180ms range instead).

The timers I'm using are actually provided by time.h and they seem to wake up with much higher precision (as per above, 150-154ms TX rate), I'm sure the kernel can do high-precision timing way better than I can achieve in userland.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants