c++: std::condition_variable::wait_until holds the lock?
Recently, I was debugging some problem where it appeared that
std::condition_variable::wait_until
would hold the lock and
never release it while waiting, in direct contradiction to its documentation:
Atomically releases lock, blocks the current executing thread, and adds it to the list of threads waiting on
*this
. The thread will be unblocked whennotify_all()
ornotify_one()
is executed, or when the absolute time point timeout_time is reached. It may also be unblocked spuriously. When unblocked, regardless of the reason, lock is reacquired and wait_until exits.
You can actually observe this with this reproducer, tested on Debian Linux “bookworm” with g++ 12.3 and libstdc++ 3.4.32:
#include <chrono>
#include <condition_variable>
#include <cstdlib>
#include <iostream>
#include <mutex>
#include <thread>
constexpr auto forever = std::chrono::steady_clock::time_point::max();
int main() {
for (int i = 0; i < 1000; i++) {
std::cerr << "============== " << std::endl;
std::cerr << "Iteration " << i << std::endl;
std::cerr << "============== " << std::endl;
std::mutex m;
std::condition_variable cv;
int counter = 0;
std::thread t([&]() {
std::cerr << "Acquiring lock (BG)" << std::endl;
std::unique_lock<std::mutex> lock(m);
if (counter > 0) {
std::cerr << "Already notified" << std::endl;
return;
}
std::cerr << "Waiting for notification" << std::endl;
cv.wait_until(lock, forever, [&] { return counter > 0; });
std::cerr << "Got notification" << std::endl;
});
{
std::cerr << "Acquiring lock" << std::endl;
std::unique_lock<std::mutex> lock(m);
std::cerr << "Incrementing counter" << std::endl;
counter++;
std::cerr << "Notifying CV" << std::endl;
cv.notify_all();
std::cerr << "Waiting for thread" << std::endl;
}
t.join();
}
return EXIT_SUCCESS;
}
It will print “Waiting for notification” and get stuck. This shouldn’t
happen, since wait_until
should release the lock, and the block at the
bottom should acquire the lock, increment the counter, and notify the
condition variable. So what was happening?
The reason is as follows:
-
wait_until
converts ourtime_point
using the steady clock to its local clock. Here’s the source from my copy of thecondition_variable
header:template<typename _Clock, typename _Duration> cv_status wait_until(unique_lock<mutex>& __lock, const chrono::time_point<_Clock, _Duration>& __atime) { #if __cplusplus > 201703L static_assert(chrono::is_clock_v<_Clock>); #endif const typename _Clock::time_point __c_entry = _Clock::now(); const __clock_t::time_point __s_entry = __clock_t::now(); const auto __delta = __atime - __c_entry; const auto __s_atime = __s_entry + __delta; if (__wait_until_impl(__lock, __s_atime) == cv_status::no_timeout) return cv_status::no_timeout; // We got a timeout when measured against __clock_t but // we need to check against the caller-supplied clock // to tell whether we should return a timeout. if (_Clock::now() < __atime) return cv_status::no_timeout; return cv_status::timeout; }
If we look at this in a debugger, some variables have been optimized out, but we can observe that there was an overflow/underflow:
(gdb) f 2 #2 0x00007ffff50b0aa4 in std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7fffffffbe18, __lock=..., __atime=...) at /home/lidavidm/miniforge3/envs/conda-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/condition_variable:141 141 if (__wait_until_impl(__lock, __s_atime) == cv_status::no_timeout) (gdb) p __atime $11 = (const std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > &) @0x7fff92fc57b8: {__d = {__r = 9223372036854775807}} (gdb) p __c_entry $12 = {__d = {__r = 967774403226439}} (gdb) p __s_entry $13 = <optimized out> (gdb) p __delta $14 = <optimized out> (gdb) p __s_atime $15 = {__d = {__r = -7515780632097620207}}
We can see that the deadline we provide (
__atime
) is subtracted from the current time to get a delta, which is then added to the clock that the condition variable uses. And apparently, this overflowed and gave us a bogus negative value. -
wait_until
passes that bogus value to a helper:template<typename _Dur> cv_status __wait_until_impl(unique_lock<mutex>& __lock, const chrono::time_point<system_clock, _Dur>& __atime) { auto __s = chrono::time_point_cast<chrono::seconds>(__atime); auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime - __s); __gthread_time_t __ts = { static_cast<std::time_t>(__s.time_since_epoch().count()), static_cast<long>(__ns.count()) }; __gthread_cond_timedwait(&_M_cond, __lock.mutex()->native_handle(), &__ts); return (system_clock::now() < __atime ? cv_status::no_timeout : cv_status::timeout); } };
It constructs a
time_t
. If we look at that in the debugger, it also has crazy bogus values as a result:(gdb) f 1 #1 std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=this@entry=0x7fffffffbe18, __lock=..., __atime=...) at /home/lidavidm/miniforge3/envs/conda-dev/x86_64-conda-linux-gnu/include/c++/10.3.0/condition_variable:232 232 __gthread_cond_timedwait(&_M_cond, __lock.mutex()->native_handle(), (gdb) p __atime $16 = (const std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > &) @0x7fff92fc5778: {__d = {__r = -7515780632097620207}} (gdb) p __s $17 = <optimized out> (gdb) p __ns $18 = <optimized out> (gdb) p __ts $19 = {tv_sec = -7515780632, tv_nsec = -97620207}
-
That
time_t
gets passed to pthreads. I wasn’t immediately sure where I should go look for the source of my particular pthreads, but if we look at glibc, the implementation ofpthread_cond_timedwait
generally starts with something like this:int __pthread_cond_timedwait (cond, mutex, abstime) pthread_cond_t *cond; pthread_mutex_t *mutex; const struct timespec *abstime; { struct _pthread_cleanup_buffer buffer; struct _condvar_cleanup_buffer cbuffer; int result = 0; /* Catch invalid parameters. */ if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000) return EINVAL; /* snip ... */
That is…it sees the negative nanosecond value in our bogus
time_t
, and immediately returnsEINVAL
, without doing anything like, oh, I don’t know, releasing the lock. -
So
pthread_cond_timedwait
instantly returns without doing anything, and going to thewait_until
implementation that takes a predicate, we see that it just spins forever waiting for the predicate to evaluate true or the deadline to pass…neither of which can happen, since we’re holding the lock the entire time!
And there we have it: deadlock.
The solution is to either use a slightly less forever time for forever, or to loop yourself and wait for short periods of time while checking for the predicate or deadline to pass.