Linux threads
User space locking
Introduction
- What is thread?
- A lightweight execution context managed by scheduler.
- Why is it lightweight?
- It shares resources with parent process.
- It has faster synchronization primitives.
- Why is it faster?
- Because it utilizes user space locking
- What is user space locking?
- We are about to find out !
What we will cover
- Short thread/process intro
- Compare System V IPC vs NPTL pthreads
- User space locking
- Futex system call
- Our own simple mutex implementation
- POSIX pthread mutex implementation
Process
- Heavyweight unit of kernel scheduling
- Isolated from other processes
- Own address space
- Own files descriptors, sockets
- Can communicate via InteprocessCommunication (IPC)
- On Linux we have System V IPC
- Shared memory, semaphores, queues, etc
- Context switch is expensive
Thread
- Lightweight unit of kernel scheduling
- Share almost all resources within process expect for
- Own Stack
- Own registers
- Own per thread storage (thread-local storage)
- Can communicate via user space library
- POSIX pthread - Library API specification for different OS
- Native Linux Thread Library (NPTL) is Linux Pthread implementation
- Context switch is cheap
Synchronization primitive
- An resource that allows processes/threads to communicate
- Processes - System V IPC
- Message queue, semaphore, shared memory
- Threads - POSIX Pthread
- mutex, condition variable, barrier, spinlock, semaphore
- We will focus on simplest primitive - mutex
Linux threads
- Just lightweight version of process in Linux
- Uses same system call as fork (clone)
- Additional flags to clone:
- CLONE_VM, CLONE_FS, CLONE_FILES, CLONE_SIGNAL, ...
- Same PID, different TID (LWP)
- Can use process synch primitives (System V IPC)
- semaphores, message queues, ...
[glibc/nptl/sysdeps/pthread/createthread.c]
static int
create_thread (struct pthread *pd, const struct pthread_attr *attr,
STACK_VARIABLES_PARMS)
{
int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
| CLONE_SETTLS | CLONE_PARENT_SETTID
| CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
| 0);
// [...]
/* Actually create the thread. */
int res = do_clone (pd, attr, clone_flags, start_thread,
STACK_VARIABLES_ARGS, stopped);
// [...]
return res;
}
Why do we need thread specific synch primitives?
Let's compare SystemV vs POSIX
semaphore
- System V IPC API
semget, setctl, semtimedop
- POSIX pthread API
sem_init, sem_destroy, sem_wait, sem_post
- Wrapper class
my_sem
with similar interface
post()
- increment (unlock) the semaphore
wait()
- decrement (lock) the semaphore
Same code run on two implementations
#include "sem_posix.hpp" // or sem_systemv.hpp
int main(int argc, char *argv[]) {
my_sem m1(10000);
for (int j=0; j<atoi(argv[1]); j++) {
for (int i=0; i<10000; i++) {
m1.wait();
}
for (int i=0; i<10000; i++) {
m1.post();
}
}
}
Let's run the programs several times
for i in {10..1000..25}; do
echo -n "$i " >> sem11.dat;
/usr/bin/time -f "%U %S" ./sem11 $i 2>>sem11.dat;
done
Results for both semaphore types
Where is the difference?
Let's consider a shorter example
int main(int argc, char *argv[]) {
my_sem m1(4);
for (int i=0; i<4; i++) {
m1.wait();
}
m1.wait(1);
}
Strace for System V
[maciek@pc futex]$ strace ./sem1
execve("./sem1", ["./sem1"], [/* 51 vars */]) = 0
brk(0) = 0x85c000
[...]
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
stat("/tmp/file", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
semget(0x610101ec, 1, IPC_CREAT|0666) = 393218
semctl(393218, 0, SETVAL, 0x4) = 0
semtimedop(393218, {{0, -1, 0}}, 1, NULL) = 0
semtimedop(393218, {{0, -1, 0}}, 1, NULL) = 0
semtimedop(393218, {{0, -1, 0}}, 1, NULL) = 0
semtimedop(393218, {{0, -1, 0}}, 1, NULL) = 0
semtimedop(393218, {{0, -1, 0}}, 1, {1, 0}) = -1 EAGAIN
semctl(393218, 0, IPC_RMID, 0xffffffffffffff60) = 0
exit_group(0) = ?
+++ exited with 0 +++
Strace for POSIX
[maciek@pc futex]$ strace ./sem2
execve("./sem2", ["./sem2"], [/* 51 vars */]) = 0
brk(0) = 0x10b8000
[...]
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
futex(0x7fffcbf93650, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0,
{1388604825, 683126976}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
exit_group(0) = ?
+++ exited with 0 +++
Conclusions
- Every operation on System V IPC semaphore requires syscall
- POSIX pthreads has significantly less system calls
- No syscall on semaphore creation
- No syscall when semaphore is still free
- On contention we get futex syscall
- How pthread library can operate on synchronization primitive without kernel involvement?
Kernel vs user mode
- Kernel mode (supervisor mode)
- Special processor execution mode
- Allows execution of privileged instructions
- Full access to hardware, memory, etc
- Only kernel runs in this mode
- User mode
- User programs run in this mode
- Access to safe instructions
- Sandbox (own address space, descriptors, etc)
- Mode switch
- Switching between kernel and user mode
- On Linux we use system calls
- Process control, file management, device management, communication, privileges, etc
- Usually via special interruption on CPU
- Store the running process state
- Load kernel address space
- Execute syscall related logic
- Restore the user process
- Very expensive
- Hundreds of instructions
- Flashes the CPU pipeline / cache
User space locking
- Kernel space primitive
- Primitive state kept in kernel space
- Handle used in user space (int with resource id)
- Every operation on primitive requires mode switch
- User space primitive
- Primitive state kept in user space
- Pure user space locking requires no kernel involvement
- Utilize biggest advantage of threads over processes
- common address space (shared memory)
Kernel space vs user space primitive
Do we have pure user space locks in POSIX threads library?
POSIX spin lock
- Spin lock - lock that performs busy waiting
- does not release CPU while waiting
- equivalent of
while (locked) {};
- does nor require mode switch
- can be executed entirely in user space
Lock function
[nptl/sysdeps/x86_64/pthread_spin_lock.S]
pthread_spin_lock:
1: LOCK
decl 0(%rdi)
jne 2f
xor %eax, %eax
ret
.align 16
2: rep
nop
cmpl $0, 0(%rdi)
jg 1b
jmp 2b
* Note the LOCK prefix
Lock function
[nptl/sysdeps/x86_64/pthread_spin_lock.S]
pthread_spin_lock:
1: LOCK
decl 0(%rdi) ; lock--;
jne 2f ; if (lock != 0) jump 2;
xor %eax, %eax ; else return;
ret ;
.align 16
2: rep ; {
nop ; nop;
cmpl $0, 0(%rdi) ; if (lock > 0)
jg 1b ; jump 1;
jmp 2b ; } while (1);
* Note the LOCK prefix
Unlock function
[nptl/sysdeps/x86_64/pthread_spin_lock.S]
pthread_spin_unlock:
movl $1, (%rdi)
xorl %eax, %eax
retq
* Note the missing LOCK prefix
Spin lock summary
- The fastest lock you can get
- Pure user space synch primitive
- No mode switch required
- Performs busy waiting
- Can be used when
- lock is hold for just a few instructions
- thread can not be interrupted
- we have low contention
- Spin lock is great but ...
- for larger contention
- longer wait periods
- more threads
we need to put the waiting threads to sleep
- Kernel side queue for waiting threads is needed
- And that is what the futex system call provides
Futex primitive
- Fast User Space Mutex
- Futex is a building block for pthread Linux primitives
- semaphores, mutexes, conditional variables
- Futex primitive can be decomposed to
- Lock state kept in 32-bit integer in user space
- The integer address serves as a handle for kernel
- The handle is associated with waiting tasks queue in kernel space
- All operation of Futex are performed via futex syscall
- futex system call
#include <linux/futex.h>
#include <sys/time.h>
int futex(int *uaddr, int op, int val, const struct timespec *timeout,
int *uaddr2, int val3);
- Required Argument
uaddr
- address to userspace 32-bit integer
op
- kind of operation performed on futex
- Remaining arguments function changes dependent of the operation
FUTEX_WAIT operation
- Suspends the running thread until a wake operation in invoked on the same futex
- Parameters:
uaddr
- address of futex integer
op
- equals FUTEX_WAIT
val
- to current value of futex (should equal *uaddr
)
timeout
- timeout in seconds or NULL
for infinity
uaddr2
- ignored
val3
- ignored
- Returned value:
0
- thread was resumed correctly
EINTR
- wait operation is interrupted
ETIMEDOUT
- timeout before wake operation called
-
EWOULDBLOCK
- val
argument is different then value pointed by uaddr
val
is value seen by the thread when entering kernel mode
*uaddr
is checked once in kernel
- ... when those two value differ we had a race condition
- FUTEX_WAIT race condition
FUTEX_WAKE operation
- Wakes requested number of threads that are waiting on the futex
- Parameters:
uaddr
- address of futex integer
op
- equals FUTEX_WAKE
val
- the number of threads to wake
timeout
- ignored
uaddr2
- ignored
val3
- ignored
- Return value - actual number of woken threads
Other futex operations
- FUTEX_FD - associates futex with a file descriptor used later in event multiplexing (poll or select) (deprecated)
- FUTEX_REQUEUE - old version of FUTEX_CMP_REQUEUE that can lead to dead lock (not recommended)
- FUTEX_CMP_REQUEUE - created to optimize implementation of pthread condition variables. It transfers (requeue) the waiting threads from condition variable internal futex (uaddr) into mutex internal futex (uaddr2) when conditional variable is signalled.
- FUTEX_WAKE_OP - created to optimize implementation of pthread condition variables. Unlocks two internal futex locks of conditional variable to avoid race between thread signalling the variable and the one that is being woken.
- FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET - created to optimize readers-writer locks. Bitset that is stored together with wait queue and compared when choosing threads that will be woken. Currently used only because unlike other operations it accepts absolute timeout.
- FUTEX_*_PI - same as above operations but with _PI postfix. Used for priority inheritance locks manipulation
Futex usage example
Example 1 - for simplicity no assembly and no atomics
- Suppose that we have a task divided into N steps
- Each step can be started immediately
- Each step needs the results from previous step to finish its computations
- Let's build a primitive that allows to serialize the work
- The
step
class will have:
wait
function - waits until N-th step is completed
signal
function - signal completion of N-th step
val
member - current step (and also futex value)
#include <thread>
#include <chrono>
#include <vector>
#include <cstdlib>
#include <limits.h>
#include <linux/futex.h>
#include <unistd.h>
#include <sys/syscall.h>
long futex_wait(int* uaddr, int val1) {
return syscall(SYS_futex, uaddr, FUTEX_WAIT_PRIVATE, val1, NULL, NULL, 0);
}
long futex_wake(int* uaddr, int val1) {
return syscall(SYS_futex, uaddr, FUTEX_WAKE_PRIVATE, val1, NULL, NULL, 0);
}
long rand_ms(int max) {
return std::rand() % max;
}
class step {
public:
step() : val(0) {
}
void signal(int new_val) {
val = new_val;
futex_wake(&val, INT_MAX);
}
void wait(int till) {
while (1) {
int tmp = val;
if (tmp >= till) break;
futex_wait(&val, tmp);
}
}
private:
int val;
};
* Do we need to use atomics here? Why tmp variable in wait?
class step {
public:
step() : val(0) {
}
void signal(int new_val) {
val = new_val;
futex_wake(&val, INT_MAX);
}
void wait(int till) {
while (val < till) {
futex_wait(&val, val);
}
}
private:
int val;
};
int main(int argc, char* argv[]) {
std::srand(std::time(0));
std::vector<std::thread> threads;
step st;
for (int i=0; i<10; i++) {
using namespace std::chrono;
threads.push_back(std::thread([&st, i] {
std::this_thread::sleep_for<>(milliseconds(rand_ms(2000)));
std::printf("[%d] WAIT\n", i);
st.wait(i);
std::this_thread::sleep_for<>(milliseconds(rand_ms(500)));
std::printf("[%d] SIG\n", i);
st.signal(i+1);
}));
}
for (auto& th : threads) {
th.join();
}
}
Example console output
[maciek@pc futex_step]$ ./test
[8] WAIT
[4] WAIT
[0] WAIT
[0] SIG
[7] WAIT
[1] WAIT
[1] SIG
[6] WAIT
[9] WAIT
[3] WAIT
[2] WAIT
[2] SIG
[5] WAIT
[3] SIG
[4] SIG
[5] SIG
[6] SIG
[7] SIG
[8] SIG
[9] SIG
Futex mutex example
Example 2 - some atomics, some assembly, mostly c
- Simplified mutex implementation from NPTL
- Not used by actual
pthread_mutex_t
- Assuming x86_64 architecture in this example
Atomic operations used in example
Atomic decrement
[glibc/sysdeps/x86_64/bits/atomic.h]
#define __arch_decrement_body(lock, mem) \
do { \
if (sizeof (*mem) == 1) \
// [...]
else if (sizeof (*mem) == 2) \
// [...]
else if (sizeof (*mem) == 4) \
__asm __volatile (lock "decl %0" \
: "=m" (*mem) \
: "m" (*mem), \
"i" (offsetof (tcbhead_t, multiple_threads))); \
else \
// [...]
} while (0)
#define atomic_decrement(mem) __arch_decrement_body (LOCK_PREFIX, mem)
Atomic bit test and set
[glibc/sysdeps/x86_64/bits/atomic.h]
#define atomic_bit_test_set(mem, bit) \
({ unsigned char __result; \
if (sizeof (*mem) == 1) \
// [...]
else if (sizeof (*mem) == 2) \
// [...]
else if (sizeof (*mem) == 4) \
__asm __volatile (LOCK_PREFIX "btsl %3, %1; setc %0" \
: "=q" (__result), "=m" (*mem) \
: "m" (*mem), "ir" (bit)); \
else \
// [...]
__result; })
Atomic add and test if zero
[glibc/sysdeps/x86_64/bits/atomic.h]
#define atomic_add_zero(mem, value) \
({ unsigned char __result; \
if (sizeof (*mem) == 1) \
// [...]
else if (sizeof (*mem) == 2) \
// [...]
else if (sizeof (*mem) == 4) \
__asm __volatile (LOCK_PREFIX "addl %2, %0; setz %1" \
: "=m" (*mem), "=qm" (__result) \
: "ir" (value), "m" (*mem)); \
else \
// [...]
__result; })
Futex mutex
- Mutex state stored in 32-bit integer
- Bit 31 indicates if mutex is locked
- Bits 0-30 store waiting threads count
- Fast path requires just single atomic operation
- no
futex
syscall
- true for both lock and unlock functions
- On contention we use
futex
syscall
Mutext lock
[glibc/nptl/lowlevellock.h]
static inline void __generic_mutex_lock (int *mutex) {
unsigned int v;
/* Bit 31 was clear, we got the mutex. (this is the fastpath). */
if (atomic_bit_test_set (mutex, 31) == 0) return;
atomic_increment (mutex);
while (1) {
if (atomic_bit_test_set (mutex, 31) == 0) {
atomic_decrement (mutex);
return;
}
/* We have to wait now. First make sure the futex value we are
monitoring is truly negative (i.e. locked). */
v = *mutex;
if (v >= 0) continue;
lll_futex_wait (mutex, v, LLL_SHARED);
}
}
Mutext unlock
[glibc/nptl/lowlevellock.h]
static inline void
__generic_mutex_unlock (int *mutex)
{
/* Adding 0x80000000 to the counter results in 0 if and only if
there are not other interested threads - we can return (this is
the fastpath). */
if (atomic_add_zero (mutex, 0x80000000))
return;
/* There are other threads waiting for this mutex, wake one of them
up. */
lll_futex_wake (mutex, 1, LLL_SHARED);
}
Example execution 1
|
THREAD1 |
Futex value (hex) |
Syscall |
1 |
|
0x00000000 |
|
2 |
lock |
0x80000000 |
|
5 |
unlock |
0x00000000 |
|
Example execution 2
|
THREAD1 |
THREAD2 |
THREAD3 |
Futex value (hex) |
Syscall |
1 |
|
|
|
0x00000000 |
|
2 |
lock |
|
|
0x80000000 |
|
3 |
|
lock (1) |
|
0x80000001 |
futex_wait |
4 |
|
|
lock (1) |
0x80000002 |
futex_wait |
5 |
unlock |
|
|
0x00000002 |
futex_wake |
6 |
|
lock (2) |
|
0x80000001 |
|
7 |
|
unlock |
|
0x00000001 |
futex_wake |
8 |
|
|
lock (2) |
0x80000000 |
|
9 |
|
|
unlock |
0x00000000 |
|
Pthread mutex
Example 3 - lot of atomics, lot of assembly
- Actual mutex implementation from NPTL
- Assuming x86_64 architecture in this example
- Low level lock - primitive used in lock/unlock functions
- counter not stored in futex variable
- instead only with three states
- LLL_LOCK_INITIALIZER (0)
the lock is not taken
- LLL_LOCK_INITIALIZER_LOCKED (1)
the lock is taken
- LLL_LOCK_INITIALIZER_WAITERS (2)
the lock is taken with possible waiters
x86_64 recap
- x86_64 kernel interface
- system call is done via the
syscall
instruction
- arguments are passed via %rdi, %rsi, %rdx, %r10, %r8, %r9
- the syscall number passed in %rax register
- return value stored in %rax register
- x86_64
cmpxchg
cmpxchg src, dest
(A&T syntax)
- compares %rax with dest operand
- when equal, the src operand is loaded into dest
- otherwise, the dest operand is loaded into %rax
- Assuming *src still holds the value I expect it to hold (%eax),
...replace it with dest,
...otherwise give me the actual value (copy *src to %eax)
- Atomic read/write instruction (compare and swap)
- Used to build most of sync primitive on x86 arch
Atomic operations used by mutex
Low level lock - try lock
[glibc/sysdeps/unix/sysv/linux/x86_64/lowlevellock.h]
#if !IS_IN (libc) || defined UP
#define __lll_trylock_asm LOCK_INSTR "cmpxchgl %2, %1"
#else
#define __lll_trylock_asm "cmpl $0, __libc_multiple_threads(%%rip)\n\t" \
"je 0f\n\t" \
"lock; cmpxchgl %2, %1\n\t" \
"jmp 1f\n\t" \
"0:\tcmpxchgl %2, %1\n\t" \
"1:"
#endif
#define lll_trylock(futex) \
({ int ret; \
__asm __volatile (__lll_trylock_asm \
: "=a" (ret), "=m" (futex) \
: "r" (LLL_LOCK_INITIALIZER_LOCKED), "m" (futex), \
"0" (LLL_LOCK_INITIALIZER) \
: "memory"); \
ret; })
Low level lock - lock
[glibc/sysdeps/unix/sysv/linux/x86_64/lowlevellock.h]
#if !IS_IN (libc) || defined UP
#define __lll_lock_asm_start LOCK_INSTR "cmpxchgl %4, %2\n\t" \
"jz 24f\n\t"
#else
#define __lll_lock_asm_start "cmpl $0, __libc_multiple_threads(%%rip)\n\t" \
"je 0f\n\t" \
"lock; cmpxchgl %4, %2\n\t" \
"jnz 1f\n\t" \
"jmp 24f\n" \
"0:\tcmpxchgl %4, %2\n\t" \
"jz 24f\n\t"
#endif
#define lll_lock(futex, private) \
(void) \
({ int ignore1, ignore2, ignore3; \
if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
__asm __volatile (__lll_lock_asm_start \
"1:\tlea %2, %%" RDI_LP "\n" \
"2:\tsub $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset 128\n" \
"3:\tcallq __lll_lock_wait_private\n" \
"4:\tadd $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset -128\n" \
"24:" \
: "=S" (ignore1), "=&D" (ignore2), "=m" (futex),\
"=a" (ignore3) \
: "0" (1), "m" (futex), "3" (0) \
: "cx", "r11", "cc", "memory"); \
else \
// [...]
})
[nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S]
__lll_lock_wait_private:
pushq %r10
pushq %rdx
xorq %r10, %r10 /* No timeout. */
movl $2, %edx
LOAD_PRIVATE_FUTEX_WAIT (%esi)
cmpl %edx, %eax /* NB: %edx == 2 */
jne 2f
1: LIBC_PROBE (lll_lock_wait_private, 1, %rdi)
movl $SYS_futex, %eax
syscall
2: movl %edx, %eax
xchgl %eax, (%rdi) /* NB: lock is implied */
testl %eax, %eax
jnz 1b
popq %rdx
popq %r10
retq
Low level lock - unlock
[glibc/sysdeps/unix/sysv/linux/x86_64/lowlevellock.h]
#if !IS_IN (libc) || defined UP
#define __lll_unlock_asm_start LOCK_INSTR "decl %0\n\t" \
"je 24f\n\t"
#else
#define __lll_unlock_asm_start \
"cmpl $0, __libc_multiple_threads(%%rip)\n\t" \
"je 0f\n\t" \
"lock; decl %0\n\t" \
"jne 1f\n\t" \
"jmp 24f\n\t" \
"0:\tdecl %0\n\t" \
"je 24f\n\t"
#endif
#define lll_unlock(futex, private) \
(void) \
({ int ignore; \
if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
__asm __volatile (__lll_unlock_asm_start \
"1:\tlea %0, %%" RDI_LP "\n" \
"2:\tsub $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset 128\n" \
"3:\tcallq __lll_unlock_wake_private\n" \
"4:\tadd $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset -128\n" \
"24:" \
: "=m" (futex), "=&D" (ignore) \
: "m" (futex) \
: "ax", "cx", "r11", "cc", "memory"); \
else \
// [...]
})
[nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S]
__lll_unlock_wake_private:
pushq %rsi
pushq %rdx
movl $0, (%rdi)
LOAD_PRIVATE_FUTEX_WAKE (%esi)
movl $1, %edx /* Wake one thread. */
movl $SYS_futex, %eax
syscall
popq %rdx
popq %rsi
retq
pthread_mutex_t struct
- Holds 3-state low level lock and additonal data
- Still entire mutex state kept in user space
- Members of pthread_mutex_t
__lock
- value of the futex lock (0,1 or 2)
__count
- number of times that the mutex was locked
(for recursive only mutex)
__owner
- tid of thread that owns the mutex
__nusers
- number of threads owning the mutex
__kind
- the type of mutex
__spins
- number of spins while busy waiting
(for adaptive only mutex)
[nptl/sysdeps/unix/sysv/linux/x86/bits/pthreadtypes.h]
typedef union
{
struct __pthread_mutex_s
{
int __lock;
unsigned int __count;
int __owner;
unsigned int __nusers;
int __kind;
int __spins;
__pthread_list_t __list;
} __data;
char __size[__SIZEOF_PTHREAD_MUTEX_T];
long int __align;
} pthread_mutex_t;
The lock function
Some macros used by lock function
[glibc/nptl/pthread_mutex_lock.c]
#ifndef LLL_MUTEX_LOCK
# define LLL_MUTEX_LOCK(mutex) \
lll_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex))
# define LLL_MUTEX_TRYLOCK(mutex) \
lll_trylock ((mutex)->__data.__lock)
# define LLL_ROBUST_MUTEX_LOCK(mutex, id) \
lll_robust_lock ((mutex)->__data.__lock, id, \
PTHREAD_ROBUST_MUTEX_PSHARED (mutex))
#endif
int
__pthread_mutex_lock (mutex)
pthread_mutex_t *mutex;
{
assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
LIBC_PROBE (mutex_entry, 1, mutex);
if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
return __pthread_mutex_lock_full (mutex);
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
if (__builtin_expect (type, PTHREAD_MUTEX_TIMED_NP)
== PTHREAD_MUTEX_TIMED_NP)
{
simple:
/* Normal mutex. */
LLL_MUTEX_LOCK (mutex);
assert (mutex->__data.__owner == 0);
}
else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
{
/* Recursive mutex. */
/* Check whether we already hold the mutex. */
if (mutex->__data.__owner == id)
{
/* Just bump the counter. */
if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
/* Overflow of the counter. */
return EAGAIN;
++mutex->__data.__count;
return 0;
}
/* We have to get the mutex. */
LLL_MUTEX_LOCK (mutex);
assert (mutex->__data.__owner == 0);
mutex->__data.__count = 1;
}
else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp) goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN(MAX_ADAPTIVE_COUNT, mutex->__data.__spins*2+10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
BUSY_WAIT_NOP;
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
else
{
assert (type == PTHREAD_MUTEX_ERRORCHECK_NP);
/* Check whether we already hold the mutex. */
if (__builtin_expect (mutex->__data.__owner == id, 0))
return EDEADLK;
goto simple;
}
/* Record the ownership. */
mutex->__data.__owner = id;
#ifndef NO_INCR
++mutex->__data.__nusers;
#endif
LIBC_PROBE (mutex_acquired, 1, mutex);
return 0;
}
The unlock function
[glibc/nptl/pthread_mutex_unlock.c]
int
internal_function attribute_hidden
__pthread_mutex_unlock_usercnt (mutex, decr)
pthread_mutex_t *mutex;
int decr;
{
int type = PTHREAD_MUTEX_TYPE (mutex);
if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
return __pthread_mutex_unlock_full (mutex, decr);
if (__builtin_expect (type, PTHREAD_MUTEX_TIMED_NP)
== PTHREAD_MUTEX_TIMED_NP)
{
/* Always reset the owner field. */
normal:
mutex->__data.__owner = 0;
if (decr)
/* One less user. */
--mutex->__data.__nusers;
/* Unlock. */
lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
LIBC_PROBE (mutex_release, 1, mutex);
return 0;
}
else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
{
/* Recursive mutex. */
if (mutex->__data.__owner != THREAD_GETMEM (THREAD_SELF, tid))
return EPERM;
if (--mutex->__data.__count != 0)
/* We still hold the mutex. */
return 0;
goto normal;
}
else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
goto normal;
else
{
/* Error checking mutex. */
assert (type == PTHREAD_MUTEX_ERRORCHECK_NP);
if (mutex->__data.__owner != THREAD_GETMEM (THREAD_SELF, tid)
|| ! lll_islocked (mutex->__data.__lock))
return EPERM;
goto normal;
}
}
Conclusions
- Normal mutex is same as low level lock
- Recursive mutex comes with zero cost when compared to normal mutex
- Adaptive mutex keeps spins history and adjust with time
- Error checking mutex does very basic checking
Summary
- Pthreads are very similar to processes
- Thanks to shared memory they can synchronize easier
- They handle fast path in user space by using atomics
- They handle lock contention by using futex syscall
- In case of low contention locks are pretty cheap
Thank you