Libpatch - Trampolines cycle of life

Libpatch is a C library that can modify the control flow of a program at runtime with little overhead. To do so, many algorithms are used to find a way to replace a region of instructions with a relative call/jmp to a trampoline that will then jump to a handler that will call a user defined function -- called a probe -- before jumping to a out of line execution (OLX) buffer. This buffer will execute the replaced instructions -- emulating some if they are relative to the program counter -- before jumping back to the program.

All of this put together is called a patch. Patches can be [un]installed at runtime without disturbing too much running threads. Indeed, with some clever usage of membarrier(EXPEDITED_SYNC_CORE), it's possible to modify the executable with very little synchronization penalty. Since patches are installed and uninstalled in batch, the penalty is constant and thus amortized over multiple patches.

When removing a patch, the original instructions are restored in place and a cores synchronization is made to ensure that no new thread can not enter the patch trampoline. However, there's a possibility for some threads to already be inside the trampoline or in the user's probe. Thus, it's not possible to reuse the trampoline and the OLX buffer until we're certain that all threads have terminated their execution of the patch.

The current solution to this is to use a restartable sequence -- see rseq(2) -- that will copy the trampoline descriptor -- that is stored right after the trampoline instructions -- and the OLX buffer in a thread local storage. The RSEQ region start by loading the trampoline tag from the trampoline descriptor and compare it to the program counter on the stack. If they match, then the sequence continues, otherwise it aborts the probe and jumps back to the program. The key point here is that when freeing a trampoline and an OLX buffer the first thing to do is to store zero in the trampoline descriptor tag followed by a membarrier(EXPEDITED_RSEQ) that will interrupt any threads running a restartable sequence. That way, if a thread is in the middle of copying the trampoline descriptor or the OLX buffer, it will restart the sequence and see that the tag does not match, thus aborting the probe.

This current solution has some caveats though. First, it's only possible to modify the trampoline descriptor but not the trampoline instructions since there's no way to guarantee that a threads is not in the trampoline. The consequence of that is that when a trampoline is used for a particular handler -- e.g. generic, fentry, leaf -- it is locked for that handler. Finally, this local copying incurs additional overhead. Benchmarks show that the average overhead of an empty probe is around 50 ns for a single thread and can go as high as 65 ns for multi-threads.

A new solution that I envisage is to get rid of the restartable sequence and copying in a TLS. Instead, the trampoline descriptor and OLX buffer are directly used and shared among all threads. With this scheme the average overhead is around 17 ns for single and multi-threads. Clearly, it's better. But how can we free and reuse trampolines and OLX buffer then?

The difficulty in this problem is that the resource that we're managing is a read-only shared resources. There's no lock, no reference counter. Nothing. It's all read-only. However, we know that new readers can not come to existence. Thus, we only have to wait until current readers are done. Kind of like RCU.

A possible solution -- and it's one done by some -- is to wait for a period of time. Perhaps one or two seconds. Then we could assume that all threads have exited the to be reclaimed regions. However, this approach is not bullet proof. It's also very empirical. How do you estimate the waiting period?

The solution I came with is to check the current execution of each running threads with ptrace(2) and libunwind(3). Here's the gist. You make a list of running threads in /proc/self/task. Only these threads can be in the middle of executing a to be freed region since new threads can not enter it because the original instructions have been restored. The idea is then to ptrace(PTRACE_ATTACH) yourself to the thread and check the program counter. If it's in the trampoline, you ptrace(PTRACE_CONT) and check again later. The same goes for the OLX buffer. The difficult part is if the thread is in between the trampoline and the OLX buffer, that is the handler or the probe. In that case, the stack has to be unwinded to check if we found the trampoline in it. If it's the case, you resume the execution of the thread and recheck later. Of course, this method does add some latency to the removal of patches, however it's linear to the number of threads! You only need to check every thread once for every patch removed!

To me this solution has many advantages. It has a better memory footprint, a better execution overhead and is portable across architectures.