BPF monitoring with user stack attribution and the frame pointer
Extended Berkeley Packet Filter (eBPF/BPF) is a fully programmable way of collecting information and changing behaviour from inside the Linux kernel. Amongst it’s many uses collecting information of what limits the performance of a system as a whole or of specific applications. From its Kernel viewpoint, BPF has unique ability to collect information about I/O, networking and other aspects of overall performance that often dominates simple computational throughput in importance.
There are numerous such tools in the bcc collection:
https://github.com/iovisor/bcc. On
Debian they are available as bpfcc-tools
. Many are described on
https://www.brendangregg.com/ebpf.html
as well as (I believe) in his book:
https://www.brendangregg.com/bpf-performance-tools-book.html.
Amongst the capabilities of these tools is to record the stack of user-space applications at various tracing points. This allows attribution of these traces to higher level functions in the program which often makes it far easier to interpret their significance.
The mechanism for interpreting the stack when doing this recording is the frame pointer.
Frame Pointer
Frame pointer is a technique of implementing functions in which the value of the stack pointer (i.e. the sp register) at entry into the function is stored in the base pointer register (i.e., bp register) while the previous value of the bp register is pushed onto the stack. It has the advantage of easy calculation of addresses of standard variables on the stack (because any subsequent changes to stack do not need to be taken into account) at the cost of using a register. It also has an advantage that it becomes very easy to identify which part of the stack (`stack frame’) belongs to each function call.
GCC controls the use of this technique with:
-fno-omit-frame-pointer/-fomit-frame-pointer options. The default in
many relevant scenarios is -fomit-frame-pointer. The effect on a
simple function (g()
) which only calls another function (f()
) is shown below:
-fno-omit-frame-pointer | -fomit-frame-pointer |
|
|
With the stack pointer present this is the print-out of the stack when
in function f()
(which was called from g
which in turn was
called from main
):
(gdb) x/30x $sp
0x7fffffffe950: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffffffe960: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffffffe970: 0xffffe980 0x00007fff 0x555551a0 0x00005555
0x7fffffffe980: 0xffffe990 0x00007fff 0x555551b1 0x00005555
0x7fffffffe990: 0x00000001 0x00000000 0xf7e031ca 0x00007fff
The chain of frame-pointer is clearly visible which allows trivially walking up the stack one function frame at a time. Since the return instruction address is right next to the frame-pointer the frame can be attributed to a function easily too.
This is basically the technique eBPF performance tools use.
How to use in practice
To obtain the stack attribution clearly the code needs to be compiled
with -fno-omit-frame-pointer
or equivalent. Since the
frame-pointers create a chain, it will be broken when entering any
function which was not compiled -fno-omit-frame-pointer
.
For example libc
is typically compiled -fomit-frame-pointer
so walking the stack will be broken at that function. This can easily
be a problem since the very purpose of using eBPF is often trace
system-related events which are typical called from libc!
One simple example derived from this public Github issue is the following program:
#include <unistd.h>
#include <sys/syscall.h>
#include <time.h>
void f() {
for (size_t i = 0; i < 15; ++i) {
//sleep(1);
struct timespec ts;
ts.tv_sec=1;
ts.tv_nsec=0;
syscall(SYS_clock_nanosleep, CLOCK_REALTIME, 0, &ts, &ts );
}
}
void g() {
f();
}
int main() {
g();
}
We can trace up the stack because the SYS call is made directly:
$ sudo offwaketime-bpfcc --stack-storage-size 200000 -Up $(pgrep t2) -d 5
Tracing blocked time (us) by user off-CPU and waker stack for 5 secs.
waker: swapper/1 0
-- --
syscall
g
main
__libc_start_call_main
target: t2 11296
4001289
However original version (using the commented out line sleep(1)
)
calls into C and the stack can not be walked up:
$ sudo offwaketime-bpfcc --stack-storage-size 200000 -Up $(pgrep t2) -d 5
Tracing blocked time (us) by user off-CPU and waker stack for 5 secs.
waker: swapper/0 0
-- --
__clock_nanosleep
target: t2 10797
4002066