Instrumenting UserLand Apps with eBPF

eBPF has revolutionized the observability landscape in the Linux kernel. Throughout our previous blog post series, I covered the fundamental building blocks of the eBPF ecosystem, scratched the surface of XDP and showed how closely it cooperates with the eBPF infrastructure to introduce a fast-processing datapath in the networking stack.

Nevertheless, eBPF is not exclusive to kernel-space tracing. Wouldn’t it be splendid if we could instrument applications running in production while enjoying all the luxuries of eBPF-driven tracing?

This is where uprobes come into play. Think of them as a sort of kprobes that are attached to userspace tracepoints instead of kernel symbols. Numerous language runtimes, database systems, and other software stacks incorporate hooks that can be consumed by BCC tools.  Specifically, the ustat tool collects a bunch of valuable events such as garbage collection events, object creation statistics, method calls and more.

The bad news is that “official” language runtime releases, like Node.js and Python, don’t ship with DTrace support, which means that you’ll have to build from source passing the -with-dtrace flag to the compiler. This is not a prerequisite for natively compiled languages. As long as the symbol table is available it is possible to apply dynamic tracing to any symbol present in the binary’s text segment. Instrumenting Go or Rust stdlib function calls on live binaries is accomplished this way.

eBPF Techniques for Instrumenting Applications

There are various approaches to tracing user space processes:

  • Statically declared USDT
  • Dynamically declared USDT
  • Dynamic tracing with uprobes

Statically declared USDT

USDT (Userland Statically Defined Tracing)  embodies the idea of embedding probes directly in user code. The origin of this technique dates back to the Solaris/BSD DTrace era and consists of using the DTRACE_PROBE() macro for declaring the tracepoint at strategic code locations. Unlike regular symbols, USDT hooks are guaranteed to remain stable even if the code is refactored. The following diagram depicts the process of declaring a USDT tracepoint in the user code all up to its execution in the kernel.

A developer would start by implanting tracepoints via the DTRACE_PROBE and DTRACE_PROBE1 macros to surround the code blocks of interest. Both macros accept a couple of mandatory arguments such as provider/probe names, followed by any value you would like to consult from the tracepoint. The compiler will squash the USDT tracepoints inside the ELF section of the target binary. The contract between compilers and tracing tools dictates that a .note.stapstd section must exist where USDT metadata resides.

The USDT tracing tool introspects the ELF section and places a breakpoint on the tracepoint’s location that is translated to an int3 interrupt. Whenever control flow is executed at the tracepoint’s marker, the interrupt handler is triggered and the program associated with the uprobe is called in the kernel to process the events and broadcast them to userspace, perform map aggregations, etc.

Dynamically declared USDT

Since USDT is pushed into statically generated ELF section, it defeats the purpose of declaring USDT for the software that runs on interpreted or JIT-based languages. Luckily, it is possible to define tracepoints in the runtime via libstapsdt. It produces a tiny shared object with USDT information that is mapped to the process’ address space so tracing tools can attach to intended tracepoints. Bindings for libstapsdt exist in many languages. Head to this example to find out how to install USDT probes in Node.js.

Dynamic tracing with uprobes

This type of tracing mechanism doesn’t require any extra capabilities from the target process, except that its symbol table is accessible. This is the most versatile and powerful instrumentation method as it allows for injecting breakpoints at arbitrary instructions without even restarting the actual process.

Tracing examples

After a brief theoretical introduction, let’s see some concrete examples on how to instrument real-world apps crafted in diverse languages.

C

Redis is a popular key-value data structures server built in C. Taking a sneak peek into the Redis symbol table reveals a vast number of functions that are candidates for capturing via uprobes.

$ objdump -tT /usr/bin/redis-server

000000000004c160 g    DF .text  00000000000000cc  Base 
addReplyDouble
0000000000090940 g    DF .text  00000000000000b0  Base        sha1hex
00000000000586e0 g    DF .text  000000000000007c  Base        
replicationSetMaster
00000000001b39e0 g    DO .data  0000000000000030  Base        
dbDictType
00000000000ace20 g    DF .text  0000000000000030  Base        
RM_DictGetC
0000000000041bc0 g    DF .text  0000000000000073  Base        
sdsull2str
00000000000bba00 g    DF .text  0000000000000871  Base        raxSeek
00000000000ac8c0 g    DF .text  000000000000000c  Base        
RM_ThreadSafeContextUnlock
00000000000e3900 g    DF .text  0000000000000059  Base        
mp_encode_lua_string
00000000001cef60 g    DO .bss   0000000000000438  Base        rdbstate
0000000000047110 g    DF .text  00000000000000b5  Base        
zipSaveInteger
000000000009f5a0 g    DF .text  0000000000000055  Base        
addReplyDictOfRedisInstances
0000000000069200 g    DF .text  000000000000004a  Base        
zzlDelete
0000000000041e90 g    DF .text  00000000000008ba  Base        
sdscatfmt
000000000009ac40 g    DF .text  000000000000003a  Base        
sentinelLinkEstablishedCallback
00000000000619d0 g    DF .text  0000000000000045  Base        
psetexCommand
00000000000d92f0 g    DF .text  00000000000000fc  Base        
luaL_argerror
00000000000bc360 g    DF .text  0000000000000328  Base        
raxRandomWalk
0000000000096a00 g    DF .text  00000000000000c3  Base        
rioInitWithFdset
000000000003d160 g    DF .text  0000000000000882  Base        
serverCron
0000000000032907 g    DF .text  0000000000000000  Base        
je_prof_thread_name_set
0000000000043960 g    DF .text  0000000000000031  Base        zfree
00000000000a2a40 g    DF .text  00000000000001ab  Base        
sentinelFailoverDetectEnd
00000000001b8500 g    DO .data  0000000000000028  Base        
je_percpu_arena_mode_names
00000000000b5f90 g    DF .text  0000000000000018  Base        
geohashEstimateStepsByRadius
00000000000d95e0 g    DF .text  0000000000000039  Base        
luaL_checkany
0000000000048850 g    DF .text  00000000000002d4  Base        
createClient
...

There is an interesting createStringObject function that Redis utilizes internally for allocating strings modeled around the robj structure. Redis commands are produced on behalf of createStringObject calls. We can spy on any command sent to the Redis server by hooking this function. For this purpose, I’m going to use the trace utility from the BCC toolkit.

$ /usr/share/bcc/tools/trace '/usr/bin/redis-server:createStringObject "%s" arg1'
PID     TID     COMM            FUNC             -
8984    8984    redis-server    createStringObject b'COMMANDrn'
8984    8984    redis-server    createStringObject 
b'setrn$4rnoctirn$4rnfestrn'
8984    8984    redis-server    createStringObject b'octirn$4rnfestrn'
8984    8984    redis-server    createStringObject b'festrn'
8984    8984    redis-server    createStringObject b'getrn$4rnoctirn'
8984    8984    redis-server    createStringObject b'octirn'

The above is the output yielded by executing set octi fest and get octi from the Redis CLI client.

Java

Modern JVM versions ship with built-in support for USDT. All the probes are brought on behalf of the libjvm shared object. We can dig out available tracepoints in the ELF section.

$ readelf -n /usr/lib/jvm/jdk-11-oracle/lib/server/libjvm.so
...
stapsdt              0x00000037       NT_STAPSDT (SystemTap probe 
descriptors)
  Provider: hs_private
  Name: cms__initmark__end
  Location: 0x0000000000e2420c, Base: 0x0000000000f725b4, Semaphore: 0x0000000000000000
  Arguments:
stapsdt              0x00000037       NT_STAPSDT (SystemTap probe descriptors)
  Provider: hs_private
  Name: cms__remark__begin
  Location: 0x0000000000e24334, Base: 0x0000000000f725b4, Semaphore: 0x0000000000000000
  Arguments:
stapsdt              0x00000035       NT_STAPSDT (SystemTap probe descriptors)
  Provider: hs_private
  Name: cms__remark__end
  Location: 0x0000000000e24418, Base: 0x0000000000f725b4, Semaphore: 0x0000000000000000
  Arguments:
stapsdt              0x0000002f       NT_STAPSDT (SystemTap probe descriptors)
  Provider: hotspot
  Name: gc__begin
  Location: 0x0000000000e2b262, Base: 0x0000000000f725b4, Semaphore: 0x0000000000000000
  Arguments: 1@$1
stapsdt              0x00000029       NT_STAPSDT (SystemTap probe descriptors)
  Provider: hotspot
  Name: gc__end
  Location: 0x0000000000e2b31a, Base: 0x0000000000f725b4, Semaphore: 0x0000000000000000
  Arguments:
...

To capture all class load events we can use the following command:

$ /usr/share/bcc/tools/trace 
'u:/usr/lib/jvm/jdk-11-oracle/lib/server/libjvm.so:class__loaded "%s", arg1'

Similarly, we can watch on JVM thread creation events:

$ /usr/share/bcc/tools/trace 
'u:/usr/lib/jvm/jdk-11-oracle/lib/server/libjvm.so:thread__start "%s", arg1'
PID     TID     COMM            FUNC
27390   27398   java            thread__start    b'Reference Handler'
27390   27399   java            thread__start    b'Finalizer'
27390   27400   java            thread__start    b'Signal Dispatcher'
27390   27401   java            thread__start    b'C2 CompilerThread0'
27390   27402   java            thread__start    b'C1 CompilerThread0'
27390   27403   java            thread__start    b'Sweeper thread'
27390   27404   java            thread__start    b'Service Thread'

When extended probes are enabled via -XX:+ExtendedDTraceProbes property, the uflow tool is able to trace and graph all method executions in real-time.

$ /usr/share/bcc/tools/lib/uflow -l java 27965
Tracing method calls in java process 27965... Ctrl-C to quit.
CPU PID    TID    TIME(us) METHOD
5   27965  27991  0.736    <- jdk/internal/misc/Unsafe.park
5   27965  27991  0.736    -> 
java/util/concurrent/locks/LockSupport.setBlocker'
5   27965  27991  0.736      -> jdk/internal/misc/Unsafe.putObject
5   27965  27991  0.736      <- jdk/internal/misc/Unsafe.putObject
5   27965  27991  0.736    <- 
java/util/concurrent/locks/LockSupport.setBlocker'
5   27965  27991  0.736    <- 
java/util/concurrent/locks/LockSupport.parkNanos
5   27965  27991  0.736    -> 
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.checkInterruptWhileWaiting
5   27965  27991  0.737      -> java/lang/Thread.interrupted
5   27965  27991  0.737        -> java/lang/Thread.isInterrupted
5   27965  27991  0.737        <- java/lang/Thread.isInterrupted
5   27965  27991  0.737      <- java/lang/Thread.interrupted
5   27965  27991  0.737    <- 
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.checkInterruptWhileWaiting
5   27965  27991  0.737    -> java/lang/System.nanoTime
5   27965  27991  0.737    <- java/lang/System.nanoTime

However, this tends to be quite expensive in terms of overhead incurred by extended probes making them unfit for production debugging.

Go

I’m going to finish the demonstration of the tracing techniques with an example in Go. Since Go is a natively compiled language it is tentative to use the trace tool to attach uprobe programs on target symbols.  You can try it yourself with this trivial code snippet:

package main

import "fmt"

func main() {
  fmt.Println("Hi")
}

$ go build -o hi build.go

$/usr/share/bcc/tools/trace '/tmp/hi:fmt.Println "%s" arg1'
PID     TID     COMM            FUNC             -
31545   31545   hi              fmt.Println      b'xd6x8dK'

Instead of our glorious “Hi” string, we got some random garbage in the argument column. This is in part caused by trace not being able to handle Println variadic arguments, but also of the wrong assumption regarding ABI calling conventions. Unlike C/C++ where the preferred way of passing parameters is in regular registers, Go passes parameters on the stack.

Since we cannot rely on the trace to demonstrate how to instrument Go code, I’ll be building a simple tool to trace all HTTP GET requests that are emitted by the http.Get function. You could easily adapt it to capture other HTTP request verbs, but I’m leaving it as an exercise. The full source code can be found in this repo.

I’ll not go into details regarding the uprobe attaching/loading process since we are using the Go bindings for libbcc to do the heavy lifting. Let’s dissect the actual uprobe program.

After the required include statements, we have the definition of the macro responsible for picking up the parameters from the stack by means of offset handling.

#define SP_OFFSET(offset) (void *)PT_REGS_SP(ctx) + offset * 8

Next, we declare the structure for encapsulating events that are streamed through reqs map. The map is defined with the BPF_PERF_OUTPUT macro. The core of our program is the __uprobe_http_get function. Whenever http.Get is invoked, the previous function is triggered in the kernel space. We know http.Get has a single argument that represents the URL where a HTTP request is sent to. Another difference between C and Go is how they layout strings in memory.

While C strings are null-terminated sequences, Go treats strings as two-word values comprising the pointer to the memory buffer and the string length. This explains that we need two calls to bpf_probe_read – one for reading the string and the second for pulling its length.

bpf_probe_read(&url, sizeof(url), SP_OFFSET(1));
bpf_probe_read(&req.len, sizeof(req.len), SP_OFFSET(2));

Later in userspace, the URL is trimmed from the byte slice to its corresponding length. As a side note, I would like to mention that the draft version of the tool was able to dig out latencies for each HTTP GET request by injecting an uretprobe. However, this proved to have a catastrophic effect each time the Go runtime decides to shrink/grow the stack, because uretprobes patch the return address on the stack to the trampoline function that’s executed in the context of the eBPF VM. On exiting the uretprobe function, the instruction pointer is restored to the original return address which might point to an invalid location messing up the stack and causing the process to crash. There are some proposals to address this issue.

Conclusion

We have come to the end of our eBPF journey. In this final article, we shed some light on eBPF features for instrumentation of userspace processes. Through several pragmatic cases, we’ve shown the versatility of the BCC framework for capturing observability signals. Finally, we got our hands dirty and built a small tool for tracing HTTP requests on live Go apps.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link