It lets you hook into various points in the kernel; ultimately you need to learn...

tanelpoder · on July 3, 2024

Yep (the 0x.tools author here). If you look into my code, you'll see that I'm not a good developer :-) But I have a decent understanding of Linux kernel flow and kernel/app interaction dynamics, thanks to many years of troubleshooting large (Oracle) database workloads. So I knew exactly what I wanted to measure and how, just had to learn the eBPF parts. That's why I picked BCC instead of libbpf as I was somewhat familiar with it already, but fully dynamic and "self-updating" libbpf loading approach is the goal for v3 (help appreciated!)

tptacek · on July 4, 2024

I was going to ask "why BCC" (BCC is super clunky) but you're way ahead of us. This is great work, thanks for posting it.

tanelpoder · on July 4, 2024

Yeah, I already see limitations, the last one was yesterday when I installed earlier Ubuntu versions to see how far back this can go - and even Ubuntu 22.04 didn't work out of the box, ended up with some BCC/kernel header mismatch issue [1] although the kernel itself supported it. A workaround was to download & compile the latest BCC yourself, but I don't want to go there as the customers/systems I work on wouldn't go there anyway.

But libbpf with CO-RE will solve these issues as I understand, so as long as the kernel supports what you need, the CO-RE binary will work.

This raises another issue for me though, it's not easy, but easier, for enterprises to download and run a single python + single C source file (with <500 code lines to review) than a compiled CO-RE binary, but my long term plan/hope is that I (we) get the RedHats and AWSes of this world to just provide the eventual mature release as a standard package.

[1] https://github.com/iovisor/bcc/issues/3993

mgaunard · on July 3, 2024

Myself I've only built simple things, like tracing sched switch events for certain threads, and killing the process if they happen (specifically designed as a safety for pinned threads).

tanelpoder · on July 3, 2024

Same here, until now. I built the earlier xcapture v1 (also in the repo) about 5 years ago and it just samples various /proc/PID/task/TID pseudofiles regularly, it also allows you get pretty far with the thread-level activity measurement approach, especially when combined with always-on low frequency on-CPU sampling with perf.

tptacek · on July 3, 2024

XDP, in its intended configuration, passes pointers to packets still on the driver DMA rings (or whatever) directly to BPF code, which can modify packets and forward them to other devices, bypassing the kernel stack completely. You can XDP_PASS a packet if you'd like it to hit the kernel, creating an skbuff, and bouncing it through all the kernel's network stack code, but the idea is that you don't want to do that; if you do, just use TC BPF, which is equivalently powerful and more flexible.

mgaunard · on July 4, 2024

Yes for XDP there is a dedicated API, but for any of the other hooks like tracepoints, it's all designed to give you read-only access.

The whole CO-RE thing is about having a kernel-version-agnostic way of reading fields from kernel data structures.

tptacek · on July 4, 2024

Right, I'm just pushing back on the DPDK thing.

mgaunard · on July 5, 2024

DPDK polls the hardware directly from userland.

XDP reads the data in the normal NAPI kernel way, integrating with the IRQ system etc., which might or might not be desirable depending on your use case.

Then if you want to forward it to userland, you still need to write the data to a ring buffer, with your userland process polling it, at which point it's more akin to using io_uring.

It's mostly useful if you can write your entire logic in your eBPF program without going through userland, so it's nice for various tracing applications, filters or security checks, but that's about it as far as I can tell.