-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bpf/optimized usdt ci #8161
Closed
olsajiri
wants to merge
12
commits into
kernel-patches:bpf-next_base
from
olsajiri:bpf/optimized_usdt_ci
Closed
Bpf/optimized usdt ci #8161
olsajiri
wants to merge
12
commits into
kernel-patches:bpf-next_base
from
olsajiri:bpf/optimized_usdt_ci
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We are about to add uprobe trampoline, so cleaning up the namespace. Signed-off-by: Jiri Olsa <[email protected]>
Making copy_from_page global and adding uprobe prefix. Adding the uprobe prefix to copy_to_page as well for symmetry. Signed-off-by: Jiri Olsa <[email protected]>
Adding nbytes argument to uprobe_write_opcode as preparation fo writing longer instructions in following changes. Signed-off-by: Jiri Olsa <[email protected]>
Adding data argument to uprobe_write_opcode function and passing it to newly added arch overloaded functions: arch_uprobe_verify_opcode arch_uprobe_is_register This way each architecture can provide custmized verification. Signed-off-by: Jiri Olsa <[email protected]>
olsajiri
force-pushed
the
bpf/optimized_usdt_ci
branch
from
December 3, 2024 11:31
7762445
to
9838fc1
Compare
Adding support to add special mapping for for user space trampoline with following functions: uprobe_trampoline_get - find or add related uprobe_trampoline uprobe_trampoline_put - remove ref or destroy uprobe_trampoline The user space trampoline is exported as architecture specific user space special mapping, which is provided by arch_uprobe_trampoline_mapping function. The uprobe trampoline needs to be callable/reachable from the probe address, so while searching for available address we use arch_uprobe_is_callable function to decide if the uprobe trampoline is callable from the probe address. All uprobe_trampoline objects are stored in uprobes_state object and are cleaned up when the process mm_struct goes down. Signed-off-by: Jiri Olsa <[email protected]>
Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initiall call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. TODO allow to call uprobe syscall only from uprobe trampoline. Signed-off-by: Jiri Olsa <[email protected]>
Adding support to emulate nop5 instruction when the uprobe is placeed on that. current: usermode-count : 233.596 ± 1.087M/s syscall-count : 12.422 ± 0.008M/s uprobe-nop : 3.393 ± 0.011M/s uprobe-push : 3.193 ± 0.002M/s uprobe-ret : 1.160 ± 0.001M/s uprobe-nop5 : 1.161 ± 0.001M/s <--- uretprobe-nop : 1.786 ± 0.008M/s uretprobe-push : 1.711 ± 0.030M/s uretprobe-ret : 0.860 ± 0.011M/s uretprobe-nop5 : 1.164 ± 0.002M/s <--- after the change: # benchs/run_bench_uprobes.sh usermode-count : 230.601 ± 3.272M/s syscall-count : 12.348 ± 0.088M/s uprobe-nop : 3.385 ± 0.023M/s uprobe-push : 3.145 ± 0.034M/s uprobe-ret : 1.150 ± 0.003M/s uprobe-nop5 : 3.398 ± 0.011M/s <--- uretprobe-nop : 1.736 ± 0.015M/s uretprobe-push : 1.704 ± 0.013M/s uretprobe-ret : 0.864 ± 0.007M/s uretprobe-nop5 : 3.390 ± 0.012M/s <--- # Overhead Command Shared Object Symbol # ........ ....... .............................................. .................................................. # 30.50% bench [kernel.vmlinux] [k] sync_regs 15.73% bench [kernel.vmlinux] [k] asm_exc_int3 11.13% bench bench [.] uprobe_target_nop 5.18% bench [kernel.vmlinux] [k] mtree_load 4.63% bench [kernel.vmlinux] [k] irqentry_exit_to_user_mode 3.85% bench [kernel.vmlinux] [k] uprobe_notify_resume 2.55% bench [kernel.vmlinux] [k] error_entry 2.34% bench [kernel.vmlinux] [k] notifier_call_chain 2.26% bench [kernel.vmlinux] [k] up_read 2.07% bench [kernel.vmlinux] [k] uprobe_pre_sstep_notifier 1.96% bench [kernel.vmlinux] [k] __uprobe_perf_func 1.92% bench [kernel.vmlinux] [k] handler_chain 1.92% bench bpf_prog_2dcccf652aac1793_bench_trigger_uprobe [k] bpf_prog_2dcccf652aac1793_bench_trigger_uprobe 1.60% bench [kernel.vmlinux] [k] down_read 1.47% bench [kernel.vmlinux] [k] find_active_uprobe_rcu 1.32% bench [kernel.vmlinux] [k] uprobe_dispatcher 0.98% bench [kernel.vmlinux] [k] __irqentry_text_end 0.93% bench [kernel.vmlinux] [k] notify_die 0.88% bench [kernel.vmlinux] [k] migrate_enable 0.82% bench [kernel.vmlinux] [k] branch_emulate_op 0.77% bench [kernel.vmlinux] [k] do_int3 0.58% bench [kernel.vmlinux] [k] __rcu_read_lock 0.58% bench [kernel.vmlinux] [k] __rcu_read_unlock 0.54% bench [kernel.vmlinux] [k] exc_int3 # Overhead Command Shared Object Symbol # ........ ....... .............................................. .................................................. # 30.43% bench [kernel.vmlinux] [k] sync_regs 16.24% bench [kernel.vmlinux] [k] asm_exc_int3 11.11% bench bench [.] uprobe_target_nop5 5.28% bench [kernel.vmlinux] [k] mtree_load 4.43% bench [kernel.vmlinux] [k] irqentry_exit_to_user_mode 3.87% bench [kernel.vmlinux] [k] uprobe_notify_resume 2.78% bench [kernel.vmlinux] [k] error_entry 2.53% bench [kernel.vmlinux] [k] notifier_call_chain 2.04% bench [kernel.vmlinux] [k] __uprobe_perf_func 2.02% bench [kernel.vmlinux] [k] up_read 2.02% bench [kernel.vmlinux] [k] uprobe_pre_sstep_notifier 1.94% bench [kernel.vmlinux] [k] handler_chain 1.60% bench [kernel.vmlinux] [k] down_read 1.52% bench [kernel.vmlinux] [k] find_active_uprobe_rcu 1.39% bench bpf_prog_2dcccf652aac1793_bench_trigger_uprobe [k] bpf_prog_2dcccf652aac1793_bench_trigger_uprobe 1.24% bench [kernel.vmlinux] [k] uprobe_dispatcher 0.90% bench [kernel.vmlinux] [k] migrate_enable 0.89% bench [kernel.vmlinux] [k] notify_die 0.86% bench [kernel.vmlinux] [k] branch_emulate_op 0.84% bench [kernel.vmlinux] [k] __irqentry_text_end 0.77% bench [kernel.vmlinux] [k] do_int3 0.63% bench [kernel.vmlinux] [k] __rcu_read_unlock 0.59% bench [kernel.vmlinux] [k] __rcu_read_lock 0.58% bench [kernel.vmlinux] [k] exc_int3 0.55% bench [kernel.vmlinux] [k] kgdb_ll_trap Signed-off-by: Jiri Olsa <[email protected]>
olsajiri
force-pushed
the
bpf/optimized_usdt_ci
branch
3 times, most recently
from
December 3, 2024 22:21
2ae2fc7
to
086f52d
Compare
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline and overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we could use nop5 as USDT probe instruction (which uses single byte nop ATM) and speed up the USDT probes. This patch overloads related arch functions in uprobe_write_opcode and set_orig_insn so they can install call instruction if needed. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. TODO release uprobe trampoline when it's no longer needed.. we might need to stop all cpus to make sure no user space thread is in the trampoline.. or we might just keep it, because there's just one 4K memory region? current: # benchs/run_bench_uprobes.sh usermode-count : 230.601 ± 3.272M/s syscall-count : 12.348 ± 0.088M/s uprobe-nop : 3.385 ± 0.023M/s uprobe-push : 3.145 ± 0.034M/s uprobe-ret : 1.150 ± 0.003M/s uprobe-nop5 : 3.398 ± 0.011M/s <--- uretprobe-nop : 1.736 ± 0.015M/s uretprobe-push : 1.704 ± 0.013M/s uretprobe-ret : 0.864 ± 0.007M/s uretprobe-nop5 : 3.390 ± 0.012M/s <--- after the change: # benchs/run_bench_uprobes.sh usermode-count : 233.912 ± 0.187M/s syscall-count : 12.391 ± 0.082M/s uprobe-nop : 3.378 ± 0.008M/s uprobe-push : 3.193 ± 0.009M/s uprobe-ret : 1.153 ± 0.000M/s uprobe-nop5 : 7.693 ± 0.039M/s <--- uretprobe-nop : 1.787 ± 0.008M/s uretprobe-push : 1.731 ± 0.013M/s uretprobe-ret : 0.871 ± 0.002M/s uretprobe-nop5 : 7.584 ± 0.029M/s <--- Signed-off-by: Jiri Olsa <[email protected]>
Using 5-byte nop for x86 usdt probes so we can switch to optimized uprobe them. Signed-off-by: Jiri Olsa <[email protected]>
Adding tests for optimized uprobe/usdt probes. Checking that we get expected trampoline and attached bpf programs get executed properly. Signed-off-by: Jiri Olsa <[email protected]>
Adding test that makes sure parallel execution of the uprobe and attach/detach of optimized uprobe on it works properly. Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
olsajiri
force-pushed
the
bpf/optimized_usdt_ci
branch
from
December 3, 2024 22:42
086f52d
to
0ca1a25
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.