Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bpf/optimized usdt ci #8161

Closed

Conversation

olsajiri
Copy link
Contributor

@olsajiri olsajiri commented Dec 3, 2024

No description provided.

We are about to add uprobe trampoline, so cleaning up the namespace.

Signed-off-by: Jiri Olsa <[email protected]>
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Signed-off-by: Jiri Olsa <[email protected]>
Adding nbytes argument to uprobe_write_opcode as preparation
fo writing longer instructions in following changes.

Signed-off-by: Jiri Olsa <[email protected]>
Adding data argument to uprobe_write_opcode function and passing
it to newly added arch overloaded functions:

  arch_uprobe_verify_opcode
  arch_uprobe_is_register

This way each architecture can provide custmized verification.

Signed-off-by: Jiri Olsa <[email protected]>
@olsajiri olsajiri force-pushed the bpf/optimized_usdt_ci branch from 7762445 to 9838fc1 Compare December 3, 2024 11:31
Adding support to add special mapping for for user space trampoline
with following functions:

  uprobe_trampoline_get - find or add related uprobe_trampoline
  uprobe_trampoline_put - remove ref or destroy uprobe_trampoline

The user space trampoline is exported as architecture specific user space
special mapping, which is provided by arch_uprobe_trampoline_mapping
function.

The uprobe trampoline needs to be callable/reachable from the probe address,
so while searching for available address we use arch_uprobe_is_callable
function to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and
are cleaned up when the process mm_struct goes down.

Signed-off-by: Jiri Olsa <[email protected]>
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initiall call
to retrieve the original 'breakpoint' address.

With this address we find the related uprobe object and call its
consumers.

TODO allow to call uprobe syscall only from uprobe trampoline.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to emulate nop5 instruction when the
uprobe is placeed on that.

current:

  usermode-count :  233.596 ± 1.087M/s
  syscall-count  :   12.422 ± 0.008M/s
  uprobe-nop     :    3.393 ± 0.011M/s
  uprobe-push    :    3.193 ± 0.002M/s
  uprobe-ret     :    1.160 ± 0.001M/s
  uprobe-nop5    :    1.161 ± 0.001M/s  <---
  uretprobe-nop  :    1.786 ± 0.008M/s
  uretprobe-push :    1.711 ± 0.030M/s
  uretprobe-ret  :    0.860 ± 0.011M/s
  uretprobe-nop5 :    1.164 ± 0.002M/s  <---

after the change:

  # benchs/run_bench_uprobes.sh
  usermode-count :  230.601 ± 3.272M/s
  syscall-count  :   12.348 ± 0.088M/s
  uprobe-nop     :    3.385 ± 0.023M/s
  uprobe-push    :    3.145 ± 0.034M/s
  uprobe-ret     :    1.150 ± 0.003M/s
  uprobe-nop5    :    3.398 ± 0.011M/s  <---
  uretprobe-nop  :    1.736 ± 0.015M/s
  uretprobe-push :    1.704 ± 0.013M/s
  uretprobe-ret  :    0.864 ± 0.007M/s
  uretprobe-nop5 :    3.390 ± 0.012M/s  <---

  # Overhead  Command  Shared Object                                   Symbol
  # ........  .......  ..............................................  ..................................................
  #
      30.50%  bench    [kernel.vmlinux]                                [k] sync_regs
      15.73%  bench    [kernel.vmlinux]                                [k] asm_exc_int3
      11.13%  bench    bench                                           [.] uprobe_target_nop
       5.18%  bench    [kernel.vmlinux]                                [k] mtree_load
       4.63%  bench    [kernel.vmlinux]                                [k] irqentry_exit_to_user_mode
       3.85%  bench    [kernel.vmlinux]                                [k] uprobe_notify_resume
       2.55%  bench    [kernel.vmlinux]                                [k] error_entry
       2.34%  bench    [kernel.vmlinux]                                [k] notifier_call_chain
       2.26%  bench    [kernel.vmlinux]                                [k] up_read
       2.07%  bench    [kernel.vmlinux]                                [k] uprobe_pre_sstep_notifier
       1.96%  bench    [kernel.vmlinux]                                [k] __uprobe_perf_func
       1.92%  bench    [kernel.vmlinux]                                [k] handler_chain
       1.92%  bench    bpf_prog_2dcccf652aac1793_bench_trigger_uprobe  [k] bpf_prog_2dcccf652aac1793_bench_trigger_uprobe
       1.60%  bench    [kernel.vmlinux]                                [k] down_read
       1.47%  bench    [kernel.vmlinux]                                [k] find_active_uprobe_rcu
       1.32%  bench    [kernel.vmlinux]                                [k] uprobe_dispatcher
       0.98%  bench    [kernel.vmlinux]                                [k] __irqentry_text_end
       0.93%  bench    [kernel.vmlinux]                                [k] notify_die
       0.88%  bench    [kernel.vmlinux]                                [k] migrate_enable
       0.82%  bench    [kernel.vmlinux]                                [k] branch_emulate_op
       0.77%  bench    [kernel.vmlinux]                                [k] do_int3
       0.58%  bench    [kernel.vmlinux]                                [k] __rcu_read_lock
       0.58%  bench    [kernel.vmlinux]                                [k] __rcu_read_unlock
       0.54%  bench    [kernel.vmlinux]                                [k] exc_int3

  # Overhead  Command  Shared Object                                   Symbol
  # ........  .......  ..............................................  ..................................................
  #
      30.43%  bench    [kernel.vmlinux]                                [k] sync_regs
      16.24%  bench    [kernel.vmlinux]                                [k] asm_exc_int3
      11.11%  bench    bench                                           [.] uprobe_target_nop5
       5.28%  bench    [kernel.vmlinux]                                [k] mtree_load
       4.43%  bench    [kernel.vmlinux]                                [k] irqentry_exit_to_user_mode
       3.87%  bench    [kernel.vmlinux]                                [k] uprobe_notify_resume
       2.78%  bench    [kernel.vmlinux]                                [k] error_entry
       2.53%  bench    [kernel.vmlinux]                                [k] notifier_call_chain
       2.04%  bench    [kernel.vmlinux]                                [k] __uprobe_perf_func
       2.02%  bench    [kernel.vmlinux]                                [k] up_read
       2.02%  bench    [kernel.vmlinux]                                [k] uprobe_pre_sstep_notifier
       1.94%  bench    [kernel.vmlinux]                                [k] handler_chain
       1.60%  bench    [kernel.vmlinux]                                [k] down_read
       1.52%  bench    [kernel.vmlinux]                                [k] find_active_uprobe_rcu
       1.39%  bench    bpf_prog_2dcccf652aac1793_bench_trigger_uprobe  [k] bpf_prog_2dcccf652aac1793_bench_trigger_uprobe
       1.24%  bench    [kernel.vmlinux]                                [k] uprobe_dispatcher
       0.90%  bench    [kernel.vmlinux]                                [k] migrate_enable
       0.89%  bench    [kernel.vmlinux]                                [k] notify_die
       0.86%  bench    [kernel.vmlinux]                                [k] branch_emulate_op
       0.84%  bench    [kernel.vmlinux]                                [k] __irqentry_text_end
       0.77%  bench    [kernel.vmlinux]                                [k] do_int3
       0.63%  bench    [kernel.vmlinux]                                [k] __rcu_read_unlock
       0.59%  bench    [kernel.vmlinux]                                [k] __rcu_read_lock
       0.58%  bench    [kernel.vmlinux]                                [k] exc_int3
       0.55%  bench    [kernel.vmlinux]                                [k] kgdb_ll_trap

Signed-off-by: Jiri Olsa <[email protected]>
@olsajiri olsajiri force-pushed the bpf/optimized_usdt_ci branch 3 times, most recently from 2ae2fc7 to 086f52d Compare December 3, 2024 22:21
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:
  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline and overwrites original
    instruction (5-byte nop) with call to user space trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we could use nop5 as USDT probe instruction (which
uses single byte nop ATM) and speed up the USDT probes.

This patch overloads related arch functions in uprobe_write_opcode and
set_orig_insn so they can install call instruction if needed.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

TODO release uprobe trampoline when it's no longer needed.. we might need to
stop all cpus to make sure no user space thread is in the trampoline.. or we
might just keep it, because there's just one 4K memory region?

current:

  # benchs/run_bench_uprobes.sh
  usermode-count :  230.601 ± 3.272M/s
  syscall-count  :   12.348 ± 0.088M/s
  uprobe-nop     :    3.385 ± 0.023M/s
  uprobe-push    :    3.145 ± 0.034M/s
  uprobe-ret     :    1.150 ± 0.003M/s
  uprobe-nop5    :    3.398 ± 0.011M/s  <---
  uretprobe-nop  :    1.736 ± 0.015M/s
  uretprobe-push :    1.704 ± 0.013M/s
  uretprobe-ret  :    0.864 ± 0.007M/s
  uretprobe-nop5 :    3.390 ± 0.012M/s  <---

after the change:

  # benchs/run_bench_uprobes.sh
  usermode-count :  233.912 ± 0.187M/s
  syscall-count  :   12.391 ± 0.082M/s
  uprobe-nop     :    3.378 ± 0.008M/s
  uprobe-push    :    3.193 ± 0.009M/s
  uprobe-ret     :    1.153 ± 0.000M/s
  uprobe-nop5    :    7.693 ± 0.039M/s  <---
  uretprobe-nop  :    1.787 ± 0.008M/s
  uretprobe-push :    1.731 ± 0.013M/s
  uretprobe-ret  :    0.871 ± 0.002M/s
  uretprobe-nop5 :    7.584 ± 0.029M/s  <---

Signed-off-by: Jiri Olsa <[email protected]>
Using 5-byte nop for x86 usdt probes so we can switch
to optimized uprobe them.

Signed-off-by: Jiri Olsa <[email protected]>
Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Signed-off-by: Jiri Olsa <[email protected]>
Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

Signed-off-by: Jiri Olsa <[email protected]>
@olsajiri olsajiri force-pushed the bpf/optimized_usdt_ci branch from 086f52d to 0ca1a25 Compare December 3, 2024 22:42
@olsajiri olsajiri closed this Dec 4, 2024
@olsajiri olsajiri deleted the bpf/optimized_usdt_ci branch December 4, 2024 11:23
@olsajiri olsajiri restored the bpf/optimized_usdt_ci branch December 4, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant