The Sorry Scheme of Things Entire: perf-backed disassembly

Since 2.6.31 or thereabouts, the Linux kernel has come with a built-in performance counter known as perf.

The common form of perf is well-known to be useful in gathering performance statistics on a running program:

bash$ perf stat -cv ./a.out

cache-misses: 11313 2020574449 2020574449
cache-references: 62031796 2020574449 2020574449
branch-misses: 17909 2020574449 2020574449
branches: 606684832 2020574449 2020574449
instructions: 6324531571 2020574449 2020574449
cycles: 6408533747 2020574449 2020574449
page-faults: 304 2019963367 2019963367
CPU-migrations: 7 2019963367 2019963367
context-switches: 205 2019963367 2019963367
task-clock-msecs: 2019963367 2019963367 2019963367

Performance counter stats for './a.out':

11313 cache-misses # 0.006 M/sec
62031796 cache-references # 30.709 M/sec
17909 branch-misses # 0.003 %
606684832 branches # 300.344 M/sec
6324531571 instructions # 0.987 IPC
6408533747 cycles # 3172.599 M/sec
304 page-faults # 0.000 M/sec
7 CPU-migrations # 0.000 M/sec
205 context-switches # 0.000 M/sec
2019.963367 task-clock-msecs # 0.996 CPUs

2.027948307 seconds time elapsed

The events to be recorded can be specified with the -e option in order to refine the output:

bash$ perf stat -e cpu-clock -e instructi

ons

Performance counter stats for './a.out':

2026.748812 cpu-clock-msecs

6324293589 instructions # 0.000 IPC

2.032519896 seconds time elapsed

A list of available events can be obtained via perf list:

bash$ perf list | head

List of pre-defined events (to be used in -e):

cpu-cycles OR cycles [Hardware event]

instructions [Hardware event]

cache-references [Hardware event]

cache-misses [Hardware event]

branch-instructions OR branches [Hardware event]

branch-misses [Hardware event]

bus-cycles [Hardware event]

The perf toolchain also includes the utility perf top, which can be used to monitor a single process, or which can be used to monitor the kernel:

bash$ sudo perf top 2>/dev/null

-------------------------------------------------------------------------------

PerfTop: 0 irqs/sec kernel:-nan% exact: -nan% [1000Hz cycles], (all, 4 CPUs)

-------------------------------------------------------------------------------

samples pcnt function DSO

_______ _____ ______________________ __________________

77.00 39.3% intel_idle [kernel.kallsyms]

13.00 6.6% __pthread_mutex_unlock libpthread-2.13.so

13.00 6.6% pthread_mutex_lock libpthread-2.13.so

12.00 6.1% __ticket_spin_lock [kernel.kallsyms]

7.00 3.6% schedule [kernel.kallsyms]

6.00 3.1% menu_select [kernel.kallsyms]

6.00 3.1% fget_light [kernel.kallsyms]

6.00 3.1% clear_page_c [kernel.kallsyms]

Where things start to get interesting, however, is with perf record. This utility is generally used along with perf report to record the performance counters of a process, and review them later.

This can be used, for example, to generate a call graph:

bash$ perf record -g -o /tmp/a.out.perf ./a.out

[ perf record: Woken up 1 times to write data ]

[ perf record: Captured and wrote 0.148 MB /tmp/a.out.perf (~6461 samples) ]

bash$ perf report -g -i /tmp/a.out.perf

# Events: 1K cycles

# Overhead Command Shared Object Symbol

# ........ ............. ............. ......

99.90% a.out a.out [.] main

--- main

__libc_start_main

0.10% a.out [l2cap] [k] 0xffffffff8103804a

--- 0xffffffff8105f438

0xffffffff8105f675

...

Once perf data has been recorded, the perf annotate utility can be used to display a disassembly of the instructions that were executed:

bash$ perf annotate -i /tmp/a.out.perf |more

------------------------------------------------

Percent | Source code & Disassembly of a.out

------------------------------------------------

: Disassembly of section .text:

: 0000000000400554

0.00 : 400554: 55 push %rbp
0.00 : 400555: 48 89 e5 mov %rsp,%rbp
0.00 : 400558: 48 81 ec 30 00 0c 00 sub $0xc0030,%rsp
0.00 : 40055f: 48 8d 85 d0 ff fb ff lea -0x40030(%rbp),%rax
0.00 : 400566: ba 00 00 04 00 mov $0x40000,%edx
0.00 : 40056b: be 00 00 00 00 mov $0x0,%esi
0.00 : 400570: 48 89 c7 mov %rax,%rdi
0.00 : 400573: e8 b0 fe ff ff callq 400428 <memset@plt>
0.00 : 400578: c7 45 fc 00 00 00 04 movl $0x4000000,-0x4(%rbp)

...

4.21 : 4006a5: 8b 45 d0 mov -0x30(%rbp),%eax
15.54 : 4006a8: 83 c0 01 add $0x1,%eax
4.97 : 4006ab: 89 45 d0 mov %eax,-0x30(%rbp)
4.87 : 4006ae: 8b 45 d0 mov -0x30(%rbp),%eax
17.79 : 4006b1: 83 c0 01 add $0x1,%eax
4.36 : 4006b4: 89 45 d0 mov %eax,-0x30(%rbp)
4.72 : 4006b7: 48 83 45 f0 01 addq $0x1,-0x10(%rbp)
0.00 : 4006bc: 48 8b 45 f0 mov -0x10(%rbp),%rax

...

As to be expected from Torvalds and company, the utilities include a number of options for generating parser-friendly output, limiting reporting to specified events and symbols, and so forth. Check the man pages for details.

The Sorry Scheme of Things Entire

Sunday, August 7, 2011

perf-backed disassembly

No comments:

Post a Comment

Labels

Blog Archive