perf-cpp enables the recording of event samples, capturing details like instruction pointers, multiple counters, branches, memory addresses, data sources, latency, and more.
Essentially, you define a sampling period or frequency at which data is captured.
At its core, this functionality is akin to traditional profiling tools, like perf record
, but uniquely tailored to record specific blocks of code rather than the entire application.
The details below provide an overview of how sampling works. For specific information about sampling in parallel settings (i.e., sampling multiple threads and cores) take a look into the "parallel sampling" documentation.
- Interface
- Trigger
- Precision
- Period / Frequency
- What can be Recorded and how to Access the Data?
- Time
- Stream ID
- Period
- Identifier
- Instruction Pointer
- Callchain
- User Stack
- Registers in user-level
- Registers in kernel-level
- ID of the recording Thread
- ID of the recording CPU
- Size of the Code Page
- Performance Counter Values
- Branch Stack (LBR)
- Physical Memory Address
- Logical Memory Address
- Memory Access Latency
- Data Source of a Memory Load
- Size of the Data Page
- Transaction Abort
- Raw Values
- Context Switches
- CGroup
- Throttle and Unthrottle Events
- Sample mode
- Lost Samples
- Specific Notes for different CPU Vendors
- Troubleshooting Counter Configurations
For sampling, the hardware records a set of data (see more details) upon reaching the threshold of a specific trigger event (see more details). In the following example, we record a timestamp and the current instruction pointer every 4000th cycle:
#include <perfcpp/sampler.h>
auto counter_definitions = perf::CounterDefinition{};
auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);
auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");
sampler.values().time(true).instruction_pointer(true);
Note: The perf::CounterDefinition
instance is used to store event configurations (e.g., names) and passed as a reference.
Consequently, the instance needs to be alive while using the Sampler
(as described here).
The sampler is initialized using sampler.start()
, if it is not already done.
This action configures all necessary hardware counters and buffers, a process that may require some time.
For those requiring precise timing measurements and wishing to omit the time spent setting up counters, the sampler.open()
method can be invoked separately.
try {
sampler.open();
} catch (std::runtime_error& e) {
std::cerr << e.what() << std::endl;
}
Surround your computational code with start()
and stop()
methods to sample hardware events:
try {
sampler.start();
} catch (std::runtime_error& e) {
std::cerr << e.what() << std::endl;
}
/// ... do some computational work here...
sampler.stop();
The output is a series of perf::Sample
instances, each potentially including extensive data.
Given the capability to select specific data elements for sampling, each data point is encapsulated within an std::optional
to manage its potential absence.
→ See how to query sample results
const auto result = sampler.result();
for (const auto& sample_record : result)
{
if (sample_record.time().has_value() && sample_record.instruction_pointer().has_value())
{
std::cout
<< "Time = " << sample_record.time().value()
<< " | IP = 0x" << std::hex << sample_record.instruction_pointer().value() << std::dec << std::endl;
}
}
The output may be something like this:
Time = 124853764466887 | IP = 0x5794c991990c
Time = 124853764663977 | IP = 0xffffffff8d79d48b
Time = 124853764861377 | IP = 0x5794c991990c
Time = 124853765058918 | IP = 0x5794c991990c
Time = 124853765256328 | IP = 0x5794c991990c
Closing the sampler releases and un-maps all buffers and deactivates all counters. Additionally, the sampler automatically closes upon destruction. However, closing the sampler explicitly enables it to be reopened at a future time.
sampler.close();
Each sampler is associated with one or more trigger events. When a trigger event reaches a specified (user-defined) threshold, the CPU records a sample containing the desired data. Triggers for a sampler can be specified as follows:
sampler.trigger("cycles");
To define multiple triggers, use a vector of trigger names:
sampler.trigger(std::vector<std::string>{"cycles", "instructions"});
In this scenario, exceeding either the cycles or instructions counter will prompt the CPU to capture a sample.
When configuring event-based sampling, it's important to understand that different CPU manufacturers support different sets of events that can be used as triggers.
Intel CPUs are generally flexible and allow almost every event as a trigger.
On AMD systems, the range of events that can trigger samples is more restricted: Typically, only the cycles
event and specific IBS events such as ibs_fetch
and ibs_op
are supported.
For more detailed information on configuring event-based sampling for different CPU types and specific notes on memory sampling, refer to the section: Specific Notes for different CPU Vendors.
Due to deeply pipelined processors, samples might not be precise, i.e., a sample might contain an instruction pointer or memory address that did not generate the overflow (→ see a blogpost on easyperf.net and the perf documentation). You can request a specific amount if skid through for each trigger, for example,
sampler.trigger("cycles", perf::Precision::AllowArbitrarySkid);
The precision can have the following values:
perf::Precision::AllowArbitrarySkid
(this does not enable Intel PEBS)perf::Precision::MustHaveConstantSkid
(default)perf::Precision::RequestZeroSkid
perf::Precision::MustHaveZeroSkid
If you do not set any precision level through the .trigger()
interface, you can control the default precision through the sample config:
auto sample_config = perf::SampleConfig{};
sample_config.precision(perf::Precision::RequestZeroSkid);
auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");
Note: If the precision setting is too high and the perf subsystem fails to activate the trigger, perf-cpp will automatically reduce the precision. However, it will not increase precision autonomously.
You can request a specific period or frequency for each trigger – basically how often the hardware should write samples –, for example,
/// Every 4000th cycle.
sampler.trigger("cycles", perf::Period{4000U});
or
/// With a frequency of 1000 samples per second , i.e., one sample per millisecond.
// (the hardware will adjust the period according to the provided frequency).
sampler.trigger("cycles", perf::Frequency{1000U});
You can also combine the configurations, for example, by
/// Every 4000th cycle with zero skid.
sampler.trigger("cycles", perf::Precision::RequestZeroSkid, perf::Period{4000U});
If you do not set any precision level through the .trigger()
interface, you can control the default period of frequency through the sample config:
auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);
/// xor:
sample_config.frequency(1000U);
auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");
Prior to activation, the sampler must be configured to specify the data to be recorded. For instance:
sampler.values()
.time(true)
.instruction_pointer(true);
This specific configuration captures both the timestamp and instruction pointer within the sample record. Upon completing the sampling and retrieving the sampling results, the recorded fields can be accessed as follows:
for (const auto& sample_record : sampler.results()) {
const auto timestamp = sample_record.time().value();
const auto instruction_pointer = sample_record.instruction_pointer().value();
}
Note: Most fields of the sample_record
are optional – recording these fields must be activated via sampler.values()
.
The timestamp of capturing the sample.
- Request by
sampler.values().time(true);
- Read from the results by
sample_record.time().value()
Unique ID of an opened event.
- Request by
sampler.values().stream_id(true);
- Read from the results by
sample_record.stream_id().value()
Period at the time of capturing the sample (the Kernel might adjust the period).
- Request by
sampler.values().period(true);
- Read from the results by
sample_record.period().value();
- Request by
sampler.values().identifier(true);
- Read from the results by
sample_record.id().value();
Address of the executed instruction at the moment the hardware captured the sample. Consider examining the precision configuration, as the sampled instruction pointer may not always be precise.
- Request by
sampler.values().instruction_pointer(true);
- Read from the results by
sample_record.instruction_pointer().value()
Additionally, you can determine if the captured instruction pointer was exact by querying sample_record.is_exact_ip()
, which returns a bool
.
Callchain as a list of instruction pointers.
- Request by
sampler.values().callchain(true);
orsampler.values().callchain(M);
whereM
is astd::uint16_t
defining the maximum call stack size. - Read from the results by
sample_record.callchain().value();
, which returns anstd::vector<std::uintptr_t>
of instruction pointers.
The user-level stack as a list of chars.
- Request by
sampler.values().user_stack(M);
whereM
is astd::uint32_t
defining the maximum size of the user stack. - Read from the results by
sample_record.user_stack().value();
, which returns anstd::vector<char>
of data.
Values of registers within the user-level.
- Request by
sampler.values().user_registers( { perf::Registers::x86::IP, perf::Registers::x86::DI, perf::Registers::x86::R10 });
- Read from the results by
sample_record.user_registers().value()[0];
(0
for the first register,perf::Registers::x86::IP
in this example) - The ABI can be queried using
sample_record.user_registers_abi()
.
Values of registers within the kernel-level.
- Request by
sampler.values().kernel_registers( { perf::Registers::x86::IP, perf::Registers::x86::DI, perf::Registers::x86::R10 });
- Read from the results by
sample_record.kernel_registers().value()[0];
(0
for the first register,perf::Registers::x86::IP
in this example) - The ABI can be queried using
sample_record.kernel_registers_abi()
.
- Request by
sampler.values().thread_id(true);
- Read from the results by
sample_record.thread_id().value()
- Request by
sampler.values().cpu_id(true);
- Read from the results by
sample_record.cpu_id().value();
Size of pages of sampled instruction pointers (e.g., when sampling for instruction pointers).
Sampling the code page size requires a Linux Kernel version of 5.11
or higher.
- Request by
sampler.values().code_page_size(true);
- Read from the results by
sample_record.code_page_size().value();
Record hardware performance counter values at the time of the record.
- Request by
sampler.values().counter({"instructions", "cache-misses"});
. You can sample any hardware event or metric. - Read from the results by
sample_record.counter_result().value().get("cache-misses");
. This can be accessed in the same manner as when recording counters.
Branch history with jump addresses and flag if predicted correctly.
sampler.values().branch_stack({ perf::BranchType::User, perf::BranchType::Conditional });
The possible branch type values are:
perf::BranchType::User
: Sample branches in user mode.perf::BranchType::Kernel
: Sample kernel branches.perf::BranchType::HyperVisor
: Sample branches in HV mode.perf::BranchType::Any
(the default): Sample all branches.perf::BranchType::Call
: Sample any call (direct, indirect, far jumps).perf::BranchType::DirectCall
: Sample direct call (requires Linux Kernel4.4
or higher).perf::BranchType::IndirectCall
: Sample indirect call.perf::BranchType::Return
: Sample return branches.perf::BranchType::IndirectJump
: Sample indirect jumps (requires Linux Kernel4.2
or higher).perf::BranchType::Conditional
: Sample conditional branches.perf::BranchType::TransactionalMemoryAbort
: Sample branches that abort transactional memory.perf::BranchType::InTransaction
: Sample branches in transactions of transactional memory.perf::BranchType::NotInTransaction
: Sample branches not in transactions of transactional memory.
Read from the results by sample_record.branches()
, which returns a vector of perf::Branch
.
Each perf::Branch
instance has the following values:
- Instruction pointer from where the branch started (
branch.instruction_pointer_from()
). - Instruction pointer from where the branch ended (
branch.instruction_pointer_to()
). - A flag that indicates if the branch was correctly predicted (
branch.is_predicted()
) (0
if not supported by the hardware). - A flag that indicates if the branch was mispredicted (
branch.is_mispredicted()
) (0
if not supported by the hardware). - A flag that indicates if the branch was within a transaction (
branch.is_in_transaction()
). - A flag that indicates if the branch was a transaction abort (
branch.is_transaction_abort()
). - Cycles since the last branch (
branch.cycles()
) (0
if not supported by the hardware – since Linux4.3
).
Sampling the physical memory address requires a Linux Kernel version of 4.13
or higher.
- Request by
sampler.values().physical_memory_address(true);
- Read from the results by
sample_record.physical_memory_address().value();
Memory sampling can be tricky; refer to the specifics of the underlying hardware since memory sampling requires distinct triggers across different hardware platforms.
- Request by
sampler.values().logical_memory_address(true);
- Read from the results by
sample_record.logical_memory_address().value()
Memory sampling can be tricky; refer to the specifics of the underlying hardware since memory sampling requires distinct triggers across different hardware platforms.
The weight (and weight struct) indicates how costly the event was (basically the latency).
Since Linux Kernel version 5.12
, the Kernel might generate more information than only a single value, which is used to differentiate between memory- (from cache towards memory) and instruction latency.
- Request by
sampler.values().weight(true);
orsampler.values().weight_struct(true);
(the latter only from Kernel5.12
) - Read from the results by
sample_record.weight().value();
, which returns aperf::Weight
class, which has the following attributes:sample_record.weight().value().cache_latency()
returns the cache latency of the sampled data address (for bothsampler.values().weight(true)
andsampler.values().weight_struct(true)
).sample_record.weight().value().instruction_retirement_latency()
returns the latency of retiring the instruction (including the cache access) but only forsampler.values().weight_struct(true)
. To the best of our knowledge, this feature is only supported by new Intel generations.sample_record.weight().value().var3()
returns "other information" (not specified by perf) but only forsampler.values().weight_struct(true)
.
Note that memory sampling depends on the underlying sampling mechanism. → See hardware-specific information (e.g., Intel PEBS vs AMD IBS)
Data source where the data was sampled (e.g., local mem, remote mem, L1d, L2, ...).
- Request by
sampler.values().data_src(true);
- Read from the results by
sample_record.data_src().value();
which returns a specificperf::DataSource
object
The perf::DataSource
object can be queried for the following information:
Query | Information |
---|---|
sample_record.data_src().value().is_load() |
True , if the access was a load operation. |
sample_record.data_src().value().is_store() |
True , if the access was a store operation. |
sample_record.data_src().value().is_prefetch() |
True , if the access was a prefetch operation. |
sample_record.data_src().value().is_exec() |
True , if the access was an execute operation. |
sample_record.data_src().value().is_mem_hit() |
True , if the access was a hit. |
sample_record.data_src().value().is_mem_miss() |
True , if the access was a miss. |
sample_record.data_src().value().is_mem_hit() |
True , if the access was a hit. |
sample_record.data_src().value().is_mem_l1() |
True , if the data was found in the L1 cache. |
sample_record.data_src().value().is_mem_l2() |
True , if the data was found in the L2 cache. |
sample_record.data_src().value().is_mem_l3() |
True , if the data was found in the L3 cache. |
sample_record.data_src().value().is_mem_l4() |
True , if the data was found in the L4 cache (since Linux 4.14 ). |
sample_record.data_src().value().is_mem_lfb() |
True , if the data was found in the Line Fill Buffer (or Miss Address Buffer on AMD). |
sample_record.data_src().value().is_mem_l2_mhb() |
True , if the data was found in the L2 Miss Handling Buffer (since Linux 6.11 ). |
sample_record.data_src().value().is_mem_local() |
True , if the data was found in local memory subsystem (since Linux 4.14 ). |
sample_record.data_src().value().is_mem_remote() |
True , if the data was found in remote memory subsystem (since Linux 4.14 ). |
sample_record.data_src().value().is_mem_ram() |
True , if the data was found in any (any) RAM. |
sample_record.data_src().value().is_mem_local_ram() |
True , if the data was found in the local RAM. |
sample_record.data_src().value().is_mem_remote_ram() |
True , if the data was found in any remote RAM. |
sample_record.data_src().value().is_mem_hops0() |
True , if the data was found locally (since Linux 5.16 ). |
sample_record.data_src().value().is_mem_hops1() |
True , if the data was found on the same node (since Linux 5.17 ). |
sample_record.data_src().value().is_mem_hops2() |
True , if the data was found on a remote socket (since Linux 5.17 ). |
sample_record.data_src().value().is_mem_hops3() |
True , if the data was found on a remote board (since Linux 5.17 ). |
sample_record.data_src().value().is_mem_remote_ram1() |
True , if the data was found in a remote RAM on the same node. |
sample_record.data_src().value().is_mem_remote_ram2() |
True , if the data was found in a remote RAM on a different socket. |
sample_record.data_src().value().is_mem_remote_ram3() |
True , if the data was found in a remote RAM on a different board (since Linux 5.17 ). |
sample_record.data_src().value().is_mem_remote_cce1() |
True , if the data was found in a cache with one hop distance. |
sample_record.data_src().value().is_mem_remote_cce2() |
True , if the data was found in a cache with two hops distance. |
sample_record.data_src().value().is_pmem() |
True , if the data was found on a PMEM device (since Linux 4.14 ). |
sample_record.data_src().value().is_cxl() |
True , if the data was transferred via Compute Express Link (since Linux 6.1 ). |
sample_record.data_src().value().is_io() |
True , if the memory address is I/O. |
sample_record.data_src().value().is_uncached() |
True , if the memory address is related to uncached memory. |
sample_record.data_src().value().is_tlb_hit() |
True , if the access was a TLB hit. |
sample_record.data_src().value().is_tlb_miss() |
True , if the access was a TLB miss. |
sample_record.data_src().value().is_tlb_l1() |
True , if the access can be associated with the dTLB. |
sample_record.data_src().value().is_tlb_l2() |
True , if the access can be associated with the STLB. |
sample_record.data_src().value().is_tlb_walk() |
True , if the access can be associated with the hardware walker. |
sample_record.data_src().value().is_locked() |
True , If the address was accessed via lock instruction. |
sample_record.data_src().value().is_data_blocked() |
True in case the data could not be forwarded. |
sample_record.data_src().value().is_address_blocked() |
True in case of an address conflict (since Linux 5.12 ). |
sample_record.data_src().value().is_snoop_hit() |
True , if access was a snoop hit. |
sample_record.data_src().value().is_snoop_miss() |
True , if access was a snoop miss. |
sample_record.data_src().value().is_snoop_hit_modified() |
True , if access was a snoop hit modified. |
All these queries wrap around the perf_mem_data_src
data structure.
Since we may have missed specific operations, you can also access each particular data structure:
sample_record.data_src().value().op()
accesses thePERF_MEM_OP
structure.sample_record.data_src().value().lvl()
accesses thePERF_MEM_LVL
structure. Note that this structure is deprecated, usesample_record.data_src().value().lvl_num()
instead.sample_record.data_src().value().lvl_num()
accesses thePERF_MEM_LVL_NUM
structure (since Linux4.14
).sample_record.data_src().value().remote()
accesses thePERF_MEM_REMOTE
structure (since Linux4.14
).sample_record.data_src().value().snoop()
accesses thePERF_MEM_SNOOP
structure.sample_record.data_src().value().snoopx()
accesses thePERF_MEM_SNOOPX
structure (since Linux4.14
).sample_record.data_src().value().lock()
accesses thePERF_MEM_LOCK
structure.sample_record.data_src().value().blk()
accesses thePERF_MEM_BLK
structure (since Linux5.12
).sample_record.data_src().value().tlb()
accesses thePERF_MEM_TLB
structure.sample_record.data_src().value().hops()
accesses thePERF_MEM_HOPS
structure (since Linux5.16
).
Note that memory sampling depends on the underlying sampling mechanism. → See hardware-specific information (e.g., Intel PEBS vs AMD IBS)
Size of pages of sampled data addresses (e.g., when sampling for logical memory address).
Sampling the data page size requires a Linux Kernel version of 5.11
or higher.
- Request by
sampler.values().data_page_size(true);
- Read from the results by
sample_record.data_page_size().value();
Reports the reason for an abort of a transactional memory transaction.
- Request by
sampler.values().transaction_abort(true);
- Read from the results by
sample_record.transaction_abort().value();
, which returns an instance of the typeperf::TransactionAbort
. The abort can be queried by:sample_record.transaction_abort().value().is_elision()
, which returnstrue
, if the abort comes from an elision type transaction (Intel-specific).sample_record.transaction_abort().value().is_transaction()
, which returnstrue
, if the abort comes a generic transaction.sample_record.transaction_abort().value().is_synchronous()
, which returnstrue
, if the abort is synchronous.sample_record.transaction_abort().value().is_asynchronous()
, which returnstrue
, if the abort is asynchronous.sample_record.transaction_abort().value().is_retry()
, which returnstrue
, if the abort is retryable.sample_record.transaction_abort().value().is_conflict()
, which returnstrue
, if the abort is due to a conflict.sample_record.transaction_abort().value().is_capacity_write()
, which returnstrue
, if the abort is due to write capacity.sample_record.transaction_abort().value().is_capacity_read()
, which returnstrue
, if the abort is due to read capacity.- In addition,
sample_record.transaction_abort().value().code()
returns the abort-code for a transaction as specified by the user.
Hardware will record different data, depending on the CPU generation and manufacture. While the perf subsystem will parse the data and map it to a generalized interface, the raw data can also be accessed using raw sampling.
- Request by
sampler.values().raw(true);
- Read the results by
sample_record.raw().value();
→ See code example for AMD IBS (more information about how to access the raw data is provided by the AMD manual from page 428)
Occurrence of context switches.
Sampling context switches requires a Linux Kernel version of 4.3
or higher.
- Request by
sampler.values().context_switch(true);
- Read from the results by
sample_record.context_switch().value();
(ifsample_record.context_switch().has_value();
), which returns aperf::ContextSwitch
object. The context switch contains- a flag if the process was switched in or out (
context_switch.is_in()
orcontext_switch.is_out()
), - a flag of the process was preempted (
context_switch.is_preempt()
) (only from Linux Kernel4.17
), - the id of the in or out process, if sampling cpu-wide (
context_switch.process_id()
), - and the id of the in or out thread, if sampling cpu-wide (
context_switch.thread_id()
). - In addition, the following data will be set in a sample:
sample_record.process_id()
andsample_record.thread_id()
, ifsampler.thread_id(true)
was specified,sample_record.timestamp()
, ifsampler.time(true)
was specified,sample_record.stream_id()
, ifsampler.stream_id(true)
was specified,sample_record.cpu_id()
, ifsampler.cpu_id(true)
was specified, andsample_record.id()
, ifsampler.identifier(true)
was specified.
- a flag if the process was switched in or out (
Sampling cgroups requires a Linux Kernel version of 5.7
or higher.
- Request by
sampler.values().cgroup(true);
- CGroup IDs are included into samples and can be read by
sample_record.cgroup_id().value();
- Whenever new cgroups are created or activated, the sample can include a
perf::CGroup
item, containing the ID of the created/activated cgroup (sample_record.cgroup().value().id();
), which matches one of thecgroup_id()
s of the sample.perf::CGroup
also contains a path, which can be accessed bysample_record.cgroup().value().path();
. - In addition, the following data will be set in a sample:
sample_record.process_id()
andsample_record.thread_id()
, ifsampler.thread_id(true)
was specified,sample_record.timestamp()
, ifsampler.time(true)
was specified,sample_record.stream_id()
, ifsampler.stream_id(true)
was specified,sample_record.cpu_id()
, ifsampler.cpu_id(true)
was specified, andsample_record.id()
, ifsampler.identifier(true)
was specified.
- Request by
sampler.values().throttle(true);
- Throttle events are included into samples and can be read by
sample_record.throttle().value();
, which returns an optionalperf::Throttle
object. The throttle object contains a flag indicating- that it was a throttle event (
sample_record.throttle().value().is_throttle();
) - or it was an unthrottle event (
sample_record.throttle().value().is_unthrottle();
). Only one of both will returntrue
.
- that it was a throttle event (
- In addition, the following data will be set in a sample:
sample_record.process_id()
andsample_record.thread_id()
, ifsampler.thread_id(true)
was specified,sample_record.timestamp()
, ifsampler.time(true)
was specified,sample_record.stream_id()
, ifsampler.stream_id(true)
was specified,sample_record.cpu_id()
, ifsampler.cpu_id(true)
was specified, andsample_record.id()
, ifsampler.identifier(true)
was specified.
Each sample is recorded in one of the following modes:
perf::Sample::Mode::Unknown
perf::Sample::Mode::Kernel
perf::Sample::Mode::User
perf::Sample::Mode::Hypervisor
perf::Sample::Mode::GuestKernel
perf::Sample::Mode::GuestUser
You can check the mode via sample_record.mode()
, for example:
for (const auto& sample_record : result)
{
if (sample_record.mode() == perf::Sample::Mode::Kernel)
{
std::cout << "Sample in Kernel" << std::endl;
}
else if (sample_record.mode() == perf::Sample::Mode::User)
{
std::cout << "Sample in User" << std::endl;
}
}
Sample records may be lost, for example, if the buffer is full or the CPU is heavily loaded.
Such losses are documented and reported through sample_record.count_loss()
, which returns std::nullopt
for regular samples and an integer for the number of samples lost, as reported by the perf subsystem.
In addition, the following data will be set in a sample:
sample_record.process_id()
andsample_record.thread_id()
, ifsampler.thread_id(true)
was specified,sample_record.timestamp()
, ifsampler.time(true)
was specified,sample_record.stream_id()
, ifsampler.stream_id(true)
was specified,sample_record.cpu_id()
, ifsampler.cpu_id(true)
was specified, andsample_record.id()
, ifsampler.identifier(true)
was specified.
Especially for sampling memory addresses, latency, and data source, the perf subsystem needs specific events as triggers.
On Intel, the perf list
command reports these triggers as "Supports address when precise".
perf-cpp will discover mem-loads
and mem-stores
events when running on Intel hardware that supports sampling for memory.
Additionally, memory sampling typically requires a precision setting of at least perf::Precision::RequestZeroSkid
.
From our experience, Intel's Cascade Lake architecture (and earlier architectures) only reports latency and source for memory loads, not stores – this changes from Sapphire Rapids.
You can add load and store events like this:
sampler.trigger("mem-loads", perf::Precision::MustHaveZeroSkid); /// Only load events
or
sampler.trigger("mem-stores", perf::Precision::MustHaveZeroSkid); /// Only store events
or
/// Load and store events
sampler.trigger(std::vector<std::vector<perf::Sampler::Trigger>>{
{
perf::Sampler::Trigger{ "mem-loads", perf::Precision::RequestZeroSkid } /// Loads
},
{ perf::Sampler::Trigger{ "mem-stores", perf::Precision::MustHaveZeroSkid } } /// Stores
});
To use memory latency sampling on Intel's Sapphire Rapids architecture, the perf subsystem needs an auxiliary counter to be added to the group, before the first "real" counter is added (see this commit).
perf-cpp will define this counter and add it as a trigger automatically (from version 0.10.0
), when it can detect that the hardware needs it.
In this case, you can proceed as before Sapphire Rapids.
However, if the detection fails but the system needs it, you can add it yourself:
sampler.trigger({
{
perf::Sampler::Trigger{"mem-loads-aux", perf::Precision::MustHaveZeroSkid}, /// Helper
perf::Sampler::Trigger{"mem-loads", perf::Precision::RequestZeroSkid} /// First "real" counter
},
{ perf::Sampler::Trigger{"mem-stores", perf::Precision::MustHaveZeroSkid} } /// Other "real" counters.
});
You can check if the auxiliary counter is needed by checking if the following file exists in the system:
/sys/bus/event_source/devices/cpu/events/mem-loads-aux
AMD uses Instruction Based Sampling to tag instructions randomly for sampling and collect various information for each sample (see the programmer reference). In contrast to Intel's mechanism, IBS cannot tag specific load and store instructions (and apply a filter on the latency). In case the instruction was a load/store instruction, the sample will include data source, latency, and a memory address (see kernel mailing list).
perf-cpp –or the perf::CounterDefinition
class to be precise– will detect IBS support on AMD devices and adds the following counters that can be used as trigger for sampling on AMD:
ibs_op
selects instructions during the execution pipeline. CPU cycles (on the specified period/frequency) will lead to tag an instruction.ibs_op_uops
selects instructions during the execution pipeline, but the period/frequency refers to the number of executed micro-operations, not CPU cycles.ibs_op_l3missonly
selects instructions during the execution pipeline that miss the L3 cache. CPU cycles are used as the trigger.ibs_op_uops_l3missonly
selects instructions during the execution pipeline that miss the L3 cache, using micro-operations as the trigger.ibs_fetch
selects instructions in the fetch-state (frontend) using cycles as the trigger.ibs_fetch_l3missonly
selects instructions in the fetch-state (frontend) that miss the L3 cache, again, using cycles as a trigger.
Debugging and configuring hardware counters can sometimes be complex, as settings (e.g., the precision – precise_ip
) may need to be adjusted for different machines.
Utilize perf-cpp's debugging features to gain insights into the internal workings of performance counters and troubleshoot any configuration issues:
auto config = perf::SampleConfig{};
config.is_debug(true);
auto sampler = perf::Sampler{ counter_definitions, config };
The idea is borrowed from Linux Perf, which can be asked to print counter configurations as follows:
perf --debug perf-event-open stat -- sleep 1
This command helps visualize configurations for various counters, which is also beneficial for retrieving event codes (for more details, see the counters documentation).