Skip to content

Latest commit

 

History

History
673 lines (534 loc) · 38.9 KB

sampling.md

File metadata and controls

673 lines (534 loc) · 38.9 KB

Event Sampling

perf-cpp enables the recording of event samples, capturing details like instruction pointers, multiple counters, branches, memory addresses, data sources, latency, and more. Essentially, you define a sampling period or frequency at which data is captured. At its core, this functionality is akin to traditional profiling tools, like perf record, but uniquely tailored to record specific blocks of code rather than the entire application.

See details below.

The details below provide an overview of how sampling works. For specific information about sampling in parallel settings (i.e., sampling multiple threads and cores) take a look into the "parallel sampling" documentation.


Table of Contents


Interface

Setting up what to record and when

For sampling, the hardware records a set of data (see more details) upon reaching the threshold of a specific trigger event (see more details). In the following example, we record a timestamp and the current instruction pointer every 4000th cycle:

#include <perfcpp/sampler.h>
auto counter_definitions = perf::CounterDefinition{};

auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);

auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");
sampler.values().time(true).instruction_pointer(true);

Note: The perf::CounterDefinition instance is used to store event configurations (e.g., names) and passed as a reference. Consequently, the instance needs to be alive while using the Sampler (as described here).

Initializing the Sampler (optional)

The sampler is initialized using sampler.start(), if it is not already done. This action configures all necessary hardware counters and buffers, a process that may require some time. For those requiring precise timing measurements and wishing to omit the time spent setting up counters, the sampler.open() method can be invoked separately.

try {
    sampler.open();
} catch (std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}

Managing Sampler Lifecycle

Surround your computational code with start() and stop() methods to sample hardware events:

try {
    sampler.start();
} catch (std::runtime_error& e) {
    std::cerr << e.what() << std::endl;
}

/// ... do some computational work here...

sampler.stop();

Retrieving Samples

The output is a series of perf::Sample instances, each potentially including extensive data. Given the capability to select specific data elements for sampling, each data point is encapsulated within an std::optional to manage its potential absence.

See how to query sample results

const auto result = sampler.result();

for (const auto& sample_record : result)
{
    if (sample_record.time().has_value() && sample_record.instruction_pointer().has_value())
    {
        std::cout 
            << "Time = " << sample_record.time().value() 
            << " | IP = 0x" << std::hex << sample_record.instruction_pointer().value() << std::dec << std::endl;
    }
}

The output may be something like this:

Time = 124853764466887 | IP = 0x5794c991990c
Time = 124853764663977 | IP = 0xffffffff8d79d48b
Time = 124853764861377 | IP = 0x5794c991990c
Time = 124853765058918 | IP = 0x5794c991990c
Time = 124853765256328 | IP = 0x5794c991990c

Closing the Sampler (optional)

Closing the sampler releases and un-maps all buffers and deactivates all counters. Additionally, the sampler automatically closes upon destruction. However, closing the sampler explicitly enables it to be reopened at a future time.

sampler.close();

Trigger

Each sampler is associated with one or more trigger events. When a trigger event reaches a specified (user-defined) threshold, the CPU records a sample containing the desired data. Triggers for a sampler can be specified as follows:

sampler.trigger("cycles");

To define multiple triggers, use a vector of trigger names:

sampler.trigger(std::vector<std::string>{"cycles", "instructions"});

In this scenario, exceeding either the cycles or instructions counter will prompt the CPU to capture a sample.

Notes for specific CPUs

When configuring event-based sampling, it's important to understand that different CPU manufacturers support different sets of events that can be used as triggers.

Intel CPUs are generally flexible and allow almost every event as a trigger. On AMD systems, the range of events that can trigger samples is more restricted: Typically, only the cycles event and specific IBS events such as ibs_fetch and ibs_op are supported.

For more detailed information on configuring event-based sampling for different CPU types and specific notes on memory sampling, refer to the section: Specific Notes for different CPU Vendors.

Precision

Due to deeply pipelined processors, samples might not be precise, i.e., a sample might contain an instruction pointer or memory address that did not generate the overflow (→ see a blogpost on easyperf.net and the perf documentation). You can request a specific amount if skid through for each trigger, for example,

sampler.trigger("cycles", perf::Precision::AllowArbitrarySkid);

The precision can have the following values:

  • perf::Precision::AllowArbitrarySkid (this does not enable Intel PEBS)
  • perf::Precision::MustHaveConstantSkid (default)
  • perf::Precision::RequestZeroSkid
  • perf::Precision::MustHaveZeroSkid

If you do not set any precision level through the .trigger() interface, you can control the default precision through the sample config:

auto sample_config = perf::SampleConfig{};
sample_config.precision(perf::Precision::RequestZeroSkid);

auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");

Note: If the precision setting is too high and the perf subsystem fails to activate the trigger, perf-cpp will automatically reduce the precision. However, it will not increase precision autonomously.

Period / Frequency

You can request a specific period or frequency for each trigger – basically how often the hardware should write samples –, for example,

/// Every 4000th cycle.
sampler.trigger("cycles", perf::Period{4000U});

or

/// With a frequency of 1000 samples per second , i.e., one sample per millisecond.
// (the hardware will adjust the period according to the provided frequency).
sampler.trigger("cycles", perf::Frequency{1000U});

You can also combine the configurations, for example, by

/// Every 4000th cycle with zero skid.
sampler.trigger("cycles", perf::Precision::RequestZeroSkid, perf::Period{4000U});

If you do not set any precision level through the .trigger() interface, you can control the default period of frequency through the sample config:

auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);
/// xor:
sample_config.frequency(1000U);

auto sampler = perf::Sampler{ counter_definitions, sample_config };
sampler.trigger("cycles");

What can be Recorded and how to Access the Data?

Prior to activation, the sampler must be configured to specify the data to be recorded. For instance:

sampler.values()
    .time(true)
    .instruction_pointer(true);

This specific configuration captures both the timestamp and instruction pointer within the sample record. Upon completing the sampling and retrieving the sampling results, the recorded fields can be accessed as follows:

for (const auto& sample_record : sampler.results()) {
    const auto timestamp = sample_record.time().value();
    const auto instruction_pointer = sample_record.instruction_pointer().value();
}

Note: Most fields of the sample_record are optional – recording these fields must be activated via sampler.values().

Time

The timestamp of capturing the sample.

  • Request by sampler.values().time(true);
  • Read from the results by sample_record.time().value()

See code example

Stream ID

Unique ID of an opened event.

  • Request by sampler.values().stream_id(true);
  • Read from the results by sample_record.stream_id().value()

Period

Period at the time of capturing the sample (the Kernel might adjust the period).

  • Request by sampler.values().period(true);
  • Read from the results by sample_record.period().value();

Identifier

  • Request by sampler.values().identifier(true);
  • Read from the results by sample_record.id().value();

Instruction Pointer

Address of the executed instruction at the moment the hardware captured the sample. Consider examining the precision configuration, as the sampled instruction pointer may not always be precise.

  • Request by sampler.values().instruction_pointer(true);
  • Read from the results by sample_record.instruction_pointer().value()

Additionally, you can determine if the captured instruction pointer was exact by querying sample_record.is_exact_ip(), which returns a bool.

See code example

Callchain

Callchain as a list of instruction pointers.

  • Request by sampler.values().callchain(true); or sampler.values().callchain(M); where M is a std::uint16_t defining the maximum call stack size.
  • Read from the results by sample_record.callchain().value();, which returns an std::vector<std::uintptr_t> of instruction pointers.

User Stack

The user-level stack as a list of chars.

  • Request by sampler.values().user_stack(M); where M is a std::uint32_t defining the maximum size of the user stack.
  • Read from the results by sample_record.user_stack().value();, which returns an std::vector<char> of data.

Registers in user-level

Values of registers within the user-level.

  • Request by sampler.values().user_registers( { perf::Registers::x86::IP, perf::Registers::x86::DI, perf::Registers::x86::R10 });
  • Read from the results by sample_record.user_registers().value()[0]; (0 for the first register, perf::Registers::x86::IP in this example)
  • The ABI can be queried using sample_record.user_registers_abi().

See code example

Registers in kernel-level

Values of registers within the kernel-level.

  • Request by sampler.values().kernel_registers( { perf::Registers::x86::IP, perf::Registers::x86::DI, perf::Registers::x86::R10 });
  • Read from the results by sample_record.kernel_registers().value()[0]; (0 for the first register, perf::Registers::x86::IP in this example)
  • The ABI can be queried using sample_record.kernel_registers_abi().

See example

ID of the recording Thread

  • Request by sampler.values().thread_id(true);
  • Read from the results by sample_record.thread_id().value()

ID of the recording CPU

  • Request by sampler.values().cpu_id(true);
  • Read from the results by sample_record.cpu_id().value();

See code example

Size of the Code Page

Size of pages of sampled instruction pointers (e.g., when sampling for instruction pointers). Sampling the code page size requires a Linux Kernel version of 5.11 or higher.

  • Request by sampler.values().code_page_size(true);
  • Read from the results by sample_record.code_page_size().value();

Performance Counter Values

Record hardware performance counter values at the time of the record.

  • Request by sampler.values().counter({"instructions", "cache-misses"});. You can sample any hardware event or metric.
  • Read from the results by sample_record.counter_result().value().get("cache-misses");. This can be accessed in the same manner as when recording counters.

See code example

Branch Stack (LBR)

Branch history with jump addresses and flag if predicted correctly.

Request recording

sampler.values().branch_stack({ perf::BranchType::User, perf::BranchType::Conditional });

The possible branch type values are:

  • perf::BranchType::User: Sample branches in user mode.
  • perf::BranchType::Kernel: Sample kernel branches.
  • perf::BranchType::HyperVisor: Sample branches in HV mode.
  • perf::BranchType::Any (the default): Sample all branches.
  • perf::BranchType::Call: Sample any call (direct, indirect, far jumps).
  • perf::BranchType::DirectCall: Sample direct call (requires Linux Kernel 4.4 or higher).
  • perf::BranchType::IndirectCall: Sample indirect call.
  • perf::BranchType::Return: Sample return branches.
  • perf::BranchType::IndirectJump: Sample indirect jumps (requires Linux Kernel 4.2 or higher).
  • perf::BranchType::Conditional: Sample conditional branches.
  • perf::BranchType::TransactionalMemoryAbort: Sample branches that abort transactional memory.
  • perf::BranchType::InTransaction: Sample branches in transactions of transactional memory.
  • perf::BranchType::NotInTransaction: Sample branches not in transactions of transactional memory.

Read from the results

Read from the results by sample_record.branches(), which returns a vector of perf::Branch.

Each perf::Branch instance has the following values:

  • Instruction pointer from where the branch started (branch.instruction_pointer_from()).
  • Instruction pointer from where the branch ended (branch.instruction_pointer_to()).
  • A flag that indicates if the branch was correctly predicted (branch.is_predicted()) (0 if not supported by the hardware).
  • A flag that indicates if the branch was mispredicted (branch.is_mispredicted()) (0 if not supported by the hardware).
  • A flag that indicates if the branch was within a transaction (branch.is_in_transaction()).
  • A flag that indicates if the branch was a transaction abort (branch.is_transaction_abort()).
  • Cycles since the last branch (branch.cycles()) (0 if not supported by the hardware – since Linux 4.3).

See code example

Physical Memory Address

Sampling the physical memory address requires a Linux Kernel version of 4.13 or higher.

  • Request by sampler.values().physical_memory_address(true);
  • Read from the results by sample_record.physical_memory_address().value();

Memory sampling can be tricky; refer to the specifics of the underlying hardware since memory sampling requires distinct triggers across different hardware platforms.

Logical Memory Address

  • Request by sampler.values().logical_memory_address(true);
  • Read from the results by sample_record.logical_memory_address().value()

Memory sampling can be tricky; refer to the specifics of the underlying hardware since memory sampling requires distinct triggers across different hardware platforms.

See code example

Memory Access Latency

The weight (and weight struct) indicates how costly the event was (basically the latency). Since Linux Kernel version 5.12, the Kernel might generate more information than only a single value, which is used to differentiate between memory- (from cache towards memory) and instruction latency.

  • Request by sampler.values().weight(true); or sampler.values().weight_struct(true); (the latter only from Kernel 5.12)
  • Read from the results by sample_record.weight().value();, which returns a perf::Weight class, which has the following attributes:
    • sample_record.weight().value().cache_latency() returns the cache latency of the sampled data address (for both sampler.values().weight(true) and sampler.values().weight_struct(true)).
    • sample_record.weight().value().instruction_retirement_latency() returns the latency of retiring the instruction (including the cache access) but only for sampler.values().weight_struct(true). To the best of our knowledge, this feature is only supported by new Intel generations.
    • sample_record.weight().value().var3() returns "other information" (not specified by perf) but only for sampler.values().weight_struct(true).

See code example

Note that memory sampling depends on the underlying sampling mechanism. → See hardware-specific information (e.g., Intel PEBS vs AMD IBS)

Data Source of a Memory Load

Data source where the data was sampled (e.g., local mem, remote mem, L1d, L2, ...).

  • Request by sampler.values().data_src(true);
  • Read from the results by sample_record.data_src().value(); which returns a specific perf::DataSource object

The perf::DataSource object can be queried for the following information:

Query Information
sample_record.data_src().value().is_load() True, if the access was a load operation.
sample_record.data_src().value().is_store() True, if the access was a store operation.
sample_record.data_src().value().is_prefetch() True, if the access was a prefetch operation.
sample_record.data_src().value().is_exec() True, if the access was an execute operation.
sample_record.data_src().value().is_mem_hit() True, if the access was a hit.
sample_record.data_src().value().is_mem_miss() True, if the access was a miss.
sample_record.data_src().value().is_mem_hit() True, if the access was a hit.
sample_record.data_src().value().is_mem_l1() True, if the data was found in the L1 cache.
sample_record.data_src().value().is_mem_l2() True, if the data was found in the L2 cache.
sample_record.data_src().value().is_mem_l3() True, if the data was found in the L3 cache.
sample_record.data_src().value().is_mem_l4() True, if the data was found in the L4 cache (since Linux 4.14).
sample_record.data_src().value().is_mem_lfb() True, if the data was found in the Line Fill Buffer (or Miss Address Buffer on AMD).
sample_record.data_src().value().is_mem_l2_mhb() True, if the data was found in the L2 Miss Handling Buffer (since Linux 6.11).
sample_record.data_src().value().is_mem_local() True, if the data was found in local memory subsystem (since Linux 4.14).
sample_record.data_src().value().is_mem_remote() True, if the data was found in remote memory subsystem (since Linux 4.14).
sample_record.data_src().value().is_mem_ram() True, if the data was found in any (any) RAM.
sample_record.data_src().value().is_mem_local_ram() True, if the data was found in the local RAM.
sample_record.data_src().value().is_mem_remote_ram() True, if the data was found in any remote RAM.
sample_record.data_src().value().is_mem_hops0() True, if the data was found locally (since Linux 5.16).
sample_record.data_src().value().is_mem_hops1() True, if the data was found on the same node (since Linux 5.17).
sample_record.data_src().value().is_mem_hops2() True, if the data was found on a remote socket (since Linux 5.17).
sample_record.data_src().value().is_mem_hops3() True, if the data was found on a remote board (since Linux 5.17).
sample_record.data_src().value().is_mem_remote_ram1() True, if the data was found in a remote RAM on the same node.
sample_record.data_src().value().is_mem_remote_ram2() True, if the data was found in a remote RAM on a different socket.
sample_record.data_src().value().is_mem_remote_ram3() True, if the data was found in a remote RAM on a different board (since Linux 5.17).
sample_record.data_src().value().is_mem_remote_cce1() True, if the data was found in a cache with one hop distance.
sample_record.data_src().value().is_mem_remote_cce2() True, if the data was found in a cache with two hops distance.
sample_record.data_src().value().is_pmem() True, if the data was found on a PMEM device (since Linux 4.14).
sample_record.data_src().value().is_cxl() True, if the data was transferred via Compute Express Link (since Linux 6.1).
sample_record.data_src().value().is_io() True, if the memory address is I/O.
sample_record.data_src().value().is_uncached() True, if the memory address is related to uncached memory.
sample_record.data_src().value().is_tlb_hit() True, if the access was a TLB hit.
sample_record.data_src().value().is_tlb_miss() True, if the access was a TLB miss.
sample_record.data_src().value().is_tlb_l1() True, if the access can be associated with the dTLB.
sample_record.data_src().value().is_tlb_l2() True, if the access can be associated with the STLB.
sample_record.data_src().value().is_tlb_walk() True, if the access can be associated with the hardware walker.
sample_record.data_src().value().is_locked() True, If the address was accessed via lock instruction.
sample_record.data_src().value().is_data_blocked() True in case the data could not be forwarded.
sample_record.data_src().value().is_address_blocked() True in case of an address conflict (since Linux 5.12).
sample_record.data_src().value().is_snoop_hit() True, if access was a snoop hit.
sample_record.data_src().value().is_snoop_miss() True, if access was a snoop miss.
sample_record.data_src().value().is_snoop_hit_modified() True, if access was a snoop hit modified.

All these queries wrap around the perf_mem_data_src data structure. Since we may have missed specific operations, you can also access each particular data structure:

  • sample_record.data_src().value().op() accesses the PERF_MEM_OP structure.
  • sample_record.data_src().value().lvl() accesses the PERF_MEM_LVL structure. Note that this structure is deprecated, use sample_record.data_src().value().lvl_num() instead.
  • sample_record.data_src().value().lvl_num() accesses the PERF_MEM_LVL_NUM structure (since Linux 4.14).
  • sample_record.data_src().value().remote() accesses the PERF_MEM_REMOTE structure (since Linux 4.14).
  • sample_record.data_src().value().snoop() accesses the PERF_MEM_SNOOP structure.
  • sample_record.data_src().value().snoopx() accesses the PERF_MEM_SNOOPX structure (since Linux 4.14).
  • sample_record.data_src().value().lock() accesses the PERF_MEM_LOCK structure.
  • sample_record.data_src().value().blk() accesses the PERF_MEM_BLK structure (since Linux 5.12).
  • sample_record.data_src().value().tlb() accesses the PERF_MEM_TLB structure.
  • sample_record.data_src().value().hops() accesses the PERF_MEM_HOPS structure (since Linux 5.16).

See code example

Note that memory sampling depends on the underlying sampling mechanism. → See hardware-specific information (e.g., Intel PEBS vs AMD IBS)

Size of the Data Page

Size of pages of sampled data addresses (e.g., when sampling for logical memory address). Sampling the data page size requires a Linux Kernel version of 5.11 or higher.

  • Request by sampler.values().data_page_size(true);
  • Read from the results by sample_record.data_page_size().value();

Transaction Abort

Reports the reason for an abort of a transactional memory transaction.

  • Request by sampler.values().transaction_abort(true);
  • Read from the results by sample_record.transaction_abort().value();, which returns an instance of the type perf::TransactionAbort. The abort can be queried by:
    • sample_record.transaction_abort().value().is_elision(), which returns true, if the abort comes from an elision type transaction (Intel-specific).
    • sample_record.transaction_abort().value().is_transaction(), which returns true, if the abort comes a generic transaction.
    • sample_record.transaction_abort().value().is_synchronous(), which returns true, if the abort is synchronous.
    • sample_record.transaction_abort().value().is_asynchronous(), which returns true, if the abort is asynchronous.
    • sample_record.transaction_abort().value().is_retry(), which returns true, if the abort is retryable.
    • sample_record.transaction_abort().value().is_conflict(), which returns true, if the abort is due to a conflict.
    • sample_record.transaction_abort().value().is_capacity_write(), which returns true, if the abort is due to write capacity.
    • sample_record.transaction_abort().value().is_capacity_read(), which returns true, if the abort is due to read capacity.
    • In addition, sample_record.transaction_abort().value().code() returns the abort-code for a transaction as specified by the user.

Raw Values

Hardware will record different data, depending on the CPU generation and manufacture. While the perf subsystem will parse the data and map it to a generalized interface, the raw data can also be accessed using raw sampling.

  • Request by sampler.values().raw(true);
  • Read the results by sample_record.raw().value();

See code example for AMD IBS (more information about how to access the raw data is provided by the AMD manual from page 428)

Context Switches

Occurrence of context switches. Sampling context switches requires a Linux Kernel version of 4.3 or higher.

  • Request by sampler.values().context_switch(true);
  • Read from the results by sample_record.context_switch().value(); (if sample_record.context_switch().has_value();), which returns a perf::ContextSwitch object. The context switch contains
    • a flag if the process was switched in or out (context_switch.is_in() or context_switch.is_out()),
    • a flag of the process was preempted (context_switch.is_preempt()) (only from Linux Kernel 4.17),
    • the id of the in or out process, if sampling cpu-wide (context_switch.process_id()),
    • and the id of the in or out thread, if sampling cpu-wide (context_switch.thread_id()).
    • In addition, the following data will be set in a sample:
      • sample_record.process_id() and sample_record.thread_id(), if sampler.thread_id(true) was specified,
      • sample_record.timestamp(), if sampler.time(true) was specified,
      • sample_record.stream_id(), if sampler.stream_id(true) was specified,
      • sample_record.cpu_id(), if sampler.cpu_id(true) was specified, and
      • sample_record.id(), if sampler.identifier(true) was specified.

See code example

CGroup

Sampling cgroups requires a Linux Kernel version of 5.7 or higher.

  • Request by sampler.values().cgroup(true);
  • CGroup IDs are included into samples and can be read by sample_record.cgroup_id().value();
  • Whenever new cgroups are created or activated, the sample can include a perf::CGroup item, containing the ID of the created/activated cgroup (sample_record.cgroup().value().id();), which matches one of the cgroup_id()s of the sample. perf::CGroup also contains a path, which can be accessed by sample_record.cgroup().value().path();.
  • In addition, the following data will be set in a sample:
    • sample_record.process_id() and sample_record.thread_id(), if sampler.thread_id(true) was specified,
    • sample_record.timestamp(), if sampler.time(true) was specified,
    • sample_record.stream_id(), if sampler.stream_id(true) was specified,
    • sample_record.cpu_id(), if sampler.cpu_id(true) was specified, and
    • sample_record.id(), if sampler.identifier(true) was specified.

Throttle and Unthrottle Events

  • Request by sampler.values().throttle(true);
  • Throttle events are included into samples and can be read by sample_record.throttle().value();, which returns an optional perf::Throttle object. The throttle object contains a flag indicating
    • that it was a throttle event (sample_record.throttle().value().is_throttle();)
    • or it was an unthrottle event (sample_record.throttle().value().is_unthrottle();). Only one of both will return true.
  • In addition, the following data will be set in a sample:
    • sample_record.process_id() and sample_record.thread_id(), if sampler.thread_id(true) was specified,
    • sample_record.timestamp(), if sampler.time(true) was specified,
    • sample_record.stream_id(), if sampler.stream_id(true) was specified,
    • sample_record.cpu_id(), if sampler.cpu_id(true) was specified, and
    • sample_record.id(), if sampler.identifier(true) was specified.

Sample mode

Each sample is recorded in one of the following modes:

  • perf::Sample::Mode::Unknown
  • perf::Sample::Mode::Kernel
  • perf::Sample::Mode::User
  • perf::Sample::Mode::Hypervisor
  • perf::Sample::Mode::GuestKernel
  • perf::Sample::Mode::GuestUser

You can check the mode via sample_record.mode(), for example:

for (const auto& sample_record : result)
{
  if (sample_record.mode() == perf::Sample::Mode::Kernel)
  {
    std::cout << "Sample in Kernel" << std::endl;      
  }
  else if (sample_record.mode() == perf::Sample::Mode::User)
  {
    std::cout << "Sample in User" << std::endl;      
  }
}

Lost Samples

Sample records may be lost, for example, if the buffer is full or the CPU is heavily loaded. Such losses are documented and reported through sample_record.count_loss(), which returns std::nullopt for regular samples and an integer for the number of samples lost, as reported by the perf subsystem.

In addition, the following data will be set in a sample:

  • sample_record.process_id() and sample_record.thread_id(), if sampler.thread_id(true) was specified,
  • sample_record.timestamp(), if sampler.time(true) was specified,
  • sample_record.stream_id(), if sampler.stream_id(true) was specified,
  • sample_record.cpu_id(), if sampler.cpu_id(true) was specified, and
  • sample_record.id(), if sampler.identifier(true) was specified.

Specific Notes for different CPU Vendors

Intel (PEBS)

Especially for sampling memory addresses, latency, and data source, the perf subsystem needs specific events as triggers. On Intel, the perf list command reports these triggers as "Supports address when precise".

perf-cpp will discover mem-loads and mem-stores events when running on Intel hardware that supports sampling for memory.

Additionally, memory sampling typically requires a precision setting of at least perf::Precision::RequestZeroSkid.

Before Sapphire Rapids

From our experience, Intel's Cascade Lake architecture (and earlier architectures) only reports latency and source for memory loads, not stores – this changes from Sapphire Rapids.

You can add load and store events like this:

sampler.trigger("mem-loads", perf::Precision::MustHaveZeroSkid); /// Only load events

See code example

or

sampler.trigger("mem-stores", perf::Precision::MustHaveZeroSkid); /// Only store events

or

/// Load and store events
sampler.trigger(std::vector<std::vector<perf::Sampler::Trigger>>{
    {
      perf::Sampler::Trigger{ "mem-loads", perf::Precision::RequestZeroSkid } /// Loads
    },
    { perf::Sampler::Trigger{ "mem-stores", perf::Precision::MustHaveZeroSkid } } /// Stores
  });

See code example

Sapphire Rapids and Beyond

To use memory latency sampling on Intel's Sapphire Rapids architecture, the perf subsystem needs an auxiliary counter to be added to the group, before the first "real" counter is added (see this commit).

perf-cpp will define this counter and add it as a trigger automatically (from version 0.10.0), when it can detect that the hardware needs it. In this case, you can proceed as before Sapphire Rapids.

However, if the detection fails but the system needs it, you can add it yourself:

sampler.trigger({
    { 
        perf::Sampler::Trigger{"mem-loads-aux", perf::Precision::MustHaveZeroSkid}, /// Helper
        perf::Sampler::Trigger{"mem-loads", perf::Precision::RequestZeroSkid}           /// First "real" counter
    },
    { perf::Sampler::Trigger{"mem-stores", perf::Precision::MustHaveZeroSkid} }         /// Other "real" counters.
  });

You can check if the auxiliary counter is needed by checking if the following file exists in the system:

/sys/bus/event_source/devices/cpu/events/mem-loads-aux

AMD (Instruction Based Sampling)

AMD uses Instruction Based Sampling to tag instructions randomly for sampling and collect various information for each sample (see the programmer reference). In contrast to Intel's mechanism, IBS cannot tag specific load and store instructions (and apply a filter on the latency). In case the instruction was a load/store instruction, the sample will include data source, latency, and a memory address (see kernel mailing list).

perf-cpp –or the perf::CounterDefinition class to be precise– will detect IBS support on AMD devices and adds the following counters that can be used as trigger for sampling on AMD:

  • ibs_op selects instructions during the execution pipeline. CPU cycles (on the specified period/frequency) will lead to tag an instruction.
  • ibs_op_uops selects instructions during the execution pipeline, but the period/frequency refers to the number of executed micro-operations, not CPU cycles.
  • ibs_op_l3missonly selects instructions during the execution pipeline that miss the L3 cache. CPU cycles are used as the trigger.
  • ibs_op_uops_l3missonly selects instructions during the execution pipeline that miss the L3 cache, using micro-operations as the trigger.
  • ibs_fetch selects instructions in the fetch-state (frontend) using cycles as the trigger.
  • ibs_fetch_l3missonly selects instructions in the fetch-state (frontend) that miss the L3 cache, again, using cycles as a trigger.

Troubleshooting Counter Configurations

Debugging and configuring hardware counters can sometimes be complex, as settings (e.g., the precision – precise_ip) may need to be adjusted for different machines. Utilize perf-cpp's debugging features to gain insights into the internal workings of performance counters and troubleshoot any configuration issues:

auto config = perf::SampleConfig{};
config.is_debug(true);

auto sampler = perf::Sampler{ counter_definitions, config };

The idea is borrowed from Linux Perf, which can be asked to print counter configurations as follows:

perf --debug perf-event-open stat -- sleep 1

This command helps visualize configurations for various counters, which is also beneficial for retrieving event codes (for more details, see the counters documentation).