In parallel computing environments, understanding the performance of code executed across multiple threads or CPU cores is crucial. perf-cpp facilitates this by providing tools to sample performance metrics either by individual threads or specific CPU cores. This guide will cover how to set up and utilize these sampling capabilities effectively.
perf-cpp provides the MultiThreadSampler
class to manage samplers for different threads, enabling precise performance measurements across thread-specific tasks.
→ See code example multi_thread_sampling.cpp
Initialize a sampler for each thread to monitor specific events:
#include <perfcpp/sampler.h>
auto counter_definitions = perf::CounterDefinition{};
auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);
auto sampler = perf::MultiThreadSampler{
counter_definitions,
/* number of threads */ 4U
sample_config
};
sampler.trigger("cycles");
sampler.values().time(true).thread_id(true);
Each thread should manage its sampler instance:
auto threads = std::vector<std::thread>{};
for (auto thread_index = 0U; thread_index < 4U; ++thread_index) {
threads.emplace_back([thread_index, &sampler /*, ... more stuff .. */]() {
try {
sampler.start(thread_index);
} catch (std::runtime_error& e) {
std::cerr << e.what() << std::endl;
}
/// ... do some work that is sampled...
sampler.stop(thread_index);
});
}
for (auto& thread : threads) {
thread.join();
}
After the threads complete execution, collate and analyze the data:
auto result = sampler.result(/* sort samples by time*/ true);
/// Print the samples
for (const auto& sample_record : result)
{
if (sample_record.time().has_value() && sample_record.thread_id().has_value())
{
std::cout
<< "Time = " << sample_record.time().value()
<< " | Thread ID = " << sample_record.thread_id().value() << std::endl;
}
}
The output may be something like this:
Time = 173058802647651 | Thread ID = 62803
Time = 173058803163735 | Thread ID = 62802
Time = 173058803625986 | Thread ID = 62804
Time = 173058804277715 | Thread ID = 62802
Closing the sampler will free and un-map all resources like buffers and hardware counters.
sampler.close();
For applications sensitive to the specific cores they run on, perf-cpp offers MultiCoreSampler
.
→ See code example multi_cpu_sampling.cpp
Note: This records data of all processes running on the specified cores and needs specific permissions (i.e., a value of less than 1
in /proc/sys/kernel/perf_event_paranoid
).
The MultiCoreSampler
needs to know which CPU cores will be sampled (see cpus_to_watch
in the code example below).
#include <perfcpp/sampler.h>
/// Create a list of CPUS to monitor.
auto cpus_to_watch = std::vector<std::uint16_t>{0U, 1U, 2U, 3U};
auto counter_definitions = perf::CounterDefinition{};
auto sample_config = perf::SampleConfig{};
sample_config.period(4000U);
auto sampler = perf::MultiCoreSampler{
counter_definitions,
std::move(cpus_to_watch) /// List of CPUs to sample
sample_config
};
sampler.trigger("cycles");
sampler.values().time(true).cpu_id(true).thread_id(true);
Optionally, open the sampler before starting to separate configuration overhead from measurement:
try {
sampler.open();
} catch (std::runtime_error& e) {
std::cerr << e.what() << std::endl;
}
No matter for which threads, the sampler only needs to be started once.
auto threads = std::vector<std::thread>{};
for (auto thread_index = 0U; thread_index < count_threads; ++thread_index) {
threads.emplace_back([thread_index, /*, ... more stuff .. */]() {
/// ... do some work that is sampled...
});
}
/// Start sampling.
try {
sampler.start();
} catch (std::runtime_error& e) {
std::cerr << e.what() << std::endl;
}
/// Wait for all threads to finish.
for (auto& thread : threads) {
thread.join();
}
/// Stop sampling after all threads have finished.
sampler.stop();
Access and print the collected data:
auto result = sampler.result(/* sort samples by time*/ true);
/// Print the samples
for (const auto& sample_record : result)
{
if (sample_record.time().has_value() && sample_record.cpu_id().has_value() && sample_record.thread_id().has_value())
{
std::cout
<< "Time = " << sample_record.time().value()
<< " | CPU ID = " << sample_record.cpu_id().value()
<< " | Thread ID = " << sample_record.thread_id().value() << std::endl;
}
}
The output may be something like this:
Time = 173058798201719 | CPU ID = 6 | Thread ID = 62803
Time = 173058798713083 | CPU ID = 3 | Thread ID = 62802
Time = 173058799826723 | CPU ID = 3 | Thread ID = 62802
Time = 173058800426323 | CPU ID = 6 | Thread ID = 62803
Time = 173058801403355 | CPU ID = 8 | Thread ID = 62804
Closing the sampler will free and un-map all resources like buffers and hardware counters.
sampler.close();