Advanced usage examples
This section contains some advance usage examples that shows you how you can use RTLMeter to gather data about some scenarios.
Evaluating the effect of processor affinity
Let’s say you want to see the effect of different processor assignments on the execution time of a multi-threaded model.
RTLMeter is aware of the processor affinity it was launched with, and the
C++ build step is executed with the -j
option to Make set to the number of
available processors. In order not to restrict compilation to specific CPUs,
you can compile the required configurations first, without running the execute
steps:
echo "OpenTitan:default:cmark" > case-list.txt
echo "XiangShan:mini*:cmark" >> case-list.txt
echo "Vortex:sane:sgemm" >> case-list.txt
./rtlmeter run --cases @case-list.txt --compileRoot work-compile --nExecute 0 --compileArgs="--threads 4"
The --compileRoot
option is similar to --workRoot
, but only applies to
the compilation steps. You can then use separate working directories to perform
multiple executions, by specifying the --executeRoot
option, and run with
different processor assignments.
On the host machine this example was written on, logical CPUs 0 and 8 correspond to hardware threads that share the same physical core, and so are CPUs 1 and 9.
You can examine /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
to see the list of logical CPUs that share the same physical core as CPU 0 on
your host machine.
To see the effect of running the 4-thread model on 4 physical cores, with 1 thread each, or on 2 physical cores with 2 threads each, you can run:
numactl -C 0,1,2,3 ./rtlmeter run --cases @case-list.txt --compileRoot work-compile --executeRoot work-4-core-1-thread --nExecute 3
numactl -C 0,1,8,9 ./rtlmeter run --cases @case-list.txt --compileRoot work-compile --executeRoot work-2-core-2-thread --nExecute 3
You can then see the effect on simulation speed with:
./rtlmeter compare --metrics speed work-4-core-1-thread work-2-core-2-thread
Which shows that performance is generally better when not sharing physical cores, and you can quantify the effect precisely:
execute - Sim speed [kHz] - higher is better
╒══════════════════════════════╤══════╤══════╤═════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (B/A) │ p-value │
╞══════════════════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ OpenTitan:default:cmark │ 3 │ 3 │ 13.14 (± 1.35%) │ 10.17 (± 0.67%) │ 0.77x │ 0.00 │
│ Vortex:sane:sgemm │ 3 │ 3 │ 2.15 (± 1.82%) │ 2.03 (± 0.46%) │ 0.94x │ 0.02 │
│ XiangShan:mini-chisel3:cmark │ 3 │ 3 │ 10.96 (± 1.37%) │ 9.02 (± 0.71%) │ 0.82x │ 0.00 │
│ XiangShan:mini-chisel6:cmark │ 3 │ 3 │ 11.23 (± 1.52%) │ 9.01 (± 0.31%) │ 0.80x │ 0.00 │
╞══════════════════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 0.83x │ │
│ Geometric mean - pVal < 0.05 │ │ │ │ │ 0.83x │ │
╘══════════════════════════════╧══════╧══════╧═════════════════╧═════════════════╧══════════════╧═══════════╛
Effect of cold vs hot Ccache on compile time
You can use the CCACHE_RECACHE
environment variable (which is specific to
Ccache, see man ccache
) to make Ccache not reuse any cached object, but
populate the cache with new objects. You might try something like:
# Run without caching, but populate the cache
env CCACHE_RECACHE=1 ./rtlmeter run --cases "OpenTitan:default:cmark Vortex:sane:sgemm" --nExecute 0 --workRoot work-ccache-cold
# Run with the just populated cache
./rtlmeter run --cases "OpenTitan:default:cmark Vortex:sane:sgemm" --nExecute 0 --workRoot work-ccache-hot
# Compare results
./rtlmeter compare --steps cppbuild --metrics "elapsed cpu" work-ccache-cold work-ccache-hot
cppbuild - Elapsed time [s] - lower is better
╒═══════════════════╤══════╤══════╤═════════════════╤════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞═══════════════════╪══════╪══════╪═════════════════╪════════════════╪══════════════╪═══════════╡
│ OpenTitan:default │ 1 │ 1 │ 72.02 │ 3.02 │ 23.85x │ │
│ Vortex:sane │ 1 │ 1 │ 93.25 │ 2.56 │ 36.43x │ │
╞═══════════════════╪══════╪══════╪═════════════════╪════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 29.47x │ │
╘═══════════════════╧══════╧══════╧═════════════════╧════════════════╧══════════════╧═══════════╛
cppbuild - CPU Total [s] - lower is better
╒═══════════════════╤══════╤══════╤══════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞═══════════════════╪══════╪══════╪══════════════════╪═════════════════╪══════════════╪═══════════╡
│ OpenTitan:default │ 1 │ 1 │ 699.49 │ 38.62 │ 18.11x │ │
│ Vortex:sane │ 1 │ 1 │ 647.56 │ 32.41 │ 19.98x │ │
╞═══════════════════╪══════╪══════╪══════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 19.02x │ │
╘═══════════════════╧══════╧══════╧══════════════════╧═════════════════╧══════════════╧═══════════╛
Enabling waveform tracing
You can turn on waveform tracing for all RTLMeter benchmarks. To compile
with trace capability, just pass the relevant Verilator options --trace
or
--trace-fst
, with possibly other --trace*
options via the
--compileArgs
option to ./rtlmeter run
. To actually enable tracing
at execution time, also pass +trace
via --executeArgs
. (+trace
is
checked by the RTLMeter support code included in the top level module of
all benchmarks).
# Compile with trace capability
./rtlmeter run --cases "VeeR-EH1:default:cmark" --compileArgs="--trace" --nExecute=0
# Execute with tracing enabled at run-time
./rtlmeter run --cases "VeeR-EH1:default:cmark" --compileRoot work --executeRoot work-trace-on --executeArgs="+trace"
# Execute without racing enabled at run-time
./rtlmeter run --cases "VeeR-EH1:default:cmark" --compileRoot work --executeRoot work-trace-off