Comparing results
One of the most important reasons RTLMeter exists, is to enable comparison
of different versions of Verilator, and to enable drawing robust conclusions
on which version is better. For this purpose, you can use
./rtlmeter compare
to display the difference in performance metrics
between two runs. You provide two working directories created with
./rtlmeter run
, and ./rtlmeter compare
will show you the
difference in the metrics you query.
Difference of two runs
As an example, let’s say you are interested in the effect of the choice
between using GCC or Clang to compile Verilator and your verilated models.
For this we assume you have 2 different builds of Verilator available, one
configured with CXX=g++
, and one configured with CXX=clang++
. You can
then perform some runs to get a quick idea. Remember RTLMeter picks up
Verilator from your shell PATH, so let’s set up some shell variables to use
throughout the following examples:
VERILATOR_GCC=... path to 'bin' directory of Verilator configured with GCC ...
VERILATOR_CLANG=... path to 'bin' directory of Verilator configured with Clang ...
You can then run a set of cases with each version:
env PATH="$VERILATOR_GCC:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nExecute 3 --workRoot work-gcc
env PATH="$VERILATOR_CLANG:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nExecute 3 --workRoot work-clang
Once this is finished (it takes ~30 minutes), you can see the difference in simulator execution time with the following:
./rtlmeter compare --steps execute work-gcc work-clang
This will produce a report similar to:
execute - Elapsed time [s] - lower is better
╒══════════════════════════════╤══════╤══════╤═════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default:cmark │ 3 │ 3 │ 59.41 (± 1.25%) │ 55.70 (± 1.34%) │ 1.07x │ 0.00 │
│ VeeR-EH2:default:cmark │ 3 │ 3 │ 57.21 (± 0.51%) │ 42.15 (± 1.93%) │ 1.36x │ 0.00 │
│ VeeR-EL2:default:cmark │ 3 │ 3 │ 59.04 (± 0.61%) │ 55.03 (± 1.10%) │ 1.07x │ 0.00 │
╞══════════════════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 1.16x │ │
│ Geometric mean - pVal < 0.05 │ │ │ │ │ 1.16x │ │
╘══════════════════════════════╧══════╧══════╧═════════════════╧═════════════════╧══════════════╧═══════════╛
Here ‘A’ refers to results from the first working directory passed on the
command line (work-gcc
), and ‘B’ refers to the second working directory
(work-clang
).
You can see that using Clang, the simulation is on average ~16% faster across these cases compared to using GCC. The ‘gain’ (ratio of means) is also shown for each individual case.
Similar to ./rtlmeter report
, you can use the --steps
, and
--metrics
options to compare different measurements available in the
working directories.
Significance of difference
To check if there is a meaningful difference in the performance metrics,
./rtlmeter compare
will compute a statistical significance test for the
difference between the means for each case. The corresponding p-value is
reported in the last column. As is standard with statistical hypothesis
testing, low p-values indicate a significant difference. A commonly used
threshold for concluding a statistically significant result is a
p-value < 0.05, so RTLMeter reports the average gain for only those cases
that meet this significance threshold. In the case of execution time above
the results are clear, Clang produces faster code.
Note that the p-value is only computed if at least 2 samples are available
from both working directories (as shown by the #A
and #B
columns),
otherwise the table entry for the p-value will be blank.
Let’s now say you are interested in the effect of Clang vs GCC on running Verilator itself. Earlier we only collected 1 sample for the compilation step, so let’s add a few more:
env PATH="$VERILATOR_GCC:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nCompile 3 --workRoot work-gcc
env PATH="$VERILATOR_CLANG:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nCompile 3 --workRoot work-clang
This will perform 2 more compilations of each configuration. The first one
is available from the earlier run we did above when measuring execution time.
Simulation will also not be run, as execution results are already available
in the working directories, but if you run this with clean working directories,
you could add --nExecute 0
to not perform execution. To check the effect on
verilation time, use:
./rtlmeter compare --steps verilate work-gcc work-clang
The report looks something like:
verilate - Elapsed time [s] - lower is better
╒══════════════════╤══════╤══════╤═════════════════╤════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════╪══════╪══════╪═════════════════╪════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default │ 3 │ 3 │ 2.98 (± 2.95%) │ 2.91 (± 0.67%) │ 1.03x │ 0.24 │
│ VeeR-EH2:default │ 3 │ 3 │ 10.06 (± 5.49%) │ 9.65 (± 3.93%) │ 1.04x │ 0.30 │
│ VeeR-EL2:default │ 3 │ 3 │ 4.81 (± 3.47%) │ 4.72 (± 3.22%) │ 1.02x │ 0.46 │
╞══════════════════╪══════╪══════╪═════════════════╪════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 1.03x │ │
╘══════════════════╧══════╧══════╧═════════════════╧════════════════╧══════════════╧═══════════╛
Although it looks like Clang might be ~3% faster, the p-values indicate that the results are not significant, the difference might just be due to a noisy host machine.
Let’s add some more samples, as some of the confidence intervals of the means are quite wide:
env PATH="$VERILATOR_GCC:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nCompile 30 --workRoot work-gcc
env PATH="$VERILATOR_CLANG:$PATH" ./rtlmeter run --cases 'VeeR*:default:cmark' --nCompile 30 --workRoot work-clang
Then rerun:
./rtlmeter compare --steps verilate work-gcc work-clang
And you will see something like:
verilate - Elapsed time [s] - lower is better
╒══════════════════════════════╤══════╤══════╤════════════════╤════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════════════════╪══════╪══════╪════════════════╪════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default │ 30 │ 30 │ 2.95 (± 0.41%) │ 2.94 (± 0.39%) │ 1.00x │ 0.11 │
│ VeeR-EH2:default │ 30 │ 30 │ 9.67 (± 0.72%) │ 9.62 (± 0.42%) │ 1.00x │ 0.28 │
│ VeeR-EL2:default │ 30 │ 30 │ 4.65 (± 0.58%) │ 4.69 (± 0.40%) │ 0.99x │ 0.03 │
╞══════════════════════════════╪══════╪══════╪════════════════╪════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 1.00x │ │
│ Geometric mean - pVal < 0.05 │ │ │ │ │ 0.99x │ │
╘══════════════════════════════╧══════╧══════╧════════════════╧════════════════╧══════════════╧═══════════╛
Now your one statistically significant case suggests using Clang is actually ~1% slower, but as you can see, the difference is hard to measure, as it is very small. At this point you might conclude that the difference is small enough not to be meaningful.
If you rerun the same session yourself, the actual results might of course differ, as they depend on the host machine, environment, or the version of the compilers you are using. The point here is that RTLMeter gives you the ability to draw statistically sound conclusions.
If you care, you can of course keep going until your time and patience allows. Here are the results after 100 runs with both compilers:
verilate - Elapsed time [s] - lower is better
╒══════════════════════════════╤══════╤══════╤════════════════╤════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════════════════╪══════╪══════╪════════════════╪════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default │ 100 │ 100 │ 2.95 (± 0.20%) │ 2.97 (± 0.45%) │ 0.99x │ 0.00 │
│ VeeR-EH2:default │ 100 │ 100 │ 9.62 (± 0.27%) │ 9.69 (± 0.41%) │ 0.99x │ 0.01 │
│ VeeR-EL2:default │ 100 │ 100 │ 4.63 (± 0.24%) │ 4.72 (± 0.38%) │ 0.98x │ 0.00 │
╞══════════════════════════════╪══════╪══════╪════════════════╪════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 0.99x │ │
│ Geometric mean - pVal < 0.05 │ │ │ │ │ 0.99x │ │
╘══════════════════════════════╧══════╧══════╧════════════════╧════════════════╧══════════════╧═══════════╛
This suggests that using Clang indeed makes verilation ~1% slower on average, across these cases. How you use that information (whether you care or not), is of course outside the scope of this discussion, but RTLMeter can give you robust data to help you make decisions.
Evaluating the effect of Verilator options
You can use the --compileArgs
option of ./rtlmeter run
to pass
additional command line arguments to verilator
during compilation. As an
example, let’s use this to check the effect of the --public-flat-rw
Verilator option. Note the =
used to prevent ./rtlmeter run
from
trying to parse the extra option as an argument to itself:
./rtlmeter run --cases 'VeeR*:default:cmark' --workRoot work-base
./rtlmeter run --cases 'VeeR*:default:cmark' --workRoot work-pfrw --compileArgs="--public-flat-rw"
Then run:
./rtlmeter compare work-base work-pfrw
Which shows:
verilate - Elapsed time [s] - lower is better
╒══════════════════╤══════╤══════╤═════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default │ 1 │ 1 │ 2.88 │ 4.82 │ 0.60x │ │
│ VeeR-EH2:default │ 1 │ 1 │ 10.23 │ 16.09 │ 0.64x │ │
│ VeeR-EL2:default │ 1 │ 1 │ 4.85 │ 8.45 │ 0.57x │ │
╞══════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 0.60x │ │
╘══════════════════╧══════╧══════╧═════════════════╧═════════════════╧══════════════╧═══════════╛
execute - Elapsed time [s] - lower is better
╒════════════════════════╤══════╤══════╤═════════════════╤══════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞════════════════════════╪══════╪══════╪═════════════════╪══════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default:cmark │ 1 │ 1 │ 55.52 │ 352.58 │ 0.16x │ │
│ VeeR-EH2:default:cmark │ 1 │ 1 │ 44.46 │ 458.54 │ 0.10x │ │
│ VeeR-EL2:default:cmark │ 1 │ 1 │ 55.74 │ 308.84 │ 0.18x │ │
╞════════════════════════╪══════╪══════╪═════════════════╪══════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 0.14x │ │
╘════════════════════════╧══════╧══════╧═════════════════╧══════════════════╧══════════════╧═══════════╛
There is not much point in doing multiple runs here, as the difference is very
large, so you can see that --public-flat-rw
causes significant slowdown both
in verilation and in execution. This is of course expected, as
--public-flat-rw
disables a lot of optimizations that result in both a slower
simulator executable, and slower verilation due to an increased working set
size in later Verilator passes.
If you find it easier to interpret the results, you can swap the working
directories around, to see the effect of not using --public-flat-rw
:
./rtlmeter compare work-pfrw work-base
Which presents:
verilate - Elapsed time [s] - lower is better
╒══════════════════╤══════╤══════╤═════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞══════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default │ 1 │ 1 │ 4.82 │ 2.88 │ 1.67x │ │
│ VeeR-EH2:default │ 1 │ 1 │ 16.09 │ 10.23 │ 1.57x │ │
│ VeeR-EL2:default │ 1 │ 1 │ 8.45 │ 4.85 │ 1.74x │ │
╞══════════════════╪══════╪══════╪═════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 1.66x │ │
╘══════════════════╧══════╧══════╧═════════════════╧═════════════════╧══════════════╧═══════════╛
execute - Elapsed time [s] - lower is better
╒════════════════════════╤══════╤══════╤══════════════════╤═════════════════╤══════════════╤═══════════╕
│ Case │ #A │ #B │ Mean A │ Mean B │ Gain (A/B) │ p-value │
╞════════════════════════╪══════╪══════╪══════════════════╪═════════════════╪══════════════╪═══════════╡
│ VeeR-EH1:default:cmark │ 1 │ 1 │ 352.58 │ 55.52 │ 6.35x │ │
│ VeeR-EH2:default:cmark │ 1 │ 1 │ 458.54 │ 44.46 │ 10.31x │ │
│ VeeR-EL2:default:cmark │ 1 │ 1 │ 308.84 │ 55.74 │ 5.54x │ │
╞════════════════════════╪══════╪══════╪══════════════════╪═════════════════╪══════════════╪═══════════╡
│ Geometric mean │ │ │ │ │ 7.13x │ │
╘════════════════════════╧══════╧══════╧══════════════════╧═════════════════╧══════════════╧═══════════╛