[Rivet] Proposal for multi-weight/NLO counter-event support, and re-entry to the event loop / finalize steps

Andy Buckley andy.buckley at cern.ch
Mon Jun 3 10:39:55 BST 2013


Hi all,

Here is the proposal for Rivet histogramming developments in the next
couple of weeks at Les Houches, and beyond. We've moved the development
repositories for Rivet and YODA to use hg rather than svn now, which
will make exploratory development much easier.

Please try to read this, or at least to scan the bits that interest you.
It's long, for which I apologise, but this is rather fiddly stuff. I
think "many eyes" looking through this proposal could help us to catch
issues before we've committed ourselves to a particular design: I can
already think of a couple of (minor) vague areas, but I'll keep them to
myself for now rather than further clutter this email! I'll sync this
email to the Trac wiki/ticket.

A reminder of what we're trying to solve...

 1. Merging of independent runs (for same or different processes)
 2. Writing fully useful histograms *during* a run
 3. Transparent handling of multi-weight events
 4. Transparent handling of correlated events

These facets are not orthogonal, but to make development feasible we do
need to be factorise as much as possible into small-ish steps, so that
after each one we'll have a working system which we can test. Doing all
of this in one big step seems a recipe for disaster!

I'll try to go through these now in the order given, which I think is
also the natural order for development.


1. RUN MERGING

Actually, we can do simple run merging at the YODA level now, thanks to
Dave M putting together the remaining Python += operators for YODA
histograms, and a yodamerge script which uses them to combine multiple
runs into one. It's very simple at the moment, but we will add a command
line way to specify weights for each input run: will that be enough to
combine *different* processes by cross-section? Please try it out... it
*seems* to be working nicely, given a simple test.

This only works for histograms to which at most a normalisation scale
factor has been applied, though. In the general case, arbitrary
manipulations might be done to the histograms in the finalize step: to
merge these from multiple runs we need to merge the data objects
*before* finalization. This introduces significant challenges,
especially since we don't want to introduce very unintuitive structures
into the "user" analysis code. Our chosen approach is as follows:

 * Analyses should *register* every object that will be used in their
finalize() method, in addition to those which are intended for
plotting/comparison use. Registration will normally happen in init() but
can also happen in finalize... this already happens, so effectively no
change except that even intermediate histograms will now need to be
registered. For analyses with cuts, registered YODA Counter objects will
need to be used in place of doubles for weight counting. Access to the
histos/counters in the analysis will be by pointers, as now, or by the
registered path name, so the user's freedom to structure their analysis'
data as they wish won't be impeded.

 * We will provide a way to declare on the booking methods whether the
object being registered is to be "visible" in the final histograms, or
if it is an interim data object to be used in preparing the final plots.
This way *everything* needed to calculate the final plots in finalize()
will also be written out to the .yoda file. This will inflate the file
size, but this should not be a major problem. We can add a behavioural
flag to disable this behaviour if that is thought to be important. The
plotting scripts (i.e. rivet-cmphistos) should have a flag to plot the
intermediate histos (default = only plot final/visible histos).

 * Add the ability to "pre-load" the intermediate histograms after
Rivet's init() step, but before the event loop starts, by supplying a
.yoda data file... or multiple files. This is the key step for merging
analyses with complex finalize() logic & manipulations: we perform the
multiple runs in parallel, merge the histogram files (although only with
an interest in the non-final histograms: the final ones will in general
be mangled by this process), and restart the event loop -- perhaps for 0
events -- with that combined state pre-loaded. The finalize() will then
proceed using the aggregated intermediate data objects and write out
consistent full-stats/all-processes physical ones. (It was even
suggested that if we map everything nicely into Python, that this
merge+preload+finalize step could be done transparently inside
rivet-cmphistos, rather than via an explicit extra run of the rivet script.)

Note that there is no fiddly weight treatment here for multi-weights,
NLO counter-events, etc. I suggest that we make a new release at this
point, since it is a significant feature improvement.

Re. the temporary/invisible histo flagging: in the meeting it was
suggested that we use a YODA "annotation" for this, but if we need to
write out both the intermediate and final versions of e.g. histograms to
be normalized, then they need to have different paths so we can
distinguish them. I suggest a /TMP/ path prefix for intermediate histos,
cf. the /REF/ that we already use to distinguish and relate MC and data
histograms and because it will alphabetically group histos in a fairly
predictable way. We can build awareness of /TMP into rivet-cmphistos in
the same way as we already do for /REF. Another benefit of a fully
predictable path scheme is that we can switch the behaviour of the
Analysis::get("name") function, so that in the analyze() method it
returns the temporary histo, and in finalize() it returns the permanent
one: I think this is necessary, and that in fact we will have to do some
pre-finalize sleight of hand to switch the target of the histo pointers
in the analysis to point at the permanent objects rather than the
temporary ones! Thoughts on this?


2. MID-RUN HISTO WRITING

This is semi-trivial given the above. All that we need to be careful
about here is that finalize() doesn't disrupt the intermediate
histograms, which the sleight-of-hand method should ensure. We can then
do the pointer target switch, run finalize, switch back, and continue
with the run: finalize() can be run any number of times. Maybe this gets
released at the same time as the above, maybe it comes slightly later.


3. MULTI-WEIGHT EVENTS

Ok, so now we have a working Rivet release in which single-weight events
can be used for arbitrary run combination: this will make a lot of
people very happy. Now we make it slightly more complicated, but in fact
the machinery above is a necessary precursor to handling multiple-weight
events (which will be an important development for MC systematics and
the generator-side machinery is starting to come to fruition, cf.
discussion at Les Houches this week, I hope.)

The obvious way to handle multiple weights is to run the analysis N
times for N weights, with different histo paths for each weight, or
similar: this was already tried by James several years ago, and Leif
advocated it in the meeting last week. The problem is that if, say, the
PDF4LHC prescription was used in a generator run, then there will be
O(200) weights per event. While projections will help to some extent,
the particle/jet looping in the analyses will have to be re-run N times,
with the guarantee that exactly the same cuts will be passed, same
histogram fills will happen, etc.: the *only* difference in the result
will be the weight that goes into the histograms and counters. We could
perhaps rewrite all the analyses (!) so that the analyze() method is a
projection (or a functor that behaves that way) and make use of explicit
caching, but I think there's a much nicer way...

The proposal is that for multi-weight events, we don't just book one
intermediate histogram per registered name (and then turn those into
permanent histograms in finalize()), but that for every call of add() we
book 1 temporary histo (which will be clear()ed after each event) and N
intermediates. This requires a little bit of magic as we'll only find
out how many weights there are by looking at the first event, but we
already do that in Run to work out the beam particles and sqrtS before
init()ing the analyses. At the end of every event, the Analysis base
class (or the AnalysisHandler) will sync the temporary histograms to the
intermediates by looping over the weight vector of the event, and
scaling the temporary by the weight before +=ing it to its intermediate.
Note that this means that in the analysis code, ~all weights should be
1! This will require migration, but also makes for a beautiful
simplification. The finalize() code will need to be aware that
operations apply to all intermediates rather than just one... this will
require some thought, e.g. running finalize() once for each weight, with
the appropriate weight-specific pointer switching.

Again, we'll need distinct path structures to track and access these
distinct-but-related data objects. I suggest a /.../.../FOO at WEIGHTNAME
path syntax extension (perhaps with no @WEIGHTNAME part for the first
weight, i.e. the nominal behaviour)

Another release should probably happen at this point.


4. CORRELATED EVENTS

Almost done. Handling NLO counter-events (where a group of events must
be treated as correlated sub-events) requires that we make a temporary
histo for each super-event block and then synchronise that super-event
to the intermediate histos. We'll trigger on the super-event transitions
by looking for a change of event number: if the event number remains the
same between consecutive events, we assume that they are correlated
sub-events.

The machinery for multi-weight events deals with most of the issues, I
think: we will already have per-event temporary, transient-only
histograms, and these just need to be generalized a bit to become
per-super-event temporaries. There are some open questions:

* Fuzzy bin edges: counter-events could fall on either side of a bin
boundary by an epsilon separation. We have the machinery to catch this
and deal with it (average out the fills between the two bins or asign
both fills to only one bin) because the YODA bins store their mean fill
position. The abstraction of the temporary histograms makes this
possible without having to put physics knowledge into YODA: phew.

* What weight do we use to sync to the intermediate histos? The
sub-events can have different weights -- in fact, they can have
systematic weight vectors! Do we need to add a separate "event group
weight" member to HepMC and LHE to separate systematics weights from NLO
subtraction term weights? Or can we make do with the systematics weight
vectors and a standardised procedure?

* In this scheme, sub-events must be consecutive and have the same event
number so we can determine that they are to be correlated. MC at NLO
doesn't do this: is that ok? (MC at NLO's events only have weights +-1, so
I think it's ok to treat them as uncorrelated: in that case, we don't
really have to include it in this "NLO" treatment at all.)

Everything is done now, so we definitely put out a new release, and
collapse from exhaustion ;-)



That's it. Thoughts? It's complicated, but not insane, and I actually
think it's rather elegant and enables analysis code to be more compact
and straightforward, without doing anything nasty or too magical. We
should make sure that the user can always do anything they want -- even
getting hold of the permanent histos in the event loop if they really
want -- but that it's not the easiest thing to do: we optimise the API
to make it as simple as possible to do the 99% use case. We will have to
proceed in steps, because implementing all of this at once will be a
disaster, but I hope the above sounds like a reasonable factorization.

Cheers,
Andy

-- 
Dr Andy Buckley, Royal Society University Research Fellow
Particle Physics Expt Group, University of Edinburgh


More information about the Rivet mailing list