|
[Rivet] Proposal for multi-weight/NLO counter-event support, and re-entry to the event loop / finalize stepsAndy Buckley andy.buckley at cern.chMon Jun 3 10:39:55 BST 2013
Hi all, Here is the proposal for Rivet histogramming developments in the next couple of weeks at Les Houches, and beyond. We've moved the development repositories for Rivet and YODA to use hg rather than svn now, which will make exploratory development much easier. Please try to read this, or at least to scan the bits that interest you. It's long, for which I apologise, but this is rather fiddly stuff. I think "many eyes" looking through this proposal could help us to catch issues before we've committed ourselves to a particular design: I can already think of a couple of (minor) vague areas, but I'll keep them to myself for now rather than further clutter this email! I'll sync this email to the Trac wiki/ticket. A reminder of what we're trying to solve... 1. Merging of independent runs (for same or different processes) 2. Writing fully useful histograms *during* a run 3. Transparent handling of multi-weight events 4. Transparent handling of correlated events These facets are not orthogonal, but to make development feasible we do need to be factorise as much as possible into small-ish steps, so that after each one we'll have a working system which we can test. Doing all of this in one big step seems a recipe for disaster! I'll try to go through these now in the order given, which I think is also the natural order for development. 1. RUN MERGING Actually, we can do simple run merging at the YODA level now, thanks to Dave M putting together the remaining Python += operators for YODA histograms, and a yodamerge script which uses them to combine multiple runs into one. It's very simple at the moment, but we will add a command line way to specify weights for each input run: will that be enough to combine *different* processes by cross-section? Please try it out... it *seems* to be working nicely, given a simple test. This only works for histograms to which at most a normalisation scale factor has been applied, though. In the general case, arbitrary manipulations might be done to the histograms in the finalize step: to merge these from multiple runs we need to merge the data objects *before* finalization. This introduces significant challenges, especially since we don't want to introduce very unintuitive structures into the "user" analysis code. Our chosen approach is as follows: * Analyses should *register* every object that will be used in their finalize() method, in addition to those which are intended for plotting/comparison use. Registration will normally happen in init() but can also happen in finalize... this already happens, so effectively no change except that even intermediate histograms will now need to be registered. For analyses with cuts, registered YODA Counter objects will need to be used in place of doubles for weight counting. Access to the histos/counters in the analysis will be by pointers, as now, or by the registered path name, so the user's freedom to structure their analysis' data as they wish won't be impeded. * We will provide a way to declare on the booking methods whether the object being registered is to be "visible" in the final histograms, or if it is an interim data object to be used in preparing the final plots. This way *everything* needed to calculate the final plots in finalize() will also be written out to the .yoda file. This will inflate the file size, but this should not be a major problem. We can add a behavioural flag to disable this behaviour if that is thought to be important. The plotting scripts (i.e. rivet-cmphistos) should have a flag to plot the intermediate histos (default = only plot final/visible histos). * Add the ability to "pre-load" the intermediate histograms after Rivet's init() step, but before the event loop starts, by supplying a .yoda data file... or multiple files. This is the key step for merging analyses with complex finalize() logic & manipulations: we perform the multiple runs in parallel, merge the histogram files (although only with an interest in the non-final histograms: the final ones will in general be mangled by this process), and restart the event loop -- perhaps for 0 events -- with that combined state pre-loaded. The finalize() will then proceed using the aggregated intermediate data objects and write out consistent full-stats/all-processes physical ones. (It was even suggested that if we map everything nicely into Python, that this merge+preload+finalize step could be done transparently inside rivet-cmphistos, rather than via an explicit extra run of the rivet script.) Note that there is no fiddly weight treatment here for multi-weights, NLO counter-events, etc. I suggest that we make a new release at this point, since it is a significant feature improvement. Re. the temporary/invisible histo flagging: in the meeting it was suggested that we use a YODA "annotation" for this, but if we need to write out both the intermediate and final versions of e.g. histograms to be normalized, then they need to have different paths so we can distinguish them. I suggest a /TMP/ path prefix for intermediate histos, cf. the /REF/ that we already use to distinguish and relate MC and data histograms and because it will alphabetically group histos in a fairly predictable way. We can build awareness of /TMP into rivet-cmphistos in the same way as we already do for /REF. Another benefit of a fully predictable path scheme is that we can switch the behaviour of the Analysis::get("name") function, so that in the analyze() method it returns the temporary histo, and in finalize() it returns the permanent one: I think this is necessary, and that in fact we will have to do some pre-finalize sleight of hand to switch the target of the histo pointers in the analysis to point at the permanent objects rather than the temporary ones! Thoughts on this? 2. MID-RUN HISTO WRITING This is semi-trivial given the above. All that we need to be careful about here is that finalize() doesn't disrupt the intermediate histograms, which the sleight-of-hand method should ensure. We can then do the pointer target switch, run finalize, switch back, and continue with the run: finalize() can be run any number of times. Maybe this gets released at the same time as the above, maybe it comes slightly later. 3. MULTI-WEIGHT EVENTS Ok, so now we have a working Rivet release in which single-weight events can be used for arbitrary run combination: this will make a lot of people very happy. Now we make it slightly more complicated, but in fact the machinery above is a necessary precursor to handling multiple-weight events (which will be an important development for MC systematics and the generator-side machinery is starting to come to fruition, cf. discussion at Les Houches this week, I hope.) The obvious way to handle multiple weights is to run the analysis N times for N weights, with different histo paths for each weight, or similar: this was already tried by James several years ago, and Leif advocated it in the meeting last week. The problem is that if, say, the PDF4LHC prescription was used in a generator run, then there will be O(200) weights per event. While projections will help to some extent, the particle/jet looping in the analyses will have to be re-run N times, with the guarantee that exactly the same cuts will be passed, same histogram fills will happen, etc.: the *only* difference in the result will be the weight that goes into the histograms and counters. We could perhaps rewrite all the analyses (!) so that the analyze() method is a projection (or a functor that behaves that way) and make use of explicit caching, but I think there's a much nicer way... The proposal is that for multi-weight events, we don't just book one intermediate histogram per registered name (and then turn those into permanent histograms in finalize()), but that for every call of add() we book 1 temporary histo (which will be clear()ed after each event) and N intermediates. This requires a little bit of magic as we'll only find out how many weights there are by looking at the first event, but we already do that in Run to work out the beam particles and sqrtS before init()ing the analyses. At the end of every event, the Analysis base class (or the AnalysisHandler) will sync the temporary histograms to the intermediates by looping over the weight vector of the event, and scaling the temporary by the weight before +=ing it to its intermediate. Note that this means that in the analysis code, ~all weights should be 1! This will require migration, but also makes for a beautiful simplification. The finalize() code will need to be aware that operations apply to all intermediates rather than just one... this will require some thought, e.g. running finalize() once for each weight, with the appropriate weight-specific pointer switching. Again, we'll need distinct path structures to track and access these distinct-but-related data objects. I suggest a /.../.../FOO at WEIGHTNAME path syntax extension (perhaps with no @WEIGHTNAME part for the first weight, i.e. the nominal behaviour) Another release should probably happen at this point. 4. CORRELATED EVENTS Almost done. Handling NLO counter-events (where a group of events must be treated as correlated sub-events) requires that we make a temporary histo for each super-event block and then synchronise that super-event to the intermediate histos. We'll trigger on the super-event transitions by looking for a change of event number: if the event number remains the same between consecutive events, we assume that they are correlated sub-events. The machinery for multi-weight events deals with most of the issues, I think: we will already have per-event temporary, transient-only histograms, and these just need to be generalized a bit to become per-super-event temporaries. There are some open questions: * Fuzzy bin edges: counter-events could fall on either side of a bin boundary by an epsilon separation. We have the machinery to catch this and deal with it (average out the fills between the two bins or asign both fills to only one bin) because the YODA bins store their mean fill position. The abstraction of the temporary histograms makes this possible without having to put physics knowledge into YODA: phew. * What weight do we use to sync to the intermediate histos? The sub-events can have different weights -- in fact, they can have systematic weight vectors! Do we need to add a separate "event group weight" member to HepMC and LHE to separate systematics weights from NLO subtraction term weights? Or can we make do with the systematics weight vectors and a standardised procedure? * In this scheme, sub-events must be consecutive and have the same event number so we can determine that they are to be correlated. MC at NLO doesn't do this: is that ok? (MC at NLO's events only have weights +-1, so I think it's ok to treat them as uncorrelated: in that case, we don't really have to include it in this "NLO" treatment at all.) Everything is done now, so we definitely put out a new release, and collapse from exhaustion ;-) That's it. Thoughts? It's complicated, but not insane, and I actually think it's rather elegant and enables analysis code to be more compact and straightforward, without doing anything nasty or too magical. We should make sure that the user can always do anything they want -- even getting hold of the permanent histos in the event loop if they really want -- but that it's not the easiest thing to do: we optimise the API to make it as simple as possible to do the 99% use case. We will have to proceed in steps, because implementing all of this at once will be a disaster, but I hope the above sounds like a reasonable factorization. Cheers, Andy -- Dr Andy Buckley, Royal Society University Research Fellow Particle Physics Expt Group, University of Edinburgh
More information about the Rivet mailing list |