|
[Rivet] Patch for Rivet to allow ordering of analyses and more fleshed out rivet-nopyAndy Buckley andy.buckley at cern.chMon May 8 11:26:31 BST 2017
Hi Christian, Sorry for the delayed reply. On the first point about how projections are tied to an analysis object, I think this is a misunderstanding. Projection *names* are specific to a ProjectionApplier (an Analysis or other Projection), to avoid lookup clashes, but their application enforces uniqueness via the comparison operator. I forget how the registration/declaration system works, but there is definitely no double-running of equivalent projections -- which is the whole point. Regarding centrality, we are actually working with ALICE on some machinery for that. As you note, the problem stems from the "two-pass" nature of centrality extraction from a physical distribution. It's really a user decision whether the same generator/mechanism is used to produce the calibration distribution as the observables within centrality quantiles. We've tried to handle that via a "data pre-load" step, on track for inclusion in the Rivet 2.6.0 release and now also in a special branch for ALICE use until all this stuff has gone mainstream. And yes, the juggling of histogram pointers is really fiddly ;-) To conclude, I think we'll hold off on the projection ordering part of the patch, at least for now -- I'm sure there are aspects of the registration that can be improved, since the system evolved to its current form from quite a different place, but empirically the caching of projection results was checked last time the system was updated, and it should be working fine. Please let me know if there does appear to be double-execution of equivalent projections... The rivet-nopy patch might be welcome, though -- I appreciate all the arguments in favour, but in the end we have to maintain whatever is in the code, and more complexity makes that more difficult. So we'd prefer to focus limited manpower on the "official" parts of Rivet, but if you could separate this enhancement out into a separate patch, then I'll happily take a look and merge in whatever looks supportable. Thanks again, Andy On 02/05/17 09:04, cholm wrote: > Hi Andy, > > This is a relative long reply - sorry, but the challenges faced are a > bit complicated :-/ > > On 2017-05-01 17:11, Andy Buckley wrote: >> Hi, >> >> Can you explain what you mean by analysis-ordering? > > Sure. The ability to specify in exactly which order the analyses are > executed. > >> The analyses run >> independently of each other -- > > That seems to be the paradigm used in Rivet, which in some cases seem to > be a sub-optimal way of doing things. Let's take projections. > Analysis A and B may both declare a P projection. In the current code, > the P projection is run two times, since the projections stored in the > Event object are specific to the Analysis object. If, and only if, two > projections compare equal, then there's no reason to redo the > projection. So instead of running the projections independently and > tying them to a specific Analyis object, the ProjectionManager could use > the provided compare member functions to ensure uniqueness of all > projections. That will in many cases reduce the number of loops over > the event data considerable - e.g., if one has 10 Analysis objects, all > using the FinalState projection (with no additional cuts) - instead of > doing 10 loops over the data, we'd do one. > > >> there should be no meaning to "ordering". > > Well, the case of centrality in heavy-ion collisions would run counter > to that. Let's first consider a few things about centrality > > A: Typically, one defines centrality in terms of a mapping from some > observable (say number of charged particles within some acceptance) to > the fraction of the nuclear cross-section as seen by that observable. > > B: This means that the mapping from observable to centrality depends on > the distribution of that observable. > > C: The distribution of such observables in general depend on the kind of > physics that a given model implements, and in some cases also on the > "tune" of that model - e.g., EPOS-LHC and HIJING does not produce the > same distribution of number of charged particles withing the ALICE V0 > acceptance, nor does AMPT with and without string melting give the same > distribution of that observable. > > D: The distribution of such an observable also depends on the collision > system and energy - e.g., the distribution of the number of charged > particles within the ALICE V0 acceptance is not the same for Pb-Pb at > 2.76TeV as it is in 5.02TeV, and clearly distinctly different from p-Pb > at 5.02TeV. > > E: Thus, the mapping from observable to centrality depends on at least > four factors > 1: Collision system > 2; Collision energy > 3: Model > 4: Model tune > > F: Thus, given that each analysis object should be run on multiple > models and model tunes (not necessarily different collision systems and > energies), it cannot implement the mapping from observable to centrality > in any meaningful way. > > G: Thus, we need a mechanism to bring a collision system and energy, and > model and tune dependent mapping from observable to centrality into the > analysis object. > > The last step, G, can be accomplished in a variety of ways. > > 1: Rely on some external data - e.g., one could have YODA files that > contain the mappings from observable to centrality stored in some > appropriate format - say as histograms or scatters. Each analysis must > then open those external files and read in the appropriate mapping. > However, this provides a bit of a challenge: Since the Analysis object > does not know which model and tune (and perhaps not which collision > system and energy) it is being run over, it cannot directly know which > file to open. E.g., suppose your running over HIJING data of Au-Au at > 200GeV, then we need say the file > $RIVET_EXTERNAL_DATA/Centrality_Hijing_AuAu_200GeV.yoda, but since the > analysis does not know the model it cannot deduce that name. > i: One way around that, would be to have some interface in the > management part of Rivet (Say AnalysisManager or Run) to open such a > file, and the Analysis object could then query that manager for the data > needed to do the observable to centrality mapping. > ii: Another way around that would be to be able to pass arguments to the > analysis object. Then one could pass the name of the file to read in to > the Analysis object. > > 2: Instead of storing the mapping from observable to centrality in > external data, we could store it in external code. That is, based on > some calibration pass of the model data, we define collision system and > energy, and model and tune dependent code where we hard-code the > mapping. We compile that code into a Rivel module (or shared library) > which can be loaded at runtime by the user. Now, we still face the > challenge of how to select the mapping appropriate for our model input. > We can do that in a number of ways. > i: One allows the user to specify which object to instantise for a given > input, and then some manager object in Rivet makes sure to make an > object of the corresponding class. The Analysis object can then retrieve > the mappings from the manager object. This is a little akin to case 1.i > above. > ii: Another way, is that the code we write is in it self an Analysis > object (though a pseudo-one because it doesn't actually do analysis) > which sets up some projection with the proper mapping from observable to > centrality. The projection is then stored in a fixed static interface, > which the analysis object can pick up and apply as it's own projection. > In this case, we specify our mapping calibration by adding a specific > Analysis to our run. This is were the ordering comes in - the > pseudo-Analysis object (and the centrality projection) needs to be > initialised _before_ any Analysis object that uses the projection. > > This is just one use case for something like this. Another case would > be an Raa analysis. In such an analysis, one need to loop over AA data > to build up dN/dpT, and then in the end divide by some known pp dN/dpT. > Now, normally one would do two independent loops over AA and pp, take > the two outputs and create a third output with the ratios in. This is > all well and fine when one has control of the running. But for model > auto-tuners, it is not so straight forward. So what one could do, is > to write a Analysis class that contains the pp dN/dpT distribution and > which posts that to some well defined static interface. Then in the AA > Analysis object, one can retrieve that distribution and divide the AA > dN/dpT distribution(s) by that. Again, the user would simple add a > pseudo-Analysis to the Run _before_ the Raa Analysis object. > > I'm sure there's other similar use cases. > >> I am wary of adding more run-options to the library unless >> there is a very compelling use-case. > > The patch (replacing std::set with std::vector) does not _add_ options > to the library. It simply provides a way to specify the execution order > of the Analysis objects. That is, if one says > > rivet -a A,B,C > > then the code will always be executed as A first, then B, and finally C. > Currently, the ordering is completely arbitrary and changes from run to > run as it depends on where the OS puts the object in memory. That means > that each run is not reproducible which can make it extremely hard to > debug issues. > >> rivet-nopy is intentionally very minimal, both because we don't want >> to support any substantial features beyond basic analysis-running, and >> because it's a simple example of how a steering code can run Rivet. >> What have you "fleshed out"? > > The easiest way to explain that is if you try to apply the patch, > compile and do > > rivet-nopy --help > > I've basically tried to support all the options of the Python script > rivet in the compiled code. > >> It might be that it's better for you to just keep this extension for >> personal use. > > Well, if you do provide the binary, then your execution environment does > not depend on Python directly - at least not for running the analyses. > This could be beneficial and a non-homogeneous setting. > > Another benefit is that it's far easier to debug Rivet code when using a > compiled program than via the Python interpreter. > >> It would also be good to keep these two distinct patches separate from >> each other, so we can apply them "atomically" -- or do they depend on >> each other somehow? > > No, the change of std::set to std;:vector does not depend on the changes > to rivet-nopy.cc. However, it is easy enough to disentangle them. Just > open the patch in an editor and cut-and-paste the relevant parts into > another file and save. > >> Regarding parallelisation, we are in the lucky position of event >> streams being "embarrassingly parallel"... as you've shown. You can >> already handle most situations using rivet and yodamerge (or scripted >> uses of their APIs), and developments in the pipeline for v3 will make >> the "re-finalize" step possible as part of the core Rivet behaviour >> without needing to specify special run-modes. > > How do you propagate the analysis objects into the Analysis objects? > Suppose I had the Analysis class > > struct A : public Analysis > { > Histo1DPtr _h; > std::mt19937 _g; > std::normal_distribution<> _d; > A() : Analysis("A"), _g(std::random_device()), _d(0,1) {} > void init() { _h = bookHisto1D("h",100,-3,3,"x","dN/dx"); } > void analyse() { _h->fill(_d(_g)); } > void finalize() { _h->normalize(1); } > }; > > When this is "re-finalized" it needs a valid object for A::_h, but we > have no clear way of setting this from the outside (other than to remove > it first using Analysis::removeAnalysisObject and then re-adding it > using Analysis::addAnalysisObject). So my suggestion would be to have > Analysis developers write something like > > A::initFinalize(const std::vector<AnalysisObjectPtr>& l) > { > for (auto a : l) { > std::string an(a->path()); > if (an == (histoPath() + "h")) _h = a; > } > } > > and then call that member function for all Analysis objects before > running the re-finalize. Another way to do this, would be to first call > the regular Analysis::init, and then have something like > > Analysis::initFinalize(const std::vector<AnalysisObjectPtr>& l) > { > for (auto ta : _analysisObjects) { > std::string tn(ta->path()); > for (auto la : l) { > if (tn == la->path()) ta.swap(la); > } > } > restoreFromAnalysisObjects(); > } > > A::restoreFromAnalysisObject() > { > for (auto ao : _analysisObjects) { > std::string an(ao->path()); > if (an == histoPath() + "h") _h = ao; > } > } > > A nice tool to have in mind for parallelisation is GNU parallel which is > a typical Unix type tool - https://www.gnu.org/software/parallel/. For > example, let's assume we have the "files" > > input1 ... input5 > > then one could do something like > > ls input* | parallel rivet -a ana1,ana2,ana3 -o {/.}.yoda > > Now, ideally, one should be able to do something like > > ls input* | parallel rivet -a ana1,ana2,ana3 -o - | yodamerge - | > rivet-finalize - -o final.yoda > > The nice thing about parallel is that it can work as a simple cluster > broker - that is, one can push jobs to remote hosts, which is kinda cool. > >> It's taken a lot longer >> than hoped to get this working, though: a lot of potential pitfalls, >> and a lack of time! (Which is why your last patch hasn't been applied >> yet -- I thought David had done so, but am actually quite glad he >> hasn't since I had some queries about that, too!) > > Well, let me know and I'll try to answer as best I can. > > Yours, > -- Dr Andy Buckley, Lecturer / Royal Society University Research Fellow Particle Physics Expt Group, University of Glasgow
More information about the Rivet mailing list |