|
[Rivet] Patch for Rivet to allow ordering of analyses and more fleshed out rivet-nopycholm cholm at nbi.dkTue May 2 09:04:04 BST 2017
Hi Andy, This is a relative long reply - sorry, but the challenges faced are a bit complicated :-/ On 2017-05-01 17:11, Andy Buckley wrote: > Hi, > > Can you explain what you mean by analysis-ordering? Sure. The ability to specify in exactly which order the analyses are executed. > The analyses run > independently of each other -- That seems to be the paradigm used in Rivet, which in some cases seem to be a sub-optimal way of doing things. Let's take projections. Analysis A and B may both declare a P projection. In the current code, the P projection is run two times, since the projections stored in the Event object are specific to the Analysis object. If, and only if, two projections compare equal, then there's no reason to redo the projection. So instead of running the projections independently and tying them to a specific Analyis object, the ProjectionManager could use the provided compare member functions to ensure uniqueness of all projections. That will in many cases reduce the number of loops over the event data considerable - e.g., if one has 10 Analysis objects, all using the FinalState projection (with no additional cuts) - instead of doing 10 loops over the data, we'd do one. > there should be no meaning to "ordering". Well, the case of centrality in heavy-ion collisions would run counter to that. Let's first consider a few things about centrality A: Typically, one defines centrality in terms of a mapping from some observable (say number of charged particles within some acceptance) to the fraction of the nuclear cross-section as seen by that observable. B: This means that the mapping from observable to centrality depends on the distribution of that observable. C: The distribution of such observables in general depend on the kind of physics that a given model implements, and in some cases also on the "tune" of that model - e.g., EPOS-LHC and HIJING does not produce the same distribution of number of charged particles withing the ALICE V0 acceptance, nor does AMPT with and without string melting give the same distribution of that observable. D: The distribution of such an observable also depends on the collision system and energy - e.g., the distribution of the number of charged particles within the ALICE V0 acceptance is not the same for Pb-Pb at 2.76TeV as it is in 5.02TeV, and clearly distinctly different from p-Pb at 5.02TeV. E: Thus, the mapping from observable to centrality depends on at least four factors 1: Collision system 2; Collision energy 3: Model 4: Model tune F: Thus, given that each analysis object should be run on multiple models and model tunes (not necessarily different collision systems and energies), it cannot implement the mapping from observable to centrality in any meaningful way. G: Thus, we need a mechanism to bring a collision system and energy, and model and tune dependent mapping from observable to centrality into the analysis object. The last step, G, can be accomplished in a variety of ways. 1: Rely on some external data - e.g., one could have YODA files that contain the mappings from observable to centrality stored in some appropriate format - say as histograms or scatters. Each analysis must then open those external files and read in the appropriate mapping. However, this provides a bit of a challenge: Since the Analysis object does not know which model and tune (and perhaps not which collision system and energy) it is being run over, it cannot directly know which file to open. E.g., suppose your running over HIJING data of Au-Au at 200GeV, then we need say the file $RIVET_EXTERNAL_DATA/Centrality_Hijing_AuAu_200GeV.yoda, but since the analysis does not know the model it cannot deduce that name. i: One way around that, would be to have some interface in the management part of Rivet (Say AnalysisManager or Run) to open such a file, and the Analysis object could then query that manager for the data needed to do the observable to centrality mapping. ii: Another way around that would be to be able to pass arguments to the analysis object. Then one could pass the name of the file to read in to the Analysis object. 2: Instead of storing the mapping from observable to centrality in external data, we could store it in external code. That is, based on some calibration pass of the model data, we define collision system and energy, and model and tune dependent code where we hard-code the mapping. We compile that code into a Rivel module (or shared library) which can be loaded at runtime by the user. Now, we still face the challenge of how to select the mapping appropriate for our model input. We can do that in a number of ways. i: One allows the user to specify which object to instantise for a given input, and then some manager object in Rivet makes sure to make an object of the corresponding class. The Analysis object can then retrieve the mappings from the manager object. This is a little akin to case 1.i above. ii: Another way, is that the code we write is in it self an Analysis object (though a pseudo-one because it doesn't actually do analysis) which sets up some projection with the proper mapping from observable to centrality. The projection is then stored in a fixed static interface, which the analysis object can pick up and apply as it's own projection. In this case, we specify our mapping calibration by adding a specific Analysis to our run. This is were the ordering comes in - the pseudo-Analysis object (and the centrality projection) needs to be initialised _before_ any Analysis object that uses the projection. This is just one use case for something like this. Another case would be an Raa analysis. In such an analysis, one need to loop over AA data to build up dN/dpT, and then in the end divide by some known pp dN/dpT. Now, normally one would do two independent loops over AA and pp, take the two outputs and create a third output with the ratios in. This is all well and fine when one has control of the running. But for model auto-tuners, it is not so straight forward. So what one could do, is to write a Analysis class that contains the pp dN/dpT distribution and which posts that to some well defined static interface. Then in the AA Analysis object, one can retrieve that distribution and divide the AA dN/dpT distribution(s) by that. Again, the user would simple add a pseudo-Analysis to the Run _before_ the Raa Analysis object. I'm sure there's other similar use cases. > I am wary of adding more run-options to the library unless > there is a very compelling use-case. The patch (replacing std::set with std::vector) does not _add_ options to the library. It simply provides a way to specify the execution order of the Analysis objects. That is, if one says rivet -a A,B,C then the code will always be executed as A first, then B, and finally C. Currently, the ordering is completely arbitrary and changes from run to run as it depends on where the OS puts the object in memory. That means that each run is not reproducible which can make it extremely hard to debug issues. > rivet-nopy is intentionally very minimal, both because we don't want > to support any substantial features beyond basic analysis-running, and > because it's a simple example of how a steering code can run Rivet. > What have you "fleshed out"? The easiest way to explain that is if you try to apply the patch, compile and do rivet-nopy --help I've basically tried to support all the options of the Python script rivet in the compiled code. > It might be that it's better for you to just keep this extension for > personal use. Well, if you do provide the binary, then your execution environment does not depend on Python directly - at least not for running the analyses. This could be beneficial and a non-homogeneous setting. Another benefit is that it's far easier to debug Rivet code when using a compiled program than via the Python interpreter. > It would also be good to keep these two distinct patches separate from > each other, so we can apply them "atomically" -- or do they depend on > each other somehow? No, the change of std::set to std;:vector does not depend on the changes to rivet-nopy.cc. However, it is easy enough to disentangle them. Just open the patch in an editor and cut-and-paste the relevant parts into another file and save. > Regarding parallelisation, we are in the lucky position of event > streams being "embarrassingly parallel"... as you've shown. You can > already handle most situations using rivet and yodamerge (or scripted > uses of their APIs), and developments in the pipeline for v3 will make > the "re-finalize" step possible as part of the core Rivet behaviour > without needing to specify special run-modes. How do you propagate the analysis objects into the Analysis objects? Suppose I had the Analysis class struct A : public Analysis { Histo1DPtr _h; std::mt19937 _g; std::normal_distribution<> _d; A() : Analysis("A"), _g(std::random_device()), _d(0,1) {} void init() { _h = bookHisto1D("h",100,-3,3,"x","dN/dx"); } void analyse() { _h->fill(_d(_g)); } void finalize() { _h->normalize(1); } }; When this is "re-finalized" it needs a valid object for A::_h, but we have no clear way of setting this from the outside (other than to remove it first using Analysis::removeAnalysisObject and then re-adding it using Analysis::addAnalysisObject). So my suggestion would be to have Analysis developers write something like A::initFinalize(const std::vector<AnalysisObjectPtr>& l) { for (auto a : l) { std::string an(a->path()); if (an == (histoPath() + "h")) _h = a; } } and then call that member function for all Analysis objects before running the re-finalize. Another way to do this, would be to first call the regular Analysis::init, and then have something like Analysis::initFinalize(const std::vector<AnalysisObjectPtr>& l) { for (auto ta : _analysisObjects) { std::string tn(ta->path()); for (auto la : l) { if (tn == la->path()) ta.swap(la); } } restoreFromAnalysisObjects(); } A::restoreFromAnalysisObject() { for (auto ao : _analysisObjects) { std::string an(ao->path()); if (an == histoPath() + "h") _h = ao; } } A nice tool to have in mind for parallelisation is GNU parallel which is a typical Unix type tool - https://www.gnu.org/software/parallel/. For example, let's assume we have the "files" input1 ... input5 then one could do something like ls input* | parallel rivet -a ana1,ana2,ana3 -o {/.}.yoda Now, ideally, one should be able to do something like ls input* | parallel rivet -a ana1,ana2,ana3 -o - | yodamerge - | rivet-finalize - -o final.yoda The nice thing about parallel is that it can work as a simple cluster broker - that is, one can push jobs to remote hosts, which is kinda cool. > It's taken a lot longer > than hoped to get this working, though: a lot of potential pitfalls, > and a lack of time! (Which is why your last patch hasn't been applied > yet -- I thought David had done so, but am actually quite glad he > hasn't since I had some queries about that, too!) Well, let me know and I'll try to answer as best I can. Yours, -- Christian Holm Christensen ------------------------------------------------- Niels Bohr Institute, Blegdamsvej 17, DK-2100 Copenhagen http://cern.ch/cholm, +4524618591
More information about the Rivet mailing list |