[Rivet] Patch for Rivet to allow ordering of analyses and more fleshed out rivet-nopy

cholm cholm at nbi.dk
Tue May 2 09:04:04 BST 2017


Hi Andy,

This is a relative long reply - sorry, but the challenges faced are a 
bit complicated :-/

On 2017-05-01 17:11, Andy Buckley wrote:
> Hi,
> 
> Can you explain what you mean by analysis-ordering?

Sure. The ability to specify in exactly which order the analyses are 
executed.

> The analyses run
> independently of each other --

That seems to be the paradigm used in Rivet, which in some cases seem to 
be a sub-optimal way of doing things.   Let's take projections.  
Analysis A and B may both declare a P projection.  In the current code, 
the P projection is run two times, since the projections stored in the 
Event object are specific to the Analysis object.  If, and only if, two 
projections compare equal, then there's no reason to redo the 
projection.   So instead of running the projections independently and 
tying them to a specific Analyis object, the ProjectionManager could use 
the provided compare member functions to ensure uniqueness of all 
projections.   That will in many cases reduce the number of loops over 
the event data considerable - e.g., if one has 10 Analysis objects, all 
using the FinalState projection (with no additional cuts) - instead of 
doing 10 loops over the data, we'd do one.


> there should be no meaning to "ordering".

Well, the case of centrality in heavy-ion collisions would run counter 
to that.  Let's first consider a few things about centrality

A: Typically, one defines centrality in terms of a mapping from some 
observable (say number of charged particles within some acceptance) to 
the fraction of the nuclear cross-section as seen by that observable.

B: This means that the mapping from observable to centrality depends on 
the distribution of that observable.

C: The distribution of such observables in general depend on the kind of 
physics that a given model implements, and in some cases also on the 
"tune" of that model - e.g., EPOS-LHC and HIJING does not produce the 
same distribution of number of charged particles withing the ALICE V0 
acceptance, nor does AMPT with and without string melting give the same 
distribution of that observable.

D: The distribution of such an observable also depends on the collision 
system and energy - e.g., the distribution of the number of charged 
particles within the ALICE V0 acceptance is not the same for Pb-Pb at 
2.76TeV as it is in 5.02TeV, and clearly distinctly different from p-Pb 
at 5.02TeV.

E: Thus, the mapping from observable to centrality depends on at least 
four factors
   1: Collision system
   2; Collision energy
   3: Model
   4: Model tune

F: Thus, given that each analysis object should be run on multiple 
models and model tunes (not necessarily different collision systems and 
energies), it cannot implement the mapping from observable to centrality 
in any meaningful way.

G: Thus, we need a mechanism to bring a collision system and energy, and 
model and tune dependent mapping from observable to centrality into the 
analysis object.

The last step, G, can be accomplished in a variety of ways.

1: Rely on some external data - e.g., one could have YODA files that 
contain the mappings from observable to centrality stored in some 
appropriate format - say as histograms or scatters.  Each analysis must 
then open those external files and read in the appropriate mapping.  
However, this provides a bit of a challenge: Since the Analysis object 
does not know which model and tune (and perhaps not which collision 
system and energy) it is being run over, it cannot directly know which 
file to open.  E.g., suppose your running over HIJING data of Au-Au at 
200GeV, then we need say the file 
$RIVET_EXTERNAL_DATA/Centrality_Hijing_AuAu_200GeV.yoda, but since the 
analysis does not know the model it cannot deduce that name.
i: One way around that, would be to have some interface in the 
management part of Rivet (Say AnalysisManager or Run) to open such a 
file, and the Analysis object could then query that manager for the data 
needed to do the observable to centrality mapping.
ii: Another way around that would be to be able to pass arguments to the 
analysis object.  Then one could pass the name of the file to read in to 
the Analysis object.

2: Instead of storing the mapping from observable to centrality in 
external data, we could store it in external code.  That is, based on 
some calibration pass of the model data, we define collision system and 
energy, and model and tune dependent code where we hard-code the 
mapping.  We compile that code into a Rivel module (or shared library) 
which can be loaded at runtime by the user.  Now, we still face the 
challenge of how to select the mapping appropriate for our model input.  
We can do that in a number of ways.
i: One allows the user to specify which object to instantise for a given 
input, and then some manager object in Rivet makes sure to make an 
object of the corresponding class. The Analysis object can then retrieve 
the mappings from the manager object.  This is a little akin to case 1.i 
above.
ii: Another way, is that the code we write is in it self an Analysis 
object (though a pseudo-one because it doesn't actually do analysis) 
which sets up some projection with the proper mapping from observable to 
centrality.  The projection is then stored in a fixed static interface, 
which the analysis object can pick up and apply as it's own projection.  
In this case, we specify our mapping calibration by adding a specific 
Analysis to our run.  This is were the ordering comes in - the 
pseudo-Analysis object (and the centrality projection) needs to be 
initialised _before_ any Analysis object that uses the projection.

This is just one use case for something like this.  Another case would 
be an Raa analysis.  In such an analysis, one need to loop over AA data 
to build up dN/dpT, and then in the end divide by some known pp dN/dpT.  
Now, normally one would do two independent loops over AA and pp, take 
the two outputs and create a third output with the ratios in.  This is 
all well and fine when one has control of the running.  But for model 
auto-tuners, it is not so straight forward.   So what one could do, is 
to write a Analysis class that contains the pp dN/dpT distribution and 
which posts that to some well defined static interface.  Then in the AA 
Analysis object, one can retrieve that distribution and divide the AA 
dN/dpT distribution(s) by that.  Again, the user would simple add a 
pseudo-Analysis to the Run _before_ the Raa Analysis object.

I'm sure there's other similar use cases.

> I am wary of adding more run-options to the library unless
> there is a very compelling use-case.

The patch (replacing std::set with std::vector) does not _add_ options 
to the library.  It simply provides a way to specify the execution order 
of the Analysis objects.  That is, if one says

   rivet -a A,B,C

then the code will always be executed as A first, then B, and finally C. 
  Currently, the ordering is completely arbitrary and changes from run to 
run as it depends on where the OS puts the object in memory.  That means 
that each run is not reproducible which can make it extremely hard to 
debug issues.

> rivet-nopy is intentionally very minimal, both because we don't want
> to support any substantial features beyond basic analysis-running, and
> because it's a simple example of how a steering code can run Rivet.
> What have you "fleshed out"?

The easiest way to explain that is if you try to apply the patch, 
compile and do

   rivet-nopy --help

I've basically tried to support all the options of the Python script 
rivet in the compiled code.

> It might be that it's better for you to just keep this extension for 
> personal use.

Well, if you do provide the binary, then your execution environment does 
not depend on Python directly - at least not for running the analyses.  
This could be beneficial and a non-homogeneous setting.

Another benefit is that it's far easier to debug Rivet code when using a 
compiled program than via the Python interpreter.

> It would also be good to keep these two distinct patches separate from
> each other, so we can apply them "atomically" -- or do they depend on
> each other somehow?

No, the change of std::set to std;:vector does not depend on the changes 
to rivet-nopy.cc.  However, it is easy enough to disentangle them.  Just 
open the patch in an editor and cut-and-paste the relevant parts into 
another file and save.

> Regarding parallelisation, we are in the lucky position of event
> streams being "embarrassingly parallel"... as you've shown. You can
> already handle most situations using rivet and yodamerge (or scripted
> uses of their APIs), and developments in the pipeline for v3 will make
> the "re-finalize" step possible as part of the core Rivet behaviour
> without needing to specify special run-modes.

How do you propagate the analysis objects into the Analysis objects?  
Suppose I had the Analysis class

   struct A : public Analysis
   {
     Histo1DPtr _h;
     std::mt19937 _g;
     std::normal_distribution<> _d;
     A() : Analysis("A"), _g(std::random_device()), _d(0,1) {}
     void init() { _h = bookHisto1D("h",100,-3,3,"x","dN/dx"); }
     void analyse() { _h->fill(_d(_g)); }
     void finalize() { _h->normalize(1); }
    };

When this is "re-finalized" it needs a valid object for A::_h, but we 
have no clear way of setting this from the outside (other than to remove 
it first using Analysis::removeAnalysisObject and then re-adding it 
using Analysis::addAnalysisObject).  So my suggestion would be to have 
Analysis developers write something like

   A::initFinalize(const std::vector<AnalysisObjectPtr>& l)
   {
     for (auto a : l) {
       std::string an(a->path());
       if (an == (histoPath() + "h")) _h = a;
     }
   }

and then call that member function for all Analysis objects before 
running the re-finalize.  Another way to do this, would be to first call 
the regular Analysis::init, and then have something like

   Analysis::initFinalize(const std::vector<AnalysisObjectPtr>& l)
   {
      for (auto ta : _analysisObjects) {
        std::string tn(ta->path());
        for (auto la : l) {
          if (tn == la->path()) ta.swap(la);
        }
      }
      restoreFromAnalysisObjects();
    }

    A::restoreFromAnalysisObject()
    {
       for (auto ao : _analysisObjects) {
         std::string an(ao->path());
         if (an == histoPath() + "h") _h = ao;
       }
    }

A nice tool to have in mind for parallelisation is GNU parallel which is 
a typical Unix type tool - https://www.gnu.org/software/parallel/.  For 
example, let's assume we have the "files"

   input1 ... input5

then one could do something like

   ls input* | parallel rivet -a ana1,ana2,ana3 -o {/.}.yoda

Now, ideally, one should be able to do something like

   ls input* | parallel rivet -a ana1,ana2,ana3 -o - | yodamerge - | 
rivet-finalize - -o final.yoda

The nice thing about parallel is that it can work as a simple cluster 
broker - that is, one can push jobs to remote hosts, which is kinda 
cool.

> It's taken a lot longer
> than hoped to get this working, though: a lot of potential pitfalls,
> and a lack of time! (Which is why your last patch hasn't been applied
> yet -- I thought David had done so, but am actually quite glad he
> hasn't since I had some queries about that, too!)

Well, let me know and I'll try to answer as best I can.

Yours,

-- 
Christian Holm Christensen 
-------------------------------------------------
  Niels Bohr Institute, Blegdamsvej 17, DK-2100 Copenhagen
  http://cern.ch/cholm, +4524618591


More information about the Rivet mailing list