[Rivet] Segmentation fault when running agile-runmc on the grid

Andy Buckley andy.buckley at ed.ac.uk
Tue Oct 18 16:42:38 BST 2011


(Copying back to the Rivet list to keep this discussion in view)

Hmm, this is problematic... I certainly run Rivet quite a lot on the 
Grid for ATLAS MC tuning production etc., and haven't seen this problem. 
Static linking won't help, not least because AGILe *definitely* wants to 
load the libraries with dlopen, which needs shared libs. Can you set up 
your environment to run with an LHC experiment's compiler, etc.?

To the second question, there is no way to get generator-specific 
information like the values of internal arrays in Rivet, because the 
only way that Rivet can receive information from the generator is in the 
HepMC events themselves. However, I did propose a while ago that the 
HepMC GenEvent class be extended to allow attaching of general-purpose 
"annotations", for exactly this sort of purpose:
https://savannah.cern.ch/support/?122948
If you would find this useful (anyone, including other Rivet list 
members!) then please add a supporting comment on that Savannah ticket!

Andy


On 18/10/11 16:44, Sara Alderweireldt wrote:
> Hi Andy,
>
> Strangely enough, the nm now returns a T value (I've no clue what I
> changed), but the segmentation fault persists. Loading libLHAPDF still
> fails. From what I've seen, this is unrelated to AGILe, so I tried one
> of the Pythia8 examples with lhapdf to see whether that gives me more
> insight in the LHAPDF issue. It returns:
>
> ./main42.exe: error while loading shared libraries: libLHAPDF.so.0:
> failed to map segment from shared object: Cannot allocate memory
>
> What you say about the versions, the older gcc and python are indeed
> quite annoying. However, if it really were a problem, the segmentation
> fault would appear if I ran directly on a worker node too, and it
> doesn't. It only appears if I submit the job to the grid.
>
> I'm still hoping to solve this, but it seems rather impossible at the
> moment. Just an idea, would it make a difference if LHAPDF was linked
> statically, and if so, where can I change this linking?
>
> And for something entirely different, in Rivet, is it possible to access
> pythia6 parameters such as the msti's? I've looked in the code and using
> the pythiawrapper in agile or hepmc I can print them at runtime
> (FC_PYPARS.msti[]), but I would prefer to get them to a histogram. Do
> you know how to do this?
>
> Sorry again to bother you repeatedly with all these questions, but
> there's no one around locally that is able to help me with this.
>
> Cheers,
> Sara
>
> On 10/17/2011 01:44 AM, Andy Buckley wrote:
>> Afraid not... to me that sounds like the Grid installation of LHAPDF
>> is strangely broken, but I have no idea how it would have ended up
>> that way :S
>>
>> Ah, one point to address in your email, which reveals how much Grid
>> site admins actually tend know about LCG operations: the *system*
>> compiler and Python are GCC 4.1 and Python 2.4, but the ones for which
>> all the LCG packages are built and designed are GCC 4.3 and Python
>> 2.6. So at the start of your Grid script you need to source setup
>> scripts (somehow) for the correct LCG compiler and Python versions.
>> That could have something to do with your problem -- there shouldn't
>> be any GCC 4.1 build of LHAPDF at all.
>>
>> Andy
>>
>>
>> On 14/10/11 13:05, Sara Alderweireldt wrote:
>>> Hi Andy,
>>>
>>> I tried running the nm command before, and locally it returns a T, like
>>> you report, whereas on the grid it returns a U. I don't know what causes
>>> this difference, but now that you explain the symbols, it seems to be
>>> significant.
>>>
>>> locally: "000000000003e210 T pdfset_"
>>> on the grid: " U pdfset_"
>>>
>>>
>>> I checked again (and asked the grid admins) and in principles all
>>> machines and worker nodes are supposed to have the same architecture and
>>> compiler environment, all slc5 x86_64 with gcc 4.1.2 and python 2.4.3.
>>>
>>> More ideas :)?
>>>
>>> Cheers,
>>> Sara
>>>
>>> On 10/13/2011 05:15 PM, Andy Buckley wrote:
>>>> Hi Sara,
>>>>
>>>> This complaint about a missing pdfset_ symbol is odd: that should
>>>> definitely be defined in libLHAPDF:
>>>>
>>>> andy at duality:~$ nm heplocal/lib/libLHAPDF.so | grep pdfset_
>>>> 000d0f70 T finitpdfset_
>>>> 00014340 T initpdfset_
>>>> 00036870 T pdfset_
>>>>
>>>> (the "T" means that the symbol is in the "text" section of the
>>>> library, i.e. it is defined in the library file rather than just
>>>> declared as something that will be eventually found elsewhere, which
>>>> would show a "U")
>>>>
>>>> You did the right thing to enable the TRACE output, and indeed you see
>>>> that libLHAPDF cannot be loaded. My suspicion is that the job running
>>>> on the Grid node has a different architecture or compiler environment,
>>>> which is why that library cannot be loaded. For example, if the job
>>>> running on the Grid is in a 32 bit environment but that library is 64
>>>> bit, then indeed the dlopen library loading will fail. Can you check
>>>> that?
>>>>
>>>> Cheers,
>>>> Andy
>>>>
>>>>
>>>> On 13/10/11 10:16, Sara Alderweireldt wrote:
>>>>> Hello,
>>>>>
>>>>> To continue on this issue, it's still unsolved, I ran only agile
>>>>> (submitted to the grid), with external PDF and TRACE output. It
>>>>> seems to
>>>>> be finding and loading everything, except (as could be expected)
>>>>> libLHAPDF.so. In that case it finds the library, but can't succesfully
>>>>> load it:
>>>>>
>>>>> AGILe.Loader: TRACE Testing for
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Trying to load
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Failed to load
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>>
>>>>> If I run the exact same command locally (m-machines in Brussels), the
>>>>> problem is gone:
>>>>>
>>>>> AGILe.Loader: TRACE Testing for
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Trying to load
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>>> AGILe.Loader: TRACE Successfully loaded
>>>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so (0xb478560)
>>>>>
>>>>> To be complete, the command I ran was:
>>>>>
>>>>> agile-runmc Pythia6:425 -b LHC:7000 -n 10 -p PYTUNE=343 -o
>>>>> test.hepmc -l AGILe.Loader=TRACE
>>>>>
>>>>> Given this output, I don't know whether I'm still posting this
>>>>> question
>>>>> to the right people, maybe I need LHAPDF support instead. In any case,
>>>>> it's really puzzling me. I'd be happy with any suggestion on how to
>>>>> move
>>>>> forward with this problem. Would it for instance be possible to get
>>>>> error messages from LHAPDF or the system in general on what exactly
>>>>> goes
>>>>> wrong with this loading of libLHAPDF?
>>>>>
>>>>> Best regards,
>>>>> Sara
>>>>>
>>>>> On 10/10/2011 10:24 AM, Sara Alderweireldt wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've been running agile+rivet locally for a while now, and this week
>>>>>> attempted moving my runs to the grid. I still access my own copy of
>>>>>> agile+rivet (which locally runs fine) and use a python script which
>>>>>> calls 'agile-runmc ... &' and 'rivet ...'. If I use the PYTUNE or
>>>>>> MSTP(52) flag to set an external PDF from lhapdf, I get a
>>>>>> segmentation
>>>>>> fault when running on the grid, and no problem when running locally.
>>>>>> If I use internal PDFs included in pythia6, everything runs fine both
>>>>>> on the grid and locally.
>>>>>>
>>>>>> At some point, I also had this segmentation fault locally, and traced
>>>>>> it back with gdb (and a lot of manual print statements) to:
>>>>>> /line 471 throw runtime_error((string("Failed to load libraries: ") +
>>>>>> dlerror()).c_str());/
>>>>>> in AGILe-1.3.0/src/Core/Loader.cc. Recompiling both LHAPDF and
>>>>>> pythia6
>>>>>> solved this.
>>>>>>
>>>>>> If I comment out the runtime_error and run on the grid, I get a
>>>>>> python
>>>>>> symbol lookup error:
>>>>>> python: symbol lookup error: mydirs/libpythia6.so: undefined symbol:
>>>>>> pdfset_
>>>>>>
>>>>>> Do you have any idea what might cause this or what I could try to
>>>>>> trace it back further? I'm entirely puzzled by the fact that
>>>>>> everything is fine when processing locally and not when submitting to
>>>>>> the grid, both methods are accessing the same hard drive with the
>>>>>> agile & rivet distributions on it. I tried tracking from the
>>>>>> agile-runmc script and got to FPythia.cc which calls PYEVT (at which
>>>>>> point the symbol lookup error arrives, no events are produced), but I
>>>>>> can't figure out where it goes wrong exactly or how to solve it.
>>>>>>
>>>>>> I checked when running on the grid what the output of 'lhapdf-config
>>>>>> --pdfsets-path' is, and it is returning the correct folder and the
>>>>>> needed LHpdf file is there. If versions are relevant, I'm using agile
>>>>>> 1.3.0, rivet 1.6.0, pythia 6.425, lhapdf 5.8.6, python 2.4.3 and gcc
>>>>>> 4.1.2. I hope you can shed some light on this.
>>>>>>
>>>>>> Best regards and thanks in advance,
>>>>>> Sara
> --
>
> Sara Alderweireldt sara.alderweireldt at ua.ac.be
> <mailto:sara.alderweireldt at ua.ac.be>
> Universiteit Antwerpen Phone: +32 (0)3 265 3577
> CGB.U.237 - Physics Skype: sara.alderweireldt
> Groenenborgerlaan 171
> 2020 Antwerpen http://www.ua.ac.be/edf
> Belgium
>


-- 
Dr Andy Buckley
SUPA Advanced Research Fellow
Particle Physics Experiment Group, University of Edinburgh

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Rivet mailing list