[Rivet] Segmentation fault when running agile-runmc on the grid

Andy Buckley andy.buckley at ed.ac.uk
Mon Oct 17 00:44:51 BST 2011


Afraid not... to me that sounds like the Grid installation of LHAPDF is 
strangely broken, but I have no idea how it would have ended up that way :S

Ah, one point to address in your email, which reveals how much Grid site 
admins actually tend know about LCG operations: the *system* compiler 
and Python are GCC 4.1 and Python 2.4, but the ones for which all the 
LCG packages are built and designed are GCC 4.3 and Python 2.6. So at 
the start of your Grid script you need to source setup scripts (somehow) 
for the correct LCG compiler and Python versions. That could have 
something to do with your problem -- there shouldn't be any GCC 4.1 
build of LHAPDF at all.

Andy


On 14/10/11 13:05, Sara Alderweireldt wrote:
> Hi Andy,
>
> I tried running the nm command before, and locally it returns a T, like
> you report, whereas on the grid it returns a U. I don't know what causes
> this difference, but now that you explain the symbols, it seems to be
> significant.
>
> locally: "000000000003e210 T pdfset_"
> on the grid: " U pdfset_"
>
>
> I checked again (and asked the grid admins) and in principles all
> machines and worker nodes are supposed to have the same architecture and
> compiler environment, all slc5 x86_64 with gcc 4.1.2 and python 2.4.3.
>
> More ideas :)?
>
> Cheers,
> Sara
>
> On 10/13/2011 05:15 PM, Andy Buckley wrote:
>> Hi Sara,
>>
>> This complaint about a missing pdfset_ symbol is odd: that should
>> definitely be defined in libLHAPDF:
>>
>> andy at duality:~$ nm heplocal/lib/libLHAPDF.so | grep pdfset_
>> 000d0f70 T finitpdfset_
>> 00014340 T initpdfset_
>> 00036870 T pdfset_
>>
>> (the "T" means that the symbol is in the "text" section of the
>> library, i.e. it is defined in the library file rather than just
>> declared as something that will be eventually found elsewhere, which
>> would show a "U")
>>
>> You did the right thing to enable the TRACE output, and indeed you see
>> that libLHAPDF cannot be loaded. My suspicion is that the job running
>> on the Grid node has a different architecture or compiler environment,
>> which is why that library cannot be loaded. For example, if the job
>> running on the Grid is in a 32 bit environment but that library is 64
>> bit, then indeed the dlopen library loading will fail. Can you check
>> that?
>>
>> Cheers,
>> Andy
>>
>>
>> On 13/10/11 10:16, Sara Alderweireldt wrote:
>>> Hello,
>>>
>>> To continue on this issue, it's still unsolved, I ran only agile
>>> (submitted to the grid), with external PDF and TRACE output. It seems to
>>> be finding and loading everything, except (as could be expected)
>>> libLHAPDF.so. In that case it finds the library, but can't succesfully
>>> load it:
>>>
>>> AGILe.Loader: TRACE Testing for
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Trying to load
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Failed to load
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>>
>>> If I run the exact same command locally (m-machines in Brussels), the
>>> problem is gone:
>>>
>>> AGILe.Loader: TRACE Testing for
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Trying to load
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>> AGILe.Loader: TRACE Successfully loaded
>>> /localgrid/salderwe/TEST/lib/libLHAPDF.so (0xb478560)
>>>
>>> To be complete, the command I ran was:
>>>
>>> agile-runmc Pythia6:425 -b LHC:7000 -n 10 -p PYTUNE=343 -o
>>> test.hepmc -l AGILe.Loader=TRACE
>>>
>>> Given this output, I don't know whether I'm still posting this question
>>> to the right people, maybe I need LHAPDF support instead. In any case,
>>> it's really puzzling me. I'd be happy with any suggestion on how to move
>>> forward with this problem. Would it for instance be possible to get
>>> error messages from LHAPDF or the system in general on what exactly goes
>>> wrong with this loading of libLHAPDF?
>>>
>>> Best regards,
>>> Sara
>>>
>>> On 10/10/2011 10:24 AM, Sara Alderweireldt wrote:
>>>> Hello,
>>>>
>>>> I've been running agile+rivet locally for a while now, and this week
>>>> attempted moving my runs to the grid. I still access my own copy of
>>>> agile+rivet (which locally runs fine) and use a python script which
>>>> calls 'agile-runmc ... &' and 'rivet ...'. If I use the PYTUNE or
>>>> MSTP(52) flag to set an external PDF from lhapdf, I get a segmentation
>>>> fault when running on the grid, and no problem when running locally.
>>>> If I use internal PDFs included in pythia6, everything runs fine both
>>>> on the grid and locally.
>>>>
>>>> At some point, I also had this segmentation fault locally, and traced
>>>> it back with gdb (and a lot of manual print statements) to:
>>>> /line 471 throw runtime_error((string("Failed to load libraries: ") +
>>>> dlerror()).c_str());/
>>>> in AGILe-1.3.0/src/Core/Loader.cc. Recompiling both LHAPDF and pythia6
>>>> solved this.
>>>>
>>>> If I comment out the runtime_error and run on the grid, I get a python
>>>> symbol lookup error:
>>>> python: symbol lookup error: mydirs/libpythia6.so: undefined symbol:
>>>> pdfset_
>>>>
>>>> Do you have any idea what might cause this or what I could try to
>>>> trace it back further? I'm entirely puzzled by the fact that
>>>> everything is fine when processing locally and not when submitting to
>>>> the grid, both methods are accessing the same hard drive with the
>>>> agile & rivet distributions on it. I tried tracking from the
>>>> agile-runmc script and got to FPythia.cc which calls PYEVT (at which
>>>> point the symbol lookup error arrives, no events are produced), but I
>>>> can't figure out where it goes wrong exactly or how to solve it.
>>>>
>>>> I checked when running on the grid what the output of 'lhapdf-config
>>>> --pdfsets-path' is, and it is returning the correct folder and the
>>>> needed LHpdf file is there. If versions are relevant, I'm using agile
>>>> 1.3.0, rivet 1.6.0, pythia 6.425, lhapdf 5.8.6, python 2.4.3 and gcc
>>>> 4.1.2. I hope you can shed some light on this.
>>>>
>>>> Best regards and thanks in advance,
>>>> Sara
> --
>
> Sara Alderweireldt sara.alderweireldt at ua.ac.be
> <mailto:sara.alderweireldt at ua.ac.be>
> Universiteit Antwerpen Phone: +32 (0)3 265 3577
> CGB.U.237 - Physics
> Groenenborgerlaan 171
> 2020 Antwerpen http://www.ua.ac.be/edf
> Belgium
>


-- 
Dr Andy Buckley
SUPA Advanced Research Fellow
Particle Physics Experiment Group, University of Edinburgh

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Rivet mailing list