[Rivet] Segmentation fault when running agile-runmc on the grid

Sara Alderweireldt sara.alderweireldt at ua.ac.be
Fri Oct 14 13:05:13 BST 2011


Hi Andy,

I tried running the nm command before, and locally it returns a T, like 
you report, whereas on the grid it returns a U. I don't know what causes 
this difference, but now that you explain the symbols, it seems to be 
significant.

locally:          "000000000003e210 T pdfset_"
on the grid:  "                                 U pdfset_"


I checked again (and asked the grid admins) and in principles all 
machines and worker nodes are supposed to have the same architecture and 
compiler environment, all slc5 x86_64 with gcc 4.1.2 and python 2.4.3.

More ideas :)?

Cheers,
Sara

On 10/13/2011 05:15 PM, Andy Buckley wrote:
> Hi Sara,
>
> This complaint about a missing pdfset_ symbol is odd: that should 
> definitely be defined in libLHAPDF:
>
> andy at duality:~$ nm heplocal/lib/libLHAPDF.so | grep pdfset_
> 000d0f70 T finitpdfset_
> 00014340 T initpdfset_
> 00036870 T pdfset_
>
> (the "T" means that the symbol is in the "text" section of the 
> library, i.e. it is defined in the library file rather than just 
> declared as something that will be eventually found elsewhere, which 
> would show a "U")
>
> You did the right thing to enable the TRACE output, and indeed you see 
> that libLHAPDF cannot be loaded. My suspicion is that the job running 
> on the Grid node has a different architecture or compiler environment, 
> which is why that library cannot be loaded. For example, if the job 
> running on the Grid is in a 32 bit environment but that library is 64 
> bit, then indeed the dlopen library loading will fail. Can you check 
> that?
>
> Cheers,
> Andy
>
>
> On 13/10/11 10:16, Sara Alderweireldt wrote:
>> Hello,
>>
>> To continue on this issue, it's still unsolved, I ran only agile
>> (submitted to the grid), with external PDF and TRACE output. It seems to
>> be finding and loading everything, except (as could be expected)
>> libLHAPDF.so. In that case it finds the library, but can't succesfully
>> load it:
>>
>>     AGILe.Loader: TRACE Testing for
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Trying to load
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Failed to load
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>
>> If I run the exact same command locally (m-machines in Brussels), the
>> problem is gone:
>>
>>     AGILe.Loader: TRACE Testing for
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Found /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Trying to load
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so
>>     AGILe.Loader: TRACE Successfully loaded
>>     /localgrid/salderwe/TEST/lib/libLHAPDF.so (0xb478560)
>>
>> To be complete, the command I ran was:
>>
>>     agile-runmc Pythia6:425 -b LHC:7000 -n 10 -p PYTUNE=343 -o
>>     test.hepmc -l AGILe.Loader=TRACE
>>
>> Given this output, I don't know whether I'm still posting this question
>> to the right people, maybe I need LHAPDF support instead. In any case,
>> it's really puzzling me. I'd be happy with any suggestion on how to move
>> forward with this problem. Would it for instance be possible to get
>> error messages from LHAPDF or the system in general on what exactly goes
>> wrong with this loading of libLHAPDF?
>>
>> Best regards,
>> Sara
>>
>> On 10/10/2011 10:24 AM, Sara Alderweireldt wrote:
>>> Hello,
>>>
>>> I've been running agile+rivet locally for a while now, and this week
>>> attempted moving my runs to the grid. I still access my own copy of
>>> agile+rivet (which locally runs fine) and use a python script which
>>> calls 'agile-runmc ... &' and 'rivet ...'. If I use the PYTUNE or
>>> MSTP(52) flag to set an external PDF from lhapdf, I get a segmentation
>>> fault when running on the grid, and no problem when running locally.
>>> If I use internal PDFs included in pythia6, everything runs fine both
>>> on the grid and locally.
>>>
>>> At some point, I also had this segmentation fault locally, and traced
>>> it back with gdb (and a lot of manual print statements) to:
>>> /line 471 throw runtime_error((string("Failed to load libraries: ") +
>>> dlerror()).c_str());/
>>> in AGILe-1.3.0/src/Core/Loader.cc. Recompiling both LHAPDF and pythia6
>>> solved this.
>>>
>>> If I comment out the runtime_error and run on the grid, I get a python
>>> symbol lookup error:
>>> python: symbol lookup error: mydirs/libpythia6.so: undefined symbol:
>>> pdfset_
>>>
>>> Do you have any idea what might cause this or what I could try to
>>> trace it back further? I'm entirely puzzled by the fact that
>>> everything is fine when processing locally and not when submitting to
>>> the grid, both methods are accessing the same hard drive with the
>>> agile & rivet distributions on it. I tried tracking from the
>>> agile-runmc script and got to FPythia.cc which calls PYEVT (at which
>>> point the symbol lookup error arrives, no events are produced), but I
>>> can't figure out where it goes wrong exactly or how to solve it.
>>>
>>> I checked when running on the grid what the output of 'lhapdf-config
>>> --pdfsets-path' is, and it is returning the correct folder and the
>>> needed LHpdf file is there. If versions are relevant, I'm using agile
>>> 1.3.0, rivet 1.6.0, pythia 6.425, lhapdf 5.8.6, python 2.4.3 and gcc
>>> 4.1.2. I hope you can shed some light on this.
>>>
>>> Best regards and thanks in advance,
>>> Sara
-- 

Sara Alderweireldt sara.alderweireldt at ua.ac.be 
<mailto:sara.alderweireldt at ua.ac.be>
Universiteit Antwerpen    Phone: +32 (0)3 265 3577
CGB.U.237 - Physics
Groenenborgerlaan 171
2020 Antwerpen http://www.ua.ac.be/edf
Belgium

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.hepforge.org/lists-archive/rivet/attachments/20111014/45208217/attachment.html>


More information about the Rivet mailing list