[Rivet] [Yoda] YODA development

Andy Buckley andy.buckley at ed.ac.uk
Sat Oct 31 11:09:31 GMT 2009


Andy Buckley wrote:
> Ben Waugh wrote:
>> On 30/10/09 11:47, Andy Buckley wrote:
>>> I'm not sure... I looked in the stats literature quite a lot to find a 
>>> definitive treatment of weighted stats combination (especially where 
>>> negative weights are concerned) and found nothing, hence the bullet 
>>> points are my invention. Perhaps it's sensitive to how the weights are 
>>> determined as to whether high weights are statistically equivalent to 
>>> many small weights, but I'm inclined to agree that my bullet points 
>>> aren't the whole story and that we need a more careful treatment. 
>>> Fortunately, the code in YODA that handles this for all histo types is 
>>> very localised! (in Dbn1D)
>>>
>>> Any suggestions of alternative weighted combination recipes, which 
>>> consistently handle positive and negative weights? Or, better, has 
>>> anyone found any papers or books that address this issue?
>> I remember a useful preprint I used in my H1 days that gave some useful 
>> recipes and derivations, but of course I can't find it now. No luck 
>> finding anythin more authoritative either.
>>
>> My recollection/understanding is that the "standard" recipe actually 
>> does work for negative weights: (1+-1) - (1+-1) = (0 += sqrt(2))...
> 
> The YODA approach is to do every stat combination based on storing 
> weighted moments of the variable being filled, i.e. w, w^2, wx, wx^2, etc.,
> 
> Just thinking about this now, it seems to me that the best approach is 
> to use var(O) = <(wO)^2> - <wO>^2, i.e. we also need to store w^2x^2. I 
> can't remember right now if this is what I'm already doing! Then if I 
> fill a single histo with alternating weights of +/- 1 for N fills, then 
> the variance on the bin height (O=1) will be N/N - 0/N = 1... so the 
> error never decreases. Reasonable? Using this treatment is of course 
> nice because it means that the combined result is invariant under 
> arbitrary re-partitionings of the fills... that seems the key property 
> to me.
> 
> Thoughts?

In fact, here's a few thoughts of my own ;)

#1. What I said above isn't quite right --- the perils of replying in a 
hurry. The variance I calculated on the alternating +-1 weight examples 
isn't the variance on the bin height, but rather the variance of the 
distribution of weights themselves. It does, however, raise the question 
of what is the "N" used to compute the averages in this quantity: if N = 
sum(w), then we get into trouble with division when the positive and 
negative weights are equal (i.e. sum(w) = 0). The truth is that the 
distribution does have a width, since it alternates between +1 and -1... 
so my feeling is that we should be using N = number of fills. But this 
causes problems elsewhere: maybe for our purposes the correct thing to 
do is N = sum(w) and just say that the error is undefined for sum(w) = 
0, as it would be for N=0. Hmm.

#2. The error on bin height is the binomial error sqrt(height). To allow 
arbitrary partitionings is a pretty crucial invariant, so we should be 
able to separate this alternating fill set into two equal sets of purely 
+1 and purely -1 weights, and then recombine. This clearly works for the 
positive weights, but for the negative weights it requires taking the 
sqrt of a negative number. If we're not worried about this --- just 
treat it as a complex quantity and return mod(sqrt(sum(w))) when asked 
--- it all works fine. The usual "sum in quadrature" rule works if 
you're more explicit about having an imaginary error, but is really 
irrelevant from a computational standpoint.

#3. Things become more awkward when we start trying to compute means and 
variances of observables, just x for example. The most obvious problem 
comes when there are partitionings (or bins) which have an overall 
negative sum(w). For the equal +-1 partitionings, if they all are at 
x=1, then the -1 set will have a mean of -1 (if we divide by N=#fills), 
which doesn't mean much but is necessary for repartitioning invariance 
(the sum(wx) needs to be additive). So the apparent central x value is 
skewed because (-w)x = w(-x). But there are also problems when the two 
sets are added together: they will have a mean of 0 with a variance of 
1: this doesn't represent anything meaningful. And what about when we 
combine binwise stats to get moments for the whole histogram: bins with 
negative sum(w) will skew any moment distributions as if the x value had 
been negative. Symmetry around x=0 doesn't mean any thing for general 
distributions, so I think this is the wrong approach. I think this is 
all resolved if we use N=sum(w), and just accept that if sum(w) = 0 (and 
sum(w) < 0?) we can't produce a meaningful answer.

Incidentally, another nice sanity check is that if the weights aren't 
exactly balanced, say 20 -1s, 30 +1s, we should be able to partition as 
[20+,20-,10+] or as [20-,30+], or even just [10+]. We should also be 
able to partition binwise, i.e. changing bin boundaries shouldn't change 
the answers for the whole histogram. Needs a bit more thought.

Anyway, just wanted to set that record straight, or at least a *little* 
less crooked. Have a good weekend ;)

Andy

-- 
Dr Andy Buckley
SUPA Advanced Research Fellow
Particle Physics Experiment Group, University of Edinburgh

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Rivet mailing list