|
[Rivet] [Yoda] YODA developmentAndy Buckley andy.buckley at ed.ac.ukSat Oct 31 11:09:31 GMT 2009
Andy Buckley wrote: > Ben Waugh wrote: >> On 30/10/09 11:47, Andy Buckley wrote: >>> I'm not sure... I looked in the stats literature quite a lot to find a >>> definitive treatment of weighted stats combination (especially where >>> negative weights are concerned) and found nothing, hence the bullet >>> points are my invention. Perhaps it's sensitive to how the weights are >>> determined as to whether high weights are statistically equivalent to >>> many small weights, but I'm inclined to agree that my bullet points >>> aren't the whole story and that we need a more careful treatment. >>> Fortunately, the code in YODA that handles this for all histo types is >>> very localised! (in Dbn1D) >>> >>> Any suggestions of alternative weighted combination recipes, which >>> consistently handle positive and negative weights? Or, better, has >>> anyone found any papers or books that address this issue? >> I remember a useful preprint I used in my H1 days that gave some useful >> recipes and derivations, but of course I can't find it now. No luck >> finding anythin more authoritative either. >> >> My recollection/understanding is that the "standard" recipe actually >> does work for negative weights: (1+-1) - (1+-1) = (0 += sqrt(2))... > > The YODA approach is to do every stat combination based on storing > weighted moments of the variable being filled, i.e. w, w^2, wx, wx^2, etc., > > Just thinking about this now, it seems to me that the best approach is > to use var(O) = <(wO)^2> - <wO>^2, i.e. we also need to store w^2x^2. I > can't remember right now if this is what I'm already doing! Then if I > fill a single histo with alternating weights of +/- 1 for N fills, then > the variance on the bin height (O=1) will be N/N - 0/N = 1... so the > error never decreases. Reasonable? Using this treatment is of course > nice because it means that the combined result is invariant under > arbitrary re-partitionings of the fills... that seems the key property > to me. > > Thoughts? In fact, here's a few thoughts of my own ;) #1. What I said above isn't quite right --- the perils of replying in a hurry. The variance I calculated on the alternating +-1 weight examples isn't the variance on the bin height, but rather the variance of the distribution of weights themselves. It does, however, raise the question of what is the "N" used to compute the averages in this quantity: if N = sum(w), then we get into trouble with division when the positive and negative weights are equal (i.e. sum(w) = 0). The truth is that the distribution does have a width, since it alternates between +1 and -1... so my feeling is that we should be using N = number of fills. But this causes problems elsewhere: maybe for our purposes the correct thing to do is N = sum(w) and just say that the error is undefined for sum(w) = 0, as it would be for N=0. Hmm. #2. The error on bin height is the binomial error sqrt(height). To allow arbitrary partitionings is a pretty crucial invariant, so we should be able to separate this alternating fill set into two equal sets of purely +1 and purely -1 weights, and then recombine. This clearly works for the positive weights, but for the negative weights it requires taking the sqrt of a negative number. If we're not worried about this --- just treat it as a complex quantity and return mod(sqrt(sum(w))) when asked --- it all works fine. The usual "sum in quadrature" rule works if you're more explicit about having an imaginary error, but is really irrelevant from a computational standpoint. #3. Things become more awkward when we start trying to compute means and variances of observables, just x for example. The most obvious problem comes when there are partitionings (or bins) which have an overall negative sum(w). For the equal +-1 partitionings, if they all are at x=1, then the -1 set will have a mean of -1 (if we divide by N=#fills), which doesn't mean much but is necessary for repartitioning invariance (the sum(wx) needs to be additive). So the apparent central x value is skewed because (-w)x = w(-x). But there are also problems when the two sets are added together: they will have a mean of 0 with a variance of 1: this doesn't represent anything meaningful. And what about when we combine binwise stats to get moments for the whole histogram: bins with negative sum(w) will skew any moment distributions as if the x value had been negative. Symmetry around x=0 doesn't mean any thing for general distributions, so I think this is the wrong approach. I think this is all resolved if we use N=sum(w), and just accept that if sum(w) = 0 (and sum(w) < 0?) we can't produce a meaningful answer. Incidentally, another nice sanity check is that if the weights aren't exactly balanced, say 20 -1s, 30 +1s, we should be able to partition as [20+,20-,10+] or as [20-,30+], or even just [10+]. We should also be able to partition binwise, i.e. changing bin boundaries shouldn't change the answers for the whole histogram. Needs a bit more thought. Anyway, just wanted to set that record straight, or at least a *little* less crooked. Have a good weekend ;) Andy -- Dr Andy Buckley SUPA Advanced Research Fellow Particle Physics Experiment Group, University of Edinburgh The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
More information about the Rivet mailing list |