|
[Rivet] [Yoda] YODA developmentBen Waugh waugh at hep.ucl.ac.ukMon Nov 2 10:25:10 GMT 2009
Hi Andy, Here are my further thoughts from random moments over the weekend, after I have tried to beat them into some kind of order. Can you give a concrete example of a generator and steering parameters where negative weights crop up? I would like to sanity check my arguments against a real case. My first thoughts, while I still think correct as far as they go, were not based on a very clear picture of what we are trying to do. I think now I have better untangled in my head the various questions we are trying to answer for any given sample. In particular I may have been confusing the issues related to bin height and those related to means and variances of variables other than those used in the binning. On that topic, I can't see why it is necessary to use a "distribution" object (Dbn1D) to store the contents of each bin in a non-profile histogram. The fill method in this case is keeping information on the distribution (in the binning variable) of entries within the bin. There may be cases where this is useful, but if so it could be achieved using a profile histogram with the Y variable the same as the X (binned) variable. Using a Dbn1D for every histogram bin seems to me to be overkill, and the generalization to more than one dimension involves tracking all the covariances, i.e. O(m^2) quantities where m is the number of dimensions. Is there a use case for this, or is this a "nice to have" feature? On 31/10/09 11:09, Andy Buckley wrote: >> The YODA approach is to do every stat combination based on storing >> weighted moments of the variable being filled, i.e. w, w^2, wx, wx^2, >> etc., Anything that is calculated exclusively from sums over the sample(s) of event-wise quantities must be invariant under re-partitioning. If we deviate from this we should worry that we are doing something inconsistent. I would also suspect that if we get e.g. an imaginary result for the error on a real quantity then we are also making a mistake. And of course we want to be correct, not just consistent! > #1. What I said above isn't quite right --- the perils of replying in a > hurry. The variance I calculated on the alternating +-1 weight examples > isn't the variance on the bin height, but rather the variance of the > distribution of weights themselves. It does, however, raise the question > of what is the "N" used to compute the averages in this quantity: if N = > sum(w), then we get into trouble with division when the positive and > negative weights are equal (i.e. sum(w) = 0). The truth is that the > distribution does have a width, since it alternates between +1 and -1... > so my feeling is that we should be using N = number of fills. But this > causes problems elsewhere: maybe for our purposes the correct thing to > do is N = sum(w) and just say that the error is undefined for sum(w) = > 0, as it would be for N=0. Hmm. You are talking here about calculating means, variances and errors of eventwise quantities, not bin heights, are you not? I'll try to deal with this under #3. > #2. The error on bin height is the binomial error sqrt(height). To allow > arbitrary partitionings is a pretty crucial invariant, so we should be > able to separate this alternating fill set into two equal sets of purely > +1 and purely -1 weights, and then recombine. This clearly works for the > positive weights, but for the negative weights it requires taking the > sqrt of a negative number. If we're not worried about this --- just > treat it as a complex quantity and return mod(sqrt(sum(w))) when asked > --- it all works fine. The usual "sum in quadrature" rule works if > you're more explicit about having an imaginary error, but is really > irrelevant from a computational standpoint. The bin height in a histogram is the estimator of a probability or an expected number of events in a given region of phase space. If the probability density is zero in some reason, then we would expect (in the long run) equal positive and negative weights to occur in that region. In most cases in fact we would expect no events to be generated in that region at all. If we do get events, then we should still calculate the error on the bin height to be the usual sum(w^2), regardless of the sign of w. In our rather pathological toy model where weights are either +1 or -1, suppose we generate a sample of size such that we expect 2N events in a particular bin. We would expect in this bin to see N +- sqrt(N) with weight +1 and N +- sqrt(N) with weight -1, giving a bin height of 0 +- sqrt(2N). Here I am using Poisson rather than binomial statistics. I haven't yet considered the difference in error calculation depending on whether we are dealing with known luminosity (but variable number of events) or known number of events (but variable fraction in each bin). > #3. Things become more awkward when we start trying to compute means and > variances of observables, just x for example. The most obvious problem > comes when there are partitionings (or bins) which have an overall > negative sum(w). For the equal +-1 partitionings, if they all are at > x=1, then the -1 set will have a mean of -1 (if we divide by N=#fills), > which doesn't mean much but is necessary for repartitioning invariance > (the sum(wx) needs to be additive). So the apparent central x value is > skewed because (-w)x = w(-x). But there are also problems when the two > sets are added together: they will have a mean of 0 with a variance of > 1: this doesn't represent anything meaningful. And what about when we > combine binwise stats to get moments for the whole histogram: bins with > negative sum(w) will skew any moment distributions as if the x value had > been negative. Symmetry around x=0 doesn't mean any thing for general > distributions, so I think this is the wrong approach. I think this is > all resolved if we use N=sum(w), and just accept that if sum(w) = 0 (and > sum(w) < 0?) we can't produce a meaningful answer. I think I'll have to think about this some more and come back to it. This is certainly a more difficult area than the bin heights. Certainly if the weights in some region (nearly) cancel out then the probability density in that region is (close to) zero and we cannot expect to calculate a reliable estimate of our quantity of interest. However, I think this particular toy case (if I understand it correctly) is actually an exception. If all events have x=1, regardless of weight, then we can still say that x=1 is a good estimate of the true value, if there is such a thing. Even for the sample with w = -1, the mean x is 1, not -1. mean(x) = sum(wx)/sum(w) = (-n)/(-n) = 1. > Incidentally, another nice sanity check is that if the weights aren't > exactly balanced, say 20 -1s, 30 +1s, we should be able to partition as > [20+,20-,10+] or as [20-,30+], or even just [10+]. We should also be > able to partition binwise, i.e. changing bin boundaries shouldn't change > the answers for the whole histogram. Needs a bit more thought. I'll certainly give it some more thought, but I'm pretty sure partition invariance is guaranteed by using only eventwise quantities summed over events. Partition invariance doesn't mean that any given subset of the entire sample will itself give a defined value for any given estimator, even if the whole sample does. > Anyway, just wanted to set that record straight, or at least a *little* > less crooked. Have a good weekend ;) I did. Hope you did too! Cheers, Ben -- Dr Ben Waugh Tel. +44 (0)20 7679 7223 Dept of Physics and Astronomy Internal: 37223 University College London London WC1E 6BT
More information about the Rivet mailing list |