[Rivet] [Yoda] YODA development

Ben Waugh waugh at hep.ucl.ac.uk
Mon Nov 2 10:25:10 GMT 2009
Next message: [Rivet] [Yoda] YODA development
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Andy,

Here are my further thoughts from random moments over the weekend, after 
I have tried to beat them into some kind of order. Can you give a 
concrete example of a generator and steering parameters where negative 
weights crop up? I would like to sanity check my arguments against a 
real case.

My first thoughts, while I still think correct as far as they go, were 
not based on a very clear picture of what we are trying to do. I think 
now I have better untangled in my head the various questions we are 
trying to answer for any given sample. In particular I may have been 
confusing the issues related to bin height and those related to means 
and variances of variables other than those used in the binning.

On that topic, I can't see why it is necessary to use a "distribution" 
object (Dbn1D) to store the contents of each bin in a non-profile 
histogram. The fill method in this case is keeping information on the 
distribution (in the binning variable) of entries within the bin. There 
may be cases where this is useful, but if so it could be achieved using 
a profile histogram with the Y variable the same as the X (binned) 
variable. Using a Dbn1D for every histogram bin seems to me to be 
overkill, and the generalization to more than one dimension involves 
tracking all the covariances, i.e. O(m^2) quantities where m is the 
number of dimensions. Is there a use case for this, or is this a "nice 
to have" feature?

On 31/10/09 11:09, Andy Buckley wrote:
>> The YODA approach is to do every stat combination based on storing
>> weighted moments of the variable being filled, i.e. w, w^2, wx, wx^2,
>> etc.,

Anything that is calculated exclusively from sums over the sample(s) of 
event-wise quantities must be invariant under re-partitioning. If we 
deviate from this we should worry that we are doing something 
inconsistent. I would also suspect that if we get e.g. an imaginary 
result for the error on a real quantity then we are also making a 
mistake. And of course we want to be correct, not just consistent!

> #1. What I said above isn't quite right --- the perils of replying in a
> hurry. The variance I calculated on the alternating +-1 weight examples
> isn't the variance on the bin height, but rather the variance of the
> distribution of weights themselves. It does, however, raise the question
> of what is the "N" used to compute the averages in this quantity: if N =
> sum(w), then we get into trouble with division when the positive and
> negative weights are equal (i.e. sum(w) = 0). The truth is that the
> distribution does have a width, since it alternates between +1 and -1...
> so my feeling is that we should be using N = number of fills. But this
> causes problems elsewhere: maybe for our purposes the correct thing to
> do is N = sum(w) and just say that the error is undefined for sum(w) =
> 0, as it would be for N=0. Hmm.

You are talking here about calculating means, variances and errors of 
eventwise quantities, not bin heights, are you not? I'll try to deal 
with this under #3.

> #2. The error on bin height is the binomial error sqrt(height). To allow
> arbitrary partitionings is a pretty crucial invariant, so we should be
> able to separate this alternating fill set into two equal sets of purely
> +1 and purely -1 weights, and then recombine. This clearly works for the
> positive weights, but for the negative weights it requires taking the
> sqrt of a negative number. If we're not worried about this --- just
> treat it as a complex quantity and return mod(sqrt(sum(w))) when asked
> --- it all works fine. The usual "sum in quadrature" rule works if
> you're more explicit about having an imaginary error, but is really
> irrelevant from a computational standpoint.

The bin height in a histogram is the estimator of a probability or an 
expected number of events in a given region of phase space. If the 
probability density is zero in some reason, then we would expect (in the 
long run) equal positive and negative weights to occur in that region. 
In most cases in fact we would expect no events to be generated in that 
region at all.

If we do get events, then we should still calculate the error on the bin 
height to be the usual sum(w^2), regardless of the sign of w. In our 
rather pathological toy model where weights are either +1 or -1, suppose 
we generate a sample of size such that we expect 2N events in a 
particular bin. We would expect in this bin to see N +- sqrt(N) with 
weight +1 and N +- sqrt(N) with weight -1, giving a bin height of 0 +- 
sqrt(2N).

Here I am using Poisson rather than binomial statistics. I haven't yet 
considered the difference in error calculation depending on whether we 
are dealing with known luminosity (but variable number of events) or 
known number of events (but variable fraction in each bin).

> #3. Things become more awkward when we start trying to compute means and
> variances of observables, just x for example. The most obvious problem
> comes when there are partitionings (or bins) which have an overall
> negative sum(w). For the equal +-1 partitionings, if they all are at
> x=1, then the -1 set will have a mean of -1 (if we divide by N=#fills),
> which doesn't mean much but is necessary for repartitioning invariance
> (the sum(wx) needs to be additive). So the apparent central x value is
> skewed because (-w)x = w(-x). But there are also problems when the two
> sets are added together: they will have a mean of 0 with a variance of
> 1: this doesn't represent anything meaningful. And what about when we
> combine binwise stats to get moments for the whole histogram: bins with
> negative sum(w) will skew any moment distributions as if the x value had
> been negative. Symmetry around x=0 doesn't mean any thing for general
> distributions, so I think this is the wrong approach. I think this is
> all resolved if we use N=sum(w), and just accept that if sum(w) = 0 (and
> sum(w) < 0?) we can't produce a meaningful answer.

I think I'll have to think about this some more and come back to it. 
This is certainly a more difficult area than the bin heights. Certainly 
if the weights in some region (nearly) cancel out then the probability 
density in that region is (close to) zero and we cannot expect to 
calculate a reliable estimate of our quantity of interest.

However, I think this particular toy case (if I understand it correctly) 
is actually an exception. If all events have x=1, regardless of weight, 
then we can still say that x=1 is a good estimate of the true value, if 
there is such a thing. Even for the sample with w = -1, the mean x is 1, 
not -1. mean(x) = sum(wx)/sum(w) = (-n)/(-n) = 1.

> Incidentally, another nice sanity check is that if the weights aren't
> exactly balanced, say 20 -1s, 30 +1s, we should be able to partition as
> [20+,20-,10+] or as [20-,30+], or even just [10+]. We should also be
> able to partition binwise, i.e. changing bin boundaries shouldn't change
> the answers for the whole histogram. Needs a bit more thought.

I'll certainly give it some more thought, but I'm pretty sure partition 
invariance is guaranteed by using only eventwise quantities summed over 
events. Partition invariance doesn't mean that any given subset of the 
entire sample will itself give a defined value for any given estimator, 
even if the whole sample does.

> Anyway, just wanted to set that record straight, or at least a *little*
> less crooked. Have a good weekend ;)

I did. Hope you did too!

Cheers,
Ben

-- 
Dr Ben Waugh                                   Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy                  Internal: 37223
University College London
London WC1E 6BT
Next message: [Rivet] [Yoda] YODA development
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Rivet mailing list