Thoughts and Ideas: November 2019

Purely speaking, a statistic is a number computed from a subset of numbers (called sample in statistics) from a larger set of numbers (called population), whereas a parameter is a number calculated from the larger set i.e. population mean/variance.

Thus, here is an alternative formulation of a statistic. Suppose there is a vector function f : R^n -> R. Our population can always be assumed to have n numbers, and our sample to have k numbers | k < n. Let Xn be a vector of n numbers (representing a population) and Xk be a subvector of k numbers (representing a sample). Then f(Xk) is a statistic of the population. For example a common f is (x1 + x2 + x3....+ xk)/k, i.e. the sample mean.

Mathematics is notorious for abstracting concepts. Even though numbers are already abstract, and calculating a statistic (a number) from data (more numbers) is a further abstraction, a sampling distribution abstracts this abstraction.

Suppose again we have our n numbers. Combinatorics tells us there exist nCk subsets of k numbers from n. f maps each of these k-groups to some real number. Now, almost certainly, f is surjective, meaning there exists Xki and Xkj such that f(Xki) = f(Xkj) (these arguments are the ith and jth Xk vectors respectively). Meaning, the cardinality of the image of f is smaller than it's domain of k-groups. This means that when one selects a Xk vector, f(Xk) is more likely to be some values than others. Therefore, f(Xk) has a distribution, and when sampling for a statistic, we are really picking that statistic out of a hat according to that distribution of f(Xk). It is the exact same process as rolling sums from two dice; here Xn = <1,2,3,4,5,6>, k=2, Xk is any two-set from Xn, and f(Xk) = x1 + x2. In this case, the statistic is the sum of the sampled numbers.

-----------------------------------------------------------------------------------------------------

Aside: Law of Large Numbers
This distribution is well known, and the probability of getting a sum of x is (13-x)/36 for x>=7 and (x-1)/36 for x<=7. In the same way, a distribution exists for any other statistic one can conceive, although it may not be easy to mathematically describe, as for more complicated f's, a closed form analytic solution for the distribution may not exist.

However the law of large numbers makes this no problem. If we progressively take k-groups from n and calculate an f, these f's will incrementally distribute themselves according to their true theoretical distribution! (Remember, this is no different than drawing sums from dice!) That is to say, our observed frequency distribution will approach the statistic's theoretical asymptotic distribution.

--------------------------------------------------------------------------------------------------------------------------

Thoughts and Ideas

Friday, November 8, 2019

Making Sense of Statistics: Sampling Distributions