Tuesday, October 22, 2019

Exploring Confidence Intervals in Statistics: Confusions and Comprehensions

They say the best way to learn something is to explain it oneself (according to Feynman at least). I find this is also the best way to assess one's confusions about a topic; when I possess a confusion, I am oftentimes unable to precisely identify the source of confusion, and historically this inability to identify the confusions correlates with an inability to articulate the confusion. I also find historically that articulating a confusion poises me to solve it - that the mere articulation of a quandary initiates some process in my brain which solves it. This is a baffling but reassuring phenomenon which I consider when encountering new problems.

So let me attempt to apply this phenomenon to the study of confidence intervals in statistics.

The concept of a confidence interval is that it measures how close to the true theoretical value of a statistical parameter your experimental value is. For instance how close your sampling proportion p̂ is to the population proportion p.  To measure this we pick a confidence level we want, usually 90%, 95%, or 99%, and from there determine the range of values (the confidence interval) that p could be possibly be.

p̂ is not a binomial random variable as it denotes the mean # of successes, not the # of successes. p̂ = B(n,p)/n. The mean is then E(B(n,p)/n) = np/n = p and the standard deviation is SD(B(n,p)/n) = SD(B(n,p))/n = sqrt(npq)/n = sqrt(pq/n). 

p̂ is still discrete, and still takes the shape of B(n,p), although flattened. It appears that it approximates a normal distribution as long is p is near .5 and n is large. I have made various plots in R of binomial distributions with varying p's and n's and found this to be the case. 


In fact, the central limit theorem (CLT) states the distribution of a sum of independent random variables approaches a normal distribution. Thats why we can use a normal approximation for p̂; p̂ is like a B(n,p), and all binomial R.V.'s are just a sum of bernoulli R.V.'s, which are by definition independent! Ergo we can use the CLT here. Experimentally this proves to be the case too. 

(My God. That is profound. That is why the distance covered by a random walk, and why the sum of n dice rolls are both normally distributed...both are sums of independent random variables!).


Ok fine,but what then are the mean and standard deviation of this normal distribution that it approaches? Perhaps that is too in depth of a discussion to determine in general. For p̂ at least I can apply the expectation and variance functions E(X) and Var(X) to p̂ to obtain the respective mean and variance. 

Here is the problem: SD(p̂) = σp̂. = sqrt(pq/n). But we don't ever know p (or consequently q). So we need to make the estimation Var(p̂) = sqrt(p̂(1-p̂)/n). But how well does this approximate sigma? That is never addressed in any of the resources I have consulted, and I don't know how to evaluate the accuracy of  σp̂.  































I we are unable to precisely articulate our qua

So in statistics, here is my conception of the idea of a confidence interval.

Okay so suppose we have a psychic who says she can predict the future / read your mind / etc.
To assess her claims we may observe how may times the future actually matches her predictions.
The concept really is to see if her correctness rate is significantly greater than the chance rate, i.e., greater than the correctness rate of someone who is guessing at each prediction.

We will test our psychic for how often she can read our mind for a number we pick in our head from 1-5 for example. Someone guessing, over many trials,

No comments:

Post a Comment