Python Basics and Biostatistics
Confidence Interval of the Mean
Back to index
In this lesson the student will learn how to:
- interpolate between two values
- calculate the 95% CI of the mean for a data set
- use the pop function
By the end of this lesson the student will be able to:
Write a Python script which calculates the 95% CI for a data
set given the mean, sample SD, and N.
How sure can we be that the mean of our sample matches the mean of the
population we are sampling from? Assuming the use of random sampling
techniques, we can be sure that the larger our sample size the more
confident we can be that any statistics we generate from our sample
data is an accurate reflection of the population we are sampling from.
But how sure can we be? That's where the calculation of a confidence
level of the mean comes in. Before we delve into this topic we will
establish that the correct syntax to express a CI is to denote a range
as in 64.6 to 66.7 inches or [64.6, 66.7].
Calculating the Confidence Interval of a Mean
The basic formula to calculate the confidence interval of a mean looks
like this:
95%CI: (m - t * ( s/ sqrt(N) ) ) to (m + t * ( s/ sqrt(N) ) )
In this equation m stands for mean, t is the coefficient for 95% CI, s
is our sample SD, and N is the number of items in our data set. The value
for t is derived from the degrees of freedom (which is just N - 1) as
displayed in the following table:
df | t |
1 | 12.706 |
2 | 4.303 |
3 | 3.182 |
4 | 2.776 |
5 | 2.571 |
6 | 2.447 |
7 | 2.365 |
8 | 2.306 |
9 | 2.262 |
10 | 2.228 |
11 | 2.201 |
12 | 2.179 |
13 | 2.160 |
14 | 2.145 |
15 | 2.131 |
20 | 2.086 |
25 | 2.060 |
30 | 2.042 |
40 | 2.021 |
60 | 2.000 |
120 | 1.980 |
infinity | 1.960 |
Here's an example:
s = 10.0
m = 100
N = 34
To find t we must interpolate from the table:
We know that for df 30 t = 2.042 and that for df 40 t = 2.021
For N = 34, df = 33
33 is 7/10ths of the way from 40 down to 30. So to find t we just do the following calculation:
t = .7 * (2.042 - 2.021) + 2.021
= .7 * .021 + 2.021
= .0147 + 2.021
= 2.036
So now we can apply this to our equation:
95%CI(high) = 100 + 2.036 * ( 10/sqrt(34) )
95%CI(low) = 100 - 2.036 * ( 10/sqrt(34) )
Using pop - The following script shows you a useful method to
prompt the user and to then use the information collected from the user.
#!/usr/bin/python
prompts = ["YOUR NAME: ", "YOUR AGE: ", "YOUR HEIGHT: ", "YOUR WEIGHT: "]
responses = []
for p in prompts:
responses.append(raw_input(p))
print responses
for r in range(0,len(responses)):
print prompts[r] + responses.pop()
print responses
If you run the script you will notice a fundamental problem that can be
easily remedied by adding the number zero between the parantheses following
pop.
#!/usr/bin/python
prompts = ["YOUR NAME: ", "YOUR AGE: ", "YOUR HEIGHT: ", "YOUR WEIGHT: "]
responses = []
for p in prompts:
responses.append(raw_input(p))
print responses
for r in range(0,len(responses)):
print prompts[r] + responses.pop(0)
print responses
Remember that the user's responses will be stored as:
responses[0] --> contains response for name prompt
responses[1] --> contains response for age prompt
responses[2] --> contains response for height prompt
responses[3] --> contains response for weight prompt
The pop function returns the value stored in the last item of an array
and deletes that item from the array unless told to do otherwise as in
pop(0) which pops (and removes) from the front of the array.
ASSIGNMENT:
Write a script which takes as input the mean, sample SD, and N. The t value
will be interpolated from the df (N-1) and then the 95% CI will be
calculated. The results will be displayed like this:
mean: 44
sample SD: 2.3
sample size: 67
95% CI: [39.6,48.4]
You must store your responses and prompts in arrays and use the pop
function to iterate through your prompts.