Python Basics and Biostatistics

Confidence Interval of the Mean

In this lesson the student will learn how to:

interpolate between two values
calculate the 95% CI of the mean for a data set
use the pop function

By the end of this lesson the student will be able to:


  Write a Python script which calculates the 95% CI for a data
  set given the mean, sample SD, and N.

How sure can we be that the mean of our sample matches the mean of the population we are sampling from? Assuming the use of random sampling techniques, we can be sure that the larger our sample size the more confident we can be that any statistics we generate from our sample data is an accurate reflection of the population we are sampling from. But how sure can we be? That's where the calculation of a confidence level of the mean comes in. Before we delve into this topic we will establish that the correct syntax to express a CI is to denote a range as in 64.6 to 66.7 inches or [64.6, 66.7].

Calculating the Confidence Interval of a Mean

The basic formula to calculate the confidence interval of a mean looks like this:


  95%CI: (m - t * ( s/ sqrt(N) ) ) to (m + t * ( s/ sqrt(N) ) )

In this equation m stands for mean, t is the coefficient for 95% CI, s is our sample SD, and N is the number of items in our data set. The value for t is derived from the degrees of freedom (which is just N - 1) as displayed in the following table:

df t
1 12.706
2 4.303
3 3.182
4 2.776
5 2.571
6 2.447
7 2.365
8 2.306
9 2.262
10 2.228
11 2.201
12 2.179
13 2.160
14 2.145
15 2.131
20 2.086
25 2.060
30 2.042
40 2.021
60 2.000
120 1.980
infinity 1.960

Here's an example:


  s = 10.0
  m = 100
  N = 34
  
To find t we must interpolate from the table:

  We know that for df 30 t = 2.042 and that for df 40 t = 2.021

  For N = 34, df = 33
 
  33 is 7/10ths of the way from 40 down to 30. So to find t we just do the following calculation:
    
   t =  .7 * (2.042 - 2.021) + 2.021
     =  .7 * .021 + 2.021
     =  .0147 + 2.021
     =  2.036

So now we can apply this to our equation:

  95%CI(high) = 100 + 2.036 * ( 10/sqrt(34) )

  95%CI(low) = 100 - 2.036 * ( 10/sqrt(34) )

Using pop - The following script shows you a useful method to prompt the user and to then use the information collected from the user.

#!/usr/bin/python prompts = ["YOUR NAME: ", "YOUR AGE: ", "YOUR HEIGHT: ", "YOUR WEIGHT: "] responses = [] for p in prompts: responses.append(raw_input(p)) print responses for r in range(0,len(responses)): print prompts[r] + responses.pop() print responses

If you run the script you will notice a fundamental problem that can be easily remedied by adding the number zero between the parantheses following pop.

Remember that the user's responses will be stored as: responses[0] --> contains response for name prompt responses[1] --> contains response for age prompt responses[2] --> contains response for height prompt responses[3] --> contains response for weight prompt The pop function returns the value stored in the last item of an array and deletes that item from the array unless told to do otherwise as in pop(0) which pops (and removes) from the front of the array.

ASSIGNMENT:

Write a script which takes as input the mean, sample SD, and N. The t value will be interpolated from the df (N-1) and then the 95% CI will be calculated. The results will be displayed like this:

mean: 44 sample SD: 2.3 sample size: 67 95% CI: [39.6,48.4] You must store your responses and prompts in arrays and use the pop function to iterate through your prompts.

df	t
1	12.706
2	4.303
3	3.182
4	2.776
5	2.571
6	2.447
7	2.365
8	2.306
9	2.262
10	2.228
11	2.201
12	2.179
13	2.160
14	2.145
15	2.131
20	2.086
25	2.060
30	2.042
40	2.021
60	2.000
120	1.980
infinity	1.960