Python Basics and Biostatistics

Random Sampling

Back to index
In this lesson the student will learn how to:
  1. explain the statistical reliability of random sampling
  2. redirect script output to a file
  3. push scalar values onto a list
  4. generate random numbers
By the end of this lesson the student will be able to:

  Use a script to perform a simple statistical
  experiment.

Assuming that two populations of 40 subjects are identical for some measurable aspect of their physiology (for instance, systolic blood pressure), what are the chances that a random sample of five subjects from each population will reflect this? How reliable is random sampling?

To help us discuss these questions, let's create two identical groups of subjects:


G1 = ( 88, 90, 79, 82, 99, 105, 91, 86, 74, 101, 80, 96 ).

G2 = ( 88, 90, 79, 82, 99, 105, 91, 86, 74, 101, 80, 96 ).

If we select five values from each group at random we might come up with:

G1 Sample = ( 90, 82, 105, 74, 80 ).
   Sample Mean: 86

G2 Sample = ( 88, 82, 99, 101, 86 ).
   Sample Mean: 91

NOTE: sample means rounded to nearest whole

The actual average for each group is 89. So, you can see that random sampling doesn't guarantee that our samples will be very representative of our actual groups. In fact, we could wind up with random samples like this:


G1 Sample = ( 99, 105, 91, 101, 96 )
   Sample Mean: 98

G2 Sample = ( 79, 82, 74, 80, 86 )
   Sample Mean: 80

Although the odds aren't very high that we will wind up with random samples yielding extremes (especially if we are dealing with large populations and sample sizes), there is always a possibility that this could happen. So, random sampling may not be very reliable if our sample size is small, but as our sample size increases the reliability of random sampling increases.

Random Sampling Script

Consider the following Python script which illustrates the reliability of random sampling:

#!/usr/bin/python import random pop = [62,65,66,72,80,75,67,73, 79,69,70,64,64,65,67,81, 66,67,77,78,82,80,73,68, 67,73,84,58,75,76,72,68] actual_average = 0 num = len(pop) tally = 0 for n in pop: tally+=n actual_average = tally/num print "ACTUAL AVERAGE: " + str(actual_average) for z in range(0,3): print "-------" + str(z+1) + "-------" tally = 0 samp = [] for n in range(0,5): i = random.randint(0,num-1) tally +=pop[i] samp.append(pop[i]) print "SAMPLE: " print samp print "SAMPLE AVERAGE: " + str(tally/5)
Run this sample script several times. Notice the use of append to add an item to a list. This is a very useful trick in a lot of circumstances.

Also pay attention to the import statement and the line in which random.randint(a,b) gets called.

ASSIGNMENT:

Present the results to this activity on a web page.

Run the Python script shown above ten times. Redirect the output to a file so that it can be easily reformatted into an HTML page. To redirect you do this:

./pyscriptname >> output_file.txt Once you have ten sets of output (or thirty sample sets), tally up how many times the sample average matched the overall average. Provide a brief summary (two or three sentences) explaining the degree to which a small sample accurately reflects an overall sample. Based on your results is it close or way off?