Sample Methodology Writeup for a Controlled Experiment

Sample Methodology Writeup for a Controlled Experiment

[rev. 12/6/2002]

Research question

Does jeering increase the proportion of eggs that fail Dr. Morse’s annual Egg Drop Competition?

Hypothesis

We predict that no statistically significant difference exists.

Outline

Control

Jeering script will be standardized, w/ 10 people shouting each time. Shouting must be loud relative to background noise (which cannot be controlled). Leader of shouting group will initiate shouting based on coded placard in window of Rm. R: word whose 3rd letter is a vowel=shout, any other word=silence. Placard controller will work from a secret SRS list, so that nobody in shouting group knows in advance whether to shout or not, and only the leader and the placard controller know the agreed-upon code. Dr. Morse will record all data (smashed/survived) in sequence for placard controller to copy later. Shouters will not communicate to the placard controller & will stand out of sight of the drop zone & out of conversational earshot of other spectators. Though shouters will surely deduce the fate of many of the eggs, they will not record whether their efforts to trigger an egg-smash were successful.

Shouters and Dr. Morse will make reasonable efforts to guarantee that the experimental and control groups are independent since that is a required assumption for the 2-proportion z test.*

Blinding (subcat. of control)

...of exp. units (eggs): not feasible.

...of researchers (placard controller & jeering crowd): partially implemented as descr. above but cannot be fully implemented w/o damaging realism. (If shouters are completely isolated, they will not shout w/ reqd. emotional energy. If they are replaced w/ an amplified tape recording, the exper. is not realistic at all.)

...of Dr. Morse: neither feasible nor desirable.

...of other bystanders: partially implemented as descr. above, but cannot be fully implemented w/o running a separate mock egg-drop.

Randomization

Eggs are randomly assigned to “treatment” (shouting) or “control” (relative silence) before the first egg is loaded onto the release mechanism.**

Replication

Use 50 experimental units so that if we see a much larger proportion of “jeered-at” eggs getting smashed, we know the difference is too great to be plausibly caused by chance alone.***

Notes

For this experiment, a cracked egg will count as a failure.

* In truth, the independence assumption cannot be satisfied here. We would need a very large population of eggs to draw from (otherwise, the sampling without replacement degrades the independence of the two groups), and we would need to know, e.g., that a run of several jeers in a row has no effect on the next silent drop. But this is circular reasoning! (Look at the research question again.) Can you think of a way to improve the design?

** Another approach would be to let the leader of the jeering crowd flip a coin each time to make the treatment/no-treatment decision (heads=jeering, tails=silence). Doing this would also simplify the design by eliminating the need for the placard controller. However, there are some disadvantages:

(1) There is no guarantee of getting a 25/25 split between treatment eggs and control eggs. An equal split is desirable, as we shall see in the 2nd semester, for simplifying the math slightly and improving the power of the test.

(2) Coin flipping is a spectacle that invites interference from bystanders. Even a random digit table or a calculator pseudorandom number generator would lead to distracting questions: What are you doing? Why are you consulting that list? What’s this all about? Can we scream with you?

(3) The best way to eliminate the communication problem among members of the jeering group is for the leader to initiate the shouting each time. However, if he visibly consults a list or flips a coin, it will be impossible to prevent other members from clustering around him and wanting to know in advance what they are to do. Egg Drop Day is characterized by moments of intense excitement separated by periods of boredom, and it is human nature to find ways of relieving the boredom between drops.

*** We will learn later (2nd semester) how to calculate how many eggs are needed to detect a real experimental effect. It is possible to show (by guess-and-check and pushing calculator buttons STAT TEST 6) that as long as the measured failure rate of “jeered-at” eggs exceeds that of the control eggs by at least 24 percentage points (i.e., 6 or more additional smashed eggs out of 25), then having 25 eggs in each group will always cause the 2-proportion z test to report statistical significance. However, this is not the way we measure the power of the test in statistics. To measure power, which we will study in the 2nd semester, we compute the probability that the test will report a statistically significant difference based on assumptions about the parameter values (i.e., the true failure rates of eggs, not the measured rates). With some number-crunching on a spreadsheet, it is possible to show that if there are 25 eggs in each group, if the true difference between experimental and control failure rates is at least 20 percentage points, and if the true (parameter) failure rate for the control group is less than .25, then the power of the test (i.e., the probability of detecting statistical significance) is always more than 50%.

If you’re thinking to yourself, “Mr. Hansen! That’s not a very powerful test,” guess what? You’re right. To make a more powerful test, we need to reduce the standard deviation of the statistic involved (i.e., the measured difference between the failure proportions), and the easiest way of doing that is to increase the sample size. With 200 eggs in each group, and under the assumption that the true (parameter) value of experimental minus control failure rate is at least 10 percentage points, the power of the test exceeds 60% for all possible values of the failure rates.

Although increasing the sample size is probably the most obvious way of increasing the power of this experiment, there are other methods. Can you think of any?

Throughout this discussion, I have assumed that statistical significance is defined using the traditional criterion, p < .05. We will learn what this means in the 2nd semester. Basically, this means that chance alone would produce a result at least this extreme less than 5% of the time.