Study Guide for Test #2

by Eric Love, D period

[partially edited and reformatted for the Web on 1/12/1999]

anecdotal evidence - based on haphazardly selected individual cases, which often come to our attention because they are striking in some way. These cases most likely do not represent any larger group of cases.

segmented bar graph - a bar graph in which each bar has a height of 100%. The division of the bar into segments shows the percent of the horizontal value in different categories.

bias - designs of study are biased when they systematically favor certain outcomes. If a flipped penny is weighted then there is bias because the outcome will systematically favor the weighted side.

blind test/ double-blind test - A blind test means that the subject does not know if he or she is receiving treatment or a placebo. A double-blind test means that the subject and the evaluator do not know if there is a treatment or placebo. All good experiments should be double-blind.

blocking (block design) - a block is a group of experimental units or subjects that are known before the experiment to be similar in a way that is expected to affect the response to the treatments. For example children and adults may be two blocks in an experiment on flu medication because as groups they would be expected to respond differently. In each block a random assignment is used.

case - a generic term to identify numerically subjects or units of experimentation; an individual person, animal, or thing for which values of variables are recorded.

categorical variable - records into which of several categories data falls.

census - an attempt to contact every individual in a population. With large populations such as the U.S. this is impossible.

common response - when two variables respond to changes in some unobserved variable. For example more churches does not explain higher levels of crime, in fact they have a common response to an increase in population.

confounding - when the effect on a response variable by an explanatory variable is hopelessly mixed up with the effect on the response variable by other variables. The claim that the sloppy lifestyle of smokers contributes to high death rate rather than smoking alone is an example of a claimed case of confounding. [In this particular case, however, a wealth of scientific evidence has been collected to show that confounding is probably not occurring—-i.e., that smoking itself is causing something to happen.]

control group - a group that is subject to the same conditions as a group receiving treatment. The control group does not receive treatment, but exists to reveal outside variables. Without control groups studies may reveal effects of treatment that are actually related to other conditions.

correlation (r) - the strength of the linear association of data points. If correlation is positive then y increases with x. If correlation is negative then y decreases with x. [To measure strength, we actually usually look at r^2, since that will always be nonnegative.]

Correlation

Description

|r| < .50

weak correlation

.50

some correlation

.75

moderate, fairly good

.90

strong, good

.99

excellent, almost perfect

correlation vs. causation - correlation does not imply causation because of possible common response and confounding.

experiment - deliberately imposes some treatment on the experimental units or subjects in order to observe the response.

explanatory variable - attempts to explain observed outcomes. Amount of exercise explains physical fitness.

exploratory data analysis - when we do not know what the data may reveal, and we have no specific questions in mind.

exponential growth - increases by a fixed percentage of the previous total. To determine correlation for exponential growth take the log of all y values and then perform a linear regression.

factor - common name for explanatory variables in experiments.

level - the specific value or amount of each factor in an experiment. The LEVEL of CO2 and the LEVEL of aspirin...

lurking variable - a variable that has an important effect on the response but is not included among the explanatory variables studied.

matching/ matched pairs - example: in testing attraction of beetles to color. Poles of one color or the other could be randomly placed outdoors. Matching would mean that each location had a pole with both colors (randomly ordered) which eliminates the lurking variables in location.

parameter - a number that describes a population.

placebo effect - the effect that occurs when people receive a placebo but mentally think that they are getting better from treatment.

prospective study - when subjects are followed and observed for data but are not subject to treatment.

quantitative variable - takes numerical values for which arithmetic operations such as means and standard deviations can be done.

response bias - when the respondent or interviewer produce bias. The respondent could lie, misunderstand the question, be swayed by the attitude of the interviewer, etc.

nonresponse bias – occurs when the subjects who cannot be reached or refuse to cooperate differ from the rest of the population (for example, a survey of buying habits might miss busy high-income people who cannot take the time to respond). Closely related to undercoverage bias (for example, an opinion poll conducted by telephone will miss the 7% to 8% of the American population without residential phones).

voluntary response bias – if participation in survey is voluntary, those without a strong opinion ignore it

response variable - measures an outcome that is supposedly explained by one or more explanatory variables

sampling (sample) - to study a part (sample) in order to gain information about the whole.

sampling distribution - "The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population."

Simpson's paradox - the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group.

simulation - using random digit tables or generators to simulate drawing an SRS from a large population.

SRS - simple random sample. A random selection from a population such that each member has an equal chance of being selected.

statistic - a number produced from sampling data to estimate a parameter.

statistical inference - we infer conclusions about the wider population from data

1. Control the effects of lurking variables

2. Randomization - use SRS or other appropriate randomization method

3. Replication - repeat experiment on many subjects to reduce chance variation

statistical significance - when a difference in treatment vs. no treatment is too large to attribute to chance. This means that the statistic has meaning in the greater population.

strata - groups of units that are similar in some way that is important to the response.

stratified random sample - when strata are chosen and then a separate SRS is selected for each. For example a state may be stratified by county which implies that each county represents the state as a whole.

subject - units of human beings

transformation to achieve linearity - If data appears to follow a function then taking the inverse of the function for each y value should produce linearity. example: "log"ing exponential growth.

two-way/ three way - show data resulting from 2 and 3 categorical variables, respectively.

variability of sampling distribution - the spread of the sampling distribution. As long as the population is much larger than the sample, the spread of the sampling distribution is the same for any population size.

wording of questions - wording will almost always produce bias, so good experiments either word the same question in multiple ways, or use (and publish) a very carefully worded question.

slope of regression line = r(Sy/Sx)

regression line passes through point (xbar,ybar)

r2- the linear regression explains r2 of the variation in the response variable. (Usually, we multiply r2 by 100 to convert to a %.)