Normal vs. Binomial: What are the hallmarks and differences?

AP Statistics / Mr. Hansen
3/25/2003

Name: _______________________

Normal vs. Binomial:
What are the hallmarks and differences?

NORMAL (z) DISTRIBUTION

The normal (z) distribution is a continuous distribution that arises in many natural processes. "Continuous" means that between any two data values we could (at least in theory) find another data value. For example, men's heights vary continuously and are the result of so many tiny random influences that the overall distribution of men's heights in America is very close to normal. Another example is the data values that we would get if we repeatedly measured the mass of a reference object on a pan balance—the readings would differ slightly because of random errors, and the readings taken as a whole would have a normal distribution.

The bell-shaped normal curve has probabilities that are found as the area between any two z values. You can use either Table A in your textbook or the normalcdf function on your calculator as a way of finding these normal probabilities.

Not all natural processes produce normal distributions. For example, incomes in America are the result of random natural capitalist processes, but the result is an extremely skew right distribution.

Here are some example problems. Make sure that you are familiar with BOTH METHODS for solving each problem.

Example 1.	What percentage of men are between 5'10" and 6'1" if men's heights in inches follow the N(69, 3) distribution?

	Method 1: The z score for 5'10" is .3333, and the z score for 6'1" is 1.3333, both by the z = (x-m)/s formula. By Table A, the area to the left of z = .3333 is about .63 (double-check me, please), and the area to the left of z = 1.3333 is about .909. Therefore, the area between z = .3333 and z = 1.3333 is .909 – .63, or approx. .28. Answer: 28%. Method 2: Draw a sketch with the peak at 5'9" and the points of inflection of the bell-shaped curve at 5'6" and 6'0". We put the points of inflection 3 inches above and below the mean because we were given that the standard deviation was 3. Then shade the area between 5'10" and 6'1", and mark the answer (found by punching in normalcdf(70,73,69,3), but remember that you can't write that). Answer: 28%.

Example 2.	At what percentile for height is a man who is 5 feet, 4 and a half inches?

	Method 1: His z score is –1.5 since he is 1.5 s.d.'s below the mean. If you can't do this in your head, use the formula z = (x-m)/s = (64.5 – 69)/3 = –4.5/3 = –1.5. By Table A, the area to the left of z = –1.5 is .0668. Answer: 7th percentile. Method 2: Draw the curve as above (mean at 69, points of inflection at 66 and 72). Shade area to left of 64.5 and mark as .0668, which you find by punching normalcdf(-99999,64.5,69,3). Remember that you cannot write normalcdf on your paper. Answer: 7th percentile.

Example 3.	How tall must a man be to be at the 90th percentile for height?

	Method 1: Look in the body of Table A for an entry that is close to 90%. We find it (very closely) for z = 1.28. Use equation z = (x-m)/s to solve for x, the man's height. I will omit the algebra, but please do this yourself. Answer: 72.84 inches. Method 2: Draw the curve as above (mean at 69, points of inflection at 66 and 72). Mark and shade 10% area in a right tail, or 90% area in a left tail. On the x-axis, mark the value found by punching invNorm(.9,69,3), though of course you remember that you cannot write invNorm on your paper. Answer: 72.84 inches.

The central limit theorem (CLT) says that the sampling distribution of xbar will approach a normal distribution, namely N(m, s/Ön), if the sample size is large. Thus we can use the z tables for many types of problems that seemingly have nothing to do with normally distributed data, as long as the sample size is large enough.

BINOMIAL DISTRIBUTION

A binomial distribution is very different from a normal distribution, and yet if the sample size is large enough, the shapes will be quite similar.

The key difference is that a binomial distribution is discrete, not continuous. In other words, it is NOT possible to find a data value between any two data values.

The requirements for a binomial distribution are

1) The r.v. of interest is the count of successes in n trials
2) The number of trials (or sample size), n, is fixed
3) Trials are independent, with fixed value p = P(success on a trial)
4) There are only two possible outcomes on each trial, called "success" and "failure." (This is where the "bi" prefix in "binomial" comes from. If there were several possible outcomes, we would need to use a multinomial distribution to account for them, but we don't study multinomial distributions in the beginning AP Statistics course.)

Consider X = number of sixes when a fair die is rolled 31 times.

Is X a binomial r.v.? Let us check...

1) X counts the number of successes (sixes) in 31 trials. CHECK!
2) The sample size (31) is fixed. CHECK!
3) Trials are independent, with p = P(six) = 1/6, a fixed value. CHECK!
4) There are only two possible outcomes on each trial. Either we get a six (success), or we fail to get a six (failure). We say
p = 1/6 and q = 5/6. CHECK!

Since X is binomial, we say X follows the B(31, 1/6) distribution. Do you see why X is discrete? X could equal 4, or 5, or 6, for example, but there is no way that X could ever equal 4.25 or 4.37. (Note, however, that the mean and s.d. of X could have messy decimal values.)

You can find the relative frequency distribution for X by making a histogram as follows:

For the X = 0 bin, graph a bar of height binompdf(31,1/6,0).
For the X = 1 bin, graph a bar of height binompdf(31,1/6,1).
For the X = 2 bin, graph a bar of height binompdf(31,1/6,2).
For the X = 3 bin, graph a bar of height binompdf(31,1/6,3).
For the X = 4 bin, graph a bar of height binompdf(31,1/6,4).
For the X = 5 bin, graph a bar of height binompdf(31,1/6,5).

[And so on.] You really should do this at least once in your life. Each year, I give a HW exercise to do something similar to this, though with a smaller n.

The fast way to get the histogram, and please do this now, is to punch in the following keystrokes (note that seq means 2nd LIST OPS 5):

seq(X,X,0,31,1)→L₁
seq(binompdf(31,1/6,X),X,0,31,1)→L₂

At this point, you can use STAT EDIT to read off the various probabilities. For example, the probability of getting 0 sixes in 31 rolls is .00351. The probability of getting 1 six in 31 rolls is .02177. The probability of getting 2 sixes in 31 rolls is .0653. I hope you are checking these numbers to make sure they are correct.

Are you?

The shorthand notation we use when making a writeup for other people to read is as follows:
P(X=0) = .00351
P(X=1) = .02177
P(X=2) = .0653

[and so on].

Now enter the following keystrokes:

2nd STATPLOT 4 ENTER (same as PlotsOff)
2nd STATPLOT 1 On
Highlight the "histogram" (third icon), set Xlist to L1, Freq to L2.
WINDOW Xmin=0, Xmax=31, Xscl=1, Ymin=0, Ymax=.3, Yscl=1, Xres=1
GRAPH

You should see a binomial distribution. It is "stairsteppy"—not smooth like a normal curve. And yet, the shape is quite similar to the familiar normal shape. For large values of n, a binomial distribution is so close to normal that we can use the z (normal) curve as an approximation.

Our rules of thumb for knowing when the normal approximation to the binomial is valid are as follows:

np must be at least 10, AND
nq must be at least 10.

In our example, nq = 31(5/6) is certainly big enough, but np is not. Therefore, the normal approximation to the binomial will not be very accurate in our example.

To find the mean and s.d. of X, you can punch

STAT CALC 1 L₁,L₂ ENTER

The mean is 5.167, and the s.d. is 2.075. Note that you could also have found these by using the formula E(X) = m_X = np = 31(1/6) = 5.167 for mean, and the formula s = Ö(npq) = Ö( (31) (1/6) (5/6) ) = 2.075 for standard deviation. When finding these on a free-response problem, you should show those formulas and then do the STAT CALC 1 L₁,L₂ as a double-check if time permits.

Does it make sense that the expected value (a.k.a. mean) of X is 5.167? I think so, since in 31 rolls we would expect a little more than 5 to be sixes.

Does it make sense for the s.d. to be about 2? Yes; since the shape is roughly normal, we can see from the histogram that most of the time (at least 2/3 of the time), we get an answer of 5 plus or minus 2 (i.e., 3, 4, 5, 6, or 7). Note that you could not use this "empirical rule" if the shape were distinctly non-normal.

Here are some more example problems.

Example 4.	In 31 rolls, what is the probability of getting no sixes?

	Solution: P(X=0) = q³¹ = .00351.

Example 5.	In 31 rolls, what is the probability of getting at least one six?

	Solution: P(X³1) = 1 – P(X<1) = 1 – P(X=0) = 1 – .00351 = .9965.

Example 6.	In 31 rolls, what is the probability of getting at least 5 sixes?

	Solution: P(X³5) = 1 – P(X<5) = 1 – P(X£4) = 1 – .39355 by calc. = .606. [Note: The value .39355 for P(X£4) is obtained by punching binomcdf(31,1/6,4), but you cannot write binomcdf on your paper.]

Example 7.	In 31 rolls, what is the probability of getting exactly 2, 3, or 4 sixes?

	Solution: P(X=2,3,or4) = P(X=2) + P(X=3) + P(X=4) = .065297... + .12624... + .1767... by calc. = .368. [Be sure to round only at the very end. Dots signify additional accuracy beyond the accuracy shown on paper. Answers were obtained by binompdf(31,1/6,2), binompdf(31,1/6,3), and binompdf(31,1/6,4), but you cannot show that.] Alternate method (useful when there are many possibilities to consider): P(X=2,3,or4) = P(X£4) – P(X£1) = .39355... – .02276... by calc. = .368. [We used binomcdf to find the .39355... and .02276..., but we cannot write binomcdf.]

Example 8.	In 31 rolls, what is the probability of getting more than 3 sixes but fewer than 10 sixes?

	Solution: P(3<X<10) = P(X£9) – P(X£3) = .97515... – .21681... by calc. = .758. [Again we used binomcdf to find the intermediate answers, but we cannot write binomcdf.]

Example 9.	In 31 rolls, what is the most likely number of sixes?

	Solution: Look at lists L₁ and L₂. The greatest probability value is .19088, and that occurs when X = 5. Answer: 5. Warning: The most likely number of sixes is not necessarily the value closest to the expected value (a.k.a. mean) of X. For example, in 70 rolls, the expected number of sixes is 70/6 or 11.667, but the most likely number of sixes turns out to be 11, not 12. Please verify this by entering a new L₁ and L₂, using keystrokes similar to those shown before the second set of example problems. (Use 70 in place of 31.)

Example 10.	In 3.5 million rolls of a fair die, what is the probability of getting somewhere between 583,000 and 584,000 sixes, inclusive?

	Solution: Here the sample size is so huge that (depending on the model of calculator you are using) you may choke it if you try to enter binomcdf(3500000,1/6,584000) – binomcdf(3500000,1/6,582999). Clearly, the normal approximation to the binomial is a much better method. Check rules of thumb using n = 3,500,000 and p = 1/6. np = 583,333.333 >> 10 CHECK! nq = 2,916,666.667 >> 10 CHECK! E(X)= m_X = np = 3500000/6 [store this as M] s_X = Ö(npq) = 697.2166888 [store this as S] Draw a normal curve centered on 583,333.333 and having points of inflection approx. 700 units above and below that. Shade the area between 583,000 and 584,000 and mark the probability found by punching normalcdf(583000,584000,M,S), though of course you cannot write normalcdf. Answer: .514.

Example 11.	Suppose that 15% of the people in a city are slippery. Explain why the count of slippery people in an SRS of 100 people from this city is not binomial.

	Solution: An SRS is sampling WITHOUT replacement, i.e., not independent trials. We must have independent trials for the count X of slippery people to be a binomial r.v. However, if the city is "large" (by which we mean that the population is at least 10 times the sample), the distinction between SRS and independent trials can be ignored. By this rule of thumb, we could use binomial methods if the city had at least 1000 people. [If the city has at least 1000 people, note that since np = 100(.15) = 15 > 10, and nq = 100(.85) = 85 > 10, we could also use the normal approximation to the binomial if we so desired.]

SUMMARY

Normal distributions are continuous and have a special bell shape.

Binomial distributions are discrete ("stairsteppy"); they are close to normal only if the sample size satisfies np ³ 10 and nq ³ 10.

Normal distributions arise in three general areas:

1) Natural processes where the data value (e.g., height) is the result of many small random inputs.
2) Sampling distribution of xbar, where either the underlying distribution is normal or (more commonly) where the sample size is large enough for the CLT to take effect. Rules of thumb are on p.606 of textbook.
3) Repeated measurement of a fixed phenomenon (e.g., the orbital period of Mars, the mass of a moon rock, or the height of a mountain). Most phenomena cannot be measured precisely—even if we have an accurate pan balance or laser range finder or whatever, there will always be some uncertainty or error in our measurement. For this reason, the normal distribution is sometimes called the "error function." However, #3 is really just a special case of #1.

Binomial distributions arise whenever the r.v. of interest is the count of successes in a fixed number (n) of independent trials. The four rules are listed near the beginning of the “binomial distribution” section, before the second set of example problems.