Suchen und Finden

# CHAPTER 0

# Basic Prerequisite Knowledge

Readers need some of the knowledge contained in a basic course in statistics to tackle regression. We summarize some of the main requirements very briefly in this chapter. Also useful is a pocket calculator capable of getting sums of squares and sums of products easily. Excellent calculators of this type cost about $25–50 in the United States. Buy the most versatile you can afford.

# 0.1. DISTRIBUTIONS: NORMAL, *t*, AND *F*

##### Normal Distribution

The normal distribution occurs frequently in the natural world, either for data “as they come” or for transformed data. The heights of a large group of people selected randomly will look normal in general, for example. The distribution is symmetric about its mean *μ* and has a standard deviation *σ*, which is such that practically all of the distribution (99.73%) lies inside the range *μ* – 3*σ* ≤ *x* ≤ *μ* + 3*σ*. The frequency function is

(0.1.1)

We usually write that *x* ~ *N*(*μ*, *σ*2), read as “*x* is normally distributed with mean *μ* and variance *σ*2.” Most manipulations are done in terms of the *standard normal* or *unit normal* distribution, *N*(0, 1), for which *μ* = 0 and *σ* = 1. To move from a general normal variable *x* to a standard normal variable *z*, we set

(0.1.2)

A standard normal distribution is shown in Figure 0.1 along with some properties useful in certain regression contexts. All the information shown is obtainable from the normal table in the Tables section. Check that you understand how this is done. Remember to use the fact that the total area under each curve is 1.

##### Gamma Function

The gamma function Γ(*q*), which occurs in Eqs. (0.1.3) and (0.1.4), is defined as an integral in general:

**Figure 0.1.** The standard (or unit) normal distribution *N*(0, 1) and some of its properties.

However, it is easier to think of it as a generalized factorial with the basic property that, for any *q*,

and so on. Moreover,

So, for the applications of Eqs. (0.1.3) and (0.1.4), where *v*, *m*, and *n* are integers, the gamma functions are either simple factorials or simple products ending in *π*1/2.

*Example 1*

*Example 2*

*t*-Distribution

There are many *t*-distributions, because the form of the curve, defined by

**Figure 0.2.** The *t*-distributions for *v* = 1, 9, ∞ *t*(∞) = *N*(0, 1).

depends on *v*, the number of degrees of freedom. In general, the *t*(*v*) distribution looks somewhat like a standard (unit) normal but is “heavier in the tails,” and so lower in the middle, because the total area under the curve is 1. As *v* increases, the distribution becomes “more normal.” In fact, *t*(∞) *is* the *N*(0, 1) distribution, and, when *v* exceeds about 30, there is so little difference between *t*(*v*) and *N*(0, 1) that it has become conventional (but not mandatory) to use the *N*(0, 1) instead. Figure 0.2 illustrates the situation. A two-tailed table of percentage points is given in the Tables section.

*F*-Distribution

The *F*-distribution depends on two separate degrees of freedom *m* and *n*, say. Its curve is defined by

The distribution rises from zero, sometimes quite steeply for certain *m* and *n*, and reaches a peak, falling off very skewed to the right. See Figure 0.3. Percentage points for the upper tail levels of 10%, 5%, and 1% are in the Tables section.

**Figure 0.3.** Some selected *f*(*m*, *n*) distributions.

The *F*-distribution is usually introduced in the context of testing to see whether two variances are equal, that is, the null hypothesis that *H*0: / = 1, versus the alternative hypothesis that *H*1: / ≠ 1. The test uses the statistic and being statistically independent estimates of and , with *v*1 and *v*2 degrees of freedom (df), respectively, and depends on the fact that, if the two samples that give rise to and are independent and normal, then (/)/(/) follows the *F*(*v*1, *v*2) distribution. Thus *if* = , *F* = / follows *F*(*v*1, *v*2). When given in basic statistics courses, this is usually described as a two-tailed test, which it usually is. In regression applications, it is typically a one-tailed, upper-tailed test. This is because regression tests always involve putting the “*s*2 that could be too big, but cannot be too small” at the top and the “*s*2 that we think estimates the true *σ*2 well” at the bottom of the *F*-statistic. In other words, we are in the situation where we test *H*0: = versus *H*1: > .

# 0.2. CONFIDENCE INTERVALS (OR BANDS) AND *t*-TESTS

Let *θ* be a parameter (or “thing”) that we want to estimate. Let be an estimate of *θ* (“estimate of thing”). Typically, will follow a normal distribution, either exactly because of the normality of the observations in , or approximately due to the effect of the Central Limit Theorem. Let be the standard deviation of and let se() be the standard error, that is, the estimated standard deviation, of (“standard error of thing”), based on *v* degrees of freedom. Typically we get se() by substituting an estimate (based on *v* degrees of freedom) of an unknown standard deviation into the formula for .

1. A 100(1 – *α*)% confidence interval (CI) for the parameter *θ* is given by

where *tv*(1 – *α*/2) is the percentage point of a *t*-variable with *v* degrees of freedom (df) that leaves a probability *α*/2 in the upper tail, and so 1 – *α*/2 in the lower tail. A two-tailed table where these percentage points are listed under the heading of 2(*α*/2) = *α* is given in the Tables section. Equation (0.2.1) in words is

(0.2.2)

2. To test *θ* = *θ*0, where *θ*0 is some specified value of *θ* that is presumed to be valid (often *θ*0 = 0 in tests of regression coefficients) we evaluate the statistic

(0.2.3)

or, in words,

(0.2.4)

**Figure 0.4.** Two cases for a *t*-test. (*a*) The observed *t* is positive (black dot) and the upper tail area is *δ*. A two-tailed test considers that this value could just as well have been negative (open “phantom” dot) and quotes “a two-tailed *t*-probability of 2*δ*.” (*b*) The observed *t* is negative; similar argument, with tails reversed.

This “observed value of *t*” (our “dot”) is then placed on a diagram of the *t*(*v*) distribution. [Recall that *v* is the number of degrees of freedom on which se() is based and that is the number of df in the estimate of *σ*2 that was used.] The tail probability beyond the dot is evaluated and doubled for a two-tail test. See Figure 0.4 for the probability 2*δ*. It is conventional to ask if the 2*δ* value is “significant” or not by concluding that, if 2*δ* < 0.05, *t* is significant and the idea (or hypothesis) that *θ* = *θ*0 is unlikely and so “rejected,” whereas if 2*δ* > 0.05, *t* is nonsignificant and we “do not reject” the hypothesis *θ* = *θ*0. The alternative hypothesis here is *θ* ≠ *θ*0, a two-sided alternative. Note that the value 0.05 is not handed down in holy writings, although we sometimes talk as though it is. Using an “alpha level” of *α* = 0.05 simply means we are prepared to risk a 1 in 20 chance of making the wrong decision. If we wish to go to *α* = 0.10 (1 in 10) or *α* = 0.01 (1 in 100), that is up to us. Whatever we decide, we should remain consistent about this level throughout our testing.

However, it is pointless to agonize too much about *α*. A journal editor who will publish a paper describing an experiment if 2*δ* = 0.049, but will not publish it if 2*δ* = 0.051 is placing a purely arbitrary standard on...

Alle Preise verstehen sich inklusive der gesetzlichen MwSt.