In this section we talk about correlation evaluation which is a technique supplied to quantify the associations between 2 continuous variables. For example, we could desire to quantify the association between body mass index and also systolic blood push, or between hours of exercise per week and also percent body fat. Regression analysis is a connected technique to assess the relationship in between a result variable and also one or even more danger factors or confounding variables (conbeginning is disputed later). The outcome variable is likewise called the response or dependent variable, and the hazard factors and confounders are dubbed the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is deprovided "Y" and the independent variables are deprovided by "X".

You are watching: Use of simple linear regression analysis assumes that:

< NOTE: The term "predictor" deserve to be misleading if it is taken as the capability to predict even past the boundaries of the information. Also, the term "explanatory variable" could provide an impression of a causal result in a instance in which inferences need to be restricted to identifying associations. The terms "independent" and "dependent" variable are less subject to these interpretations as they carry out not strongly imply cause and also result.

Learning Objectives

After completing this module, the student will be able to:

Define and carry out examples of dependent and also independent variables in a study of a public health and wellness problemCompute and also interpret a correlation coefficientCompute and also translate coefficients in a direct regression analysis

*

Correlation Analysis

In correlation evaluation, we estimate a sample correlation coefficient, even more especially the Pearkid Product Moment correlation coefficient. The sample correlation coeffective, delisted r,

varieties in between -1 and +1 and also quantifies the direction and also stamina of the direct association in between the two variables. The correlation in between two variables can be positive (i.e., higher levels of one variable are connected via greater levels of the other) or negative (i.e., better levels of one variable are connected with lower levels of the other).

The authorize of the correlation coreliable indicates the direction of the association. The magnitude of the correlation coreliable suggests the toughness of the association.

For example, a correlation of r = 0.9 argues a solid, positive association in between two variables, whereas a correlation of r = -0.2 indicate a weak, negative association. A correlation close to zero suggests no direct association between two constant variables.

It is vital to note that tbelow may be a non-direct association in between two consistent variables, but computation of a correlation coefficient does not detect this. Thus, it is always necessary to evaluate the information closely prior to computing a correlation coefficient. Graphical display screens are especially helpful to discover associations between variables.

The figure listed below reflects 4 hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other alengthy the Y-axis.

*

Scenario 1 depicts a strong positive association (r=0.9), similar to what we can see for the correlation in between infant birth weight and also birth length.Scenario 2 depicts a weaker association (r=0,2) that we can mean to watch in between age and also body mass index (which tends to boost with age).Scenario 3 can depict the lack of association (r around = 0) in between the extent of media expocertain in adolescence and age at which teenagers initiate sexual task.Scenario 4 could depict the solid negative association (r= -0.9) generally observed in between the variety of hrs of aerobic exercise per week and percent body fat.

*

Example - Correlation of Gestational Period and Birth Weight

A tiny research is carried out involving 17 infants to investigate the association in between gestational age at birth, measured in weeks, and also birth weight, measured in grams.

Infant ID #

Gestational Age (weeks)

Birth Weight (grams)

1

34.7

1895

2

36.0

2030

3

29.3

1440

4

40.1

2835

5

35.7

3090

6

42.4

3827

7

40.3

3260

8

37.3

2690

9

40.9

3285

10

38.3

2920

11

38.5

3430

12

41.4

3657

13

39.7

3685

14

39.7

3345

15

41.1

3260

16

38.0

2680

17

38.7

2005

We wish to estimate the association in between gestational age and infant birth weight. In this example, birth weight is the dependent variable and also gestational age is the independent variable. Hence y=birth weight and x=gestational age. The data are displayed in a scatter diagram in the figure below.

*

Each allude represents an (x,y) pair (in this instance the gestational age, measured in weeks, and the birth weight, measured in grams). Keep in mind that the independent variable, gestational age) is on the horizontal axis (or X-axis), and the dependent variable (birth weight) is on the vertical axis (or Y-axis). The scatter plot reflects a positive or direct association in between gestational age and birth weight. Infants with shorter gestational periods are more most likely to be born through lower weights and also babies with much longer gestational periods are more likely to be born through better weights.

Computing the Correlation Coefficient

The formula for the sample correlation coeffective is:

*

wright here Cov(x,y) is the covariance of x and also y defined as

*
and
*
are the sample variances of x and y, defined as follows:

*
and
*

The variances of x and y measure the varicapacity of the x scores and y scores about their corresponding sample implies of X and Y considered separately. The covariance procedures the varicapacity of the (x,y) pairs roughly the expect of x and suppose of y, thought about simultaneously.

*

To compute the sample correlation coreliable, we have to compute the variance of gestational age, the variance of birth weight, and also also the covariance of gestational age and also birth weight.

We initially summarize the gestational age data. The mean gestational age is:

*

To compute the variance of gestational age, we need to sum the squared deviations (or differences) in between each observed gestational age and the intend gestational age. The computations are summarized below.

Infant ID #

Gestational Age (weeks)

*

*

1

34.7

-3.7

13.69

2

36.0

-2.4

5.76

3

29.3

-9.1

82,81

4

40.1

1.7

2.89

5

35.7

-2.7

7.29

6

42.4

4.0

16.0

7

40.3

1.9

3.61

8

37.3

-1.1

1.21

9

40.9

2.5

6.25

10

38.3

-0.1

0.01

11

38.5

0.1

0.01

12

41.4

3.0

9.0

13

39.7

1.3

1.69

14

39.7

1.3

1.69

15

41.1

2.7

7.29

16

38.0

-0.4

0.16

17

38.7

0.3

0.09

*

*

*

The variance of gestational age is:

*

Next off, we summarize the birth weight information. The suppose birth weight is:

*

The variance of birth weight is computed just as we did for gestational age as presented in the table below.

Infant ID#

Birth Weight

*

*

1

1895

-1007

1,014,049

2

2030

-872

760,384

3

1440

-1462

2,137,444

4

2835

-67

4,489

5

3090

188

35,344

6

3827

925

855,625

7

3260

358

128,164

8

2690

-212

44,944

9

3285

383

146,689

10

2920

18

324

11

3430

528

278,764

12

3657

755

570,025

13

3685

783

613,089

14

3345

443

196,249

15

3260

358

128,164

16

2680

-222

49,284

17

2005

-897

804,609

*

*

*

The variance of birth weight is:

*

Next we compute the covariance:

To compute the covariance of gestational age and birth weight, we have to multiply the deviation from the mean gestational age by the deviation from the mean birth weight for each participant, that is:

*

The computations are summarized below. Notice that we sindicate copy the deviations from the suppose gestational age and also birth weight from the two tables over into the table below and also multiply.

Infant ID#

*

*

*

1

-3.7

-1007

3725.9

2

-2.4

-872

2092.8

3

-9,1

-1462

13,304.2

4

1.7

-67

-113.9

5

-2.7

188

-507.6

6

4.0

925

3700.0

7

1.9

358

680.2

8

-1.1

-212

233.2

9

2.5

383

957.5

10

-0.1

18

-1.8

11

0.1

528

52.8

12

3.0

755

2265.0

13

1.3

783

1017.9

14

1.3

443

575.9

15

2.7

358

966.6

16

-0.4

-222

88.8

17

0.3

-897

-269.1

Total = 28,768.4

The covariance of gestational age and birth weight is:

*

Finally, we deserve to ow compute the sample correlation coefficient:

*

Not surprisingly, the sample correlation coefficient indicates a solid positive correlation.

As we detailed, sample correlation coefficients selection from -1 to +1. In practice, systematic correlationships (i.e., corconnections that are clinically or virtually important) have the right to be as small as 0.4 (or -0.4) for positive (or negative) associations. Tright here are additionally statistical tests to determine whether an observed correlation is statistically significant or not (i.e., statistically substantially different from zero). Procedures to test whether an observed sample correlation is suggestive of a statistically significant correlation are defined in information in Kleinbaum, Ktop and also Muller.1

Regression Analysis

Regression evaluation is a extensively supplied approach which is advantageous for many applications. We present the technique here and expand on its supplies in subsequent modules.

Simple Liclose to Regression

Simple linear regression is an approach that is correct to understand the association between one independent (or predictor) variable and one continuous dependent (or outcome) variable. For instance, expect we want to assess the association in between full cholesterol (in milligrams per deciliter, mg/dL) and also body mass index (BMI, measured as the proportion of weight in kilograms to elevation in meters2) where full cholesterol is the dependent variable, and also BMI is the independent variable. In regression evaluation, the dependent variable is dedetailed Y and also the independent variable is dedetailed X. So, in this case, Y=full cholesterol and also X=BMI.

When tright here is a solitary continuous dependent variable and a solitary independent variable, the evaluation is referred to as an easy straight regression analysis . This evaluation assumes that tright here is a straight association in between the two variables. (If a different connection is hypothesized, such as a curvistraight or exponential partnership, alternate regression analyses are perdeveloped.)

The figure below is a scatter diagram illustrating the relationship in between BMI and also full cholesterol. Each allude represents the observed (x, y) pair, in this instance, BMI and the equivalent total cholesterol measured in each participant. Note that the independent variable (BMI) is on the horizontal axis and also the dependent variable (Total Serum Cholesterol) on the vertical axis.

BMI and also Total Cholesterol

*

The graph reflects that tright here is a positive or straight association between BMI and total cholesterol; participants via reduced BMI are more likely to have lower total cholesterol levels and participants via better BMI are more most likely to have actually higher complete cholesterol levels. In contrast, intend we study the association between BMI and also HDL cholesterol.

In contrast, the graph below depicts the partnership between BMI and HDL cholesterol in the very same sample of n=20 participants.

BMI and HDL Cholesterol

*

This graph mirrors an adverse or inverse association in between BMI and HDL cholesterol, i.e., those with reduced BMI are more likely to have greater HDL cholesterol levels and also those via higher BMI are more likely to have lower HDL cholesterol levels.

For either of these relationships we could usage straightforward linear regression evaluation to estimate the equation of the line that ideal defines the association in between the independent variable and the dependent variable. The basic straight regression equation is as follows:

*

wright here Y is the predicted or intended value of the outcome, X is the predictor, b0 is the approximated Y-intercept, and also b1 is the estimated slope. The Y-intercept and also slope are estimated from the sample information, and also they are the values that minimize the sum of the squared differences in between the oboffered and the predicted worths of the outcome, i.e., the estimates minimize:

*

These distinctions between oboffered and predicted values of the outcome are referred to as residuals. The approximates of the Y-intercept and slope minimize the sum of the squared residuals, and are referred to as the least squares estimates.1

Residuals

Conceptually, if the values of X provided a perfect prediction of Y then the sum of the squared differences in between oboffered and predicted values of Y would certainly be 0. That would expect that variability in Y could be totally explained by distinctions in X. However, if the differences between observed and also predicted worths are not 0, then we are unable to entirely account for differences in Y based upon X, then there are residual errors in the prediction. The residual error can outcome from inprecise dimensions of X or Y, or tright here could be other variables besides X that impact the worth of Y.

Based on the oboffered data, the ideal estimate of a linear relationship will be acquired from an equation for the line that minimizes the distinctions in between oboffered and predicted worths of the outcome. The Y-intercept of this line is the value of the dependent variable (Y) when the independent variable (X) is zero. The slope of the line is the adjust in the dependent variable (Y) family member to a one unit change in the independent variable (X). The least squares approximates of the y-intercept and slope are computed as follows:

*

and

*

where

r is the sample correlation coefficient,the sample means are
*
and also
*
and also Sx and also Sy are the standard deviations of the independent variable x and also the dependent variable y, respectively.

BMI and also Total Cholesterol

The least squares approximates of the regression coefficients, b 0 and also b1, describing the relationship in between BMI and full cholesterol are b0 = 28.07 and b1=6.49. These are computed as follows:

*

and

*

The estimate of the Y-intercept (b0 = 28.07) represents the estimated full cholesterol level when BMI is zero. Because a BMI of zero is meaningless, the Y-intercept is not informative. The estimate of the slope (b1 = 6.49) represents the change in total cholesterol relative to a one unit readjust in BMI. For instance, if we compare two participants whose BMIs differ by 1 unit, we would certainly mean their full cholesterols to differ by about 6.49 units (via the perchild with the higher BMI having actually the higher total cholesterol).

The equation of the regression line is as follows:

*

The graph listed below shows the approximated regression line superapplied on the scatter diagram.

*

The regression equation can be supplied to estimate a participant"s complete cholesterol as a role of his/her BMI. For instance, mean a participant has actually a BMI of 25. We would certainly estimate their complete cholesterol to be 28.07 + 6.49(25) = 190.32. The equation have the right to additionally be provided to estimate total cholesterol for various other worths of BMI. However before, the equation should only be supplied to estimate cholesterol levels for persons whose BMIs are in the range of the data used to geneprice the regression equation. In our sample, BMI ranges from 20 to 32, hence the equation have to only be supplied to geneprice approximates of complete cholesterol for persons via BMI in that variety.

Tbelow are statistical tests that can be percreated to assess whether the estimated regression coefficients (b0 and also b1) are statistically substantially various from zero. The test of many interest is generally H0: b1=0 versus H1: b1≠0, where b1 is the populace slope. If the population slope is considerably various from zero, we conclude that there is a statistically substantial association in between the independent and dependent variables.

BMI and HDL Cholesterol

The leastern squares approximates of the regression coefficients, b0 and b1, describing the connection between BMI and HDL cholesterol are as follows: b0 = 111.77 and also b1 = -2.35. These are computed as follows:

*

and

*

Aobtain, the Y-intercept in uninformative because a BMI of zero is meaningless. The estimate of the slope (b1 = -2.35) represents the adjust in HDL cholesterol family member to a one unit adjust in BMI. If we compare two participants whose BMIs differ by 1 unit, we would certainly expect their HDL cholesterols to differ by approximately 2.35 systems (with the perchild via the higher BMI having actually the lower HDL cholesterol. The number listed below shows the regression line superapplied on the scatter diagram for BMI and HDL cholesterol.

*

Linear regression evaluation rests on the assumption that the dependent variable is continuous and also that the circulation of the dependent variable (Y) at each worth of the independent variable (X) is about usually distributed. Note, but, that the independent variable have the right to be continuous (e.g., BMI) or have the right to be dichotomous (view below).

Comparing Average HDL Levels With Regression Analysis

Consider a clinical trial to evaluate the efficacy of a brand-new drug to boost HDL cholesterol. We might compare the expect HDL levels between therapy teams statistically utilizing a 2 independent samples t test. Here we consider an different technique. Outline information for the trial are displayed below:

Sample Size

Typical HDL

Standard Deviation of HDL

New Drug

Placebo

50

40.16

4.46

50

39.21

3.91

HDL cholesterol is the consistent dependent variable and treatment assignment (brand-new drug versus placebo) is the independent variable. Suppose the information on n=100 participants are gone into right into a statistical computing package. The outcome (Y) is HDL cholesterol in mg/dL and also the independent variable (X) is therapy assignment. For this evaluation, X is coded as 1 for participants that received the new drug and as 0 for participants that obtained the placebo. A simple linear regression equation is estimated as follows:

*

wbelow Y is the estimated HDL level and also X is a dichotomous variable (likewise referred to as an indicator variable, in this instance indicating whether the participant was assigned to the brand-new drug or to placebo). The estimate of the Y-intercept is b0=39.21. The Y-intercept is the worth of Y (HDL cholesterol) as soon as X is zero. In this instance, X=0 indicates assignment to the placebo group. Therefore, the Y-intercept is precisely equal to the intend HDL level in the placebo team. The slope is approximated as b1=0.95. The slope represents the approximated change in Y (HDL cholesterol) loved one to a one unit readjust in X. A one unit adjust in X represents a distinction in treatment assignment (placebo versus brand-new drug). The slope represents the difference in expect HDL levels in between the therapy teams. Therefore, the expect HDL for participants receiving the brand-new drug is:

*

*
-----
*

A study was carried out to assess the association between a person"s knowledge and the dimension of their brain. Participants completed a standardized IQ test and also researchers supplied Magnetic Resonance Imaging (MRI) to recognize brain size. Demographic information, consisting of the patient"s sex, was additionally recorded.

*

The Controversy Over Environpsychological Tobacco Smoke Exposure

Tbelow is convincing evidence that energetic smoking is a cause of lung cancer and also heart illness. Many type of studies done in a broad variety of circumstances have consistently demonstrated a solid association and also suggest that the threat of lung cancer and cardiovascular condition (i.e.., heart attacks) rises in a dose-connected method. These studies have actually caused the conclusion that active smoking is causally regarded lung cancer and also cardiovascular illness. Studies in energetic smokers have had actually the benefit that the lifetime expocertain to tobacco smoke have the right to be quantified with reasonable accuracy, given that the unit dose is consistent (one cigarette) and also the habitual nature of tobacco smoking cigarettes renders it feasible for many smokers to carry out a reasonable estimate of their complete lifetime expocertain quantified in terms of cigarettes per day or packs per day. Frequently, average everyday expocertain (cigarettes or packs) is combined via duration of usage in years in order to quantify exposure as "pack-years".

It has been a lot even more difficult to create whether environmental tobacco smoke (ETS) expocertain is causally related to chronic conditions like heart disease and lung cancer, because the complete life time exposure dosage is lower, and it is a lot more hard to accurately estimate complete life time expocertain. In enhancement, quantifying these threats is additionally facility because of confounding determinants. For example, ETS expocertain is typically classified based upon parental or spousal cigarette smoking, but these research studies are unable to quantify other eco-friendly exposures to tobacco smoke, and incapability to quantify and change for other ecological exposures such as air air pollution provides it hard to demonstrate an association also if one existed. As an outcome, there continues to be debate over the risk implemented by environmental tobacco smoke (ETS). Some have gone so much regarding claim that even very brief expocertain to ETS have the right to reason a myocardial infarction (heart attack), however an extremely big prospective cohort examine by Enstrom and also Kabat was unable to show considerable associations between expocertain to spousal ETS and coronary heart disease, chronic obstructive pulmonary condition, or lung cancer. (It need to be listed, but, that the report by Enstrom and Kabat has actually been extensively criticized for methodological problems, and also these authors likewise had actually financial ties to the tobacco sector.)

Correlation analysis offers a useful tool for reasoning about this conflict. Consider data from the British Doctors Cohort. They reported the yearly mortality for a variety of condition at 4 levels of cigarette cigarette smoking per day: Never before smoked, 1-14/day, 15-24/day, and also 25+/day. In order to percreate a correlation evaluation, I rounded the expocertain levels to 0, 10, 20, and 30 respectively.

Cigarettes Smoked

Per Day

CVD Mortality

Per 100,000 Men Per Year

Lung Cancer Mortality

Per 100,000 Men Per Year

0

10 (actually 1-14)

20 (actually 15-24)

30 (actually >24)

572

14

802

105

892

208

1025

355

The numbers listed below present the two approximated regression lines superimplemented on the scatter diagram. The correlation via amount of smoking cigarettes was solid for both CVD mortality (r= 0.98) and for lung cancer (r = 0.99). Keep in mind additionally that the Y-intercept is a coherent number here; it represents the predicted annual death price from these illness in individuals that never smoked. The Y-intercept for prediction of CVD is slightly better than the oboffered rate in never before smokers, while the Y-intercept for lung cancer is reduced than the oboffered price in never before smokers.

The linearity of these relationships argues that there is an increpsychological threat through each added cigarette smoked per day, and also the extra risk is estimated by the slopes. This perhaps helps us think around the after-effects of ETS exposure. For instance, the threat of lung cancer in never before smokers is rather low, but tbelow is a finite risk; assorted reports indicate a threat of 10-15 lung cancers/100,000 per year. If an individual who never before smoked actively was exposed to the identical of one cigarette"s smoke in the develop of ETS, then the regression argues that their danger would certainly rise by 11.26 lung cancer deaths per 100,000 per year. However, the hazard is plainly dose-connected. Therefore, if a non-smoker was employed by a tavern through hefty levels of ETS, the risk might be significantly higher.

*

*

Finally, it should be provided that some findings indicate that the association in between smoking and heart disease is non-direct at the extremely lowest expocertain levels, meaning that non-smokers have actually a disproportionate increase in hazard once exposed to ETS due to a rise in platelet aggregation.

Summary

Correlation and straight regression evaluation are statistical approaches to quantify associations between an independent, periodically called a predictor, variable (X) and also a consistent dependent outcome variable (Y). For correlation evaluation, the independent variable (X) can be continuous (e.g., gestational age) or ordinal (e.g., boosting categories of cigarettes per day). Regression analysis can likewise accommoday dichotomous independent variables.

See more: Mech-X4 Season 2 Episode 10

The measures described right here assume that the association between the independent and also dependent variables is linear. With some adjustments, regression analysis deserve to likewise be used to estimate associations that follow another practical form (e.g., curvidirect, quadratic). Here we think about associations between one independent variable and one constant dependent variable. The regression analysis is referred to as straightforward linear regression - straightforward in this case refers to the truth that tbelow is a single independent variable. In the following module, we think about regression evaluation via numerous independent variables, or predictors, thought about at the same time.