WB-ILIAS | Weiterbildung und offene Bildungsressourcen

R

Correlations

The last of the statistical measures you are going to learn here are correlation measures. We use these in order to express the relationship between two numeric variables. For example, does one’s IQ predict their salary? Or does the age of a child predict the size of their vocabulary? The correlation measure you are going to learn here is Pearson r correlation coefficient.

It is a parametric measure, requiring that both of the datasets are normally distributed, their relationship should be linear and there should be no heteroscedasticity present. You have learnt how to test normal distribution when learning about the t-test. The remaining two assumptions can be tested in the manner explained below.

The linearity assumption can be verified by plotting the two variables against each other. If the relationship between them is linear, the data points should form approximately a straight line, not a curve.

Checking for homoscedasticity is somewhat more complex. In order to test for it, you will need to add a function which R does not have preinstalled. These functions are usually bundled in groups called packages. For the present case, you will need the package lmtest. This package uses another package to operate, zoo. You need to install both of them by typing in

install.packages(c(“zoo”,  “lmtest”))

You will be asked to select where they should be downloaded from. Once this is done, activate the package lmtest by typing

library(lmtest)

This gives you access to all the functions contained in the package. Now you can test the heteroscedasticity of the data. In our example we will test the correlation of a participants age and the ratio with which they omit the copula (i.e. say He good instead of He is good).

qtitle
qtitle

To test the heteroscedasticity, we will use the following two lines:

lmMod <- lm(data$verb_omission ~ data$age)
bptest(lmMod)

The test performed returns this report:

Since the p-value is above 0.05, we can assume that there is likely no heteroscedasticity in the data.

Knowing this, we can measure the Pearson r correlation of the two variables. This expresses the strength of the relationship between them and can have values between -1 and +1. A value of 0 indicates that there is absolutely no relationship: knowing the value of one variable does not tell you anything about the other. The value -1 indicates a perfect negative correlation: i.e. the higher the value of the predictor, the lower the value of the dependent variable. +1 then stands for perfect positive correlation. Any value between 0 and -1 or 0 and 1 shows some correlation of the two variables, but not a perfect one.

To perform this measure on our data, we call

cor(data$verb_ommission, data$age, method="pearson")

The received coefficient is 0.92. That means that the two variables are very closely positively related and knowing the value of one will allow you to predict the value of the other well.

If one of the assumptions is not met, you can use a non-parametric alternative of the Pearson r, called Kendall tau. To do this, simply change the method parameter in the cor() function from “pearson” to “kendall”.

qtitle
qcloze
qtitle


Bisher wurde noch kein Kommentar abgegeben.