Hypothesis testing | Biostatistics and epidemiology

The Null hypothesis or H0 states that there is no difference between two groups, which means that a population parameter (such as the mean, the standard deviation, and so on) is equal to a hypothesized/ previously observed value.

The alternative hypothesis or H1 states that there is a difference or that a population parameter is smaller, greater, or different than the hypothesized/previously observed value in the null hypothesis.

Type I error is incorrectly rejecting the null hypothesis. In other words saying that there is a difference when actually there is no difference. The probability of making a type I error is denoted by alpha, also called level of significance. If alpha is set at a lower level then the chance of a type I error decreases and vice versa. If alpha is set at a lower level it also means that the test is conducted under more rigorous standards.

Type II error is incorrectly accepting the null hypothesis. In other words saying that there is no difference when actually there is a difference. The probability making a type II error is denoted by beta. Power of a study is (1 - beta). Power denotes the probability of correctly rejecting the null hypothesis when it is false and thereby detecting a real difference when it does exist. If alpha is set at a lower value, that decreases the chance of a type I error but increases the chance of a type II error (and vice versa), hence decreasing power of a study. So, alpha decreases then power also decreases and vice versa.

Traditionally, alpha is set at 0.05 or 5% and beta at 0.2 or 20%. Power of a study is set at 0.8 or 80% minimum.

If the p value obtained from the study is = or < 0.05 then the null hypothesis is rejected and the alternate hypothesis is accepted. The result is said to be statistically significant.

If the 95% confidence interval includes 0 then there is no significance and the null hypothesis is not rejected. If the 95% confidence interval for RR or odds ratio includes 1, then also the null hypothesis is not rejected.

Correlation and regression: Correlation and linear regression are the most commonly used techniques for quantifying the association between two numeric variables. Correlation quantifies the strength of the linear relationship between paired variables, expressing this as a correlation coefficient. In simple linear regression a single independent variable is used to predict the value of a dependent variable. If both variables are normally distributed then Pearson’s correlation coefficient ‘r’ is calculated. If one or both variables is not normally distributed, then a rank correlation coefficient such as Spearman’s rho (ρ) may be calculated.

The coefficient of determination or r square, denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x. Using linear regression, if the value of either variable is known, the value of the other variable can be calculated.

Correlation analysis quantifies the strength of the association between two numerical variables. A scatter plot or diagram is used to depict the relationship between two variables. The independent variable is plotted on the X-axis and the dependent variable on the Y-axis. The closer the points or dots on the scatter plot are to each other the stronger the association between two variables. The correlation coefficient quantifies the strength of the relationship between two variables and is stated between -1 and +1. Plus sign denotes a direct relationship, whereas minus denotes an inverse relationship. A value of r close to +1 indicates a strong direct linear relationship which means that one variable increases with the other. A value close to −1 indicates a strong inverse linear relationship which means that one variable decreases as the other increases. A value very close to zero does not necessarily mean that there is no association, it just means that there is no “linear” association.

Correlation just shows association; it does not prove causation. In multiple regression, two or more independent variables are used to predict the value of a dependent variable.