Quantile, Quantile-Quantile, and GWAS

Quantile (分位數/切位點)
" In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way." - Wikipedia
A quantile is the cutoff value at a certain proportion of the data. If you divide your dataset into k equal groups, then you get k inequalities. So:
2 quantiles → Median (中位數) Splits the distribution into 2 halves.
4 quantiles → Quartiles (四分位數) Splits the distribution into 4 equal parts (Q1, Q2 [= median], Q3).
100 quantiles → Percentiles (百分位數) Splits the distribution into 100 equal parts.
Quantile-Quantile Plot (QQ-plot/分位數對分位數圖)
Purpose:
- Compare the distribution of two datasets.
- Assess whether a dataset follows a specified theoretical distribution (e.g., normal distribution).
How it works:
- When comparing the distribution of two datasets.
- We sort our data based on order statistics (順序統計量)
- To compare empirical data with a theoretical distribution (e.g., in a QQ-plot), you map each order statistic 𝑋(𝑖) to a quantile probability.
$X_{(i)}$ → the $i^{\text{th}}$ ordered sample value.
$p_i$ → its plotting position (approximate quantile level).
- Plots the quantiles of one dataset against the quantiles of another (or against a theoretical distribution).
Q-Q plot is an essential tool for detecting problems (such as unrecognized population structure, analytical approach, genotyping artifacts, etc.) in a Genome-wide association study (GWAS).
Q-Q plot and GWAS
Q-Q plots the observed quantiles of one distribution versus another, OR plots the observed quantiles of a distribution versus the quantiles of the ideal distribution.
In GWAS, we use a QQ plot to plot our the quantile distribution of observed p-values (on the y-axis) versus the quantile distribution of expected p-values. In an ideal situation, where there ARE NO causal polymorphisms, the QQ-plot will be a line.
The reason is that we will observe a uniform distribution of p-values from such a case and in our QQ, we are plotting this observed distribution of p-value versus the expected distribution of p-values: a uniform distribution (where both have been -log transformed).
** Note that if your GWAS analysis is correct but you do not have enough power to detect positions of causal polymorphisms, this will also be your result (!!)-> it is a way to assess whether you can detect any hits in your GWAS.
To plot a QQ-plot
One way to do it is by using the qqman package in R.
install.packages('qqman')
library(qqman)
qq(result$PVALUE, main = "QQ Plot of {{Project}}")
Lambda (λ)
When making a QQ-plot, it is important to calculate lambda (also called the genomic inflation factor, often written as λGC).
- λ quantifies how much the observed test statistics deviate from what you’d expect under the null hypothesis (i.e., no association). It helps assess whether your test statistics are inflated due to technical or population structure issues.
- Detect inflations or deflations of P-values
📈 If your QQ plot shows a systematic upward curve and λ is »1 (e.g., 1.2 or higher), it suggests inflation, possibly from:
- Population stratification
- Cryptic relatedness
- Batch effects
- Genotyping errors
📈 If λ is <1, it might signal deflation, often due to:
- Overcorrection
- Conservative test statistics
- Sparse data
chisq <- qchisq(1 - result$PVALUE, df = 1)
lambda <- median(chisq, na.rm = TRUE) / qchisq(0.5, df = 1)
legend("topleft", legend = bquote(lambda == .(round(lambda, 3))), bty = "n")
To read more, see GitHub repository
References
- Statistical Horizons
- Ehret GB, Curr Hypertens Rep. 2010 Feb;12(1):17–25. doi: 10.1007/s11906-009-0086-6