Addressing overfitting bias due to sample overlap in polygenic risk scoring

Numerous studies on Alzheimer’s polygenic risk scores (PRS) overlook sample overlap between IGAP and target datasets like ADNI. To address this, we developed OA-PRS and tested it on simulated data to assess biases from different scenarios by varying training, testing, and overlap proportions. OA-PRS was used to adjust for sample bias in simulations, then we applied OA-PRS to IGAP and ADNI datasets and validated through visual diagnosis. OA-PRS effectively adjusted for sample overlap in all simulation scenarios, as well as for IGAP and ADNI. The original IGAP PRS showed an inflated AUROC(0.915) on overlapping samples. OA-PRS reduced the AUROC to 0.726, closely aligning with the AUROC of non-overlapping samples(0.712). Further, visual diagnostics confirmed the effectiveness of our adjustments. With OA-PRS, we were able to adjust the IGAP summary-based PRS for the overlapped ADNI samples, allowing the dataset to be fully utilized without the risk of overfitting.

In diseases with strong genetic components such as Alzheimer’s, machine learning-based methods like the Polygenic Risk Score (PRS) serve as essential tools for estimating genetic susceptibility. However, when there is sample overlap between the Genome-Wide Association Study (GWAS) data used for training PRS models and the datasets used for evaluating model performance, overfitting and bias can occur, leading to inflated estimates of predictive accuracy. This issue is particularly prevalent in dementia research, where commonly used datasets such as IGAP and ADNI often contain overlapping samples.

To address this problem, we developed a novel method called Overlap-Adjusted Polygenic Risk Score (OA-PRS). Our approach integrates four key stages—GWAS summary statistics adjustment, bias correction due to sample overlap, PRS construction, and validation/diagnostics—to mitigate overfitting bias. In the bias correction stage, we applied meta-analysis techniques to enable the use of publicly available, summary-level GWAS data without requiring individual-level genotypes, enhancing the method’s practicality. Furthermore, OA-PRS incorporates graphical diagnostic tools to assess the degree of overfitting in the adjusted PRS.

We first validated OA-PRS using simulations based on the UK Biobank, a large-scale British cohort. Across various scenarios, OA-PRS consistently reduced the bias arising from sample overlap. We also applied OA-PRS to Korean genome datasets and confirmed its robustness by further validating the method on East Asian data from AGEN-T2D. When applied to the real-world IGAP and ADNI datasets, conventional PRS trained on IGAP showed inflated predictive performance (AUROC = 0.915) in the presence of overlapping samples. In contrast, OA-PRS corrected the AUROC to 0.726, closely matching the AUROC observed under non-overlapping conditions (AUROC = 0.712). Diagnostic plots further demonstrated the effectiveness of OA-PRS in adjusting for overfitting bias.

This study demonstrates that OA-PRS can correct for sample overlap between IGAP and ADNI, enabling more reliable use of these datasets while minimizing overfitting. This approach is expected to make a significant contribution to improving the reliability and accuracy of future PRS research.

Seokho Jeong, Manu Shivakumar, Sang-Hyuk Jung, Hong-Hee Won, Kwangsik Nho, Heng Huang, Christos Davatzikos, Andrew J Saykin, Paul M. Thompson, Li Shen, Young Jin Kim, Bong-Jo Kim, Seunggeun Lee, Dokyoon Kim

Alzheimer’s & Dementia, 21(4), e70109

https://alz-journals.onlinelibrary.wiley.com/doi/10.1002/alz.70109