In diseases with strong genetic components such as Alzheimer’s, machine learning-based methods like the Polygenic Risk Score (PRS) serve as essential tools for estimating genetic susceptibility. However, when there is sample overlap between the Genome-Wide Association Study (GWAS) data used for training PRS models and the datasets used for evaluating model performance, overfitting and bias can occur, leading to inflated estimates of predictive accuracy. This issue is particularly prevalent in dementia research, where commonly used datasets such as IGAP and ADNI often contain overlapping samples.

To address this problem, we developed a novel method called Overlap-Adjusted Polygenic Risk Score (OA-PRS). Our approach integrates four key stages—GWAS summary statistics adjustment, bias correction due to sample overlap, PRS construction, and validation/diagnostics—to mitigate overfitting bias. In the bias correction stage, we applied meta-analysis techniques to enable the use of publicly available, summary-level GWAS data without requiring individual-level genotypes, enhancing the method’s practicality. Furthermore, OA-PRS incorporates graphical diagnostic tools to assess the degree of overfitting in the adjusted PRS.

We first validated OA-PRS using simulations based on the UK Biobank, a large-scale British cohort. Across various scenarios, OA-PRS consistently reduced the bias arising from sample overlap. We also applied OA-PRS to Korean genome datasets and confirmed its robustness by further validating the method on East Asian data from AGEN-T2D. When applied to the real-world IGAP and ADNI datasets, conventional PRS trained on IGAP showed inflated predictive performance (AUROC = 0.915) in the presence of overlapping samples. In contrast, OA-PRS corrected the AUROC to 0.726, closely matching the AUROC observed under non-overlapping conditions (AUROC = 0.712). Diagnostic plots further demonstrated the effectiveness of OA-PRS in adjusting for overfitting bias.

This study demonstrates that OA-PRS can correct for sample overlap between IGAP and ADNI, enabling more reliable use of these datasets while minimizing overfitting. This approach is expected to make a significant contribution to improving the reliability and accuracy of future PRS research.


Seokho Jeong, Manu Shivakumar, Sang-Hyuk Jung, Hong-Hee Won, Kwangsik Nho, Heng Huang, Christos Davatzikos, Andrew J Saykin, Paul M. Thompson, Li Shen, Young Jin Kim, Bong-Jo Kim, Seunggeun Lee, Dokyoon Kim

Alzheimer’s & Dementia21(4), e70109

https://alz-journals.onlinelibrary.wiley.com/doi/10.1002/alz.70109