GENOSCORES: a platform for calculating genotypic predictors of binary and quantitative phenotypes

Paul McKeigue, Marco Colombo, Athina Spiliopoulou
Usher Institute for Population Health Sciences
University of Edinburgh

9 June, 2017

What GENOSCORES is useful for

GENOSCORES is a database of filtered summary results of genome-wide association studies in humans together with a package of analysis tools. These studies cover multiple phenotypes, including -omic measurements such as gene expression and metabolite levels. This resource can be used with any individual-level dataset consisting of genome-wide SNP genotypes to calculate genotypic scores for the phenotypes represented in GENOSCORES. These scores have several uses:

  1. Genotypic predictors of complex traits can be constructed from (possibly oligogenic) genotypic scores for related and intermediate traits. The rationale for this is that because genotypic effects on complex traits are highly polygenic, genotypic predictors of these traits cannot be learned directly from modelling SNP-trait associations. It is more feasible to construct genetic predictors of intermediate traits or biomarkers that are relatively oligogenic and under strong genetic control, and to use these to construct predictors of complex outcomes.

  2. Testing for association of a complex trait with oligogenic scores that predict intermediate phenotypes can help to discover and test causal pathways, by a “mendelian randomization” approach.

  3. Associations between genotypic scores for different phenotypes can identify unexpected connections between pathways, and can help to identify unknown biomarkers on metabolomic or proteomic platforms.

Description

Currently the database holds 138225 phenotypes including:

To be added shortly:

From an R session on your own computer (preferably a high-performance multicore Linux server) you can query the database to extract a table of SNP weights for the traits that you specify. Using the R scripts that we supply, you can compute genetic scores for these traits. For traits that have associations with multiple genomic regions, you can specify that you want scores to be computed separately for each associated region. Usually you will want region-specific scores to be computed for clinical traits, so that the scores are oligonenic. For biomarkers such as proteins and gene transcripts, the genotypic scores are usually oligonenic anyway so that it is not necessary to compute rergion-specific scores. The default is to calculate region-specific scores for traits of type “clinical”, but genome-wide scores for all other trait types (“protein”, “transcript”, “metabolite”, “miRNA”, “glycan”).

Using the resource

For the moment, the resource is available only to collaborators, but we plan to make it freely available as soon as we have completed initial development, subject to any restrictions that may be imposed on our sharing GWAS summary results. Contact us to discuss collaboration. You will need credentials for an SSH connection to our server and for the database itself.

Genotypes for your target sample should be in PLINK format, imputed to a haplotype reference panel which should preferably be a recent release based on the current genome build and current release of dbSNP. If your genotype dataset does not have up-to-date rs IDs, the R code we provide will determine the genome build, update old (merged) rs IDs to current rs IDs, and replace other SNP identifiers (for instance chr:position:alleles) with current rs IDs where possible.

If you want to run a script to test for association with phenotype, this also should be in PLINK format: first two columns should be FID, IID, followed by columns for whatever variables you will later define as outcome variables and covariates.

GENOSCORES can be accessed through the R package genoscores, which we will make available to collaborators. This depends on a few packages available from CRAN or Bioconductor: BH, bigalgebra, bigmemory, corrplot, data.table, doMC, foreach, ggplot2, glmnet, grid, gtools, Matrix, plyr, RColorBrewer, Rcpp (>= 0.12.3), RcppEigen, RMySQL (>= 0.11) and speedglm.

Note that the RMySQL package must be at least version 0.11: currently this is still under development, and can be loaded from github by running the following commands

install.packages("devtools")
devtools::install_github("rstats-db/RMySQL")

You should also install PLINK version 1.9 in a search path where it can be found as “plink”. You will also need to install liftOver and intersectBed as well as the chain file to build hg38.

Before starting R, you need to do two things:

  1. set up an SSH tunnel to port 3306 on our server, using any unused local port on your server. The example below is to tunnel traffic from port 3307 on your client to port 3306 on our server

    ssh -fTNL 3307:localhost:3306 username@pm2.phs.ed.ac.uk
  2. set some environment variables with username and password that we will provide:

    export GENOSCORES_USER=<username>
    export GENOSCORES_PASSWORD=<password>

    You are now ready to start an R session.

Querying GENOSCORES from an R session

  1. Load the script gscorefunctions.R and open a connection to the database

    library(genoscores)
    genoscores.connect()
  2. Update the SNP identifiers in your PLINK .bim file to match the current version of dbSNP where possible. You need only run this step once for a given PLINK dataset

    genomedir <- "/path/to/liftoverchainfiles"
    updated.plinkfile <- paste(plinkfile, "_snpids.updated", sep="")
    updatesnpids.plinkfile(plinkfile, genomedir, out=updated.plinkfile)
  3. Filter your updated PLINK dataset to keep only those SNPs that are matched in the weights table. This helps to reduce the computational burden and workload later.

    filtered.plinkfile <- paste(plinkfile, "_forscores", sep="")
    filterplinkfile(out=filtered.plinkfile)
  4. Edit the filtered .bim file to add a suffix to any duplicated SNP identifiers

    fixduplicateSNPids(filtered.plinkfile)
  5. Select the studies or traits for which you want genotypic scores to be calculated [instructions to be added]. Alternatively specify gwasids <- NULL to select all gwasids. You can also specify an integer vector complextrait.studyids to specify the studies for which each gwasid is to be treated as a complex trait: for such traits, scores are calculated separately for each genomic region containing one or more hits at a p-value less than grouping.minp. The default is that all traits of type “clinical” are treated as complex traits.

  6. Set the variable pvalue.filter to specify the threshold p-value at which weights are to be filtered: the default is 1E-5. Set the variable grouping.minp to specify the threshold minimum p-value in a genomic region for a region-specific score to be calculated for that region (if the trait is defined as a complex trait). This should be more stringent than the value of pvalue.filter: the default is grouping.minp <- 1E-7.

  7. Now calculate scores for your genotype data. The function getscores returns a matrix in which rows index individuals and columns index score variables. The score variables have names of the form <trait name>_<gwasid>. Where a trait has region-specific scores, the variable names are of the form <trait name>_chrom_regionnumber_gwasid.

    score.full <- getscores(inputfile=filtered.plinkfile, gwasids=gwasids,
                        pvalue.filter=1E-5, complextrait.studyids=complextrait.studyids)
    genoscores.disconnect()

Below is an example of a script that calculates scores using all records in the gwas table.

library(genoscores)

## open connection to database
genoscores.connect()

pvalue.filter <- 1E-5 # threshold p-value for weights to be included in score
grouping.minp <- 1E-7 # threshold minimum p-value for region-specific scores
ld.adjust = FALSE # set this to TRUE to adjust for LD between SNPs

plinkfile <- "my_folder/my_plink_bfile"
# 1000G European ancestry  panel is used as reference  panel to adjust LD
ld.refplinkfile <- file.path(genomedir, "kg.eur.forld") 

gwasids <- NULL # to get all studies
## Alternative ways to restrict which studies or traits are included in the calculated scores
#gwasids <- c(14025, 17928) ## restrict to type 1 and type 2 diabetes
#studyids <- c(1, 9, 17, 18, 21, 22, 26, 27, 31, 33, 34, 35, 37, 38) 
#traittypes <- c("transcript", "protein", "miRNA")
#gwasids <- c(selectgwasidsbystudy(studyids), selectgwasidsbytraittype(traittypes))

complextrait.gwasids <- NULL # default is region-specific scores for all traits of type clinical
# Alternative ways to specify traits for which you want region-specific (rather than genome-wide) scores 
#complextrait.gwasids <- 1
#complextrait.gwasids <- selectgwasidsbytraittype("clinical")
#complextrait.studyids <- c(1, 9, 17, 18, 21, 22, 26, 31, 33, 34, 38)
#complextrait.gwasids <- selectgwasidsbystudy(complextrait.studyids)

score.full <- getscores(inputfile=filtered.plinkfile, gwasids=gwasids,
                        pvalue.filter=1E-5, grouping.minp=grouping.minp,
                        complextrait.gwasids=complextrait.gwasids)
 
# update SNP identifiers - only have to do this once for your dataset
genomedir <- "/path/to/liftoverchainfiles"
updated.plinkfile <- paste(plinkfile, "_snpids.updated", sep="")
updatesnpids.plinkfile(plinkfile=plinkfile, genomedir, out=updated.plinkfile)

# filter SNPs to retain only SNPs in weights table: only have to do this once for your dataset
filtered.plinkfile <- paste(plinkfile, "_forscores", sep="")
filterplinkfile(plinkfile=updated.plinkfile, out=filtered.plinkfile)

# fix any duplicated SNP ids
fixduplicateSNPids(filtered.plinkfile)

# calculate score matrix
score.full <- getscores(inputfile=filtered.plinkfile,
                        gwasids=gwasids,
                        pvalue.filter=pvalue.filter, 
                        grouping.minp=grouping.minp,
                        complextrait.gwasids=complextrait.gwasids,
                        threshold.distance=1E6,
                        ld.adjust=ld.adjust,
                        ld.refplinkfile=ld.refplinkfile,
                        ld.adjust.method="pseudoinverse",
                        ld.eigvar.thresh=NULL,
                        ld.lambda=NULL, ld.prune.thresh=0.99)
genoscores.disconnect() # always close database connection when finished

save(list(score=score.full), file=paste(plinkfile,"_genotypicscore.gz", sep=""))

The correction for LD is based on premultiplying the vector of weights by the generalized inverse of the correlation matrix, estimated from a reference panel. This is simlar to the approach used for LDpred [1], but can use SNPs that have been filtered by p-value.

To see a list of the studies that are in the database with the numbers of GWAS studies and weights that have been imported for each study, you can can view the table v_study:

studyid studyname num_GWAS num_weights max_pvalue pubmedid
1 Rheumatoid arthritis meta-analysis in Europeans 1 39735 1e-04 24390342
2 Anti-TNF response in RA: Mirkov 2013 1 155 1e-04 23233654
3 eQTL meta-analysis 2013 11972 765265 0.001 24013639
4 Metabolon meta-analysis 2013 453 201791 1e-04 24816252
5 Biocrates meta-analysis 2015 129 12622 1e-05 26068415
6 Coronary disease CARDIoGRAM 1 490 1e-05 21378990
7 Glycemic traits MAGIC 5 1870 1e-05 22885922
9 Inflammatory bowel disease genetics 2 23317 1e-05 18587394
10 Anthropometric traits GIANT consortium 3 86531 1e-04 22885922
11 Global lipids genetics consortium 3 21341 1e-04 24097068
12 Fatty acids CHARGE consortium 19 17125 1e-04 20031568
13 SUMMIT protein biomarkers 25 5444 1e-05 25740695
14 Urinary biomarkers Pomerania-KORA 52 11018 1e-04 26352407
17 Type 1 diabetes Barrett 2009 meta-analysis 1 3079 1e-04 19430480
18 Multiple sclerosis Immunochip 1 6887 1e-04 24076602
19 Vitamin D age 14 Anderson 2014 1 49 3e-07 25208829
20 Vitamin D Ahn 2010 1 6 2e-05 20418485
21 Juvenile idiopathic arthritis Hinks 2013 immunochip 1 1564 1e-04 23603761
22 Celiac disease Trynka 2011 immunochip 1 7572 1e-04 22057235
23 Narcolepsy Faraco 2013 immunochip 1 173 1e-04 23459209
24 Global urate genetics consortium 1 5029 1e-04 23263486
25 Alzheimer disease IGAP 1 3539 1e-04 24162737
26 Psoriasis Immunochip 1 3534 1e-04 23143594
27 Young Finns gene expression 3829 334017 1e-04 23717546
28 SUMMIT Metabolon extra 56 11090 1e-05 0
29 Serum creatinine Icelandic study 1 6374 1e-04 25082825
30 Coronary disease C4D 1 185 1e-04 26343387
31 Asthma meta-analysis 1 63 1e-05 20860503
32 Type 2 diabetes DIAGRAM 1 1561 1e-04 22885922
33 Age-related macular degeneration gene consortium 1 2287 1e-04 23455636
34 Sarcoidosis Detroit-Oklahoma collaboration 1 1921 1e-04 22952805
35 Galectin-3 in PREVEND cohort 1 311 8e-08 23056639
36 Statin response meta-analysis 1 70 9e-05 20339536
37 miRNA expression whole blood Framingham 81 8328 7e-05 25791433
38 EAGLE eczema consortium 1 4988 1e-04 22197932
40 Age at menopause REPROGEN consortium 1 5508 1e-04 26414677
41 Age at menarche REPROGEN consortium 1 10269 1e-04 25231870
42 Major depressive disorder Psychiatric Genomics Consortium 1 198 1e-04 22472876
43 Schizophrenia Psychiatric Genomics Consortium 1 1873 1e-04 21926974
44 Bipolar disorder Psychiatric Genomics Consortium 1 1163 1e-04 21926972
45 Schizophrenia 2nd PGC meta-analysis 1 58552 1e-04 25056061
46 Bone mineral density GEFOS consortium 3 13211 1e-04 19801982
47 IgG glycans meta-analysis 77 25891 1e-04 23382691
50 Genotype-Tissue Expression project (GTEx) 2595 1251862 7e-05 23715323
51 Migraine International Headaches Genetics Consortium meta-analysis 1 6815 1e-05 27322543
52 Immune cell counts Sardinian study 272 145435 1e-05 24074872
53 Kuopio METSIM RNA-seq adipose tissue 2107 488199 1e-06 26854917
54 Male baldness in UK Biobank 1 27496 1e-04 0
55 Blood Cell Consortium cell counts 15 2045 1e-04 27346685
56 HRGene Consortium heart rate 1 2064 1e-04 23583979
57 Body fat percent meta-analysis 1 1286 1e-04 26833246
58 Leptin adjusted for BMI 1 566 1e-04 26833098
59 HaemGen red blood cell traits meta-analysis 6 20761 1e-04 23222517
60 Dupuytren contracture 1 114 1e-04 0
61 Circulating metabolites magnetic NMR meta-analysis 123 409349 1e-04 27005778
62 SUMMIT coronary artery disease novel hits 3 47 9e-04 0
63 C-reactive protein GWAS 1 1687 1e-05 21300955
65 Childrens Oncology Group Neuroblastoma case-control study 1 207 1e-04 21124317
66 Genentech SLE case-control study 1 399 1e-04 18204098
67 Methylation QTL middle age 116059 7063462 1e-07 27036880
69 Blood pressure meta-analysis all ethnicities 2 321 1e-04 27618448
71 Immune cells female twins 152 40435 1e-04 25772697

Table of trait types:

type count
clinical 69
IgG glycan 77
immune cells 424
lipid 182
metabolite 649
methylation 116059
miRNA 91
protein 28
transcript 20646

Under development

Access and privacy considerations

The GENOSCORES database contains only a filtered subset of the full summary GWAS results of each study. Case-control effect size estimates have been filtered at a p-value of 1E-4, even where the full summary results are publicly available for download. Most of the summary results currently on the dataset have been extracted from publicly available download sites. For these studies, the URL of the download site, and the PubMed ID of the paper reporting the study results are given in the table v_study. The custodians of the original study datasets generally require users to cite these papers when reporting the results of other studies that use these summary data. While this isn’t practicable for analyses that use GENOSCORES to construct predictor that may be based on thousands of studies, it is recommended that where the results of such an analysis depends critically on a few of the original studies, these studies should be cited.

Where summary GWAS results from many thousands of SNPs are released, there is a theoretical risk that someone with an individual’s genome-wide genotype profile can learn about the phenotype of that individual from summary results of a genome-wide association study that included the individual. If an individual was in the case sample of a case-control GWAS, there will be a very weak correlation between the individual’s genotypes and the effect size estimates. It is possible to use these weak correlations, summed over thousands of SNPs, to infer that the individual was included in the case sample. This theoretical risk to privacy has led researchers and institutions to restrict the sharing of summary results from genome-wide association studies. Most existing databases of GWAS summary results - such as GWAS Central, GRASP, GWASdb and the Accelerating Medicines Partnership portal for Type 2 diabetes genetics - withhold the sign of the effect estimate (either by withholding the effect size or the effect allele). This makes the summary data useless for constructing genotypic scores.

In a forthcoming paper we outline the statistical theory underlying this information leakage from summary GWAS effect size estimates and show how to calculate a threshold at which to filter SNP effect sizes so as to ensure that an attribute disclosure attack based on using an individual’s genotypes is unlikely to succeed. We show that for large meta-analyses based on at least 5000 cases, filtering the effect size at a threshold p-value of 1E-4 is enough to protect against such an attack.

Technical description

The database is implemented in MySQL.

The database schema is shown above.

The database schema is shown above.

Table v_weights holds summary results (trait, SNP, allele, coefficient, p-value) of genome-wide association studies of binary or quantitative phenotypes. Usually these are filtered at p < 1E-5 or p < 1E-4.

Table dbsnp holds all SNPs in the “common_all” subset of dbSNP (build 146), plus all SNPs on the X chromosome.

Table traits has one record for each phenotype. Each phenotype has a short name usually taken from the original study. Gene transcripts are mapped to genes identified by Entrez identifier, and from genes to Reactome pathways. Metabolites are mapped to PubChem or to ChEBI identifiers (work in progress).

Table gwas: gwasid, studyid, traitid, effectivesamplesize. This table has one record for each trait tested in each study. For a quantitative trait the effective sample size is the total sample size. For a binary trait it is half the harmonic mean of case and control sample sizes. This allows the user to apply a more stringent filter to small studies.

Table v_study: has one record for each study (usually defined by a collection of cohorts or other phenotype-genotype samples) This has fields for PubMed ID, and URL for download of summary results.

Table rsmerge: old rsnumbers that have been merged with current rs numbers.

Additional tables have been imported from Reactome to map genes to pathways, from HGNC for gene symbols and gene identifiers and from ChEBI to map metabolites to an ontology of biochemical entities. This work is still in progress – most other traits have not yet been mapped to an ontology.

References

[1] Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97:576–92.

[2] Lumley T, Rice K. Potential for Revealing Individual-Level Information in Genome-wide Association Studies JAMA 2010;303:659-660.

[3] Clayton D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatics 2010 11:661–73.