1 | How to be sure that I can establish a connection to the GWAS Catalog server?

You can check that gwasrapidd is able to connect to https://www.ebi.ac.uk my making a connection attempt with the function is_ebi_reachable():

Returns TRUE if the connection is possible, or FALSE otherwise. If the connection is not possible, use the parameter chatty = TRUE to learn at what point the connection is failing.

library(gwasrapidd)
is_ebi_reachable(chatty = TRUE)

2 | What resources is the GWAS Catalog database currently mapped against?

The GWAS Catalog is mapped against Ensembl, dbSNP and a specific assembly version of the human genome. You can get this info with get_metadata():

3 | How to perform batch search with gwasrapidd?

The four main retrieval functions get_studies(), get_associations(), get_variants(), and get_traits() allow to search by multiple values for the same search criterion. You only need to pass a vector of queries to each search criterion parameter. Here are some simple examples.

Get studies by study identifiers (GCST002420 or GCST000392):

library(gwasrapidd)
get_studies(study_id = c('GCST002420', 'GCST000392'))
#> An object of class "studies"
#> Slot "studies":
#> # A tibble: 2 x 13
#>   study_id reported_trait initial_sample_… replication_sam… gxe   gxg  
#>   <chr>    <chr>          <chr>            <chr>            <lgl> <lgl>
#> 1 GCST002… Binge eating … 206 European an… 70 European anc… FALSE FALSE
#> 2 GCST000… Type 1 diabet… 7,514 European … 4,267 European … FALSE FALSE
#> # … with 7 more variables: snp_count <int>, qualifier <chr>,
#> #   imputed <lgl>, pooled <lgl>, study_design_comment <chr>,
#> #   full_pvalue_set <lgl>, user_requested <lgl>
#> 
#> Slot "genotyping_techs":
#> # A tibble: 2 x 2
#>   study_id   genotyping_technology       
#>   <chr>      <chr>                       
#> 1 GCST002420 Genome-wide genotyping array
#> 2 GCST000392 Genome-wide genotyping array
#> 
#> Slot "platforms":
#> # A tibble: 3 x 2
#>   study_id   manufacturer
#>   <chr>      <chr>       
#> 1 GCST002420 Affymetrix  
#> 2 GCST000392 Illumina    
#> 3 GCST000392 Affymetrix  
#> 
#> Slot "ancestries":
#> # A tibble: 4 x 4
#>   study_id   ancestry_id type        number_of_individuals
#>   <chr>            <int> <chr>                       <int>
#> 1 GCST002420           1 initial                       929
#> 2 GCST002420           2 replication                   828
#> 3 GCST000392           1 initial                     16559
#> 4 GCST000392           2 replication                 13279
#> 
#> Slot "ancestral_groups":
#> # A tibble: 4 x 3
#>   study_id   ancestry_id ancestral_group
#>   <chr>            <int> <chr>          
#> 1 GCST002420           1 European       
#> 2 GCST002420           2 European       
#> 3 GCST000392           1 European       
#> 4 GCST000392           2 European       
#> 
#> Slot "countries_of_origin":
#> # A tibble: 2 x 5
#>   study_id   ancestry_id country_name major_area region
#>   <chr>            <int> <chr>        <chr>      <chr> 
#> 1 GCST002420           1 <NA>         <NA>       <NA>  
#> 2 GCST002420           2 <NA>         <NA>       <NA>  
#> 
#> Slot "countries_of_recruitment":
#> # A tibble: 5 x 5
#>   study_id   ancestry_id country_name major_area       region         
#>   <chr>            <int> <chr>        <chr>            <chr>          
#> 1 GCST002420           1 U.S.         Northern America <NA>           
#> 2 GCST002420           2 U.S.         Northern America <NA>           
#> 3 GCST000392           1 U.K.         Europe           Northern Europe
#> 4 GCST000392           2 U.K.         Europe           Northern Europe
#> 5 GCST000392           2 Denmark      Europe           Northern Europe
#> 
#> Slot "publications":
#> # A tibble: 2 x 7
#>   study_id pubmed_id publication_date publication title author_fullname
#>   <chr>        <int> <date>           <chr>       <chr> <chr>          
#> 1 GCST002…  24882193 2014-04-18       J Affect D… Bipo… Winham SJ      
#> 2 GCST000…  19430480 2009-05-09       Nat Genet   Geno… Barrett JC     
#> # … with 1 more variable: author_orcid <chr>

Get associations by variant identifiers (rs3798440 or rs7329174):

get_associations(variant_id = c('rs3798440', 'rs7329174'))
#> An object of class "associations"
#> Slot "associations":
#> # A tibble: 5 x 17
#>   association_id   pvalue pvalue_descript… pvalue_mantissa pvalue_exponent
#>   <chr>             <dbl> <chr>                      <int>           <int>
#> 1 24299710       3.00e-10 <NA>                           3             -10
#> 2 16617          1.00e- 8 <NA>                           1              -8
#> 3 26394          6.00e- 6 <NA>                           6              -6
#> 4 26451          8.00e- 9 <NA>                           8              -9
#> 5 17433639       3.00e- 6 (Chinese)                      3              -6
#> # … with 12 more variables: multiple_snp_haplotype <lgl>,
#> #   snp_interaction <lgl>, snp_type <chr>, standard_error <dbl>,
#> #   range <chr>, or_per_copy_number <dbl>, beta_number <dbl>,
#> #   beta_unit <chr>, beta_direction <chr>, beta_description <chr>,
#> #   last_mapping_date <dttm>, last_update_date <dttm>
#> 
#> Slot "loci":
#> # A tibble: 6 x 4
#>   association_id locus_id haplotype_snp_count description          
#>   <chr>             <int>               <int> <chr>                
#> 1 24299710              1                  NA SNP x SNP interaction
#> 2 24299710              2                  NA SNP x SNP interaction
#> 3 16617                 1                  NA Single variant       
#> 4 26394                 1                  NA Single variant       
#> 5 26451                 1                  NA Single variant       
#> 6 17433639              1                  NA Single variant       
#> 
#> Slot "risk_alleles":
#> # A tibble: 6 x 7
#>   association_id locus_id variant_id risk_allele risk_frequency genome_wide
#>   <chr>             <int> <chr>      <chr>                <dbl> <lgl>      
#> 1 24299710              1 rs3798440  A                   NA     TRUE       
#> 2 24299710              2 rs9350602  C                   NA     TRUE       
#> 3 16617                 1 rs7329174  G                   NA     NA         
#> 4 26394                 1 rs7329174  G                   NA     NA         
#> 5 26451                 1 rs7329174  G                   NA     NA         
#> 6 17433639              1 rs7329174  <NA>                 0.211 FALSE      
#> # … with 1 more variable: limited_list <lgl>
#> 
#> Slot "genes":
#> # A tibble: 9 x 3
#>   association_id locus_id gene_name   
#>   <chr>             <int> <chr>       
#> 1 24299710              1 MYO6        
#> 2 24299710              2 MYO6        
#> 3 16617                 1 ELF1        
#> 4 26394                 1 ELF1        
#> 5 26451                 1 WBP4        
#> 6 26451                 1 ELF1        
#> 7 26451                 1 microRNA2276
#> 8 26451                 1 SLC25A15    
#> 9 17433639              1 ELF1        
#> 
#> Slot "ensembl_ids":
#> # A tibble: 9 x 4
#>   association_id locus_id gene_name    ensembl_id     
#>   <chr>             <int> <chr>        <chr>          
#> 1 24299710              1 MYO6         ENSG00000196586
#> 2 24299710              2 MYO6         ENSG00000196586
#> 3 16617                 1 ELF1         ENSG00000120690
#> 4 26394                 1 ELF1         ENSG00000120690
#> 5 26451                 1 WBP4         ENSG00000120688
#> 6 26451                 1 ELF1         ENSG00000120690
#> 7 26451                 1 microRNA2276 <NA>           
#> 8 26451                 1 SLC25A15     ENSG00000102743
#> 9 17433639              1 ELF1         ENSG00000120690
#> 
#> Slot "entrez_ids":
#> # A tibble: 9 x 4
#>   association_id locus_id gene_name    entrez_id
#>   <chr>             <int> <chr>        <chr>    
#> 1 24299710              1 MYO6         4646     
#> 2 24299710              2 MYO6         4646     
#> 3 16617                 1 ELF1         1997     
#> 4 26394                 1 ELF1         1997     
#> 5 26451                 1 WBP4         11193    
#> 6 26451                 1 ELF1         1997     
#> 7 26451                 1 microRNA2276 <NA>     
#> 8 26451                 1 SLC25A15     10166    
#> 9 17433639              1 ELF1         1997

Get associations by traits (braces or binge eating or gambling):

get_associations(efo_trait = c('braces', 'binge eating', 'gambling'))
#> An object of class "associations"
#> Slot "associations":
#> # A tibble: 19 x 17
#>    association_id  pvalue pvalue_descript… pvalue_mantissa pvalue_exponent
#>    <chr>            <dbl> <chr>                      <int>           <int>
#>  1 15608          4.00e-7 (braces)                       4              -7
#>  2 44592          9.00e-7 <NA>                           9              -7
#>  3 44589          1.00e-6 <NA>                           1              -6
#>  4 44590          4.00e-6 <NA>                           4              -6
#>  5 27460823       1.00e-6 <NA>                           1              -6
#>  6 27460811       1.00e-7 <NA>                           1              -7
#>  7 27460817       7.00e-7 <NA>                           7              -7
#>  8 27460805       3.00e-8 <NA>                           3              -8
#>  9 27460830       1.00e-6 <NA>                           1              -6
#> 10 27460844       1.00e-8 <NA>                           1              -8
#> 11 27460858       3.00e-7 <NA>                           3              -7
#> 12 27460864       3.00e-7 <NA>                           3              -7
#> 13 27460870       1.00e-6 <NA>                           1              -6
#> 14 27460851       9.00e-8 <NA>                           9              -8
#> 15 23033          3.00e-6 <NA>                           3              -6
#> 16 23034          3.00e-6 <NA>                           3              -6
#> 17 23035          4.00e-6 <NA>                           4              -6
#> 18 23036          5.00e-6 <NA>                           5              -6
#> 19 23176          5.00e-6 <NA>                           5              -6
#> # … with 12 more variables: multiple_snp_haplotype <lgl>,
#> #   snp_interaction <lgl>, snp_type <chr>, standard_error <dbl>,
#> #   range <chr>, or_per_copy_number <dbl>, beta_number <dbl>,
#> #   beta_unit <chr>, beta_direction <chr>, beta_description <chr>,
#> #   last_mapping_date <dttm>, last_update_date <dttm>
#> 
#> Slot "loci":
#> # A tibble: 19 x 4
#>    association_id locus_id haplotype_snp_count description   
#>    <chr>             <int>               <int> <chr>         
#>  1 15608                 1                  NA Single variant
#>  2 44592                 1                  NA Single variant
#>  3 44589                 1                  NA Single variant
#>  4 44590                 1                  NA Single variant
#>  5 27460823              1                  NA Single variant
#>  6 27460811              1                  NA Single variant
#>  7 27460817              1                  NA Single variant
#>  8 27460805              1                  NA Single variant
#>  9 27460830              1                  NA Single variant
#> 10 27460844              1                  NA Single variant
#> 11 27460858              1                  NA Single variant
#> 12 27460864              1                  NA Single variant
#> 13 27460870              1                  NA Single variant
#> 14 27460851              1                  NA Single variant
#> 15 23033                 1                  NA Single variant
#> 16 23034                 1                  NA Single variant
#> 17 23035                 1                  NA Single variant
#> 18 23036                 1                  NA Single variant
#> 19 23176                 1                  NA Single variant
#> 
#> Slot "risk_alleles":
#> # A tibble: 19 x 7
#>    association_id locus_id variant_id risk_allele risk_frequency
#>    <chr>             <int> <chr>      <chr>                <dbl>
#>  1 15608                 1 rs1535480  <NA>                 NA   
#>  2 44592                 1 rs6006893  <NA>                 NA   
#>  3 44589                 1 rs10198175 <NA>                 NA   
#>  4 44590                 1 rs13233490 <NA>                 NA   
#>  5 27460823              1 rs1821075… C                     0.04
#>  6 27460811              1 rs7904579  G                     0.37
#>  7 27460817              1 rs1950038  T                     0.3 
#>  8 27460805              1 rs726170   T                     0.12
#>  9 27460830              1 rs76087671 T                     0.05
#> 10 27460844              1 rs1119404… T                     0.04
#> 11 27460858              1 rs7337127  T                     0.15
#> 12 27460864              1 rs1457636… A                     0.1 
#> 13 27460870              1 rs73057489 C                     0.07
#> 14 27460851              1 rs17810023 T                     0.02
#> 15 23033                 1 rs8064100  A                    NA   
#> 16 23034                 1 rs12237653 T                    NA   
#> 17 23035                 1 rs11060736 T                    NA   
#> 18 23036                 1 rs9383153  A                    NA   
#> 19 23176                 1 rs10812227 C                    NA   
#> # … with 2 more variables: genome_wide <lgl>, limited_list <lgl>
#> 
#> Slot "genes":
#> # A tibble: 31 x 3
#>    association_id locus_id gene_name   
#>    <chr>             <int> <chr>       
#>  1 15608                 1 <NA>        
#>  2 44592                 1 PRR5        
#>  3 44589                 1 APOB        
#>  4 44590                 1 PER4        
#>  5 27460823              1 LOC101929321
#>  6 27460811              1 CUBN        
#>  7 27460817              1 Intergenic  
#>  8 27460805              1 PRR5        
#>  9 27460805              1 ARHGAP8     
#> 10 27460830              1 intergenic  
#> # … with 21 more rows
#> 
#> Slot "ensembl_ids":
#> # A tibble: 31 x 4
#>    association_id locus_id gene_name    ensembl_id     
#>    <chr>             <int> <chr>        <chr>          
#>  1 15608                 1 <NA>         <NA>           
#>  2 44592                 1 PRR5         ENSG00000186654
#>  3 44589                 1 APOB         ENSG00000084674
#>  4 44590                 1 PER4         <NA>           
#>  5 27460823              1 LOC101929321 <NA>           
#>  6 27460811              1 CUBN         ENSG00000107611
#>  7 27460817              1 Intergenic   <NA>           
#>  8 27460805              1 PRR5         ENSG00000186654
#>  9 27460805              1 ARHGAP8      ENSG00000241484
#> 10 27460830              1 intergenic   <NA>           
#> # … with 21 more rows
#> 
#> Slot "entrez_ids":
#> # A tibble: 31 x 4
#>    association_id locus_id gene_name    entrez_id
#>    <chr>             <int> <chr>        <chr>    
#>  1 15608                 1 <NA>         <NA>     
#>  2 44592                 1 PRR5         55615    
#>  3 44589                 1 APOB         338      
#>  4 44590                 1 PER4         168741   
#>  5 27460823              1 LOC101929321 101929321
#>  6 27460811              1 CUBN         8029     
#>  7 27460817              1 Intergenic   <NA>     
#>  8 27460805              1 PRR5         55615    
#>  9 27460805              1 ARHGAP8      23779    
#> 10 27460830              1 intergenic   <NA>     
#> # … with 21 more rows

Get traits by PubMed identifiers (24882193 or 22780124):

The only search parameters that are not vectorised are user_requested and full_pvalue_set from get_studies(). These parameters are not vectorised because they take boolean values (TRUE or FALSE) and thus only one of the values is sensical to be used as a query at a given time.

4 | What is the difference between a trait and a reported trait?

There are two levels of trait description in the GWAS Catalog: (EFO) trait and reported trait.

Studies are assigned one or more terms from the Experimental Factor Ontology (EFO), i.e., an EFO trait, or simply trait, that best represents the phenotype under investigation.

In addition, each study is also assigned a free text reported trait. This is written by the GWAS Catalog curators and reflects the author language, and where necessary, it includes more specific and detailed description of the experimental design, e.g., interaction studies or studies with a background trait.

As an example take the study with accession identifier GCST000206 by EM Behrens et al. (2008). We can get the EFO trait with get_traits() and the reported trait with get_studies():

The (EFO) trait for the Behrens study is chronic childhood arthritis:

whereas the reported trait is Arthritis (juvenile idiopathic):

5 | Genomic coordinates of genomic contexts seem to be wrong?

The REST API response for variants contains an element named genomic contexts. This element is mapped onto the table genomic_contexts of a variants S4 object.

Now, there is indeed a server-side bug with the column chromosome_position of the genomic_contexts table: the chromosome position returned is that of the variant and not of the gene (genomic context) as it should be.

The GWAS Catalog team is aware of this bug, and they plan to fix it, eventually. For the time being, just do not rely on chromosome_position of the genomic_contexts table.

6 | How to search for variants within a certain genomic region?

Single genomic range

For this you may use the function get_variants() with parameter genomic_range.

For example, to search for variants in chromosome Y in the interval 14692000–14695000, you start by defining a list of 3 elements: chromosome, start and end that specify your genomic range:

Now you can use my_genomic_range to retrieve the variants:

Multiple genomic ranges

To search in multiple regions, construct your genomic range list with those locations just like in the previous example. For example, let’s search now for variants in chromosome X and Y, both in range 13000000–15000000:

Searching variants by cytogenetic regions

To search for variants within a cytogenetic band you can use the parameter cytogenetic_band of get_variants(). Here is an example, again for chromosome Y, using the cytogenetic band 'Yq11.221' as query:

How to know what are the cytogenetic bands for querying? We provide a dataset (dataframe) named cytogenetic_bands that you can use:

Let’s say you want to search for all variants in the shorter arm (p) of chromosome 21, you can take advantage of the cytogenetic_bands to get all the corresponding cytogenetic band names:

Now search by cytogenetic_band:

7 | Genomic range for an entire chromosome?

You can get the total length of a chromosome by using the provided dataset: cytogenetic_bands. Here is an example for chromosome 15:

8 | How to keep track of which queries generated which results?

Currently, there is not an implemented solution in gwasrapidd. For example, if you search for variants by EFO identifier (efo_id):

So it is not immediately obvious which variants resulted from the query 'EFO_0005543' or 'EFO_0004762'.

A possible workaround is to make multiple independent queries and save your results in a list whose names are the respective queries:

Now you can see which variants are associated with each EFO identifier.

For 'EFO_0005543' we got the following variants:

And for 'EFO_0004762':

9 | How to combine results from multiple queries?

The four main retrieval functions get_studies(), get_associations(), get_variants() and get_traits() all allow you to search multiple criteria at once. You can then combine results in an OR or AND fashion using the parameter set_operation.

Use set_operation = 'union' to combine results in an OR fashion:

The code above retrieves variants whose associated efo_trait is equal to 'triple-negative breast cancer' or variants that are associated with gene 'MDM4'.

Alternatively, we may use set_operation = 'intersection' to combine results in an AND fashion:

With set_operation = 'intersection', as in the code above, we get variants whose associated efo_trait is equal to 'triple-negative breast cancer' and that are associated with gene 'MDM4', i.e., only variants meeting both conditions simultaneously are retrieved.

Please note that almost all search criteria to be used with the retrieval functions are vectorised, meaning that you can use multiple values with the same search criterion. In these cases results are always combined in an OR fashion.

In the following example, we will be using the gene name as the only search criterion. If we pass a vector of gene names then we get all variants that are associated with EITHER (OR) genes.

In this case we retrieved 14 variants. Please note that the set_operation parameter does not affect this result. The set_operation only controls the function behaviour when combining results from different criteria, e.g., when using efo_trait and gene_name.

To retrieve variants that are concomitantly associated with genes RNU6-367P and TOPAZ1, the user needs to place these queries separately and then intersect them — using the intersect() function, i.e., combining in an AND fashion. Here we start by retrieving variants associated with gene RNU6-367P:

There are 9 variants associated with gene RNU6-367P. Now, for gene TOPAZ1:

There are 6 variants associated with gene TOPAZ1. To find those variants simultaneously associated with both genes, you can intersect the two variants objects using intersect():

Apparently only 1 variant(s) is related to both genes RNU6-367P and TOPAZ1.