Sample sets
A sample set is a group of samples used in a polygenic score
evaluation.Each sample set is identified in the PGS Catalog by a unique
sample set identifier (pss_id
). See
vignette('cohorts-samples-sample-sets')
for more details on
the relationship between cohorts, samples, and sample sets.
Getting sample sets
To get information on sample sets you can either search by the associated polygenic score identifiers, or by the sample set identifiers themselves (if you know them beforehand).
By the PGS identifier
# Sample sets used in the evaluation of the PGS000013
(pgs_13_sset <- get_sample_sets(pgs_id = 'PGS000013'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 55 × 1
#> pss_id
#> <chr>
#> 1 PSS000015
#> 2 PSS000019
#> 3 PSS000020
#> 4 PSS000021
#> 5 PSS000022
#> 6 PSS000219
#> 7 PSS000227
#> 8 PSS000228
#> 9 PSS000229
#> 10 PSS000230
#> # ℹ 45 more rows
#>
#> Slot "samples":
#> # A tibble: 62 × 15
#> pss_id sample_id stage sample_size sample_cases sample_controls
#> <chr> <int> <chr> <int> <int> <int>
#> 1 PSS000015 1 eval 288978 8676 280302
#> 2 PSS000019 1 eval 5762 173 5589
#> 3 PSS000020 1 eval 862 446 416
#> 4 PSS000020 2 eval 2333 937 1396
#> 5 PSS000021 1 eval 1964 974 976
#> 6 PSS000022 1 eval 3309 2492 817
#> 7 PSS000219 1 eval 11010 126 10884
#> 8 PSS000227 1 eval 544 40 504
#> 9 PSS000228 1 eval 1298 336 962
#> 10 PSS000229 1 eval 919 168 751
#> # ℹ 52 more rows
#> # ℹ 9 more variables: sample_percent_male <dbl>, phenotype_description <chr>,
#> # ancestry_category <chr>, ancestry <chr>, country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>
#>
#> Slot "demographics":
#> # A tibble: 20 × 11
#> pss_id sample_id variable estimate_type estimate unit variability_type
#> <chr> <int> <chr> <chr> <dbl> <chr> <chr>
#> 1 PSS000365 1 age mean 34 years NA
#> 2 PSS000365 2 age mean 33 years NA
#> 3 PSS000366 1 age mean 54 years NA
#> 4 PSS000366 2 age mean 55 years NA
#> 5 PSS000367 1 age mean 60.6 years NA
#> 6 PSS000367 2 age mean 52.8 years NA
#> 7 PSS000331 1 follow_up_… median 9.2 years NA
#> 8 PSS000332 1 follow_up_… median 9.2 years NA
#> 9 PSS000333 1 follow_up_… median 11.7 years NA
#> 10 PSS000334 1 follow_up_… median 11.7 years NA
#> 11 PSS000335 1 follow_up_… median 10.4 years NA
#> 12 PSS000336 1 follow_up_… median 10.4 years NA
#> 13 PSS000467 1 follow_up_… median 21.3 years NA
#> 14 PSS000468 1 follow_up_… median 23.2 years NA
#> 15 PSS000469 1 follow_up_… median 8.1 years NA
#> 16 PSS001063 1 follow_up_… median 14 years NA
#> 17 PSS010119 1 follow_up_… NA NA years NA
#> 18 PSS010120 1 follow_up_… NA NA years NA
#> 19 PSS010121 1 follow_up_… NA NA years NA
#> 20 PSS010122 1 follow_up_… NA NA years NA
#> # ℹ 4 more variables: variability <dbl>, interval_type <chr>,
#> # interval_lower <dbl>, interval_upper <dbl>
#>
#> Slot "cohorts":
#> # A tibble: 127 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000015 1 UKB UK Biobank
#> 2 PSS000019 1 CARTaGENE CARTaGENE cohort (CHU Sainte-Justine, Queb…
#> 3 PSS000020 1 MHI Montreal Heart Institute Biobank
#> 4 PSS000020 2 MHI Montreal Heart Institute Biobank
#> 5 PSS000021 1 MHI Montreal Heart Institute Biobank
#> 6 PSS000022 1 MHI Montreal Heart Institute Biobank
#> 7 PSS000219 1 CG Color Genomics
#> 8 PSS000227 1 VIRGO Variation in Recovery: Role of Gender on O…
#> 9 PSS000227 1 MESA Multi-Ethnic Study of Atherosclerosis
#> 10 PSS000228 1 VIRGO Variation in Recovery: Role of Gender on O…
#> # ℹ 117 more rows
By the sample set identifier
# Sample set PSS000020
(pss_20 <- get_sample_sets(pss_id = 'PSS000020'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 1 × 1
#> pss_id
#> <chr>
#> 1 PSS000020
#>
#> Slot "samples":
#> # A tibble: 2 × 15
#> pss_id sample_id stage sample_size sample_cases sample_controls
#> <chr> <int> <chr> <int> <int> <int>
#> 1 PSS000020 1 eval 862 446 416
#> 2 PSS000020 2 eval 2333 937 1396
#> # ℹ 9 more variables: sample_percent_male <dbl>, phenotype_description <chr>,
#> # ancestry_category <chr>, ancestry <chr>, country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>
#>
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # ℹ 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> # estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> # variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> # interval_upper <dbl>
#>
#> Slot "cohorts":
#> # A tibble: 2 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000020 1 MHI Montreal Heart Institute Biobank
#> 2 PSS000020 2 MHI Montreal Heart Institute Biobank
By trait or disease
If you wish to search by other criteria other than the PGS identifier
or the PSS identifier, then you will need to do it in several steps. The
general approach is to map your criteria to matching PGS identifiers and
from those PGS IDs to sample sets using
get_sample_sets()
.
Let’s say that you want to retrieve all sample sets used in the evaluation of polygenic scores for the disease Vitiligo (loss of skin melanocytes that causes areas of skin depigmentation).
We start by searching for this disease in the PGS Catalog with
get_traits()
:
(traits_vitiligo <- get_traits(trait_term = 'Vitiligo'))
#> An object of class "traits"
#> Slot "traits":
#> # A tibble: 1 × 6
#> efo_id parent_efo_id is_child trait description url
#> <chr> <chr> <lgl> <chr> <chr> <chr>
#> 1 EFO_0004208 NA FALSE Vitiligo Generalized well circumscri… http…
#>
#> Slot "pgs_ids":
#> # A tibble: 3 × 4
#> efo_id parent_efo_id is_child pgs_id
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE PGS000738
#> 2 EFO_0004208 NA FALSE PGS000760
#> 3 EFO_0004208 NA FALSE PGS001536
#>
#> Slot "child_pgs_ids":
#> # A tibble: 0 × 4
#> # ℹ 4 variables: efo_id <chr>, parent_efo_id <chr>, is_child <lgl>,
#> # child_pgs_id <chr>
#>
#> Slot "trait_categories":
#> # A tibble: 1 × 4
#> efo_id parent_efo_id is_child trait_categories
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE Immune system disorder
#>
#> Slot "trait_synonyms":
#> # A tibble: 1 × 4
#> efo_id parent_efo_id is_child trait_synonyms
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE vitiligo
#>
#> Slot "trait_mapped_terms":
#> # A tibble: 14 × 4
#> efo_id parent_efo_id is_child trait_mapped_terms
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE DOID:12306
#> 2 EFO_0004208 NA FALSE ICD10:L80
#> 3 EFO_0004208 NA FALSE ICD10CM:L80
#> 4 EFO_0004208 NA FALSE ICD9:709.01
#> 5 EFO_0004208 NA FALSE MESH:D014820
#> 6 EFO_0004208 NA FALSE MONDO:0008661
#> 7 EFO_0004208 NA FALSE MeSH:D014820
#> 8 EFO_0004208 NA FALSE MedDRA:10047642
#> 9 EFO_0004208 NA FALSE NCIT:C26915
#> 10 EFO_0004208 NA FALSE NCIt:C26915
#> 11 EFO_0004208 NA FALSE OMIM:193200
#> 12 EFO_0004208 NA FALSE Orphanet:247871
#> 13 EFO_0004208 NA FALSE SNOMEDCT:56727007
#> 14 EFO_0004208 NA FALSE UMLS:C0042900
The slot pgs_ids
contains the polygenic score
identifiers associated with Vitiligo.
traits_vitiligo@pgs_ids
#> # A tibble: 3 × 4
#> efo_id parent_efo_id is_child pgs_id
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE PGS000738
#> 2 EFO_0004208 NA FALSE PGS000760
#> 3 EFO_0004208 NA FALSE PGS001536
Now to search for the sample sets, we can pass those PGS identifiers
to get_sample_sets()
:
(pss_vitiligo <- get_sample_sets(pgs_id = traits_vitiligo@pgs_ids$pgs_id))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 11 × 1
#> pss_id
#> <chr>
#> 1 PSS000907
#> 2 PSS010968
#> 3 PSS010969
#> 4 PSS010974
#> 5 PSS010977
#> 6 PSS000970
#> 7 PSS004173
#> 8 PSS004174
#> 9 PSS004175
#> 10 PSS004176
#> 11 PSS004177
#>
#> Slot "samples":
#> # A tibble: 11 × 15
#> pss_id sample_id stage sample_size sample_cases sample_controls
#> <chr> <int> <chr> <int> <int> <int>
#> 1 PSS000907 1 eval 4008 1827 2181
#> 2 PSS010968 1 eval 4702 3750 952
#> 3 PSS010969 1 eval 4945 243 4702
#> 4 PSS010974 1 eval 4979 34 4945
#> 5 PSS010977 1 eval 4987 NA NA
#> 6 PSS000970 1 eval 1584 NA NA
#> 7 PSS004173 1 eval 6497 17 6480
#> 8 PSS004174 1 eval 1704 6 1698
#> 9 PSS004175 1 eval 24905 45 24860
#> 10 PSS004176 1 eval 7831 71 7760
#> 11 PSS004177 1 eval 67425 131 67294
#> # ℹ 9 more variables: sample_percent_male <dbl>, phenotype_description <chr>,
#> # ancestry_category <chr>, ancestry <chr>, country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>
#>
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # ℹ 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> # estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> # variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> # interval_upper <dbl>
#>
#> Slot "cohorts":
#> # A tibble: 6 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000970 1 GNEHGI2020Q2 Genentech Human Genetics Initiative Cancer …
#> 2 PSS004173 1 UKB UK Biobank
#> 3 PSS004174 1 UKB UK Biobank
#> 4 PSS004175 1 UKB UK Biobank
#> 5 PSS004176 1 UKB UK Biobank
#> 6 PSS004177 1 UKB UK Biobank