Validation tools for identifying and repairing errors in pedigrees

Introduction

The BGmisc R package offers a comprehensive suite of functions tailored for extended behavior genetics analysis, including model identification, calculating relatedness, pedigree conversion, and pedigree simulation. This vignette provides an overview of the validation tools available in the package, designed to identify and repair errors in pedigrees.

In an ideal world, you would have perfect pedigrees with no errors. However, in the real world, pedigrees are often incomplete, contain errors, or are missing data. The BGmisc package provides tools to identify these errors, which is particularly useful for large pedigrees where manual inspection is not feasible. While some errors in the package can be automatically repaired, the vast majority require manual inspection. It is often not possible to automatically repair errors in pedigrees, as the correct solution may not be obvious, or may depend on additional information that is not universally available.

Identifying and Repairing Errors in Pedigrees

ID Validation

One common issue in pedigree data is the presence of duplicate IDs. There are two main types of ID duplication: within-row duplication and across-row duplication. Within-row duplication occurs when an individual’s parents’ IDs are incorrectly listed as their own ID. Across-row duplication occurs when two or more individuals share the same ID.

The checkIDs function in BGmisc helps identify by kinds of duplicates. Here’s how to use it:

library(BGmisc)
# Create a sample dataset
df <- ped2fam(potter, famID = "newFamID", personID = "personID")

# Call the checkIDs function
result <- checkIDs(df, repair = FALSE)
print(result)
#> $all_unique_ids
#> [1] TRUE
#> 
#> $total_non_unique_ids
#> [1] 0
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 0
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 0
#> 
#> $within_row_duplicates
#> [1] FALSE

#> $all_unique_ids
#> [1] TRUE
#> 
#> $total_non_unique_ids
#> [1] 0
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 0
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 0
#> 
#> $within_row_duplicates
#> [1] FALSE

In this example, the checkIDs function returns a list with several elements. The all_unique_ids element indicates whether all IDs in the dataset are unique. The total_non_unique_ids element indicates the total number of non-unique IDs. The total_own_father and total_own_mother elements indicate the total number of individuals whose father’s and mother’s IDs match their own ID, respectively. The total_duplicated_parents element indicates the total number of individuals with duplicated parent IDs. The total_within_row_duplicates element indicates the total number of within-row duplicates. The within_row_duplicates element indicates whether there are any within-row duplicates in the dataset. As the output shows, there are no duplicates in the sample dataset.

Between-Person Duplicates

Let us now consider a scenario where there are between-person duplicates in the dataset. The checkIDs function can identify these duplicates and, if the repair argument is set to TRUE, attempt to repair them. In the example below, we have created two between-person duplicates. First, we have overwritten the personID of one person with their sibling’s ID. Second, we have added a copy of Dudley Dursley to the dataset.

# Create a sample dataset with duplicates
df <- ped2fam(potter, famID = "newFamID", personID = "personID")

# Sibling overwrite
df$personID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Marjorie Dursley"]

# Add a copy of Dudley Dursley
df <- rbind(df, df[df$name == "Dudley Dursley",])

Now, let’s call the sumarizeFamilies function to see what the dataset looks like.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ forcats   1.0.0     ✔ readr     2.1.5
#> ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
#> ✔ lubridate 1.9.3     ✔ tibble    3.2.1
#> ✔ purrr     1.0.2     ✔ tidyr     1.3.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

summarizeFamilies(df, famID = "newFamID", personID = "personID")$family_summary %>% glimpse()
#> Rows: 1
#> Columns: 17
#> $ newFamID        <dbl> 1
#> $ count           <int> 37
#> $ gen_mean        <dbl> 1.756757
#> $ gen_median      <dbl> 2
#> $ gen_min         <dbl> 0
#> $ gen_max         <dbl> 3
#> $ gen_sd          <dbl> 1.038305
#> $ spouseID_mean   <dbl> 38.2
#> $ spouseID_median <dbl> 15
#> $ spouseID_min    <dbl> 1
#> $ spouseID_max    <dbl> 106
#> $ spouseID_sd     <dbl> 44.15118
#> $ sex_mean        <dbl> 0.5135135
#> $ sex_median      <dbl> 1
#> $ sex_min         <dbl> 0
#> $ sex_max         <dbl> 1
#> $ sex_sd          <dbl> 0.5067117

If we didn’t know to look for duplicates, we might not notice the issue. Indeed, only of the duplicates was selected as are founder member. However, the checkIDs function can help us identify and repair these errors:

# Call the checkIDs
result <- checkIDs(df)

print(result)
#> $all_unique_ids
#> [1] FALSE
#> 
#> $total_non_unique_ids
#> [1] 4
#> 
#> $non_unique_ids
#> [1] 2 6
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 0
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 0
#> 
#> $within_row_duplicates
#> [1] FALSE

As we can see from this output, there are 4 non-unique IDs in the dataset, specifically 2, 6. Let’s take a peek at the duplicates:


df %>% filter(personID %in% result$non_unique_ids) %>%
  arrange(personID)
#>    personID newFamID famID             name gen momID dadID spouseID sex
#> 1         2        1     1   Vernon Dursley   1   101   102        3   1
#> 2         2        1     1 Marjorie Dursley   1   101   102       NA   0
#> 6         6        1     1   Dudley Dursley   2     3     1       NA   1
#> 61        6        1     1   Dudley Dursley   2     3     1       NA   1

Yep, these are definitely the duplicates.

df_repair <- checkIDs(df, repair = TRUE)

df_repair %>% filter(ID %in% result$non_unique_ids) %>%
  arrange(ID)
#>   ID newFamID fam             name gen momID dadID spID sex
#> 1  2        1   1   Vernon Dursley   1   101   102    3   1
#> 2  2        1   1 Marjorie Dursley   1   101   102   NA   0
#> 6  6        1   1   Dudley Dursley   2     3     1   NA   1

result <- checkIDs(df_repair)

print(result)
#> $all_unique_ids
#> [1] FALSE
#> 
#> $total_non_unique_ids
#> [1] 2
#> 
#> $non_unique_ids
#> [1] 2
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 0
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 0
#> 
#> $within_row_duplicates
#> [1] FALSE

Great! The function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling overwrite, but that’s a more complex issue that would require manual intervention. We’ll leave that for now.

Handling Within-Row Duplicates

Sometimes, an individual’s parents’ IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:

# Create a sample dataset with within-person duplicate parent IDs

df <- ped2fam(potter, famID = "newFamID", personID = "personID")

df$momID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Vernon Dursley"]

# Check for within-row duplicates
result <- checkIDs(df, repair = FALSE)
print(result)
#> $all_unique_ids
#> [1] TRUE
#> 
#> $total_non_unique_ids
#> [1] 0
#> 
#> $total_own_father
#> [1] 0
#> 
#> $total_own_mother
#> [1] 1
#> 
#> $total_duplicated_parents
#> [1] 0
#> 
#> $total_within_row_duplicates
#> [1] 1
#> 
#> $within_row_duplicates
#> [1] TRUE
#> 
#> $is_own_mother_ids
#> [1] 1

In this example, we have created a within-row duplicate by setting the momID of Vernon Dursley to his own ID. The checkIDs function correctly identifies this error.

Verifying Sex Coding

Another common issue in pedigree data is incorrect coding of biological sex. In genetic studies, ensuring accurate recording of biological sex in pedigree data is crucial for analyses that rely on this information. The checkSex function in BGmisc helps identify and repair errors related to biological sex coding, such as inconsistencies where an individual’s sex is incorrectly recorded. An example of this would be a parent who is biologically male, but listed as a mother. The checkSex function can help identify and correct such errors.

It is essential to distinguish between biological sex (genotype) and gender identity (phenotype). Biological sex is based on chromosomes and other biological characteristics, while gender identity is a broader, richer, personal, deeply-held sense of being male, female, a blend of both, neither, or another gender entirely. While checkSex focuses on biological sex necessary for genetic analysis, we respect and recognize the full spectrum of gender identities beyond the binary. The developers of this package affirm their support for folx in the LGBTQ+ community.

The checkSex function in BGmisc performs two main tasks: identifying possible errors and inconsistencies for variables related to biological sex. The function is capable of validating the sex coding in a pedigree and optionally repairing the sex coding based on specified logic. Here’s how you can use the checkSex function to validate and optionally repair sex coding in a pedigree dataset:

# Validate sex coding
results <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = FALSE)
#> Step 1: Checking how many sexes/genders...
#> 2 unique values found.
#>  1 2 unique values found.
#>  0Checks Made:
#> $sex_unique
#> [1] 1 0
#> 
#> $sex_length
#> [1] 2
#> 
#> $all_sex_dad
#> [1] "1"
#> 
#> $all_sex_mom
#> [1] "0"
#> 
#> $most_frequent_sex_dad
#> [1] "1"
#> 
#> $most_frequent_sex_mom
#> [1] "0"
print(results)
#> $sex_unique
#> [1] 1 0
#> 
#> $sex_length
#> [1] 2
#> 
#> $all_sex_dad
#> [1] "1"
#> 
#> $all_sex_mom
#> [1] "0"
#> 
#> $most_frequent_sex_dad
#> [1] "1"
#> 
#> $most_frequent_sex_mom
#> [1] "0"

In this example, the checkSex function checks the unique values in the sex column and identifies any inconsistencies in the sex coding of parents. The function returns a list containing validation results, such as the unique values found in the sex column and any inconsistencies in the sex coding of parents.

If incorrect sex codes are found, you can attempt to repair them automatically using the repair argument:

# Repair sex coding
df_fix <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = TRUE)
#> Step 1: Checking how many sexes/genders...
#> 2 unique values found.
#>  1 2 unique values found.
#>  0Step 2: Attempting to repair sex coding...
#> Changes Made:
#> [[1]]
#> [1] "Recode sex based on most frequent sex in dads: 1. Total gender changes made: 36"
print(df_fix)
#>     ID fam               name gen momID dadID spID sex
#> 1    1   1     Vernon Dursley   1   101   102    3   M
#> 2    2   1   Marjorie Dursley   1   101   102   NA   F
#> 3    3   1      Petunia Evans   1   103   104    1   F
#> 4    4   1         Lily Evans   1   103   104    5   F
#> 5    5   1       James Potter   1    NA    NA    4   M
#> 6    6   1     Dudley Dursley   2     3     1   NA   M
#> 7    7   1       Harry Potter   2     4     5    8   M
#> 8    8   1      Ginny Weasley   2    10     9    7   F
#> 9    9   1     Arthur Weasley   1    NA    NA   10   M
#> 10  10   1      Molly Prewett   1    NA    NA    9   F
#> 11  11   1        Ron Weasley   2    10     9   17   M
#> 12  12   1       Fred Weasley   2    10     9   NA   M
#> 13  13   1     George Weasley   2    10     9   NA   M
#> 14  14   1      Percy Weasley   2    10     9   20   M
#> 15  15   1    Charlie Weasley   2    10     9   NA   M
#> 16  16   1       Bill Weasley   2    10     9   18   M
#> 17  17   1   Hermione Granger   2    NA    NA   11   F
#> 18  18   1     Fleur Delacour   2   105   106   16   F
#> 19  19   1 Gabrielle Delacour   2   105   106   NA   F
#> 20  20   1     Audrey UNKNOWN   2    NA    NA   14   F
#> 21  21   1    James Potter II   3     8     7   NA   M
#> 22  22   1       Albus Potter   3     8     7   NA   M
#> 23  23   1        Lily Potter   3     8     7   NA   F
#> 24  24   1       Rose Weasley   3    17    11   NA   F
#> 25  25   1       Hugo Weasley   3    17    11   NA   M
#> 26  26   1   Victoire Weasley   3    18    16   NA   F
#> 27  27   1  Dominique Weasley   3    18    16   NA   F
#> 28  28   1      Louis Weasley   3    18    16   NA   M
#> 29  29   1      Molly Weasley   3    20    14   NA   F
#> 30  30   1       Lucy Weasley   3    20    14   NA   F
#> 31 101   1     Mother Dursley   0    NA    NA  102   F
#> 32 102   1     Father Dursley   0    NA    NA  101   M
#> 33 104   1       Father Evans   0    NA    NA  103   M
#> 34 103   1       Mother Evans   0    NA    NA  104   F
#> 35 106   1    Father Delacour   0    NA    NA  105   M
#> 36 105   1    Mother Delacour   0    NA    NA  106   F

When the repair argument is set to TRUE, the function attempts to repair the sex coding based on specified logic. It recodes the sex variable based on the most frequent sex values found among parents. This ensures that the sex coding is consistent and accurate, which is essential for constructing valid genetic pedigrees.

Conclusion

This vignette demonstrates how to use the BGmisc package to identify and repair errors in pedigrees. By leveraging functions like checkIDs, checkSex, and recodeSex, you can ensure the integrity of your pedigree data, facilitating accurate analysis and research.