Coding, GitHub, and Bioinformatics
Here I have tutorials for coding, example outputs, and best data practices that I’ve developed through some of my PhD-related computational research.
Creating a Quarto Website with GitHub
Before getting started, you need (1) a GitHub account, (2) RStudio on your local machine, (3) GitHub desktop or Terminal, and (4) Quarto. I will complete this tutorial using GitHub Desktop, as I think it is easier to learn for those with less coding familiarity.
Initializing the website
- In GitHub, create a new repository with the domain name that you would like for your website. For example, my repository name is ‘madeline-eppley’, so my website will be named ‘madeline-eppley.github.io’.
- In GitHub Desktop, choose “Add Existing Repository” and paste the clone url. Alternatively, you can open the repository in GitHub, then select “Open with GitHub Desktop”. If you’re using the terminal, you can clone the repository with
gh repo clone repo-name/repo-name.github.io
. - If this was your first time linking a repository through GitHub Desktop, by default, GitHub Desktop will create a local folder named “GitHub” in your Documents folder (on a Mac) and place your repository there. I moved this folder into my regular user space, but you can store it anywhere you like. If you’ve already set up GitHub Desktop, your new repository will be linked in the previously created folder. Locate your local repository folder.
- In your local R Studio, initialize a new Quarto project with “New Project” and select “Quarto Website”. Then, provide the directory name as your local repository folder. You should now see a basic
index.qmd
and_quarto.yml
file in your directory. - Open your
_quarto.yml
file and addoutput-dir: docs
directly under the type: website line. Save the edit. - Make a test edit to the
index.qmd
file and then check “Render on Save”. This will open up a locally-hosted browser where you can preview your edits. If everything looks good, you can select “Render” and will have auto-generated .html files appear in a new/docs
folder within the repository folder. - Test the setup by committing these new files to GitHub through GitHub desktop. You should then see your new setup files on GitHub.
Rendering the Website
- Go to “Settings” in GitHub and scroll down to Pages. There, select Source “Deploy from a branch” and edit the branch so that it renders from the
/docs
folder in the main branch. The folder that you select should be where your .html files are stored, so after this point, do NOT move your .html files out of the/docs
folder. - At this point, GitHub will begin automatically rendering your site. Under the “Actions” tab, you can check the build and deployment status.
- When it completes, you can return to Settings/Pages and see that your site will be live at ‘your-repo-title.github.io’. Now you can visit your site!
Editing your Website Structure
Quarto websites are designed to host structure within the _quarto.yml
file. Here, you establish your website format, image folders, style, font design, header and footer, and order of the pages. You can check out my _quarto.yml file in my repo for inspiration!
Your index.qmd
file is your first “page” and is the home screen of your website. You can add new pages at any time by adding a new .qmd file in GitHub. Each time you add a new element of “structure” (eg. a new page), you need to update the _quarto.yml
file to add this file.
I typically edit my pages locally in RStudio, but it is also possible to edit them within GitHub. Standard R markdown formatting also applies to .qmd files. Wherever you edit, be sure to use GitHub desktop or git to push and pull your edits. Whenever you’re ready to publish your new edits, you need to return to your local RStudio and “Render” your edited file. This step is important because it re-generates your .html file, which is what is actually used to build the website. To save myself the hassle of remembering this step, I enable “Render on Save” for all of my pages within RStudio.
Adding Images to your Website
You can add as many images to your Quarto website as you’d like! The first step is to set up your resources folder in your _quarto.yml
file. Add resources: "/img"
under the project:
section at the top of the file. Next, add a high-quality (I suggest .png format) image to your /img
folder and title it something intuitive (e.g. ‘profile_pic.png’). To add any image in-text, add {fig-alt="Add your caption here"}
to your .qmd file. There are several methods to adjust image size, caption font size, etc. that you can learn more about on the Quarto website.
Styles, Icons, Fonts, and Colors
Coming soon!
Population Genomics
Imputing SNP Datasets
Why impute?
When you receive genotype data from the sequencing facility it will be incomplete. That is, each individual will be missing some genotype calls at SNPs. However, many downstream analyses require complete data. If you reduce your data to just complete SNPs for all individuals you’ll likely end up with a very small number of useable SNPs. For example, the seascape oysters from the CviMVP project were returned from the sequencing facility with 194k SNPs, but only 14k of those are complete across all individuals. We definitely don’t want to reduce our data by that amount! So, we impute data using methods that calculate the most frequent genotype at each SNP for a designated group of samples (typically by ancestral groups) or across all individuals.
Recommended pre-reading
Before starting the SNP imputation and filtering process, I recommend reading the following:
LEA 3: Factor models in population genetics and ecological genomics with R, Gain et al 2021
These aren’t the loci you’re looking for: Principles of effective SNP filtering for molecular ecologists, O’Leary et al 2018
Fast and Efficient Estimation of Individual Ancestry Coefficients, Frichot et al 2014
If you’re short on time, readings 1 and 2 are the most important. The third reading dives further into the ancestry calculations in snmf that we will use for imputation.
LEA and impute()
We will be using the impute() function in the LEA package to impute our missing genotypes with the output of an snmf object (.lfmm file). The function takes the following arguments: impute (object, input.file, method, K, run)
. See more details here. The imputation works by using ancestry and genotype frequency estimates from an snmf run and using the mode for each genotype within the assigned ancestral group. the method of imputation is either set to mode or random. With “random”, imputation is performed by using the genotype probabilities. With “mode”, the most likely genotype is used for matrix completion. Since LEA is imputing using a K = n specified number of ancestral groups, it can potentially introduce structure into the dataset. If you are concerned about this potential introduction of structure, compare Fst values pre- and post- imputation (using a complete set of filtered SNPs for the ‘pre-imputation’ SNP set).
Full vs thinned SNP sets
It is important to understand what type of SNPs you want to retain in your dataset for downstream anaylses. Your full SNP set will be all SNPs that are retained after filtering for missingness and MAF and imputed. Your thinned SNP set will be a subsetted version of the full SNP set that removes SNPs in LD. Generally, you want to use a full SNP set for detecing selection and functional loci detection (e.g. gene ontology), and use a thinned SNP set for population structure (e.g. PCA, ancestry/structure).
General imputation workflow
- Transfer your data into a genotype matrix format. We have developed code for extracting genotype values from a VCF using the
extract.gt()
function (see Preparing SNP matrix section below).
- Filter for missing data. This typically includes: removing low-quality calls (often done by the sequencing facility, but double-check), removing individuals with >10% missing data across all loci, and removing SNPs with >10% missing data across all individuals.
- Filter for minor allele frequency (MAF)
- Impute
- THIS STEP PRODUCES YOUR FULL SNP DATASET
- Filter your full SNP dataset for linkage disequilibrium
- THIS STEP PRODUCES YOUR THINNED SNP DATASET
Step 1: Preparing your SNP matrix
To prepare to use the impute()
function, you should arrage your SNP data in a matrix with individuals in rows and your SNPs in columns. If you’re starting with a VCF file of genotypes from the sequencing facility, use the extract.gt()
function to extract the genotype values into a data frame from the VCF. For an example on how to do this, ask for access to the genotypes_experimental.R
script that was used for the Cvi MVP H2F project.
Step 2: Filter for missing data
The threshold of missing data can change depending on your dataset. For our Cvi MVP samples, we decided to use a 10% missingness threshold (e.g. any individual or loci that has more than 10% missing data across the entire data set will be removed). Here in this example, our SNP matrix is called exp_ordered
and contains sequenced individuals in rows, SNPs in columns. We detect 2.9k SNPs that are above the threshold of 0.10 and remove them, producing a reduced data frame called exp_cleaned_columns
. The same is repeated with the rows, but no individuals are detected that exceed the missingness threshold (yay, good data!). We also have a separate mutations dataframe that stores some functional annotations of the SNPs, so I also subset that muts_filtered
to produce an equal number of retained SNPs as my genotype matrix.
# look at where our data is missing in rows
hist_rows <- rowSums(is.na(exp_ordered))
hist(hist_rows)
# look at where our data is missing in columns
cols_hist <- colSums(is.na(exp_ordered))
hist(cols_hist)
# Filter out cols with missing data
missing_threshold <- 0.10 #10%
snp_missing_prop <- colMeans(is.na(exp_ordered))
exp_cleaned_columns <- exp_ordered[, snp_missing_prop <= missing_threshold]
dim(exp_cleaned_columns) # removed 2976 SNPs
# Filter out the rows with missing data
ind_missing_prop <- rowMeans(is.na(exp_ordered))
inds_missing <- which(ind_missing_prop > missing_threshold) # length 0, all inds pass
# filter the muts database to just keep our SNPs that meet the missingness threshold
muts_filtered <- muts[snp_missing_prop <= missing_threshold, ]
dim(muts_filtered) #191994 SNPs
Step 3: Filter for MAF
This step shows how to filter for MAF. Our genotype matrix exp_cleaned_columns_sort
has been filtered for missingness and now sorted to have the rows and columns in the exact same order as our mutation matrix and individual matrix. We use a custom function to calculate the allele frequency (af) within each column, and then apply that custom function to each column in the genotype matrix. The maf_threshold
can be set to whatever threshold you’d like for your data, here we use 0.05, meaning SNPs that appear in less than 5% of individuals are filtered out. These are “minor alleles” and because of their rarity across the dataset, could bias population estimates. Once we complete this filtering, we are left with n = 154k SNPs, and this will be the dataset that we impute.
# The geno matrix has individuals in rows and mutations in columns, with a 0, 1, or 2 entered for the number of alternate alleles in the diploid
GEN <- exp_cleaned_columns_sort
# function to calculate the allele frequency of a column
af <- function(i, GEN){
sum(GEN[,i], na.rm=TRUE)/(2*sum(!is.na(GEN[,i])))
}
# af(1, GEN)
# calculate the allele freq of SNPs in the geno mat
snp_afs <- apply(GEN, 2, function(i){
sum(i, na.rm=TRUE)/(2*sum(!is.na(i)))})
# calculate MAF
maf_threshold <- 0.05
keep_snps <- names(snp_afs[snp_afs >= maf_threshold & snp_afs <= (1 - maf_threshold)]) # vector with SNPs above 0.05
# now filter the GEN matrix for MAF
GEN_filtered <- GEN[, keep_snps]
dim(GEN_filtered) #154609 SNPs
# filter the muts data frame to just keep our SNPs with full data
muts_maf_filtered <- muts_filt_ord[muts_filt_ord$Affx.ID %in% keep_snps, ]
dim(muts_maf_filtered) #154609 SNPs
saveRDS(muts_maf_filtered, file = "merged_data/genotypes/20250203_FULLmuts_exp.rds")
Step 4: Imputing and producing the full SNP set
Here we have our filtered SNP matrix called GEN_filtered
. This matrix has been filtered for MAF and missingness and contains 154k SNPs. In this example, we impute for a value of K = 2.
# Impute #####################################################################
# use snmf on our genotype matrix with 154k SNPs
write.lfmm(GEN_filtered, "merged_data/genotypes/20250203_genotypes_exp.lfmm")
exp.project.snmf <- snmf("merged_data/genotypes/20250203_genotypes_exp.lfmm", K = 2, repetitions = 10, ploidy = 2, entropy = TRUE, project = "new")
# run with the lowest cross-entropy value and impute
best = which.min(cross.entropy(exp.project.snmf, K = 2))
impute(exp.project.snmf, "merged_data/genotypes/20250203_genotypes_exp.lfmm", method = 'mode', K = 2, run = best)
# store the imputed SNP set as a matrix
GEN_imputed <- as.data.frame(read.table("merged_data/genotypes/20250203_genotypes_exp.lfmm_imputed.lfmm", header = FALSE))
saveRDS(GEN_imputed, file = "merged_data/genotypes/20250203_FULLSNPs_exp.rds") # this is the FULL SNP set
Step 5: Filtering for linkage disequilibrium and producing the thinned SNP set
Here, we read in the thinned SNP set and use the bigsnpr
and bigstatsr
packages to calculate a PCA while filtering for LD. The LD correlation argument is set of a default of thr.r2 = 0.2
, but this can be explored for your data.
GEN_imputed <- readRDS("/Users/madelineeppley/Desktop/20250203_FULLSNPs_exp.rds")
gen_impute <- add_code256(big_copy(GEN_imputed,type="raw"),code=bigsnpr:::CODE_012)
# double check our dimensions
muts_maf_filtered <- readRDS("/Users/madelineeppley/Desktop/20250203_FULLmuts_exp.rds")
dim(GEN_imputed)
dim(muts_maf_filtered) # matches
thr.r2 <- 0.2
# create the thinned SNP set
pca_inv <- snp_autoSVD(gen_impute,
infos.chr= muts_maf_filtered$Chromosome,
infos.pos = muts_maf_filtered$Position,
thr.r2=thr.r2, # correlation LD
size= 100/thr.r2, # explore
)
thinned_snps <- attr(pca_inv, which="subset") # SNPs n=104958, this step outputs the integer positions of your SNPs that should be retained in the thinned set
gen_filtered_subset <- GEN_imputed[, thinned_snps] # subset the full imputed dataset by the thinned SNP positions
saveRDS(gen_filtered_subset, file = "/Users/madelineeppley/Desktop/20250203_THINNEDSNPs_exp.rds")
Other imputation methods
I’ve tried several different imputation methods before deciding on using LEA. Since the rest of the scripting was done in R for the Cvi MVP project, we elected to use LEA for ease of use. However, other imputation methods, such as BEAGLE, snpStats, and bigsnpr are also often used in pop gen studies. I was unsuccessful with getting bigsnpr to work on our data, but it did sound promising. Here are my notes:
bigsnpr is good for large datasets and will already work with the genotype matrix in its current format. This method will reduce chances of introducing structure because it compares each individual with missing SNPs to a generated “reference panel” based on the input data. It will automatically compute a reference panel based on the input genotype matrix, then each individual will be imputed based on the comparison to the reference panel. I assume it is averages across all inds in the dataset (“mean” type of approach) although you can specify method as well (mode, mean0, mean2, or random). It will output a new imputed genotype matrix in a .rds file format.
Environmental Data
Coming soon!