BSF Genomics Tutorials

Comprehensive guides for population genomics analysis of Black Soldier Fly

NGS Analysis Population Structure Genetic Diversity Imputation

🧬 Overview

This comprehensive tutorial provides a complete workflow for analyzing the genetic diversity, population structure, and demographic history of wild and captive Black Soldier Fly (Hermetia illucens) populations using whole-genome sequencing data.

What You'll Learn
  • Process raw NGS data from FASTQ to VCF files
  • Perform quality control and variant filtering
  • Conduct population structure analysis (PCA, ADMIXTURE, phylogenetics)
  • Calculate genetic diversity metrics
  • Identify runs of homozygosity and assess inbreeding
  • Estimate effective population size and linkage disequilibrium
Software Requirements
  • FastQC - Quality control of raw reads
  • BWA - Read mapping to reference genome
  • Picard Tools - BAM file processing
  • FreeBayes - Variant calling
  • BCFtools/VCFtools - VCF manipulation
  • PLINK - Genetic analysis
  • ADMIXTURE - Ancestry estimation
  • R/Python - Data visualization

📊 Part 1: What Does a Population Genomics Study of BSF Entail?

A population genomics study of the Black Soldier Fly involves several key steps that integrate laboratory work, high-throughput sequencing, and computational analysis. The goal is to understand patterns of genetic diversity, population structure, and evolutionary history across wild and captive populations.

1

Sample Collection

Collect individuals from multiple geographic locations to capture global genetic variation, including both wild and captive colonies. Sampling across populations allows for inference of population structure, admixture, and demography.

2

DNA Extraction and Sequencing

  • Extract high-quality genomic DNA using standard protocols
  • Perform whole-genome sequencing (WGS) at high or low depth
  • Consider imputation strategies for cost-effective large-scale analysis
3

Bioinformatics Pipeline

The computational workflow includes:

  • Variant calling and filtering
  • Genotype imputation for low-coverage samples
  • Population structure analysis
  • Genetic diversity assessment
  • Demographic inference

🔬 Part 2: NGS Data Analysis

This section demonstrates how to analyze WGS data from raw FASTQ files to a variant call format (VCF) file containing SNPs and indels.

Directory Structure

The scripts assume the following directory arrangement. Adjust paths according to your needs:

├── 1_data
├── 2_fastqc
├── 3_mapping
├── 4_processing
├── 5_variant_calling
└── 6_filtering

Step 1: Quality Control

Quality assessment of raw sequencing reads is the first critical step in any NGS analysis pipeline.

bash
cd 2_fastqc

# Create soft links for all files in 1_data
for fastq in ../1_data/*.fastq.gz; do
  ln -s $fastq
done

# Run FastQC on the files
fastqc -t 4 *.fastq

# Optional: Use GNU parallel for faster processing
parallel "fastqc {}" ::: *.fastq

Tip: For multiple HTML reports, use MultiQC to aggregate results: multiqc .

Step 2: Mapping Raw Reads to Reference Genome

Mapping aligns sequencing reads to a reference genome, enabling variant identification and structural analysis.

bash
#!/bin/bash
cd 3_mapping

# Index the reference genome
bwa index reference_genome.fasta

# Path to the reference genome
reference_genome="path/to/ref"

echo "Job started at $(date '+%d_%m_%y_%H_%M_%S')"

# Loop through each forward-read file
for forward_file in ../1_data/*.R1.fastq.gz; do
    # Get base filename
    base_filename=$(basename -- "$forward_file" | sed 's/.R1.fq.gz//')
    
    # Generate paths for forward and reverse reads
    reverse_file="../1_data/${base_filename}.R2.fq.gz"
    
    echo "Processing: ${base_filename}"
    
    # Perform alignment, convert to BAM, and sort
    bwa mem -M -t 8 -R "@RG\tID:${base_filename}\tPL:BGISEQ-500\tSM:${base_filename}" \
    "$reference_genome" "$forward_file" "$reverse_file" | \
    samtools view -bS | \
    samtools sort --output-fmt BAM -@ 8 -o "${base_filename}.sorted.bam" -
    
    # Index the sorted BAM file
    samtools index ${base_filename}.sorted.bam
    
    echo "Done Processing: ${base_filename}"
done

echo "Mapping complete"

Step 3: Processing BAM Files

Mark PCR duplicates to avoid bias in variant calling.

bash
#!/bin/bash
cd 4_processing

bam_directory="../3_mapping/"
output_directory="./"

# Create output directory if it doesn't exist
mkdir -p "$output_directory"

# Process each BAM file
for bam in "$bam_directory"*.bam; do
    if [ -f "$bam" ]; then
        base_name=$(basename "${bam%.*}")
        echo "Processing ${bam}"
        
        java -Xmx40G -jar picard.jar MarkDuplicates \
            I="${bam}" \
            O="${output_directory}${base_name}_markdup.bam" \
            M="${output_directory}${base_name}_markdup_metrics.txt" \
            OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 \
            CREATE_INDEX=true \
            VALIDATION_STRINGENCY=LENIENT \
            REMOVE_DUPLICATES=true \
            ASSUME_SORT_ORDER=coordinate
    fi
done

echo "Completed marking duplicates"

Step 4: Variant Calling using FreeBayes

FreeBayes is a Bayesian genetic variant detector designed to find SNPs, indels, and complex events.

Basic FreeBayes Run:

bash
#!/bin/bash
cd 5_variant_calling

reference_genome="path/to/ref"

# Create BAM list
ls -l ../4_processing/*.bam | awk '{print $NF}' | xargs -I{} readlink -f {} > bamlist

# Run FreeBayes
freebayes \
  --fasta-reference ${reference_genome} \
  --bam-list bamlist > output.vcf

Parallel FreeBayes for Large Datasets:

bash
#!/bin/bash

# Run FreeBayes in parallel (20 processors)
freebayes-parallel \
   <(fasta_generate_regions.py ${ref}.fai 100000) 20 \
   --fasta-reference ${reference_genome} \
   --bam-list bamlist > output.vcf

Step 5: VCF File Filtering

Apply quality filters to retain high-confidence variants.

bash
# Initial filtering: missingness, MAC, and quality
vcftools --gzvcf BSF_renamed.vcf.gz \
         --max-missing 0.5 \
         --mac 3 \
         --minQ 30 \
         --recode \
         --recode-INFO-all \
         --stdout | bgzip -c > BSF_filtered.vcf.gz

# Remove individuals with high missing data
vcftools --gzvcf BSF_filtered.vcf.gz --missing-indv
mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv

vcftools --gzvcf BSF_filtered.vcf.gz \
         --remove lowDP.indv \
         --recode \
         --recode-INFO-all \
         --stdout | bgzip -c > BSF_filtered_individuals.vcf.gz

# Keep only biallelic SNPs
bcftools view -m2 -M2 -v snps BSF_filtered_individuals.vcf.gz \
              -Oz -o BSF_biallelic_snps.vcf.gz

# Filter on mean depth
vcftools --gzvcf BSF_biallelic_snps.vcf.gz \
         --min-meanDP 5 \
         --max-meanDP 25 \
         --recode \
         --recode-INFO-all \
         --stdout | bgzip -c > BSF_depth_filtered.vcf.gz

# Final missingness filter
vcftools --gzvcf BSF_depth_filtered.vcf.gz \
         --max-missing 0.95 \
         --maf 0.001 \
         --recode \
         --recode-INFO-all \
         --stdout | bgzip -c > BSF_final.vcf.gz

Step 6: Genotype Imputation

Phasing with Beagle:

bash
VCF='Chr1.filtered.vcf.gz'
Chr='Chr1'

# Phase the VCF using Beagle
java -Xmx100g -jar beagle5.jar gt=${VCF} \
     out=${Chr}_phased gp=true burnin=10 iterations=40 \
     impute=false nthreads=40 chrom=${Chr}

# Index the phased VCF
bcftools index --threads 8 ${Chr}_phased.vcf.gz

Imputation with QUILT:

bash
# Build reference panel
QUILT2_prepare_reference.R \
  --outputdir="$Out_dir" \
  --chr="$Chr" \
  --nGen=100 \
  --regionStart="$regionStart" \
  --regionEnd="$regionEnd" \
  --buffer=500000 \
  --reference_vcf_file="$VCF_file"

# Run imputation
QUILT.R \
  --prepared_reference_filename="${Ref}QUILT_prepared_reference.${Chr}.${regionStart}.${regionEnd}.RData" \
  --bamlist="${Bamlist}" \
  --method=diploid \
  --chr="$Chr" \
  --regionStart="${regionStart}" \
  --regionEnd="${regionEnd}" \
  --buffer=500000 \
  --output_filename="${Out_dir}/quilt.${Chr}.${regionStart}.${regionEnd}.vcf.gz" \
  --nCores=1

🌍 Part 3: Population Structure Analysis

Principal Component Analysis (PCA)

PCA is a model-free method to explore population genetic structure.

PLINK-based PCA:

bash
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF_bed

# Quality control
plink-1.9-rc --bfile BSF_bed --geno 0.2 --mind 0.2 --maf 0.05 \
             --make-bed --out BSF_qc

# LD pruning
plink-1.9-rc --bfile BSF_qc --indep-pairwise 50 10 0.2 --out BSF_pruned

# Run PCA
plink-1.9-rc --bfile BSF_qc --extract BSF_pruned.prune.in \
             --pca --out BSF_pca

Visualization in R:

r
library(tidyverse)

# Read PCA results
plinkPCA <- read_table("BSF_pca.eigenvec", col_names = F)
plinkPCA <- plinkPCA[,c(-1,-2)]
EigenValue <- scan("BSF_pca.eigenval")

# Set column names
names(plinkPCA)[1:ncol(plinkPCA)] <- paste0("PC", 1:(ncol(plinkPCA)))

# Calculate percentage variance explained
pve <- data.frame(PC = 1:20, pve = EigenValue/sum(EigenValue)*100)

# Create PCA plot
ggplot(plinkPCA, aes(PC1, PC2, color = as.factor(pop))) + 
  geom_point(size = 3) + 
  coord_equal() +
  theme_light() + 
  xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) + 
  ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))

Phylogenetic Tree Construction

bash
# Convert VCF to phylip format using TASSEL
tassel-5.2.89-0/run_pipeline.pl -Xmx100G \
  -SortGenotypeFilePlugin -inputFile BSF.vcf.gz \
  -outputFile sorted.vcf -fileType VCF

tassel-5.2.89-0/run_pipeline.pl -Xmx100G \
  -importGuess sorted.vcf \
  -ExportPlugin -saveAs sequences.phy -format Phylip_Inter

# Build tree with IQ-TREE
iqtree -s sequences.phy -st DNA -m MFP -o outgroup_sample \
       -bb 1000 --bnni -cmax 15 -nt AUTO

ADMIXTURE Analysis

bash
# Prepare PLINK files
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF --allow-extra-chr

# Fix chromosome names for ADMIXTURE
awk '{$1="0";print $0}' BSF.bim > BSF.bim.tmp
mv BSF.bim.tmp BSF.bim

# Run ADMIXTURE with cross-validation (K=2 to K=11)
for i in {2..11}
do
   admixture --cv BSF.bed $i > log${i}.out
done

# Extract CV errors
awk '/CV/ {print $3,$4}' *out | cut -c 4,7-20 > BSF.cv.error

📈 Part 4: Genetic Diversity and Differentiation

Nucleotide Diversity (π)

bash
# Parse VCF for genomics_general
python3 genomics_general/parseVCF.py -i BSF.vcf.gz | bgzip > BSF.geno.gz

# Calculate diversity metrics in windows
python2 genomics_general/popgenWindows.py \
    -g BSF.geno.gz \
    -o BSF.diversity.csv.gz \
    -f phased \
    -w 100000 \
    -m 100 \
    -s 100000 \
    -p pop1 -p pop2 -p pop3 \
    --popsFile sample_pop.txt

FST and DXY

bash
# Calculate FST and DXY between populations
python2 genomics_general/popgenWindows.py \
    -g BSF.geno.gz \
    -o BSF.Fst.Dxy.csv.gz \
    -f phased \
    -w 50000 \
    -m 50 \
    -s 50000 \
    $(awk '{print "-p", $1}' pops.txt) \
    --popsFile sample_pop.txt

Tajima's D

bash
# Calculate Tajima's D for each population
for pop in $(cat pops.txt); do
    vcftools --gzvcf BSF.vcf.gz \
             --keep ${pop}.txt \
             --TajimaD 100000 \
             --out tajimasD_${pop}
done

🧮 Part 5: Heterozygosity Analysis

ANGSD-based Heterozygosity

bash
# Compute SAF files with ANGSD
angsd -i sample.bam -anc reference.fa -ref reference.fa \
      -C 50 -minQ 20 -dosaf 1 -GL 1 -nThreads 4 \
      -out sample_angsd

# Estimate site frequency spectrum
realSFS sample_angsd.saf.idx -P 1 > sample_est.ml

🔄 Part 6: Runs of Homozygosity and Inbreeding

bash
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF

# Detect ROH for each population
for pop in $(cat pops.txt); do
    echo "Processing ROH for population: $pop"
    plink-1.9-rc --bfile BSF \
          --keep <(awk -v p="$pop" '$2==p {print $1, $1}' sample_pop.txt) \
          --homozyg \
          --homozyg-window-het 3 \
          --homozyg-window-missing 20 \
          --out "${pop}_roh"
done

👥 Part 7: Relatedness Analysis

bash
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF

# Calculate relatedness using KING
king -b BSF.bed --related

🔗 Part 8: Linkage Disequilibrium and Effective Population Size

Linkage Disequilibrium Decay

bash
# Calculate LD decay for each population
for pop in $(cat pops.txt); do
    PopLDdecay -InVCF ${pop}.vcf.gz \
               -OutStat LD_${pop} \
               -MaxDist 500 \
               -MAF 0.05
done

Effective Population Size (Ne)

bash
# Estimate Ne using SNeP
SNeP1.1 -ped population.ped -map population.map \
        -phased -out population_Ne -threads 8

✅ Conclusion

This comprehensive tutorial provides an end-to-end workflow for analyzing genetic diversity, population structure, and demographic history of Black Soldier Fly populations using whole-genome sequencing data.

Key Takeaways
  • Robust pipeline for processing NGS data from raw reads to variants
  • Multiple approaches for population structure analysis
  • Comprehensive genetic diversity metrics
  • Tools for breeding program design and conservation planning
  • Scalable analyses applicable to other non-model species
Contact & Support

For questions, suggestions, or contributions, please contact:

Peter Muchina - kimanimuchina@gmail.com

Zexi Cai - zexi.cai@qgg.au.dk

Citation

If you use these scripts or workflows in your research, please cite our work appropriately.