Comprehensive guides for population genomics analysis of Black Soldier Fly
This comprehensive tutorial provides a complete workflow for analyzing the genetic diversity, population structure, and demographic history of wild and captive Black Soldier Fly (Hermetia illucens) populations using whole-genome sequencing data.
A population genomics study of the Black Soldier Fly involves several key steps that integrate laboratory work, high-throughput sequencing, and computational analysis. The goal is to understand patterns of genetic diversity, population structure, and evolutionary history across wild and captive populations.
Collect individuals from multiple geographic locations to capture global genetic variation, including both wild and captive colonies. Sampling across populations allows for inference of population structure, admixture, and demography.
The computational workflow includes:
This section demonstrates how to analyze WGS data from raw FASTQ files to a variant call format (VCF) file containing SNPs and indels.
The scripts assume the following directory arrangement. Adjust paths according to your needs:
Quality assessment of raw sequencing reads is the first critical step in any NGS analysis pipeline.
cd 2_fastqc
# Create soft links for all files in 1_data
for fastq in ../1_data/*.fastq.gz; do
ln -s $fastq
done
# Run FastQC on the files
fastqc -t 4 *.fastq
# Optional: Use GNU parallel for faster processing
parallel "fastqc {}" ::: *.fastq
Tip: For multiple HTML reports, use MultiQC to aggregate results: multiqc .
Mapping aligns sequencing reads to a reference genome, enabling variant identification and structural analysis.
#!/bin/bash
cd 3_mapping
# Index the reference genome
bwa index reference_genome.fasta
# Path to the reference genome
reference_genome="path/to/ref"
echo "Job started at $(date '+%d_%m_%y_%H_%M_%S')"
# Loop through each forward-read file
for forward_file in ../1_data/*.R1.fastq.gz; do
# Get base filename
base_filename=$(basename -- "$forward_file" | sed 's/.R1.fq.gz//')
# Generate paths for forward and reverse reads
reverse_file="../1_data/${base_filename}.R2.fq.gz"
echo "Processing: ${base_filename}"
# Perform alignment, convert to BAM, and sort
bwa mem -M -t 8 -R "@RG\tID:${base_filename}\tPL:BGISEQ-500\tSM:${base_filename}" \
"$reference_genome" "$forward_file" "$reverse_file" | \
samtools view -bS | \
samtools sort --output-fmt BAM -@ 8 -o "${base_filename}.sorted.bam" -
# Index the sorted BAM file
samtools index ${base_filename}.sorted.bam
echo "Done Processing: ${base_filename}"
done
echo "Mapping complete"
Mark PCR duplicates to avoid bias in variant calling.
#!/bin/bash
cd 4_processing
bam_directory="../3_mapping/"
output_directory="./"
# Create output directory if it doesn't exist
mkdir -p "$output_directory"
# Process each BAM file
for bam in "$bam_directory"*.bam; do
if [ -f "$bam" ]; then
base_name=$(basename "${bam%.*}")
echo "Processing ${bam}"
java -Xmx40G -jar picard.jar MarkDuplicates \
I="${bam}" \
O="${output_directory}${base_name}_markdup.bam" \
M="${output_directory}${base_name}_markdup_metrics.txt" \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT \
REMOVE_DUPLICATES=true \
ASSUME_SORT_ORDER=coordinate
fi
done
echo "Completed marking duplicates"
FreeBayes is a Bayesian genetic variant detector designed to find SNPs, indels, and complex events.
#!/bin/bash
cd 5_variant_calling
reference_genome="path/to/ref"
# Create BAM list
ls -l ../4_processing/*.bam | awk '{print $NF}' | xargs -I{} readlink -f {} > bamlist
# Run FreeBayes
freebayes \
--fasta-reference ${reference_genome} \
--bam-list bamlist > output.vcf
#!/bin/bash
# Run FreeBayes in parallel (20 processors)
freebayes-parallel \
<(fasta_generate_regions.py ${ref}.fai 100000) 20 \
--fasta-reference ${reference_genome} \
--bam-list bamlist > output.vcf
Apply quality filters to retain high-confidence variants.
# Initial filtering: missingness, MAC, and quality
vcftools --gzvcf BSF_renamed.vcf.gz \
--max-missing 0.5 \
--mac 3 \
--minQ 30 \
--recode \
--recode-INFO-all \
--stdout | bgzip -c > BSF_filtered.vcf.gz
# Remove individuals with high missing data
vcftools --gzvcf BSF_filtered.vcf.gz --missing-indv
mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv
vcftools --gzvcf BSF_filtered.vcf.gz \
--remove lowDP.indv \
--recode \
--recode-INFO-all \
--stdout | bgzip -c > BSF_filtered_individuals.vcf.gz
# Keep only biallelic SNPs
bcftools view -m2 -M2 -v snps BSF_filtered_individuals.vcf.gz \
-Oz -o BSF_biallelic_snps.vcf.gz
# Filter on mean depth
vcftools --gzvcf BSF_biallelic_snps.vcf.gz \
--min-meanDP 5 \
--max-meanDP 25 \
--recode \
--recode-INFO-all \
--stdout | bgzip -c > BSF_depth_filtered.vcf.gz
# Final missingness filter
vcftools --gzvcf BSF_depth_filtered.vcf.gz \
--max-missing 0.95 \
--maf 0.001 \
--recode \
--recode-INFO-all \
--stdout | bgzip -c > BSF_final.vcf.gz
VCF='Chr1.filtered.vcf.gz'
Chr='Chr1'
# Phase the VCF using Beagle
java -Xmx100g -jar beagle5.jar gt=${VCF} \
out=${Chr}_phased gp=true burnin=10 iterations=40 \
impute=false nthreads=40 chrom=${Chr}
# Index the phased VCF
bcftools index --threads 8 ${Chr}_phased.vcf.gz
# Build reference panel
QUILT2_prepare_reference.R \
--outputdir="$Out_dir" \
--chr="$Chr" \
--nGen=100 \
--regionStart="$regionStart" \
--regionEnd="$regionEnd" \
--buffer=500000 \
--reference_vcf_file="$VCF_file"
# Run imputation
QUILT.R \
--prepared_reference_filename="${Ref}QUILT_prepared_reference.${Chr}.${regionStart}.${regionEnd}.RData" \
--bamlist="${Bamlist}" \
--method=diploid \
--chr="$Chr" \
--regionStart="${regionStart}" \
--regionEnd="${regionEnd}" \
--buffer=500000 \
--output_filename="${Out_dir}/quilt.${Chr}.${regionStart}.${regionEnd}.vcf.gz" \
--nCores=1
PCA is a model-free method to explore population genetic structure.
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF_bed
# Quality control
plink-1.9-rc --bfile BSF_bed --geno 0.2 --mind 0.2 --maf 0.05 \
--make-bed --out BSF_qc
# LD pruning
plink-1.9-rc --bfile BSF_qc --indep-pairwise 50 10 0.2 --out BSF_pruned
# Run PCA
plink-1.9-rc --bfile BSF_qc --extract BSF_pruned.prune.in \
--pca --out BSF_pca
library(tidyverse)
# Read PCA results
plinkPCA <- read_table("BSF_pca.eigenvec", col_names = F)
plinkPCA <- plinkPCA[,c(-1,-2)]
EigenValue <- scan("BSF_pca.eigenval")
# Set column names
names(plinkPCA)[1:ncol(plinkPCA)] <- paste0("PC", 1:(ncol(plinkPCA)))
# Calculate percentage variance explained
pve <- data.frame(PC = 1:20, pve = EigenValue/sum(EigenValue)*100)
# Create PCA plot
ggplot(plinkPCA, aes(PC1, PC2, color = as.factor(pop))) +
geom_point(size = 3) +
coord_equal() +
theme_light() +
xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) +
ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))
# Convert VCF to phylip format using TASSEL
tassel-5.2.89-0/run_pipeline.pl -Xmx100G \
-SortGenotypeFilePlugin -inputFile BSF.vcf.gz \
-outputFile sorted.vcf -fileType VCF
tassel-5.2.89-0/run_pipeline.pl -Xmx100G \
-importGuess sorted.vcf \
-ExportPlugin -saveAs sequences.phy -format Phylip_Inter
# Build tree with IQ-TREE
iqtree -s sequences.phy -st DNA -m MFP -o outgroup_sample \
-bb 1000 --bnni -cmax 15 -nt AUTO
# Prepare PLINK files
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF --allow-extra-chr
# Fix chromosome names for ADMIXTURE
awk '{$1="0";print $0}' BSF.bim > BSF.bim.tmp
mv BSF.bim.tmp BSF.bim
# Run ADMIXTURE with cross-validation (K=2 to K=11)
for i in {2..11}
do
admixture --cv BSF.bed $i > log${i}.out
done
# Extract CV errors
awk '/CV/ {print $3,$4}' *out | cut -c 4,7-20 > BSF.cv.error
# Parse VCF for genomics_general
python3 genomics_general/parseVCF.py -i BSF.vcf.gz | bgzip > BSF.geno.gz
# Calculate diversity metrics in windows
python2 genomics_general/popgenWindows.py \
-g BSF.geno.gz \
-o BSF.diversity.csv.gz \
-f phased \
-w 100000 \
-m 100 \
-s 100000 \
-p pop1 -p pop2 -p pop3 \
--popsFile sample_pop.txt
# Calculate FST and DXY between populations
python2 genomics_general/popgenWindows.py \
-g BSF.geno.gz \
-o BSF.Fst.Dxy.csv.gz \
-f phased \
-w 50000 \
-m 50 \
-s 50000 \
$(awk '{print "-p", $1}' pops.txt) \
--popsFile sample_pop.txt
# Calculate Tajima's D for each population
for pop in $(cat pops.txt); do
vcftools --gzvcf BSF.vcf.gz \
--keep ${pop}.txt \
--TajimaD 100000 \
--out tajimasD_${pop}
done
# Compute SAF files with ANGSD
angsd -i sample.bam -anc reference.fa -ref reference.fa \
-C 50 -minQ 20 -dosaf 1 -GL 1 -nThreads 4 \
-out sample_angsd
# Estimate site frequency spectrum
realSFS sample_angsd.saf.idx -P 1 > sample_est.ml
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF
# Detect ROH for each population
for pop in $(cat pops.txt); do
echo "Processing ROH for population: $pop"
plink-1.9-rc --bfile BSF \
--keep <(awk -v p="$pop" '$2==p {print $1, $1}' sample_pop.txt) \
--homozyg \
--homozyg-window-het 3 \
--homozyg-window-missing 20 \
--out "${pop}_roh"
done
# Convert VCF to PLINK format
plink-1.9-rc --vcf BSF.vcf.gz --make-bed --out BSF
# Calculate relatedness using KING
king -b BSF.bed --related
# Calculate LD decay for each population
for pop in $(cat pops.txt); do
PopLDdecay -InVCF ${pop}.vcf.gz \
-OutStat LD_${pop} \
-MaxDist 500 \
-MAF 0.05
done
# Estimate Ne using SNeP
SNeP1.1 -ped population.ped -map population.map \
-phased -out population_Ne -threads 8
This comprehensive tutorial provides an end-to-end workflow for analyzing genetic diversity, population structure, and demographic history of Black Soldier Fly populations using whole-genome sequencing data.
For questions, suggestions, or contributions, please contact:
Peter Muchina - kimanimuchina@gmail.com
Zexi Cai - zexi.cai@qgg.au.dk
If you use these scripts or workflows in your research, please cite our work appropriately.