--- title: "Benchmarking genomic prediction models with GSbench" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Benchmarking genomic prediction models with GSbench} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5) ``` GSbench fits and compares genomic-selection models — GBLUP, ridge marker effects, and machine-learning methods — through one interface, using breeding-relevant cross-validation. This vignette walks through a full workflow. ```{r setup} library(GSbench) ``` ## A simulated dataset For a reproducible example we simulate a population: 300 lines, 2000 SNPs, 50 QTL, narrow-sense heritability 0.5. ```{r sim} sim <- simulate_population(n = 300, m = 2000, n_qtl = 50, h2 = 0.5, seed = 1) dim(sim$geno) sim$h2 # realised heritability ``` ## Quality control Filter low-call-rate, low-MAF and monomorphic markers, and impute the rest. ```{r qc} qc <- qc_markers(sim$geno, maf = 0.05, max_missing = 0.1) qc$removed geno <- qc$geno ``` ## GBLUP and the genomic relationship matrix `gblup()` estimates variance components by REML. (Its GEBVs match `rrBLUP::mixed.solve` to within `6e-5` — see the package tests.) ```{r gblup} fit <- gblup(sim$pheno, geno) fit cor(fit$gebv, sim$bv) # accuracy against the true breeding values ``` Because `geno` was supplied, the fit also carries ridge marker effects, so it can predict new genotypes: ```{r predict} train <- 1:240; test <- 241:300 fit_tr <- gs_fit(sim$pheno[train], geno[train, ], model = "gblup") pred <- predict(fit_tr, geno[test, ]) cor(pred, sim$bv[test]) ``` ## One interface for many models `gs_fit()` exposes the same API for every model family. Which are available depends on which optional packages you have installed: ```{r available} available_models() ``` ```{r models, eval = requireNamespace("glmnet", quietly = TRUE)} en <- gs_fit(sim$pheno[train], geno[train, ], model = "elastic_net") cor(predict(en, geno[test, ]), sim$bv[test]) ``` ## Cross-validation, done correctly `gs_cv()` runs breeding-relevant cross-validation. Random k-fold predicts untested lines (CV1); `leave_group_out` holds out whole families or environments. ```{r cv} cv <- gs_cv(sim$pheno, geno, models = "gblup", k = 5, reps = 1, seed = 1) cv # leave-one-family-out, with a toy family structure fam <- rep(1:6, length.out = nrow(geno)) gs_cv(sim$pheno, geno, models = "gblup", scheme = "leave_group_out", groups = fam) ``` ## A stacked ensemble `gs_ensemble()` combines base models into a super-learner: each model's out-of-fold predictions are used to learn non-negative weights (summing to one), and the final predictor is that weighted blend. ```{r ensemble} ens <- gs_ensemble(sim$pheno, geno, base_models = "gblup", seed = 1) ens$weights ``` With several base models installed, the weights show how much each contributes. ## Benchmarking everything at once `gs_benchmark()` runs every available model — including the ensemble — under one cross-validation and reports predictive ability so you can compare them fairly. ```{r benchmark} bench <- gs_benchmark(sim$pheno, geno, models = c("gblup", "ensemble"), k = 5, seed = 1) bench$accuracy ``` ```{r benchplot} plot(bench) ``` For an additive trait like this one, GBLUP and elastic net are typically strongest; tree methods trail; the ensemble lands near the top without you having to pick the winner in advance.