| Title: | Benchmarking Genomic Selection and Machine-Learning Prediction Models |
|---|---|
| Description: | A unified interface to fit, cross-validate and benchmark genomic prediction models from SNP marker data. It implements genomic best linear unbiased prediction (GBLUP) and ridge-regression BLUP in base R, and offers a common interface to machine-learning predictors (elastic net, random forest and gradient boosting) through optional packages, together with a stacked ensemble. Cross-validation uses breeding-relevant schemes and reports prediction accuracy honestly, so models can be compared fairly. The genomic relationship matrix follows VanRaden (2008) <doi:10.3168/jds.2007-0980>; the mixed-model solver follows Endelman (2011) <doi:10.3835/plantgenome2011.08.0024>; the genomic-selection framework follows Meuwissen, Hayes and Goddard (2001) <doi:10.1093/genetics/157.4.1819>. |
| Authors: | Muhammad Farooqi [aut, cre] |
| Maintainer: | Muhammad Farooqi <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-25 08:30:18 UTC |
| Source: | https://github.com/mqfarooqi1/GSbench |
Returns the genomic-prediction models GSbench can currently run. "gblup" is
always available; the machine-learning models require their (suggested)
package to be installed.
available_models()available_models()
A character vector of usable model names.
available_models()available_models()
Fits the genomic best linear unbiased prediction model
y = X b + g + e, with g ~ N(0, G s2u) and e ~ N(0, I s2e), estimating
the variance components by restricted maximum likelihood (REML). The solver
uses the spectral / eigendecomposition approach of Endelman (2011) (the same
method as rrBLUP::mixed.solve): the REML log-likelihood is profiled to a
one-dimensional optimisation over the variance ratio lambda = s2e / s2u.
gblup(y, geno = NULL, K = NULL, X = NULL)gblup(y, geno = NULL, K = NULL, X = NULL)
y |
Numeric phenotype vector (length n), no missing values. |
geno |
Optional marker matrix (n x m, coded 0/1/2). Used to build |
K |
Optional n x n genomic relationship matrix. If |
X |
Optional fixed-effects design matrix (n x p). Defaults to an intercept. |
When geno is supplied, the equivalent ridge-regression marker effects are
also returned, so the fit can predict breeding values for new genotypes via
predict.gblup().
An object of class gblup: a list with Vu, Ve, h2, beta
(fixed effects), gebv (length-n GEBVs), lambda, K, and — when geno
was given — marker_effects, marker_means (2p) and intercept.
Endelman, J. B. (2011) "Ridge regression and other kernels for genomic selection with R package rrBLUP." The Plant Genome 4, 250-255. doi:10.3835/plantgenome2011.08.0024
sim <- simulate_population(n = 120, m = 500, seed = 1) fit <- gblup(sim$pheno, sim$geno) fit$h2 cor(fit$gebv, sim$bv)sim <- simulate_population(n = 120, m = 500, seed = 1) fit <- gblup(sim$pheno, sim$geno) fit$h2 cor(fit$gebv, sim$bv)
Builds the additive genomic relationship matrix from allele dosages using
VanRaden's first method: with markers centred by twice the allele frequency,
G = W W' / (2 * sum(p (1 - p))). This is the standard additive GRM used in
GBLUP.
Gmatrix(geno, min_maf = 0)Gmatrix(geno, min_maf = 0)
geno |
A numeric marker matrix (individuals x markers), coded 0/1/2,
with no missing values (run |
min_maf |
Markers with minor allele frequency below this are dropped before building the matrix. Default 0 (keep all supplied markers). |
An n x n symmetric genomic relationship matrix with the row/column
names taken from rownames(geno).
VanRaden, P. M. (2008) "Efficient methods to compute genomic predictions." Journal of Dairy Science 91, 4414-4423. doi:10.3168/jds.2007-0980
sim <- simulate_population(n = 80, m = 400, seed = 1) G <- Gmatrix(sim$geno) dim(G)sim <- simulate_population(n = 80, m = 400, seed = 1) G <- Gmatrix(sim$geno) dim(G)
Convenience wrapper around gs_cv() that evaluates every model available in
the session (including the stacked "ensemble") under one cross-validation,
so they can be compared on an equal footing. Returns a gs_cv object; use
plot() to draw the comparison.
gs_benchmark( y, geno, models = available_models(), k = 5, reps = 1, scheme = c("kfold", "leave_group_out"), groups = NULL, seed = NULL, ... )gs_benchmark( y, geno, models = available_models(), k = 5, reps = 1, scheme = c("kfold", "leave_group_out"), groups = NULL, seed = NULL, ... )
y |
Numeric phenotype vector (length n), no missing values. |
geno |
Marker matrix (n x m, 0/1/2, no missing values). |
models |
Models to evaluate; defaults to all available (see
|
k |
Number of folds for |
reps |
Number of repeats for |
scheme |
|
groups |
Grouping vector (length n) for |
seed |
Optional seed (applied via |
... |
Passed to |
A gs_cv object (see gs_cv()).
sim <- simulate_population(n = 120, m = 400, seed = 1) bench <- gs_benchmark(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1) bench$accuracysim <- simulate_population(n = 120, m = 400, seed = 1) bench <- gs_benchmark(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1) bench$accuracy
Runs a breeding-relevant cross-validation and reports predictive ability
(the correlation between predictions and observed phenotypes on held-out
individuals) for each model. Two schemes are supported:
"kfold" (random k-fold, i.e. predicting untested lines, repeated reps
times) and "leave_group_out" (hold out one family/environment at a time,
via groups).
gs_cv( y, geno, models = available_models(), k = 5, reps = 1, scheme = c("kfold", "leave_group_out"), groups = NULL, seed = NULL, ... )gs_cv( y, geno, models = available_models(), k = 5, reps = 1, scheme = c("kfold", "leave_group_out"), groups = NULL, seed = NULL, ... )
y |
Numeric phenotype vector (length n), no missing values. |
geno |
Marker matrix (n x m, 0/1/2, no missing values). |
models |
Models to evaluate; defaults to all available (see
|
k |
Number of folds for |
reps |
Number of repeats for |
scheme |
|
groups |
Grouping vector (length n) for |
seed |
Optional seed (applied via |
... |
Passed to |
Fitting is wrapped so that a model which errors on a fold records NA for
that fold rather than aborting the whole run.
An object of class gs_cv: a list with accuracy (a data frame of
model, mean, sd, n_folds), per_fold (the raw fold results), and
the call settings.
sim <- simulate_population(n = 120, m = 400, seed = 1) cv <- gs_cv(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1) cvsim <- simulate_population(n = 120, m = 400, seed = 1) cv <- gs_cv(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1) cv
Combines several base models into one predictor by stacking: each base model's out-of-fold cross-validated predictions are used to learn a set of non-negative weights (constrained to sum to one), and the final prediction is that weighted average of the base models refit on all the data. This is the Breiman / van der Laan stacked-regression (super-learner) idea applied to genomic selection; in practice it tends to match or beat the best single model without having to know in advance which that is.
gs_ensemble(y, geno, base_models = NULL, inner_k = 5, seed = NULL, ...)gs_ensemble(y, geno, base_models = NULL, inner_k = 5, seed = NULL, ...)
y |
Numeric phenotype vector (length n), no missing values. |
geno |
Marker matrix (n x m, 0/1/2, no missing values). |
base_models |
Character vector of base model names. Defaults to every available model except the ensemble itself. |
inner_k |
Folds for the inner stacking cross-validation. Default 5. |
seed |
Optional seed for the inner folds (via |
... |
Passed to |
An object of class gs_ensemble (and gs_model): a list with
base_names, weights (named, summing to 1), the refit base_fits, and
the out-of-fold prediction matrix oof.
van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2007) "Super Learner." Statistical Applications in Genetics and Molecular Biology 6, Article 25. doi:10.2202/1544-6115.1309
sim <- simulate_population(n = 100, m = 300, seed = 1) ens <- gs_ensemble(sim$pheno, sim$geno, base_models = "gblup", seed = 1) ens$weightssim <- simulate_population(n = 100, m = 300, seed = 1) ens <- gs_ensemble(sim$pheno, sim$geno, base_models = "gblup", seed = 1) ens$weights
One interface for several model families: "gblup" (always available),
"elastic_net" (glmnet), "random_forest" (ranger),
"xgboost" (xgboost), and "ensemble" (a stacked super-learner; see
gs_ensemble()). The returned object has a predict()
method that takes a new marker matrix.
gs_fit(y, geno, model = "gblup", ...)gs_fit(y, geno, model = "gblup", ...)
y |
Numeric phenotype vector (length n), no missing values. |
geno |
Marker matrix (n x m, coded 0/1/2, no missing values). |
model |
Model name; see |
... |
Model-specific hyperparameters (e.g. |
An object of class gs_model wrapping the fitted model.
sim <- simulate_population(n = 120, m = 400, seed = 1) fit <- gs_fit(sim$pheno, sim$geno, model = "gblup") head(predict(fit, sim$geno))sim <- simulate_population(n = 120, m = 400, seed = 1) fit <- gs_fit(sim$pheno, sim$geno, model = "gblup") head(predict(fit, sim$geno))
Replaces missing values in each marker (column) with that marker's mean dosage (i.e. twice the estimated allele frequency). Simple and fast; for production work a model-based imputation upstream is preferable.
impute_markers(geno)impute_markers(geno)
geno |
A numeric marker matrix (individuals x markers), 0/1/2, possibly
with |
The matrix with NAs filled by column means. Columns that are
entirely missing are filled with 0.
g <- matrix(c(0, 1, NA, 2, 2, 0), nrow = 3) impute_markers(g)g <- matrix(c(0, 1, NA, 2, 2, 0), nrow = 3) impute_markers(g)
Bar chart of mean predictive ability per model, with +/- 1 SD whiskers across folds.
## S3 method for class 'gs_cv' plot(x, ...)## S3 method for class 'gs_cv' plot(x, ...)
x |
A |
... |
Passed to |
x, invisibly.
sim <- simulate_population(n = 120, m = 400, seed = 1) plot(gs_cv(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1))sim <- simulate_population(n = 120, m = 400, seed = 1) plot(gs_cv(sim$pheno, sim$geno, models = "gblup", k = 5, seed = 1))
Predict breeding values for new genotypes from a GBLUP fit
## S3 method for class 'gblup' predict(object, newgeno, ...)## S3 method for class 'gblup' predict(object, newgeno, ...)
object |
A |
newgeno |
Marker matrix for new individuals (markers must match the training markers, in the same order). |
... |
Ignored. |
A numeric vector of predicted values (intercept + marker effects),
one per row of newgeno.
sim <- simulate_population(n = 150, m = 400, seed = 1) fit <- gblup(sim$pheno[1:120], sim$geno[1:120, ]) predict(fit, sim$geno[121:150, ])sim <- simulate_population(n = 150, m = 400, seed = 1) fit <- gblup(sim$pheno[1:120], sim$geno[1:120, ]) predict(fit, sim$geno[121:150, ])
Predict from a fitted genomic prediction model
## S3 method for class 'gs_ensemble' predict(object, newgeno, ...) ## S3 method for class 'gs_model' predict(object, newgeno, ...)## S3 method for class 'gs_ensemble' predict(object, newgeno, ...) ## S3 method for class 'gs_model' predict(object, newgeno, ...)
object |
A |
newgeno |
Marker matrix for the individuals to predict (same markers as training). |
... |
Ignored. |
A numeric vector of predictions, one per row of newgeno.
sim <- simulate_population(n = 120, m = 400, seed = 1) fit <- gs_fit(sim$pheno[1:90], sim$geno[1:90, ], model = "gblup") predict(fit, sim$geno[91:120, ])sim <- simulate_population(n = 120, m = 400, seed = 1) fit <- gs_fit(sim$pheno[1:90], sim$geno[1:90, ], model = "gblup") predict(fit, sim$geno[91:120, ])
Drops markers failing a call-rate or minor-allele-frequency threshold, and (optionally) monomorphic markers, then imputes any remaining missing values.
qc_markers(geno, maf = 0.05, max_missing = 0.1, impute = TRUE)qc_markers(geno, maf = 0.05, max_missing = 0.1, impute = TRUE)
geno |
A numeric marker matrix (individuals x markers), coded 0/1/2. |
maf |
Minimum minor allele frequency to keep a marker. Default 0.05. |
max_missing |
Maximum fraction of missing calls to keep a marker. Default 0.1. |
impute |
Whether to mean-impute remaining missing values. Default |
A list with geno (the filtered, optionally imputed matrix) and
removed (a named integer vector counting markers dropped by each rule).
sim <- simulate_population(n = 50, m = 200, seed = 1) qc <- qc_markers(sim$geno) dim(qc$geno) qc$removedsim <- simulate_population(n = 50, m = 200, seed = 1) qc <- qc_markers(sim$geno) dim(qc$geno) qc$removed
Generates a small marker matrix and a phenotype with a known narrow-sense heritability, for examples, tests and demonstrations. Markers are coded as allele dosages (0, 1, 2). A random subset of markers are QTL with additive effects; the phenotype is the resulting breeding value plus normal noise scaled to the target heritability.
simulate_population(n = 200, m = 1000, n_qtl = 50, h2 = 0.5, seed = NULL)simulate_population(n = 200, m = 1000, n_qtl = 50, h2 = 0.5, seed = NULL)
n |
Number of individuals. Default 200. |
m |
Number of markers. Default 1000. |
n_qtl |
Number of markers with a non-zero (QTL) effect. Default 50. |
h2 |
Target narrow-sense heritability in (0, 1]. Default 0.5. |
seed |
Optional integer seed for reproducibility. Applied with
|
A list with geno (an n x m 0/1/2 matrix, row/column names set),
pheno (length-n numeric), bv (true breeding values), qtl (indices of
the causal markers) and h2 (the realised heritability).
sim <- simulate_population(n = 100, m = 300, seed = 1) dim(sim$geno) length(sim$pheno)sim <- simulate_population(n = 100, m = 300, seed = 1) dim(sim$geno) length(sim$pheno)
Summary of a cross-validation / benchmark
## S3 method for class 'gs_cv' summary(object, ...)## S3 method for class 'gs_cv' summary(object, ...)
object |
A |
... |
Ignored. |
The accuracy data frame, invisibly.
sim <- simulate_population(n = 100, m = 300, seed = 1) summary(gs_cv(sim$pheno, sim$geno, models = "gblup", seed = 1))sim <- simulate_population(n = 100, m = 300, seed = 1) summary(gs_cv(sim$pheno, sim$geno, models = "gblup", seed = 1))