Background

This R tutorial illustrates the concept of “Reject-to-Remeasure” through simple code implementations. It accompanies the paper “Rejections Based on Predictive Uncertainty Enable Reliable Routine Soil Spectroscopy.” To recap, the aim is to predict soil properties (here, clay in g kg⁻¹) from visible near-infrared spectroscopy (VNIRS). While VNIRS predictions cost only a fraction of conventional laboratory measurements, the quality of these predictions can vary drastically between samples, depending on how well the soil is represented in the training library. Using probabilistic modelling, we can quantify the predictive uncertainty and flag potentially inaccurate predictions. These predictions can then be rejected and remeasured using reference laboratory methods. This establishes a cost–accuracy trade-off, whereby expensive laboratory analyses are reserved for cases where large prediction errors are expected.

The full dataset comprises 793 soil samples. For demonstration purposes, however, we focus on four selected spectra for which clay content is predicted:

# Select four example samples (two with low and two with high uncertainty)
selected_examples <- c(576, 506,243,229) 

# Manipulate data table for plotting spectra examples
spectra_long <- data_full %>%
  select(ID, all_of(grep("^wl_", names(data_full), value = TRUE))) %>%
  pivot_longer(
    cols = all_of(grep("^wl_", names(data_full), value = TRUE)),
    names_to = "wavelength",
    values_to = "reflectance",
    names_prefix = "wl_",
    names_transform = list(wavelength = as.numeric)
  ) %>%
  filter(ID %in% selected_examples)

# Plot the four example spectra
ggplot(spectra_long, aes(x = wavelength, y = reflectance * 100, colour = factor(ID))) +
  geom_line(linewidth = 1.2) +
  labs(
    x = "Wavelength (nm)",
    y = "Reflectance (%)",
    colour = "ID") +
  scale_colour_brewer(palette = "Dark2")+
  scale_x_continuous(expand = c(0.03, 0.03)) +
  scale_y_continuous(expand = c(0.03, 0.03)) +
  theme_bw() +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.position = "top")

Probabilistic Modelling

For reject-to-remeasure, we first need a probabilistic model, which can tell us what predictions are “potentially inaccurate”. For this, we used TabICLv2, a tabular foundation model based on in-context learning that provides probabilistic predictions. Instead of predicting a single clay value (point prediction), the model outputs a set of quantile predictions. In this tutorial, we use a quantile grid with levels \(\tau_i \in \{0.001, 0.005, 0.01, \ldots, 0.99, 0.995, 0.999\}\). See some example quantile predictions:

# Get quantile levels by manipulating column names
quantile_cols <- setdiff(names(pred_quantiles_ID), "ID")
quantile_tau <- as.numeric(gsub("q", "", quantile_cols))

# Select examples  quantiles
tau_subset <- c(0.001, 0.005, 0.01, 0.99, 0.995, 0.999) # example quantiles

# Example subset
quantile_table <- pred_quantiles_ID %>%
  filter(ID %in% selected_examples) %>%
  pivot_longer(
    cols = all_of(quantile_cols),
    names_to = "quantile",
    values_to = "value"
  ) %>%
  mutate(
    tau = as.numeric(gsub("q", "", quantile))
  ) %>%
  filter(tau %in% tau_subset) %>%
  select(ID, tau, value) %>%
  pivot_wider(
    names_from = ID,
    values_from = value
  ) %>%
  arrange(tau)

# Show
as.data.frame(quantile_table)

##     tau        229       506       243       576
## 1 0.001   12.85944  84.67957  65.69633  34.77934
## 2 0.005   79.50162 124.35324  81.67505  69.66004
## 3 0.010  107.02148 140.72612  87.14993  81.10527
## 4 0.990  800.72340 335.78632 195.61947 471.72940
## 5 0.995  878.00910 352.33902 214.45951 510.94543
## 6 0.999 1065.75790 396.65123 271.68568 612.98630

From the set of quantile predictions, the predictive distribution can be approximated. The predictive distribution reflects the model’s uncertainty about an individual prediction. If the quantile values are closely spaced, the predictive distribution is narrow, indicating low uncertainty. In contrast, widely spread quantile values result in a broader distribution, indicating high uncertainty.

# Manipulate data table for plotting predictive distributions
cdf_long <- pred_quantiles_ID %>%
  filter(ID %in% selected_examples) %>%
  pivot_longer(
    cols = -ID,
    names_to = "quantile",
    values_to = "value"
  ) %>%
  mutate(
    tau = as.numeric(gsub("q", "", quantile))
  )

# We add a 0% and 100% probability boundary. This is just for better visibility of the predictive distribution in the plot!
cdf_long <- bind_rows(
  cdf_long,
  
  # lower boundary
  cdf_long %>%
    distinct(ID) %>%
    mutate(value = 0, tau = 0),
  
  # upper boundary
  cdf_long %>%
    distinct(ID) %>%
    mutate(value = 1000, tau = 1)
  ) %>%
  arrange(ID, tau)


# Plot the four example predictive distributions
ggplot(cdf_long, aes(x = value, y = tau, colour = factor(ID))) +
  geom_step(linewidth = 1.2,alpha=0.7) +
  labs(
    x = "Clay (g kg⁻¹)",
    y = "F(y)",
    colour = "ID") +
  coord_cartesian(xlim = c(0, 850))+
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_colour_brewer(palette = "Dark2") +
  theme_bw()

The derived predictive distributions for 243 and 506 were relatively narrow (i.e., low uncertainty), whereas those for 229 and 576 were much broader (i.e., high uncertainty). However, this alone does not yet answer which predictions are “good enough” and which predictions should be remeasured using conventional laboratory methods. We require a formal criterion to determine whether to accept or reject a prediction.

Rejection option

To determine whether a prediction is acceptable, an error threshold (δ) must be defined. We therefore aim to reject predictions for which there is a non-negligible risk that the absolute error (AE) exceeds δ. The choice of δ is highly application-specific. For demonstration purposes, we assume that clay predictions should not have an AE greater than 80 g kg⁻¹ (i.e., δ(Y) = 80). While a scaled threshold was used in the paper (see Fig. 2), an absolute threshold is more convenient in the context of this tutorial.

Given the predictive distribution, we can quantify the probability that the absolute prediction error exceeds δ, that is, Pr(|Y − ŷ| > δ). In the plot below, the shaded regions illustrate these lower and upper tail probabilities.

# Define the AE threshold (δ) as 80 g kg⁻¹
delta <- 80 

# Manipulate data table for plotting predictive distribution with point prediction and tail probabilities
cdf_plot <- cdf_long %>%
  left_join(
    pred_ID %>% filter(ID %in% selected_examples),
    by = "ID"
  ) %>%
  mutate(
    lower_threshold = Clay_pred - delta,
    upper_threshold = Clay_pred + delta
  )

cdf_left <- cdf_plot %>%
  filter(value <= lower_threshold)

cdf_right <- cdf_plot %>%
  filter(value >= upper_threshold)

# Plot the four example predictive distributions with point prediction and tail probabilities
ggplot(cdf_plot, aes(x = value, y = tau)) +
  geom_ribbon(
    data = cdf_left,
    aes(ymin = 0, ymax = tau),
    fill = "darkred",
    alpha = 0.7) +
  
  geom_ribbon(
    data = cdf_right,
    aes(ymin = tau, ymax = 1),
    fill = "darkred",
    alpha = 0.7) +
  
  geom_step(linewidth = 1.3, colour = "black") +
  
  geom_vline(
    aes(xintercept = Clay_pred, linetype = "ŷ"),
    colour = "cadetblue",
    linewidth = 0.8) +
    geom_vline(
    aes(xintercept = lower_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 0.8) +
  
  geom_vline(
    aes(xintercept = upper_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 0.8)+
  
  scale_linetype_manual(
    values = c(
      "ŷ" = "solid",
      "y" = "solid",
      "ŷ ± 80" = "dotted")) +

  facet_wrap(~ ID, ncol = 2) +

  labs(
    x = "Clay (g kg⁻¹)",
    y = "F(y)"
  ) +

  coord_cartesian(xlim = c(0, 850))+
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +

  theme_bw() +
  theme(
    strip.text = element_text(size = 14),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.text = element_text(size = 12),
    legend.title=element_blank())

The column “Exceeding_Pr” below presents Pr(|Y − ŷ| > δ) in %:

# Obtain Pr(|Y − ŷ| > δ)
Exceedance_Probability <- pred_quantiles_ID %>%
  filter(ID %in% selected_examples) %>%
  rowwise() %>%
  mutate(
    Exceeding_Pr = {

      q_vals <- as.numeric(c_across(all_of(quantile_cols)))

      cdf_fun <- stepfun(q_vals, c(0, quantile_tau), right = TRUE)

      y_hat <- pred_ID$Clay_pred[pred_ID$ID == ID]

      lower_threshold <- y_hat - delta
      upper_threshold <- y_hat + delta

      (cdf_fun(lower_threshold) + (1 - cdf_fun(upper_threshold)))*100
    }
  ) %>%
  ungroup() %>%
  select(ID, Exceeding_Pr)

# Show
as.data.frame(Exceedance_Probability)

##    ID Exceeding_Pr
## 1 229         36.5
## 2 506          4.5
## 3 243          1.0
## 4 576         38.0

After estimating the probability that the AE exceeds the interval ŷ ± 80 (i.e., Pr(|Y − ŷ| > δ)), we also need to define the degree of risk (α) we are willing to accept that the prediction error may exceed δ, as some degree of uncertainty will always remain. In this study, we set a conventional value of α = 0.05, meaning that we only accept predictions for which the model assigns less than a 5% probability to exceeding the error threshold. Based on this criterion, the rejector function determines whether a prediction is accepted or rejected.

# Determine what risk we accept of exceeding the threshold
alpha <- 0.05 

# Get decision based on alpha
Exceedance_Probability_Decision <- Exceedance_Probability %>%
  mutate(
    Decision = ifelse(
      Exceeding_Pr <= alpha*100, #*100 to use in % connotation
      "Acceptance",
      "Rejection"
    )
  )

as.data.frame(Exceedance_Probability_Decision)

##    ID Exceeding_Pr   Decision
## 1 229         36.5  Rejection
## 2 506          4.5 Acceptance
## 3 243          1.0 Acceptance
## 4 576         38.0  Rejection

spectra_long_decision <- spectra_long %>%
  left_join(
    Exceedance_Probability_Decision,
    by = "ID"
  )

# Plot the decision for the spectra and predictive distribution 
spectra_decision <- ggplot(spectra_long_decision,
       aes(x = wavelength, y = reflectance * 100,
           colour = Decision, group = ID)) +
  
  geom_line(linewidth = 1.2) +
  
  labs(
    x = "Wavelength (nm)",
    y = "Reflectance (%)",
    colour = "Decision"
  ) +
  
  scale_colour_manual(
    values = c(
      "Acceptance" = "#1F4E79",   
      "Rejection" = "#FB6A4A"   
    )
  ) +
  
  theme_bw() +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.position = "right"
  )

cdf_long_decision <- cdf_long %>%
  left_join(
    Exceedance_Probability_Decision,
    by = "ID"
  )


cdf_decision <-ggplot(cdf_long_decision, aes(x = value, y = tau, colour = Decision, group = ID)) +
  geom_step(linewidth = 1.3, alpha = 0.9) +
  
  labs(
    x = "Clay (g kg⁻¹)",
    y = "F(y)",
    colour = "Decision"
  ) +
  
  coord_cartesian(xlim = c(0, 850)) +
  
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  
  scale_colour_manual(
    values = c(
      "Acceptance" = "#1F4E79",   
      "Rejection" = "#FB6A4A"    
    )
  ) +
  
  theme_bw() +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.position = "right"
  )

# Combine to single plot
egg::ggarrange(spectra_decision,cdf_decision,ncol=1)

While there are no obvious visual differences between the accepted and rejected spectra, the model arrives at very different levels of confidence. Methods for interpreting why certain spectra are rejected do exist, but exploring these is beyond the scope of this tutorial and study. For now, we focus on whether the rejector’s decisions are reliable, rather than arising purely by chance.

Rejector evaluation

Ideally, we would accept all predictions with an AE < δ (correct acceptance). However, it cannot be ruled out that some of these predictions are rejected (incorrect rejection). Conversely, we would like to reject all predictions with AE > δ (correct rejection), but since we allow a certain level of risk, defined by α, a small proportion of predictions may still be accepted despite having too large errors (incorrect acceptance). We now evaluate the four example predictive distributions, but this time in relation to the ground truth value:

# Evluate whether decision was "correct" or "incorrect"
Clay_selected <- Clay_ID %>%
  filter(ID %in% selected_examples)


cdf_plot_true <- cdf_plot %>%
  left_join(Clay_selected, by = "ID")


cdf_plot_true_labelled <- cdf_plot_true %>%
  mutate(
    ID_label = case_when(
      ID == 229 ~ "229 – Incorrect Rejection",
      ID == 243 ~ "243 – Correct Acceptance",
      ID == 506 ~ "506 – Correct Acceptance",
      ID == 576 ~ "576 – Correct Rejection",
      TRUE ~ as.character(ID)
    )
  )

# Plot the four example predictive distributions with point prediction, thresholds and ground truth
ggplot(cdf_plot_true_labelled, aes(x = value, y = tau)) +

  geom_step(linewidth = 1.2, colour = "black") +

  geom_vline(
    aes(xintercept = Clay_pred, linetype = "ŷ"),
    colour = "cadetblue",
    linewidth = 1.2,
    alpha=0.7
  ) +

  geom_vline(
    aes(xintercept = Clay, linetype = "y"),
    colour = "grey",
    linewidth = 1.2,
    alpha=0.7
  ) +

  geom_vline(
    aes(xintercept = lower_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 1.2
  ) +
  geom_vline(
    aes(xintercept = upper_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 1.2
  ) +

  facet_wrap(~ ID_label, ncol = 2)+

  labs(
    x = "Clay (g kg⁻¹)",
    y = "F(y)"
  ) +

  coord_cartesian(xlim = c(0, 850)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +

  scale_linetype_manual(
    values = c(
      "ŷ" = "solid",
      "y" = "solid",
      "ŷ ± 80" = "dotted"
    )
  ) +

  theme_bw() +
  theme(
    strip.text = element_text(size = 14),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.text = element_text(size = 12),
    legend.title = element_blank()
  )

For samples 243 and 506, the model expressed low uncertainty, and the true values (y) were indeed close to the predicted values (ŷ). In contrast, for sample 229 the model indicated high uncertainty, even though the observed AE was relatively small, resulting in an incorrect rejection. For sample 576, the AE exceeded 80, and the prediction was correctly rejected due to high uncertainty.

While these four examples are meant to examplify the different possible outcomes, We would need to evaluate all 793 predictions from our 10-fold cross validation, to get an understanding about the reliability of our rejector. Most importantly, we would like to have the rate of incorrect acceptances to be lower than α (i.e., 5%), among the accepted predictions.

# Compute probabilities of exceeding 80 g kg⁻¹ for all 793 predictions
Exceedance_Probability_full <- pred_quantiles_ID %>%
  rowwise() %>%
  mutate(
    p_value = {

      q_vals <- as.numeric(c_across(all_of(quantile_cols)))
      
      cdf_fun <- stepfun(q_vals, c(0, quantile_tau), right = TRUE)

      y_hat <- pred_ID$Clay_pred[pred_ID$ID == ID]

      lower_threshold <- y_hat - delta
      upper_threshold <- y_hat + delta

      (cdf_fun(lower_threshold) + (1 - cdf_fun(upper_threshold)))
    }
  ) %>%
  ungroup() %>%
  select(ID, p_value)

# Join true values, predictions and probabilities
eval_df <- pred_ID %>%
  
  rename(pred = Clay_pred) %>%
  
  left_join(
    data_full %>% select(ID, Clay),
    by = "ID"
  ) %>%
  
  mutate(
    true  = Clay * 10,
    error = abs(pred - true)
  ) %>%
  
  left_join(Exceedance_Probability_full, by = "ID")
  

eval_df <- eval_df %>%
  mutate(
    is_accepted = p_value <= alpha,

    is_wrong = error > delta
  )


# Evaluate the total number of predictions, the % of accepted samples, and the % of AE > δ for the whole prediction set and only for accepted samples after the rejector implementation 
n_total     <- nrow(eval_df)
n_accept    <- sum(eval_df$is_accepted)
n_wrong     <- sum(eval_df$is_wrong)
n_wrong_acc <- sum(eval_df$is_wrong & eval_df$is_accepted)


wrong_before_percent <- (n_wrong / n_total) * 100

wrong_after_percent <- (n_wrong_acc / n_accept) * 100

acceptance_rate_percent <- (n_accept / n_total) * 100

# Show
cat(
  "\n Reject-to-Remeasure Evaluation \n",
  "AE > δ (before rejector):  ", round(mean(eval_df$is_wrong) * 100, 2), "%\n",
  "AE > δ (after rejector):   ", round(mean(eval_df$is_wrong[eval_df$is_accepted]) * 100, 2), "%\n",
  "Acceptance rate:  ", round(mean(eval_df$is_accepted) * 100, 2), "%\n",
  sep = ""
)

## 
##  Reject-to-Remeasure Evaluation 
## AE > δ (before rejector):  7.19%
## AE > δ (after rejector):   0.94%
## Acceptance rate:  53.72%

As can be seen from these evaluation results, without using a rejector (i.e., acceptance rate = 100%), 7.19% of predictions had an AE larger than 80 g kg⁻¹. After implementing the rejector, only 4 samples (i.e., less than 1%) had an AE larger than 80 g kg⁻¹. See these four instances in the plot below:

# Make table for all predictions
cdf_long_full <- pred_quantiles_ID %>%
  pivot_longer(
    cols = -ID,
    names_to = "quantile",
    values_to = "value"
  ) %>%
  mutate(
    tau = as.numeric(gsub("q", "", quantile))
  )

cdf_long_full <- bind_rows(
  cdf_long_full,
  cdf_long_full %>%
    distinct(ID) %>%
    mutate(value = 0, tau = 0),
  cdf_long_full %>%
    distinct(ID) %>%
    mutate(value = 1000, tau = 1)
) %>%
  arrange(ID, tau)

# Get incorrect acceptances to filter them for plotting
false_acceptances <- eval_df %>%
  filter(is_accepted & is_wrong)

cdf_plot_Incorrect <- cdf_long_full %>%
  filter(ID %in% false_acceptances$ID) %>%
  left_join(
    false_acceptances %>%
      select(ID, pred, true),
    by = "ID"
  ) %>%
  mutate(
    lower_threshold = pred - delta,
    upper_threshold = pred + delta,
    ID_label = paste0(ID, " – Incorrect Acceptance")
  )

# Plot
ggplot(cdf_plot_Incorrect, aes(x = value, y = tau)) +
  geom_step(linewidth = 1.2, colour = "black") +

  geom_vline(
    aes(xintercept = pred, linetype = "ŷ"),
    colour = "cadetblue",
    linewidth = 1.2,
    alpha = 0.7
  ) +

  geom_vline(
    aes(xintercept = true, linetype = "y"),
    colour = "grey",
    linewidth = 1.2,
    alpha = 0.7
  ) +

  geom_vline(
    aes(xintercept = lower_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 1.2
  ) +

  geom_vline(
    aes(xintercept = upper_threshold, linetype = "ŷ ± 80"),
    colour = "darkred",
    linewidth = 1.2
  ) +

  facet_wrap(~ ID_label, ncol = 2) +

  labs(
    x = "Clay (g kg⁻¹)",
    y = "F(y)"
  ) +

  coord_cartesian(xlim = c(0, 850)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +

  scale_linetype_manual(
    values = c(
      "ŷ" = "solid",
      "y" = "solid",
      "ŷ ± 80" = "dotted"
    )
  ) +

  theme_bw() +
  theme(
    strip.text = element_text(size = 14),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 16),
    legend.text = element_text(size = 12),
    legend.title = element_blank()
  )

However, apart from these four examples, 422 samples have been correctly accepted (i.e. ~53% of total predictions). If we assume that the measurement costs of VisNIRS are approximately 10% of those of conventional laboratory analysis based on sedimentation, this indicates a clear potential to reduce costs while maintaining predefined quality standards, as specified by our α and δ:

# Evaluate decision
eval_df <- eval_df %>%
  mutate(
    outcome = case_when(
      is_accepted & !is_wrong ~ "Correct Acceptance",
      is_accepted &  is_wrong ~ "Incorrect Acceptance",
      !is_accepted & is_wrong ~ "Correct Rejection",
      !is_accepted &  !is_wrong ~ "Incorrect Rejection"
    )
  )

# Plot
Clay_Rejector_1to1 <- ggplot(
  eval_df,
  aes(x = true, y = pred, colour = outcome, shape = outcome)
) +

  geom_point(
    size = 2.8,
    stroke = 0.8,
    alpha = 0.8
  ) +

  geom_abline(
    intercept = 0,
    slope = 1,
    linetype = "dashed",
    linewidth = 1
  ) +

  geom_abline(intercept = delta, slope = 1, linetype = "dotted") +
  geom_abline(intercept = -delta, slope = 1, linetype = "dotted") +

  scale_colour_manual(
    values = c(
      "Correct Acceptance"   = "#1F4E79",
      "Incorrect Acceptance" = "#1F4E79",
      "Correct Rejection"    = "#FB6A4A",
      "Incorrect Rejection"  = "#FB6A4A"
    )
  ) +

  scale_shape_manual(
    values = c(
      "Correct Acceptance"   = 16, 
      "Incorrect Acceptance" = 4,  
      "Correct Rejection"    = 16, 
      "Incorrect Rejection"  = 4   
    )
  ) +

  scale_x_continuous(
    limits = c(0, 1000),
    expand = c(0.009, 0.009)
  ) +
  scale_y_continuous(
    limits = c(0, 1000),
    expand = c(0.009, 0.009)
  ) +

  labs(
    x = "Measured Clay (g kg⁻¹)",
    y = "Predicted Clay (g kg⁻¹)",
    colour = "",
    shape = ""
  ) +

  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_text(size = 9, colour = "black"),
    axis.title = element_text(size = 14),
    legend.text = element_text(size = 9),
    legend.position = "top",
    panel.border = element_rect(colour = "black", linewidth = 0.5, fill = NA)
  ) +

  annotate(
    "text",
    x = -Inf,
    y = Inf,
    label = "δ(Y) = 80",
    hjust = -0.2,
    vjust = 1.5,
    size = 4
  )

Clay_Rejector_1to1

As shown in the plot, predictions for clay contents above 300 g kg⁻¹ were rejected in most cases, likely due to limited training data in this range. Improving the representation of higher clay contents in the training data would therefore likely enhance the predictive performance of spectral models. If we would retrain our model on rejected predictions, which were remeasured, there is a good chance the retrained model would give better predictions for high clay values in future applications.

For comparison, see below aluminium (Al) predictions with δ = 250 mg kg⁻¹, where the training data are well represented across the entire prediction range:

delta <- 250

# Compute exceedance probabilities
Exceedance_Probability_full <- pred_quantiles_ID_Al %>%
  rowwise() %>%
  mutate(
    p_value = {

      q_vals <- as.numeric(c_across(all_of(quantile_cols)))
      
      cdf_fun <- stepfun(q_vals, c(0, quantile_tau), right = TRUE)

      y_hat <- pred_ID_Al$Al_Meh3_pred[pred_ID_Al$ID == ID]

      lower_threshold <- y_hat - delta
      upper_threshold <- y_hat + delta

      (cdf_fun(lower_threshold) + (1 - cdf_fun(upper_threshold)))
    }
  ) %>%
  ungroup() %>%
  select(ID, p_value)

# Build evaluation table
eval_df <- pred_ID_Al %>%
  
  rename(pred = Al_Meh3_pred) %>%
  
  left_join(
    data_full %>% select(ID, Al_Meh3),
    by = "ID"
  ) %>%
  
  mutate(
    true  = Al_Meh3,
    error = abs(pred - true)
  ) %>%
  
  left_join(Exceedance_Probability_full, by = "ID") %>%
  
  mutate(
    is_accepted = p_value <= alpha,
    is_wrong    = error > delta
  ) %>%
  
  mutate(
    outcome = case_when(
      is_accepted & !is_wrong ~ "Correct Acceptance",
      is_accepted &  is_wrong ~ "Incorrect Acceptance",
      !is_accepted & is_wrong ~ "Correct Rejection",
      !is_accepted & !is_wrong ~ "Incorrect Rejection"
    )
  )

# Plot
Al_Rejector_1to1 <- ggplot(
  eval_df,
  aes(x = true, y = pred, colour = outcome, shape = outcome)
) +

  geom_point(
    size = 2.8,
    stroke = 0.8,
    alpha = 0.8
  ) +

  geom_abline(
    intercept = 0,
    slope = 1,
    linetype = "dashed",
    linewidth = 1
  ) +

  geom_abline(intercept = delta, slope = 1, linetype = "dotted") +
  geom_abline(intercept = -delta, slope = 1, linetype = "dotted") +

  scale_colour_manual(
    values = c(
      "Correct Acceptance"   = "#1F4E79",
      "Incorrect Acceptance" = "#1F4E79",
      "Correct Rejection"    = "#FB6A4A",
      "Incorrect Rejection"  = "#FB6A4A"
    )
  ) +

  scale_shape_manual(
    values = c(
      "Correct Acceptance"   = 16,
      "Incorrect Acceptance" = 4,
      "Correct Rejection"    = 16,
      "Incorrect Rejection"  = 4
    )
  ) +

  scale_x_continuous(
    limits = c(0, 2500),
    expand = c(0.009, 0.009)
  ) +
  scale_y_continuous(
    limits = c(0, 2500),
    expand = c(0.009, 0.009)
  ) +

  labs(
    x = "Measured Al (mg kg⁻¹)",
    y = "Predicted Al (mg kg⁻¹)",
    colour = "",
    shape = ""
  ) +

  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_text(size = 9, colour = "black"),
    axis.title = element_text(size = 14),
    legend.text = element_text(size = 9),
    legend.position = "top",
    panel.border = element_rect(colour = "black", linewidth = 0.5, fill = NA)
  ) +

  annotate(
    "text",
    x = -Inf,
    y = Inf,
    label = "δ(Y) = 250",
    hjust = -0.2,
    vjust = 1.5,
    size = 4
  )

Al_Rejector_1to1

Reject-to-Remeasure: An R tutorial

Jonas Schmidinger

2026-05-08

Background

Probabilistic Modelling

Rejection option

Rejector evaluation

Cite & Contact