initial_parameter_optimization: Run Parameter Optimization Via Latin Hypercube Sampling

Description

Performs parameter optimization using Latin Hypercube Sampling (LHS) combined with k-fold cross-validation. Parameters are sampled from specified ranges using maximin LHS design to ensure good coverage of parameter space. Each parameter set is evaluated using k-fold cross-validation to assess prediction accuracy. To calculate one NLL per set of parameters, the function uses a pooled errors approach which combine all validation errors into one set, then calculate a single NLL. This approach has two main advantages: 1- It treats all validation errors equally, respecting the underlying error distribution assumption 2- It properly accounts for the total number of validation points

Usage

initial_parameter_optimization(
  distance_matrix,
  mapping_max_iter = 1000,
  relative_epsilon,
  convergence_counter,
  scenario_name,
  N_min,
  N_max,
  k0_min,
  k0_max,
  c_repulsion_min,
  c_repulsion_max,
  cooling_rate_min,
  cooling_rate_max,
  num_samples = 20,
  max_cores = NULL,
  folds = 20,
  verbose = FALSE,
  write_files = FALSE,
  output_dir
)

Value

A data.frame containing the parameter sets and their performance metrics (Holdout_MAE and NLL). The columns of the data frame are N, k0, cooling_rate, c_repulsion, Holdout_MAE, and NLL. If write_files is TRUE, this data frame is also saved to a CSV file as a side effect.

Arguments

distance_matrix: Matrix or data frame. Input distance matrix. Must be square and symmetric. Can contain NA values for missing measurements.
mapping_max_iter: Integer. Maximum number of optimization iterations.
relative_epsilon: Numeric. Convergence threshold for relative change in error.
convergence_counter: Integer. Number of iterations below threshold before declaring convergence.
scenario_name: Character. Name for output files and job identification.
N_min, N_max: Integer. Range for number of dimensions parameter.
k0_min, k0_max: Numeric. Range for initial spring constant parameter.
c_repulsion_min, c_repulsion_max: Numeric. Range for repulsion constant parameter.
cooling_rate_min, cooling_rate_max: Numeric. Range for spring decay parameter.
num_samples: Integer. Number of LHS samples to generate (default: 20).
max_cores: Integer. Maximum number of cores to use for parallel processing. If NULL, uses all available cores minus 1 (default: NULL).
folds: Integer. Number of cross-validation folds. Default: 20.
verbose: Logical. Whether to print progress messages. Default: FALSE.
write_files: Logical. Whether to save results to CSV. Default: FALSE.
output_dir: Character. Directory where output files will be saved. Required if write_files is TRUE.

Details

The function performs these steps:

Generates LHS samples in parameter space
Creates k-fold splits of input data
For each parameter set and fold:
- Trains model on training set
- Evaluates on validation set
- Calculates MAE and negative log likelihood
Computations are run locally in parallel.

Parameters ranges are transformed to log scale where appropriate to handle different scales effectively.

Examples

Run this code

# \donttest{
# This example is wrapped in \donttest{} because it can exceed 5 seconds,
# 1. Create a structured, synthetic dataset for the example
# Generate coordinates for a more realistic test case
synth_coords <- generate_complex_data(n_points = 20, n_dim = 3)
# Convert coordinates to a distance matrix
dist_mat <- coordinates_to_matrix(synth_coords)

# 2. Run the optimization on the synthetic data
# ensuring it passes CRAN's automated checks.
results <- initial_parameter_optimization(
  distance_matrix = dist_mat,
  mapping_max_iter = 100,
  relative_epsilon = 1e-3,
  convergence_counter = 2,
  scenario_name = "test_opt_synthetic",
  N_min = 2, N_max = 5,
  k0_min = 1, k0_max = 10,
  c_repulsion_min = 0.001, c_repulsion_max = 0.05,
  cooling_rate_min = 0.001, cooling_rate_max = 0.02,
  num_samples = 4,
  max_cores = 2,
  verbose = FALSE
)
print(results)
# }