Fits a linear regression model using SQL aggregation on a remote database table. The data never leaves the database — only sufficient statistics (sums and cross-products) are returned to R.
lm_sql(formula, data, tol = 1e-07)An S7 object of class lm_sql_result, or a tibble with a
model list-column if the data is grouped.
A formula object (e.g., price ~ x + cut).
A tbl_sql object (from dbplyr).
Tolerance for detecting linear dependency.
The function computes the \(X^TX\) and \(X^Ty\) matrices entirely inside the database engine via a single SQL aggregation query, then solves the normal equations in R using Cholesky decomposition (falling back to Moore-Penrose pseudoinverse for rank-deficient designs).
Supported formula features:
Numeric and categorical (character/factor) predictors with automatic dummy encoding via `CASE WHEN`.
Interaction terms (`*` and `:`) including numeric × categorical and categorical × categorical cross-products.
Dot expansion (`y ~ .`) to all non-response columns.
Transforms: `I()`, `log()`, and `sqrt()` translated to SQL equivalents (`POWER`, `LN`, `SQRT`).
Date and datetime predictors automatically cast to numeric in SQL.
No-intercept models (`y ~ 0 + x`).
For grouped data (via [dplyr::group_by()]), a single `GROUP BY` query is executed and one model per group is returned in a tibble with a `model` list-column.
NA handling uses listwise deletion: rows with `NULL` in any model variable are excluded via a `WHERE ... IS NOT NULL` clause.