lm_sql: SQL-Backed Linear Regression

Description

Fits a linear regression model using SQL aggregation on a remote database table. The data never leaves the database — only sufficient statistics (sums and cross-products) are returned to R.

Usage

lm_sql(formula, data, tol = 1e-07)

Value

An S7 object of class lm_sql_result, or a tibble with a

model list-column if the data is grouped.

Arguments

formula: A formula object (e.g., price ~ x + cut).
data: A tbl_sql object (from dbplyr).
tol: Tolerance for detecting linear dependency.

Details

The function computes the \(X^TX\) and \(X^Ty\) matrices entirely inside the database engine via a single SQL aggregation query, then solves the normal equations in R using Cholesky decomposition (falling back to Moore-Penrose pseudoinverse for rank-deficient designs).

Supported formula features:

Numeric and categorical (character/factor) predictors with automatic dummy encoding via `CASE WHEN`.
Interaction terms (`*` and `:`) including numeric × categorical and categorical × categorical cross-products.
Dot expansion (`y ~ .`) to all non-response columns.
Transforms: `I()`, `log()`, and `sqrt()` translated to SQL equivalents (`POWER`, `LN`, `SQRT`).
Date and datetime predictors automatically cast to numeric in SQL.
No-intercept models (`y ~ 0 + x`).

For grouped data (via [dplyr::group_by()]), a single `GROUP BY` query is executed and one model per group is returned in a tibble with a `model` list-column.

NA handling uses listwise deletion: rows with `NULL` in any model variable are excluded via a `WHERE ... IS NOT NULL` clause.