cparlwr: Conditionally Parametric LWR Estimation

Description

Estimates a model of the form y = XB(z) + u, where z can include one or two variables. "Geographically weighted regression" is a special case in which z = (latitude, longitude) or some other measure of location.

Usage

cparlwr(form,nonpar,window=.25,bandwidth=0,kern="tcub", distance="Mahal",targetobs=NULL,data=NULL)

Arguments

form

Model formula

nonpar

List of either one or two variables for z. Formats: cparlwr(y~xlist, nonpar=~z1, ...) or cparlwr(y~xlist, nonpar=~z1+z2, ...). Important: note the "~" before the first z variable.

window

Window size. Default: 0.25.

bandwidth

Bandwidth. Default: not used.

kern

Kernel weighting functions. Default is the tri-cube. Options include "rect", "tria", "epan", "bisq", "tcub", "trwt", and "gauss".

distance

Options: "Euclid", "Mahal", or "Latlong" for Euclidean, Mahalanobis, or "great-circle" geographic distance. May be abbreviated to the first letter but must be capitalized. Note: cparlwr looks for the first two letters to determine which variable is latitude and which is longitude, so the data set must be attached first or specified using the data option; options like data$latitude will not work. Default: Mahal.

targetobs

If targetobs = NULL, uses the maketarget command to form targets. If target="alldata", each observation is used as a target value for x. A set of target can also be supplied directly by listing the observation numbers of the target data points. The observation numbers can be identified using the obs variable produced by the maketarget command.

data

A data frame containing the data. Default: use data in the current working directory

Value

target: The target points for the original estimation of the function.
ytarget: The predicted values of y at the target values z.
xcoef.target: Estimated coefficients, B(z), at the target values of z.
xcoef.target.se: Standard errors for B(z) at the target values of z.
yhat: Predicted values of y at the original data points.
xcoef: Estimated coefficients, B(z), at the original data points.
xcoef.se: Standard errors for B(z) with z evaluated at all points in the data set.
df1: tr(L), a measure of the degrees of freedom used in estimation.
df2: tr(L'L), an alternative measure of the degrees of freedom used in estimation.
sig2: Estimated residual variance, sig2 = rss/(n-2*df1+df2).
cv: Cross-validation measure. cv = mean(((y-yhat)/(1-infl))^2) , where yhat is the vector of predicted values for y and infl is the vector of diagonal terms for L.
gcv: gcv = n*(n*sig2)/((n-nreg)^2), where sig2 is the estimated residual variance and nreg = 2*df1 - df2.
infl: A vector containing the diagonal elements of L.

Details

The list of explanatory variables is specified in the base model formula while Z is specified using nonpar. The model formula does not have to include an intercept, making it suitable for repeat sales estimation as well as other models. X can include any number of explanatory variables, but Z must have at most two. cparlwr is equivalent to the lwr command when Z = X and the formula includes an intercept, with one exception: the explanatory variables are not centered on the target points so the intercept does not provide a direct estimate of y. This affects the intercept and its standard errors but not the coefficients on the explanatory variables. It also means that $\hat{y} = \hat{\alpha} + X \hat{\beta} $ rather than just $\hat{\alpha}. $ The estimated coefficient matrix, xcoef, provides estimates of the slopes at $z_0$ , i.e., $B(z_0)$

The estimated value of y at a target value $z_0$ is the predicted value from a weighted least squares regression of y on X with weights given by K. When Z includes a single variable, K is a simple kernel weighting function: $K((z - z_0 )/(sd(z)*h)) $. When Z includes two variables (e.g., nonpar=~z1+z2), the method for specifying K depends on the distance option. Under either option, the ith row of the matrix Z = (z1, z2) is transformed such that $z_i = sqrt(z_i * V * t(z_i)).$ Under the "Mahal" option, V is the inverse of cov(Z). Under the "Euclid" option, V is the inverse of diag(cov(Z)). After this transformation, the weights again reduce to the simple kernel weighting function $K((z- z_0 )/(sd(z)*h))$.

The great circle formula is used to define K when distance = "Latlong"; in this case, the variable list for nonpar must be listed as nonpar = ~latitude+longitude (or ~lo+la or ~lat+long, etc), with the longitude and latitude variables expressed in degrees (e.g., -87.627800 and 41.881998 for one observation of longitude and latitude, respectively). The order in which latitude and longitude are listed does not matter and the function only looks for the first two letters to determine which variable is latitude and which is the longitude. It is important to note that the great circle distance measure is left in miles rather than being standardized. Thus, the window option should be specified when distance = "Latlong" or the bandwidth should be adjusted to account for the scale. The kernel weighting function becomes K(distance/h) under the "Latlong" option.

h is specified by the bandwidth or window option. The function cparlwrgrid can be used to search for the value of h that minimizes the cv or gcv criterion.

Since each estimate is a linear function of all n values for y, the full set of estimates takes the form yhat = LY, where L is an nxn matrix. Loader (1999) suggests two measures of the number of degrees of freedom used in estimation: df1 = tr(L) and df2 = tr(L'L). The diagonal elements of tr(L) are stored in the array infl. Since the degrees of freedom measures can differ substantially when target="alldata" rather than using a set of target points, it is a good idea to report final estimates using target="alldata" when possible.

Again following Loader (1999), the degrees of freedom correction used to estimate the error variance, sig2, is df = 2*df1 - df2. Let e represent the vector of residuals, y - yhat. The estimated variance is $sig2 = \sum e^2/(n-df)$. The covariance matrix for $B(z_0)$ is $$\hat{\sigma}^2(\sum_{i=1}^n X_i K(\phi_i) X_i^\top)^{-1}(\sum_{i=1}^n X_i (K(\phi_i))^2 X_i^\top )(\sum_{i=1}^n X_i K(\phi_i) X_i^\top)^{-1}.$$

Estimation can be very slow when targetobs = "alldata". The maketarget command can be used to identify target points.

Available kernel weighting functions include the following:

Kernel	Call abbreviation
Kernel function K(z)	Rectangular
``rect''	$1/2 * I(\|z\|<1)$ <="" td="">
Triangular	``tria''
$(1-\|z\|) * I(\|z\|<1)$< td="">	Epanechnikov
``epan''	$3/4 * (1-z^2)*I(\|z\| < 1)$
Bi-Square	``bisq''
$15/16 * (1-z^2)^2 * I(\|z\| < 1)$	Tri-Cube
``tcub''	$70/81 * (1-\|z\|^3)^3 * I(\|z\| < 1)$
Tri-Weight	``trwt''
$35/32 * (1-z^2)^3 * I(\|z\| < 1)$	Gaussian
``gauss''	$2pi^{-.5} exp(-z^2/2)$

References

Cleveland, William S. and Susan J. Devlin, "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting," Journal of the American Statistical Association 83 (1988), 596-610.

Loader, Clive. Local Regression and Likelihood. New York: Springer, 1999.

McMillen, Daniel P., "One Hundred Fifty Years of Land Values in Chicago: A Nonparametric Approach," Journal of Urban Economics 40 (1996), 100-124.

McMillen, Daniel P., "Issues in Spatial Data Analysis," Journal of Regional Science 50 (2010), 119-141.

McMillen, Daniel P., "Employment Densities, Spatial Autocorrelation, and Subcenters in Large Metropolitan Areas," Journal of Regional Science 44 (2004), 225-243.

McMillen, Daniel P. and John F. McDonald, "A Nonparametric Analysis of Employment Density in a Polycentric City," Journal of Regional Science 37 (1997), 591-612.

McMillen, Daniel P. and Christian Redfearn, ``Estimation and Hypothesis Testing for Nonparametric Hedonic House Price Functions,'' Journal of Regional Science 50 (2010), 712-733.

Pagan, Adrian and Aman Ullah. Nonparametric Econometrics. New York: Cambridge University Press, 1999.

Examples

Run this code


data(cookdata)
par(ask=TRUE)
cookdata <- cookdata[cookdata$CHICAGO==1&!is.na(cookdata$LNFAR),]
fit1 <- cparlwr(LNFAR~DCBD,nonpar=~DCBD, window=.10, 
  data=cookdata)
fit2 <- cparlwr(LNFAR~DCBD,nonpar=~LONGITUDE+LATITUDE,window=.10,
  distance="LATLONG",data=cookdata)
cookdata$yhat1 <- fit1$yhat
cookdata$yhat2 <- fit2$yhat
o <- order(cookdata$DCBD)
plot(cookdata$DCBD[o], cookdata$LNFAR[o],main="Log Floor Area Ratio",
  xlab="Distance from CBD",ylab="Log FAR")
lines(cookdata$DCBD[o], cookdata$yhat1[o], col="red")
plot(cookdata$DCBD[o], cookdata$LNFAR[o],main="Log Floor Area Ratio",
  xlab="Distance from CBD",ylab="Log FAR")
points(cookdata$DCBD[o], cookdata$yhat2[o], col="red")

Run the code above in your browser using DataLab