sasxport.get: Enhanced Importing of SAS Transport Files using read.xport

Description

Uses the read.xport and lookup.xport functions in the foreign library to import SAS datasets. SAS date, time, and date/time variables are converted respectively to Date, POSIX, or POSIXct objects in R, variable names are converted to lower case, SAS labels are associated with variables, and (by default) integer-valued variables are converted from storage mode double to integer. If the user ran PROC FORMAT CNTLOUT= in SAS and included the resulting dataset in the SAS version 5 transport file, variables having customized formats that do not include any ranges (i.e., variables having standard PROC FORMAT; VALUE label formats) will have their format labels looked up, and these variables are converted to S factors.

For those users having access to SAS, method='csv' is preferred when importing several SAS datasets. Run SAS macro exportlib.sas available from https://github.com/harrelfe/Hmisc/blob/master/src/sas/exportlib.sas to convert all SAS datasets in a SAS data library (from any engine supported by your system) into CSV files. If any customized formats are used, it is assumed that the PROC FORMAT CNTLOUT= dataset is in the data library as a regular SAS dataset, as above.

SASdsLabels reads a file containing PROC CONTENTS printed output to parse dataset labels, assuming that PROC CONTENTS was run on an entire library.

Usage

sasxport.get(file, lowernames=TRUE, force.single = TRUE,
             method=c('read.xport','dataload','csv'), formats=NULL, allow=NULL,
             out=NULL, keep=NULL, drop=NULL, as.is=0.5, FUN=NULL)
sasdsLabels(file)

Value

If there is more than one dataset in the transport file other than the

PROC FORMAT file, the result is a list of data frames containing all the non-PROC FORMAT datasets. Otherwise the result is the single data frame. There is an exception if out

is specified; that causes separate R

save files to be written and the returned value to be a list corresponding to the SAS datasets, with key PROC CONTENTS information in a data frame making up each part of the list.

sasdsLabels returns a named vector of dataset labels, with names equal to the dataset names.

Arguments

file: name of a file containing the SAS transport file. file may be a URL beginning with https://. For sasdsLabels, file is the name of a file containing a PROC CONTENTS output listing. For method='csv', file is the name of the directory containing all the CSV files created by running the exportlib SAS macro.
lowernames: set to FALSE to keep from converting SAS variable names to lower case
force.single: set to FALSE to keep integer-valued variables not exceeding \(2^31-1\) in value from being converted to integer storage mode
method: set to "dataload" if you have the dataload executable installed and want to use it instead of read.xport. This seems to correct some errors in which rarely some factor variables are always missing when read by read.xport when in fact they have some non-missing values.
formats: a data frame or list (like that created by read.xport) containing PROC FORMAT output, if such output is not stored in the main transport file.
allow: a vector of characters allowed by R that should not be converted to periods in variable names. By default, underscores in variable names are converted to periods as with R before version 1.9.
out: a character string specifying a directory in which to write separate R save files (.rda files) for each regular dataset. Each file and the data frame inside it is named with the SAS dataset name translated to lower case and with underscores changed to periods. The default NULL value of out results in a data frame or a list of data frames being returned. When out is given, sasxport.get returns only metadata (see below), invisibly. out only works with methods='csv'. out should not have a trailing slash.
keep: a vector of names of SAS datasets to process (original SAS upper case names). Must include PROC FORMAT dataset if it exists, and if the kept datasets use any of its value label formats.
drop: a vector of names of SAS datasets to ignore (original SAS upper case names)
as.is: SAS character variables are converted to S factor objects if as.is=FALSE or if as.is is a number between 0 and 1 inclusive and the number of unique values of the variable is less than the number of observations (n) times as.is. The default if as.is is .5, so character variables are converted to factors only if they have fewer than n/2 unique values. The primary purpose of this is to keep unique identification variables as character values in the data frame instead of using more space to store both the integer factor codes and the factor labels.
FUN: an optional function that will be run on each data frame created, when method='csv' and out are specified. The result of all the FUN calls is made into a list corresponding to the SAS datasets that are read. This list is the FUN attribute of the result returned by sasxport.get.

Author

Frank E Harrell Jr

Details

See contents.list for a way to print the directory of SAS datasets when more than one was imported.

Examples

Run this code

if (FALSE) {
# SAS code to generate test dataset:
# libname y SASV5XPT "test2.xpt";
#
# PROC FORMAT; VALUE race 1=green 2=blue 3=purple; RUN;
# PROC FORMAT CNTLOUT=format;RUN;  * Name, e.g. 'format', unimportant;
# data test;
# LENGTH race 3 age 4;
# age=30; label age="Age at Beginning of Study";
# race=2;
# d1='3mar2002'd ;
# dt1='3mar2002 9:31:02'dt;
# t1='11:13:45't;
# output;
#
# age=31;
# race=4;
# d1='3jun2002'd ;
# dt1='3jun2002 9:42:07'dt;
# t1='11:14:13't;
# output;
# format d1 mmddyy10. dt1 datetime. t1 time. race race.;
# run;
# data z; LENGTH x3 3 x4 4 x5 5 x6 6 x7 7 x8 8;
#    DO i=1 TO 100;
#        x3=ranuni(3);
#        x4=ranuni(5);
#        x5=ranuni(7);
#        x6=ranuni(9);
#        x7=ranuni(11);
#        x8=ranuni(13);
#        output;
#        END;
#    DROP i;
#    RUN;
# PROC MEANS; RUN;
# PROC COPY IN=work OUT=y;SELECT test format z;RUN; *Creates test2.xpt;
w <- sasxport.get('test2.xpt')
# To use an existing copy of test2.xpt available on the web:
w <- sasxport.get('https://github.com/harrelfe/Hmisc/raw/master/inst/tests/test2.xpt')

describe(w$test)   # see labels, format names for dataset test
# Note: if only one dataset (other than format) had been exported,
# just do describe(w) as sasxport.get would not create a list for that
lapply(w, describe)# see descriptive stats for both datasets
contents(w$test)   # another way to see variable attributes
lapply(w, contents)# show contents of both datasets
options(digits=7)  # compare the following matrix with PROC MEANS output
t(sapply(w$z, function(x)
 c(Mean=mean(x),SD=sqrt(var(x)),Min=min(x),Max=max(x))))
}

Run the code above in your browser using DataLab