importers: Object Oriented Interface to Foreign Files

Description

Importer objects are objects that refer to an external data file. Currently only Stata files, SPSS system, portable, and fixed-column files are supported.

Data are actually imported by `translating' an importer file into a data.set using as.data.set or subset.

The importer mechanism is more flexible and extensible than read.spss and read.dta of package "foreign", as most of the parsing of the file headers is done in R. It is also adapted to efficiently load large data sets. Most importantly, importer objects support the labels, missing.values, and descriptions, provided by this package.

Usage

spss.fixed.file(file,
  columns.file,
  varlab.file=NULL,
  codes.file=NULL,
  missval.file=NULL,
  count.cases=TRUE,
  to.lower=TRUE
  )
spss.portable.file(file,
  varlab.file=NULL,
  codes.file=NULL,
  missval.file=NULL,
  count.cases=TRUE,
  to.lower=TRUE)
spss.system.file(file,
  varlab.file=NULL,
  codes.file=NULL,
  missval.file=NULL,
  count.cases=TRUE,
  to.lower=TRUE)
Stata.file(file)
## The most important methods for "importer" objects are:
# S4 method for importer
subset(x, subset, select, drop = FALSE, …)
# S4 method for importer
as.data.set(x,row.names=NULL,optional=NULL,
                    compress.storage.modes=FALSE,…)

Arguments

an object that inherits from class "importer".

file

character string; the path to the file containing the data

columns.file

character string; the path to an SPSS/PSPP syntax file with a DATA LIST FIXED statement

varlab.file

character string; the path to an SPSS/PSPP syntax file with a VARIABLE LABELS statement

codes.file

character string; the path to an SPSS/PSPP syntax file with a VALUE LABELS statement

missval.file

character string; the path to an SPSS/PSPP syntax file with a MISSING VALUES statement

count.cases

logical; should cases in file be counted? This takes effect only if the data file does not already contain information about the number of cases.

to.lower

logical; should variable names changed to lower case?

subset

a logical vector or an expression containing variables from the external data file that evaluates to logical.

select

a vector of variable names from the external data file. This may also be a named vector, where the names give the names into which the variables from the external data file are renamed.

drop

a logical value, that determines what happens if only one column is selected. If TRUE and only one column is selected, subset returns only a single item object and not a data.set.

row.names

ignored, present only for compatibility.

optional

ignored, present only for compatibility.

compress.storage.modes

logical value; if TRUE floating point values are converted to integers if possible without loss of information.

…

other arguments; ignored.

Value

spss.fixed.file, spss.portable.file, spss.system.file, and Stata.file return, respectively, objects of class "spss.fixed.importer", "spss.portable.importer", "spss.system.importer", or "Stata.importer", which, by inheritance, are also objects of class "importer".

Objects of class "importer" have at least the following two slots:

ptr

an external pointer

variables

a list of objects of class "item.vector" which provides a `prototype' for the "data.set" set objects returned by the as.data.set and subset methods for objects of class "importer"

The as.data.frame for importer objects does the actual data import and returns a data frame. Note that in contrast to read.spss, the variable names of the resulting data frame will be lower case, unless the importer function is called with to.lower=FALSE. If long variable names are defined (in case of a PSPP/SPSS system file), they take precedence and are not coerced to lower case.

Details

A call to a `constructor' for an importer object, that is, spss.fixed.file, spss.portable.file, spss.sysntax.file, or Stata.file, causes R to read in the header of the data file and/or the syntax files that contain information about the variables, such as the columns that they occupy (in case of spss.fixed.file), variable labels, value labels and missing values.

The information in the file header and/or the accompagnying files is then processed to prepare the file for importing. Thus the inner structure of an importer object may well vary according to what type of file is to imported and what additional information is given.

The as.data.set and subset methods for "importer" objects internally use the generic functions seekData, readData, readSlice, and readChunk, which have methods for the subclasses of "importer". These functions are not callable from outside the package, however.

The subset method for "importer" objects reads in the data `chunk-wise' to create the subset of observations if the option "subset.chunk.size" is set to a non-NULL value, e.g. by options(subset.chunk.size=1000). This may be useful in case of very large data sets from which only a tiny subset of observations is needed for analysis.

Since the functions described here are more or less complete rewrite based on the description of the file structure provided by the documenation for PSPP, they are perhaps not as thorougly tested as the functions in the foreign package, apart from the frequent use by the author of this package.

Examples

Run this code

# NOT RUN {
# Extract American National Election Study of 1948
nes1948.por <- unzip(system.file("anes/NES1948.ZIP",package="memisc"),
                     "NES1948.POR",exdir=tempfile())

# Get information about the variables contained.
nes1948 <- spss.portable.file(nes1948.por)

# The data are not yet loaded:
show(nes1948)

# ... but one can see what variables are present:
description(nes1948)

# Now a subset of the data is loaded:
vote.socdem.48 <- subset(nes1948,
              select=c(
                  v480018,
                  v480029,
                  v480030,
                  v480045,
                  v480046,
                  v480047,
                  v480048,
                  v480049,
                  v480050
                  ))

# Let's make the names more descriptive:
vote.socdem.48 <- rename(vote.socdem.48,
                  v480018 = "vote",
                  v480029 = "occupation.hh",
                  v480030 = "unionized.hh",
                  v480045 = "gender",
                  v480046 = "race",
                  v480047 = "age",
                  v480048 = "education",
                  v480049 = "total.income",
                  v480050 = "religious.pref"
        )

# It is also possible to do both
# in one step:
# vote.socdem.48 <- subset(nes1948,
#              select=c(
#                  vote           = v480018,
#                  occupation.hh  = v480029,
#                  unionized.hh   = v480030,
#                  gender         = v480045,
#                  race           = v480046,
#                  age            = v480047,
#                  education      = v480048,
#                  total.income   = v480049,
#                  religious.pref = v480050
#                  ))



# We examine the data more closely:
codebook(vote.socdem.48)

# ... and conduct some analyses.
#
t(genTable(percent(vote)~occupation.hh,data=vote.socdem.48))

# We consider only the two main candidates.
vote.socdem.48 <- within(vote.socdem.48,{
  truman.dewey <- vote
  valid.values(truman.dewey) <- 1:2
  truman.dewey <- relabel(truman.dewey,
              "VOTED - FOR TRUMAN" = "Truman",
              "VOTED - FOR DEWEY"  = "Dewey")
  })

summary(truman.relig.glm <- glm((truman.dewey=="Truman")~religious.pref,
    data=vote.socdem.48,
    family="binomial",
))
# }

Run the code above in your browser using DataLab