read_fwf: Read a fixed-width file into a tibble

Description

Fixed-width files store tabular data with each field occupying a specific range of character positions in every line. Once the fields are identified, converting them to the appropriate R types works just like for delimited files. The unique challenge with fixed-width files is describing where each field begins and ends. readr tries to ease this pain by offering a few different ways to specify the field structure:

fwf_empty() - Guesses based on the positions of empty columns. This is the default. (Note that fwf_empty() returns 0-based positions, for internal use.)
fwf_widths() - Supply the widths of the columns.
fwf_positions() - Supply paired vectors of start and end positions. These are interpreted as 1-based positions, so are off-by-one compared to the output of fwf_empty().
fwf_cols() - Supply named arguments of paired start and end positions or column widths.

Note: fwf_empty() cannot work with a connection or with any of the input types that involve a connection internally, which includes remote and compressed files. The reason is that this would necessitate reading from the connection twice. In these cases, you'll have to either provide the field structure explicitly with another fwf_*() function or download (and decompress, if relevant) the file first.

Usage

read_fwf(
  file,
  col_positions = fwf_empty(file, skip, n = guess_max),
  col_types = NULL,
  col_select = NULL,
  id = NULL,
  locale = default_locale(),
  na = c("", "NA"),
  comment = "",
  trim_ws = TRUE,
  skip = 0,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  progress = show_progress(),
  name_repair = "unique",
  num_threads = readr_threads(),
  show_col_types = should_show_types(),
  lazy = should_read_lazy(),
  skip_empty_rows = TRUE
)
fwf_empty(
  file,
  skip = 0,
  skip_empty_rows = deprecated(),
  col_names = NULL,
  comment = "",
  n = 100L
)
fwf_widths(widths, col_names = NULL)
fwf_positions(start, end = NULL, col_names = NULL)
fwf_cols(...)

Arguments

file

Either a path to a file, a connection, or literal data (either a single string or a raw vector).

Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded. Remote .gz files can also be automatically downloaded and decompressed.

Literal data is most useful for examples and tests. To be recognised as literal data, wrap the input with I().

Using a value of clipboard() will read from the system clipboard.

col_positions

Column positions, as created by fwf_empty(), fwf_widths(), fwf_positions(), or fwf_cols(). To read in only selected fields, use fwf_positions(). If the width of the last column is variable (a ragged fwf file), supply the last end position as NA.

col_types

One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be inferred from guess_max rows of the input, interspersed throughout the file. This is convenient (and fast), but not robust. If the guessed types are wrong, you'll need to increase guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column:

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message, set show_col_types = FALSE or set options(readr.show_col_types = FALSE).

col_select

Columns to include in the results. You can use the same mini-language as dplyr::select() to refer to the columns by name. Use c() to use more than one selection expression. Although this usage is less common, col_select also accepts a numeric column index. See ?tidyselect::language for full details on the selection language.

id

The name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date. If NULL (the default) no extra column is created.

locale

The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names.

na

Character vector of strings to interpret as missing values. Set this option to character() to indicate no missing values.

comment

A string used to identify comments. Any text after the comment characters will be silently ignored.

trim_ws

Should leading and trailing whitespace (ASCII spaces and tabs) be trimmed from each field before parsing it?

skip

Number of lines to skip before reading data.

n_max

Maximum number of lines to read.

guess_max

Maximum number of lines to use for guessing column types. Will never use more than the number of lines read. See vignette("column-types", package = "readr") for more details.

progress

Display a progress bar? By default it will only display in an interactive session and not while knitting a document. The automatic progress bar can be disabled by setting option readr.show_progress to FALSE.

name_repair

Handling of column names. The default behaviour is to ensure column names are "unique". Various repair strategies are supported:

"minimal": No name repair or checks, beyond basic existence of names.
"unique" (default value): Make sure names are unique and not empty.
"check_unique": No name repair, but check they are unique.
"unique_quiet": Repair with the unique strategy, quietly.
"universal": Make the names unique and syntactic.
"universal_quiet": Repair with the universal strategy, quietly.
A function: Apply custom name repair (e.g., name_repair = make.names for names in the style of base R).
A purrr-style anonymous function, see rlang::as_function().

This argument is passed on as repair to vctrs::vec_as_names(). See there for more details on these terms and the strategies used to enforce them.

num_threads

The number of processing threads to use for initial parsing and lazy reading of data. If your data contains newlines within fields the parser should automatically detect this and fall back to using one thread only. However if you know your file has newlines within quoted fields it is safest to set num_threads = 1 explicitly.

show_col_types

If FALSE, do not show the guessed column types. If TRUE always show the column types, even if they are supplied. If NULL (the default) only show the column types if they are not explicitly supplied by the col_types argument.

lazy

Read values lazily? By default, this is FALSE, because there are special considerations when reading a file lazily that have tripped up some users. Specifically, things get tricky when reading and then writing back into the same file. But, in general, lazy reading (lazy = TRUE) has many benefits, especially for interactive use and when your downstream work only involves a subset of the rows or columns.

Learn more in should_read_lazy() and in the documentation for the altrep argument of vroom::vroom().

skip_empty_rows

Should blank rows be ignored altogether? i.e. If this option is TRUE then blank rows will not be represented at all. If it is FALSE then they will be represented by NA values in all the columns.

col_names

Either NULL, or a character vector column names.

n

Number of lines the tokenizer will read to determine file structure. By default it is set to 100.

widths

Width of each field. Use NA as the width of the last field when reading a ragged fixed-width file.

start, end

Starting and ending (inclusive) positions of each field. Positions are 1-based: the first character in a line is at position 1. Use NA as the last value of end when reading a ragged fixed-width file.

...

If the first element is a data frame, then it must have all numeric columns and either one or two rows. The column names are the variable names. The column values are the variable widths if a length one vector, and if length two, variable start and end positions. The elements of ... are used to construct a data frame with or or two rows as above.

Second edition changes

Comments are now only ignored if they appear at the start of a line. Comments elsewhere in a line are no longer treated specially.

Details

Here's a enhanced example using the contents of the file accessed via readr_example("fwf-sample.txt").

         1         2         3         4
123456789012345678901234567890123456789012
[     name 20      ][state 10][  ssn 12  ]
John Smith          WA        418-Y11-4111
Mary Hartford       CA        319-Z19-4341
Evan Nolan          IL        219-532-c301

Here are some valid field specifications for the above (they aren't all equivalent! but they are all valid):

fwf_widths(c(20, 10, 12), c("name", "state", "ssn"))
fwf_positions(c(1, 30), c(20, 42), c("name", "ssn"))
fwf_cols(state = c(21, 30), last = c(6, 20), first = c(1, 4), ssn = c(31, 42))
fwf_cols(name = c(1, 20), ssn = c(30, 42))
fwf_cols(name = 20, state = 10, ssn = 12)

Examples

Run this code

fwf_sample <- readr_example("fwf-sample.txt")
writeLines(read_lines(fwf_sample))

# You can specify column positions in several ways:
# 1. Guess based on position of empty columns
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
# 2. A vector of field widths
read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
# 3. Paired vectors of start and end positions
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))
# 4. Named arguments with start and end positions
read_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42)))
# 5. Named arguments with column widths
read_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))