docs_bulk: Use the bulk API to create, index, update, or delete documents.

Description

Use the bulk API to create, index, update, or delete documents.

Usage

docs_bulk(x, index = NULL, type = NULL, chunk_size = 1000,
  doc_ids = NULL, es_ids = TRUE, raw = FALSE, ...)

Arguments

A data.frame or path to a file to load in the bulk API

index

(character) The index name to use. Required for data.frame input, but optional for file inputs.

type

(character) The type name to use. If left as NULL, will be same name as index.

chunk_size

(integer) Size of each chunk. If your data.frame is smaller thank chunk_size, this parameter is essentially ignored. We write in chunks because at some point, depending on size of each document, and Elasticsearch setup, writing a very large number of documents in one go becomes slow, so chunking can help. This parameter is ignored if you pass a file name. Default: 1000

doc_ids

An optional vector (character or numeric/integer) of document ids to use. This vector has to equal the size of the documents you are passing in, and will error if not. If you pass a factor we convert to character. Default: not passed

es_ids

(boolean) Let Elasticsearch assign document IDs as UUIDs. These are sequential, so there is order to the IDs they assign. If TRUE, doc_ids is ignored. Default: TRUE

raw

(logical) Get raw JSON back or not.

...

Pass on curl options to POST

Value

A list

Document IDs

Document IDs can be passed in via the doc_ids paramater when passing in data.frame or list, but not with files. If ids not passed to doc_ids, we assign document IDs from 1 to length of the object (rows of a data.frame, or length of a list). In the future we may allow the user to select whether they want to assign sequential numeric IDs or to allow Elasticsearch to assign IDs, which are UUIDs that are actually sequential, so you still can determine an order of your documents.

Large numbers for document IDs

Until recently, if you had very large integers for document IDs, docs_bulk failed. It should be fixed now. Let us know if not.

Details

More on the Bulk API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html.

This function dispatches on data.frame or character input. Character input has to be a file name or the function stops with an error message.

If you pass a data.frame to this function, we by default to an index operation, that is, create the record in the index and type given by those parameters to the function. Down the road perhaps we will try to support other operations on the bulk API. if you pass a file, of course in that file, you can specify any operations you want.

Row names are dropped from data.frame, and top level names for a list are dropped as well.

A progress bar gives the progress for data.frames and lists - the progress bar is based around a for loop, where progress indicates progress along the iterations of the for loop, where each iteration is a chunk of data that's converted to bulk format, then pushed into Elasticsearch. The character method has no for loop, so no progress bar.

Examples

Run this code

# NOT RUN {
plosdat <- system.file("examples", "plos_data.json", package = "elastic")
docs_bulk(plosdat)
aliases_get()
index_delete(index='plos')
aliases_get()

# Curl options
library("httr")
plosdat <- system.file("examples", "plos_data.json", package = "elastic")
docs_bulk(plosdat, config=verbose())

# From a data.frame
docs_bulk(mtcars, index = "hello", type = "world")
## field names cannot contain dots
names(iris) <- gsub("\\.", "_", names(iris))
docs_bulk(iris, "iris", "flowers")
## type can be missing, but index can not
docs_bulk(iris, "flowers")
## big data.frame, 53K rows, load ggplot2 package first
# res <- docs_bulk(diamonds, "diam")
# Search("diam")$hits$total

# From a list
docs_bulk(apply(iris, 1, as.list), index="iris", type="flowers")
docs_bulk(apply(USArrests, 1, as.list), index="arrests")
# dim_list <- apply(diamonds, 1, as.list)
# out <- docs_bulk(dim_list, index="diamfromlist")

# When using in a loop
## We internally get last _id counter to know where to start on next bulk 
## insert but you need to sleep in between docs_bulk calls, longer the 
## bigger the data is
files <- c(system.file("examples", "test1.csv", package = "elastic"),
           system.file("examples", "test2.csv", package = "elastic"),
           system.file("examples", "test3.csv", package = "elastic"))
for (i in seq_along(files)) {
  d <- read.csv(files[[i]])
  docs_bulk(d, index = "testes", type = "docs")
  Sys.sleep(1)
}
count("testes", "docs")
index_delete("testes")

# You can include your own document id numbers
## Either pass in as an argument
index_create("testes")
files <- c(system.file("examples", "test1.csv", package = "elastic"),
           system.file("examples", "test2.csv", package = "elastic"),
           system.file("examples", "test3.csv", package = "elastic"))
tt <- vapply(files, function(z) NROW(read.csv(z)), numeric(1))
ids <- list(1:tt[1],
           (tt[1] + 1):(tt[1] + tt[2]),
           (tt[1] + tt[2] + 1):sum(tt))
for (i in seq_along(files)) {
  d <- read.csv(files[[i]])
  docs_bulk(d, index = "testes", type = "docs", doc_ids = ids[[i]], 
    es_ids = FALSE)
}
count("testes", "docs")
index_delete("testes")

## or include in the input data
### from data.frame's
index_create("testes")
files <- c(system.file("examples", "test1_id.csv", package = "elastic"),
           system.file("examples", "test2_id.csv", package = "elastic"),
           system.file("examples", "test3_id.csv", package = "elastic"))
readLines(files[[1]])
for (i in seq_along(files)) {
  d <- read.csv(files[[i]])
  docs_bulk(d, index = "testes", type = "docs")
}
count("testes", "docs")
index_delete("testes")

### from lists via file inputs
index_create("testes")
for (i in seq_along(files)) {
  d <- read.csv(files[[i]])
  d <- apply(d, 1, as.list)
  docs_bulk(d, index = "testes", type = "docs")
}
count("testes", "docs")
index_delete("testes")

# data.frame's with a single column
## this didn't use to work, but now should work
db <- paste0(sample(letters, 10), collapse = "")
index_create(db)
res <- data.frame(foo = 1:10)
out <- docs_bulk(x = res, index = db)
count(db)
index_delete(db)
# }

Run the code above in your browser using DataLab