solrium

A general purpose R interface to Solr

Development is now following Solr v7 and greater - which introduced many changes, which means many functions here may not work with your Solr installation older than v7.

Be aware that currently some functions will only work in certain Solr modes, e.g, collection_create() won't work when you are not in Solrcloud mode. But, you should get an error message stating that you aren't.

Currently developing against Solr v7.0.0

Note that we recently changed the package name to solrium. A previous version of this package is on CRAN as solr, but next version will be up as solrium.

Solr info

Package API and ways of using the package

The first thing to look at is SolrClient to instantiate a client connection to your Solr instance. ping and schema are helpful functions to look at after instantiating your client.

There are two ways to use solrium:

Call functions on the SolrClient object
Pass the SolrClient object to functions

For example, if we instantiate a client like conn <- SolrClient$new(), then to use the first way we can do conn$search(...), and the second way by doing solr_search(conn, ...). These two ways of using the package hopefully make the package more user friendly for more people, those that prefer a more object oriented approach, and those that prefer more of a functional approach.

Collections

Functions that start with collection work with Solr collections when in cloud mode. Note that these functions won't work when in Solr standard mode

Cores

Functions that start with core work with Solr cores when in standard Solr mode. Note that these functions won't work when in Solr cloud mode

Documents

The following functions work with documents in Solr

#>  - add
#>  - delete_by_id
#>  - delete_by_query
#>  - update_atomic_json
#>  - update_atomic_xml
#>  - update_csv
#>  - update_json
#>  - update_xml

Search

Search functions, including solr_parse for parsing results from different functions appropriately

#>  - solr_all
#>  - solr_facet
#>  - solr_get
#>  - solr_group
#>  - solr_highlight
#>  - solr_mlt
#>  - solr_parse
#>  - solr_search
#>  - solr_stats

Install

Stable version from CRAN

install.packages("solrium")

Or development version from GitHub

devtools::install_github("ropensci/solrium")

library("solrium")

Setup

Use SolrClient$new() to initialize your connection. These examples use a remote Solr server, but work on any local Solr server.

(cli <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL))
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

You can also set whether you want simple or detailed error messages (via errors), and whether you want URLs used in each function call or not (via verbose), and your proxy settings (via proxy) if needed. For example:

SolrClient$new(errors = "complete")

Your settings are printed in the print method for the connection object

cli
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

For local Solr server setup:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted example/exampledocs/*.xml

Search

cli$search(params = list(q='*:*', rows=2, fl='id'))
#> # A tibble: 2 x 1
#>                                      id
#>                                   <chr>
#> 1    10.1371/journal.pone.0079536/title
#> 2 10.1371/journal.pone.0079536/abstract

Search grouped data

Most recent publication by journal

cli$group(params = list(q='*:*', group.field='journal', rows=5, group.limit=1,
                        group.sort='publication_date desc',
                        fl='publication_date, score'))
#>                         groupValue numFound start     publication_date
#> 1                         plos one  1572163     0 2017-11-01T00:00:00Z
#> 2 plos neglected tropical diseases    47510     0 2017-11-01T00:00:00Z
#> 3                    plos genetics    59871     0 2017-11-01T00:00:00Z
#> 4                   plos pathogens    53246     0 2017-11-01T00:00:00Z
#> 5                             none    63561     0 2012-10-23T00:00:00Z
#>   score
#> 1     1
#> 2     1
#> 3     1
#> 4     1
#> 5     1

First publication by journal

cli$group(params = list(q = '*:*', group.field = 'journal', group.limit = 1,
                        group.sort = 'publication_date asc',
                        fl = c('publication_date', 'score'),
                        fq = "publication_date:[1900-01-01T00:00:00Z TO *]"))
#>                          groupValue numFound start     publication_date
#> 1                          plos one  1572163     0 2006-12-20T00:00:00Z
#> 2  plos neglected tropical diseases    47510     0 2007-08-30T00:00:00Z
#> 3                    plos pathogens    53246     0 2005-07-22T00:00:00Z
#> 4        plos computational biology    45582     0 2005-06-24T00:00:00Z
#> 5                              none    57532     0 2005-08-23T00:00:00Z
#> 6              plos clinical trials      521     0 2006-04-21T00:00:00Z
#> 7                     plos genetics    59871     0 2005-06-17T00:00:00Z
#> 8                     plos medicine    23519     0 2004-09-07T00:00:00Z
#> 9                      plos medicin        9     0 2012-04-17T00:00:00Z
#> 10                     plos biology    32513     0 2003-08-18T00:00:00Z
#>    score
#> 1      1
#> 2      1
#> 3      1
#> 4      1
#> 5      1
#> 6      1
#> 7      1
#> 8      1
#> 9      1
#> 10     1

Search group query : Last 3 publications of 2013.

gq <- 'publication_date:[2013-01-01T00:00:00Z TO 2013-12-31T00:00:00Z]'
cli$group(
  params = list(q='*:*', group.query = gq,
                group.limit = 3, group.sort = 'publication_date desc',
                fl = 'publication_date'))
#>   numFound start     publication_date
#> 1   307076     0 2013-12-31T00:00:00Z
#> 2   307076     0 2013-12-31T00:00:00Z
#> 3   307076     0 2013-12-31T00:00:00Z

Search group with format simple

cli$group(params = list(q='*:*', group.field='journal', rows=5,
                        group.limit=3, group.sort='publication_date desc',
                        group.format='simple', fl='journal, publication_date'))
#>   numFound start     publication_date  journal
#> 1  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 2  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 3  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 4  1898495     0 2017-11-01T00:00:00Z PLOS ONE
#> 5  1898495     0 2017-11-01T00:00:00Z PLOS ONE

Facet

cli$facet(params = list(q='*:*', facet.field='journal', facet.query=c('cell', 'bird')))
#> $facet_queries
#> # A tibble: 2 x 2
#>    term  value
#>   <chr>  <int>
#> 1  cell 157652
#> 2  bird  16385
#> 
#> $facet_fields
#> $facet_fields$journal
#> # A tibble: 9 x 2
#>                               term   value
#>                              <chr>   <chr>
#> 1                         plos one 1572163
#> 2                    plos genetics   59871
#> 3                   plos pathogens   53246
#> 4 plos neglected tropical diseases   47510
#> 5       plos computational biology   45582
#> 6                     plos biology   32513
#> 7                    plos medicine   23519
#> 8             plos clinical trials     521
#> 9                     plos medicin       9
#> 
#> 
#> $facet_pivot
#> NULL
#> 
#> $facet_dates
#> NULL
#> 
#> $facet_ranges
#> NULL

Highlight

cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2))
#> # A tibble: 2 x 2
#>                          names
#>                          <chr>
#> 1 10.1371/journal.pone.0185457
#> 2 10.1371/journal.pone.0071284
#> # ... with 1 more variables: abstract <chr>

Stats

out <- cli$stats(params = list(q='ecology', stats.field=c('counter_total_all','alm_twitterCount'), stats.facet='journal'))

out$data
#>                   min    max count missing       sum sumOfSquares
#> counter_total_all   0 920716 40497       0 219020039 7.604567e+12
#> alm_twitterCount    0   3401 40497       0    281128 7.300081e+07
#>                          mean      stddev
#> counter_total_all 5408.302813 12591.07462
#> alm_twitterCount     6.941946    41.88646

More like this

solr_mlt is a function to return similar documents to the one

out <- cli$mlt(params = list(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5))

out$docs
#> # A tibble: 5 x 2
#>                             id counter_total_all
#>                          <chr>             <int>
#> 1 10.1371/journal.pbio.1001805             21824
#> 2 10.1371/journal.pbio.0020440             25424
#> 3 10.1371/journal.pbio.1002559              9746
#> 4 10.1371/journal.pone.0087217             11502
#> 5 10.1371/journal.pbio.1002191             22013

out$mlt
#> $`10.1371/journal.pbio.1001805`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     3822     0 10.1371/journal.pone.0098876              3590
#> 2     3822     0 10.1371/journal.pone.0082578              2893
#> 3     3822     0 10.1371/journal.pone.0102159              2028
#> 4     3822     0 10.1371/journal.pcbi.1002652              3819
#> 5     3822     0 10.1371/journal.pcbi.1003408              9920
#> 
#> $`10.1371/journal.pbio.0020440`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     1115     0 10.1371/journal.pone.0162651              2828
#> 2     1115     0 10.1371/journal.pone.0003259              3225
#> 3     1115     0 10.1371/journal.pntd.0003377              4267
#> 4     1115     0 10.1371/journal.pone.0101568              4603
#> 5     1115     0 10.1371/journal.pone.0068814              9042
#> 
#> $`10.1371/journal.pbio.1002559`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     5482     0 10.1371/journal.pone.0155989              2519
#> 2     5482     0 10.1371/journal.pone.0023086              8442
#> 3     5482     0 10.1371/journal.pone.0155028              1547
#> 4     5482     0 10.1371/journal.pone.0041684             22057
#> 5     5482     0 10.1371/journal.pone.0164330               969
#> 
#> $`10.1371/journal.pone.0087217`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     4576     0 10.1371/journal.pone.0175497              1088
#> 2     4576     0 10.1371/journal.pone.0159131              4937
#> 3     4576     0 10.1371/journal.pcbi.0020092             24786
#> 4     4576     0 10.1371/journal.pone.0133941              1336
#> 5     4576     0 10.1371/journal.pone.0131665              1207
#> 
#> $`10.1371/journal.pbio.1002191`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1    12585     0 10.1371/journal.pbio.1002232              3055
#> 2    12585     0 10.1371/journal.pone.0070448              2203
#> 3    12585     0 10.1371/journal.pone.0131700              2493
#> 4    12585     0 10.1371/journal.pone.0121680              4980
#> 5    12585     0 10.1371/journal.pone.0041534              5701

Parsing

solr_parse is a general purpose parser function with extension methods solr_parse.sr_search, solr_parse.sr_facet, and solr_parse.sr_high, for parsing solr_search, solr_facet, and solr_highlight function output, respectively. solr_parse is used internally within those three functions (solr_search, solr_facet, solr_highlight) to do parsing. You can optionally get back raw json or xml from solr_search, solr_facet, and solr_highlight setting parameter raw=TRUE, and then parsing after the fact with solr_parse. All you need to know is solr_parse can parse

For example:

(out <- cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2),
                      raw=TRUE))
#> [1] "{\"response\":{\"numFound\":25987,\"start\":0,\"maxScore\":4.705177,\"docs\":[{\"id\":\"10.1371/journal.pone.0185457\",\"journal\":\"PLOS ONE\",\"eissn\":\"1932-6203\",\"publication_date\":\"2017-09-28T00:00:00Z\",\"article_type\":\"Research Article\",\"author_display\":[\"Jacqueline Willmore\",\"Terry-Lynne Marko\",\"Darcie Taing\",\"Hugues Sampasa-Kanyinga\"],\"abstract\":[\"Objectives: Alcohol-related morbidity and mortality are significant public health issues. The purpose of this study was to describe the prevalence and trends over time of alcohol consumption and alcohol-related morbidity and mortality; and public attitudes of alcohol use impacts on families and the community in Ottawa, Canada. Methods: Prevalence (2013–2014) and trends (2000–2001 to 2013–2014) of alcohol use were obtained from the Canadian Community Health Survey. Data on paramedic responses (2015), emergency department (ED) visits (2013–2015), hospitalizations (2013–2015) and deaths (2007–2011) were used to quantify the acute and chronic health effects of alcohol in Ottawa. Qualitative data were obtained from the “Have Your Say” alcohol survey, an online survey of public attitudes on alcohol conducted in 2016. Results: In 2013–2014, an estimated 595,300 (83%) Ottawa adults 19 years and older drank alcohol, 42% reported binge drinking in the past year. Heavy drinking increased from 15% in 2000–2001 to 20% in 2013–2014. In 2015, the Ottawa Paramedic Service responded to 2,060 calls directly attributable to alcohol. Between 2013 and 2015, there were an average of 6,100 ED visits and 1,270 hospitalizations per year due to alcohol. Annually, alcohol use results in at least 140 deaths in Ottawa. Men have higher rates of alcohol-attributable paramedic responses, ED visits, hospitalizations and deaths than women, and young adults have higher rates of alcohol-attributable paramedic responses. Qualitative data of public attitudes indicate that alcohol misuse has greater repercussions not only on those who drink, but also on the family and community. Conclusions: Results highlight the need for healthy public policy intended to encourage a culture of drinking in moderation in Ottawa to support lower risk alcohol use, particularly among men and young adults. \"],\"title_display\":\"The burden of alcohol-related morbidity and mortality in Ottawa, Canada\",\"score\":4.705177},{\"id\":\"10.1371/journal.pone.0071284\",\"journal\":\"PLoS ONE\",\"eissn\":\"1932-6203\",\"publication_date\":\"2013-08-20T00:00:00Z\",\"article_type\":\"Research Article\",\"author_display\":[\"Petra Suchankova\",\"Pia Steensland\",\"Ida Fredriksson\",\"Jörgen A. Engel\",\"Elisabet Jerlhag\"],\"abstract\":[\"\\nAlcohol dependence is a heterogeneous disorder where several signalling systems play important roles. Recent studies implicate that the gut-brain hormone ghrelin, an orexigenic peptide, is a potential mediator of alcohol related behaviours. Ghrelin increases whereas a ghrelin receptor (GHS-R1A) antagonist decreases alcohol consumption as well as operant self-administration of alcohol in rodents that have consumed alcohol for twelve weeks. In the present study we aimed at investigating the effect of acute and repeated treatment with the GHS-R1A antagonist JMV2959 on alcohol intake in a group of rats following voluntarily alcohol consumption for two, five and eight months. After approximately ten months of voluntary alcohol consumption the expression of the GHS-R1A gene (Ghsr) as well as the degree of methylation of a CpG island found in Ghsr was examined in reward related brain areas. In a separate group of rats, we examined the effect of the JMV2959 on alcohol relapse using the alcohol deprivation paradigm. Acute JMV2959 treatment was found to decrease alcohol intake and the effect was more pronounced after five, compared to two months of alcohol exposure. In addition, repeated JMV2959 treatment decreased alcohol intake without inducing tolerance or rebound increase in alcohol intake after the treatment. The GHS-R1A antagonist prevented the alcohol deprivation effect in rats. There was a significant down-regulation of the Ghsr expression in the ventral tegmental area (VTA) in high- compared to low-alcohol consuming rats after approximately ten months of voluntary alcohol consumption. Further analysis revealed a negative correlation between Ghsr expression in the VTA and alcohol intake. No differences in methylation degree were found between high- compared to low-alcohol consuming rats. These findings support previous studies showing that the ghrelin signalling system may constitute a potential target for development of novel treatment strategies for alcohol dependence.\\n\"],\"title_display\":\"Ghrelin Receptor (GHS-R1A) Antagonism Suppresses Both Alcohol Consumption and the Alcohol Deprivation Effect in Rats following Long-Term Voluntary Alcohol Consumption\",\"score\":4.7050986}]},\"highlighting\":{\"10.1371/journal.pone.0185457\":{\"abstract\":[\"Objectives: <em>Alcohol</em>-related morbidity and mortality are significant public health issues\"]},\"10.1371/journal.pone.0071284\":{\"abstract\":[\"\\n<em>Alcohol</em> dependence is a heterogeneous disorder where several signalling systems play important\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"

Then parse

solr_parse(out, 'df')
#> # A tibble: 2 x 2
#>                          names
#>                          <chr>
#> 1 10.1371/journal.pone.0185457
#> 2 10.1371/journal.pone.0071284
#> # ... with 1 more variables: abstract <chr>

Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "val"

cli$search(params = list(q='_val_:"product(counter_total_all,alm_twitterCount)"',
  rows=5, fl='id,title', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                             id
#>                          <chr>
#> 1 10.1371/journal.pmed.0020124
#> 2 10.1371/journal.pone.0141854
#> 3 10.1371/journal.pone.0073791
#> 4 10.1371/journal.pone.0153419
#> 5 10.1371/journal.pone.0115069
#> # ... with 1 more variables: title <chr>

Here, we search for the papers with the most citations

cli$search(params = list(q='_val_:"max(counter_total_all)"',
    rows=5, fl='id,counter_total_all', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                                                        id
#>                                                     <chr>
#> 1                            10.1371/journal.pmed.0020124
#> 2 10.1371/annotation/80bd7285-9d2d-403a-8e6f-9c375bf977ca
#> 3                            10.1371/journal.pcbi.1003149
#> 4                            10.1371/journal.pone.0141854
#> 5                            10.1371/journal.pcbi.0030102
#> # ... with 1 more variables: counter_total_all <int>

Or with the most tweets

cli$search(params = list(q='_val_:"max(alm_twitterCount)"',
    rows=5, fl='id,alm_twitterCount', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                             id alm_twitterCount
#>                          <chr>            <int>
#> 1 10.1371/journal.pone.0141854             3401
#> 2 10.1371/journal.pmed.0020124             3207
#> 3 10.1371/journal.pone.0115069             2873
#> 4 10.1371/journal.pmed.1001953             2821
#> 5 10.1371/journal.pone.0061981             2392

Using specific data sources

USGS BISON service

The occurrences service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/occurrences/select", port = NULL)
conn$search(params = list(q = '*:*', fl = c('decimalLatitude','decimalLongitude','scientificName'), rows = 2))
#> # A tibble: 2 x 3
#>   decimalLongitude         scientificName decimalLatitude
#>              <dbl>                  <chr>           <dbl>
#> 1        -116.5694 Zonotrichia leucophrys        34.05072
#> 2        -116.5694    Tyrannus vociferans        34.05072

The species names service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/scientificName/select", port = NULL)
conn$search(params = list(q = '*:*'))
#> # A tibble: 10 x 2
#>                scientificName  `_version_`
#>                         <chr>        <dbl>
#>  1 Dictyopteris polypodioides 1.565325e+18
#>  2           Lonicera iberica 1.565325e+18
#>  3            Epuraea ambigua 1.565325e+18
#>  4   Pseudopomala brachyptera 1.565325e+18
#>  5    Didymosphaeria populina 1.565325e+18
#>  6                   Sanoarca 1.565325e+18
#>  7     Celleporina ventricosa 1.565325e+18
#>  8         Trigonurus crotchi 1.565325e+18
#>  9       Ceraticelus laticeps 1.565325e+18
#> 10           Micraster acutus 1.565325e+18

PLOS Search API

Most of the examples above use the PLOS search API... :)

Solr server management

This isn't as complete as searching functions show above, but we're getting there.

Cores

conn <- SolrClient$new()

Many functions, e.g.:

core_create()
core_rename()
core_status()
...

Create a core

conn$core_create(name = "foo_bar")

Collections

Many functions, e.g.:

collection_create()
collection_list()
collection_addrole()
...

Create a collection

conn$collection_create(name = "hello_world")

Add documents

Add documents, supports adding from files (json, xml, or csv format), and from R objects (including data.frame and list types so far)

df <- data.frame(id = c(67, 68), price = c(1000, 500000000))
conn$add(df, name = "books")

Delete documents, by id

conn$delete_by_id(name = "books", ids = c(3, 4))

Or by query

conn$delete_by_query(name = "books", query = "manu:bank")

solrium

Solr info

Package API and ways of using the package

Install

Setup

Search

Search grouped data

Facet

Highlight

Stats

More like this

Parsing

Advanced: Function Queries

Using specific data sources

Solr server management

Cores

Collections

Add documents

Meta

Copy Link

Version

Install

Monthly Downloads

Version

License

Issues

Pull Requests

Stars

Forks

Repository

Maintainer

Last Published

Functions in solrium (1.0.0)