search_tweets: search_tweets

Description

Returns two data frames (tweets data and users data) using a provided search query.

Usage

search_tweets(q, n = 100, type = "recent", max_id = NULL,
  include_rts = TRUE, parse = TRUE, usr = TRUE, token = NULL,
  retryonratelimit = FALSE, verbose = TRUE, ...)

Arguments

Query to be searched, used to filter and select tweets to return from Twitter's REST API. Must be a character string not to exceed maximum of 500 characters. Spaces behave like boolean "AND" operator. To search for tweets containing at least one of multiple possible terms, separate each search term with spaces and "OR" (in caps). For example, the search q = "data science" looks for tweets containing both "data" and "science" anywhere located anywhere in the tweets and in any order. When "OR" is entered between search terms, query = "data OR science", Twitter's REST API should return any tweet that contains either "data" or "science." It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"data science"' or escape each internal double quote with a single backslash, e.g., q = "\"data science\"".

Integer, specifying the total number of desired tweets to return. Defaults to 100. Maximum number of tweets returned from a single token is 18,000. To return more than 18,000 tweets, users are encouraged to set retryonratelimit to TRUE. See details for more information.

type

Character string specifying which type of search results to return from Twitter's REST API. The current default is type = "recent", other valid types include type = "mixed" and type = "popular".

max_id

Character string specifying the [oldest] status id beyond which search results should resume returning. Especially useful large data returns that require multiple iterations interrupted by user time constraints. For searches exceeding 18,000 tweets, users are encouraged to take advantage of rtweet's internal automation procedures for waiting on rate limits by setting retryonratelimit argument to TRUE. It some cases, it is possible that due to processing time and rate limits, retreiving several million tweets can take several hours or even multiple days. In these cases, it would likely be useful to leverage retryonratelimit for sets of tweets and max_id to allow results to continue where previous efforts left off.

include_rts

Logical, indicating whether to include retweets in search results. Retweets are classified as any tweet generated by Twitter's built-in "retweet" (recycle arrows) function. These are distinct from quotes (retweets with additional text provided from sender) or manual retweets (old school method of manually entering "RT" into the text of one's tweets).

parse

Logical, indicating whether to return parsed data.frame, if true, or nested list (fromJSON), if false. By default, parse = TRUE saves users from the wreck of time and frustration associated with disentangling the nasty nested list returned from Twitter's API (for proof, check rtweet's Github commit history). As Twitter's APIs are subject to change, this argument would be especially useful when changes to Twitter's APIs affect performance of internal parsers. Setting parse = FALSE also ensures the maximum amount of possible information is returned. By default, the rtweet parse process returns nearly all bits of information returned from Twitter. However, users may occassionally encounter new or omitted variables. In these rare cases, the nested list object will be the only way to access these variables.

usr

Logical indicating whether to return a data frame of users data. Users data is stored as an attribute. To access this data, see users_data. Useful for marginal returns in memory demand. However, any gains are likely to be negligible as Twitter's API invariably returns this data anyway. As such, this defaults to true, see users_data.

token

OAuth token. By default token = NULL fetches a non-exhausted token from an environment variable. Find instructions on how to create tokens and setup an environment variable in the tokens vignette (in r, send ?tokens to console).

retryonratelimit

Logical indicating whether to wait and retry when rate limited. This argument is only relevant if the desired return (n) exceeds the remaining limit of available requests (assuming no other searches have been conducted in the past 15 minutes, this limit is 18,000 tweets). Defaults to false. Set to TRUE to automate process of conducting big searches (i.e., n > 18000). For many search queries, esp. specific or specialized searches, there won't be more than 18,000 tweets to return. But for broad, generic, or popular topics, the total number of tweets within the REST window of time (7-10 days) can easily reach the millions.

verbose

Logical, indicating whether or not to include output processing/retrieval messages. Defaults to TRUE. For larger searches, messages include rough estimates for time remaining between searches. It should be noted, however, that these time estimates only describe the amount of time between searches and not the total time remaining. For large searches conducted with retryonratelimit set to TRUE, the estimated retreival time can be estimated by dividing the number of requested tweets by 18,000 and then multiplying the quotient by 15 (token cooldown time, in minutes).

…

Futher arguments passed on to make_url. All named arguments that do not match the above arguments (i.e., count, type, etc.) will be built into the request. To return only English language tweets, for example, use lang = "en". For more options see Twitter's API documentation.

Value

List object with tweets and users each returned as a data frame.

Details

Twitter API documentation recommends limiting searches to 10 keywords and operators. Complex queries may also produce API errors preventing recovery of information related to the query. It should also be noted Twitter's search API does not consist of an index of all Tweets. At the time of searching, the search API index includes between only 6-9 days of Tweets.

Number of tweets returned will often be less than what was specified by the user. This can happen because (a) the search query did not return many results (the search pool is already thinned out from the population of tweets to begin with), (b) because user hitting rate limit for a given token, or (c) of recent activity (either more tweets, which affect pagination in returned results or deletion of tweets). To return more than 18,000 tweets in a single call, users must set retryonratelimit argument to true. This method relies on updating the max_id parameter and waiting for token rate limits to refresh between searches. As a result, it is possible to search for 50,000, 100,000, or even 10,000,000 tweets, but these searches can take hours or even days. At these durations, it would not be uncommon for connections to timeout. Users are instead encouraged to breakup data retrieval into smaller chunks by leveraging retryonratelimit and then using the status_id of the oldest tweet as the max_id to resume searching where the previous efforts left off.

Examples

Run this code

# NOT RUN {
## search for 1000 tweets mentioning Hillary Clinton
hrc <- search_tweets(q = "hillaryclinton", n = 1000)

## data frame where each observation (row) is a different tweet
hrc

## users data also retrieved. can access it via users_data()
users_data(hrc)

## search for 1000 tweets in English
djt <- search_tweets(q = "realdonaldtrump", n = 1000, lang = "en")
djt
users_data(djt)

## exclude retweets
rt <- search_tweets("rstats", n = 500, include_rts = FALSE)

## perform search for lots of tweets
rt <- search_tweets("trump OR president OR potus", n = 100000,
                    retryonratelimit = TRUE)

## plot time series of tweets frequency
ts_plot(rt, by = "mins", theme = "spacegray",
        main = "Tweets about Trump")

# }

Run the code above in your browser using DataLab