extract.tweets: Connect to Mongo database and extract tweets that match conditions specified in the arguments.

Description

extract.tweets opens a connection to the Mongo database in the lab computer and will return tweets that match a series of conditions: whether it contains a certain keyword, whether it is or not a retweet, or whether or not it contains a hashtag. It allows to specify the fields of the tweet to be extracted. If desired, it can also return a fixed number of tweets that will represent a random sample of all tweets in the database.

Usage

extract.tweets(set, string = NULL, size = 0, fields = c("created_at", "user.screen_name", "text"), retweets = NULL, hashtags = NULL, from = NULL, to = NULL, user_id = NULL, screen_name = NULL, verbose = TRUE)

Arguments

set

string, name of the collection of tweets in the Mongo database to query.

string

string or vector of strings, set to NULL by default (will return all tweets). If it is a string, it will return tweets that contain that string. If it is a vector of string, it will return all tweets that contain at least one of them.

size

numeric, set to 0 by default (will return all tweets that match other conditions). If it between 0 and 1 (not included), it will return that proportion of tweets in the database (e.g. 0.5 implies 50% of all tweets that match other conditions will be returned). If it is 1 or greater, it will return a random sample of that size with tweets that match the specified conditions.

fields

vector of strings, indicates fields from tweets that will be returned. Default is the date and time of the tweet, its text, and the screen name of the user that published it. See details for full list of possible fields.

retweets

logical, set to NULL by default (will return all tweets). If TRUE, will return only tweets that are retweets (i.e. contain an embededed retweeted status - manual retweets are not included). If FALSE, will return only tweets that are not retweets (manual retweets are now included).

hashtags

logical, set to NULL by default (will return all tweets). If TRUE, will return only tweets that use a hashtag. If FALSE, will return only tweets that do not contain a hashtag.

from

date, in string format. If different from NULL, will consider only tweets after that date. Note that using this field requires that the tweets have a field in ISODate format called timestamp. All times are GMT.

user_id

vector of numeric IDs for users. If different form NULL, will return only tweets sent by that set of Twitter users (if there are any in the collection)

screen_name

screen name of a user. If different form NULL, will return only tweets sent by that Twitter user (if there are any in the collection)

verbose

logical, default is TRUE, which generates some output to the R console with information about the count of tweets.

Details

The following is a non-exhaustive of relevant fields that can be specified on the fields argument (for a complete list, check the documentation at: https://dev.twitter.com/docs/platform-objects Tweet: text, created_at, id_str, favorite_count, source, retweeted, r retweet_count, lang, in_reply_to_status_id, in_reply_to_screen_name Entities: entities.hashtags, entities.user_mentions, entities.hashtags, entities.urls Retweeted_status: retweeted_status.text, retweeted_status.created_at... (and all other tweet, user, and entities fields) User: user.screen_name, user.id_str, user.geo_enabled, user.location, user.followers_count, user.statuses_count, user.friends_count, user.description, user.lang, user.name, user.url, user.created_at, user.time_zone Geo: geo.coordinates

Examples

Run this code

## Not run: 
# ## connect to the Mongo database
#  mongo <- mongo.create("SMAPP_HOST:PORT", db="DATABASE")
#  mongo.authenticate(mongo, username="USERNAME", password="PASSWORD", db="DATABASE")
#  set <- "DATABASE.COLLECTION"
# 
# ## extract text from all tweets in the database
#  tweets <- extract.tweets(set, fields="text")
# 
# ## extract random sample of 10% of tweets, with text and screen name
#  tweets <- extract.tweets(set, fields=c("user.screen_name", "text"), size=0.10)
# 
# ## extract random sample of 100 tweets that are not retweets
#  tweets <- extract.tweets(set, size=100, retweets=FALSE)
# 
# ## extract all tweets that mention turkey
#  tweets <- extract.tweets(set, string="turkey")
# 
# ## extract all tweets that mention 'occupygezi' and do a quick plot
#  tweets <- extract.tweets(set, string="occupygezi", fields="created_at")
#  plot(tweets)
# ## End(Not run)

Run the code above in your browser using DataLab