computeTfIdf(channel, tableName, docId, textColumns, parser, top = NULL, rankFunction = "rank", idSep = "-", idNull = "(null)", adjustDocumentCount = FALSE, where = NULL, stopwords = NULL, test = FALSE)
odbcConnect
idSep
. Database NULLs are replaced with
idNull
string.ngram(2)
parser
generates 2-grams (ngrams of length 2), token(2)
parser generates 2-word
combinations of terms within documents.top
value. If value
is greater than 0 then included top ranking terms only, otherwise all terms returned
(also see paramter rankFunction
). Terms are always ordered by their term frequency -
inverse document frequency (tf-idf) within each document. Filtered out terms have their
rank ariphmetically greater than threshold top
(see details): term is more
important the smaller value of its rank.rownumber, rank, denserank, percentrank
. Rank computed and
returned for each term within each document. function determines which SQL window function computes
term rank value (default rank
corresponds to SQL RANK()
window function).
When threshold top
is greater than 0 ranking function used to limit number of
terms returned (see details).docId
).WHERE
clause).top
to limit number of terms returned by
filtering top ranked terms for each document. Thus if set top=1000
and there
is 100 documents then at least 100,000 terms (rows) will be returned. Result size could
exceed this number when other than rownumber
rankFunction
used:
rownumber
applies a sequential row number, starting at 1, to each term in a document.
The tie-breaker behavior is as follows: Rows that compare as equal in the sort order will be
sorted arbitrarily within the scope of the tie, and all terms will be given unique row numbers.
rank
function assigns the current row-count number as the terms's rank, provided the
term does not sort as equal (tie) with another term. The tie-breaker behavior is as follows:
terms that compare as equal in the sort order are sorted arbitrarily within the scope of the tie,
and the sorted-as-equal terms get the same rank number.
denserank
behaves like the rank
function, except that it never places
gaps in the rank sequence. The tie-breaker behavior is the same as that of RANK(), in that
the sorted-as-equal terms receive the same rank. With denserank
, however, the next term after
the set of equally ranked terms gets a rank 1 higher than preceding tied terms.
percentrank
assigns a relative rank to each term, using the formula:
(rank - 1) / (total rows - 1)
. The tie-breaker behavior is as follows: Terms that compare
as equal are sorted arbitrarily within the scope of the tie, and the sorted-as-equal rows
get the same percent rank number.
The ordering of the rows is always by their tf-idf value within each document.
computeTf
, nGram
, token
if(interactive()){
# initialize connection to Dallas database in Aster
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
# compute term-document-matrix of all 2-word Ngrams of Dallas police crime reports
# for each 4-digit zip
tdm1 = computeTfIdf(channel=conn, tableName="public.dallaspoliceall",
docId="substr(offensezip, 1, 4)",
textColumns=c("offensedescription", "offensenarrative"),
parser=nGram(2, ignoreCase=TRUE,
punctuation="[-.,?\\!:;~()]+"))
# compute term-document-matrix of all 2-word combinations of Dallas police crime reports
# for each type of offense status
tdm2 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall", docId="offensestatus",
textColumns=c("offensedescription", "offensenarrative", "offenseweather"),
parser=token(2),
where="offensestatus NOT IN ('System.Xml.XmlElement', 'C')")
# include only top 100 ranked 2-word ngrams for each 4-digit zip into resulting
# term-document-matrix using rank function
tdm3 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall",
docId="substr(offensezip, 1, 4)",
textColumns=c("offensedescription", "offensenarrative"),
parser=nGram(2), top=100)
# same but get top 10% ranked terms using percent rank function
tdm4 = computeTfIdf(channel=NULL, tableName="public.dallaspoliceall",
docId="substr(offensezip, 1, 4)",
textColumns=c("offensedescription", "offensenarrative"),
parser=nGram(1), top=0.10, rankFunction="percentrank")
}
Run the code above in your browser using DataLab