3404 occurrences of four synonymous Finnish ‘think’ verbs (‘ajatella’: 1492; ‘mietti\"a’: 812; ‘pohtia’: 713; ‘harkita’: 387) in newspaper and Internet newsgroup discussion texts
data(think)A data frame with 3404 observations on the following 27 variables:
LexemeA factor specifying one of the four ‘think’ verb synonyms
PolarityA factor specifying whether the ‘think’ verb has negative polarity (Negation) or not (Other)
VoiceA factor specifying whether the ‘think’ verb is in the Passive voice or not (Other)
MoodA factor specifying whether the ‘think’ verb is in the Indicative or Conditional mood or not (Other)
PersonA factor specifying whether the ‘think’ verb is in the First, Second, Third person or not (None)
NumberA factor specifying whether the ‘think’ verb is in the Plural number or not (Other)
CovertA factor specifying whether the agent/subject of the ‘think’ verb is explicitly expressed as a syntactic argument (Overt), or only as a morphological feature of the ‘think’ verb (Covert)
ClauseEquivalentA factor specifying whether the ‘think’ verb is used as a non-finite clause equivalent (ClauseEquivalent) or as a finite verb (FiniteVerbChain)
AgentA factor specifying the occurrence of Agent/Subject of the ‘think’ verb as either a Human Individual, Human Group, or as absent (None)
PatientA factor specifying the occurrence of the Patient/Object argument among the semantic or structural subclasses as either an Human Individual or Group (IndividualGroup), Abstraction, Activity, Communication, Event, an ‘etta’ (‘that’) clause (etta_CLAUSE), DirectQuote, IndirectQuestion, Infinitive, Participle, or as absent (None)
MannerA factor specifying the occurrrence of the Manner argument as any of its subclasses Generic, Negative (sufficiency), Positive (sufficiency), Frame, Agreement (Concur or Disagree), Joint (Alone or Together), or as absent (None)
TimeA factor specifying the occurrence of Time argument (as a moment) as either of its subclasses Definite, Indefinite, or as absent (None)
Modality1A factor specifying the main semantic subclasses of the entire Verb chain as either indicating Possibility, Necessity, or their absense (None)
Modality2A factor specifying minor semantic subclasses of the entire Verb chain as indicating either a Temporal element (begin, end, continuation, etc.), External (cause), Volition, Accidental nature of the thinking process, or their absense (None)
SourceA factor specifying the occurrence of a Source argument or its absense (None)
GoalA factor specifying the occurrence of a Goal argument or its absence (None)
QuantityA factor specifying the occurrence of a Quantity argument, or its absence (None)
LocationA factor specifying the occurrence of a Location argument, or its absence (None)
DurationA factor specifying the occurrence of a Duration argument, or its absence (None)
FrequencyA factor specifying the occurrence of a Frequency arument, or its absence (None)
MetaCommentA factor specifying the occurrence of a MetaComment, or its absence (None)
ReasonPurposeA factor specifying the occurrence of a Reason or Purpose argument (ReasonPurpose), or their absence (None)
ConditionA factor specifying the occurrence of a Condition argument, or its absence (None)
CoordinatedVerbA factor specifying the occurrence of a Coordinated Verb (in relation to the ‘think’ verb: CoordinatedVerb), or its absence (None)
RegisterA factor specifying whether the ‘think’ verb occurs in the newspaper subcorpus (hs95) or the Internet newsgroup discussion corpus (sfnet)
SectionA factor specifying the subsection in which the ‘think’ verb occurs in either of the two subcorpora
AuthorA factor specifying the author of the text in which the ‘think’ verb occurs, if that author is identifiable -- authors in the Internet newgroup discussion subcorpus are anonymized; unidentifiable/unknown author designated as (None)
The four most frequent synonyms meaning ‘think, reflect, ponder,
consider’, i.e. ‘ajatella, miettia, pohtia, harkita’, were extracted
from two months of newspaper text from the 1990s (Helsingin Sanomat
1995) and six months of Internet newsgroup discussion from the early
2000s (SFNET 2002-2003), namely regarding (personal) relationships
(sfnet.keskustelu.ihmissuhteet) and politics
(sfnet.keskustelu.politiikka). The newspaper corpus consisted of
3,304,512 words of body text (i.e. excluding headers and captions as
well as punctuation tokens), and included 1,750 examples of the
studied ‘think’ verbs. The Internet corpus comprised 1,174,693 words of
body text, yielding 1,654 instances of the selected ‘think’
verbs. In terms of distinct identifiable authors, the newspaper
sub-corpus was the product of just over 500 journalists and other
contributors, while the Internet sub-corpus involved well over 1000
discussants. The think dataset contains a selection of 26
contextual features judged as most informative.
For extensive details of the data and its linguistic and statistical
analysis, see Arppe (2008). For the full selection of contextual
features, see the amph (2008) microcorpus.
Arppe, A. 2008. Univariate, bivariate and multivariate methods in corpus-based lexicography -- a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978-952-10-5175-3.
Arppe, A. 2009. Linguistic choices vs. probabilities -- how much and what can linguistic theory explain? In: Featherston, Sam & Winkler, Susanne (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 1-24.
# NOT RUN {
data(think)
think.ndl = ndlClassify(Lexeme ~ Person + Number + Agent + Patient + Register,
data=think)
summary(think.ndl)
plot(think.ndl)
# }
Run the code above in your browser using DataLab