CITAN-package: CITation ANalysis toolpack

Description

CITAN is a library of functions useful in --- but not limited to --- quantitative research in the field of scientometrics. It contains various tools for preprocessing bibliographic data retrieved from, e.g., Elsevier's SciVerse Scopus and computing bibliometric impact of individuals. Moreover, some functions dealing with Pareto-Type II (GPD) and Discretized Pareto-Type II statistical models are included (e.g., Zhang-Stephens and MLE estimators, goodness-of-fit and two-sample tests, confidence intervals for the theoretical Hirsch index etc.). They may be used to describe and analyze many phenomena encountered in the social sciences.

Arguments

Details

Fair and objective assessment methods of individual scientists had become the focus of scientometricians' attention since the very beginning of their discipline. A quantitative expression of some publication-citation process' characteristics is assumed to be a predictor of broadly conceived scientific competence. It may be used e.g. in building decision support systems for scientific quality control.

The \(h\)-index, proposed by J.E. Hirsch (2005) is among the most popular scientific impact indicators. An author who has published \(n\) papers has the Hirsch index equal to \(H\), if each of his \(H\) publications were cited at least \(H\) times, and each of the remaining \(n-H\) items were cited no more than \(H\) times. This simple bibliometric tool quickly received much attention in the academic community and started to be a subject of intensive research. It was noted that, contrary to earlier approaches, i.e. publication count, citation count, etc., this measure concerns both productivity and impact of an individual.

In a broader perspective, this issue is a special case of the so-called Producer Assessment Problem (PAP; see Gagolewski, Grzegorzewski, 2010b).

Consider a producer (e.g. a writer, scientist, artist, craftsman) and a nonempty set of his products (e.g. books, papers, works, goods). Suppose that each product is given a rating (of quality, popularity, etc.) which is a single number in \(I=[a,b]\), where \(a\) denotes the lowest admissible valuation. We typically choose \(I=[0,\infty]\) (an interval in the extended real line). Some instances of the PAP are listed below.

	Producer	Products	Rating method	Discipline
A	Scientist	Scientific articles	Number of citations	Scientometrics
B	Scientific institute	Scientists	The h-index	Scientometrics
C	Web server	Web pages	Number of in-links	Webometrics
D	Artist	Paintings	Auction price	Auctions
E	Billboard company	Advertisements	Sale results	Marketing

Each possible state of producer's activity can therefore be represented by a point \(x\in I^n\) for some \(n\). Our aim is thus to construct and analyze --- both theoretically and empirically --- aggregation operators (cf. Grabisch et al, 2009) which can be used for rating producers. A family of such functions should take the two following aspects of producer's quality into account:

the ability to make highly-rated products,
overall productivity, \(n\).

For some more formal considerations please refer to (Gagolewski, Grzegorzewski, 2011).

To preprocess and analyze bibliometric data (cf. Gagolewski, 2011) retrieved from e.g. Elsevier's SciVerse Scopus we need the RSQLite package. It is an interface to the free SQLite DataBase Management System (see http://www.sqlite.org/). All data is stored in a so-called Local Bibliometric Storage (LBS), created with the lbsCreate function.

The data frames Scopus_ASJC and Scopus_SourceList contain various information on current source coverage of SciVerse Scopus. They may be needed during the creation of the LBS and lbsCreate for more details. License information: this data are publicly available and hence no special permission is needed to redistribute them (information from Elsevier).

CITAN is able to import publication data from Scopus CSV files (saved with settings "Output: complete format" or "Output: Citations only", see Scopus_ReadCSV). Note that the output limit in Scopus is 2000 entries per file. Therefore, to perform bibliometric research we often need to divide the query results into many parts. CITAN is able to merge them back even if records are repeated.

The data may be accessed via functions from the DBI interface. However, some typical tasks may be automated using e.g. lbsDescriptiveStats (basic description of the whole sample or its subsets, called ‘Surveys’), lbsGetCitations (gather citation sequences selected authors), and lbsAssess (mass-compute impact functions' values for given citation sequences).

There are also some helpful functions (in **EXPERIMENTAL** stage) which use the RGtk2 library (see Lawrence, Lang, 2010) to display some suggestions on which documents or authors should be merged, see lbsFindDuplicateTitles and lbsFindDuplicateAuthors.

For a complete list of functions, call library(help="CITAN").

Keywords: Hirsch's h-index, Egghe's g-index, L-statistics, S-statistics, bibliometrics, scientometrics, informetrics, webometrics, aggregation operators, arity-monotonicity, impact functions, impact assessment.

References

GTK+ Project, http://www.gtk.org SQLite DBMS, http://www.sqlite.org/ Dubois D., Prade H., Testemale C. (1988). Weighted fuzzy pattern matching, Fuzzy Sets and Systems 28, s. 313-331. Egghe L. (2006). Theory and practise of the g-index, Scientometrics 69(1), 131-152. Gagolewski M., Grzegorzewski P. (2009). A geometric approach to the construction of scientific impact indices, Scientometrics 81(3), 617-634. Gagolewski M., Debski M., Nowakiewicz M. (2009). Efficient algorithms for computing ''geometric'' scientific impact indices, Research Report of Systems Research Institute, Polish Academy of Sciences RB/1/2009. Gagolewski M., Grzegorzewski P. (2010a). S-statistics and their basic properties, In: Borgelt C. et al (Eds.), Combining Soft Computing and Statistical Methods in Data Analysis, Springer-Verlag, 281-288. Gagolewski M., Grzegorzewski P. (2010b). Arity-monotonic extended aggregation operators, In: Hullermeier E., Kruse R., Hoffmann F. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, CCIS 80, Springer-Verlag, 693-702. Gagolewski M. (2011). Bibliometric Impact Assessment with R and the CITAN Package, Journal of Informetrics 5(4), 678-692. Gagolewski M., Grzegorzewski P. (2011a). Axiomatic Characterizations of (quasi-) L-statistics and S-statistics and the Producer Assessment Problem, for Fuzzy Logic and Technology (EUSFLAT/LFA 2011), Atlantic Press, 53-58. Grabisch M., Pap E., Marichal J.-L., Mesiar R. (2009). Aggregation functions, Cambridge. Gagolewski M., Grzegorzewski P. (2011b). Possibilistic analysis of arity-monotonic aggregation operators and its relation to bibliometric impact assessment of individuals, International Journal of Approximate Reasoning 52(9), 1312-1324. Hirsch J.E. (2005). An index to quantify individual's scientific research output, Proceedings of the National Academy of Sciences 102(46), 16569-16572. Kosmulski M. (2007). MAXPROD - A new index for assessment of the scientific output of an individual, and a comparison with the h-index, Cybermetrics 11(1). Lawrence M., Lang D.T. (2010). RGtk2: A graphical user interface toolkit for R, Journal of Statistical Software 37(8), 1-52. Woeginger G.J. (2008). An axiomatic characterization of the Hirsch-index, Mathematical Social Sciences 56(2), 224-232. Zhang J., Stevens M.A. (2009). A New and Efficient Estimation Method for the Generalized Pareto Distribution, Technometrics 51(3), 316-325.