CITAN-package: CITation ANalysis toolpack

Description

CITAN is a library of functions useful in --- but not limited to --- quantitative research in the field of scientometrics. It contains various tools for preprocessing bibliometric data retrieved e.g. from Elsevier's SciVerse Scopus and calculating impact of individuals. Also, many functions dealing with Pareto-Type II (GPD) and Discretized Pareto-Type II statistical models are included (e.g. Zhang-Stephens and MLE estimators, goodness-of-fit and two-sample tests, confidence intervals for the theoretical Hirsch index etc.). They may be used to describe and analyze many phenomena encountered in the social sciences.

Arguments

Details

Fair and objective assessment methods of individual scientists had become the focus of scientometricians' attention since the very beginning of their discipline. A quantitative expression of some publication-citation process characteristics is assumed to be a predictor of broadly conceived scientific competence. It may be used e.g. in building decision support systems for science quality control.

The $h$-index, proposed by J.E. Hirsch (2005) is among the most popular scientific impact indicators. An author who has published $n$ papers has the Hirsch index equal to $H$, if each of his $H$ publications were cited at least $H$ times, and each of the other $n-H$ items were cited no more than $H$ times. This simple bibliometric tool quickly received much attention in the academic community and started to be a subject of intensive research. It was noted that, contrary to earlier approaches, i.e. publication count, citation count etc., this measure both concerns productivity and impact of an individual. In a broader perspective, this issue is a special case of the so-called Producer Assessment Problem (Gagolewski, Grzegorzewski, 2010b).

Consider a producer (e.g. a writer, scientist, artist, craftsman) and a nonempty set of his products (e.g. books, papers, works, goods). Suppose that each product is given a rating (of quality, popularity, etc.) which is a single number in $I=[a,b]$, where $a$ denotes the lowest admissible valuation. We typically choose $I=[0,\infty]$ (an interval in the extended real line). Some instances of such situation are listed below.

cllll{ Producer Products Rating method Discipline A Scientist Scientific articles Number of citations Scientometrics B Scientific institute Scientists The $h$-index Scientometrics C Web server Web pages Number of in-links Webometrics D Artist Paintings Auction price Auctions E Billboard company Advertisements Sale results Marketing }

Each possible state of producer's activity can be described by a point in $I^n$ for some arbitrary $n$. The Producer Assessment Problem (PAP) involves constructing and analyzing --- both theoretically and empirically --- aggregation operators (see Grabisch et al, 2009) which can be used for rating producers. A family of such functions should take into account the two following aspects of producer's quality:

the ability to make highly-rated products,
overall productivity.

For some more formal considerations see e.g. (Gagolewski, Grzegorzewski, 2011). The CITAN package consists of four types of tools.

(1) Given a numeric vector, the first class of functions computes the values of certain impact functions. Among them we have:

Hirsch's$h$-index (Hirsch, 2005; seeindex.h),
Egghe's$g$-index (Egghe, 2006; seeindex.g),
the$r_p$and$l_p$indices (Gagolewski, Grzegorzewski, 2009a, 2009b; seeindex.rpandindex.lp), which generalize the$h$-index, the$w$-index (Woeginger, 2008), and the MAXPROX-index (Kosmulski, 2007),
S-statistics (Gagolewski, Grzegorzewski, 2010a, 2011; seeSstatandSstat2), which generalize the OWMax operators (Dubois et al, 1988) and the$h$- and$r_\infty$-indices.

(2) To preprocess and analyze bibliometric data retrieved e.g. from Elsevier's SciVerse Scopus we need the RSQLite package. It is an interface to the free SQLite DataBase Management System (see http://www.sqlite.org/). All data is stored in a so-called Local Bibliometric Storage (LBS), created with the lbsCreate function.

The data frames Scopus_ASJC and Scopus_SourceList contain various information on current source coverage of SciVerse Scopus. They may be needed during the creation of the LBS and lbsCreate for more details. License information: this data are publicly available and hence no special permission is needed to redistribute them (information from Elsevier).

CITAN is able to import publication data from Scopus CSV files (saved with settings "Output: complete format", see Scopus_ReadCSV). Note that the output limit in Scopus is 2000 entries per file. Therefore, to perform bibliometric research we often need to divide the query results into many parts. CITAN is able to merge them back even if records are repeating.

The data may be accessed via functions from the DBI interface. However, some typical tasks may be automated using e.g. lbsDescriptiveStats (basic description of the whole sample or its subsets, called Surveys), lbsGetCitations (gather citation sequences selected authors), and lbsAssess (mass-compute impact functions' values for given citation sequences).

There are also some helpful functions (in **EXPERIMENTAL** stage) which use the RGtk2 library (see Lawrence, Lang, 2010) to display some suggestions on which documents or authors should be merged, see lbsFindDuplicateTitles and lbsFindDuplicateAuthors.

(3) Additionally, a set of functions dealing with stochastic aspects of S-statistics, the $h$-index and the Pareto type-II family of distributions statistical models is included (Gagolewski, Grzegorzewski, 2010a). We have the following.

Functions that work for any continuous distribution (see Gagolewski, Grzegorzewski, 2010a):
1. psstat,dsstatfor computing the distribution of S-statistics generated by a control function,
2. phirsch,dhirschfor computing the distribution of the Hirsch index,
3. rho.getfor computing the so-called$\rho$-index ($\rho_\kappa$), which is a particular location characteristic of a given probability distribution depending on a control function$\kappa$.
Tools for the Pareto-type II family:
1. ppareto2,dpareto2,qpareto2,rpareto2for general functions dealing with the Pareto distribution of the second kind, including the c.d.f., p.d.f, quantiles and random deviates,
2. pareto2.phirsch,pareto2.dhirschfor computing the distribution of the Hirsch index (much faster than the generalized versions),
3. pareto2.htest--- two-sample$h$-test for equality of shape parameters based on the difference of$h$-indices,
4. pareto2.htest.approx--- two-sample asymptotic (approximate)$h$-test,
5. pareto2.ftest--- two-sample exact F-test for equality of shape parameters,
6. pareto2.zsestimate--- estimation of parameters using the Bayesian method (MMSE) developed by Zhang and Stevens (2009),
7. pareto2.mlekestimate,pareto2.mleksestimate--- estimation of parameters using the MLE,
8. discrpareto2.mlekestimate,discrpareto2.mleksestimate--- estimation of parameters using the MLE in case of the Discretized Pareto-type II distribution,
9. pareto2.goftest--- goodness-of-fit tests,
10. pareto2.confint.rhoandpareto2.confint.rho.approx--- exact and approximate (asymptotic) confidence intervals for the$\rho$-index basing on S-statistics,
11. pareto2.confint.h--- exact confidence intervals for the theoretical$h$-index.

(4) Moreover, we have implemented some simple graphical methods than may be used to illustrate various aspects of data being analyzed, see plot.citfun, curve.add.rp, and curve.add.lp. Please feel free to send any comments and suggestions (e.g. to include some new bibliometric impact indices) to the author (see also http://www.ibspan.waw.pl/~gagolews).

For a complete list of functions, use library(help="CITAN"). Keywords: Hirsch's h-index, Egghe's g-index, L-statistics, S-statistics, bibliometrics, scientometrics, informetrics, webometrics, aggregation operators, impact functions, impact assessment.

References

GTK+ Project, http://www.gtk.org/download.html SQLite DBMS, http://www.sqlite.org/ Dubois D., Prade H., Testemale C., Weighted fuzzy pattern matching, Fuzzy Sets and Systems 28, s. 313-331, 1988. Egghe L., Theory and practise of the g-index, Scientometrics 69(1), 131-152, 2006. Gagolewski M., Grzegorzewski P., Possibilistic analysis of arity-monotonic aggregation operators and its relation to bibliometric impact assessment of individuals, International Journal of Approximate Reasoning, 2011, doi:10.1016/j.ijar.2011.01.010. Gagolewski M., Grzegorzewski P., A geometric approach to the construction of scientific impact indices, Scientometrics 81(3), 2009a, 617-634. Gagolewski M., Debski M., Nowakiewicz M., Efficient algorithms for computing ''geometric'' scientific impact indices, Research Report of Systems Research Institute, Polish Academy of Sciences RB/1/2009, 2009b. Gagolewski M., Grzegorzewski P., S-statistics and their basic properties, In: Borgelt C. et al (Eds.), Combining Soft Computing and Statistical Methods in Data Analysis, Springer-Verlag, 2010a, 281-288. Grabisch M., Pap E., Marichal J.-L., Mesiar R.. Aggregation functions, Cambridge, 2009. Hirsch J.E., An index to quantify individual's scientific research output, Proceedings of the National Academy of Sciences 102(46), 16569-16572, 2005. Kosmulski M., MAXPROD - A new index for assessment of the scientific output of an individual, and a comparison with the h-index, Cybermetrics 11(1), 2007. Lawrence M., Lang D.T., RGtk2: A graphical user interface toolkit for R, Journal of Statistical Software 37(8), 1-52, 2010. Woeginger G.J., An axiomatic characterization of the Hirsch-index, Mathematical Social Sciences 56(2), 224-232, 2008. Zhang J., Stevens M.A., A New and Efficient Estimation Method for the Generalized Pareto Distribution, Technometrics 51(3), 2009, 316-325.