psych (version 2.4.3)

describe: Basic descriptive statistics useful for psychometrics

Description

There are many summary statistics available in R; this function provides the ones most useful for scale construction and item analysis in classic psychometrics. Range is most useful for the first pass in a data set, to check for coding errors. Parallelizes if multiple cores are available.

Usage

describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1,
              type=3,check=TRUE,fast=NULL,quant=NULL,IQR=FALSE,omit=FALSE,data=NULL,
              size=50)
describeData(x,head=4,tail=4)
describeFast(x)

Arguments

Value

A data.frame of the relevant statistics:

item name

item number

number of valid cases

mean

standard deviation

trimmed mean (with trim defaulting to .1)

median (standard or interpolated

mad: median absolute deviation (from the median).

minimum

maximum

skew

kurtosis

standard error

Details

In basic data analysis it is vital to get basic descriptive statistics. Procedures such as summary and Hmisc::describe do so. The describe function in the psych package is meant to produce the most frequently requested stats in psychometric and psychology studies, and to produce them in an easy to read data.frame. If a grouping variable is called for in formula mode, it will also call describeBy to the processing. The results from describe can be used in graphics functions (e.g., error.crosses).

The range statistics (min, max, range) are most useful for data checking to detect coding errors, and should be found in early analyses of the data.

Although describe will work on data frames as well as matrices, it is important to realize that for data frames, descriptive statistics will be reported only for those variables where this makes sense (i.e., not for alphanumeric data).

If the check option is TRUE, variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name. This is somewhat slower. Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded "high school", "some college" , "finished college", then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

In a typical study, one might read the data in from the clipboard (read.clipboard), show the splom plot of the correlations (pairs.panels), and then describe the data.

na.rm=FALSE is equivalent to describe(na.omit(x))

When finding the skew and the kurtosis, there are three different options available. These match the choices available in skewness and kurtosis found in the e1071 package (see Joanes and Gill (1998) for the advantages of each one).

If we define \(m_r = [\sum(X- mx)^r]/n\) then

Type 1 finds skewness and kurtosis by \(g_1 = m_3/(m_2)^{3/2} \) and \(g_2 = m_4/(m_2)^2 -3\).

Type 2 is \(G1 = g1 * \sqrt{n *(n-1)}/(n-2)\) and \(G2 = (n-1)*[(n+1)g2 +6]/((n-2)(n-3))\).

Type 3 is \(b1 = [(n-1)/n]^{3/2} m_3/m_2^{3/2}\) and \(b2 = [(n-1)/n]^{3/2} m_4/m_2^2)\).

The additional helper function describeData just scans the data array and reports on whether the data are all numerical, logical/factorial, or categorical. This is a useful check to run if trying to get descriptive statistics on very large data sets where to improve the speed, the check option is FALSE.

An even faster overview of the data is describeFast which reports the number of total cases, number of complete cases, number of numeric variables and the number which are factors.

The fast=TRUE option will lead to a speed up of about 50% for larger problems by not finding all of the statistics (see NOTE)

To describe the data for different groups, see describeBy or specify the grouping variable(s) in formula mode (see the examples).

References

Joanes, D.N. and Gill, C.A (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.

See Also

describeBy, skew, kurtosi interp.median, read.clipboard. Then, for graphic output, see error.crosses, pairs.panels, error.bars, error.bars.by and densityBy, or violinBy

Examples

Run this code
data(sat.act)
describe(sat.act)
describe(sat.act ~ gender) #formula mode option calls describeBy for the entire data frame
describe(SATV + SATQ ~ gender, data=sat.act) #formula mode specifies just two variables

describe(sat.act,skew=FALSE)
describe(sat.act,IQR=TRUE) #show the interquartile Range
describe(sat.act,quant=c(.1,.25,.5,.75,.90) ) #find the 10th, 25th, 50th, 
                   #75th and 90th percentiles
                   
                   
 
describeData(sat.act) #the fast version just  gives counts and head and tail

print(describeFast(sat.act),short=FALSE)  #even faster is just counts  (just less information)  

#now show how to adjust the displayed number of digits
 des <- describe(sat.act)  #find the descriptive statistics.  Keep the original accuracy
 des  #show the normal output, which is rounded to 2 decimals
 print(des,digits=3)  #show the output, but round to 3 (trailing) digits
 print(des, signif=3) #round all numbers to the 3 significant digits 

Run the code above in your browser using DataCamp Workspace