PCL-R - CiteSeerX [PDF]

expected scores on a test item. When items are scored using more than two score categories, such as on the PCL-R, a mode

6 downloads 4 Views 120KB Size

Recommend Stories


Army STARRS - CiteSeerX [PDF]
The Army Study to Assess Risk and Resilience in. Servicemembers (Army STARRS). Robert J. Ursano, Lisa J. Colpe, Steven G. Heeringa, Ronald C. Kessler,.

CiteSeerX
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

Rawls and political realism - CiteSeerX [PDF]
Rawls and political realism: Realistic utopianism or judgement in bad faith? Alan Thomas. Department of Philosophy, Tilburg School of Humanities,.

Messianity Makes a Person Useful - CiteSeerX [PDF]
Lecturers in Seicho no Ie use a call and response method in their seminars. Durine the lectures, participants are invited to give their own opinions,and if they express an opinion. 21. Alicerce do Paraiso (The Cornerstone of Heaven) is the complete

Nursing interventions in radiation therapy - CiteSeerX [PDF]
The Nursing intervention. 32. Standard care. 32 ... Coping with radiation therapy- Effects of a nursing intervention on coping ability for women with ..... (PTSD). To receive a life-threatening diagnosis such as cancer may trigger PTSD according to t

Automatic Orthogonal Graph Layout - CiteSeerX [PDF]
In this student work we define the automatic layout problem for EMF diagrams and propose .... V, we denote with in(υ) the set of edges in E which have target υ, and with out(υ) the set of edges with source υ. The in-degree δG. ¯ (υ) denotes th

Robust Facial Feature Tracking - CiteSeerX [PDF]
We present a robust technique for tracking a set of pre-determined points on a human face. To achieve robustness, the Kanade-Lucas-Tomasi point tracker is extended and specialised to work on facial features by embedding knowledge about the configurat

Physical and Cognitive Domains of the Instrumental ... - CiteSeerX [PDF]
cognitive IADL domain taps a set of activities directly related to cognitive functioning. FUNCTIONAL disability is frequently assessed in older adults by their difficulty in performing basic activities of daily living (ADL) tasks such as those (eatin

A Review of Advances in Dielectric and Electrical ... - CiteSeerX [PDF]
success is its ability to accurately measure the permittivity of a material water content. Electromagnetic methods .... (1933, 1935) and Thomas (1966) gave accounts of early attempts to estimate moisture. However, not until the aftermath of the Secon

spiritual regeneration and ultra-nationalism: the political ... - CiteSeerX [PDF]
Semana de Arte Moderna, in 1922, that was the founding moment of Brazilian. Modernism in literature and art. In 1926 he published with success his first novel, O estrangeiro, in which the corruption of the oligarchic system was opposed to the authent

Idea Transcript


Score Metric Equivalence of the Psychopathy Checklist–Revised (PCL-R) Across Criminal Offenders in North America and the United Kingdom A Critique of Cooke, Michie, Hart, and Clark (2005) and New Analyses Daniel M. Bolt University of Wisconsin

Robert D. Hare University of British Columbia and Darkstone Research Group, Canada

Craig S. Neumann University of North Texas David Cooke and colleagues have published a series of item response theory (IRT) studies investigating the equivalence of the Psychopathy Checklist–Revised (PCL-R) for European versus North American (NA) male criminal offenders. They have consistently concluded that PCL-R scores are not equivalent, with European offenders receiving scores up to five points lower than those in NA when matched according to the latent trait. In this article, the authors critique the Cooke et al. analyses and demonstrate how their anchor item selection method is responsible for their final conclusions concerning the apparent lack of equivalence. The authors provide a competing IRT analysis using an iterative purification strategy for anchor item selection and show how this more justifiable approach leads to very different conclusions regarding the equivalence of the PCL-R. More generally, it is argued that strong interpretations of IRT analyses in the presence of uncorroborated anchor items can be highly misleading when evaluating score metric equivalence. Keywords: psychopathy; item response theory; differential item functioning; Psychopathy Checklist–Revised; PCL-R Psychopathy has been described as perhaps the single most important clinical construct in the criminal justice system (Hare, 1996) and, more recently, as “what may be the most important forensic concept of the early 21st century” (Monahan, 2006). The Hare Psychopathy Checklist–Revised

(PCL-R; Hare, 1991, 2003) has become a widely accepted international standard for the assessment of psychopathy (Acheson, 2005). The PCL-R is a construct rating scale consisting of 20 items that represent defining characteristics of the psychopathic personality disorder. Items are

Assessment, Volume 14, No. 1, March 2007 44-56 DOI: 10.1177/1073191106293505 © 2007 Sage Publications

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 45

scored by a trained rater as definitely not present (0), possibly present (1), or definitely present (2), following both an interview and file review of the focus respondent. More than most psychological instruments, the PCL-R has been subjected to intense critical scrutiny, in large part because of the key role it plays in a variety of criminal justice contexts, including risk assessment, release decisions, dangerous offender and sexually violent predator (SVP) determinations, death penalty hearings, evaluations of treatment suitability, and so forth. PCL-R scores can vary from 0 to 40, reflecting the extent to which an individual matches the prototypical psychopath. Psychiatrists and psychologists often use various cut scores for a diagnosis of psychopathy and for making forensic decisions that can have serious implications for individuals and for society. Because the PCL-R is used in many different countries and jurisdictions, the adoption of a defensible cut score becomes very important when making decisions about offenders. More generally, does a given PCL-R score have the same meaning with respect to psychopathy in different countries, cultures, and ethnic groups? If not, what local score adjustments should be made? Several studies have now examined the psychometric equivalence of the PCL-R across diverse populations (see, e.g., Bolt, Hare, Vitale, & Newman, 2004; Cooke & Michie, 1999; Cooke, Michie, Hart, & Clark, 2005a, 2005b; Hare & Neumann, 2005; Salekin, Rogers, & Sewell, 1997; Windle & Dumenci, 1999). Such studies are important for several reasons. First, most of the initial validation work done with the PCL-R was performed using North American (NA) male criminal offenders and may or may not generalize to the many other populations in which the PCL-R is currently used. Both the internal psychometric characteristics of the instrument (e.g., interrater reliability, internal consistency) and its external validity (e.g., prediction of treatment amenability, recidivism) may be affected when the instrument is used in different populations. A second concern is the possibility that some items on the PCL-R may be affected by nonpsychopathy-related factors associated with group membership. For example, the PCL-R item “juvenile delinquency” is a characteristic that tends to be disproportionately more prevalent among men than among women even after accounting for men’s higher mean level of psychopathy (see Bolt et al., 2004), making the item’s performance quite different when assessing psychopathy in women. At one level, such group-related factors can affect the degree to which the

items on the PCL-R cohere in their measurement of a common trait and thus alter the meaning of the underlying construct being measured. Equally significant, however, are the effects that these factors may have on the meaningfulness of specific PCL-R scores as indicators of specific levels of psychopathy. Despite their measurement of a common psychopathy construct, the presence of PCL-R items that disproportionately favor or disfavor a particular subpopulation (as juvenile delinquency performs for women) raises important questions about the score metric equivalence of the PCL-R when used among members of that group. Score metric equivalence exists when PCL-R scores can be assumed to represent the same levels of the trait across groups. Among other implications, a lack of score metric equivalence may necessitate the need for different diagnostic or research cut scores across different populations. For example, although psychopathy and its components, as measured by the PCL-R, appear to be distributed dimensionally (Edens, Marcus, Lilienfeld, & Poythress, 2006; Guay, Ruscio, Knight, & Hare, 2006; see also Marcus, John, & Edens, 2004, for evidence of the dimensional nature of psychopathy measured by selfreport), a cut score of 30 on the PCL-R often is used for diagnostic and research purposes. This score provided the best diagnostic efficiency with respect to global clinical assessments of psychopathy prior to development of the PCL, the precursor to the PCL-R (Hare, 2003). It also has proven useful for researchers who wish to compare individuals with a relatively high dose of psychopathic features with those at the lower end of the distribution, typically a score of 20 or less. The matter of cut scores takes on particular importance when they are used to make a diagnosis of psychopathy, to facilitate decisions about treatment options, to assess risk for recidivism and violence, or to help determine which sex offenders are to be civilly committed (United States) or to be considered for designation as a “dangerous and severe personality disorder” (DSPD, United Kingdom). Cut scores other than 30 also may be adopted, for example, to attend to the relative costs of false negatives and false positives in risk assessment, as in the use of receiver operating characteristic (ROC) analyses (e.g., Douglas, Strand, Belfrage, Fransson, & Levander, 2005; Quinsey, Harris, Rice, & Cormier, 2005). Hare (2003) also notes the possibility of a more dimensional interpretation to the PCL-R score metric, with different regions of the score scale distinguishing individuals having very low to very high levels of psychopathy. Verification of score metric equivalence naturally applies to such conditions as well because a lack of score metric

Authors’ Note: Correspondence concerning this article should be addressed to Daniel M. Bolt, Department of Educational Psychology, 1025 W. Johnson, Room 859, Madison, WI 53705; e-mail: [email protected].

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

46 ASSESSMENT

equivalence implies the need for a corresponding revision to the interpretative framework applied in using PCL-R scores. Three studies by Cooke and colleagues have derived some fairly strong conclusions regarding the direction and magnitude of bias in the PCL-R scores, to the point of recommending lower PCL-R cut scores for identifying equivalent levels of psychopathy in Europe compared to NA. Cooke and Michie (1999) claimed the need for cut scores five points lower than in NA for identifying psychopathy in Scottish inmates (e.g., a score of 25 to match an NA score of 30). More recently, Cooke et al. (2005a) suggested an adjustment of approximately two points in the United Kingdom (UK), with UK offenders again having lower cut scores. Cooke et al. (2005b) extended the same recommendation to a more general European population. As Cooke et al. (2005a) note, the consequences of these adjustments are quite significant, with a larger proportion of European offenders being identified as psychopathic than would have been the case if the traditional NA cut scores were used. The basis for the Cooke et al. (2005a, 2005b) and Cooke and Michie (1999) recommendations has been differential item functioning (DIF) analyses using item response theory (IRT). In this article, we raise serious concerns about the analyses by Cooke and his colleagues. In particular, we note a fundamental problem in their approach to selecting anchor items, an approach we show to be responsible for their final conclusions regarding the apparent lack of score metric equivalence and their recommendations for alternative cut scores. Using large PCL-R data sets described in Hare (2003), we provide a competing IRT analysis based on an alternative and more justifiable approach for exploratory selection of anchor items, referred to as iterative purification (Lord, 1980). Using NA and UK criminal offender samples similar to that used by Cooke et al. (2005a), we show how the alternative analysis actually implies minimal differences in the PCL-R score metric across populations. More generally, we review the inherent limitations in the use of a strictly internal psychometric methodology such as IRT when identifying score metric equivalence and argue that any interpretation of IRT results in the presence of uncorroborated anchor items can be highly misleading.

ITEM RESPONSE THEORY AND DIFFERENTIAL ITEM FUNCTIONING IRT encompasses a class of statistical models that characterizes the relationship between a latent trait and expected scores on a test item. When items are scored using more than two score categories, such as on the PCL-R, a

model such as the Graded Response Model (GRM; Samejima, 1969) is frequently used. The GRM uses logistic curves to model the cumulative score probabilities of an item (see Bolt et al., 2004; Cooke et al., 2005b, for examples). Because PCL-R items are scored using three score categories, application of the GRM requires three parameters per item: a discrimination parameter a, which represents the degree to which item scores are affected by the latent trait, and threshold parameters b1 and b2, which identify the latent trait levels at which scores at or above 1 and 2, respectively, become more likely than item scores at or below 0 and 1, respectively. The appeal of using IRT in studying DIF can be attributed to the invariance properties of IRT models. Because IRT models characterize the relationship between trait level and item scores, they allow for comparisons of item functioning across groups in a way that is not affected by potential differences in how the trait is distributed across groups. This is in contrast to classical psychometric approaches that consider characteristics of the item score distributions (e.g., differences in the mean item score across groups). There is a variety of different IRT-based approaches that can be taken to the study of DIF (see Penfield & Camilli, in press, for a recent review). A common strategy is the likelihood ratio test (Thissen, Steinberg, & Wainer, 1993), the procedure also used in the Cooke et al. studies. In this procedure, the item parameters of a studied item are tested for equivalence across groups. The procedure involves the comparison of two models, one in which the parameters of the studied item are set equal across groups (a compact model) and one in which the parameters are free to vary (an augmented model). A separate log-likelihood statistic is computed for each model. The difference between these log-likelihood statistics follows a chi-square distribution under a null hypothesis of no DIF in the studied item. However, a critical issue in any implementation of DIF studies involves selection of anchor items, an issue discussed shortly.

METHOD Data for the Current Analysis The data studied in the current analysis consist of item score patterns from the PCL-R collected for 3,847 NA male offenders and 1,117 UK male offenders. All items were scored using the traditional 0, 1, 2 scoring and based on the standard interview and file review process. PCL-R item labels, along with item statistics for each of the two samples, are shown in Table 1. Total scale score statistics resulted in a higher mean score for the NA respondents

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 47

TABLE 1 PCL-R Items and Descriptive Statistics Across NA and UK Criminal Offender Populations, Hare (2003) Data NA

PCL-R Item (Psychopathy Factor) 1. Glibness/superficial charm (1) 2. Grandiose sense of self-worth (1) 3. Need for stimulation/proneness to boredom (3) 4. Pathological lying (1) 5. Conning/manipulative (1) 6. Lack of remorse or guilt (2) 7. Shallow affect (2) 8. Callous/lack of empathy (2) 9. Parasitic lifestyle (3) 10. Poor behavioral controls (4) 11. Promiscuous sexual behavior 12. Early behavior problems (4) 13. Lack of realistic, long-term goals (3) 14. Impulsivity (3) 15. Irresponsibility (3) 16. Failure to accept responsibility (2) 17. Many short-term marital relationships 18. Juvenile delinquency (4) 19. Revocation of conditional release (4) 20. Criminal versatility (4)

UK

M

SD

M

SD

0.84 0.92 1.27 0.99 1.06 1.46 0.97 1.19 1.08 1.21 1.16 0.94 1.20 1.30 1.36 1.32 0.75 1.18 1.48 1.26

0.71 0.74 0.72 0.71 0.77 0.67 0.74 0.72 0.67 0.78 0.81 0.83 0.74 0.69 0.68 0.72 0.84 0.86 0.79 0.79

0.39 0.55 0.96 0.47 0.84 1.18 0.62 0.81 0.95 1.00 0.93 0.79 0.98 1.08 1.06 1.12 0.71 0.94 0.97 0.82

0.59 0.70 0.75 0.64 0.73 0.73 0.70 0.72 0.75 0.79 0.82 0.83 0.78 0.70 0.71 0.72 0.84 0.80 0.87 0.81

SOURCE: The items are from Hare (1991, 2003). Copyright 1991 R. D. Hare and Multi-Health Systems, 3770 Victoria Park Avenue, Toronto, Ontario, M2H 3M6, Canada. All rights reserved. Reprinted by permission. NOTE: Note that the item titles cannot be scored without reference to the formal criteria contained in the Psychopathy Checklist–Revised (PCL-R) manual. NA = North American male criminal offenders (N = 3,847); UK = United Kingdom male criminal offenders (N = 1,117).

(M = 22.9, SD = 7.6, α = .84) than for the UK respondents (M = 17.3, SD = 7.5, α = .84). This mean score difference is consistent with the mean differences reported in previous studies and provides a potential reason for exploring the possibility of score bias across these populations. A multigroup IRT analysis of the NA and UK data sets was presented in the second revision of the PCL-R manual (Hare, 2003, Appendix B). Bolt et al. (2004) also performed an IRT-based DIF analysis using the same NA data set but compared it against alternative populations (forensic psychiatric offenders, female offenders, male offenders scored from file reviews). In conducting IRT analyses with these data, we note that a preliminary requirement is test unidimensionality, which implies the items as a collection measure only one underlying trait. A single factor model fit to the polychoric matrix using LISREL 8.71 suggested that the unidimensionality assumption of IRT was reasonably well satisfied for the NA data (χ2 = 2,382.9, df = 170, p < .01, Comparitive Fit Index [CFI] = .94, Tucker-Lewis Index [TLI] = .93, root mean square error of approximation [RMSEA] = .065). Neumann and colleagues (Neumann, Hare, & Newman,

in press; Neumann,Vitacco, Hare, & Wupperman, 2005) also found good support for unidimensionality in this data set. For the UK data, the same model provided a somewhat worse fit, although the misfit could be attributed to a relatively small number of item pairs demonstrating local dependence. A respecified model freeing up 10 measurement error covariances provided a more reasonable fit (χ2 = 2,479.0, df = 160, p < .01, CFI = .93, TLI = .92, RMSEA = .084), suggesting at least approximate unidimensionality with the UK data. The largest of these covariances (i.e., items 6 and 8, 6 and 16, and 1 and 2) correspond to item pairs frequently associated with common secondary factors, as described next. Despite consistent agreement as to the presence of a pervasive single factor underlying the PCL-R, different multifactor models have been advocated for different subsets of the PCL-R items (e.g., Cooke & Michie, 2001; Hare, 2003; Vitacco, Neumann, & Jackson, 2005). The most inclusive of these models is a four-factor model that includes 18 of the 20 PCL-R items and categorizes the items as Interpersonal (Factor 1), Affective (Factor 2), Lifestyle (Factor 3), and Antisocial (Factor

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

48 ASSESSMENT

4) characteristics (Hare, 2003; Neumann et al., 2005, in press). Other models, such as Cooke and Michie’s three-factor model, involve a similar categorization but attend to items related to the first three factors. Table 1 identifies the items belonging to each factor according to the four-factor model. Such models naturally open up the possibility of studying differential item functioning with respect to a multidimensional model. We have considered a unidimensional IRT analysis here, however, for several reasons. First, as noted, is that a unidimensional model appears to provide a good approximation to the data for both samples, and IRT applications are generally robust to violations of unidimensionality in its strictest sense. Second is the fact that most practical use of the PCL-R involves the total score, making the study of item performance against a single latent trait potentially more meaningful for practical applications. Third, and finally, we seek to mimic the Cooke et al. approaches as closely as possible both to illustrate the similarity between their data and ours as well as to indicate where we contend their analyses are problematic.

RESULTS The Importance of Anchor Item Selection in the Interpretation of DIF and its Consequences at the Test Score Level A critical component in determining score metric equivalence using IRT methods involves appropriate selection of anchor items. Anchor items are items that are assumed to perform equivalently across groups and thus provide the basis for equating IRT trait metrics so that the remaining items on the test can be meaningfully compared across populations. As a collection, anchor items are sometimes referred to as a “valid subtest” (Chang, Mazzeo & Roussos, 1996) because they represent what are assumed to be real manifestations of the trait that are not influenced by other characteristics related to group membership. Once anchor items have been identified, their item parameters can be constrained across groups so that the remaining nonanchor items (i.e., studied items) can be tested for DIF. By constraining the parameters of the anchor items to be equivalent, a common latent trait metric is defined against which studied items can be compared and ultimately against which the score metric equivalence of the entire instrument can be evaluated. The process of identifying anchor items is perhaps the most challenging issue in making sense of DIF analyses (see Camilli, 1993; Embretson & Reise, 2000, pp. 259262; Penfield & Camilli, in press; Thissen et al., 1993, for extended discussions of these issues). For some test

instruments, the identification of anchor items is fairly straightforward due to the known or presumed unbiased nature of certain test items. For example, in educational measurement applications, it is common to test only new or pilot items on a given form of the test for DIF where the operational items (items previously tested and found valid on earlier forms) are used as anchors. With many psychological test instruments such as the PCL-R, however, the identification of anchor items is less clear. An exploratory search for items that can function as anchor items is thus needed. To identify anchor items, Cooke and Michie (1999) and Cooke et al. (2005a, 2005b) adopted a common strategy. As described in Cooke et al. (2005a), a baseline model was first fit in which “the mean level of the latent trait and all item parameters were allowed to vary across groups” (p. 337) . They then “identified items with similar parameters across cultures to serve as ‘anchors’ for the estimation of a common measure” and later, for each factor, they “selected the item with the smallest crosscultural differences in the b parameters” (p. 337). This is an inappropriate way of identifying anchor items for a very fundamental reason. Specifically, the baseline model is not statistically identified; the metrics of the parameters for the comparison groups have not been equated, implying that the differences in b parameter estimates for a common item across groups are meaningless. Consequently, a change in one group’s latent trait mean can be compensated for by a corresponding change in that group’s b estimates while holding the b estimates for the other group the same. That is, the b parameter estimates reported for one group can collectively be shifted a constant value up or down (while holding the threshold estimates for the other group constant) without changing the statistical fit of the model to the data. The lack of statistical identification and its consequences on anchor item selection are illustrated in Table 2. The NA and UK columns display the GRM solution reported in Cooke et al. (2005a), with rows in bold identifying the anchor items they chose based on their minimal b parameter differences across groups. The items selected as anchors in Cooke et al. (2005a) were 5, 6 and 9. The UK* column represents a statistically equivalent solution to that provided by Cooke et al. that is obtained by simply lowering the latent trait mean among UK offenders by .9 units (an arbitrary selection statistically but one chosen here for illustration purposes). Because of the statistical indeterminacy of Cooke et al.’s (2005a) model, there is no statistical justification for the UK solution reported in Table 2 (and in Cooke et al., 2005a) compared to the UK* solution—they are statistically equivalent. However, the use of the UK* as opposed to the UK solution has a dramatic effect on the selection of

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 49

TABLE 2 Cooke et al’s (2005) Baseline Graded Response Model (GRM) Solution for NA and UK Offenders, and Statistically Equivalent UK solution (UK*) based on Alternative UK Latent Mean NA Item 1 2 3 4 5 6 7 8 9 13 14 15 16

UK*a

UK

a

b1

b2

a

b1

b2

a

b1

b2

1.4 1.6 1.4 1.4 1.4 1.8 1.7 2.0 0.9 1.2 1.3 1.3 1.1

–0.5 –0.7 –1.7 –1.0 –0.8 –1.8 –1.2 –1.4 –1.8 –1.7 –2.3 –2.3 –1.6

1.3 0.9 –0.2 0.8 0.8 –0.3 0.4 0.2 1.1 0.1 –0.5 –0.3 0.2

1.2 1.3 1.3 1.4 1.4 1.7 1.8 2.1 0.9 1.0 1.0 1.0 1.0

0.4 0.0 –1.2 –0.2 –1.0 –1.6 –0.7 –0.8 –1.6 –1.1 –1.4 –1.8 –1.8

2.2 1.3 0.4 1.2 0.7 –0.2 0.7 0.6 1.1 0.4 0.4 0.6 0.6

1.2 1.3 1.3 1.4 1.4 1.7 1.8 2.1 0.9 1.0 1.0 1.0 1.0

–0.5 –0.9 –2.1 –1.1 –1.9 –2.5 –1.6 –1.7 –2.5 –2.0 –2.3 –2.7 –2.7

1.3 0.4 –0.5 0.3 –0.2 –1.1 –0.2 –0.3 0.2 –0.5 –0.5 –0.3 –0.3

Note: aUK parameter estimates when UK latent trait mean is arbitrarily decreased by .9 units; Bold = Cooke et al’s (2005a) anchor items; Bold Italics = Anchor items based on UK* solution

anchor items when using Cooke et al.’s criterion. In the UK* solution, items 1, 14, and 15 demonstrate the smallest differences in the threshold estimates and thus would be chosen as anchors. The consequences of choosing different anchor items become apparent when comparing test characteristic curves (TCCs) across groups. TCCs illustrate the relationship between the latent trait and expected score on the test instrument, thus providing the basis for evaluating score metric equivalence. Figure 1 displays TCCs for both the NA and UK offenders based on the GRM estimates obtained using the Hare (2003) data. The graph on the left illustrates the TCCs when using Cooke et al.’s (2005a) anchor items (items 5, 6, and 9), whereas the graph to the right shows the TCCs when using items 1, 14, and 15 as anchors. The graph on the left largely replicates the findings of Cooke et al. (2005a), with an expected score difference of approximately three points across all levels of the latent trait. By contrast, the figure on the right suggests a very different result. In particular, no difference in expected scores occurs at higher levels of the trait, the trait levels of significance in defining PCL-R cut points, a finding consistent with the analyses presented in Hare (2003, Appendix B). Of course, other statistically equivalent solutions could lead to the choice of still other items as anchors. The point of this illustration is to demonstrate that the process by which Cooke et al. (2005a) chose anchor

items is methodologically unsound and essentially leads to an arbitrary result in terms of the direction and/or existence of overall bias in the instrument. This is not the only serious concern that we have with the anchor items chosen by Cooke et al. (2005a). It should be noted that inspection of their solution (and also of the estimates in Cooke & Michie, 1999) shows a pattern not unexpected in the absence of an equating of the IRT trait metrics across groups: The b estimates in the UK are consistently higher than those observed among NA offenders. Because UK offenders as a population achieve lower scores on the PCL-R (and may have a lower distribution of the psychopathy trait) than do NA offenders, their b estimates in general tend to be higher across items. Only on those items for which UK offenders score disproportionately highest (relative to NA offenders) do they end up having b estimates that look the same when compared to NA offenders. The same problem occurs (albeit with respect to different items) in Cooke and Michie (1999). The end result of Cooke et al.’s approach is that they ultimately selected as anchors those items for which UK offenders scored disproportionately highest relative to the other items on the instrument (i.e., items likely biased in providing higher scores in the UK when matched according to their true levels of psychopathy). The next section illustrates the problematic features of the Cooke et al. approach to selecting anchor items.

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

50 ASSESSMENT

FIGURE 1 Comparison of NA and UK TCCs Using Cooke, Michie, Hart, and Clark’s (2005a) Anchor Items (5, 6, 9) and Anchor Items (1, 14, 15) From UK* Solution in Table 1, Hare (2003) Data Anchor Items = 1,14,15

40

40

30

30 Expected PCL-R Score

Expected PCL-R Score

Anchor Items = 5,6,9

20

20

10

10

0

0 −3

−2

−1

0 Theta

1

2

3

−3

−2

−1

0 Theta

1

UK

UK

Nrth Amer

Nrth Amer

2

3

NOTE: NA = North American male criminal offenders; UK = United Kingdom male criminal offenders; TCC = test characteristic curve; UK* = statistically equivalent UK solution, PCL-R = Psychopathy Checklist–Revised.

Single-Item DIF Analyses of all PCL-R Items A frequent starting point for DIF analysis is a singleitem DIF analysis of all items on the test (Thissen et al., 1993, p. 102). In such an analysis, each item is tested for DIF using all of the remaining items (19 on the PCL-R) as anchors. This strategy is sometimes applied to identify an initial core set of items from which an anchor could be selected. Here, it also serves the purpose of highlighting the ipsative nature of DIF statistics (Camilli, 1993) and the foundation of the problem with the Cooke et al. (2005a) anchor items. To better interpret the results from this single-item DIF analysis, a second DIF detection method, Poly-SIBTEST (Chang et al., 1996), is also considered in the current

analysis. Similar to the likelihood ratio test (LR), PolySIBTEST tests a null hypothesis that a studied item displays no DIF across groups. However, unlike the LR test, Poly-SIBTEST tests for DIF by comparing expected scores on the item across groups (rather than the equivalence of the GRM item parameters). The advantage of this method in the current application is that Poly-SIBTEST also makes immediately apparent the direction of DIF in the item. Specifically, when applying a model such as the GRM, the average direction of DIF is often not apparent from inspecting differences in the two b estimates for the item across groups because the differences can occur in both directions (e.g., Item 15 in Table 3). By examining DIF with respect to expected scores on the item using Poly-SIBTEST, the ipsative nature of DIF statistics and

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 51

TABLE 3 Single-Item-at-a-Time Exploratory DIF Analyses, Anchoring on all Remaining Test Items, Hare (2003) Data LR Test NA Item (Factor) 1 (1) 2 (1) 3 (3) 4 (1) 5 (1) 6 (2) 7 (2) 8 (2) 9 (3) 10 (4) 11 12 (4) 13 (3) 14 (3) 15 (3) 16 (3) 17 18 (4) 19 (4) 20 (4)

UK

Poly-SIBTEST

χ2*

a

b1

b2

a

b1

b2

β

z-stat

p

97.0 38.8 1.5 132.9 48.9 26.9 4.0 10.9 81.5 18.1 6.1 31.6 22.2 12.3 12.3 9.1 33.1 57.8 125.4 60.8

1.1 1.1 1.2 1.1 1.3 1.6 1.2 1.8 0.9 1.0 0.8 1.0 0.9 1.1 1.0 0.9 0.6 0.7 0.6 0.7

0.1 –0.0 –0.9 –0.4 –0.2 –1.1 –0.1 –0.4 –1.0 –0.7 –0.6 0.2 –1.0 –1.3 –1.6 –1.5 0.9 –0.6 –1.8 –1.1

2.5 2.1 1.2 2.1 1.6 0.7 2.1 1.3 2.2 1.2 1.3 1.8 1.5 1.2 1.1 1.1 2.9 1.0 –0.4 1.0

0.7 0.9 1.2 1.3 1.2 1.2 1.2 1.4 1.1 1.0 0.9 1.0 1.0 0.8 0.8 0.8 0.6 1.1 1.1 1.2

1.0 0.4 –0.9 0.5 –0.7 –1.5 0.0 –0.5 –1.0 –1.0 –0.8 –0.2 –1.0 –1.9 –1.8 –2.0 0.2 –0.7 –0.4 –0.3

4.2 2.5 1.1 2.5 1.5 0.5 2.0 1.4 1.2 0.9 1.1 1.2 1.0 1.2 1.4 1.0 2.0 1.0 0.7 1.1

0.26 0.13 –0.09 0.23 –0.13 –0.05 0.06 –0.01 –0.17 –0.15 –0.05 –0.20 –0.10 –0.06 0.03 –0.03 –0.17 –0.03 0.28 0.18

10.08 4.47 –4.01 8.40 –4.74 –2.24 1.99 –0.28 –6.33 –5.63 –1.57 –6.28 –3.77 –2.27 1.27 –1.20 –4.62 –1.04 8.41 5.05

.000 .000 .000 .000 .000 .025 .047 .778 .000 .000 .115 .000 .000 .023 .205 .231 .000 .298 .000 .000

NOTE: Bold indicates anchor items chosen by Cooke et al (2005a). DIF = differential item functioning; LR = likelihood ratio test; NA = North American male criminal offenders; UK = United Kingdom male criminal offenders; a = a discrimination parameter that represents the degree to which item scores are affected by the latent trait; b1 and b2 = threshold parameters that identify the latent trait levels at which scores at or above 1 and 2, respectively, become more likely than item scores at or below 0 and 1, respectively. ∗χ2 Statistics all based on three degrees of freedom.

the implications of anchor item selection for score metric equivalence also are seen more clearly. Both the NA and UK item response data sets were fit using the GRM and examined for model fit. To evaluate fit, each data set was randomly split so that approximately half the respondents would be used as a calibration sample (NA = 1,924 respondents; UK = 558) and the other half a cross-validation sample (NA = 1,925 respondents; UK = 559). For each sample, GRM estimates were computed for the calibration samples and examined for fit in the cross-validation samples using three types of chi-square indices (Drasgow, Levine, Tsien, Williams, & Mead, 1995) computed using the software MODFIT (Stark, 2001). These indices evaluate the fit of the GRM with respect to the first-order, second-order, and thirdorder joint frequencies of the item scores, respectively (for details, see Drasgow et al., 1995), and suggest model fit is good when the indices are less than 3 for each item, item pair, or item triple. Analyses in both samples demonstrated good fit with respect to first-order chisquare tests (mean chi-square/df = 1.58 for NA; 1.98 for UK) but only average fit with respect to the second- and third-order tests (mean chi-square/df = 2.884, 2.981 for

NA double and triple tests, respectively; 2.945, 3.188 for UK, respectively). The latter results are not unexpected due to the known local dependence between certain item pairs based on the factor analysis. Table 3 displays results from both LR and PolySIBTEST single-item DIF tests. Items in bold again signify those chosen as anchor items by Cooke et al. (2005a). The β statistic in the SIBTEST analysis represents the expected score difference on the item between members of each group having the same latent trait level. In this application, a positive β implies a higher expected score for NA offenders and a negative β implies a higher expected score for UK offenders. As expected, across items the β values average essentially 0. This ipsative property of DIF statistics always emerges when using the remaining items as anchors (Camilli, 1993) and provides an important reminder that DIF statistics ultimately communicate information about the relative ordering of items in terms of signed bias, which may or may not correspond to the actual magnitude and direction of bias in the item. That is, an instrument that on the whole truly possesses bias will not display it in a DIF analysis, where the DIF statistics consistently sum to 0. Also reported in

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

52 ASSESSMENT

Table 3 are the individual factors associated with the PCL-R items. Attending to such information can be informative not only in identifying whether certain item types are more prone to displaying DIF but whether the causes of DIF can be attributed to secondary factors measured by the instrument (see, e.g., Roussos & Stout, 1996, for further details on how multidimensionality has the potential to emerge as DIF). Most important for the current illustration, the results in Table 3 demonstrate that the three items chosen as anchors by Cooke et al. (2005a) all display DIF in a common direction, namely, one producing higher scores for UK offenders, when anchored on all remaining items. The consequence of Cooke et al. using these three items as anchors is that respondents from the NA and UK populations become equated according to items that produce disproportionately lower scores for UK offenders. This mismatch of offenders across NA and UK populations will thus be such that UK offenders appear to receive higher scores on most (if not all) of the remaining nonanchor items. Indeed, the exact ordering of TCC curves as observed by Cooke et al. (2005a) should be observed because the treatment of these anchor items as valid will make all other items on the instrument appear biased against UK offenders in the sense that they achieve lower scores. (It should be noted that the data set used in the present DIF analysis is not identical with that used by Cooke et al. [2005a]; we expect that the three anchor items they chose actually would have displayed the largest amounts of DIF in a single-item DIF analysis.) An Iterative Purification Approach to Selection of Anchor Items The issue of anchor items selection is commonly addressed by using some method of anchor item purification (Penfield & Camilli, in press). Lord (1980) discussed an iterative purification strategy that searches for a core set of “DIF-free” items that can be used as anchors. Following Lord’s approach, a baseline model is considered in which all items are assumed to have the same parameters across groups but with latent means allowed to differ. From this model, each individual item can be tested for DIF (by comparing augmented and compact models using the LR method), essentially a replication of the single-item DIF analysis described above. The item displaying the most DIF is dismissed from the analysis and a new baseline model is considered that includes all items but the one dismissed from the previous analysis. The process then continues in an iterative fashion, with DIF items being sequentially tossed out of the analysis (and/or readded if they no longer display DIF) until the only items remaining display a minimal

amount of DIF across groups. These items then define the final anchor against which all items dismissed in earlier stages are tested in a final model for DIF. Variants on this strategy also can be considered. For example, a similar approach starts with a small core of items (say a group of items displaying minimal DIF in the single-item DIF analysis). Using this core as an initial anchor, the remaining items are tested for DIF and items are added to the initial anchor in a sequential fashion based on their failure to display DIF. The process terminates when items can no longer be added to the anchor, either because the items themselves display DIF or else introduce DIF into an item currently included on the anchor. A defining aspect of these purification approaches that is not present in the analyses by Cooke and colleagues is that the group of items selected as the anchor should function equivalently (or nearly so) when studied independent of the other items on the test. They also are based on the often realistic assumption in exploratory DIF analyses that the items that are valid and should function as anchors represent a sizeable minority, and perhaps even majority, of items included on the scale, certainly more than the three items used by Cooke and colleagues. This philosophy also is consistent with the nature of statistical testing that underlies DIF methodology—equivalence is assumed unless the data show otherwise. Applying Lord’s (1980) iterative purification approach to the Hare (2003) data, we identified a set of 13 items that displayed minimal DIF across NA and UK offender populations. These items, and the resulting DIF testing of each item in the 13-item subset, are shown in Table 4. Table 5 displays the DIF testing results for the 7 items not included on the anchor. Taken together, these results appear to support use of the 13 items as an anchor. Using this designation of anchor items and nonanchor items, a final multigroup IRT model was fit in which the parameters of the 13 anchor items were constrained to equality across groups and the remaining 7 were allowed to vary. The results of this analysis are shown in Table 6. Using these item parameter estimates, TCCs were constructed for NA and UK offender populations, the results of which are shown in Figure 2. The difference between TCCs here appears minimal, especially at high levels of the trait where the cut scores are generally applied. In particular, at a trait level of approximately 1.5, where expected scores of 30 are obtained for NA offenders, the expected score for UK offenders differs by approximately 0.5 points. It should be noted that an important feature of the multigroup solution is its ability also to characterize the distributions of the trait within the groups being compared. For the solution reported in Table 6, the latent mean estimate for UK was arbitrarily fixed at 0, whereas

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 53

TABLE 4 Internal Analysis of 13 Anchor Items Identified by Iterative Purification, Hare (2003) Data NA

UK

Item

χ2*

a

b1

b2

a

b1

b2

3 5 6 7 8 10 11 12 13 14 15 16 17

2.9 8.6 9.3 14.8 12.9 6.4 0.8 14.8 18.6 3.0 20.1 0.8 20.5

1.2 1.1 1.5 1.2 1.9 1.0 0.8 0.9 0.8 1.1 1.0 0.8 0.5

–1.0 –0.5 –1.3 –0.3 –0.6 –0.9 –0.8 0.0 –1.3 –1.4 –1.8 –1.9 0.8

1.1 1.6 0.5 2.0 1.2 1.1 1.2 1.7 1.5 1.1 0.9 1.0 2.9

1.2 1.0 1.2 1.2 1.6 1.1 0.8 0.9 1.0 1.0 0.8 0.8 0.5

–0.9 –0.7 –1.5 0.0 –0.5 –0.9 –0.8 –0.2 –0.9 –1.7 –1.8 –2.0 0.3

1.1 1.7 0.5 2.0 1.3 0.9 1.2 1.3 1.0 1.1 1.4 1.0 2.2

NOTE: NA = North American male criminal offenders; UK = United Kingdom male criminal offenders; a = a discrimination parameter that represents the degree to which item scores are affected by the latent trait; b1 and b2 = threshold parameters that identify the latent trait levels at which scores at or above 1 and 2, respectively, become more likely than item scores at or below 0 and 1, respectively. ∗χ2 Statistics all based on three degrees of freedom.

TABLE 6 Final Multigroup IRT Solution

TABLE 5 Analysis of Remaining 7 Items Using 13 Anchor Items Identified by Iterative Purification, Hare (2003) Data NA Item 1 2 4 9 18 19 20

NA Item (Factor)

UK

χ2*

a

b1

b2

a

b1

b2

135.6 61.3 181.0 69.4 54.2 154.8 82.7

0.9 1.0 1.0 0.9 0.8 0.6 0.7

–0.1 –0.2 –0.6 –1.2 –0.7 –2.1 –1.3

2.7 2.2 2.2 2.0 1.0 –0.6 0.9

0.6 0.8 1.1 1.1 1.1 0.9 1.1

1.2 0.4 0.5 –0.9 –0.7 –0.4 –0.3

4.8 2.6 2.6 1.1 1.0 0.7 1.2

NOTE: NA = North American male criminal offenders; UK = United Kingdom male criminal offenders; a = a discrimination parameter that represents the degree to which item scores are affected by the latent trait; b1 and b2 = threshold parameters that identify the latent trait levels at which scores at or above 1 and 2, respectively, become more likely than item scores at or below 0 and 1, respectively. ∗χ2 Statistics all based on three degrees of freedom.

the estimate for NA offenders was .52. Because the common standard deviation within each group was 1.0, this difference implies that the latent distribution of psychopathy is approximately ½ standard deviation higher among NA criminal offenders than among UK offenders.

1 (1) 2 (1) 3 (3) 4 (1) 5 (1) 6 (2) 7 (2) 8 (2) 9 (3) 10 (4) 11 12 (4) 13 (3) 14 (3) 15 (3) 16 (2) 17 18 (4) 19 (4) 20 (4)

UK

a

b1

b2

a

b1

b2

1.1 1.2 1.3 1.2 1.3 1.5 1.2 1.7 0.9 1.0 0.9 1.0 0.9 1.1 1.0 0.9 0.6 0.8 0.6 0.8

–0.2 –0.4 –1.1 –0.7 –0.6 –1.5 –0.4 –0.7 –1.3 –1.1 –0.9 –0.2 –1.3 –1.7 –1.8 –1.9 0.4 –0.9 –2.1 –1.4

2.1 1.8 0.8 1.7 1.3 0.3 1.7 1.0 1.8 0.8 0.9 1.3 1.1 0.8 0.7 0.7 2.4 0.6 –0.7 0.7

0.7 0.8 1.3 1.2 1.3 1.5 1.2 1.7 1.2 1.0 0.9 1.0 0.9 1.1 1.0 0.9 0.6 1.2 1.2 1.4

1.0 0.3 –1.1 0.3 –0.6 –1.5 –0.4 –0.7 –1.0 –1.1 –0.9 –0.2 –1.3 –1.7 –1.8 –1.9 0.4 –0.8 –0.5 –0.4

4.4 2.6 0.8 2.3 1.3 0.3 1.7 1.0 1.0 0.8 0.9 1.3 1.1 0.8 0.7 0.7 2.4 0.8 0.5 0.9

DISCUSSION

NOTE: Bold indicates anchor items. IRT = item response theory; NA = North American male criminal offenders; UK = United Kingdom male criminal offenders; a = a discrimination parameter that represents the degree to which item scores are affected by the latent trait; b1 and b2 = threshold parameters that identify the latent trait levels at which scores at or above 1 and 2, respectively, become more likely than item scores at or below 0 and 1, respectively.

Generalizability studies are necessary for validating the use of psychological test instruments in new populations and also can help clarify the nature of differential test functioning when it exists. For instruments such as the PCL-R,

DIF studies in particular offer insight into the differential manifestation of psychopathic characteristics across populations that can lend insight into the universality of psychopathic personality disorder and its measurement. Of

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

54 ASSESSMENT

FIGURE 2 Comparison of TCCs Using a 13-Item Anchor (3, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17) Identified Through Iterative Purification, Hare (2003) Data Anchor Items = 3,5,6,7,8,10,11,12,13,14,15,16,17 40

Expected PCL-R Score

30

20

10

0 −3

−2

−1

0 Theta

1

2

3

UK Nrth Amer

NOTE: Nrth Amer = North American male criminal offenders; UK = United Kingdom male criminal offenders; TCC = test characteristic curve; PCL-R = Psychopathy Checklist–Revised.

importance, DIF studies also may lend insight into the underlying causes of test bias in a test instrument when such bias is known to be present. However, absent the presence of well-defined anchor items, it is often difficult to derive strong conclusions regarding the overall magnitude and/or direction of bias in a test instrument based solely on DIF analyses. In evaluating recent findings by Cooke and colleagues comparing European and North American criminal offenders, it is important to recognize the problematic (and we suggest highly misleading) choices they made in selecting anchor items for their DIF analysis. By basing their decisions on baseline models that were not identified, they were inappropriately led to anchor items that as a collection result in disproportionately higher scores for European offenders and thus likely contain actual bias. As we have shown, this bias then introduces

bias in an opposing direction when comparing TCCs across groups. The lack of score metric equivalence identified by Cooke and Michie (1999) and Cooke et al. (2005a, 2005b) thus can be attributed to inappropriate selection of anchor items.1 When selecting anchor items, a substantial amount of confidence is placed in the validity of those items. Consequently, in exploratory DIF analyses such as those generally conducted with the PCL-R, results must always be interpreted with respect to the presumed validity of the anchor items. Consistent with general DIF testing philosophy, the approach taken in this article sought to identify the largest subset of items in the test that appeared to be functioning consistently across groups to serve as an anchor. The iterative purification procedure operates from this general principle and, in the current application, led to results implying virtually no difference in the expected PCL-R score for NA and UK offenders having levels of psychopathy associated with diagnostic classification. The finding that 13 of the 20 PCL-R items can be viewed as performing in a largely consistent fashion across NA and UK populations lends greater credence to their use as an anchor than the selection of the three items chosen by Cooke et al. (2005a, 2005b), which as noted, all show high relative bias (in terms of higher relative scores) among UK offenders. Of interest, from Table 6 it can be seen that of the seven items not included on the anchor, six belong to Factors 1 and 4. This finding opens up the possibility that the DIF observed in the current analysis may actually be attributed in part to differential factor distributions across groups as defined by its secondary factor structure rather than nuisance dimensions unrelated to psychopathy. The matter of which PCL-R cut score to use is of more than academic interest. For example, adoption of the cut score of 25, originally proposed by Cooke and Michie (1999) as being metrically equivalent to a score of 30 in NA, would have resulted in about 15% of offenders in Her Majesty’s Prison Service being eligible for DSPD, a designation with serious consequences for an offender. If adopted, the more recent cut score of 28 proposed by Cooke et al. (2005a) would place almost 8% of offenders in this category. In contrast, a cut score of 29.5 or 30 would place only about 4% to 5% in the DSPD category.2 IRT is a valuable tool for informing users about the score metric equivalence of an instrument across different populations. However, conclusions and recommendations derived from an IRT analysis should be based on the use of a sound and defensible methodology as well as awareness of the limitations of the analysis and the possibility of alternative solutions that are equally or more plausible (see Penfield & Camilli, in press; Thissen et al., 1993, for further discussion). We argue that both the methodology

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Bolt et al. / PCL-R IN AMERICA AND UNITED KINGDOM 55

and the conclusions in the Cooke et al. studies were faulty and that they do not provide a reasonable basis for their strong conclusions about score metric equivalence or the use of PCL-R scores in the UK and Europe (also see Note 1). Finally, we argue that the best evidence of such equivalence is obtained from analyses of test scores in relation to critical correlates and predictive outcomes. There now exists a solid body of research suggesting that correlates of the PCL-R (and its derivative, the Psychopathy Checklist: Screening Version [PCL:SV]) are much the same in the UK and Europe as they are in NA.3 The critical issue in using such variables for evaluating score metric equivalence will require agreement as to whether such variables, or some subset thereof, also can be assumed to represent equivalent levels of the latent construct of psychopathy across populations. NOTES 1. Cooke et al. (2005b) display test characteristic curves (TCCs) that are actually coincident at the diagnostic cut point associated with a high level of psychopathy, yet still claim the need for a lower Psychopathy Checklist–Revised (PCL-R) cut score in Europe than in North America. 2. We note that a derivative of the PCL-R, the Hare Psychopathy Checklist: Screening Version (PCL: SV; Hart, Cox, & Hare, 2005), increasingly is being used in many different countries. Its structural properties and many of its correlates are very much like those of the PCL-R (e.g., Guy & Douglas, in press; Vitacco, Neumann, & Jackson, 2005). It is likely our item response theory (IRT ) analyses and discussions of the PCL-R have direct implications for the scale metric equivalence of the PCL: SV, although the appropriate cross-cultural analyses remain to be done. 3. For examples, see Carmen Pastor, Moltó, Vila, and Lang (2003); Dolan and Doyle (2000); Dolan and Millington (2002); Dolan and Rennie (2006); Douglas, Strand, Belfrage, Fransson, and Levander (2005); Flor, Birbaumer, Hermann, Ziegler, and Patrick (2002); Gray, Hill, et al. (2003); Gray, MacCulloch, Smith, Morris, and Snowden (2003); Hare, Clark, Grann, and Thornton (2000); Hildebrand, de Ruiter, and de Vogel (2004); Hobson, Shine, and Roberts (2000); Johansson and Kerr (2005); Laakso et al. (2001); Moltó, Poy, and Torrubia (2000); Morrissey et al. (2005); Söderström, Blennow, Sjodin, and Forsman (2003); Stadtland, Kleindienst, Kröner, Eidt, and Nedopil (2005); Tengström, Grann, Långström, and Kullgren (2000); and Tengström, Hodgins, Grann, Långström, and Kullgren (2004).

REFERENCES Acheson, S. K. (2005). Review of the Hare Psychopathy Checklist– Revised, 2nd edition. In R. A. Spies & B. S. Plake (Eds.). The sixteenth mental measurements yearbook (pp. 429-431). Lincoln, NE: Buros Institute of Mental Measurements. Bolt, D. M., Hare, R. D., Vitale, J. E., & Newman, J. P. (2004). A multigroup item response theory analysis of the Psychopathy Checklist– Revised. Psychological Assessment, 16, 155-168. Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 397-413). Hillsdale, NJ: Lawrence Erlbaum.

Carmen Pastor, M., Moltó, J., Vila, J., & Lang, P. J. (2003). Startle reflex modulation, affective ratings and autonomic reactivity in incarcerated Spanish psychopaths. Psychophysiology, 40, 934-938. Chang, H., Mazzeo, J., & Roussos, R. (1996). Detect DIF for polytomously scored items: An adaptation of Shealy-Stout’s SIBTEST procedure. Journal of Educational Measurement, 33, 333-353. Cooke, D. J., & Michie, C. (1999). Psychopathy across cultures: North America and Scotland compared. Journal of Abnormal Psychology, 108, 55-68. Cooke, D. J., & Michie, C. (2001). Refining the construct of psychopathy: Towards a hierarchical model. Psychological Assessment, 13, 171-188. Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005a). Assessing psychopathy in the UK: Concerns about cross-cultural generalizability. British Journal of Psychiatry, 186, 335-341. Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005b). Searching for the pan-cultural core of psychopathic personality disorder. Personality and Individual Differences, 39, 283-295. Dolan, M., & Doyle, M. (2000). Violence risk prediction: Clinical and actuarial measures and the role of the Psychopathy Checklist. British Journal of Psychiatry, 177, 303-311. Dolan, M., & Millington, J. (2002). The influence of personality traits such as psychopathy on detained patients using the NHS complaints procedure in forensic settings. Personality and Individual Differences, 33, 955-965. Dolan, M. C., & Rennie, C. E. (2006). Reliability and validity of the Psychopathy Checklist: Youth Version in a UK sample of conduct disordered boys. Personality and Individual Differences, 40, 65-75. Douglas, K. S., Strand, S., Belfrage, H., Fransson, G., & Levander, S. (2005). Reliability and validity evaluation of the Psychopathy Checklist: Screening Version (PCL:SV) in Swedish correctional and forensic psychiatric samples. Assessment, 12, 145-161. Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19, 143-165. Edens, J. F., Marcus, D. K., Lilienfeld, S. O., & Poythress, N. G. (2006). Psychopathic, not psychopath: Taxometric evidence for the dimensional structure of psychopathy. Journal of Abnormal Psychology, 115, 131-144. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. Flor, H., Birbaumer, N., Hermann, C., Ziegler, S., & Patrick, C. (2002). Aversive Pavlovian conditioning in psychopaths: Peripheral and central correlates. Psychophysiology, 39, 505-518. Gray, N. S., Hill, C., McGleish, A., Timmons, D., MacCulloch, M. J., & Snowden, R. J. (2003). Prediction of violence and self-harm in mentally disordered offenders: A prospective study of the efficacy of HCR-20, PCL-R and psychiatric symptomatology. Journal of Clinical and Consulting Psychology, 71, 443-451. Gray, N. S., MacCulloch, M. J., Smith, J., Morris, M., & Snowden, R. J. (2003). Violence viewed by psychopathic murderers: Adapting a revealing test may expose those psychopaths who are most likely to kill. Nature, 423, 497-498. Guay, J. P., Ruscio, J., Knight, R. A., & Hare, R. D. (in press). A taxometric analysis of the latent structure of psychopathy: Evidence for dimensionality. Journal of Abnormal Psychology. Guy, L. S., & Douglas, K. S. (2006). Evaluating the utility of the PCL: SV as a screening measure using competing factor models of psychopathy. Psychological Assessment 18, 225-230. Hare, R. D. (1991). Manual for the Revised Psychopathy Checklist (1st ed.). Toronto, Ontario, Canada: Multi-Health Systems. Hare, R. D. (1996). Psychopathy: A construct whose time has come. Criminal Justice and Behavior, 23, 25-54. Hare, R. D. (2003). Manual for the Revised Psychopathy Checklist (2nd ed.). Toronto, Ontario, Canada: Multi-Health Systems. Hare, R. D., Clark, D., Grann, M., & Thornton, D. (2000). Psychopathy and the predictive validity of the PCL-R: An international perspective. Behavioral Sciences and the Law, 18, 623-645.

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

56 ASSESSMENT

Hare, R. D., & Neumann, C. S. (2005). The structure of psychopathy. Current Psychiatry Reports, 7, 1-32. Hart, S. D., Cox, D. N., & Hare, R. D. (2005). The Hare Psychopathy Checklist: Screening Version. Toronto, Ontario, Canada: MultiHealth Systems. Hildebrand, M., de Ruiter, C., & de Vogel, V. (2004). Psychopathy and sexual deviance in treated rapists: Association with (sexual) recidivism. Sexual Abuse: A Journal of Research and Treatment, 16, 1-24. Hobson, J., Shine, J., & Roberts, R. (2000). How do psychopaths behave in a prison therapeutic community? Psychology, Crime, and Law, 6, 139-154. Johansson, P., & Kerr, M. (2005). Psychopathy and intelligence: A second look. Journal of Personality Disorders, 19, 357-369. Laakso, M. P., Vaurio, O., Koivisto, E., Savolainen, L., Eronen, M., Aronen, H. J., et al. (2001). Psychopathy and the posterior hippocampus. Behavioural Brain Research, 118, 187-193. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Marcus, D. K., John, S. L., & Edens, J. F. (2004). A taxometric analysis of psychopathic personality. Journal of Abnormal Psychology, 113, 626-635. Moltó, J., Poy, R., & Torrubia, R. (2000). Standardization of the Hare Psychopathy Checklist–Revised in a Spanish prison sample. Journal of Personality Disorders, 14, 84-96. Monahan, J. (2006). Comments on cover jacket. In C. J. Patrick (Ed.), Handbook of psychopathy. New York: Guilford. Morrissey, C., Hogue, T. E., Mooney, P., Lindsay, W. R., Steptoe, L., Taylor, J., et al. (2005). Applicability, reliability and validity of the Psychopathy Checklist–Revised in offenders with intellectual disabilities: Some initial findings. International Journal of Forensic Mental Health, 4, 207-220. Neumann, C. S., Hare, R. D., & Newman, J. P. (in press). The superordinate nature of psychopathy. Journal of Personality Disorders. Neumann, C. S., Vitacco, M. J., Hare, R. D., & Wupperman, P. (2005). Reconstruing the reconstruction of psychopathy: A comment on Cooke, Michie, Hart, & Clark. Journal of Personality Disorders, 19, 624-640. Penfield, R. D., & Camilli, G. (in press). Differential item functioning and item bias. In S. Sinharay & C. R. Rao (Eds.), Handbook of statistics, Volume 27: Psychometrics. New York: Elsevier North-Holland. Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (2005). Violent offenders: Appraising and managing risk (2nd ed.). Washington, DC: American Psychological Association. Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371. Salekin, R. T., Rogers, R., & Sewell, K. W. (1997). Construct validity of psychopathy in a female offender sample: A multitrait-multimethod evaluation. Journal of Abnormal Psychology, 106, 576-585. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 100-114. Söderström, H., Blennow, K., Sjodin, A. K., & Forsman, A. (2003). New evidence for an association between the CSF HVA:5-HIAA ratio and psychopathic traits. Journal of Neurology, Neurosurgery, and Psychiatry, 74, 918-921. Stadtland, C., Kleindienst, N., Kröner, C., Eidt, M., & Nedopil, N. (2005). Psychopathic traits and risk of criminal recidivism in offenders with and without mental disorders. International Journal of Forensic Mental Health, 4, 89-97. Stark, S. (2001). MODFIT: A computer program for model-data fit. Urbana-Champaign: University of Illinois. Tengström, A., Grann, M., Långström, N., & Kullgren, G. (2000). Psychopathy (PCL-R) as a predictor of violent recidivism among

criminal offenders with schizophrenia. Law and Human Behavior, 24, 45-58. Tengström, A., Hodgins, S., Grann, M., Långström, N., & Kullgren, G. (2004). Schizophrenia and criminal offending: The role of psychopathy and substance use disorders. Criminal Justice and Behavior, 31, 367-391. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum. Vitacco, M. J., Neumann, C. S., & Jackson, R. (2005). Testing a fourfactor model of psychopathy and its association with ethnicity, gender, intelligence, and violence. Journal of Consulting and Clinical Psychology, 73, 466-476. Windle, M., & Dumenci, L. (1999). The factorial structure and construct validity of the Psychopathy Checklist–Revised among alcoholic inpatients. Structural Equation Modeling, 6, 372-393.

Daniel M. Bolt is associate professor of educational psychology at the University of Wisconsin, Madison, and specializes in quantitative methods. He received his PhD in psychology from the University of Illinois at Urbana–Champaign in 1999 and has published various articles on the development and application of latent variable models in the social sciences. His primary research interest is item response theory and its use for applications such as test equating and differential item functioning. Robert D. Hare is emeritus professor of psychology, University of British Columbia, and president of Darkstone Research Group Ltd., a forensic research and consulting firm. He obtained his PhD from the University of Western Ontario. He has devoted most of his academic career to the investigation of psychopathy, its nature, assessment, and implications for mental health and criminal justice. He is the author or coauthor of several books and many chapters and articles on psychopathy and is the developer of the Psychopathy Checklist–Revised (PCL-R), co-developer of the Psychopathy Checklist: Screening Version (PCL:SV) and the Psychopathy Checklist: Youth Version (PCL: YV), and co-author of the Antisocial Process Screening Device (APSD). He consults with law enforcement, sits on the FBI’s Research Advisory Board of the Child Abduction and Serial Murder Investigative Resources Center, and is a member of the International Criminal Investigative Analysis Fellowship. His current research on psychopathy includes assessment and developmental issues, neurobiological correlates, risk for recidivism and violence, and cross-cultural generalizability. Craig S. Neumann is associate professor of psychology at the University of North Texas. He received his PhD from Kansas University. His research interests include the developmental, neurocognitive, and structural features of personality disorders and other psychopathology.

Downloaded from asm.sagepub.com at PENNSYLVANIA STATE UNIV on May 8, 2016

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.