Boost your Grades with us today!
What are the benefits of using this selection method?
Sample Answer for What are the benefits of using this selection method? Included After Question
What are the benefits of using this selection method?
Question Description
I’m working on a psychology discussion question and need an explanation and answer to help me learn.
What are the benefits of using this selection method?
How would you summarize some best practices for using this method in selection?
- What controversy (if any) or challenges are associated with this selection method?
- How would you feel, if you were administered this assessment method for a job and you were rejected based on this method alone, particularly if it was asking about something you did in your distant past? Would this change your view about the value of the assessment? If so, how?
Picardi, C. A. (2020). Recruitment and selection: Strategies for workforce planning & assessment. SAGE Publications, Inc. (US).
Roulin, N., Bangerter, A., & Levashina, J. (2015). Honest and deceptive impression management in the employment interview: Can it be detected and how does it impact evaluations? Personnel Psychology, 68(2), 395-444
A Sample Answer For the Assignment: What are the benefits of using this selection method?
Title: What are the benefits of using this selection method?
Journal of Applied Psychology 2012, Vol. 97, No. 3, 499 –530 © 2012 American Psychological Association 0021-9010/12/$12.00 DOI: 10.1037/a0021196 The Criterion-Related Validity of Integrity Tests: An Updated Meta-Analysis Chad H. Van Iddekinge Philip L. Roth and Patrick H. Raymark Florida State University Clemson University Heather N. Odle-Dusseau This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Gettysburg College Integrity tests have become a prominent predictor within the selection literature over the past few decades. However, some researchers have expressed concerns about the criterion-related validity evidence for such tests because of a perceived lack of methodological rigor within this literature, as well as a heavy reliance on unpublished data from test publishers. In response to these concerns, we metaanalyzed 104 studies (representing 134 independent samples), which were authored by a similar proportion of test publishers and non-publishers, whose conduct was consistent with professional standards for test validation, and whose results were relevant to the validity of integrity-specific scales for predicting individual work behavior. Overall mean observed validity estimates and validity estimates corrected for unreliability in the criterion (respectively) were .12 and .15 for job performance, .13 and .16 for training performance, .26 and .32 for counterproductive work behavior, and .07 and .09 for turnover. Although data on restriction of range were sparse, illustrative corrections for indirect range restriction did increase validities slightly (e.g., from .15 to .18 for job performance). Several variables appeared to moderate relations between integrity tests and the criteria. For example, corrected validities for job performance criteria were larger when based on studies authored by integrity test publishers (.27) than when based on studies from non-publishers (.12). In addition, corrected validities for counterproductive work behavior criteria were larger when based on self-reports (.42) than when based on other-reports (.11) or employee records (.15). Keywords: integrity, honesty, personnel selection, test validity, counterproductive work behavior In recent years, integrity tests have become a prominent predictor within the selection literature. Use of such tests is thought to offer several advantages for selection, including criterion-related validity for predicting a variety of criteria (Ones, Viswesvaran, & Schmidt, 1993) and small subgroup differences (Ones & Viswesvaran, 1998). Researchers also have estimated that across a range of selection procedures, integrity tests may provide the largest amount of incremental validity beyond cognitive ability tests (Schmidt & Hunter, 1998). Furthermore, relative to some types of selection procedures (e.g., structured interviews, work sample tests), integrity tests tend to be cost effective and easy to administer and score. Several meta-analyses and quantitative-oriented reviews have provided the foundation for the generally favorable view of the criterion-related validity of integrity tests (e.g., J. Hogan & Hogan, 1989; Inwald, Hurwitz, & Kaufman, 1991; Kpo, 1984; McDaniel & Jones, 1988; Ones et al., 1993). Ones et al. (1993) conducted the most thorough and comprehensive review of the literature. Their meta-analysis revealed correlations (corrected for predictor range restriction and criterion unreliability) of .34 and .47 between integrity tests and measures of job performance and counterproductive work behavior (CWB), respectively. These researchers also found support for several moderators of integrity test validity. For instance, validity estimates for job performance criteria were somewhat larger in applicant samples than in incumbent samples. Chad H. Van Iddekinge, College of Business, Florida State University; Philip L. Roth, Department of Management, Clemson University; Patrick H. Raymark, Department of Psychology, Clemson University; Heather N. Odle-Dusseau, Department of Management, Gettysburg College. An earlier version of this article was presented at the 70th Annual Meeting of the Academy of Management, Montreal, Quebec, Canada, August 2010. We gratefully acknowledge the many researchers and test publishers who provided unpublished primary studies for possible inclusion in this meta-analysis. This study would not have been possible without their assistance. We are particularly grateful to Linda Goldinger (Creative Learning, Atlanta, Georgia), Matt Lemming and Jeff Foster (Hogan Assessment Systems, Tulsa, Oklahoma), and Kathy Tuzinski and Mike Fetzer (PreVisor, Minneapolis, Minnesota) for helping us locate some of the older unpublished work in this area, and to Saul Fine (Midot, Israel) and Bernd Marcus (University of Hagen, Germany) for providing unpublished data on some newer integrity tests. Finally, we thank Huy Le for his guidance concerning several technical issues, and Mike McDaniel for his helpful comments on an earlier version of the article. Correspondence concerning this article should be addressed to Chad H. Van Iddekinge, College of Business, Florida State University, 821 Academic Way, P.O. Box 3061110, Tallahassee, FL 32306-1110. E-mail: [email protected] 499 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 500 VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU Several variables also appeared to moderate relations between integrity tests and CWB criteria, such that validity estimates were larger for overt tests, incumbent samples, concurrent designs, self-reported deviance, theft-related criteria, and high-complexity jobs. The work of Ones et al. is highly impressive in both scope and sophistication. Despite these positive results, some researchers have been concerned that the majority of validity evidence for integrity tests comes from unpublished studies conducted by the firms who develop and market the tests (e.g., Camara & Schneider, 1994, 1995; Dalton & Metzger, 1993; Karren & Zacharias, 2007; Lilienfeld, 1993; McDaniel, Rothstein, & Whetzel, 2006; Morgeson et al., 2007; Sackett & Wanek, 1996). For example, conclusions from several reviews of particular integrity tests (e.g., J. Hogan & Hogan, 1989; Inwald et al., 1991), or of the broader integrity literature (e.g., Sackett, Burris, & Callahan, 1989), have been based primarily or solely on test-publisher-sponsored research. The same holds for meta-analytic investigations of integrity test criterion-related validity. For instance, only 10% of the studies in Ones et al.’s (1993) meta-analysis were published in professional journals (p. 696), and all the studies cumulated by McDaniel and Jones (1988) were authored by test publishers. This situation has led to two main concerns. First, questions have been raised about the methodological quality of some of this unpublished test publisher research. For instance, during the 1980s, when there was great interest in the integrity test industry to publish its work, very few studies submitted for publication at leading journals were accepted because of their poor quality (Morgeson et al., 2007). Various methodological issues have been noted about these studies (e.g., Lilienfeld, 1993; McDaniel & Jones, 1988; Sackett et al., 1989), including an overreliance on self-report criterion measures, selective reporting of statistically significant results, and potentially problematic sampling techniques (e.g., use of “extreme groups”). Such issues have prompted some researchers to note that “gathering all of these low quality unpublished studies and conducting a meta-analysis does not erase their limitations. We have simply summarized a lot of low quality studies” (Morgeson et al., 2007, p. 707). The second concern is that test publishers have a vested interest in the validity of their tests. As Michael Campion noted, “my concern is not the ‘file drawer’ problem (i.e., studies that are written but never published). I believe that non-supportive results were never even documented” (Morgeson et al., 2007, p. 707). Karren and Zacharias (2007) reached a similar conclusion in their review of the integrity literature, stating that “since it is in the self-interest of the test publishers not to provide negative evidence against their own tests, it is likely that the reported coefficients are an overestimate of the tests’ validity” (p. 223). Concerns over test-publisher-authored research in the integrity test literature resemble concerns over research conducted by forprofit organizations in the medical research literature. The main concern in this literature has been conflicts of interest that may occur when for-profit organizations (e.g., drug companies) conduct studies to test the efficacy of the drugs, treatments, or surgical techniques they produce. Several recent meta-analyses have addressed whether for-profit and non-profit studies produce different results (e.g., Bekelman, Li, & Gross, 2003; Bhandari et al., 2004; Kjaergard & Als-Nielsen, 2002; Ridker & Torres, 2006; Wahlbeck & Adams, 1999). The findings of this work consistently suggest that studies funded or conducted by for-profit organizations tend to report more favorable results than do studies funded or conducted by non-profit organizations (e.g., government agencies). Research of this type also may provide insights regarding validity evidence reported by researchers with and without vested interests in integrity tests. Present Study The aim of the current study was to reconsider the criterionrelated validity of integrity tests, which we did in three main ways. First, questions have been raised about the lack of methodological rigor within the integrity test literature. This is of particular concern because several of the noted methods issues are likely to result in inflated estimates of validity. These include design features, such as contrasted groups and extreme groups, and data analysis features, such as stepwise multiple regression and the reporting of statistically significant results only. We address these issues by carefully reviewing each primary study and then metaanalyzing only studies whose design, conduct, and analyses are consistent with professional standards for test validation (e.g., Society for Industrial and Organizational Psychology [SIOP], 2003). This approach is in line with calls for meta-analysts to devote greater thought to the primary studies included in their research (e.g., Berry, Sackett, & Landers, 2007; Bobko & StoneRomero, 1998). Second, the results of prior meta-analyses primarily are based on test-publisher research, and there are unanswered questions concerning potential conflicts of interest and the comparability of publisher and non-publisher research results (Sackett & Wanek, 1996). However, such concerns largely are based on anecdotal evidence rather than on empirical data. We address this issue by examining whether author affiliation (i.e., test publishers vs. nonpublishers) moderates the validity of integrity tests. Finally, almost 20 years have passed since Ones et al.’s (1993) comprehensive meta-analysis. We do not attempt to replicate this or other previous reviews, but rather to examine the validity evidence for integrity tests from a different perspective. For example, whereas prior reviews have incorporated primary studies that used a wide variety of samples, designs, and variables, our results are based on studies that met a somewhat more focused set of inclusion criteria (which we describe in the Method section). Further, in addition to job performance and CWB, we investigate relations between integrity tests and two criteria that to our knowledge have not yet been cumulated individually: training performance and turnover. We also investigate the potential role of several previously unexplored moderators, including author affiliation (i.e., test publishers vs. non-publishers), type of job performance (i.e., task vs. contextual performance), and type of turnover (i.e., voluntary vs. involuntary). Finally, we incorporate results of integrity test research that has been conducted since the early 1990s. We believe the results of the present research have important implications for research and practice. From a practice perspective, practitioners may use meta-analytic findings to guide their decisions about which selection procedures—among the wide variety of procedures that exist—to use or to recommend to managers and clients. Accurate meta-analytic evidence may be particularly important for practitioners who are unable to conduct local validation This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. CRITERION-RELATED VALIDITY OF INTEGRITY TESTS studies (e.g., due to limited resources, small sample jobs, or lack of good criterion measures) and, thus, may rely more heavily on cumulative research to identify, and help justify the use of, selection procedures than practitioners who do not have such constraints. For instance, if meta-analytic evidence suggests a selection procedure has lower criterion-related validity than actually is the case, then practitioners may neglect a procedure that could be effective and, in turn, end up with a less optimal selection system (Schmidt, Hunter, McKenzie, & Muldrow, 1979). On the other hand, if meta-analytic evidence suggests a selection procedure has higher criterion-related validity than actually is the case, this could lead practitioners to incorporate the procedure into their selection systems. This, in turn, could diminish the organization’s ability to identify high-potential employees and possibly jeopardize the defensibility of decisions made on the basis of the selection process. Professional organizations devoted to personnel selection and human resources management also use meta-analytic findings as a basis for the assessment and selection information they provide their membership and the general public. For example, materials from organizations such as SIOP and the U.S. Office of Personnel Management (OPM) describe various selection procedures with respect to factors such as validity, subgroup differences, applicant reactions, and cost. Both SIOP and OPM indicate criterion-related validity as a key benefit of integrity tests. For instance, OPM’s Personnel Assessment and Selection Resource Center website states that “integrity tests have been shown to be valid predictors of overall job performance as well as many counterproductive behaviors . . . The use of integrity tests in combination with cognitive ability can substantially enhance the prediction of overall job performance” (http://apps.opm.gov/ADT). Meta-analytic evidence also can play an important role in legal cases involving employee selection and promotion. For instance, in addition to the use of meta-analyses to identify and defend the use of the selection procedures, expert witnesses may rely heavily on metaanalytic findings when testifying about what is known from the scientific literature concerning a particular selection procedure. Lastly, a clear understanding of integrity test validity has implications for selection research. For one, results of meta-analyses can influence the direction of future primary studies in a particular area. As McDaniel et al. (2006, p. 947) noted, “meta-analytic studies have a substantial impact as judged by citation rates, and researchers and practitioners often rely on meta-analytic results as the final word on research questions”; meta-analysis may “suppress new research in an area if there is a perception that the meta-analysis has largely settled all the research questions.” Metaanalysis also can highlight issues that remain unresolved and thereby influence the agenda for future research. Second, meta-analytic values frequently are used as input for other studies. For example, criterion-related validity estimates from integrity meta-analyses (e.g., Ones et al., 1993) have been used in metaanalytic correlation matrices to estimate incremental validity beyond cognitive ability tests (e.g., Schmidt & Hunter, 1998) and in simulation studies to examine the predicted performance or adverse impact associated with different selection procedures (e.g., Finch, Edwards, & Wallace, 2009). Thus, the validity of inferences drawn from the results of such studies hinges, in part, on the accuracy of meta-analytic values that serve as input for analysis. In sum, results of the present meta-analysis address questions and concerns about integrity tests that have been debated for years, 501 but until now have not been systematically investigated. This study also incorporates the results of almost 20 years of additional integrity test data that have not been cumulated. We believe the end result is a better understanding of integrity test validity, which is vital to both practitioners and researchers involved in personnel selection. Before we describe the method of our study, we discuss the basis for the potential moderator variables we examine. Potential Moderators of Integrity Test Validity Type of Integrity Test The first potential moderator we examine is type of integrity test. Integrity tests can be either overt or personality-based (Sackett et al., 1989). Overt or “clear-purpose” tests ask respondents directly about integrity-related attitudes and past dishonest behaviors. Conversely, personality-based or “disguised-purpose” tests are designed to measure a broader range of constructs thought to be precursors of dishonesty, including social conformity, impulse control, risk-taking, and trouble with authority (Wanek, Sackett, & Ones, 2003). Two theoretical perspectives provide a basis for expecting test type to moderate relations between test scores and CWB criteria. According to the theory of planned action (Ajzen, 1991; Ajzen & Fishbein, 2005), the most immediate precursor of behavior is one’s intentions to engage in the behavior. This theory also specifies three main determinants of intentions: attitudes toward the behavior, subjective norms regarding the behavior, and perceived control over engaging in the behavior. The second perspective is the theory of behavioral consistency (Wernimont & Campbell, 1968), which is based on the premise that past behavior is a good predictor of future behavior. More specifically, the more a predictor measure samples behaviors that are reflected in the criterion measure, the stronger the relationship between the two measures should be. Most overt integrity tests focus on measuring attitudes, intentions, and past behaviors related to dishonesty. For example, such tests ask respondents to indicate their views about dishonesty, such as their acceptance of common rationalizations for dishonest behavior (i.e., attitudes), their perceptions regarding the ease of behaviors such as theft (i.e., perceived control), and their beliefs about the prevalence of dishonesty (i.e., subjective norms) and how wrongdoers should be punished (Wanek et al., 2003). Further, many overt tests ask respondents to report past dishonest behaviors, such as overcharging customers and stealing cash or merchandise (i.e., behavior consistency). Thus, on the basis of the theories of planned action and behavioral consistency, people who have more positive attitudes about dishonesty, who believe that most people are somewhat dishonest, and who have engaged in dishonest behaviors in the past, should be more likely to behave dishonestly in the future. In contrast, personality-based integrity tests primarily focus on personality-related traits, such as social conformity and risktaking. Although potentially relevant to CWB, such traits are more distal to actual behavior than are the attitudes, intentions, and behaviors on which overt tests tend to focus. This leads to our first hypothesis: Hypothesis 1: There will be a stronger relationship between overt integrity tests and CWB than between personality-based integrity tests and CWB. VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 502 We also investigate whether test type moderates relations between integrity tests and measures of job performance.1 Scores on overt tests may relate to performance because supervisors and peers consider CWB (which such tests were designed to predict) when forming an overall evaluation of an employee’s performance (Rotundo & Sackett, 2002). Scores on personality-based tests may relate to performance because some of the traits these tests measure are relevant to performance in certain types of jobs. For example, some personality-based tests assess elements of conscientiousness, such as rule abidance, orderliness, and achievement orientation (Wanek et al., 2003). However, we are not aware of a compelling theoretical basis to predict that either type of test will be strongly related to job performance (particularly to task-related performance), or to predict that one test will be a better predictor of performance than will the other. Thus, we explore test type as a potential moderator of validity with respect to performance criteria. Research Question 1: Does type of integrity test (overt vs. personality-based) moderate relations between test scores and job performance? Study Design and Sample The next two potential moderators we examine are study design (i.e., predictive vs. concurrent) and study sample (i.e., applicants vs. incumbents), which typically are concomitant within the selection literature (Van Iddekinge & Ployhart, 2008). We expect to find higher validity estimates in concurrent designs than in predictive designs because in concurrent studies, respondents complete an integrity test and a self-report CWB measure at the same time. As a result, relations between scores on the two measures are susceptible to common method factors, such as transient mood state and measurement context effects (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). In contrast, predictive designs are less susceptible to such influences, because completion of the integrity test and CWB measure are separated by time, context, and so forth. Another reason why we expect to find larger validity estimates in concurrent designs concerns the potential for the predictor and criterion in these studies to assess the same behavioral events. For example, many integrity tests (particularly overt tests but also some personality-based tests) ask respondents to report dishonest or counterproductive behaviors they have displayed recently at work. In a concurrent design, participants are then asked to complete a self-report measure of work-related CWB. Thus, the two measures may ask the respondent about the same types of behaviors but using different questions. In fact, some have suggested that correlations between overt integrity tests and self-reported CWB are more like alternate form or test–retest reliability estimates than like criterion-related validity estimates (e.g., Morgeson et al., 2007; Sackett & Wanek, 1996). This same logic also may apply to other criteria used to validate integrity tests, such as employee records of CWB and ratings of job performance. If an integrity test asks respondents to report CWB they recently demonstrated, and then test scores are related to employee records that reflect the same instances of this CWB (e.g., of theft, absenteeism, insubordination), then relations between test scores and employee records may be stronger than if the two measures were separated in time (and thus assessed different instances of behavior). Similarly, supervisors may be asked to evaluate employees’ performance over the past 6 months or a year, and although these ratings may focus primarily on productive behaviors, they may (explicitly or implicitly) capture counterproductive behaviors as well. This, in turn, may result in stronger relations between integrity test scores and performance ratings than if test scores reflected employees’ pre-hire attitudes and behaviors. Hypothesis 2: Criterion-related validity estimates for integrity tests will be larger in concurrent designs than in predictive designs. We also expect to find higher validity estimates in incumbent samples than in applicant samples. Although the debate continues concerning the prevalence and effects of applicant response distortion on personality-oriented selection procedures (e.g., Morgeson et al., 2007; Ones, Dilchert, Viswesvaran, & Judge, 2007; Tett & Christiansen, 2007), meta-analytic research suggests that integrity tests, particularly overt tests, are susceptible to faking and coaching (e.g., Alliger & Dwight, 2000). Thus, to the extent that faking is more prevalent among applicants than among incumbents, lower criterion-related validities may be found in applicant samples. Finally, a finding of stronger validity evidence for concurrent designs and incumbent samples would be consistent with the results of primary and meta-analytic studies that have examined the moderating effects of validation design or sample on other selection procedures, including personality tests (e.g., Hough, 1998), biodata inventories (e.g., Harold, McFarland, & Weekley, 2006), situational judgment tests (e.g., McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001), and employment interviews (e.g., Huffcutt, Conway, Roth, & Klehe, 2004). Hypothesis 3: Criterion-related validity estimates for integrity tests will be larger in incumbent samples than in applicant samples. Performance Construct In recent years, researchers have devoted increased attention to understanding the criteria used to validate selection procedures. One important trend in this area concerns the identification and testing of multidimensional models of job performance (Campbell, McCloy, Oppler, & Sager, 1993). One model that has received support partitions the performance domain into three broad dimensions: task performance, contextual or citizenship performance, and counterproductive performance or CWB (e.g., Rotundo & Sackett, 2002).2 Task performance involves behaviors that are a formal part of one’s job and that contribute directly to the products or services an organization provides. Contextual performance involves behaviors that sup1 As we discuss later, CWB can be considered an aspect of job performance (e.g., Rotundo & Sackett, 2002). However, we use job performance to refer to “productive” performance behaviors (i.e., task and contextual behaviors) and CWB to refer to counterproductive behaviors. 2 Some models also include adaptive performance, which concerns the proficiency with which individuals alter their behavior to meet the demands of the work environment (Pulakos, Arad, Donovan, & Plamondon, 2000). However, relations between integrity tests and adaptive performance have not been widely examined, and thus we do not consider this performance construct here. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. CRITERION-RELATED VALIDITY OF INTEGRITY TESTS port the organizational, social, and psychological context in which task behaviors are performed. Examples of citizenship behaviors include volunteering to complete tasks not formally part of one’s job, persisting with extra effort and enthusiasm, helping and cooperating with coworkers, following company rules and procedures, and supporting and defending the organization (Borman & Motowidlo, 1993). Finally, counterproductive performance (i.e., CWB) reflects voluntary actions that violate organizational norms and threaten the well-being of the organization and/or its members (Robinson & Bennett, 1995; Sackett & Devore, 2001). Researchers have identified various types of CWB, including theft, property destruction, unsafe behavior, poor attendance, and intentional poor performance. We expect both overt and personality-based tests will relate more strongly to CWB than to productive work behaviors that reflect task or contextual performance. Integrity tests primarily are designed to predict CWB, and as we noted, some integrity tests and CWB measures even include the same or highly similar items concerning past or current CWB. We also note that researchers have tended to measure CWB using selfreports, whereas productive work behaviors often are measured using supervisor or peer ratings. Thus, common method variance also may contribute to stronger relations between integrity tests and CWB than between integrity tests and productive work behaviors. Hypothesis 4: Criterion-related validity estimates for integrity tests will be larger for CWB than for productive work behaviors that reflect task and contextual performance. We also explore whether integrity tests relate differently to task performance versus contextual performance. A common belief among researchers is that ability-related constructs (e.g., cognitive ability) tend to be better predictors of task performance, whereas personality-related constructs (e.g., conscientiousness) tend to be better predictors of contextual performance (e.g., Hattrup, O’Connell, & Wingate, 1998; LePine & Van Dyne, 2001; Van Scotter & Motowidlo, 1996). If true, then integrity tests—which are thought to capture personality traits, such as conscientiousness, emotional stability, and agreeableness (Ones & Viswesvaran, 2001)—may demonstrate stronger relations with contextual performance than with task performance. However, some studies have found that personality constructs do not demonstrate notably stronger relationships with contextual behaviors than with task behaviors (e.g., Allworth & Hesketh, 1999; Hurtz & Donovan, 2000; Johnson, 2001). One possible contributing factor to this finding is that measures of task and contextual performance tend to be highly correlated (e.g., Hoffman, Blair, Meriac, & Woehr, 2007), which may make it difficult to detect differential relations between predictors and these two types of performance. Thus, although a theoretical rationale exists to expect that integrity tests will relate more strongly to contextual performance than to task performance, we might not necessarily find strong empirical support for this proposition. Research Question 2: Does job performance construct (task performance vs. contextual performance) moderate the criterion-related validity of integrity tests? 503 Breadth and Source of CWB Criteria Researchers have used various types of CWB measures to validate integrity tests. One factor that differentiates CWB measures is the “breadth” of their content. Some measures are broad in scope and assess multiple types of CWB, such as theft, withdrawal, substance abuse, and violence. Other measures are narrower and assess only one type of CWB, such as theft. Ones et al. (1993) addressed this issue by comparing validity evidence for integrity tests for criteria that assessed theft only to validity evidence for broader criteria that reflected violence, withdrawal, and other CWB. Results revealed a somewhat larger corrected mean validity for theft-related criteria (.52) than for broader CWB criteria (.45). We also examine whether criterion breadth moderates relations between integrity tests and CWB. Specifically, we compare validities for criteria that reflect multiple types of CWB to validities for criteria that reflect only one type of CWB, namely, substance abuse, theft, or withdrawal, which are among the most commonly measured CWB dimensions. Integrity tests are considered relatively broad, non-cognitive predictors (Ones & Viswesvaran, 1996). Most overt tests assess multiple aspects of integrity, including thoughts and temptations to behave dishonestly, admissions of past dishonesty, norms about dishonest behavior, beliefs about how dishonest individuals should be punished, and assessments of one’s own honesty. Personality-based tests also tend to be quite broad and tap into constructs such as conscientiousness, social conformity, risk-taking, impulsivity, and trouble with authority. Because integrity tests are broad in scope, we expect they will relate more strongly to criteria that also are broad in scope than to narrower criteria. Hypothesis 5: Criterion-related validity estimates for integrity tests will be larger for broad CWB criteria than for more narrow CWB criteria. We also investigate whether the “source” of CWB information moderates integrity test validity. Researchers have measured CWB using self-reports from applicants or employees, supervisor or peer ratings, and employee records of disruptive behavior. Integrity tests and self-report measures are subject to several of the same method factors (Podsakoff et al., 2003). For instance, because the same individuals complete both measures, relations between integrity and CWB may be influenced by common rater effects, such as social desirability, consistency motif, and mood state (e.g., negatively affectivity). Further, because studies often have had participants complete an integrity test and a self-report CWB measure on the same occasion, relations between the two also are subject to transient mood and measurement context effects. Such factors are likely to result in larger validities when the criterion data are provided by a common rater (i.e., self-reports) than when they are provided by a different rater (e.g., a supervisor) or a different source of information (e.g., company records). Hypothesis 6: Criterion-related validity estimates for integrity tests will be larger for self-reported CWB than for external measures of CWB. VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU 504 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Type of Turnover Theory and research on antecedents of employee turnover have tended to focus on job attitudes, such as job satisfaction and organizational commitment (e.g., Griffeth, Hom, & Gaertner, 2000), which are difficult or impossible to assess during the selection process, as most applicants have not yet been exposed to the job or organization. Recently, however, the use of selection procedures to reduce turnover has received attention. Importantly, studies have found that personality variables, such as conscientiousness and emotional stability, may be useful for predicting turnover (e.g., Barrick & Mount, 1996; Barrick & Zimmerman, 2009). We attempt to add to this recent stream of research by investigating relations between integrity tests and turnover. On the one hand, integrity tests may capture variance in the types of CWB that lead to involuntary turnover. They also may capture variance in personality traits related to voluntary turnover (see below). On the other hand, myriad reasons may cause an employee to leave an organization, and thus any single predictor is unlikely to account for a large portion of the variance in turnover. Further, turnover typically is difficult to predict because of low base rates, imprecise coding of reasons for turnover, and the dichotomous nature of this criterion. Thus, if a relationship between integrity tests and turnover exists, we expect it will be a modest one. We also examine whether the type of turnover moderates integrity–turnover relations. Involuntary turnover typically results from substandard job performance or CWB, such as absenteeism, theft, or substance abuse. Thus, to the extent integrity tests are related to job performance or CWB, such tests also may predict involuntary turnover. Because integrity tests are thought to tap into personality traits like conscientiousness and emotional stability, such tests also may predict voluntary turnover. For example, it has been suggested that conscientious individuals are more likely to believe they have a moral obligation to remain with an organization, which affects their commitment to the organization and, in turn, retention decisions (Maertz & Griffeth, 2004). Further, employees low on emotional stability are more likely to experience negative states of mind or mood, which can lead to conflict with coworkers and lack of socialization, which can lead to stress and ultimately influence decisions to leave the organization (Barrick & Zimmerman, 2009). However, if a relationship exists between integrity tests and voluntary turnover, it would seem to be less direct, and thus more modest, than the relationship between integrity tests and involuntary turnover. In addition, voluntary turnover often is due to factors other than employees’ personality and behavior, including poor work conditions, availability of better jobs, and work-life issues, such as relocation due to a spouse’s job change (Griffeth et al., 2000). This leads to our next hypothesis: Hypothesis 7: Criterion-related validity estimates for integrity tests will be larger for involuntary turnover than for voluntary turnover. Author Affiliation As discussed, a prevalent concern about integrity tests is that test-publisher research may provide an overly optimistic view of criterion-related validity. For one, questions have been raised about methodological approaches used by some test publishers that tend to overestimate validity (e.g., extreme group designs, reporting significant results only). Second, because publishers have a vested interest in the success of their tests, questions have been raised about the possible suppression of studies that may reveal less optimistic results. The documentation of publication bias in data from some selection test publishers has served to further increase awareness of this issue (McDaniel et al., 2006). Despite this, we are not aware of any empirical evidence that supports or refutes the claim that test publishers report more positive validity evidence for integrity tests than do non-publishers. As we noted earlier, studies in the medical literature have used meta-analysis to assess the comparability of results from for-profit versus non-profit research and have found that for-profit research tends to report more favorable results than does non-profit research (e.g., Bhandari et al., 2004; Kjaergard & Als-Nielsen, 2002; Wahlbeck & Adams, 1999). Although we are not aware of any analogous studies within the selection literature, Russell et al. (1994) examined the influence of investigator characteristics on reported criterion-related validity estimates for various personnel selection procedures. Studies whose first author was employed in private industry were associated with somewhat higher mean validities (r ⫽ .29) than studies whose first author was an academician (r ⫽ .24). Furthermore, studies conducted to address some organizational need tended to yield higher validities than studies conducted for research purposes. For example, studies conducted for legal compliance were associated with higher mean validities (r ⫽ .33) than studies conducted for theory testing and development (r ⫽ .22). We adopted a similar approach to try to understand the potential influence of author affiliation on the reported validity evidence for integrity tests. Specifically, we estimate criterion-related validity for three separate categories of integrity studies: (a) studies authored by test publishers only, (b) studies authored by nonpublishers only, and (c) studies authored by publishers and nonpublishers. Additionally, among non-publisher studies, we compare validity evidence from studies authored by researchers who developed the integrity test to validity evidence from studies whose authors did not develop the test. Research Question 3: Does author affiliation (test publisher vs. non-publisher) moderate the criterion-related validity of integrity tests? Publication Status Finally, we examine whether published and unpublished studies on integrity tests report similar or different levels of validity evidence. Publication bias can occur when studies are more likely to be published depending on the magnitude, direction, or statistical significance of the results (Dickerson, 2005; McDaniel et al., 2006). Indeed, a common assumption is that studies that find large or statistically significant results are overrepresented in the published literature because journals have limited space and consider such results more interesting than small or non-significant results. Researchers also may contribute to this phenomenon by submitting studies with significant findings while putting studies with nonsignificant findings in the “file drawer” (R. Rothstein, 1979). CRITERION-RELATED VALIDITY OF INTEGRITY TESTS This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. The only study we know of to examine publication status and integrity test validity is Ones et al. (1993), who found an observed correlation of –.02 between publication status and validity (i.e., published studies tended to report slightly larger validity estimates). We add to their analyses by reporting separate validity estimates for published and unpublished studies for each criterion in our analyses. Conventional wisdom would suggest that published integrity test studies will yield larger validity estimates. However, because a large portion of test-publisher research is unpublished, and given concerns that test-publisher studies may provide an overly optimistic view of integrity test validity, there might not be a strong association between publication status and validity in this literature. Research Question 4: Does publication status (published vs. unpublished) moderate the criterion-related validity of integrity tests? Method Literature Search We started by searching for published articles on integrity test criterion-related validity, beginning with the articles included in Ones et al.’s (1993) comprehensive meta-analysis. We then searched available electronic databases (e.g., PsycINFO, ABI/ INFORM Global, ERIC) for additional studies. We searched for words such as “integrity,” “honesty,” and “dishonesty” appearing in study titles and abstracts. We also performed separate searches on the names of each of the approximately 30 integrity tests we identified through our research as well as the names of known integrity test developers and researchers. We then reviewed the reference sections of all of the obtained articles to identify additional publications. Our search process for unpublished studies was much more involved. We started by attempting to obtain copies of all the unpublished papers, technical reports, and test manuals cited by Ones et al. (1993). If we could not locate or obtain a response from the original authors of a given study, we contacted other authors who have cited the study in their work to see whether they had a copy of the paper. For example, we contacted authors of several qualitative reviews and books related to integrity testing to obtain copies of the primary studies they reviewed (e.g., Goldberg, Grenier, Guion, Sechrest, & Wing, 1991; O’Bannon, Goldinger, & Appleby, 1989; U.S. Congress, Office of Technology Assessment, 1990). We also attempted to contact all the publishers whose integrity tests were cited in Ones et al. (1993). This was challenging because many of these publishers are no longer in existence, several publishers have changed names, and some tests are now published by different companies. In addition, we identified several newer integrity tests during this process, and we attempted to contact the publishers of these tests. We encountered a range of responses from our attempts to contact the approximately 30 test publishers we identified. Several publishers did not respond, and a few others responded but declined to participate (e.g., because of concerns about how the studies would be used). One major test publisher declined to participate after several months of discussions. Furthermore, this 505 publisher advised us (under threat of legal recourse) that we could not use unpublished studies on their tests we obtained from other researchers, such as those who authored the qualitative reviews noted earlier (see Footnote 10). Of the publishers who responded to our inquiries and expressed interest in helping us, almost all required us to submit a formal research proposal and/or to sign a non-disclosure agreement. In the end, only two publishers provided us with more than just a few studies for potential inclusion in the meta-analysis. Our overall experience appears to be similar to some of the experiences described by past researchers who have attempted to obtain unpublished studies from integrity test publishers (e.g., Camara & Schneider, 1995; Lilienfeld, 1993; Martelli, 1988; Snyman, 1990; Woolley & Hakstian, 1993). Finally, we took several steps to obtain additional unpublished studies. In addition to requesting newer validity studies from each of the publishers we contacted, we searched the Dissertation Abstracts database for unpublished doctoral dissertations and master’s theses. We also searched electronic and hard copies of programs from the annual conventions of the Academy of Management, American Psychological Association, and SIOP. Last, we contacted numerous researchers who have published in the integrity test area for any “file drawer” or in-progress studies. Overall, we located 324 studies that appeared relevant to the criterion-related validity of integrity tests. Of these studies, 153 were included in Ones et al.’s (1993) meta-analysis, and 171 were not. Most of the studies not included in Ones et al. were completed subsequent to their meta-analysis. Inclusion Criteria Our interest was to identify primary studies whose results were relevant to the criterion-related validity of integrity-specific scales for predicting individual work behavior and whose conduct was consistent with professional standards for test validation. With this in mind, we set up several criteria to foster a careful review of the available studies. The first criterion for inclusion in the meta-analysis concerned study design. We only included studies that collected both predictor and criterion data on individual participants. We excluded studies that compared integrity test scores between a group of known deviants (e.g., prisoners) and a group of “normal” individuals (e.g., job applicants). In addition to lacking criterion data, this “contrasted group” approach can overestimate validity, because it is easier to differentiate between criminals and non-criminals than it is to differentiate among non-criminals (Coyne & Bartram, 2002; Murphy, 1995). We also excluded studies that examined relations at the unitlevel of analysis, such as how use of an integrity test for selection correlated with theft or inventory shrinkage for entire stores, rather than for individual employees. First, relations among aggregate data are not necessarily the same as relations among the underlying individual data (E. L. Thorndike, 1939). Therefore, inclusion of unit-level integrity studies could have distorted validity estimates for individual-level criteria (McDaniel & Jones, 1986). Second, most unit-level studies have used some form of time-series design, whereby changes in an outcome (e.g., theft) were measured before and after implementation of an integrity testing program. Although the results of such research are interesting and potentially valuable, This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 506 VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU they do not provide validity estimates for individual-level outcomes. The second inclusion criterion concerned the type of integrity test examined. We only included studies that examined one or more integrity-specific scales contained within an integrity test whose content appeared consistent with overt or personality-based integrity tests. We excluded studies that only reported scores from scales of related, but different, constructs contained within an integrity test. For instance, although some overt tests include items that assess views and behaviors concerning substance abuse, we did not include scales that focus on substance abuse only. As an example, certain versions of the Personnel Selection Inventory (Vangent, 2007) include both the Honesty Scale, which is regarded in the literature as an integrity test, and the Drug Avoidance Scale, which focuses more specifically on the use and sale of illegal drugs and is not regarded as an integrity test per se. Likewise, we excluded scales that focus solely on attitudes and behaviors about workplace safety and customer service. We also excluded studies in which an integrity scale was included in a composite of several scales measured within the same instrument, and no integrity scale-specific validity estimates were reported. In sum, we focused on integrity-specific scales only. In addition, we excluded studies that examined organizational surveys designed to assess the integrity of existing employees, such as the Employee Attitude Inventory (London House, 1982). Such measures were designed to assess integrity-related attitudes and behaviors among an organization’s current employees (e.g., employee’s perceptions about the prevalence of dishonest behavior in their workplace) and are not intended for preemployment selection (Jones, 1985). Finally, we excluded a few studies that used other methods to measure integrity, such as interviews (e.g., Gerstein, Brooke, & Johnson, 1989) and situational judgment tests (e.g., Becker, 2005). Although we encourage exploration of such methods, our primary interest was to estimate the criterion-related validity of traditional integrity tests. The third inclusion criterion concerned the type of criterion. To be included, studies had to relate integrity test scores to scores on one or more work-related criteria, including measures of job performance, training performance, CWB, or turnover. We excluded studies that used measures of non-work deviance as criteria, such as academic cheating, traffic violations, and shoplifting. In addition, we excluded studies in which students participated in lab studies in which their (non-work related) integrity-related attitudes or behaviors were measured in response to an experimental manipulation (e.g., whether students kept or returned an overpayment for participation; Cunningham, Wong, & Barbee, 1994). Although these types of studies and outcomes are important, we focused on validity evidence for criteria that were more directly relevant to workplace behavior. Due to longstanding concerns regarding the validity and reliability of polygraphs (e.g., Lykken, 1981; Sackett & Decker, 1979; Saxe, Dougherty, & Cross, 1985; U.S. Congress, Office of Technology Assessment, 1983), we also excluded studies that used polygrapher ratings as the criterion to validate an integrity test. Furthermore, we excluded studies in which the criterion reflected different types of, or reasons for, turnover. For example, several studies from a particular test publisher used purportedly interval-scaled turnover measures, such as 1 ⫽ voluntary turnoverwould rehire, 2 ⫽ reduction in force-may rehire, 3 ⫽ probationary-may not rehire, 4 ⫽ involuntary turnover-minor offense, and 5 ⫽ involuntary turnover-major offense. In addition to questions about treating these as equal intervals, such scales appear to confound turnover with performance (e.g., voluntary turnover would rehire vs. reduction in force may rehire). Finally, although we coded studies that used employee tenure as a criterion, we did not include results from these studies in our analysis of turnover because tenure and turnover are related, but different, criteria (Williams, 1990). The fourth criterion for inclusion concerned the reporting of validity results. Each study had to describe an original study and to provide sufficient details concerning the research design, data analysis, and results. For example, we did not include secondary reports of studies from qualitative reviews (e.g., Sackett et al., 1989) and meta-analytic studies of specific integrity measures (e.g., McDaniel & Jones, 1988), as such studies tend to provide only basic results for the primary studies analyzed, such as sample size, study design (e.g., predictive vs. concurrent), and validity coefficients. Instead, as noted above, we attempted to obtain the primary studies cited in these secondary sources to judge whether the conduct and results of the original study met all the criteria described herein. We also had to exclude many studies for which the study particulars (e.g., sampling procedures, data analysis, validity results) were not fully described, or for which the description of these elements was so unclear that we could not be reasonably confident about the resulting validity estimates.3 This situation appears to be consistent with previous quantitative and qualitative reviews of the integrity test literature. For example, Ones et al. (1993, p. 696) noted that the test publisher technical reports included in their meta-analysis were “sketchy, often omitting important information,” and O’Bannon et al. (1989, p. 70) noted that 3 Space limitations prevent us from describing each of the studies that we excluded because of unclear reporting. However, we briefly describe three studies as examples. One study reported the results of a validation study of an integrity test across two samples. However, some validity results were reported for only one of the samples, some for both samples separately, and some for both samples combined, without explanation as to why or how the samples were (were not) combined. Further, although a particular subscale of the integrity test is designed to predict length of service of the job, only the correlation between the other subscale and turnover was reported, and the sample size on which the correlation was based did not match any of the other sample sizes reported in the article. In another study, 1,657 job applicants were hired during the study period, but the integrity test was administered to only 367 of these applicants (even though recruiters were instructed to give the test to all applicants). Then, 90-day performance evaluations could be located for only 146 of these individuals, yet the authors did not indicate the number of applicants who actually were hired. Further, the validity of integrity test scores for a subset of these individuals was only ⫺.05, which led the researchers to remove these data points from the final sample, which comprised only 71 employees. In a third study, the authors had store managers provide job performance and turnover information for 131 subordinates, who completed an integrity test as job applicants. However, the authors indicated that only 100 of these individuals actually had been hired. Then, only 44 of these employees were included in the validation sample. Finally, no information was provided concerning the job performance criteria used to validate the test, and the resultant validity coefficients were described as both “correlations” and “weights assigned by the discriminant function.” This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. CRITERION-RELATED VALIDITY OF INTEGRITY TESTS many reports were “ambiguous, incomplete, or not detailed enough to be properly evaluated” (also see Sackett & Harris, 1984). However, before excluding such studies, we first made several attempts to contact the authors to clarify our questions or to obtain the necessary information to estimate validity. Finally, we excluded several studies that were referenced in previous integrity test criterion-related validity meta-analyses, but that did not appear to report any validity results. The fifth criterion concerned the exclusion of studies for specific methodological issues we encountered in this literature. First, we excluded studies that reported statistically significant validity results only, as exclusion of non-significant results can lead to overestimates of validity (McDaniel & Jones, 1988). Second, and possibly related to the above, we excluded studies that collected data on multiple integrity test scales and/or multiple criterion measures, but reported validity estimates for only certain test scales or criteria and did not explain why this was done. Third, we excluded studies for which variance on the predictor, criterion, or both was artificially increased. For instance, we excluded studies that oversampled low performing employees. We also excluded studies that used an extreme-groups approach that, for example, collected job performance data on a range of employees but then used data from the top and bottom 25% of performers only to validate an integrity test. Such studies were excluded because they can produce higher correlations than if participants were selected at random (e.g., Sackett & Harris, 1984) as well as reduce the representativeness of the sample (i.e., because cases between the extremes are omitted; Butts & Ng, 2009). Finally, we only included results based on independent samples. To help ensure this, we used Wood’s (2008) method to identify (and exclude) studies in which a sample appeared to overlap with a sample from another article authored by the same researchers. When possible, we also tried to confirm apparent instances of sample overlap with the study authors. Of the 324 studies we found, 104 (32.1%) met all the criteria. These 104 studies comprised 42 published studies and 62 unpublished studies, and a total of 134 independent samples. Table 1 shows the number and percentage of studies we had to exclude according to each inclusion criterion. Although studies were excluded for a range of reasons (and many studies could have been excluded for multiple reasons), the three most prevalent were (a) lack of details concerning the research design, data analysis, and/or results; (b) use of polygrapher ratings as validation criteria; and (c) use of contrasted group designs that compared integrity test scores between a group of known deviants (e.g., prisoners) and a group of “normal” individuals (e.g., job applicants). Coding of Primary Studies Two of the authors coded all the studies. Both were professors with more than 10 years of research experience. We coded whether the integrity test was overt or personality-based, whether the sample comprised applicants or incumbents (including students who were employed or recently employed), and whether the integrity test and criterion data were collected using a concurrent or predictive design. We also coded whether the criterion measured task performance, contextual performance, CWB, or some combination thereof. We used definitions from the work of Borman and colleagues (e.g., Borman & Motowidlo, 1993; Coleman & Bor- 507 Table 1 Number and Percentage of Excluded Studies by Inclusion Criterion Outcome/inclusion criterion k % Total studies reviewed Studies that passed all inclusion criteria Studies that did not pass one or more inclusion criteria 1. Study design criterion Contrasted group design Unit-level analysis Time-series design 2. Integrity test criterion Predictor was not integrity-specific Composite included both integrity and nonintegrity scales Integrity survey for existing employees Alternative type of integrity measure 3. Validation criteria criterion Criteria reflected non-job related behaviors Laboratory experiment Polygraph as criterion Criterion reflected different types/reasons for turnover 4. Reporting of validity results criterion Lack of sufficient details regarding study particulars Unclear methods and/or results No apparent criterion-related validity results 5. Methodological issues criterion Reported significant results only Reported results for only certain predictors or criteria Extreme groups/range enhancement 6. Independent sample criterion Sample overlapped with a sample from another study 324 104 220 32.1 67.9 24 8 17 9.6 3.2 6.8 14 5.6 6 16 6 2.4 6.4 2.4 17 9 34 6.8 3.6 13.7 6 2.4 34 8 13 13.7 3.2 5.2 15 6.0 4 12 1.6 4.8 6 2.4 Note. The ks and associated percentages for the inclusion criteria reflect the percentage of excluded studies (k ⫽ 220) that were excluded because of each criterion. Because many excluded studies failed to meet multiple criteria, the total k exceeds 220. man, 2000) to differentiate task performance from contextual performance. We categorized measures as task or contextual only when the primary authors specifically indicated such, or when we could be reasonably confident that a measure reflected primarily task (contextual) performance according to the dimension descriptions provided. Further, we categorized CWB according to the dimensions of counterproductivity and workplace deviance identified by Gruys and Sackett (e.g., Gruys & Sackett, 2003; Sackett, 2002) and Bennett and Robinson (Bennett & Robinson, 2000; Robinson & Bennett, 1995). We also coded the source of the measures, namely, self-report, other-report (i.e., supervisors or peers), or employee records. Finally, with respect to author affiliation, the studies in our data set were authored by test publishers, non-test publishers, and a combination of publishers and non-publishers. We also noted two types of non-publishers: researchers who developed the integrity test they studied and researchers who did not develop the test. Thus, we categorized the authors of each study as follows: (a) test publishers, (b) non-publishers who developed the test, (c) nonpublishers who did not develop the test, and (d) test publishers and non-publishers. 508 VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Before analyzing the data, we estimated interrater agreement on the coding of key study variables, including the validity coefficients, sample sizes, and proposed moderators. The percentage of judgments on which the two authors agreed ranged from 95% to 100% across the coded variables, with a mean 97% agreement. Instances of disagreement were resolved after discussion. In the Appendix, we provide the main codes and input values for 74 of the 104 primary studies included in the meta-analysis. Information for the remaining studies was withheld to protect the confidentiality of client-specific technical reports from certain test publishers. Analyses Observed validities. We implemented Hunter and Schmidt’s (2004) psychometric approach to meta-analysis. We began by identifying (and/or computing) the observed validity coefficient(s) within each primary study. Most primary studies reported zeroorder correlations. Instead of correlations, several studies reported means and standard deviations or frequency counts in 2 ⫻ 2 tables (e.g., did vs. did not receive a particular score on an integrity test, and stayed on vs. left the job). Thus, we first converted such statistics to correlation coefficients. Studies reported a variety of validity coefficients depending on the nature and number of integrity tests and criteria examined. If studies reported a validity coefficient based on an overall integrity test score and an overall criterion score, we used that coefficient in our analyses. If studies reported validities only for subscales of a test (e.g., the Performance and Tenure scales of the PDI Employment Inventory; ePredix, 2001) and/or facets of a criterion measure (e.g., two dimensions of task performance), we estimated a unitweighted composite validity coefficient using the predictor and criterion correlations (Schmidt & Le, 2004). If the primary authors did not report the correlations needed to estimate a composite validity, we tried to obtain correlations from the authors. In instances for which we could not obtain the necessary information to estimate a composite validity, we computed the mean validity across the predictors and/or criteria for the given study.4 Because our interest was to estimate the validity of using a single integrity test for selection, we did not estimate a composite validity for studies that examined the validity of multiple integrity tests, as such estimates would not be comparable with those from singletest studies. In these cases, we computed the mean validity across integrity tests. Furthermore, some test publishers used multiple items or subscales from the same integrity test to predict criteria. For instance, the Inwald Personality Inventory (Institute for Personality and Ability Testing, 2006) includes 26 separate scales, and many of the studies based on this measure used discriminant function analysis or multiple regression analysis to estimate criterion-related validity. The coefficients from such analyses reflect the combined validity of all 26 scales, which are optimally weighted to predict the outcome of interest. Because the weights assigned to each scale often are chosen on the basis of the research sample, and given the large number of predictors considered, there is a high likelihood of capitalization on chance in such situations. Although we were somewhat hesitant to include validity estimates from these studies with the more common bivariate validity estimates that the vast majority of other studies in our dataset reported, this is the approach the developers of these integrity tests have tended to use. Because we wanted our results to reflect how integrity tests have been (are being) used, we cautiously included this small set of studies in the meta-analysis. However, we first attempted to obtain correlations among the items or subscales so that we could calculate a unit-weighted composite validity estimate. For studies for which we could not obtain predictor intercorrelations, we adjusted the reported validity estimate for shrinkage using the positive-part Pratt population R formula described by Shieh (2008). The resulting values estimate what validity would be if the coefficients were calculated based on the full population. We used the unit-weighted composite validities or shrinkage-adjusted validities, rather than the original validities, in our analyses.5 The one exception is that for the author affiliation moderator analyses, we report results using validity estimates that the test publishers originally reported (which we label “reported validity” in the tables) and using validity estimates that we computed (which we label “computed validity” in the tables). That is, for the computed validity estimates, we replaced the reported validities with the corresponding composite or shrunken validities. Corrected validities. We also report validity estimates corrected for measurement error in the criterion to estimate the operational validity of integrity tests. Supervisor or peer ratings represent the main way integrity test researchers measured job performance. A few studies also used ratings to measure CWB. Only three studies reported an estimate of interrater reliability. The mean estimate across these studies was .72. All three studies used two raters, so the estimates reflect the reliability of a performance measure based on mean ratings of two raters. Using the Spearman– Brown formula, the mean single-rater reliability was .56. This value is highly similar to interrater estimates in the mid .50s to low .60s found in other meta-analyses involving job performance (e.g., H. R. Rothstein, 1990; Viswesvaran, Schmidt, & Ones, 2005). Our approach was to use reliability estimates from the studies within the meta-analysis whenever possible. Thus, for studies that reported an interrater estimate, we used the actual estimates in our analyses. For studies that did not report such an estimate, but did report the number of raters, we took the mean single-rater reliability estimate (.56) and used the Spearman–Brown formula to estimate reliability based on the number of individuals whose ratings contributed to the performance or CWB criterion. For studies that did not report the number of raters, we assumed a single rater and inputted the mean reliability estimate. None of the studies in our data set reported reliability estimates for training performance, and so we had to use estimates from other research. A meta-analysis by McNatt (2000) reported 11 internal consistency reliability estimates for training exams (see Table 1, p. 317), and we calculated the mean reliability estimate to be .81. Similarly, a meta-analysis by Taylor, Russ-Eft, and Chan (2005) reported mean internal consistency estimates of .76 and .85 4 See the Limitations and Directions for Future Research section for a discussion of the possible implications of having to use mean validities instead of composite validities in such cases. 5 We had to use shrinkage-adjusted Rs (instead of zero-order or unitweighted composite correlations) for five job performance samples, one training performance sample, three CWB samples, and four turnover samples. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. CRITERION-RELATED VALIDITY OF INTEGRITY TESTS for training tests that measured declarative knowledge and procedural knowledge and skills, respectively (mean ␣ ⫽ .81). Thus, we used a reliability estimate of .81 for studies that used training exam scores or grade point average as a criterion. We could not find reliability estimates for instructor ratings of training performance. We therefore used the mean interrater reliability of .56 from the job performance criteria studies. Self-reports are the primary way researchers measured CWB. Twenty-five studies reported an internal consistency reliability estimate for a global measure of CWB (mean ␣ ⫽ .72). For studies that did not report a reliability estimate, we took the available reliability estimates and the number of items on which each estimate was based and used the Spearman–Brown formula to estimate the mean reliability for a single-item measure. We then used this estimate (i.e., .21) to calculate a reliability estimate for each study based on the number of items in the criterion for that study. For a few studies that did not report a reliability estimate or the number of items within the criterion measure, we used the mean reliability estimate. We also were able to cumulate validity evidence for three specific types of CWB: on-the-job substance abuse, theft, and withdrawal. Seven studies reported reliability estimates (alpha) for self-reported substance abuse (mean ␣ ⫽ .50), 12 studies reported a reliability estimate for theft (mean ␣ ⫽ .51), and four studies reported a reliability estimate for withdrawal (mean ␣ ⫽ .80). As with the global CWB measures, we used these reliability estimates and the number of items on which each estimate was based to estimate the mean reliability for a single-item measure. We then used these estimates to calculate a reliability estimate for each study based on the number of items in the criterion for that study. We again used the mean reliability estimate for studies that did not report such an estimate or the number of items within the criterion measure. Studies also used employee records to measure job performance and CWB. No studies reported reliability estimates for such measures, so we had to draw upon results from other research. A few studies used measures of employee productivity, such as sales, error rate, and accidents. A meta-analysis by Hunter, Schmidt, and Judiesh (1990) estimated the reliability of various productivity measures. They reported a mean reliability estimate of .83 for a 1-month period. Four of the five studies in our database that used a productivity criterion reported the period of measurement. For these studies, we took the reliability estimate from Hunter et al. and used the Spearman–Brown formula to derive a reliability estimate for each study. The mean reliability estimate was .99. We used this mean reliability estimate for one study that did not report the measurement period. Other studies used records of employee absenteeism. A metaanalysis by Ones, Viswesvaran, and Schmidt (2003; using a subset of data from their original 1993 meta-analysis) determined the test–retest reliability of absence records. They identified 79 studies from the general absenteeism literature that reported test–retest information and the period for which the absence records were kept. Using the Spearman–Brown formula, the mean test–retest estimate for a 1-month period was .17. Eight of the 10 studies in our data set that used absence records as a criterion also reported the length of measurement period. For these studies, we took the .17 estimate from Ones et al. (2003) and used the Spearman–Brown formula to derive a reliability estimate for each study. The mean estimate across the eight 509 studies was .72. For the two studies that did not report measurement period, we used this mean reliability estimate. In addition, a few studies used records of detected theft. Unfortunately, we could not find any data concerning the reliability of such records. For these studies, we used the reliability estimate for self-reported theft (␣ ⫽ .51). There were two criteria we did not correct for measurement error. First, a few studies used number of disciplinary actions as a criterion, but we could not find any information concerning the reliability of this type of measure. We view records of disciplinary actions as similar to records of turnover (discussed below) in that employees either were disciplined or they were not. Although some instances of discipline may fail to be recorded, it seems like this might be rare. Thus, we did not make any corrections for number of disciplinary actions. Second, to be consistent with prior research (e.g., Griffeth et al., 2000; Harrison, Newman, & Roth, 2006; Zimmerman, 2008), we did not correct relations between integrity tests and turnover for unreliability. However, as other researchers have done (e.g., Griffeth et al., 2000; Zimmerman, 2008), we adjusted all integrity–turnover correlations to reflect a 50 –50 split between “stayers” and “leavers.” This correction estimates what the maximum correlation would be if there was an “optimal” turnover base rate of .50. It also controls the potential influence of turnover base rate across studies, which, if not corrected, could falsely indicate the existence of moderator effects (Zimmerman, 2008). The mean turnover base rate across primary studies was .31. One primary study did not report the turnover base rate, and thus we used the mean value for this study. We also made corrections when integrity test or criterion scores were artificially dichotomized (Schmidt & Hunter, 1998). Some test publishers transformed test scores into a dichotomous variable that reflected whether participants achieved a particular cutoff score, and other publishers dichotomized scores on the criteria (e.g., three or fewer absences vs. more than three absences). This practice can alter relations between the dichotomized variable and other variables depending on where in the score distribution the cut is made. For example, a cutoff on the integrity test could be selected that maximizes validity in the current sample, or a cutoff could be set a priori based on previous research with the integrity test or criterion measures (Sackett & Harris, 1984). Although publishers rarely described the rationale for use of a particular cutoff score, we decided to correct for dichotomization in such cases. For the one study that did not report information needed to correct for dichotomization in the criteria, we corrected the validity estimates to reflect a 50/50 distribution of criterion scores. Finally, we attempted to determine the likelihood and nature of range restriction within each primary study (Berry et al., 2007).6 Only seven studies in our sample (comprising 10 independent samples) reported the statistics necessary to estimate range restriction. Given the general lack of range restriction information in the primary studies, we chose not to correct for this artifact in our main 6 We identified five categories of studies with respect to range restriction. The largest category (61% of our sample) comprised studies that used incumbent samples for which the authors provided limited or no information concerning how employees originally were selected. Given this, the presence and nature of range restriction in these studies could not be determined. The second largest category (22% of our sample) comprised VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU 510 analyses. However, we present some illustrative analyses later in the Discussion section. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Results We first report validity evidence for job performance criteria (and also training performance) and then for CWB criteria. We report these two sets of results separately, given that productive and counterproductive behaviors typically are considered separately in the literature (e.g., Ones et al., 1993). Furthermore, as noted, researchers have tended to use self-reports to measure CWB and supervisor or peer ratings to measure task and contextual behaviors. Thus, analyzing the results separately provides a clearer picture concerning how integrity tests relate to different criteria. We conclude by presenting validity evidence with respect to turnover. Meta-Analysis Results for Job Performance Criteria Overall validity evidence. Table 2 displays the meta-analytic validity estimates for integrity tests and measures of job performance. Across 74 independent samples, the overall, sample-size weighted mean observed validity was .13, and the mean validity corrected for unreliability in the criterion was .18. The 90% confidence interval (CI) for the corrected validity was .15–.20. While reviewing studies from the publisher of a particular personality-based test, we noticed that most studies used a standard, publisher-developed criterion measure. Although the publisher referred to this as a measure of job performance, the measure also assessed CWB, including suspected theft, withdrawal behaviors (e.g., absenteeism), drug abuse, and antagonistic behaviors. Given this, we analyzed studies on this particular integrity test and compared the validity estimates based on criterion measures that appeared to reflect both productive and counterproductive behaviors (k ⫽ 24; n ⫽ 3,127) to validity estimates based on criterion measures that appeared to reflect productive behaviors only (k ⫽ 18; n ⫽ 6,223). The mean validities for these two groups of studies were .26 and .16, respectively. This finding is consistent with Hypothesis 4, which predicted that integrity tests would be more strongly related to CWB than to productive work behaviors. We then reran the overall analysis excluding estimates based on criteria that included CWB. Some of these studies also reported a studies whose samples were subject to some form of direct range restriction because the integrity test originally was used (typically along with one or more other selection procedures) to select participants. However, because few, if any, of these studies used an integrity test as the sole basis for selection, these actually represent instances of indirect rather than direct range restriction (Hunter et al., 2006). The third range restriction category (9% of our sample) comprised studies in which the authors stated that the job incumbent participants were not selected on the basis of the integrity test. Therefore, the validity estimates from these studies are subject to indirect range restriction because of a possible correlation between integrity test scores and the procedure(s) on which incumbents originally were selected. The fourth category (4%) comprised studies that used an applicant sample, but the authors did not indicate how the applicants were selected, including whether the integrity test was used in the process. Finally, there were studies (4%) in which all applicants completed both an integrity test and the criterion (i.e., a self-report measure of prior CWB); hence, the resulting validity estimates are not subject to range restriction. validity estimate based on a subset of ratings that reflected task or overall performance. For these studies, we replaced the validity estimates based on criteria that included CWB with validity estimates based on task or overall performance. The resulting mean observed and corrected validities were .12 and .15, respectively (k ⫽ 63, n ⫽ 11,995). These values may provide the best overall estimates of the relationship between integrity tests and measures of productive performance. It has been suggested that studies that use predictive designs with applicant samples provide the best estimates of operational validity for integrity tests (e.g., Ones et al., 1993; Sackett & Wanek, 1996). Therefore, we also cumulated validity evidence for these studies separately and report the results in Table 3. There were 24 predictiveapplicant studies in the data set, and the mean observed and corrected validities from these studies were .11 and .15, respectively. Considering the author affiliation moderator results discussed below, we also present validity estimates based on studies conducted by nonpublishers. There were eight such studies for job performance, all of which were authored by non-publishers who did not develop the integrity test they examined. The mean observed and corrected validity estimates from these studies were .03 and .04, respectively. Excluding an influential case decreased both of the observed and corrected validity to ⫺.01.7 Moderator analyses. Statistical artifacts accounted for 55.4% of the variance in corrected validity estimates for job performance, which indicates the possible existence of moderator variables. Beginning with our job performance-related hypotheses, Hypothesis 2 predicted larger validities for concurrent designs than for predictive designs, and Hypothesis 3 predicted larger validities for incumbent samples than for applicant samples. Although the results were consistent with Hypothesis 2, the difference in corrected validities between concurrent and predictive designs was small (.19 vs. .17). Hypothesis 3 received somewhat greater support in that corrected validities were somewhat larger for incumbent samples (.20) than for applicant samples (.15). However, the CIs around the corrected validities for these two types of samples overlapped slightly. We also posed several research questions with respect to job performance criteria. Research Question 1 pertained to the type of integrity test. Validity estimates were somewhat larger for personality-based tests than for overt tests (.18 vs. .14), and excluding an influential case decreased the corrected validity for overt tests to .11. Research Question 2 focused on task versus contextual performance criteria. Results revealed slightly larger corrected validity estimates for task than for contextual performance (.16 vs. .14). The highly similar validity estimates for these two types of performance is consistent with other research (e.g., Hurtz & Donovan, 2000) that suggests that measures of personality-related constructs do not tend to demonstrate notably stronger relationships with contextual behaviors than with task behaviors. 7 To identify potential influential studies, we used a modified version of the sample adjusted meta-analytic deviancy (SAMD) statistic (Beal, Corey, & Dunlap, 2002; Huffcutt & Arthur, 1995) available in Meta-Analysis Mark XIII, a Microsoft Excel-based program developed by Piers Steel. If exclusion of a study changed the original corrected validity estimate by 20% or more, we report the results with and without the influential study (Cortina, 2003). CRITERION-RELATED VALIDITY OF INTEGRITY TESTS 511 Table 2 Meta-Analytic Estimates of Integrity Test Criterion-Related Validity for Job Performance and Training Performance This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Criterion/analysis Job performance Overall Without criteria that included CWB Type of integrity testa Overt Without influential caseb Personality-based Study designc Concurrent Predictive Study sampled Incumbents Applicants Performance construct Task performance Contextual performance Author affiliation Test publishers Computed validity Reported validity Non-publishers Overall Developed integrity test Did not develop integrity test Publishers and non-publishers Publication status Published studies Unpublished studies Type of criterione Ratings of performance Productivity records Training performance Overall Grades Instructor ratings k N r SD % VE 90% CI 80% CV 74 63 13,706 11,955 .13 .12 .18 .15 .08 .07 55.4 61.1 .15, .20 .13, .18 .07, .28 .06, .25 18 17 60 2,213 1,891 12,017 .10 .08 .14 .14 .11 .18 .13 .11 .07 46.4 56.0 64.1 .07, .21 .04, .17 .16, .21 ⫺.03, .30 ⫺.04, .25 .09, .27 38 32 4,586 8,608 .16 .12 .19 .17 .09 .09 60.4 49.9 .16, .24 .13, .20 .07, .32 .06, .27 47 24 6,191 7,104 .16 .11 .20 .15 .08 .09 65.8 44.9 .17, .23 .11, .19 .10, .31 .04, .26 13 8 1,464 799 .13 .11 .16 .14 .00 .00 100.0 100.0 .12, .21 .07, .22 .16, .16 .14, .14 45 45 5,946 5,946 .17 .22 .21 .27 .09 .13 60.3 43.3 .18, .25 .23, .31 .09, .33 .10, .44 25 7 18 4 3,247 798 2,449 4,513 .09 .15 .07 .12 .12 .20 .10 .17 .10 .09 .09 .00 56.1 66.8 58.4 100.0 .07, .17 .10, .29 .04, .15 .15, .18 ⫺.01, .25 .08, .31 ⫺.02, .22 .17, .17 25 49 3,533 10,173 .12 .14 .15 .18 .10 .08 53.4 57.8 .10, .20 .16, .21 .02, .28 .09, .28 73 6 13,517 799 .13 .15 .18 .15 .09 .06 53.6 54.7 .15, .20 .07, .22 .06, .29 .05, .25 8 5 3 1,530 824 706 .13 .20 .05 .16 .23 .06 .09 .03 .07 40.2 91.7 61.3 .08, .23 .16, .29 ⫺.05, .17 .05, .28 .19, .26 ⫺.03, .15 Note. k ⫽ number of validity coefficients (ks for some moderator categories are larger or smaller than the overall k due to unique design features of particular studies that comprise these categories); r ⫽ sample-size weighted mean observed validity estimate; ⫽ validity estimate corrected for measurement error in the criterion only; SD ⫽ standard deviation of ; % VE ⫽ percentage of variance in accounted for by sampling error and measurement error in the criterion; 90% CI ⫽ lower and upper bounds of the 90% confidence interval for ; 80% CV ⫽ lower and upper bounds of the 80% credibility value for ; CWB ⫽ counterproductive work behavior. a Three studies (comprising four independent samples) reported separate validity estimates for both overt and personality-based tests. Thus, the total k for this moderator analysis is larger than the k for the overall analysis. b See Footnote 7 regarding identification of influential cases. c Results of two studies are based on a combination of concurrent and predictive designs, and two studies did not clearly specify the type of design used. These four studies were excluded from this moderator analysis. d Results of three studies are based on both incumbents and applicants and thus were excluded from this moderator analysis. e Five studies reported separate validity estimates for both performance ratings and a productivity measure. Thus, the total k for this moderator analysis is larger than the k for the overall analysis. Research Question 3 pertained to the possible influence of author affiliation. Corrected validity estimates were larger for studies authored by test publishers (.21) than for studies authored by nonpublishers (.12). However, this test-publisher estimate is based on some validities for which we computed a unit-weighted composite validity or adjusted the reported validity for shrinkage (i.e., for studies that used multiple items or subscales of an integrity test as predictors). When we used the validity estimates the test publishers originally reported, the difference between corrected validities from publishers and non-publishers was .27 versus .12. Moreover, corrected validities from studies conducted by non-publishers who developed the integrity test they examined (.20) were larger than validity estimates from non-publishers who did not develop the test (.10), although the two sets of CIs overlapped to some extent. Finally, the mean corrected validity estimate from studies authored by both test publishers and non-publishers was .17. Overall, validity evidence reported by test publishers and test developers tended to be somewhat more optimistic than validity evidence reported by non-publishers who did not develop the integrity test. Research Question 4 explored potential validity differences between published and unpublished studies. Interestingly, published studies were associated with slightly smaller corrected validity estimates (.15) than were unpublished studies (.18). Finally, we also separated validity estimates by type of criterion measure and found slightly larger mean validities for performance ratings than for productivity measures (.18 vs. .15) We then conducted a weighted least squares (WLS) multiple regression analysis to examine relations among the moderator variables VAN IDDEKINGE, ROTH, RAYMARK, AND ODLE-DUSSEAU 512 Table 3 Meta-Analytic Estimates of Integrity Test Criterion-Related Validity From Studies Using Predictive Designs, Applicant Samples, and Non-Self-Report Criterion Measures This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Criterion/analysis Job performance Overall Non-publishers only Without influential casea Training performance Overall Non-publishers only CWB Overall Non-publishers only Thefta Withdrawal behaviorsb Turnover Overall Without influential case Non-publishers only k N r SD % VE 90% CI 80% CV 24 8 7 7,104 928 735 .11 .03 ⫺.01 .15 .04 ⫺.01 .09 .11 .08 44.9 54.8 70.5 .11, .19 ⫺.06, .14 ⫺.10, .09 .04, .26 ⫺.10, .19 ⫺.12, .10 5 4 962 782 .05 .06 .07 .08 .00 .00 100.0 100.0 .01, .12 .02, .14 .07, .07 .08, .08 10 2 3 5 5,056 340 1,481 5,873 .09 .13 .03 .12 .11 .13 .04 .15 .02 .09 .03 .00 76.0 100.0 76.1 42.2 .08, .14 .05, .22 ⫺.02, .11 .11, .19 .08, .14 .12, .14 .00, .09 .09, .20 13 12 5 22,647 4,652 2,407 .06 .11 .08 .09 .16 .15 .05 .06 .07 20.1 38.4 28.5 .06, .11 .12, .20 .09, .21 .03, .15 .08, .24 .06, .24 Note. k ⫽ number of validity coefficients; r ⫽ sample-size weighted mean observed validity estimate; ⫽ validity estimate corrected for measurement error in the criterion only; SD ⫽ standard deviation of ; % VE ⫽ percentage of variance in accounted for by sampling error and measurement error in the criterion; 90% CI ⫽ lower and upper bounds of the 90% confidence interval for ; 80% CV ⫽ lower and upper bounds of the 80% credibility value for ; CWB ⫽ counterproductive work behavior. a No studies in this category were conducted by non-publishers. b See Footnote 7 regarding identification of influential cases. and validity estimates (see Steel & Kammeyer-Mueller, 2002). The observed validity coefficients from the 74 primary studies within this category served as the dependent variable. The moderators served as the independent variables, which we binary-coded (i.e., 0 vs. 1) to represent the two levels of each moderator (see the note to Table 4 for details regarding these codes). We also included the year the study was published as an additional (continuous) predictor. We did not include the task versus contextual performance moderator, as this distinction was examined in only a subset of the studies. Finally, we weighted each study by the inverse of the sampling error variance, such that studies with less sampling error received greater weight than studies with more sampling error (Hedges & Olkin, 1985; Steel & Kammeyer-Mueller, 2002). Table 4 displays the moderator intercorrelations and WLS regression results. Study design, which was not a sizeable or statistically significant predictor of validity within the regression model, correlated .78 with study sample and appeared to produce multicollinearity effects when included in the model. Thus, to more clearly interpret the effects of the other moderators, we excluded this variable from the final model. In addition, one test-publisherauthored study (n ⫽ 87, r ⫽ .66) emerged as in influential case (Cook’s D ⫽ 0.61 vs. a mean D of 0.02 across the remaining primary studies), and we chose to exclude this study from the final model. However, the results with and without the study sample moderator and the one influential case were not drastically different than the results reported herein. As a group, the moderators accounted for 38% of the variance in observed validities (R ⫽ .62). Four moderators emerged as sizeable (and statistically significant) individual predictors within the regression model. Study sample was related to validity ( ⫽ .29), such that incumbent samples were associated with larger validities than applicant samples. Type of criterion was related to validity ( ⫽ .25), such that ratings of performance were associated with larger validities than objective measures. Author affiliation was related to validity ( ⫽ .38), such that validities were larger in studies authored by test publishers than by non-publishers. Finally, year of publication was related to validity ( ⫽ ⫺.31), such that older studies tended to report larger validities than did more recent studies. Meta-Analysis Results for Training Performance Criteria There were eight independent samples for the relationship between integrity tests and performance during training (see Table 2). The overall mean observed validity was .13, and the mean validity corrected for unreliability in the criterion was .16 (90% CI [.08, .23]). We also separated the validity estimates by type of criterion and found that corrected validities were larger for training grades (.23) than for instructor ratings (.06). In addition, we estimated validity based on predictive studies with job applicants (see Table 3). The observed and corrected validities based on the results of these five studies were .05 and .07, respectively. The corresponding validities for the four studies authored by nonpublishers (none of whom developed the integrity test) were .06 and .08. Of course, all the training performance results need to be interpreted with caution given the small number of samples on which they are based.8 Meta-Analysis Results for CWB Criteria Overall validity evidence. Table 5 presents validity evidence for integrity tests and CWB. Across 65 independent samples, the mean observed validity estimate was .26, and the mean validity 8 Given the small number of samples available for training performance, we did not examine additional potential moderators of validity for this criterion. CRITERION-RELATED VALIDITY OF INTEGRITY TESTS 513 Table 4 Results of Weighted Least Squares Regression of Integrity Test–Job Performance Validity Estimates on Coded Moderators Variable 1 1. Validity 2. Type of integrity test 3. Study design 4. Study sample 5. Type of criterion 6. Author affiliation 7. Publication status 8. Year of publication — .08 .10 .28ⴱⴱ .02 .43ⴱⴱ .01 ⫺.40ⴱⴱ 2 3 4 5 6 7 8 — ⫺.35ⴱⴱ ⫺.52ⴱⴱ — .37ⴱⴱ — This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Correlations among moderators and validity estimates Variable Type of integrity test Study sample Type of criterion Author affiliation Publication status Year of publication — .22ⴱ .19ⴱ .10 ⫺.01 .43ⴱ .27ⴱ B .01 .07 .13 .12 .03 ⫺.01 — .78ⴱⴱ ⫺.34ⴱ ⫺.11 .21ⴱ ⫺.01 — ⫺.26ⴱ ⫺.04 .28ⴱⴱ ⫺.11 SE — ⫺.22ⴱ .15 .30ⴱⴱ 90% CI Multiple regression analysis resultsa .04 ⫺.06, .08 .03 .02, .12 .06 .03, .23 .04 .05, .19 .04 ⫺.04, .10 .00 ⫺.01, .00 F(6, 61) ⫽ 6.21ⴱⴱ, R ⫽ .62, R2 ⫽ .38  t .03 .29 .25 .38 .12 ⫺.31 0.24 2.57ⴱ 2.28ⴱ 3.04ⴱⴱ 0.96 ⫺2.34ⴱ Note. N ⫽ 73 independent samples (excludes one influential sample). Both correlation and regression analyses are based on primary study results weighted by the inverse of the sampling error variance. Validity ⫽ observed validity coefficient between integrity test scores and job performance; B ⫽ unstandardized regression coefficient; SE ⫽ standard error of B; 90% CI ⫽ lower and upper bounds of the 90% confidence interval for B;  ⫽ standardized regression coefficient. Type of integrity test was coded 0 for personality and 1 for overt. Study design was coded 0 for predictive and 1 for concurrent. Study sample was coded 0 for applicants and 1 for incumbents. Type of criterion was coded 0 for objective performance measures and 1 for subjective performance measures (i.e., ratings). Author affiliation was coded 0 for non-publishers and 1 for test publishers. Publication status was coded 0 for unpublished and 1 for published. a Study design was excluded from the final regression analysis because of collinearity with other predictors. ⴱ p ⬍ .05. ⴱⴱ p ⬍ .01. estimate corrected for unreliability in the criterion was .32 (90% CI [.27, .35]). These results provide additional support for Hypothesis 4, which predicted that integrity tests would be more strongly related to CWB than to productive work behaviors. Indeed, the validity estimates for CWB criteria were approximately two times larger than the overall observed and corrected validity estimates for performance criteria that did not include CWB (.12 and .15). However, as we describe below, the source of CWB criteria (i.e., self-reports vs. other-reports and employee record…