计算机自适应英语能力测试模型设计与效度验证

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Design and Validation of a Computerized Adaptive English Proficiency Test
作者：闵尚超
论文级别：博士
学科专业名称：英语语言文学
英文关键词：computerized adaptive language testing (CALT) ; item response theory (IRT) ; assessment use argument (AUA) ; differential item functioning (DIF) ; validation
学位年度：2012
导师：何莲珍
学科代码：050201
学位授予单位：浙江大学
论文提交日期：2012-11-15
答辩委员会主席：刘海涛

摘要

1.研究背景
     随着计算机技术与测量理论的不断发展,建立大型的语言测试题库并基于题库实现计算机自适应语言测试(computerized adaptive language testing,简称CALT)是近年来国外语言测试研究的热点问题。计算机自适应测试兴起于八十年代中期,但直到八十年代后期才真正被运用到语言测试领域。相对于传统的纸笔语言测试(paper-and-pencil language testing,简称PPLT)或普通的计算机辅助语言测试(computer-based language testing,简称CBLT), CALT有以下优势：1)测试信度与效率高；2)即时反馈效果良好；3)施考安全性好；4)测试的个性化程度高；等等。
     CALT的主要理论依据为项目反应理论(item response theory,简称IRT)。IRT是一组用于阐述考生答题行为与潜在能力之间关系的数学模型,其最大优点是项目数据与样本数据之间具有独立性,即项目参数估计不受其所施测的样本影响,样本能力估计不受其所施测的项目影响。因此,即使考生在测试过程中所得到的考题不一样,仍可以对考生能力进行估计并直接比较,这一优点极大地促进了CALT的设计与应用。
     依据计分模式,IRT可以分为二元计分IRT模型和多元计分IRT模型。二元计分IRT模型中,考生在题目上的得分只有0分、1分两种可能性,二元计分IRT模型包括单参数模型(one-parameter logistic model,简称IPLM)、双参数模型(two-parameter logistic model,简称2PLM)、三参数模型(three-parameter logistic model,简称3PLM)。多元计分IRT模型中,考生在题目山的得分有0分、1分、2分等多种可能性,常见的多元计分IRT模型有等级反应模型(graded response model,简称GRM)、分部评分模型(partial credit model,简称PCM)、广义分部评分模型(generalized partial credit model,简称GPCM)。
     IRT的基本假设为单维性与局部独立性,单维性指同一份考卷中的所有题目测量同一种能力。尽管长期以来,语言测试领域在语言能力的单维性问题上争论不休,但目前较为公认的一种观点是单维性是一个度的问题而非存在与否的问题。局部独立性指考生在各道题目上的答对概率相互独立,即考生的潜在能力是影响作答的唯一因素,当排除这个因素的影响后,考生在不同题目上的作答行为之间不存在任何关系。但是在大规模英语测试中，局部独立性这一假设往往会被违反,因为常见的题型是几道选择题基于同一篇章。在局部独立性假设违反的情况下,采用标准的一元计分IRT模型进行项目分析不仅会导致模型与数据的不拟合,而且会导致对项目区分度的估值过高,从而导致对测试信息量,即对测量精确度的估值过高。解决上述问题的一个有效方法是采用多元计分IRT模型。该方法把基于同一篇章的若干题目看成一个整体,即把考生在同一篇章所有题目上得分相加,作为一个多元计分题目,运用多元计分IRT模型进行参数估计。除IRT以外,CALT的成功与否主要取决于其四个重要组成部分的功能,即题库、项目选择、能力估计、终止原则。目前,国内关于CALT方面的研究基本停留在文献综述或简要介绍上,只有极少数研究进行了CALT模型设计的实证研究。国外关于CALT方面的实证研究相对较多,主要集中在对CALT模型设计的探讨以及效度验证。在设计方面,绝大多数CALT涉及词汇、语法以及阅读测试,仅有少数CALT涉及听力测试,因为听力测试中的语音成分使得CALT开发过程更为复杂。此外,国内外CALT设计方面的研究均主要集中介绍题库建设,因为高质量的大型题库是CALT得以成功运行的前提条件。尽管如此,以往研究在CALT题库建设方面仍存在以下四个方面的局限性：1)大多数题库仅采用独立项目,即每道题目基于一个独立的篇章,而在实际的语言测试中,尤其是听力与阅读测试中,使用最为广泛的题型是若干道题目基于同一篇章；2)虽然大部分题库包括词汇测试、语法测试、阅读测试等多个组成成分,但是很少有研究关注题库中的不同组成成分在多大程度上影响整个题库的单维性；3)在模型选择方面,以往的题库建设过分依赖Rasch模型,极少有研究通过模型数据拟合程度从一系列理论上可行的模型中选择最佳模型对项目进行参数估计；4)尽管项目功能差异(differential item functioning,简称DIF)的存在对CALT的效度以及公平性构成极大威胁,但是迄今为止尚未有关于在题库建设中探讨DIF项目的甄别以及剔除问题的研究。
     在效度验证方面,以往研究关注的三个主要问题是CALT与PPLT的等效性、计算机熟悉度的影响、CALT构念在男女考生群体中的一致性。
     关于CALT与PPLT的等效性,研究者争论的核心问题在于是否需要保证CALT与PPLT的等效性。大多数研究者认为,无论CALT跟PPLT同时存在,还是CALT取代PPLT,测试开发者和使用者都应保证这两种测试形式之间的对等关系。但目前较新的一个观点是,在这个全球化时代,人们很多交流都通过计算机进行,因此语言能力的定义应考虑个体的语言能力与基于计算机的交际语境的相互作用。也就是说,保证CALT与PPLT内等效性不仅不切实际,而且没有必要,因为这两种测试形式本身就测量了两种不同的构念。然而,必须指出的是,测试开发者需保证CALT与CBLT的等效性,不能因为CALT内自适应性而使CALT所测量的构念与CBLT有所不同。尽管如此,目前的研究均主要致力于探讨CALT与PPLT、CBLT与PPLT的等效性,尚未有研究探讨CALT与CBLT的等效性。
     关于计算机熟悉度的影响,大部分研究探讨了计算机熟悉度在多大程度上影响考生在CBLT中的表现,研究结果不尽一致。目前仍未有研究直接探讨计算机熟悉度对考生在CALT中成绩的影响,尽管计算机热悉度在这两种测试形式中对考生的影响方式可能并不一样。同时,就研究方法而言,以往的研究均局限于基本统计方法(如T检验、方差分析、回归分析等),目前仍无研究采用过高级统计方法(如结构方程模型等)来探讨计算机熟悉度究竟在多大程度上影响CALT所考查的构念,从而影响考生在CALT中的表现。
     关于CALT构念在男女考生群体中的一致性,其核心问题是计算机熟悉度对男女考生在CALT中表现的影响是否有差异。因为以往研究表明男性的计算机水平高于女性,所以计算机热悉度可能会在不同程度上影响男女群组在CALT中的表现,从而影响测试的效度及公平性。但是,目前仍未有研究探讨CALT中所涉及的计算机熟悉度、以及所考查的语言能力的因子结构是否在男女群组中存在一致性。
     2.研究目的
     基于以上文献综述,本研究的目的有两个：1)建立一个用于计算机自适应测试的听力和阅读题库;2)设计一个计算机自适应语言测试,并采用“评估使用论据”(assessment use argument,简称AUA)对该测试进行效度验证
     3.研究问题
     针对第一个研究目的,本研究旨在解决以下四个问题：
     1)听力与阅读部分的项目在多大程度上满足局部独立性假设?
     2)听力与阅读部分的项目在多大程度上满足单维性假设?
     3)GRM与GPCM这两种多元计分IRT模型中,哪种更适合基于篇章的项目分析?
     4)听力与阅读部分的项目在多大程度上存在性别上的项目功能差异?针对第二个研究目的,本研究旨在解决以下三个问题：
     1)CALT与CBLT在多大程度上考查相同的语言能力?
     2)计算机熟悉度在多大程度上影响CALT所考查的构念?
     3)CALT的因子结构关系在多大程度上保持男女考生群组间的一致性?
     4.实证研究
     4.1研究一
     研究一的主要目的是建立一个用于计算机自适应测试的大型题库,该题库包括以下四种题型：听力短对话理解、听力长对话理解、听力短文理解、阅读篇章理解。在内容方面,该题库涵盖社会、文化、教育、经济、科普等多个方面。
     4.1.1研究方法
     采用的主要研究方法是对所有进入题库的题目进行预测,预测通过与目标群体能力相当的样本参加CBLT方式进行。为了使题库中的题目参数在同一量表上,本研究采用锚题方式使CBLT中不同考卷上的题目实现等值。为了缩小题目参数估计误差,每道项目的预测样本为550左右。所有项目进行预测后,采用以下分析方法对数据进行处理：1)使用IRTPR02.1软件对项目局部独立性假设进行检验；2)使用SPSS18.0软件对数据进行探索性因子分析,并用AMOS7.0软件进行验证性因子分析,以检验单维性假设是否成立；3)使用IRTPRO2.1软件中的2PLM对二元计分项目进行分析,用GRM以及GPCM对多元计分项目进行分析,再根据模型与整体数据以及项日层次数据拟合度情况选择最佳模型对项目进行参数估计；4)使用IRTPRO2.1软件以及SIBTEST软件对项目进行性别DIF检验,然后对存在性别DIF的项目进行内容分析,以确定该项目是否需要从题库中剔除。
     4.1.2研究结果与讨论
     研究发现：1)基于同一篇章的若干项目间存在局部独立性假设违反的问题,因此基于同一篇：章的若干项目应当被看成一个整体,作为一个多元计分项目。2)探索性因子分析对项目的单维性假设无法给出确切的结论,因为不同的评判标准得出了相反的结论；验证性因子分析表明二阶因子模型显著优于一阶因子模型,即不仅存在单独的听力技能与阅读技能因子,并且这两个因子同时受更高阶的语言能力因子影响。该结果一方面证实了单维性假设的成立,另一方面表明阅读题目与听力题目各自具有独特性,因此采用IRT分析时不应把两部分项目混在一起,而应分开进行。3)针对听力部分与阅读部分,GPCM的模型与整体数据以及项目层次数据拟合度均优于GRM，因此GPCM被确定为多元计分项目参数估计的最终分析模型。同时,基于理论与实际考虑,2PLM被用于二元计分项目参数估计。个别与模型无法拟合的项目以及参数不达标的项目被删除。4)两种DIF甄别力方法检验出的DIF项目并不一致,并且内容分析也无法对一些项目产生DIF的原因进行明确解释。为了尽量降低DIF的存在对CALT测试公平性的影响,所有可能存在DIF的项目均被删除,被剔除项目占总题量的12.5%左右。
     4.1.3研究总结
     通过上述预测以及项目估计方法,最终有258个带有项目难度、项目分度、话题类别等参数的项目进入题库。根据题型类别,这258个项目被分置于四个子题库中。除听力短对话子题库外,其他三个题库的总信息量在不同能力水平上的分布较为平坦,说明该题库在不同能力水平上的测量精确度较为相似。本研究所设计的CALT为常模参照性测试,需要尽可能保证各个能力水平上的测量精确度,因此本题库符合实际要求。
     4.2研究二
     研究二的主要目的是设计一个计算机自适应语言测试,并采用“评估使用论据”(AUA)对该测试进行效度验证。本研究采用AUA作为理论框架的理据是：相对于其他理论框架,AUA通过具体的主张和理由把语言测试各个重要属性有机地联系起来,提供了一个更为系统的理论框架及操作步骤。
     4.2.1研究方法
     本研究分三个步骤：CALT设计、CALT模拟、CALT运行与效度验证,其中第三个步骤是本研究的核心部分。
     针对CALT设计,本研究在测试顺序、项目选择、能力估计、终止原则方面采取以下方法：1)测试按照听力短对话理解、听力长对话理解、听力短文理解、阅读篇章理解的顺序进行。听力短对话理解以中等难度的项目作为测试起点,听力短对话理解中考生的能力估计值直接用于听力长对话理解、听力短文理解中；但是,听力部分的能力估计值不用于阅读理解部分,即阅读篇章理解测试中所有考生的初始能力值均假定为0。2)项目选择采用最大信息量选择法(maximum information,简称MI),并兼用内容平衡、曝光控制等措施。3)能力估计采用贝氏期望后验法(expected a posteriori,简称EAP)。4)终止原则同时采用标准误差控制原则和总题量控制原则。
     针对CALT模拟,本研究采用Firestar与R软件,对四个子题库分别按照上述设计方法进行四次CALT模拟运行。结果表明按照以上方式设置的CALT只需使用CBLT中50%左右的题量,就可以使听力部分与阅读部分的测量信度均达到0.8以上。
     针对CALT运行与效度验证,416名曾参加过CBLT测试的非英语专业学生参加了CALT测试,其中,289名考生不仅完整完成了一份关于计算机熟悉度的问卷调查,而且详细给出了其最近一次CET-4成绩(2011年12月)。对以上数据采用的数据分析方法包括：1)使用SPSS18.0进行配对T检验,用AMOS7.0软件进行验证性因子分析,探讨CALT与CBLT所考查的构念是否一致；2)使用AMOS7.0软件对计算机熟悉度、CET-4成绩、CALT成绩进行结构方程模型建模,探讨CALT所考查的构念与计算机熟悉度以及常规纸笔测试中所测量的构念之间的关系；3)使用AMOS7.0软件对以上模型进行多群组结构方程模型分析,探讨CALT的因子结构关系在男女考生群组中是否一致。
     4.2.2研究结果与讨论
     研究发现：1)配对T检验结果表明尽管考生在CALT中的成绩略低于在CBLT中的成绩,但正如采访结果所示,这可归因于CALI设计中仍存在的一些小问题,如,CALT界面中“下一题”与“提交”按钮处于同一位置,考生由于紧张的原因连续点击鼠标三次导致漏答,使得测试得分偏低。验证性因子分析结果表明CALT与CBLT考查的构念相同,说明CALT的自适应性并没有改变其所考查的构念,初步证明对CALT所考查的能力的解释是有意义的；2)计算机熟悉度与CALT所考查的构念没有显著关系,但是考生在常规纸笔测试中成绩能在很大程度上预测考生在CALT中的表现,说明对CALT所考查的能力的解释是有意义的,并具有概推性；3)CALT的因子结构关系在男女考生群组中具有一致性,说明对CALT所考查的能力的解释不存在性别差异,即测试对男女生群组而言是公平的。
     4.2.3研究总结
     以上研究结果表明对CALT中所考查的能力的解释是有意义的,并具有概推性和公正性。换言之,本研究结果证明了CALT的效度。但是,需要指出的是,本研究对CALT效度的验证仍停留在分数解释阶段,对CALT效度的更全面验证应当基于根据CALT分数所做的决定的公平性、以及CALT的使用对教学以及整个社会所带来的影响。
     5.结论
     本研究不仅设计了一个采用多种题型、涉及听力与阅读测试的CALT,填补了国内语言测试领域的一个研究空白,而且在国内外语言测试领域中首次尝试对CALT进行较为系统的效度验证。在计算机技术已经给语言测试带来真正变革的今天,本研究在理论以及实践上均具有重要意义。
     在理论层面,本研究的效度验证不再局限于CALT与PPLT之间的等效性,而是倡导更好地理解CALT所考查的构念,理解计算机熟悉度以及语言能力与CALT中所考查的构念之间的关系。其次,本研究尝试在测试分数解释方面使用AUA框架,有利于促进基于论据的效度验证方法在语言测试领域得到更广泛的应用。
     在实践层面,本研究介绍了CALT开发过程中的具体步骤,指出了以往研究在CALT题库建设方面存在的不足,有利于拓展CALT开发方面的知识,并能给考虑开发CALT系统的机构提供实证依据。其次,本研究针对计算机熟悉度对考生在CALT中表现的影响的探讨有助于CALT开发者与使用者更积极主动地预测CALT实施中可能遇到的问题,对这些问题作更好地理解与诠释,并更恰当地使用CALT的测试结果。
     6.局限性与未来研究方向
     诚然,本研究也存在一定的局限性,未来研究可从以下几个方面着手：1)本研究尽管通过把基于同一篇章的若干个二元计分项目合并为一个多元计分项目,采用多元计分IRT模型解决了局部独立性假设违反的问题,但是此方法的一大缺陷是项目层次信息的丢失,未来研究可以考虑采用多维IRT模型,如双因素模型(bi-factor model)、题组反应理论模型(testlet response theory model),进行项目参数估计,从而保证项目层次信息的完整性；2)本研究尽管对听力部分与阅读部分的题目单独进行IRT分析,肯定了听力技能与阅读技能各自的独特性,但是单独分析使同时参与分析的项目数量减少,在一定程度上加大了参数估计误差,未来研究可以考虑采用双阶全信息项目因子分析模型(two-tier full-information item factor analysis model)进行参数估计,真正实现多维CALT的创建；3)本研究仅从性别角度考查了CALT因子结构的一致性,未来研究可从考生专业、生源地等角度着手；4)本研究主要基于定量分析论证CALT的效度,未来研究可采用定性分析方法研究考生在CALT中的认知过程,从而更好地探讨CALT的效度问题：5)本研究设计的CALT仅包括听力和阅读成分,而未涉及写作与口语考试,随着自动评分技术的发展,未来研究可尝试设计更全面的考查语言能力的CALT。
In this study an attempt was made to design a computerized adaptive language test (CALT) to assess listening and reading proficiency in English using mixed-format with dichotomous and polytomous item response theory (IRT) models, and to investigate validity issues of the CALT under the assessment use argument (AUA) framework.
     In order to construct an item pool for the CALT, Study I was carried out where8,203test-takers'item-level responses to15different forms of a computer-based language test (CBLT) were used for item calibration and differential item functioning (DIF) detection. The results indicated that1) the item pool was supported by sufficient unidimensionality when passage-based items were grouped together as polytomous items;2) the construct tapped by the listening and reading sections were distinct from each other, suggesting the need for separate IRT calibrations;3) the generalized partial credit model (GPCM) fit the data of passage-based polytomous items better than the graded response model (GRM);4) approximately12.5%of the items were identified as showing statistically and practically significant gender DIF. The item pool constructed in such a way had relatively flat scale information function, implying that the item pool provided equal precision of measurement for test takers along the ability continuum.
     The item pool was then combined with the other three components-item selection procedure, ability estimation method, and stopping rule-to develop a CALT system. A total of416test takers drawn from the same sample of Study1took the CALT in Study2. By drawing upon data from test-takers'scores in the CALT, CBLT and CET-4, as well as their self-ratings of computer familiarity, Study2investigated the validity issues of the CALT by examining its factor structure using confirmatory factor analysis (CFA), structural equation modeling (SEM), and multi-group SEM. The results provided strong support for the validity of the CALT with evidence regarding the equivalence of the CALT and CBLT, English ability as a major factor measured in the CALT, as well as the factorial invariance of the CALT across male and female subgroups. These findings suggested the meaningfulness, impartiality and generalizability of score-based interpretations of the CALT desired by the test developers.

引文

Abbott, Marilyn L. (2007). A Confirmatory Approach to Differential Item Functioning on an ESL Reading Assessment. Language Testing 24(1): 7-36.
    Ackerman, Terry A. (1987). The Robustness of LOGIST and BILOG IRT Estimation Programs to Violations of Local Independence. ACT Research Report Series, 87-14. Iowa City, IA: American College Testing.
    Akaike, Hirotugu (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19: 716-723.
    Alderson, Charles (2000). Technology in Testing: the Present and the Future. System 28: 593-603.
    Anderson, James C. & David W. Gerbing (1988). Structural Equation Modeling in Practice: A Review and Recommended Two-Step Approach. Psychological Bulletin 103(3): 411-423.
    Arbuckle, James L. (2006). AMOS 7.0 (Build 1140). Spring House, PA: Amos Development Corporation.
    Bachman, Lyle F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
    Bachman, Lyle F. (2000). Modern Language Testing at the Turn of the Century: Assuring that What We Count Counts. Language Testing 17(1): 1-42.
    Bachman, Lyle F. (2003). Constructing an Assessment Use Argument and Supporting Claims about Test Taker-assessment Task Interactions in Evidence-centered Assessment Design. Measurement: Interdisciplinary Research and Perspectives 7(1): 63-65.
    Bachman, Lyle F. (2004). Statistics for Language Assessment. Cambridge: Cambridge University Press.
    Bachman, Lyle F. (2005). Building and Supporting a Case for Test Use. Language Assessment Quarterly: An International Journal 2(1): 1-34.
    Bachman, Lyle F. & Adrian S. Palmer (1981). The Construct Validation of the FSI Oral Interview. Language Learning 31: 67-86.
    Bachman, Lyle F. & Adrian S. Palmer (1982). The Construct Validation of Some Components of Communicative Proficiency. TESOL Quarterly 16: 449-465.
    Bachman, Lyle F. & Adrian S. Palmer (1996). Language Testing in Practice. Oxford: Oxford University Press.
    Bachman, Lyle F. & Adrian S. Palmer (2010). Language Assessment in Practice. Oxford: Oxford University Press.
    Bachman, Lyle F., Fred Davidson & John Foulkes (1990). A Comparison of the Abilities Measured by the Cambridge and Educational Testing Service EFL Test Batteries. Issues in Applied Linguistics 1:30-55.
    Bachman, Lyle F., Fred Davidson, Katharine Ryan & Inn-Chull Choi (1995). An Investigation into the Comparability of Two Tests of English as a Foreign Language: The Cambridge-TOEFL Comparability Study. Cambridge:UCLES.
    Bae, Jungok & Lyle F. Bachman (1998). A Latent Variable Approach to Listening and Reading:Testing Factorial Invariance across Two Groups of Children in the Korean/English Two-way Immersion Program. Language Testing 15(3):380-414.
    Bagozzi, Richard P.& Todd F. Heatherton (1994). A General Approach to Representing Multifaceted Personality Constructs:Application to State Self-esteem. Structural Equation Modeling 1:35-67.
    Bagozzi, Richard P.& Youjae Yi (1988). On the Evaluation of Structural Equation Models. Journal of the Academy of Marketing Science 16(1):74-94.
    Bailey, Kathleen M. (1996). Working for Washback:a Review of the Washback Concept in Language Testing. Language Testing 13(3):257-279.
    Baker, Frank B. (1985). The Basics of Item Response Theory. Portsmouth, NH: Heinemann.
    Baker, Frank B. (1987). Methodology Review:Item Parameter Estimation under the One-Two-, and Three-parameter Logistic Models. Applied Psychological Measurement 12: 111-141.
    Baker, Frank B. (1992). Equating Tests under the Graded Response Model. Applied Psychological Measurement 16:87-96.
    Bartlett, Maurice Stevenson (1950). Tests of Significance in Factor Analysis. British Journal of Psychology 3:77-85.
    Bentler, Peter M. (1990). Comparative Fit Indexes in Structural Models. Psychological Bulletin 107:238-246.
    Bentler, Peter M. (2006). EQS 6 Structural Equations Program Manual. Encino, CA: Multivariate Software, Inc.
    Bergstrom, Betity A.& Mary E. Lunz (1992). Confidence in Pass/Fail Decisions for Computer Adaptive and Paper-and-pencil Examinations. Evaluation and the Health Professions 15(4): 453-464.
    Bergstrom, Betity A. & Mary E. Lunz (1999). CAT for Certification and Licensure. In Fritz Drasgow & Julie B. Olson-Bunchanan (eds.). Innovations in Computerized Assessment. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 67-91.
    Birnbaum, Allan (1968). Some Latent Traits and Their Use in Inferring an Examinee's Ability. In Frederic M. Lord, Melvin R. Novick & Allan Birnbaum (eds.). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
    Bock, R. Darrell & Murray Aitkin (1981). Marginal Maximum Likelihood Estimation of Item Parameters: Application of an EM Algorithm. Psychometrika 46: 433-459.
    Bock, R. Darrell & Robert J. Mislevy (1982). Adaptive EAP Estimation of Ability in a Microcomputer Environment. Applied Psychological Measurement 6: 431-444.
    Bollen, Kenneth A. (1989). Structural Equations with Latent Variables. New York: Wiley & Sons.
    Boyd, Aimee Michelle (2003). Strategies for Controlling Testlet Exposure Rates in Computerized Adaptive Testing Systems. Ph.D. Dissertation. Austin: The University of Texas at Austin.
    Bradlow, Eric T, Howard Wainer & Xiaohui Wang (1999). A Bayesian Random Effects Model for Testlets. Psychometrika 64: 153-168.
    Brown, Annie (2003). Legibility and the Rating of Second Language Writing: an Investigation of the Rating of Handwritten and Word-processed IELTS Task Two Essays. (IELTS Research Reports Volume 4). Canberra: IDP: IELTS Australia.
    Brown, Annie & Noriko Iwashita (1996). Language Background and Item Difficulty: the Development of a Computer-adaptive Test of Japanese. System 24(2): 199-206.
    Brown, Joel M. & David J. Weiss (1977). An Adaptive Testing Strategy for Achievement Test Batteries. (Computerized Adaptive Testing Laboratory: Research Report 77-6). Minneapolis, MN: University of Minnesota.
    Browne, Michael W. & Robert Cudeck (1993). Alternative Ways of Assessing Model Fit. In Kenneth A. Bollen & J. Scott Long (eds.). Testing Structural Equation Models. Newbury Park. CA: Sage Publications. 136-162.
    Buck, Gary (1992). Listening Comprehension: Construct Validity and Trait Characteristics. Language Learning 42(3): 313-357.
    Byrne, Barbara M. (1989). A Primer of LISREL: Basic Applications and Programming for Confirmatory Factor Analytic Models. New York:Springer-Verlag.
    Byrne, Barbara M. (2010). Structural Equation Modeling with AMOS. New York: Routledge Taylor & Francis Group.
    Cai, Li (2010). A Two-tier Full-information Item Factor Analysis Model with Applications. Psychometrika 75(4):581-612.
    Cai, Li, David Thissen & Stephen du Toit (2011).IRTPRO User's Guide. Lincolnwood, IL:Scientific Software International, Inc.
    Camilli, Gregory & Lorrie A. Shepard (1994). Methods for Identifying Biased Test Items. Thousand Oaks, CA:Sage.
    Canale, Michael (1986). The Promise and Threat of Computerized Adaptive Assessment of Reading Comprehension. In Charles Stansfield (ed.). Technology and Language Testing:A collection of papers from the Seventh Annual Language Testing Research Colloquium, held at Educational Testing Service, Princeton, New Jersey.
    Carroll, John B. (1983). Psychometric Theory and Language Testing. In John W. Oiler Jr. (ed.). Issues in Language Testing Research. Rowley, MA:Newbury House,80-107.
    Carroll, John B. (1987). Psychometric Theory and Language Testing. In Rudiger Grotjahn, Douglas Keith Stevenson & Christine Klein-Bradley (eds.) Taking Their Measure: The Validity and Validation of Language Tests, Bochum:Studienverlag Dr N. Brockmeyer,1-40.
    Chalhoub-Deville, Micheline (2001). Lanugage Testing and Techonology:Past and Future. Language Learning & Technology 5(2):95-98.
    Chalhoub-Deville, Micheline & Craig Deville (1999). Computer Adaptive Testing in Second Language Contexts. Annual Review of Applied Linguistics 19:273-299.
    Chang, Hua-Hua & Zhiliang Ying (1996). A Global Information Approach to Computerized Adaptive Testing. Applied Psychological Measurement 20:213-229.
    Chapelle, Carol & Dan Douglas (2006). Assessing Language through Computer Technology. Cambridge:University of Cambridge Press.
    Chen, Shu-Ying, Robert D. Ankenmann & Hua-Hua Chang (2000). A Comparison of Item Selection Rules at the Early Stage of Computerized Adaptive Testing. Applied Psychological Measurement 24(3):241-255.
    Chen, Ssu-Kuang, Liling Hou & Barbara G. Dodd (1998). A Comparison of Maximum Likelihood Estimation and Expected a posteriori Estimation in CAT Using the Partial Credit Model. Educational and Psychological Measurement 58(4):569-595.
    Chen, Ssu-Kuang, Liling Hou, Steven J. Fitzpatrick & Barbara G. Dodd (1997). The Effect of Population Distribution and Method of Theta Estimation on Computerized Adaptive Testing (CAT) Using the Rating Scale Model. Educational and Psychological Measurement 57(3):422-439.
    Chen, Wen-Hung & David Thissen (1997). Local Dependence Indexes for Item Pairs Using Item Response Theory. Journal of Educational and Behavioral Statistics 22: 265-289.
    Cheng, Liying (1999). Changing Assessment:Washback on Teacher Perceptions and Actions. Teaching and Teacher Education 15:253-271.
    Choi, Inn-Chull, Kyoung Sung Kim & Jaeyool Boo (2003). Comparability of a Paper-based Language Test and a Computer-based Language Test. Language Testing 20(3):295-320.
    Choi, Inn-Chull & Lyle F. Bachman (1992). An Investigation into the Adequacy of Three IRT Models for Data from Two EFL Reading Tests. Language Testing 9:51-78.
    Choi, Seung W. (2009). FIRESTAR:Computerized Adaptive Testing (CAT) Simulation Program for Polytomous IRT Models. Illinois:Northwestern University Feinberg School of Medicine.
    Choi, Seung W., Karon F. Cook & Barbara G. Dodd (1997). Parameter Recovery for the Partial Credit Model Using MULTILOG. Journal of Outcome Measurement 1: 114-142.
    Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). New York:Academic Press.
    Coniam, David (2006). Evaluating Computer-based and Paper-based Versions of an English-language Listening Test. ReCALL 18 (2):193-211.
    Cook, Linda, Daniel Eignor, Yasuyo Sawaki, Jonathan Steinberg & Frederick Cline (2010). Using Factor Analysis to Investigate Accommodations Used by Students with Disabilities on an English-language Arts Assessment. Applied Measurement in Education 23:187-208.
    Cronbach, Lee J. (1988). Five Perspectives on Validity Argument. In Howard Wainer & Henry I. Braun (eds.). Test Validity. Hillsdale, NJ:Lawrence Erlbaum Associates, Inc.,3-17.
    Davies, Alan (1997). Demands of Being Professional in Language Testing. Language Testing 14(3):328-339.
    Davis, Laurie Laughlin (2004). Strategies for Controlling Item Exposure in Computerized Adaptive Testing with the Generalized Partial Credit Model. Applied Psychological Measurement 28:165-185.
    Davis, Laurie Laughlin & Barbara G. Dodd (2003). Item Exposure Constraints for Testlets in the Verbal Reasoning Section of the MCAT. Applied Psychological Measurement 27(5):335-356.
    Davis, Laurie Laughlin & Barbara G. Dodd (2005). Strategies for Controlling Item Exposure in Computerized Adaptive Testing with the Partial Credit Model. (PEM Report No.05-01). Austin, TX:Pearson Educational Measurement.
    De Ayala, Ralph J. (1989). A Comparison of the Nominal Response Model and the Three-parameter Logistic Model in Computerized Adaptive Testing. Educational and Psychological Measurement 49:789-805.
    De Ayala, Ralph J. (1992). The Influence of Dimensionality on CAT Ability Estimation. Educational and Psychological Measurement 52:513-528.
    DeMars, Christine E. (2006). Application of the Bi-factor Multidimensional Item Response Theory Model to Testlet-based Tests. Journal of Educational Measurement 43(2):145-168.
    DeMars, Christine E. (2010). Item Response Theory. Oxford:Oxford University Press.
    DeMars, Christine E. (2012). Confirming Testlet Effects. Applied Psychological Measurement 36(2):104-121.
    Dodd, Barbara G, Ralph J. De Ayala & William R. Koch (1995). Computerized Adaptive Testing with Polytomous Items. Applied Psychological Measurement 19(1):5-22.
    Dodd, Barbara G., William R. Koch & Ralph J. De Ayala (1989). Operational Characteristics of Adaptive Testing Procedures Using the Graded Response Model. Applied Psychological Measurement 13:129-143.
    Dodd, Barbara G., William R. Koch & Ralph J. De Ayala (1993). Computerized Adaptive Testing Using the Partial Credit model:Effect of Item Pool Characteristics and Different Stopping Rules. Educational and Psychological Measurement 53:61-77.
    Dooey, Patricia (2008). Language Testing and Technology:Problems of Transition to a New Era. ReCALL 20(1):21-34.
    Dorans, Neil J.& Edward Kulick (1986). Demonstrating the Utility of the Standardization Approach to Assessing Unexpected Differential Item Performance on the SAT. Journal of Educational Measurement 23:355-368.
    Dorans, Neil J. & Paul W. Holland (1993). DIF Detection and Description. In Paul W. Holland & Howard Wainer (eds.). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum, 35-66.
    Douglas, Dan & Volker Hegelheimer (2007). Assessing Language Using Computer Technology. Annual Review of Applied Linguistics 27: 115-132.
    Douglas, Jeffrey A., Louis A. Roussos & William Stout (1996). Item Bundle DIF Hypothesis Testing: Identifying Suspect Bundles and Assessing Their Differential Functioning. Journal of Educational Measurement 33: 465-484.
    Drasgow, Fritz (1989). An Evaluation of Marginal Maximum Likelihood Estimation for the Two-parameter Logistic Model. Applied Psychological Measurement 13: 77-90.
    du Toit, Mathilda (2003). IRT from SSI: B1LOG-MG, MULTILOG, PARSCALE, TESTFACT[Computer manual]. Lincolnwood, IL: Scientific Software International.
    Dunkel, Patricia (1999). Research and Development of a Computer-adaptive Test of Listening Comprehension in the Less-commonly Taught Language Hausa. In Micheline Chalhoub-Deville (ed.). Issues in Computer-Adaptive Testing of Reading Comprehension. Cambridge: Cambridge University Press, 91-121.
    Edmonds, Jennifer J. (2004). The Evaluation of Multiple Stage Adaptive Test Designs. Ph.D. Dissertation. New Jersey: The State University of New Jersey.
    Edwards, Michael C. & Maria Orlando Edelen (2009). Special Topics in Item Response Theory. In Millsap, Roger Ellis & Albert Maydeu-Olivares (eds.). The SAGE Handbook of Quantitative Methods in Psychology. London: Sage publications, 179-198.
    Eignor, Daniel R. (1999). Selected Technical Issues in the Creation of Computer-adaptive Tests of Second Language Reading Proficiency. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge: Cambridge University Press, 167-181.
    Eignor. Daniel R. (2007). Linking Scores Derived under Different Modes of Test Administration. In Neil J. Dorans, Mary Pommerich & Paul W. Holland (eds.) Linking and Aligning Scores and Scales. New York: Springer, 135-157.
    Ellis, Kathleen. (2000). The Development and Validation of an Instrument and Two Studies of the Relationship to Cognitive and Affective Learning. Human Communication Research 26(2): 264-291.
    Embretson, Susan E. & Steven Paul Reise (2000). Item Response Theory for Psychologists. NJ:Lawrence Erlbaum Associates.
    Ferne, Tracy & Andre A. Rupp (2007). A Synthesis of 15 Years of Research on DIF in Language Testing:Methodological Advances, Challenges, and Recommendations. Language Assessment Quarterly 4(2):113-148.
    Flaugher, Ronald (2000). Item Pools. In Howard Wainer (ed.). Computerized Adaptive Testing:A Primer (2nd ed.). Mahwah, NH:Lawrence Erlbaum Associates,37-60.
    Folk, Valerie Greaud & Bert F. Green (1989). Adaptive Estimation When the Unidimensionality Assumption of IRT Is Violated. Applied Psychological Measurement 13:373-89.
    Fornell, Claes & David F. Larcker (1981). Evaluating Structural Equation Models with Unobservable Variables and Measurement Error. Journal of Marketing Research 18(1):39-50.
    Freedle, Roy & Irene Kostin (1994). Can Multiple-choice Reading Tests Be Construct Valid? Psychological Science 5:107-110.
    Freedle, Roy & Irene Kostin (1999). Does Text Matter in a Multiple-choice Test of Comprehension? The Case for the Construct Validity of TOEFL's Minitalks. Language Testing 16(1):2-32.
    Fulcher, Glenn (1999). Computerizing an English Language Placement Test. ELT Journal 53(4):289-299.
    Georgiadou, Eissavet G., Evangelos Triantafillou & Anastasios A. Economides (2007). A Review of Item Exposure Control Strategies for Computerized Adaptive Testing Developed from 1983 to 2005. Journal of Technology, Learning, and Assessment 5: 4-38.
    Gibbons, Robert D.& Donald R. Hedeker (1992). Full-information Bi-factor Analysis. Psychometrika 57:423-436.
    Giouroglou, Hara & Anastasios A. Economides (2003). Cognitive CAT in Foreign Language Assessment. In Proceedings 11th International PEG Conference, Powereful ICT Tools for Learning and Teaching, PEG'03, St. Petersburg, Russia,28 June-1 July, CD-ROM.
    Giouroglou, Hara & Anastasios A. Economides (2005). An Implemented Theoretical Framework for a Common European Foreign Language Adaptive Assessment. Presented at the 3rd International Conference on Open and Distance Learning, Greek Open University, Patra, Greece.
    Gorin, Joanna S., Barbara G. Dodd, Steven J. Fitzpatrick & Yann Yann Shieh (2005). Computerized Adaptive Testing with the Partial Credit Model: Estimation Procedures, Population Distributions, and Item Pool Characteristics. Applied Psychological Measurement 29(6): 433-456.
    Green, Bert, R. Darrell Bock, Lloyd G. Humphreys, Robert L. Linn & Mark D. Reckase (1984). Technical Guidelines for Assessing Computerized Adaptive Tests. Journal of Educational Measurement 21(4): 347-360.
    Green, Tony & Louise Maycock (2004). Computer-based IELTS and Paper-based Versions of IELTS. Research Notes 18: 3-6.
    Gulliksen, Harold (1950). Theory of Mental Tests. Hillsdale, NJ: Lawrence Erlbaum.
    Haebara, Tomokazu (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research 22: 144-149.
    Hair, Joseph F., Bill Black, Barry Babin, Rolph E. Anderson & Ronald L. Tatham (2006). Multivariate Data Analysis (6th ed.). Upper Saddle River, NJ: Prentice Hall.
    Hambleton, Ronald K. & Hariharan Swaminathan (1985). Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff Publishing.
    Hambleton, Ronald K., Hariharan Swaminathan & H. Jane Rogers (1991). Fundamentals of Item Response Theory. Newbury Park: Sage Publications, Inc.
    Hambleton, Ronald K. & Jac N. Zaal (1991). Advances in Educational and Psychological Testing: Theory and Applications. Boston, MA: Kluwer-Nijhoff Publishing.
    Hamp-Lyons, Liz (1997). Washback, Impact and Validity: Ethical Concerns. Language Testing 14(3): 295-303.
    Han, Kyung T. (2009). 1RTEQ: Windows Application That Implements Item Response Theory Scaling and Equating. Applied Psychological Measurement 33(6): 491-493.
    Hanson, Bradley A. & Anton A. Beguin. (2002). Obtaining a Common Scale for Item Response Theory Hem Parameters Using Separate Versus Concurrent Estimation in the Common-item Liquating Design. Applied Psychological Measurement 26(1): 3-24.
    Harris, Deborah (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practices 8: 35-41.
    Harrison David A. (1986). Robustness of IRT Parameter Estimation to Violations of the Unidimensionality Assumption. Journal of Educational and Behavioral Statistics11: 91-115.
    Harwell, Michael R.& Janine E. Janosky (1991). An Empirical Study of the Effects of Small Datasets and Varying Prior Variances on Item Parameter Estimation in BILOG. Applied Psychological Measurement 15:279-291.
    Hattie, John (1985). Methodology Review:Assessing Unidimensionality of Tests and Items. Applied Psychological Measurement 9(2):139-164.
    Hau, Kit-Tai & Herbert W. Marsh (2004). The Use of Item Parcels in Structural Equation Modelling:Non-normal Data and Small Sample Sizes. British Journal of Mathematical and Statistical Psychology 57:327-351.
    He, Lianzhen (2004). Computerized Cognitive Adaptive Testing. Hangzhou:Zhejiang University Press.
    Hendrickson, Amy (2007). An NCME Instructional Module on Multistage Testing. Educational Measurement:Issue and Practice 26(2):44-52.
    Henning, Grant (1987). A Guide to Language Testing:Development, Evaluation, Research. Cambridge, MA:Newbury House.
    Henning, Grant (1991). Validating an Item Bank in a Computer-assisted or Computer-adaptive Test:Using Item Response Theory for the Process of Validating CATS. In Patricia Dunkel (ed.). Computer-assisted Language Learning and Testing: Research Issues and Practice. New York:Newbury House,209-222.
    Hetter, Rebecca D., Daniel O. Segall & Bruce M. Bloxom (1994). A Comparison of Item Calibration Media in Computerized Adaptive Testing. Applied Psychological Measurement 18:197-204.
    Ho, Tsung-Han (2010). A Comparison of Item Selection Procedures Using Different Ability Estimation Methods in Computerized Adaptive Testing Based on the Generalized Partial Credit Model. Ph.D. Dissertation. Austin:The University of Texas at Austin.
    Holland, Paul W. (1985). On the Study of Differential Item Performance without IRT. Presented at the Proceedings of the Military Testing Association.
    Holland, Paul W.& Dorothy T. Thayer (1988). Differential Item Performance and the Mantel-Haenszel Procedure. In Howard Wainer & Henry I. Braun (eds.). Test Validity. Hillsadale, NJ:Lawrence Erlbaum,129-145.
    Holmes, Finch (2005). The MIMIC Model as a Method for Detecting DIF:Comparison with Mantel-Haenszel, SIBTEST, and the IRT Likelihood Ratio. Applied Psychological Measurement 29:278-295.
    Horkay, Nancy, Randy Elliot Bennett, Nancy Allen, Bruce A. Kaplan & Fred Yan (2006). Does it Matter if I Take My Writing Test on Computer? A Second Empirical Study of Mode Effects in NAEP. Journal of Technology, Learning and Assessment 5: (2), 1-39.
    Hu, Li-tze & Peter M. Bentler (1995). Evaluating Model Fit. In Rick H. Hoyle (ed.). Structural Equation Modeling: Concepts, Issues, and Applications. Thousand Oaks, CA: Sage, 76-99.
    Hu, Li-tze, Peter M. Bentler & Yutaka Kano (1992). Can Test Statistics in Covariance Structure Analysis be Trusted? Psychological Bulletin 112(2): 351-362.
    Hutchinson, Susan R. & Antonio Olmos (1998). Behavior of Descriptive Fit Indices in Confirmatory Factor Analysis Using Ordered Categorical Data. Structural Equation Modelling: A Multidisciplinary Journal 5: 334-364.
    Jang, Eunice Eunhee & Louis Roussos (2007). An Investigation into the Dimensionality of TOEFL Using Conditional Covariance-based Nonparametric Approach. Journal of Educational Measurement 44 (1): 1-21.
    Jin, Yan, Jiang Wu & Ming Yan (2011). Is Computer Literacy Construct-relevant in a Language Test in the 21st Century? Presented at the 33rd Language Testing Research Colloquium, Ann Arbor: Michigan State University.
    Jones, Neil (2000). BULATS: A Case Study Comparing Computer-based and Paper-and-pencil Tests, Research Notes 3, 10-13.
    Jones, Neil & Louise Maycock (2007). The Comparability of Computer-based and Paper-based Tests: Goals, Approaches, and a Review of Research. Research Notes 27: 11-14.
    Kaiser, Henry F. (1970). A Second Generation Little Jiffy. Psychometrika 35(4): 401-416.
    Kaiser Henry F. & John Rice (1974). Little Jiffy Mark IV. Educational and Psychological Measurement 34: 111-117.
    Kane, Michael T. (1992). An Argument-based Approach to Validity. Psychological Bulletin 112:527-535.
    Kane, Michael T. (2001). Current Concerns in Validity Theory. Journal of Educational Measurement 38(4): 319-342.
    Kane, Michael T. (2004). Certification Testing as an Illustration of Argument-based Validation. Measurement: Interdisciplinary Research and Perspectives 2(3): 135-170.
    Kane, Michael T. (2006). Validation. In Robert L. Brennan (ed.). Educational Measurement (4th ed.). New York:American Council on Education and Praeger Publishers.
    Kaya-carton, Esin, Aaron S. Carton & Patricia Dandonoli (1991). Developing a Computer-adaptive Test of French Reading Proficiency. In Patricia Dunkel (ed.). Computer-assisted Language Learning and Testing:Research Issues and Practice. New York:Newbury House,259-284.
    Kim, Jiseon (2010). A Comparison of Computer-based Classification Testing Approaches Using Mixed-format Tests with the Generalized Partial Credit Model. Ph.D. Dissertation. Austin:The University of Texas at Austin.
    Kim, Mikyung (2001). Detecting DIF across the Different Language Groups in a Speaking Test. Language Testing 18(1):89-114.
    Kim, Seock-Ho & Allan S. Cohen (1992). Effects of Linking Methods on Detection of DIF. Journal of Educational Measurement 29:51-66.
    Kim, Seock-Ho & Allan S. Cohen (1998). A Comparison of Linking and Concurrent Calibration under Item Response Theory. Applied Psychological Measurement 22 (2): 131-143.
    Kim, Seonghoon & Won-Chan Lee (2006). An Extension of Four IRT Linking Methods for Mixed-format Tests. Journal of Educational Measurement 43(1):53-76.
    Keng, Leslie (2008). A Comparison of the Performance of Testlet-based Computer Adaptive Tests and Multistage Tests. Ph.D. Dissertation. Austin:The University of Texas at Austin.
    Kilne, Rex B. (1998). Principles and Practice of Structural Equation Modeling. New York, NY:Guilford.
    Kingsbury, G. Gage (1990). Adapting Adaptive Testing:Using the MicroCAT Testing in a Local School District. Educational Measurement:Issues and Practice 9(2):3-6.
    Kingsbury, G. Gage (2002). An Empirical Comparison of Achievement Level Estimates from Adaptive Tests and Paper-and-pencil Tests. Presented at the American Educational Research Association annual meeting, New Orleans, LA.
    Kingsbury, G. Gage & Anthony R. Zara (1989). Procedures for Selecting Items for Computerized Adaptive Tests. Applied Measurement in Education 2(4):359-375.
    Kingsbury, G. Gage & David J. Weiss (1983). A Comparison of IRT-based Adaptive Mastery Testing and a Sequential Mastery Testing Procedure. In David J. Weiss (ed.). New Horizons in Testing: Latent Trait Theory and Computerized Adaptive Testing. New York: Academic Press, 237-254.
    Kingsbury, G. Gage & Ronald L. Houser (1993). Assessing the Utility of Item Response Models: Computer Adaptive Testing. Educational Measurement: Issues and Practice 12:21-27.
    Kingsbury, G. Gage & Steven L. Wise (2000). Practical Issues in Developing and Maintaining a Computerized Adaptive Testing Program. Psicologica 21:135-155.
    Koch, William R. & Barbara G. Dodd (1989). An Investigation of Procedures for Computerized Adaptive Testing Using Partial Credit Scoring. Applied Measurement in Education 2(4): 335-357.
    Kolen, Micahel J. & Robert L. Brennan (2004). Test Equating, Scaling, and Linking: Methods and Practices (2nd ed.). New York: Springer.
    Kristjansson, Elizabeth, Richard Aylesworth & Ian McDowell (2005). A Comparison of Four Methods for Detecting Differential Item Functioning in Ordered Response Items. Educational and Psychological Measurement 65: 935-953.
    Kunnan, Antony J. (2003). Test Fairness. In Michael Milanovic & Cyril J. Weir (eds.). European Langauge Testing in a Global Context: Proceedings of the ALTE Barcelona Conference July 2001. Cambridge, UK: Cambridge University Press.
    Larson, Jerry (1987). Computerized Adaptive Language Testing: A Spanish Placement Exam. In Kathleen M. Bailey, Theodore L. Dale & Ray T. Clifford (eds.). Language Testing Research. Monterey, CA: Defense Language Institute, 1-10.
    Larson, Jerry (1999). Considerations for Testing Reading Proficiency via Computer-adaptive Testing. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge: Cambridge University Press, 71-90.
    Lau, C. Allen & Tianyou Wang (1998). Comparing and Combining Dichotomous and Polylomous Items with SPRT Procedure in Computerized Classification Testing. Presented at the annual meeting of the American Educational Research Association, San Diego, CA.
    Laurier, Micahel (1999). The Development of an Adaptive Test for Placement in French. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge: Cambridge University Press, 122-135.
    Lee, Guemin. Michael J. Kolen. David A. Frisbie & Robert D. Ankenmann (2001). Comparison of Dichotomous and Polytomous Item Response Models in Equating Scores from Tests Composed of Testlets. Applied Psychological Measurement 25: 357-372.
    Lee, Jo Ann (1986). The Effects of Past Computer Experience on Computerized Aptitude Test Performance. Educational and Psychological Measurement 46:727-733.
    Lee, Yong-Won (1998). Examining the Suitability of an IRT-based Testlet Approach to the Construction and Analysis of Passage-based Items in an EFL Reading Comprehension Test in the Korean High School Context. Ph.D. Dissertation. PA:The Pennsylvania State University.
    Lee, Yong-Won (2004). Examining Passage-related Local Item Dependence (LID) and Measurement Construct Using Q3 Statistics in an EFL Reading Comprehension Test. Language Testing 21 (1):74-100.
    Levin, Tamar & Claire Gordon (1989). Effect of Gender and Computer Experience on Attitudes toward Computers. Journal of Educational Computing Research 5:69-88.
    Levin, Tamar & Smadar Donitsa-Schmidt (1997). Commitment to Learning:Effects of Computer Experience, Confidence and Attitudes. Journal of Educational Computing Research 16(1):83-105.
    Levine, Adina & Thea Revers (1988). The FL Receptive Skills:Same or Different? System16:326-336.
    Li, Hsin-Hung & William Stout (1996). A New Procedure for Detection of Crossing DIF. Psychometrika 61:647-677.
    Li, Yanmei, Daniel M. Bolt & Jianbin Fu (2005). A Test Characteristic Curve Linking Method for the Testlet Model. Applied Psychological Measurement 29:340-356.
    Linacre, John M. (1999). A Measurement Approach to Computer-adaptive Testing of Reading Comprehension. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge:Cambridge University Press,182-195.
    Little, Todd D., William A. Cunningham. Golan Shahar & Keith F. Widaman (2002). To Parcel or Not to Parcel:Exploring the Question, Weighing the Merits. Structural Equation Modeling 9:151-173.
    Lord, Frederic M. (1971). Robbins-Munro Procedures for Tailored Testing. Educational and Psychological Measurement 31:3-31.
    Lord, Frederic M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ:Lawrence Erlbaum Associates, Inc.
    Lord, Frederic M.& Melvin R. Novick (1968). Statistical Theories of Mental Test Scores. Reading, MA:Addison-Wesley.
    Loyd, Brenda H.& H. D. Hoover (1980). Vertical Equating Using the Rasch Model. Journal of Educational Measurement 17:179-193.
    Luecht, Richard (1999). The Practical Utility of Rasch Measurement Models. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge:Cambridge University Press,196-223.
    Lynch, Brian K. (1997). In Search of the Ethical Test. Language Testing 14(3):315-327.
    Madsen, Harold (1991). Computer-adaptive Testing of Listening and Reading Comprehension. In Patricia Dunkel (ed.). Computer-assisted Language Learning and Testing:Research Issues and Practice.New York:Newbury House,237-257.
    Mantel, Nathan & William Haenszel (1959). Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease. Journal of the National Cancer Institute 22: 719-748.
    Marais, Ida & David Andrich (2008). Effects of Varying Magnitude and Patterns of Local Dependence in the Unidimensional Rasch Model. Journal of Applied Measurement 9: 105-124.
    Marco, Gary L. (1977). Item Characteristic Curve Solutions to Three Intractable Testing Problems. Journal of Educational Measurement 14:139-160.
    Masters, Geoff N. (1982). A Rasch Model for Partial Credit Scoring. Psychometrika 47(2): 149-174.
    Maycock, Louis e& Tony Green (2005). The Effects on Performance of Computer Familiarity and Attitudes towards CB IELTS. Research Notes 20:3-8.
    McBride, James R.& John T. Martin (1983). Reliability and Validity of Adaptive Ability Tests in a Military Setting. In David J. Weiss (ed.). New Horizons in Testing. New York, Academic Press,224-236.
    McClarty, Katie Larsen (2006). A Feasibility Study of a Computerized Adaptive Test of the International Personality Item Pool NEO. Ph.D. Dissertation. Austin:The University of Texas at Austin.
    McDonald, Angus S. (2002). The Impact of Individual Differences on the Equivalence of Computer-based and Paper-and-pencil Educational Assessments. ('omputers & Education 39:299-312.
    McNamara, Tim (1998). Policy and Social Considerations in Language Assessment. Annual Review of Applied Linguistics 18:304-319.
    McNamara, Tim (2006). Validity in Language Testing:the Challenge of Sam Messick's Legacy. Language Assessment Quarterly:An International Journal 3(1):31-51.
    McNamara, Tim (2007). Language Assessment in Foreign Language Education:the Struggle over Constructs. The Modern Language Journal 97(2):280-282.
    McNamara, Tim & Carsten Roever (2006). Language Testing:the Social Dimension. Massachusetts:Blackwell Publishing.
    Meijer, Rob R. & Michael L. Nering (1999). Computerized Adaptive Testing:Overview and Introduction. Applied Psychological Measurement 23:187-194.
    Messick, Samuel (1988). The Once and Future Issues of Validity:Assessing the Meaning and Consequences of Measurement. In Howard Wainer & Henry I. Braun (eds.). Test Validity. Hillsdale, NJ:Erlbaum,33-45.
    Messick, Samuel (1989). Validity. In Robert L. Linn (ed.). Educational Measurement (3rd ed.). New York:Macmillan,13-103.
    Messick, Samuel (1994). The Interplay of Evidence and Consequences in the Validation of Performance Assessments. Educational Researcher 23(2):13-23.
    Messick, Samuel (1996). Validity and Washback in Language Testing. Language Testing 13(3):241-256.
    Meunier, Lydie E. (1994). Computer Adaptive Language Test (CALT) Offer a Great Potential for Functional Testing. Yet, Why don't They? CALICO Journal 11(4): 23-39.
    Miller, Fayneese & Narendra Varma (1994). The Effects of Psychosocial Factors on Indian Children's Attitudes toward Computers. Journal of Educational Computing Research 10:223-238.
    Mills, Craig N. (1999). Development and Introduction of a Computer Adaptive Graduate Record Examination General Test. In Fritz Drasgow & Julie Olson-Buchanan (eds.). Innovations in Computerized Assessment. Mahwah, NJ:Lawrence Erlbaum Associates,117-135.
    Mislevy, Robert J., Linda S. Steinberg & Russell G. Almond (2003). On the Structure of Educational Assessments. Measurement:Interdisciplinary Research and Perspectives 7(1):3-62.
    Morgan, Rick & John Mazzeo (1988). A Comparison of the Structural Relationships among Reading, Listening, Writing, and Speaking Components of the Advanced Placement French Language Examination for Advanced Place Candidates and College Students. (Research Report 89-59). Princeton, NJ:Educational Testing Service.
    Muraki, Eiji (1992). A Generalized Partial Credit Model:Application of an EM Algorithm. Applied Psychological Measurement 16(2):159-176.
    Nering, Michael & Remo Ostini (2010). Handbook of Polytomous Item Response Theory Models. New York:Routledge Taylor & Francis Group.
    Neuman, George & Ramzi Baydoun (1998). Computerization of Paper-and-pencil Tests: When Are They Equivalent? Applied Psychological Measurement 22(1):71-83.
    Nielsen, Jean (2004). Levels of English Proficiency (LOEP):A Computer-adaptive Language Test. Newsletter of the Association of Teachers of English as a Second Language of Ontario 30(1):1-6.
    Nogami, Yasuko & Norio Hayashi (2010). A Japanese Adaptive Test of English as a Foreign Language:Development and Operational Aspects. In Wim J. van der Linden & Cees A. W. Glas (eds.). Elements of Adaptive Testing. New York:Springer, 191-211.
    Ockey, Gary J. (2007). Investigating the Validity of Math Word Problems for English Language Learners with DIF. Language Assessment Quarterly 4(2):149-164.
    Ockey, Gary J. (2009). Developments and Challenges in the Use of Computer-based Testing for Assessing Second Language Ability. The Modern Language Journal 93: 836-847.
    Oiler, John W. Jr. (1976). Evidence of a General Language Proficiency Factor:An Expectancy Grammar. Die Neuen Sprachen 76:165-174.
    Oller, John W. Jr. (1979). Language Tests at School:A Pragmatic Approach. London: Longman.
    Oiler, John W. Jr.& Frances Butler Hinofotis (1980). Two Mutually Exclusive Hypotheses about Second Language Ability:Factor Analytic Studies of a Variety of Language Subtests. In John W. Oller Jr.,& Kyle Perkins (eds.). Research in Language Testing. Rowley, MA:Newbury House,13-23.
    Orlando, Maria & David Thissen (2000). Likelihood-based Item Fit Indices for Dichotomous Item Response Theory Models. Applied Psychological Measurement 24:50-64.
    Orlando, Maria & David Thissen (2003). Further Investigation of the Performance of S-X2: An Item Fit Index for Use with Dichotomous Item Response Theory Models. Applied Psychological Measurement 27:289-298.
    Ostini, Remo & Michael L. Nering (2006). Polytomous Item Response Theory Models. Thousand Oaks, CA:Sage Publication.
    Owen, Roger J. (1969). A Bayesian Approach to Tailored Testing (Research Bulletin No. 69-92). Princeton, NJ:Educational Testing Service.
    Owen, Roger J. (1975). A Bayesian Sequential Procedure for Quantal Response in the Context of Adaptive Mental Testing. Journal of the American Statistical Association 70(350):351-356.
    Pae, Tae-11 (2004). DIF for Examinees with Different Academic Backgrounds. Language Testing 21(1):53-73.
    PaPadima-SoPhocleouS, Salomi (2008). A Hybrid of a CBT-and a CAT-based New English Placement Test Online (NEPTON). CALICO Journal 25 (2):276-304.
    Parshall, Cynthia G., Judith A. Spray, John C. Kalohn & Tim Davey (2002). Practical Considerations in Computer-based Testing. New York:Springer-Verlag.
    Passos, Valeria Lima, Martijn P. F. Berger & Frans E. Tan (2007). Test Design Optimization in CAT Early Stage with the Nominal Response Model. Applied Psychological Measurement 31(3):213-232.
    Patsula, Liane N.& Mandred Steffan (1997). Maintaining Item and Test Security in a CAT Environment:A Simulation Study. Presented at the annual meeting of National Council on Measurement in Education, Chicago, IL
    Peng, Jian-E & Lindy Woodrow (2010). Willingness to Communicate in English:A Model in the Chinese EFL Classroom Context. Language Learning 60 (4):834-876.
    Petersen, Nancy S., Linda L. Cook & Martha L. Stocking (1983). IRT versus Conventional Equating Methods:A Comparative Study of Scale Stability. Journal of Educational Statistics 8:137-156.
    Powers, Donald E. (1999). Test Anxiety and Test Performance:Comparing Paper-based and Computer-adaptive Versions of the GRE General Test. Journal of Educational Computing Research 24(3):249-273.
    Preece, Jenny, Yvonne Rogers & Helen Sharp (2002). Interaction Design:Beyond Human-Computer Interaction. New Jersey:Wiley.
    Rasch, George (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: The University of Chicago Press.
    Raykov, Tenko & George A. Marcoulides (2006). A First Course in Structural Equation Modeling. New Jersey: Erlbaum Associates.
    Reckase, Mark D. (1979). Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational Statistics 4: 207-230.
    Reckase, Mark D. (1981). Final Report: Procedures for Criterion Referenced Tailored Testing. Columbia: University of Missouri, Educational Psychology Department.
    Reckase, Mark D. (1985). The Difficulty of Test Items that Measure More Than One Ability. Applied Psychological Measurement 9: 401-412.
    Reckase, Mark D. (1989). Adaptive Testing: the Evolution of a Good Idea. Educational Measurement: Issues and Practice 18(3): 11-15.
    Reise, Steve P. & Jiayuan Yu (1990). Parameter Recovery in the Graded Response Model Using MULTILOG. Journal of Educational Measurement 27: 133-144.
    Revuelta, Javier & Vicente Ponsoda (1998). A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing. Journal of Educational Measurement 35(4): 311-327.
    Rijmen, Frank (2010). Formal Relations and an Empirical Comparison among the Bi-factor, the Testlet, and a Second-order Multidimensional IRT Model. Journal of Educational Measurement 47(3): 361-372.
    Rosa, Kathleen, Kimberly A. Swygert, Lauren Nelson & David Thissen (2001). Item Response Theory Applied to Combinations of Multiple-choice and Constructed-response Items-scale Scores for Patterns of Summed Scores. In D. Thissen & Howard Wainer (eds.). Test Scoring. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 253-292.
    Rosenbaum, Paul R. (1988). Item Bundles. Psychometrika 53: 349-359.
    Roussos, Louis & William Stout (1996). A Multidimensionality-based DIF Analysis Paradigm. Applied Psychological Measurement 20: 355-371.
    Samejima. Fumiko (1969). Estimation of Latent Ability Using a Response Pattern of Graded Responses. Psychometrika Monograph Supplement 34 (2): 100.
    Sasaki, Miyuki (1996). Second Language Proficiency, Foreign Language Aptitude, and Intelligence: Quantitative and Qualitative Analyses. New York: Peter Lang.
    Sawaki, Yasuyo (2001a). Comparability of Conventional and Computerized Tests of Reading in a Second Language. Language Learning & Technology 5(2): 38-59.
    Sawaki, Yasuyo (2001b). How Examinees Take Conventional versus Web-based Japanese Reading Tests. Work in progress session presented at the 23rd Annual Language Testing Research Colloquium, St. Louis, MO.
    Sawaki, Yasuyo (2007). Construct Validation of Analytic Rating Scales in a Speaking Assessment:Reporting a Score Profile and a Composite. Language Testing 24(3): 355-390.
    Sawaki, Yasuyo, Lawrence J. Stricker & Andreas H. Oranje (2008). Factor structure of the TOEFL(?) Internet-based Test (iBT):Exploration in a Field Trial Sample. (TOEFL iBT Research Report No.04). Princeton, NJ:Educational Testing Service.
    Sawaki, Yasuyo, Lawrence J. Stricker & Andreas H. Oranje (2009). Factor Structure of the TOEFL Internet-based Test. Language Testing 26(1):5-30.
    Schwarz, Gideon E. (1978). Estimating the Dimension of a Model. Annals of Statistics 6: 461-464.
    Segall, Daniel O.& Kathleen E. Moreno (1999). Development of the Computerized Adaptive Testing Version of the Armed Service Vocational Aptitude Battery. In Fritz Drasgow & Julie B. Olson-Bunchanan (eds.). Innovations in Computerized Assessment. Mahwah, NJ:Lawrence Erlbaum Associates,35-65.
    Segall, Daniel O., Kathleen E. Moreno & Rebecca D. Hetter (1997). Item Pool Development and Evaluation. In William A. Sands, Brian K. Waters & James R. McBride (eds.). Computerized Adaptive Testing:From Inquiry to Operation. Washington, DC:American Psychological Association,117-130.
    Shaw, Stuart D. (2003). Legibility and the Rating of Second Language Writing:the Effect on Examiners When Assessing Handwritten and Word-processed Scripts. Research Notes 11:7-10.
    Shin, Chingwei David, Yuehmei Chien, Walter Denny Way & Len Swanson (2009). Weighted Penalty Model for Content Balancing in CATS. Pearson:Pearson Assessment and Information.
    Shin, Sang-Keun (2005). Did They Take the Same Test? Examinee Language Proficiency and the Structure of Language Tests. Language Testing 22(1):31-57.
    Shin, Sunyong (2007). Examining the Construct Validity of a Web-based Academic Listening Test:an Investigation of the Effects of Response Formats in a Web-based Listening Test. Ph.D. Dissertation. Los Angeles:University of California, Los Angeles.
    Shohamy, Elana (1998). Critical Language Testing and Beyond. Studies in Educational Evaluation 24: 331-345.
    Siann, Gerda, Hamish Macleod, Peter Glissov & Alan Durndell (1990). The Effect of Computer Use on Gender Differences in Attitudes to Computers. Computers in Education 14: 183-191.
    Sireci Stephen G, David Thissen & Howard Wainer (1991). On the Reliability of Testlet-based Tests. Journal of Educational Measurement 28(3): 237-247.
    So, Youngsoon (2010). Dimensionality of Responses to a Reading Comprehension Assessment and Its Implications to Scoring Test Takers on Their Reading Proficiency. Ph.D. Dissertation. Los Angeles: University of California, Los Angeles.
    Song, Min-Young (2008). Do Divisible Subskills Exist in Second Language (L2) Comprehension? A Structural Equation Modelling Approach. Language Testing 25(4): 435-464.
    Spolsky, Bernald (1997). The Ethics of Gatekeeping Tests: What Have We Learned in a Hundred Years? Language Testing 14(3): 242-247.
    Spray, Judith A., Terry A. Ackerman, Mark D. Reckase & James E. Carlson (1989). Effect of the Medium of Item Presentation on Examinee Performance and Item Characteristics. Journal of Educational Measurement 26: 261-271.
    SPSS Inc. (2009). PASW Statistics Release 18.0. Chicago, IL: SPSS Inc.
    Stansfield. Charles W. (1993). Ethics, Standards, and Professionalism in Language Testing. Issues in Applied Linguistics 4(2): 189-206.
    Steinberg, Lynne & David Thissen (1996). Uses of Item Response Theory and the Testlet Concept in the Measurement of Psychopathology. Psychological Methods 1: 81-97.
    Steinberg, Lynne, David Thissen & Howard Wainer (1990). Validity. In Howard Wainer. Neil J. Dorans, Daniel Eignor, Ronald Flaughter, Bert F. Green, Robert J. Mislevy, Lynne Steinberg & David Thissen (eds.). Computerized Adaptive Testing: A Primer. Hillsdale, NJ: Lawrence Erlbaum Associates, 187-232.
    Stevens, James (1996). Applied Multivariate Statistics for the Social Sciences. Mahwah, NJ: Erlbaum.
    Stevenson, Jose & Susan Gross (1991). Use of a Computerized Adaptive Testing Model for ESOL/Bilingual Entry/Exit Decision Making. In Patricia Dunkel (ed.). Computer-assisted Language Learning and Testing: Research Issues and Practice. New York: Newbury House, 223-235.
    Stocking, Martha L. (1993). Controlling Item Exposure Rates in a Realistic Adaptive Testing Paradigm. (Research Report 93-2). Princeton, NJ:Educational Testing Service.
    Stocking, Martha L.& Frederic M. Lord (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement 7:201-210.
    Stocking, Martha L.& Len Swanson (1993). A Method for Severely Constrained Item Selection in Adaptive Testing. Applied Psychological Measurement 17(3):277-292.
    Stone, Clement A. (1992). Recovery of Marginal Maximum Likelihood Estimates in the Two-parameter Logistic Response Model:An Evaluation of MULTILOG. Applied Psychological Measurement 16:1-16.
    Stone, Gregory Ethan & Mary E. Lunz (1994). Item Calibration Considerations:A Comparison of Item Calibrations on Written and Computerized Adaptive Examinations. Presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
    Stout, William (1987). A Nonparametric Approach for Assessing Latent Trait Unidimensionality. Psychometrika 52:589-617.
    Stout, William, Amy Goddwin Froelich & Furong Gao (2001). Using Resampling Methods to Produce an Improved DIMTEST Procedure. In Anne Boomsma, Marijtje A. J. Duijn & T. A. B. Snijders (eds.). Essays on Item Response Theory. New York: Spring-Verlag,357-376.
    Stout, William & Louis Roussos (1995). SIBTEST Manual. Champaign-Urbana, IL: Department of Statistics, Statistical Laboratory for Educational and Psychological Measurement, University of Illinois.
    Stricker, Lawrence J., Donald A. Rock & Yong-Won Lee (2005). Factor Structure of the LanguEdge Test across Language Groups. (TOEFL Monograph Series MS-32). Princeton, NJ:Educational Testing Service.
    Sumbling, Mick, Pablo Sanz, M. Carme Viladrich, Eduardo Doval & Laura Riera (2007). Development of a Multiple-component CAT for Measuring Foreign Language Proficiency (SIMTEST). Presented at the 2007 GMAC Computerized Adaptive Testing Conference, Minneapolis, USA.
    Swaminathan, Hariharan & Janice A. Gifford (1985). Bayesian Estimation in the Two-parameter Logistic Model. Psychometrika 50:349-364.
    Sympson, James B., David J. Weiss & Malcolm James Ree (1982). Predictive Validity of Conventional and Adaptive Tests in an Air Force Training Environment. (AFHRL-TR-81-40). BAFB, TX: Air Force Human Resources Laboratory.
    Takala, Sauli & Felianka Kaftandjieva (2000). Test Fairness: a DIF Analysis of an L2 Vocabulary Test. Language Testing 17(3): 323-340.
    Tang, K. Linda & Daniel R. Eignor (1997). Concurrent Calibration of Dichotomously and Polytomously Scored TOEFL Items Using IRT Models. (TOEFL Technical Report TR-13). Princeton, NJ: Educational Testing Service.
    Tao, Yu-Hui, Wu Yu-Lung & Hsin-Yi Chang (2008). A Practical Computer Adaptive Testing Model for Small-scale Scenarios. Educational Technology & Society 11(3): 259-274.
    Taylor, Carol, Joan Jamieson & Daniel Eignor (2000). Trends in Computer Use among International Students. TESOL Quarterly 34 (3): 575-585.
    Taylor, Carol, Irwin Kirsch, Joan Jamieson & Daniel Eignor (1999). Examining the Relationship between Computer Familiarity and Performance on Computer-based Language Tasks. Language Learning 49 (2): 219-274.
    Teresi, Jeanne A., Marjorie Kleinman & Katja Ocepek-Welikson (2000). Modern Psychometric Methods for Detection of Differential Item Functioning: Application to Cognitive Assessment Measures. Statistics in Medicine 19: 1651 -1683.
    Thissen, David (2009). The MEDPRO project: An SBIR Project for a Comprehensive IRT and CAT Software System—IRT Software. In David J. Weiss (ed.). Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Retrieved [031111] from www.psvch.unm.edu/psvlabs/CATCentral/
    Thissen, David & Lynne Steinberg (1986). Taxonomy of Item Response Models. Psychometrika 51(4): 567-577.
    Thissen, David & Lynne Steinberg (2010). Using Item Response Theory to Disentangle Constructs at Different Levels of Generality. In Embretson, Susan E. (ed.). Measuring Psychological Constructs: Advances in Model-based Approaches. Washington, DC, US: American Psychological Association, 123-144.
    Thissen, David, Lynne Steinberg & Howard Wainer (1988). Use of Item Response Theory in the Study of Group Differences in Trace Lines. In Howard Wainer & Henry I. Braun (eds.). Test Validity. Hillsdale, NJ: Lawrence Erlbaum, 147-169.
    Thissen, David, Lynne Steinberg & Howard Wainer (1993). Deteetion of Differential Item Functioning Using the Parameters of Item Response Models. In Paul W. Holland & Howard Wainer (eds.). Differential Item Functioning. Hillsdale, NJ:Lawrence Erlbaum,35-66.
    Thissen, David, Lynne Steinberg & Jo Ann Mooney (1989). Trace Lines for Testlets:A Use of Multiple Categorical Response Models. Journal of Educational Measurement 26(3):247-260.
    Thissen, David & Robert J. Mislevy (2000). Testing Algorithms. In Howard Wainer (ed.). Computerized Adaptive Testing:A Primer (2nd ed.). Mahwah, NJ:Lawrence Erlbaum Associates,101-134.
    Toulmin, Stephen Edelston (2003). The Uses of Argument. Cambridge, England: Cambridge University Press.
    Tuerlinckx, Francis & Paul De Boeck (2001). The Effects of Ignoring Item Interactions on the Estimated Discrimination Parameters in Item Response Theory. Psychological Methods 6:181-195.
    Uiterwijk, Henny & Ton Vallen (2005). Linguistic Sources of Item Bias for Second Generation Immigrants in Dutch Tests. Language Testing 22 (2):211-234.
    Urry, Vern W. (1977). Tailored Testing:a Successful Application of Latent Trait Theory. Journal of Educational Measurement 14:181-196.
    Vale, C. David (1986). Linking Item Parameters onto a Common Scale. Applied Psychological Measurement 10(4):333-344.
    van der Linden, Wim J. (2000). Constrained Adaptive Testing with Shadow Tests. In Wim J. van der Linden & Cees A. W. Glas (eds.). Computerized Adaptive Testing:Theory and Practice. Netherlands:Kluwer Academic Publishers,27-52.
    van der Linden, Wim J. (2005). A Comparison of Item-selection Methods for Adaptive Tests with Content Constraints. Journal of Educational Measurement 42(3): 283-302.
    van der Linden, Wim J.& Peter J. Pashley (2000). Item Selection and Ability Estimation. In Wim J. van der Linden & Cees A. W. Glas (eds.). Computerized Adaptive Testing: Theory and Practice. Netherlands:Kluwer Academic Publishers,1-25.
    van Rijn, P. W., T. J. H. M. Eggen, B. T. Hemker & P. F. Sanders (2002). Evaluation of Selection Procedures for Computerized Adaptive Testing with Polytomous Items. Applied Psychological Measurement 26(4):393-411.
    Veerkamp, Wim J. J.& Martijn P. F. Berger (1997). Some New Item Selection Criteria for Adaptive Testing. Journal of Educational and Behavioral Statistics 22:203-226.
    Veldkamp, Bernard P. (2003). Item Selection in Polytomous CAT. In Haruo Yanai, Akinori Okada, Kazuo Shigenasu, Yutaka Kano & Jacqueline J. Meulman (eds.). New Developments in Psychometrics. Tokyo, Japan: Springer-Verlag, 207-214.
    Wainer, Howard (1989). The Future of Item Analysis. Journal of Educational Measurement 26: 191-208.
    Wainer, Howard (1995). Precision and Differential Item Functioning on a Testlet-based Test: The 1991 Law School Admissions Test as an Example. Applied Measurement in Education 8(2): 157-187.
    Wainer, Howard (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
    Wainer, Howard & David Thissen (1987). Estimating Ability with the Wrong Model. Journal of Educational and Behavioral Statistics 112: 339-368.
    Wainer, Howard & David Thissen (1996). How is Reliability Related to the Quality of Test Scores? What is the Effect of Local Dependence on Reliability? Educational Measurement: Issues and Practice 15(1): 22-29.
    Wainer, Howard, Eric T. Bradlow & Zuru Du (2000). Testlet Response Theory: An Analog for the 3PL Model Useful in Testlet-based Adaptive Testing. In Wim J. van der Linden & Cees A. W. Glas (eds.). Computerized Adaptive Testing: Theory and Practice. Netherlands: Kluwer Academic Publishers, 245-269.
    Wainer, Howard & Robert Lukhele (1997). How Reliable are TOEFL Scores? Educational and Psychological Measurement 57: 741-759.
    Wainer, Howard & Robert J. Mislevy (2000). Item Response Theory, Item Calibration, and Proficiency Estimation. In Howard Wainer (ed.). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah: Lawrence Erlbaum Associates, 61-100.
    Wall, Diannc & Charles J. Alderson (1993). Examining Washback: the Sri Lankan Impact Study. Language Testing 10(1): 41-69.
    Warm, Thomas A. (1989). Weighted Likelihood Estimation of Ability in Item Response Theory. Psychometrika 54: 427-450.
    Wang. Jia-Hwa (2009). Using Real-data Simulations to Compare Computer Adaptive Testing and Static Short-form Administrations of an Upper Extremity Item Bank. Ph.D. Dissertation. Florida: University of Florida.
    Wang, Shudong & Tianyou Wang (2001). Precision of Warm's Weighted Likelihood Estimates for a Polytomous Model in Computerized Adaptive Testing. Applied Psychological Measurement 25:317-331.
    Wang, Shudong & Tianyou Wang (2002). Relative Precision of Ability Estimation in Polytomous CAT:A Comparison under the Generalized Partial Credit Model and Graded Response Model. Presented at the annual meeting of American Educational Research Association, New Orleans.
    Wang, Tianyou & Michael J. Kolen (2001). Evaluating Comparability in Computerized Adaptive Testing:Issues, Criteria and an Example. Journal of Educational Measurement 38(1):19-49.
    Wang, Tianyou & Walter P. Vispoel (1998). Properties of Ability Estimation Methods in Computerized Adaptive Testing. Journal of Educational Measurement 35:109-135.
    Ware Jr., John E., Barbar Gandek, Samuel J. Sinclair & Jakob B. Bjorner (2005). Item Response Theory and Computerized Adaptive Testing:Implications for Outcomes Measurement in Rehabilitation. Rehabilitation Psychology 50(1):71-78.
    Ware Jr., John E., Jakob B. Bjorner & Mark Kosinski (2000). Practical Implications of Item Response Theory and Computerized Adaptive Testing. Medical Care 38:73-82.
    Ware Jr., John E., Mark Kosinski, Jakob B. Bjorner, Martha S. Bayliss, Alice Batenhorst, Carl G. H. Dahlof, Stewart Tepper & Andrew Dowson (2003). Applications of Computerized Adaptive Testing (CAT) to the Assessment of Headache Impact. Quality of Life Research 12:935-952.
    Way, Walter D. (1998). Protecting the Integrity of Computerized Testing Item Pools. Educational Measurement:Issues and Practice 17:17-27.
    Weiss, David J. (1982). Improving Measurement Quality and Efficiency with Adaptive Testing. Applied Psychological Measurement 6:473-492.
    Weiss, David J.& G. Gage Kingsbury (1984). Application of Computerized Adaptive Testing to Educational Problems. Journal of Educational Measurement 21(4): 361-375.
    Weiss, David J.& Robert D. Gibbons (2007). CAT with the Bifactor Model. Presented at the 2007 GMAC Conference on Computerized Adaptive Testing, Minneapolis MN.
    Wilson, Mark & Raymond J. Adams (1995). Rasch Models for Item Bundles. Psychometrika 60:181-198.
    Wise, Steven L.& Barbara S. Plake (1989). Research on the Effects of Administering Tests via Computers. Educational Measurement:Issues and Practice 8(3):5-10.
    Wolfe, Edward W.& Jonathan R. Manalo (2005). An Investigation of the Impact of Composition Medium on the Quality of TOEFL Writing Scores. TOEFL Research Reports 72: 1-58.
    Wolfe, Edward W., Sandra Bolton, Brain Feltovich & Donna M. Niday (1996). The Influence of Student Experience with Word Processors on the Quality of Essays Written for a Direct Writing Assessment. Assessing Writing 3(2): 123-147.
    Woods, Carol M. (2008). Likelihood-ratio DIF Testing: Effects of Nonnormality. Applied Psychological Measurement 32(7): 511-526.
    Wright, Benjamin D. & Susan R. Bell (1984). Item Banks: What, Why, How. Journal of Educational Measurement 21(4): 331-345.
    Yang, Ji Seung, Mark Hansen & Li Cai (2012). Characterizing Sources of Uncertainty in Item Response Theory Scale Scores. Educational and Psychological Measurement 72(2): 264-290.
    Yang, Tae-Kyong (2005). Measurement of Korean EFL College Students' Foreign Language Classroom Speaking Anxiety: Evidence of Psychometric Properties and Accuracy of A Computerized Adaptive Test (CAT) with Dichotomously Scored Items Using A CAT Simulation. Ph.D. Dissertation. Austin: The University of Texas at Austin.
    Yen, Wendy M. (1984). Effects of Local Item Dependence on the Fit and Equating Performance of the Three-parameter Logistic Model. Applied Psychological Measurement 8: 125-145.
    Yen Wendy M. (1993). Scaling Performance Assessments: Strategies for Managing Local Item Dependence. Journal of Educational Measurement 30(3): 187-213.
    Young, Richard, Mark I). Shermis, Sheila R. Brutten & Kyle Perkins (1996). From Conventional to Computer-adaptive Testing of ESL Reading Comprehension. System 24(1): 23-40.
    Zeng, Lingjia (1996). An IRT Scale Transformation Method for Parameters Calibrated from Multiple Samples of Subjects. (ACT Research Report Scries 96-2). Iowa city, IA: ACT Inc.
    Zhang, Bo (2010). Assessing the Accuracy and Consistency of Language Proficiency Classification under Competing Measurement Models. Language Testing 27(1): 119-140.
    Zhang, Jinming & William Stout (1999). The Theoretical DETECT Index of Dimensionality and its Application to Approximate Simple Structure. Psychometrika 64(2):213-249.
    Zheng, Ying & Liying Cheng (2008). Test Review:College English Test (CET) in China. Language Testing 25(3):408-417.
    Zumbo, Bruno D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning:Losgistic Regression Modelling as a Unitary Framework for Binary and Likert-type Item Scores. Ottawa, Ontario, Canada:Directorate of Human Resources Research and Evaluation, National Defense Headquarters.
    Zumbo, Bruno D. (2007). Three Generations of DIF Analyses:Considering When It Has Been, Where It Is Now, and Where It Is Going. Language Assessment Quarterly 4(2): 223-233.
    Zumbo, Bruno D.& Peter D. Macmillan (1999). An Overview and Some Observations on the Psychometric Models Used in Computer-adaptive Language Testing. In Micheline Chalhoub-Deville (ed.). Issues in Computer-adaptive Testing of Reading Comprehension. Cambridge:Cambridge University Press,216-228.
    Zwick Rebecca (2000). The Assessment of Differential Item Functioning in Computer Adaptive Tests. In Wim J. van der Linden & Cees A. W. Glas (eds.). Computerized Adaptive Testing:Theory and Practice. Netherlands:Kluwer Academic Publishers, 221-244.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700