Colleges have come to realize the need to assess and improve student learning and to report their efforts to students, faculty, administrators, and the public; including policy makers and prospective students and their parents.
The question is how to accomplish this. The roar of yesterday’s Spellings Commission and its vision of accountability is background noise to today’s cacophony of calls for more transparency and campus-based, authentic assessment of student learning. Some of the advocates for more authentic measures, such as Carol Schneider, president of the Association of American Colleges and Universities, have suggested using electronic portfolios -- collections of a student’s work products, such as term papers, research papers or descriptions, and the student’s written thoughts (“reflections”) about these work products and curricular experiences that are bundled together on an electronic platform. The presumed merits of portfolios, such as their supposed ability to drill down into the local curriculum, have been extolled elsewhere .
Portfolios are simply not up to the task of providing the necessary data for making a sound assessment of student learning. They do not and cannot yield the trustworthy information that is needed for this purpose. However, there are approaches that can provide some of the information that is required.
Portfolio Assessment’s Inherent Limitations
There are three major reasons portfolios are not appropriate for higher education assessment programs: They are (a) not standardized, (b) not feasible for large-scale assessment due to administration and scoring problems, and (c) potentially biased. Indeed, course grades, aggregated across an academic major or program, provide more reliable and better evidence of student learning than do portfolios. Here’s why.
Lack of Standardization
Standardization refers to assessments in which (a) all students take the same or conceptually and statistically parallel measures; (b) all students take the measures under the same administrative conditions (such as on-site proctors and time limits); (c) the same evaluation methods, graders, and scoring criteria are applied consistently to all of the students’ work; and (d) the score assigned to a student most likely reflects the quality of the work done by that student and that student alone (without assistance from others).
Portfolios do not and cannot meet the requirements for standardization  because by their very nature, they are tailored to each student. AAC&U’s attempts at “metarubrics”  are not even close to being an adequate solution to address this problem. Portfolio advocates simply ignore the evidence that valid comparisons in the level of learning achieved can only be made when students take the same or statistically “equated” measures (such as different versions of the SAT).
Without standardization, faculty and administrators at individual campuses cannot answer the fundamental questions: Is the amount of student learning and level of achievement attained by the students at our campus good enough? Could they do better, and if so, how much better? For example, are the critical writing skills of our students on a par with those of students at comparable institutions and if below, what might be done to improve their performance?
The reason that campuses using portfolio assessment cannot answer these types of questions is that determining how much learning has occurred has to be measured by comparison to some type of standardized benchmarks. For example, to assess whether seniors write better than freshmen, both groups need to respond to the same essay questions within the same time limits and have their answers mixed together before being graded by readers who do not know whether an answer was written by a freshman or senior.
The same standardization is needed to assess whether the students at one school (or in one program within a school) are more proficient (or learned more) than students at similar schools. In short, learning has to be measured by some type of standardized, controlled, and unbiased comparison. There is no absolute scale (like weight and height) that is interpretable in and of itself.
Descriptions of scoring criteria are not sufficient to ensure comparable grading standards even when benchmark answers are used to train raters. In order to answer the good enough question, performance comparisons -- “benchmarking” -- is necessary. But benchmarking cannot occur without standardization and benchmarking is necessary to interpret differences in scores between programs within a campus and between peer campuses. Without standardization, differences might be due to variation in portfolio content, rater background and training, assistance provided to students for building their portfolios, bias (see below), and a host of other factors.
Valid interpretations of differences in scores between students, programs, and schools can only occur when the assessment is standardized. Only then can institutions monitor their students’ progress toward improving their skills and abilities relative to (a) their school’s academic standards, (b) the progress made by their classmates, and (c) the improvements in performance made by students in other programs and similar institutions. Ironically then, by eliminating the standardization that is necessary for benchmarking learning, the portfolio method prevents making the kinds of comparisons that are essential for assessing improvement.
We recognize that there are roles for portfolios. For example, they might be used to provide information about the range of tasks and activities students engage in and their views about the importance of different aspects of their education and campus experiences. This information may have heuristic value in providing possible insights into areas for improvement.
Not Feasible for Large Scale Learning Assessment
By their un-standardized nature, portfolios (even electronic ones) are not practically feasible on a large scale. A moment’s reflection reveals why this is true. Because of their length, a single grader will typically need an hour or so to grade a single portfolio. To assure adequate score reliability, each portfolio needs at least two independent graders (and major differences between them should be resolved by a third). In addition, due to the potential interdisciplinary nature of a portfolio’s contents, raters with different areas of expertise might be needed which could lead to even more scoring time and feasibility problems.
For portfolios to be truly authentic, they have to relate to each student’s academic major or combination of majors. Hence, different teams of graders (and most likely different scoring rubrics) are needed for students with different majors. These and related concerns preclude combining results across students with different and perhaps unique combinations of majors.
Computer technology cannot solve portfolio feasibility and reliability problems. For example, computers with natural language processing software have been shown  to provide a cost-effective and accurate way to grade large numbers of student responses to essay questions and other open-ended tasks. However, these machine grading methods require standardized prompts. They require that thousands of students respond to the same prompt and thus they are not applicable to portfolios.
Simply put, the time, content expertise, and other challenges -- and hence feasibility -- of grading portfolios substantially exceeds that of grading constructed responses (e.g., essays) that are administered and scored under standardized conditions. Incidentally, the solution to this problem does not lie in having local faculty grade portfolios, even when justified as a professor’s instructional and professional development responsibilities. The evidence is clear: in large-scale programs, portfolio assessment overwhelms faculty, and is a source of faculty resistance and low morale. Portfolio assessment, then, is simply not a feasible or practical tool for large-scale assessment programs.
A portfolio may include a photograph, videoclip, or other information about student identities. Their gender, race, ethnicity, and other characteristics also may be known by those evaluating the portfolio. This lack of anonymity may bias results.
Faculty are understandably skeptical of standardized tests. In an article last year  in Academe, Gerald Graff and Cathy Birkenstein pointed out that many faculty erroneously equate standardized exams with the highly questionable multiple-choice tests that characterize the implementation of the No Child Left Behind Act. Professors and administrators rightly celebrate the diversity of American higher education and therefore do not see how the same standardized test could be used across this range of institutions. However, colleges may share some important goals. For instance, virtually all faculty and college mission statements agree that critical thinking and writing skills are essential  for all college graduates to possess. Graff and Birkenstein put it well:
A marketing instructor at a community college, a biblical studies instructor at a church-affiliated college, and a feminist literature instructor at an Ivy League research university would presumably differ radically in their disciplinary expertise, their intellectual outlooks, and the students they teach, but it would be surprising if there were not a great deal of common ground in what they regard as acceptable college-level work. They (these instructors) would probably agree -- or should agree -- that college-educated students, regardless of their background or major, should be critical thinkers, meaning that, at a minimum, they should be able to read a college-level text, offer a pertinent summary of its central claim, and make a relevant response, whether by agreeing with it, complicating its claims, or offering a critique.
If standardization is possible, the question arises as to whether it is possible to standardize “authentic” tasks. David C. McClelland's 1973 paper , provided the key to authenticity with standardization. He argued for a “criterion-sampling” approach to assessment in which students confront “real-world” tasks like those they may face in their further education, work, and private and civil lives. As McClelland said, if you want to know if a person can drive a car, observe and evaluate his performance on a sample of tasks like starting the car, pulling out into traffic, turning left, parking and the like. Moreover, you can evaluate performing these tasks in a standardized way. Put succinctly, he provided a strong argument for gaining authenticity through the assessment of criterion performances.
Performance assessment, then, represents an authentic, standardized testing paradigm in which students craft original responses to real-life (criterion-sampled) tasks. For example, most state bar examinations now include tasks in which candidates are given a realistic case situation and asked to use a library to perform a typical task, such as prepare deposition questions or a points-and-authorities brief, draft instructions for an investigator, or write a letter to opposing counsel. Candidates are given a “library” of documents and told to base their answers on the information in these documents. The library might include the opposing counsel’s brief, excerpts of relevant and irrelevant case law, letters, investigator reports, and other documents… just like they would review in practice. Performance tasks also have been used in credentialing teachers.
We applied this testing paradigm in developing the Collegiate Learning Assessment (CLA). This testing tool taps critical thinking, analytic reasoning, problem-solving and written communication skills of college students with standardized analytic writing and performance tasks that have been described elsewhere . Over 450 colleges with 200,000 students have participated in the CLA. Faculty and students recognize its authenticity and report that its tasks tap the kinds of thinking and reasoning they expect a college education will help students perform.
We are concerned about the suggestion to replace standardized higher education measures with electronic portfolios as a means for assessing the effects of campus’ programs and as a response to the demand for external accountability. Because of the inherent problems with portfolios, they do not and cannot provide trustworthy, unbiased, or cost effective information about student learning. This is just not in their DNA.
Gathering valid data about student performance levels and performance improvement requires making comparisons relative to fixed benchmarks and that can only be done when the assessments are standardized. Consequently, we urge the higher education community to embrace authentic, standardized performance-assessment approaches so as to gather valid data that can be used to improve teaching and learning as well as meet its obligations to external audiences to account for its actions and outcomes regarding student learning.
Richard J. Shavelson is a professor of education at Stanford University. Stephen Klein and Roger Benjamin are director of research and development and president/CEO, respectively, at the Council for Aid to Education, which owns the Collegiate Learning Assessment.