What to Think about the MET Project Results
What can you do with $45 million and three years? Well, if you鈥檙e the Bill & Melinda Gates Foundation, you can confirm, empirically, what educators have always known implicitly: great teaching matters, it can be measured, and it improves student learning.[1]
That was one of the many released last week in the final report from the (Measures of Effective Teaching). MET has generated buzz in education and popular media alike, so I won鈥檛 provide a full synopsis here. For a basic summary, check out the 听or听听rundown; for more thoughtful commentary, turn to posts from , , , , and . Instead, I want to call attention to two big takeaways from the MET Project.
What teacher evaluations measure is just as important as how they measure it.
Much has been made of the finding that classroom observations are the worst predictor of student learning, compared to state test scores and student surveys. Some have questioned whether observations are the significant time and personnel costs involved to do them well. Tim Daly of TNTP even that MET shows 鈥渢he way that most teachers have been evaluated forever is completely unreliable.鈥
It鈥檚 easy to jump to that conclusion: MET used proven, high-quality observation tools, and observers were trained and certified on their knowledge of them. This isn鈥檛 the case with many of the classroom observations used across the country. 听Still, observations are a critical component of teacher evaluations, particularly for those in the and in untested subjects. And using observations typically receives greater compared to test scores. Finally, MET鈥檚 research found that although classroom observations didn鈥檛 improve the predictive power of the evaluation measure, they did improve its reliability 鈥 or stability 鈥 from year to year.听
Test scores also don鈥檛 have the same diagnostic power as classroom observations: as put it, 鈥渢est scores can reveal when kids are not learning; they can鈥檛 reveal听why.鈥 Observations can provide teachers with valuable, timely, and clear feedback on their practice. Given their complexity and the timing of state testing, value-added measures are far less teacher-friendly 鈥 not to mention, limited in scope. Surely, great teaching involves than improving student scores on multiple-choice tests in two subjects.
To this end, it鈥檚 laudable that MET鈥檚 researchers also used higher-order tests (the SAT 9 Open-Ended Reading Assessment and the Balanced Assessment in Mathematics) to measure student learning. In some states, these assessments are more similar to the Common Core assessments they will offer in 2014-15. Presumably, states should want teacher evaluations that not only function well with today鈥檚 tests, but also those of the future.
Still, the tests MET used only consider English Language Arts and math skills. If the ultimate goal of evaluations is to measure whether teachers create learning environments where students achieve a broader set of outcomes (say, the knowledge, skills, and attributes it takes to be college- and career-ready), then there is still a long way to go in developing these systems. , many states will be simultaneously implementing new teacher evaluations and the Common Core assessments. But the best evaluation systems today do a far better job identifying teachers that improve student learning via state test scores than teachers that improve college and career readiness. MET鈥檚 findings suggest that states should carefully consider whether their evaluation systems are measuring the teacher attributes needed to meet the Common Core鈥檚 objectives.
How teacher evaluations are used is just as important as what they measure.
Part of the demand for research like the MET Project comes from the push to use teacher evaluation systems to make human resources decisions. Hiring, retention, placement, compensation, and tenure can all be affected. Some of the push can be attributed directly to the Obama administration: developing and using teacher evaluation systems like the ones in the MET study for HR decisions was a major component of both and the .
But there is still uncertainty surrounding teacher evaluation systems; the MET Project doesn鈥檛 provide a definitive roadmap or specific policies for states and districts looking to measure effective teaching. Many of its findings are ambiguous (with the exception that value-added measures must account for students鈥 prior test scores). The MET report is inconclusive when it comes to:
- whether student demographics should be included as a control in value-added models;
- precisely how to weight each component within a composite effectiveness measure: value-added data, student-perception surveys, and classroom observations;[2]
- whether measures like the Content Knowledge for Teaching (CKT) tests or subject-based classroom observation tools could be useful additions to composite measures of teacher quality; and
- who should observe teachers, how long these observations should last, and how many observations should occur each year.[3]
The teacher quality measures MET suggests are 鈥渂etter on virtually every dimension than the measures in use now.鈥 But does that mean similar teacher evaluation systems should be used as the deciding factor for whether a teacher is fired? Or promoted? Or receives a pay increase?
Thorny questions, indeed. Yes, the new measures of effective teaching are promising, compared to most teacher evaluation systems where nearly every teacher rated 鈥榮atisfactory.鈥 But given MET鈥檚 lingering questions and inevitable in these measures of effectiveness, wouldn鈥檛 it make more sense to continue developing and refining teacher evaluation systems without rushing to use them for high-stakes decisions? Especially since most schools lack the capacity and resources to implement evaluations of the rigor and quality that the MET study used? States and districts should consider using the results from teacher evaluations in a more diagnostic manner: why not make these measures of effective teaching the first step in the process of providing professional development, determining who receives pay increases or tenure, and making decisions about hiring or firing 鈥 rather than the final step?
[1] In full disclosure, the work of 国产视频鈥檚 Education Policy Program is supported, in part, with funding from the Gates Foundation.
[2] However, the 鈥渄ata suggest that assigning 50 to 33 percent of the weight to state test results maintains considerable predictive power, increases reliability, and potentially avoids the unintended negative consequences from assigning too-heavy weights to a single measure.鈥
[3]听 MET鈥檚 results do show that more lessons and observers increases the reliability of observations, but there are 鈥渁 range of scenarios for achieving reliable classroom observations.鈥