Pick an item, any item - the reveal

Yesterday, I wrote a quick run through of item p-value in which I ignored a bunch of stuff about item analysis in order to focus on the big idea of predicting item difficulty. 20 anonymous people - of unknown age, teaching experience and background, answered a quick poll I put up about two items from the released NYS items. 20 adults (I assume -and based on their responses to the optional question, it's a reasonable assumption) picked which item they thought was harder. And 17 adults - about the size of an elementary school PLC - or 85% of them picked Item 1. Reasons for picking 1 included:
  • To answer question 1 you need to go off and look up the contents of the 4 referenced paragraphs and then keep them in mind while evaluating the most plausible answer. This would put more strain on the working memory. The content needed to answer question 2 is right there in the question, easily visible.
  • Question 1 requires students to go back into the reading selection; that is not required for Question 2.
  • "Predicts the action" is awkward phrasing, question requires students to flip back into the story to reread. Also, question number 2 is a more familiar format
All reasonable. All made by (again, I assume) well-educated adults using the evidence in front of them to draw a conclusion. And yet, the reason we need student item data...

49% of NYS 4th graders who took the test got this question correct. 
25% of NYS 4th graders who took the test got this question correct. 
Does this mean that the 20 adults don't understand teaching, students, or education? Not by a mile. It's a reminder that when it comes to assessment - especially multiple choice item design - when adults read items we see difficulty different than the test taker. Think one of the released items is especially hard? Check the p-values by using the released charts. See the small blue number in the top left? That's the item's code. Look for that code on the p-value chart. If you're in a school in NY wondering how to make connections to your students, look at your students' p-values in the reports released by the RIC and start to have conversations about the implications. SED has released guidance on how to analyze the data and educators across the state are writing (written by my co-blogger, Theresa) about how to state assessment data to inform conversations.

State ed testing takes 300 minutes, 1%, 2 weeks - however you chose to present the numbers - of a child's year. It is the LEAST important assessment children will engage in over the course of a year. What's to be gained - or lost - by framing the LEAST important thing students do as a way to advance our agendas? How does it help alleviate students' (and parents') stress when we give the LEAST important thing in the education landscape the most attention? 

Disclaimer: And as you read these posts, please know, gentle reader, that I am an advocate of performance-based, portfolio, and authentic assessment. I love roses but have committed to the science of the teaching profession which means working to ensure we're describing the daisies correctly. So the usual disclaimer - I am not defending NCLB, VAM, APPR. I'm not even defending the NYS assessments. It's my hunch that we're making it harder to fix the big picture when we neglect to accurately define the parts of the whole.

Pick an item, any item

So we're going to play a little game in this post. But first, let me set the stage.

While lurking on a Twitter exchange about race, education, and schools, I saw a great reminder from Bill Fitzgerald scroll by. In effect [and apologies, Bill, if I've summarized incorrectly], it's worth engaging around important topics even it's clear the discourse isn't going anywhere because you never know who might be listening, seeking to learn. To revisit an earlier post, I decided not to worry about the manager of the nursery, and consider instead the walkers just out for a Sunday stroll who may overhear the discussion about the daisies.

I am going to make one claim here in this post and one claim only: when adults look at multiple choice items, we see them differently than students do. Experience, background knowledge, expertise, confirmation bias, 20 years of living - a wide variety of things influence how we read an item. Any teacher who's seen students ace a "hard" item or tank on an "easy" one will know that it's not until students actually take the items that we get a real sense of the item's difficulty.

Item design is a science - and an art. Objectivity plays a large role. BUT:
One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of a p value. Basic Concepts in Item Design
This is why there's field testing. Or why we should field test classroom tests and why states have to field test items from their large scale tests. The test designer (teachers or Pearson writers) do their level best but we need certain statistics (available only after students have actually taken and responded to an item) to reach conclusions about how high quality an item is. The most common statistic we can use is what's known as a p-value. This value is the percent correct - the higher the p-value, the more students who got an item correct, the easier the item was for the group of students who took the test. There are guidelines around p-values but generally speaking, "p-values should range between 0.30 and 0.90." There's a lot more to unpack around item difficulty but we'll just leave this here.

In the absence of these p-values, our observations about the difficulty of an item are just that - observations. Hambleton and Jirka (2006), two psychometricians/unicorns reviewed the literature around estimating item difficulty and found studies where highly qualified, well-experienced teachers were inconsistent when it came to accurately estimating how students would do on an item. "No one in any studies we have read has been successful at getting judges to estimate item difficulty well." Pretty compelling evidence that we need to temper our opinions with supporting evidence from students who, you know, actually took the assessments.

So now onto the game. Let's pick an. any item. How about Item 132040149_1 from the released Grade 4 Assessment items?


Now, in order for this game to work, you have to play along. Click the link above to read the Pecos Bill story and do your best to answer the question. You may look at it and conclude it requires skills "required skills out of the reach for many young children" or that the number of steps to answer this question are too many and too complicated for 4th graders. Now consider the question below also from the fourth grade test:


Which one would you expect to be harder for the students? The top one or the bottom one? What's informing your decision making? What evidence are you using? What percent of students do you think got the top item correct? How about the bottom one?

Hit me up here or on Twitter and share your thinking. I'll follow up with the answer in an upcoming blog, provided I get through my rose-tending to do list.

The reveal is here.


Chasing Down Pineapple Chasers


Imagine you're strolling with a loved one in a local park one bright Sunday morning. You and your companion pass a cluster of flowers and you overhear another stroller say, "Look at those gross weeds. They should be pulled out, they ruined this entire garden." You look where he's pointing and see the happiest, cheeriest, sunniest, albeit ugliest, bunch of daisies you've ever seen. You look at his face and recognize the speaker as a highly regarded and respected nursery owner. What do you do? What would Emily Post say? What would Freud advise? You look over at your loved one, panic clear on your face. If your loved one is like mine, s/he smiles, squeezes your hand and asks, "is it worth it?"

You decide it's not. No damage done. Who cares that the nursery owner confused a rare variety of daisies with weeds? But then you look over and see a group you recognize from the local gardening club, nodding along. "Bad weeds" you hear one mutter. "Terrible things. They should be yanked." Another pulls out a garbage bag and covers up the bunch of daisies. "No one should have to see these weeds." She says and you hear the passion and conviction in her voice. Her voice practically vibrates with anger.

The nursery owner is consulting a gardening book and reading aloud the problems he sees with the weeds and your stomach drops as you recognize he's misreading some of the information. "They're going to strangle the whole garden." You hear him mutter and you start to twitch, knowing that the daisies actually attract a particular strain of butterflies that help germinate a different section of the park.

You know because you're a botanist. You spend your professional life studying plants. While your work actually involves roses, you had to study daisies in order to better understand the species you grow. There is the real possibility, you admit, that you're wrong. The longer you stand there, the louder the group gets, the more convinced you are that you must be off-base; daisies aren't THAT necessary and it would be great to use the space for more of your roses..

So, you say nothing. The moment passes. The group is unified by their hate for those damn not-really-weeds. Not much you can say.  So you walk on with your companion, working hard to not give your loved one yet another lecture on the importance of ugly daisies in a well-balanced ecosystem.  On your way out of the park, you hear members of the gardening group telling incoming strollers how the owner of the nursery had, just that morning, published a piece in a national gardening newsletter, "setting the record straight" on those nasty weeds.

What's the role of expertise in conversations like this? Do you, with expertise in flowers, though you way prefer roses to daisies, speak up? Does your obligation to speak up change based on the size of the crowd? Is it changed by knowing the nursery owner isn't fond of you, and has even publicly called you "uninformed?" In truth, last time you spoke up, you had a middle school flashback to being told "your opinion doesn't matter because you're not tall/ short/ athletic/ musical/ smart enough/ right-handed enough" to comment. Even worse, the last time someone spoke up, it seemed some members of the gardening group became even more insistent and vocal about calling the much maligned daisies "weeds."

Help me out here, gentle readers. If you were the botanist, what you do? What if you were the nursery owner, would you want the botanist to speak up? Is there a right time and place to speak up? Is it worth it?

Chasing Pineapples – Part 1

In my column for NY ASCD, I considered the role of assessment literacy on education. Here on my blog, I want to poke at the idea a bit more in-depth. And since this is my (“our” actually - Theresa was a much more consistent writer than I was when we first started. Her reflections and thinking can be found throughout the archive) blog, I’m going to draw a line between assessment literacy and the Common Core Learning Standards.

I have a favorite Common Core Learning Standard. I realize that’s a bit like saying I have a favorite letter in the alphabet, but there you go.

CCSS.ELA-LITERACY.W.11-12.1: Write arguments to support claims in an analysis of substantive topics or texts, using valid reasoning and relevant and sufficient evidence. (NYS added an additional sentence to this standard when the state adopted CCLS: Explore and inquire into areas of interest to formulate an argument.)

Making the promise to NY students that we will do everything in our power to help them develop their ability to understand arguments, logic, evidence, and claims, in my humble opinion, is long overdue. In truth, I’m jealous that my teachers weren't working towards this goal. I learned how to write a really solid 5 paragraph essay in HS English and it wasn't until I was paying for my education that I was introduced to the rules of arguments and logical fallacies. Since I missed the chance during my formative years to explore this approach to discourse and discussion, I try to practice it as much as I can as an adult.

It’s my hunch that assessment illiteracy is having a dramatically negative impact on how we talk about public education. More to the point, I suspect that the same quirk that makes us fear Ebola more than texting while driving is what leads us to discuss and debate the state assessments with more energy, passion, and time more than the assessments students see on a regular, daily basis. My claim: when viewed as a data-collection tool mandated by politicians with a 10,000 foot perspective, the tests are benign. Their flaws and challenges are amplified when we connect them to other parts of the system, or if we view them through the same lens we view assessments designed by those with a 10 foot perspective on student learning. When we chase the flaws in a test that takes less than 1% of the year, we end up chasing pineapples. 

In the traditional of well-supported arguments, I want to focus on patterns more than individuals and on a narrow, specific claim, rather than a bigger narrative. (In other words, I’m not defending the tests, NYSED, Pearson, Bill Gates, APPR, or anything else.) The pattern across Twitter and in Facebook groups is a call for NYSED to release the NCLB-mandated tests so that the public (including parents) can judge their appropriateness, quality, length, use of copyright, or whichever variable the person asking for the release wants to investigate. I absolutely support the Spencerport teachers desire to see the entire test but a voice in the back of my head keeps asking, “Why? So what? What criteria are you using to determine if the test is any good?” Last year, NYSED released 25% of the items and a few bloggers shared their opinions about the quality of the items but I haven't been able to find any analysis of the released items against specific criteria. This is not to say they don't exist, just that they escaped my google searches. This year, NYSED released 50% of the items and the response has been NYSED should release ALL of the items. Which, I suspect, is what NYSED wants to be able to do but funding issues are preventing it from happening. I've been watching Twitter, hoping to see critical reviews of the released 50% but instead, there’s been lots of opining. Lots of “I” statements, not a lot of evidence-based claims. This, I suspect, is a side effect of assessment illiteracy across the board. We just aren't any good as a field, much less as a collection of citizens, at assessing the quality of measurement tools.

So, what makes an assessment good? What makes it bad? Given rules of quality test design as outlined in the APA Testing Standards, why are we willing to accept that the strength of the speaker’s opinion as the determining factor of quality? Is the issue of quality in large scale assessments a matter of opinion? I suspect anyone who has taken any large scale test (from the drivers test to the SAT’s) hopes that’s not the case. I know that numerous organizations including the National Council for Measurement in Education work to establish explicit criteria. The USDOE is instituting a peer review process for state assessments to ensure quality. PARCC is being as transparent as possible, including bringing together 400 educators and teachers to provide feedback. All of these groups use specific criteria to assess the quality large scale assessments. Yet, in the media - social and traditional, one person's opinion about "bad" or "hard" items is treated as if it's the truth.

So, my confusion remains: if members of a groups who do assessment for a living spend years establishing and sharing measures of quality for large scale tests, what tools will the public use to assess their quality? How can the general public use “cardiac evaluation” (I know it because I feel it - not my phrase, I totally cribbed it from someone else) when the vast majority of classroom teachers receive little or no training during teacher prep in how to assess and evaluate assessments? When it comes to state assessments, is it more about chasing pineapples – making claims about the tests quality – than actually catching them – supporting claims with evidence?

 And as I often do, I end up asking myself why it matters. If a parent says "I think this item is too hard/developmentally inappropriate/unfair" should that be enough to say that it is? How much of the science of the education profession actually belongs to members of the profession? 

Two roads diverged... and I blundered straight ahead

I am passionately, openly, and sometimes foolishly, in love with authentic assessment and portfolio assessment. I have seen working as a teacher, and now alongside them, the power that relevant and meaningful work holds for students. . . One of the amazing things I get to do for a living is help schools design performance-based assessments that ask students to do something with what they have learned, not just recall facts and provide a right answer. My job does not depend on the success or failure of state tests. I have no stake in testing itself, beyond that of a taxpayer and an educator privileged to work with teachers and schools. So my passionate belief in the craft of the teaching profession comes from my professional experience in classrooms and schools. I believe, adamantly, in using both the science of learning and the art of instruction to provide a quality public education to all of New York’s students.
I wrote the above paragraph last Spring in a column on standardized tests for Chalkbeat. (I can't remember if I picked the post title. In hindsight, I think "A Primer on Standardized Tests" would have been better.) My passion for performance-based assessment hasn't lessened. My commitment to supporting teachers as they design meaningful tasks for the 99% of the school year that isn't devoted to state tests hasn't changed. What has changed is the degree of assessment illiteracy I'm willing to tolerate when it comes to talking about assessment design of both the large-scale, standardized and the day to day classroom flavor. My understanding of measurement has deepened and expanded as I work to better understand how we take this messy, amazing thing called learning and work to capture evidence of it through artifacts and evidence in order to make decisions at the student, teacher, school, district, state, and national level. 

I sometimes joke that I feel a bit like a butcher who chimes in on vegetarian's conversations. If I am so passionate about performance-based assessment, why do I care what word a columnist uses to describe student performance on state tests? If I would rather see students designing portfolios that tell the story of their learning than take yet another multiple choice test, why does it matter so much to me that educators understand the basic rules of large-scale standardized testing? For me, it about what it means to be members of a profession. I don't expect parents to get how p-values work nor do I expect reporters and columnists to have a deep understanding of psychometric jargon. I'm frustrated and disconcerted when those columnists and authors who are also members of the education profession ignore the science of their profession in order to prove a point. And I've elected to speak up when I see that happening. I've chosen to speak up when a vegetarian is wandering around my kitchen complaining about the how unhealthy cow bacon is. 

This is a space Theresa and I created a few years ago to ramble, babble, reflect, and wrestle and I want to dust it off to have conversations that go beyond 280 character blurts. I've been perseverating on some of the same ideas since Day 1 of our blog. I've also revised some of the claims I'm willing to make. On my "things I want to write about" list are topics such as the NYS annotated items, issues around bias in teacher-designed assessments, challenging some commonly held beliefs around assessment design, and gender-related issues in education. 

So - I'm an assessment nerd. Ask me (almost) anything. 

Increase Our Assessment Literacy

There are 11 million licensed drivers in NY State. Unless they took a Driver’s Ed course, these motor vehicle operators (5% of which are between 16 and 19 years old) willingly subjected themselves – sometimes gleefully – to a high-stakes standardized test. Some do it more than once. Again and again, they go back to ensure they pass a standardized test that will have a dramatic impact on their personal freedom and daily lives. Despite the significant impact of this test, there really aren't any public cries about the quality, validity, or reliability of the NYS Driver’s Test. In fact, the technical documents to support the design and psychometrics of the permit test don't appear to be publically available. This lack of interest in the quality of the tests may be because it’s short (20 questions) or because it’s followed up by a road test scored by an assessor trained in the rules and basics of the road who gives the test taker immediate feedback on any mistakes or errors. In either case, we accept the presence of a standardized test as a part of the transition to responsible adulthood.

Americans have an odd relationship with standardized tests. We expect that the service providers we interact with, from cosmetologists to real estate agents, are duly licensed to do their jobs. We require that doctors, lawyers, teachers, and others who are members of a board or a profession meet certification criteria. In almost all of these cases, the license or the certification is only awarded after successfully passing one or more standardized tests. There are likely a variety of reasons why we're comfortable with some standardized tests and not others: the age of the test taker, the degree to which the test taker wants to or seeks to take the test, the degree to which we believe the test measures something important, the test taker’s ability to prepare for the test, how the results will be used, or a fear of test-taking. Some of the discomfort may come from the fact that our field suffers from what Popham (2004) calls “assessment illiteracy”. He goes so far as to claim, “the vast majority of educators reside in blissful ignorance” when it comes to understanding the design and nature of standardized tests. In 2004, he got 5 million hits from Google when he searched for “educational assessment.” In 2014, there are 159 million and the need for assessment literacy has never been higher.

While it is impossible to explain the complexity of large-scale assessment in a single column, I’d like to offer an invitation for readers to invest time in their own assessment literacy. There are several available resources that can provide a NY educator with a better understanding of standardized test design and a deeper understanding of what the NYS standardized tests are and are not.

The best starting point for learning about standardized testing is a document referenced in the APPR documentation. The 1999 Standards for Educational and Psychological Testing was published by the American Psychological Association (APA), American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME) and provide the foundation for testing. Each section of the text defines the psychometric concept (i.e. validity, reliability, fairness, etc.) and sets out the limits of the concept. While not written to explain how testing works, it is the official source for understanding testing concepts. I find myself going back to Popham’s ASCD book, “The Truth about Testing”, an overview of standardized testing that is free of the “noise” created by recent policy like APPR and RttT. Finally, membership in NCME costs $70 a year and provides access to numerous field-friendly and scholarly texts on standardized tests. A more practical document to improve New York State specific assessment literacy is the NY Testing Program Technical Reports. Like the Testing Standards, each section of the technical report explains the psychometric concepts and presents the statistics around the concept from the test being reported.

Social media is awash with claims that the NYS Tests are unfair, invalid, or unreliable. These technical reports provide evidence of the veracity, or lack thereof, for those claims. For example, several groups are claiming the tests are too long. A statistic known as “speededness” provides details about how many students left items blank at the end of the test, giving us an actual report of how many students finished or ran out of time. Additionally, NYSED has released items from the 2012-2013 and 2013-2014 assessments that include explicit and annotated alignment between the items’ demands and the CCLS. A thoughtful read of these resources can empower educators who wish to make claims against the misuse of standardized tests.

The use of standardized tests as the primary means to ascribe growth and attainment for students is highly problematic and has been documented extensively (Berliner and Glass, 2014). Seeking out information about standardized tests to improve assessment literacy isn’t a concession or an endorsement of over-testing or bad policy. It is important for educators to deepen their understanding of these tests by looking at SED annotations and reports, studies from the field of psychometrics, and long form analysis rather than relying primarily on social media and or impressionistic observations.

I named my column in NY ASCD, where this post was originally published, “Pushing at the Boundaries of Assessment” because we wanted to carve out space to investigate what it means to poke at our understanding of what it means to capture evidence of student learning. It’s challenging though, to push at boundaries if we don’t know where they are. These boundaries of standardized tests are defined by and for our profession. Members of our profession have the obligation to separate myth from truth, hyperbole from fact. It is by learning and understanding the research and thinking behind the tests that we can truly be prepared to lead with knowledge and can be prepared to answer the difficult questions. It is good to question and push at the boundaries. It is responsible to be well informed, even if it isn't our own particular area of expertise.

Is it worth it?

Is it worth it? One simple question. Set aside for a moment your textbooks with checklists on quality assessment design and statistical software. Ignore the blog posts with titles like “4 Questions You Must Answer Before You Give Another Test!” or “10 MUST Ask Questions About Assessment.” Forget state education tests and mandates or district requirements, look at the assessments within your control and ask one simple question: Is it worth it?

To be worthy, a task has to have value and meaning to the child engaged in the work. When it comes to the boundaries of assessment, we often reflect on how those boundaries impact adults. We’re comfortable talking about how much time teachers spend scoring or wrestling with assessment design from our vantage point. We talk about the value of an assessment’s results for data analysis by adults. We work to make explicit the meaning of an assessment as it relates to a school’s mission or vision. Consider this an invitation to reflect on the worth of the tasks we ask students to do – from their perspective.

Much of what students do in schools every day are things that are found only in school settings. A history professor on Twitter offers a cash bounty for anyone who can find a five paragraph essay “in the wild”. There are sites devoted to the stretches teachers make to create “real world” problems that are anything but. Typically, students submit tasks to their teachers that will be read only by their teachers and never leave the classroom. They’re often given checklists of how to complete a project, resulting in a project that looks exactly like their classmates’ projects. Asking Is it worth it? about assessment is more important than finding a definite answer. For those asking the question, the answer lies in the context in which teaching and learning occurs. In many of these contexts, the curriculum is overwhelmingly prescriptive and there is little autonomy or choice for students, and sometimes for teachers. Yet, taking a beat to consider the worth of the tests, tasks, quizzes, projects, worksheets, and activities we ask students to complete is not only sensible, it’s a humbling experience.

Attending to worth is a way to start building the foundation for the kind of system where students spend their days constructing knowledge in a way that is individual, powerful, meaningful and relevant to them. Identifying opportunities to increase worth is a small move we can make to give students more space to find themselves within the system. Worth can be increased through curriculum moves by asking, wrestling with, and answering essential questions such as “Can we revile a thinker, but revere their thoughts?”, “Is war inevitable?” “Can one person change the world?”. It can also be increased by asking students to identify an authentic audience for their work and then mailing, sending, or presenting their work to that audience, rather than just handing it in to the teacher. Worth can be increased in offering choice – true choice – around an assessment. Consider asking students, “I’m looking for evidence you’ve learned about or mastered this skill, standard, or concept. How do you want to show me your learning?” By asking students to attend to their own culture, to create something, to go beyond the task, we can answer the question, “Why do I need to do this?” before it’s even asked.

Originally published on http://www.newyorkstateascd.org/