Grand Rounds: Giving “Practice” Math Tests? Read me first. (Part 2)

Part 1 is here.

Usually, the most common reason for giving a practice math test is to identify students’ weaknesses. Hopefully the first post showed why it’s so critical to determine the weakness you’re talking about: familiarity with format, time, etc. If you’re worried about the math there are particular ways to approach the practice test.

First, ignore how the student did on the assessment. It sounds counter-intuitive but there is rationale reason for it, I promise. It’s more important how your students did on particular items than how they did overall. There are a couple of reasons for this:

NYS Assessments are based on a criterion-referenced model. Typically, when you give a student 25 questions, you mark the correct responses, determine a fraction of correct over total and come up with a score. Generally, we talk about these scores in percentages. Due to the complexity of the NYS assessments and the fact that they focus on performance in relation to a standard or criteria, scores are NOT reported this way. In fact, the number of raw points needed to demonstrate mastery shifts from year to year depending on the standard setting process.

It’s not the real deal. Regardless of the conditions we create, students know it’s not the real deal. Their performance may be inflated or deflated for that very reason and may not reflect their true performance.

NYS Test Design procedures. NYS follows a particular test design model that requires the test include items with varying difficulty. I’m sure you’ve noticed looking through the test that some questions “feel” easier than others. This isn’t a coincidence. Items are strategically chosen for the assessment that reflect a range of difficulty based on how students performed on them on the field testing. It doesn’t make sense to include 25 questions on Book 1 that were missed by most students during field testing. So, the test designers include items with a variety of difficulty – a few hard, a few easy and most middle of the road. This concept of item difficulty is called “p-value” – most simply put, what percent of students responded correctly to a question. In shorthand, we say items with high p-values are easy, while items with low p-values are hard for the particular group of students under discussion. So - two districts side by side may have different p-values on the same item. We need a neutral standard or benchmark to act as judge and jury around item difficulty. That's where the state data come in.

A great deal of data about NYS tests are made public every year – including p-values. These data can tell us which questions are easy and which are hard. It’s not a secret and requires only a smidge of background to use correctly. P-values are provided at a couple of levels. The one that is most important is for our purposes here Low Level 3. In this example, let’s talk about fifth grade. My mental model around scale scores and p-values is to picture a giant swimming pool filled with every fifth grader in the state of New York who took the state assessment last year. Floating above their head is their scale score. Students from the Bronx to Buffalo, from Long Island to Lake Placid. Students with and without disabilities. Levels 1, 2, 3 and 4.

I can look at how ALL the students did on items but included within the mix are students who really struggled and students who did really well (We assume most questions were hard for students at Level 1 while most were easy for students at Level 4.) So, I as the data lifeguard blow my whistle and call out every child who scores Levels 1 and 2. Same for the Level 4’s. Left in the pool are my Level 3’s – every child who met the standard. Because I want the data to be as clean and precise as possible, I’m going to boot out every child who scores above the minimum standard – which in Fifth grade in 2008 was 650. Left in the pool I have a few thousand students – all who met the minimum standard, AKA scale score 650. For each question these students took, I can look at how many got each question right and compare (or benchmark) my students to their performance. The graph below shows you what that looks like:

Out of ALL of the students who scored 650, only 18% of the students got question 7 correct. In other words, that was a hard question. My gut isn't telling me that. My students aren't telling me that. Students from across NYS are telling me that. Take a look and see how your students did on it. Odds are, they didn't do very well. It's not because you didn't teach it or they just weren't listening. It could be because the wording tripped them up - just like 82% of all students who scored a 650. The question is below:

Students are likely to pick A because it practically screams "PICK ME!" at them. Your students may know fractions inside out and sideways. Picking A and not C is an issue of testing sophistication, not mathematics. When reviewing similar problems with students, as much as possible, give them "PICK ME!" choices so they can learn what they look like and how to avoid their siren song.

At the other end of the difficulty continuum are easy questions. The Low Level 3's did pretty well on question 3. If you discover that your students didn't do well on questions like 3 (any item with a p-value higher than 80%), then your warning bells should start warming up.

However, before assuming it's a strength or weakness, look for other evidence that the students understand the concept. Formative assessment can really come in handy here. You can pose a similar question and ask students to respond on their way out the door. This time though, ask:

Anne has completed 87% of the race. What fraction represents that portion of the race she has NOT finished?

If students get the math, they should pick A. If they pick C, it's probably a testing issue. They slid past the NOT. Anyone who picks B or D may have a problem with fractions in general. How did they do on question 15 which taps a similar understanding? (I use Tinkerplots to answer these questions. It's one of my favorite data toys.) The students will form themselves into like-needed groups, depending on what the other instructional evidence shows.

So - if you're going to give the test to identify weaknesses:

Consider how your students do on easy questions (high Low Level 3 p-values) versus hard (low Low Level 3 p-values) questions.
Be aware of what wrong answers students give as that's often more interesting than what they got right.
Consult other evidence (formative and summative) before confirming the students have a mathematical weakness.

I'd love to hear if any of these ideas contradicts what you've heard in schools. Feel free to drop me a line or leave a comment if you have any questions!

1 comment:

walsheimer said...: I have been reading your bloggings on math test prep. I taught 4th grade for a long time and I don't think we ever talked about the p values of any question. In fact we were lucky to even get test scores by the end of June. We looked at the overall rating and lamented the ones who should have passed or celebrated our successes.

I have been spending time working with Sandy Zander from BOCES in trying to impact the caliber of learning that occurs in the elementary in order to hopefully see improvements at the middle school and/or high school end. For years we were told that we were doing a good job teaching math because our scores are good.Were we really good or did we just do a good job teaching them how to answer test questions. I have been a firm believer in always teaching children to be thinkers first. If they can think critically, they can do anything. I wonder is students really have math skills or if they are good guessers. Given the fact that so many struggle in middle and high school, I guess that we have not given them a solid enough foundation early on.

I like your thinking around using old tests for one purpose at a time. If we are using them to teach test format, focus on that. It really opens up opportunities to rethink and reshape instruction.; 9:50 PM, February 22, 2009