Improving CS assessment with careful (data) analysis

Now we’ll consider the three question attributes mentioned earlier: difficulty, discrimination, and distractors.

Difficulty for a question (in CTT) is calculated as proportion of respondents who got a question correct.

We show the difficulty for each question below.

A tricky thing about CTT’s version of difficulty is determining whether a question is too hard (or easy) for a target population of learners.

This is a bit impossible to determine with CTT uses total test score to represent learners’ knowledge levels.

So a shortcoming of CTT is that there is no way to separate learners’ knowledge level and question difficulty.

Question difficulty as measured by the percentage of learners who got each question correct.

The SCS1 is a difficult test with all questions having <50% correctness.

Given that this is a multiple-choice test with 5 options, questions with < 20% correctness (as indicated by pink boxes) may be problematic.

The next thing we’ll consider is discrimination.

Again, discrimination reflects how well a question distinguishes between learners of different knowledge levels.

As a rule of thumb, greater discrimination is typically desirable and low discrimination is troublesome.

We show the discrimination for SCS1 in the questions below:Question discrimination as measured by point-biserial correlation (relationship between question and overall test performance).

Higher correlation is better, with a point-biserial correlation of <0.

2 (blue line) typically being deemed as problematic.

Q5 and Q20 have both poor discrimination (blue circle) and high difficulty (pink squares, carry over from the previous figure).

There are two questions (Q5, Q20) which have potentially problematic difficulty and discrimination.

Q20 is actually “the tricky question” I showed you at the beginning of this post.

So let’s look at the distractors for the tricky question, as shown in the figure below.

Distractors for “the Tricky Question” (Q20, shown above).

When we aggregate learner responses, we see that options C and D are selected more frequently than the correct response (B).

This suggests that we may need to review this question’s response options.

We see that there are multiple distractors which are chosen more frequently than the correct answer.

This could imply many things, ranging from the question assessing knowledge that learners and missing to the answer key having an error.

What would be more insightful than seeing the distribution of distractors in aggregate is to compare the distribution of high performers to the distribution of low performers.

While CTT looks at learners in aggregate, item response theory enables us to disaggregate learners and compare response patterns of learners with different knowledge levels.

Data Analysis Pt 2: IRT to model question & learner propertiesWhereas CTT confounds learner and test properties, IRT models learners to analyze learner properties (e.

g.

knowledge level) and test properties (e.

g.

question difficulty) separately.

A fundamental aspect of IRT is question difficulty and learner knowledge level are both on the same continuum.

We show this in the figure below, where a hypothetical learner (in blue) would likely get questions A and B correct because their knowledge level is greater than the difficulty of a question.

They would likely get question C incorrect because their knowledge level is lower than the difficulty of question C.

Representation of latent variable continuum with 3 questions (A,B,C).

Because this learner’s knowledge level is greater than the difficulties of A and B, we would predict they get those questions correct.

Their level is lower than C’s difficulty, so we predict they get that question wrong.

Zero reflects the knowledge level for the average test-taker, where learners are often assumed to be distributed normally.

(sorry…)A “good” question should discriminate (differentiate) between learners of different knowledge levels.

So the figure below has a nice steep “S” curve, indicating high discrimination.

This is often desirable as a learner with a lower knowledge level (e.

g.

outlined orange learner) will likely get the question wrong and the higher knowledge level (e.

g.

solid green learner) will likely get the exercise correct.

Item Characteristics Curve (ICC) for a well-performing question from the SCS1 (Q19).

This question has a reasonable difficulty level (0.

68) and a high discrimination (steep S curve).

In stark contrast, poor questions tend not to discriminate between learners.

This is visualized in the figure with very flat curves.

This is not ideal because the probability of selecting a correct answer does not change much between learners of different knowledge levels.

ICC for poorly performing questions in SCS1.

The questions have poor discrimination as indicated by the very flat curves.

This is not ideal because the probability of getting a question of correct does not change for learners of varying knowledge levels.

This modeling of learners’ knowledge levels independent also helps us shed light on distractor patterns, as we can analyze response patterns of wrong answer choices for learners of varying levels of knowledge.

As a rule of thumb, we want learners with low knowledge levels to select distractors (wrong answers).

As knowledge level increases, we want the likelihood of selecting the correct answer to increase and eventually become the most likely option selected.

In the figure below, the question on the good question on the left reflects this pattern.

In stark contrast, the poor question on the right is unusual because learners of all knowledge levels are more likely to select a distractor than the correct answer.

Furthermore, as knowledge level increases, the likelihood of selecting the correct answer (B) decreases.

This is typically a bad thing.

Distractor patterns for a good question (left) and poor question (right).

For the good question (Q19 in SCS1), learners with lower knowledge levels will likely select certain distractors, or wrong answers (A, B).

As learners’ knowledge levels increase, they are more likely to select the correct answer (C).

In contrast, the poor question (Q20) performs poorly because learners of all knowledge levels are more likely to select a distractor than the correct answer.

Furthermore, as knowledge levels increase, the likelihood of selecting the correct answer DECREASES, which is not good.

So with data analysis helped us identify potentially problematic questions.

We know the response patterns for potentially problematic questions are odd, but we don’t know why.

And we need to understand why before we improve our assessment!.Follow-up analyses is all about helping us understand why questions have odd response patterns and then we can decide what to do.

Follow-up: Expert review of problematic itemsFollow-up analyses is all about contextualizing and verifying the findings from the data analysis.

We must understand the test design, look at the questions, consider the learners, and be very skeptical.

A great start for follow-up analyses is expert review, or sharing your data analysis with a domain expert (e.

g.

a computer science educator) and hypothesizing potential explanations for the odd response patterns.

For “the tricky question,” we know from CTT that the question is very difficult and does not discriminate between learners of varying knowledge levels well.

From IRT, we identified that low-performing learners tend to select incorrect option C and high-performers tend to select incorrect option E.

Why is that?A typical suspect is some misunderstanding as a result of confusing wording in the question.

The figure below shows “the tricky question.

” We see that some of the wording of the question (underlined in gold) may be confusing or unfamiliar to learners.

Or perhaps the wording of a response option (e.

g.

A or E) confused learner.

Another suspect is a potential misalignment between the knowledge the test assesses and the content covered in class.

Survey data from learners revealed that most learners in our sample took a data programming version of CS1 which did not emphasize function scope.

So perhaps this question assesses knowledge that learners did not learn.

This reveals a key tension: there may be a misalignment between the knowledge learned and what the test measures.

We often design standardized tests but CS courses are very diverse and teach different topics, perhaps introducing this misalignment.

Follow-up analyses in the form of expert review.

We found that low-performers and high-performers were likely to select different incorrect options.

This may be because of a misunderstanding in the question prompt (underlined in yellow) or in the answer options (e.

g.

wording of option E).

Or this may be because the knowledge this question assessed (function scope) was not covered in the course learners took.

So we were able to rely on domain experts to generate a few hypotheses as to why a question was problematic.

To determine which hypothesis is true, it’s typical to conduct think-alouds or cognitive walkthoughs with learners in the target population.

By understanding learners’ thought processes, we can get more evidence as to why a question is problematic.

And with that evidence, we can decide how to change, iterate, and improve our tests!Conclusion: More rigorous evaluation of instruments in computing educationWe live in a world of reductionism.

CS students often receive letter grades which we intended to reflect their mastery of certain knowledge.

And test scores serve a large part of grades.

We also live in a world of beautiful diversity.

Initiatives such as CS4All are working for more inclusive learning experiences.

So as we welcome more diversity into computing, we must work hard to ensure our measurements of what students know measure what we intend for them to and do not bias against certain groups.

Psychometrics provides frameworks and methodologies to evaluate and ensure validity and reliability in how we interpret our measurements, in effect ensuring instruments are considerate to the growing diversity of learners.

So let’s iterate to better!There is a lot of rich, thorough detail on how we conducted IRT, including confirmatory factor analysis to verify the questions measured the same latent construct (CS1 knowledge).

We also describe and differentiate between CTT and IRT to identify the merits of IRT.

I’ve also linked slides to my SIGCSE 2019 talk and a link to supplementary resources.

And of course, please reach out to me if you would like some ideas on how to use evidence to improve our test and how you interpret your test scores!paper (PDF)slides (PDF)supplementary material (GitHub).. More details

Leave a Reply