Cheating analysis: Comparing every data pair of students' test answers:

michaelaltose · Dec 14, 2018

Hello,

THE QUESTION:
My ultimate goal is to look at how a set of students answered all of the questions on a multiple-choice exam, and compare each student with each other student to see if there are any pairs of students who are more similar to each other than all the other pairs - suggesting that copying (or "collaboration") might have occurred between those two students.

I am open to your suggestions, but my idea is to create a "heat map" showing how similar each pair is.

THE DETAILS:
This is a sample set of data, where the students are named S1 thru S10, and there are 10 questions (Q1 thru Q10) in this data set. Each question could be answered A,B,C, or D.

. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
S1 A B D C A B C D A A
S2 B B A C A A D D A A
S3 A B D C A B D D A A
S4 B A A C B B C D B A
S5 A B B D C D C D A D
S6 C B D C B C D D C D
S7 A D D C A B D A A D
S8 A B C A A B C D A A
S9 D B D C D A C D D C
S10 A C D B A B C D A C

So far, I can compare S1 to all the other students for just Q1 using the formula:
{=IF(B2=$B$2:$B$11,1,0)}

which will give me the output:

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 1
S2 0
S3 1
S4 0
S5 1
S6 0
S7 1
S8 1
S9 0
S10 1

But automatically extending this table to compare S2 to the class in the second column, then comparing S3 to the class in the third column, etc, requires a lot of manual input. If there were a way to do this easily, I would end up with a table that looks like this:

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 1 0 1 0 1 0 1 1 0 1
S2 0 1 0 1 0 0 0 0 0 0
S3 1 0 1 0 1 0 1 1 0 1
S4 0 1 0 1 0 0 0 0 0 0
S5 1 0 1 0 1 0 1 1 0 1
S6 0 0 0 0 0 1 0 0 0 0
S7 1 0 1 0 1 0 1 1 0 1
S8 1 0 1 0 1 0 1 1 0 1
S9 0 0 0 0 0 0 0 0 1 0
S10 1 0 1 0 1 0 1 1 0 1

Repeating this process again for questions Q2 thru Q10 would yield a total of 10 tables like the table above. And adding all 10 tables together would yield a table like this (I just made up the values):

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 10 6 1 5 5 5 2 3 2 3
S2 6 10 4 7 4 6 4 10 1 7
S3 1 4 10 7 6 3 1 4 8 3
S4 5 7 7 10 7 3 8 3 7 7
S5 5 4 6 7 10 6 7 6 4 2
S6 5 6 3 3 6 10 4 8 5 6
S7 2 4 1 8 7 4 10 9 2 4
S8 3 10 4 3 6 8 9 10 1 7
S9 2 1 8 7 4 5 2 1 10 2
S10 3 7 3 7 2 6 4 7 2 10

Obviously each student is a perfect match with him/herself (and the top right of the table is a mirror image of the bottom left). But otherwise, high matches might suggest that cheating occurred. I could also use the conditional formatting function to create a color heat map.

And of course, I have more than 10 students (usually about 70) and more than 10 questions (usually 20-40 per exam).

I don't have any VBA experience (yet!), so if this can be done with formulas or existing Excel tools, that would be easiest for me, but I'm willing to take any suggestions you can offer!

Thank you for reading this far and helping me ponder this question and improve my Excel skills!

GraH - Guido · Dec 14, 2018

Could you please upload a sample workbook with fake results for say 5-10 students, Michael? As per forum rules.
Also let us know which Excel version you are using. Many possibilities I guess: formulae, vba, Power Query (which I would use and prefer).

vletm · Dec 14, 2018

michaelaltose
Upload a Sample file ...
Questions:
If two or more students answer same way, eg 100% correct, how You would handle those?
Could You do own random order exam for every student?
Could those exams be also one way to learn something? Gotta find correct answer!

michaelaltose · Dec 14, 2018

Thanks for your comments!
I have attached a file with some sample data.
I am using Excel 2016

Side note to vletm:
I agree that my statistical methods are dubious, since you may want a correlation of wrong answers more than a correlation of correct answers, which could just indicate that they simply knew the material. But even that's not quite accurate. Wrong answers may yield more information, but only if we assume that the majority of class is getting the right answer. On the other hand it if it is a difficult question, then two people getting the right answer might be more statistically valuable. So I am open to other thoughts about way to analyze this data. But in the meantime my approach is an interesting start to just visualizing the data, and a good chance to expand my Excel capabilities anyway.

Peter Bartholomew · Dec 15, 2018

Would this get you started? The names studentA and studentB are relative references to row and column indices for the output table.

paperA: = INDEX( test, studentA, 0 )
paperB: = INDEX( test, studentB, 0 )
matchingAnswers: = N( paperA = paperB )
The worksheet formula that generates the table is then
= SUM( matchingAnswers )

madhusudhana · Dec 15, 2018

paperA: = INDEX( test, studentA, 0 ) how it is work in that formula

vletm · Dec 15, 2018

michaelaltose
hmm ... sorry, so far, for me, not Excel case

I've offered something same before somewhere ... but ...
Seems, that You try to find how do students has learnt (or actually answered).

If most of students has answered same way
then that should be the correct answer ... okay ... hmm?
That could be true and work too!

(eg 70% has answered 'A' and 30% has answered 'B')

But, if sometimes that would be 'wrong'...

( if 'the most correct answer' is 'B' or ... OUCH! even 'C')

If someone has taught it 'wrong' or something misunderstood.
Of course, those cases would/should/have to notice/find and
then try to find out how to teach 'better' way and
how to 'offer/supply' correct answers to students.
That would be the challenge for both sides (for teachers and students).
... the heat map in the real world.

GraH - Guido · Dec 15, 2018

Using Power Query

1. Load table , unpivot
2. Merge table with itself (cross table)
3. match lists
4. pivot table

(some steps needs some tweaks to enable variable number of students and/or questions.)

New data? => simply refresh

Code:

let
  Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
  #"Changed Type" = Table.TransformColumnTypes(Source,{{".", type text}, {"Q1", type text}, {"Q2", type text}, {"Q3", type text}, {"Q4", type text}, {"Q5", type text}, {"Q6", type text}, {"Q7", type text}, {"Q8", type text}, {"Q9", type text}, {"Q10", type text}}),
  #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"."}, "Attribute", "Value"),
  #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{"Attribute", "Quenstions"}, {"Value", "Answers"}, {".", "Student"}}),
  #"Merged Columns" = Table.CombineColumns(#"Renamed Columns",{"Quenstions", "Answers"},Combiner.CombineTextByDelimiter("-", QuoteStyle.None),"Q-A"),
  #"Grouped Rows" = Table.Group(#"Merged Columns", {"Student"}, {{"Table", each _, type table}}),
  #"Added Custom" = Table.AddColumn(#"Grouped Rows", "AsList", each Table.Column([Table],"Q-A")),
  #"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 1, 0)
in
  #"Added Index"

Code:

let
  Source = Table.NestedJoin(StudentAnswers,{"Index"},StudentAnswers,{"Index"},"StudentAnswers",JoinKind.LeftOuter),
  #"Expanded StudentAnswers" = Table.ExpandTableColumn(Source, "StudentAnswers", {"Student", "AsList"}, {"Student.1", "AsList.1"}),
  #"Added Custom" = Table.AddColumn(#"Expanded StudentAnswers", "%Match", each List.Count(List.Intersect({[AsList],[AsList.1]}))/List.Count([AsList])),
  #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"%Match", Percentage.Type}}),
  #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Table", "AsList", "Index", "AsList.1"}),
  #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Student.1]), "Student.1", "%Match")
in
  #"Pivoted Column"

Peter Bartholomew · Dec 15, 2018

@madhusudhana

paperA: = INDEX( test, studentA, 0 )

'paperA' is a defined name that refers to the formula that follows.
'studentA' is an index that identifies a specific student from the list of students.
'test' is the range containing the complete set of results for all students.
The formula
= INDEX( test, studentA, 0 )
therefore returns the complete set of results for 'studentA'.
Note:
= INDEX( test, studentA, 5 )
would be their answer to question 5 but 0 or blank gives the entire row.

From there, equating arrays does a pairwise comparison the result of which is aggregated using SUM.

I hope this helps.

michaelaltose · Dec 16, 2018

Peter Bartholomew said:
@madhusudhana

'paperA' is a defined name that refers to the formula that follows.
'studentA' is an index that identifies a specific student from the list of students.
'test' is the range containing the complete set of results for all students.
The formula
= INDEX( test, studentA, 0 )
therefore returns the complete set of results for 'studentA'.
Note:
= INDEX( test, studentA, 5 )
would be their answer to question 5 but 0 or blank gives the entire row.

From there, equating arrays does a pairwise comparison the result of which is aggregated using SUM.

I hope this helps.

Thank you Peter, this definitely accomplishes what I am trying to do!
I have applied it to my actual data set, which has 39 questions and 76 students. With some tweaking, it definitely works! And it is a striking way to visualize the data.

A few minor questions:
1. I notice that the data set has to have the students down the left side, and the questions going across. My data set happens to be transposed, with question numbers down the left side and student names across the top. Obviously I can rotate the data set, and then it works. This must be intrinsic to how the INDEX function works? Can you help me understand how this works?

2. In your file, how do you make the cells show S1-10 and Q1-10 when their actual values are 1-10?

3. In general, how can I keep the cells showing each students' name but allow the function to work (which requires the cell to have a number)?

Thank you again for your elegant answer.

michaelaltose · Dec 16, 2018

GraH - Guido said:

Using Power Query

1. Load table , unpivot
2. Merge table with itself (cross table)
3. match lists
4. pivot table

(some steps needs some tweaks to enable variable number of students and/or questions.)

New data? => simply refresh

Code:

let
  Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
  #"Changed Type" = Table.TransformColumnTypes(Source,{{".", type text}, {"Q1", type text}, {"Q2", type text}, {"Q3", type text}, {"Q4", type text}, {"Q5", type text}, {"Q6", type text}, {"Q7", type text}, {"Q8", type text}, {"Q9", type text}, {"Q10", type text}}),
  #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"."}, "Attribute", "Value"),
  #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{"Attribute", "Quenstions"}, {"Value", "Answers"}, {".", "Student"}}),
  #"Merged Columns" = Table.CombineColumns(#"Renamed Columns",{"Quenstions", "Answers"},Combiner.CombineTextByDelimiter("-", QuoteStyle.None),"Q-A"),
  #"Grouped Rows" = Table.Group(#"Merged Columns", {"Student"}, {{"Table", each _, type table}}),
  #"Added Custom" = Table.AddColumn(#"Grouped Rows", "AsList", each Table.Column([Table],"Q-A")),
  #"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 1, 0)
in
  #"Added Index"

Code:

let
  Source = Table.NestedJoin(StudentAnswers,{"Index"},StudentAnswers,{"Index"},"StudentAnswers",JoinKind.LeftOuter),
  #"Expanded StudentAnswers" = Table.ExpandTableColumn(Source, "StudentAnswers", {"Student", "AsList"}, {"Student.1", "AsList.1"}),
  #"Added Custom" = Table.AddColumn(#"Expanded StudentAnswers", "%Match", each List.Count(List.Intersect({[AsList],[AsList.1]}))/List.Count([AsList])),
  #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"%Match", Percentage.Type}}),
  #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Table", "AsList", "Index", "AsList.1"}),
  #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Student.1]), "Student.1", "%Match")
in
  #"Pivoted Column"

Thanks for taking the time to reply. The output you created looks exactly like what I am trying to achieve. But I don't have enough experience with Pivot Tables and Power Query to be able to implement your suggestion. If you are willing to walk me through the process in a little more detail, I would be interested in giving it a try. If not, I will work with Peter Bartholomew's suggestion.

Best,
Michael

AliGW · Dec 16, 2018

Would it not just be simpler to sit the students further apart when completing their tests?

Peter Bartholomew · Dec 16, 2018

1. I notice that the data set has to have the students down the left side, and the questions going across.

INDEX(Range, rowNum, columnNum) returns a single cell reference from the 2D range [ a more standard math notation would be Range(rowNum, columnNum) but that is not how spreadsheeters do it! ]. Omitting or using zero as a parameter is interpreted by Excel as 'for all values of the parameter'. Thus
= INDEX( test, studentA, 0 )
returns a row of test results for studentA, each cell being the answer given to a particular question.
If your data is transposed, then you will need to adjust the formula, using Name Manager, to read
= INDEX( test, 0, studentA )
That will return a column from the 'test' range.

I have uploaded an example with the input transposed. Unfortunately, I had made some unrecorded changes since the previous upload. The indices, presented as row and column headings, are no longer relative references. The necessary array calculation is done using the named formula final cell formula may be committed by using Enter (or Ctrl+Enter for multiple cells) or CSE.

Peter Bartholomew · Dec 16, 2018

In your file, how do you make the cells show S1-10 and Q1-10 when their actual values are 1-10?

That is by using custom number formatting to display a letter before the number. It is possible to make even more refined changes, for example, the number format
0;-0;"absent";@
will display zeros as "absent". In Excel, what you see is not necessarily what you have got!

3. In general, how can I keep the cells showing each students' name but allow the function to work (which requires the cell to have a number)?

In this instance (working with INDEX which does not treat array parameters in the same manner as most functions), the index row and column is an essential part of the solution; they cannot be replaced by in-memory index arrays. They can be hidden, reformatted or moved away from the result range to make room for student names.

I hope this helps.

Peter Bartholomew · Dec 16, 2018

On another front, to make sense of your map, I suspect you need a calculation of the results you might expect were there no collusion. Summing in one direction you know how many times each answer was selected for a given question. Summing in the other direction, you have how many correct answers were provided by each student. The trick would be to form an outer product of these results which would tell you the probability of two students giving the same answer to each question and then over multiple questions.

The squares of the differences between the observed and expected values might give information of significance.

GraH - Guido · Dec 16, 2018

michaelaltose said:
Thanks for taking the time to reply. The output you created looks exactly like what I am trying to achieve. But I don't have enough experience with Pivot Tables and Power Query to be able to implement your suggestion. If you are willing to walk me through the process in a little more detail, I would be interested in giving it a try. If not, I will work with Peter Bartholomew's suggestion.

Best,
Michael

Don't be afraid of Power Query. This can simply be done with some mouse clicks.
I've made and documented such a solution in the attachment.

michaelaltose · Dec 16, 2018

Thank you again for taking the time to give me your explanations. I will try to go through them - I've already learned quite a bit.

Cheating analysis: Comparing every data pair of students' test answers:

michaelaltose

New Member

GraH - Guido

Well-Known Member

vletm

Excel Ninja

michaelaltose

New Member

Attachments

Peter Bartholomew

Well-Known Member

Attachments

madhusudhana

New Member

vletm

Excel Ninja

GraH - Guido

Well-Known Member

Attachments

Peter Bartholomew

Well-Known Member

michaelaltose

New Member

michaelaltose

New Member

AliGW

Well-Known Member

Peter Bartholomew

Well-Known Member

Attachments

Peter Bartholomew

Well-Known Member

Peter Bartholomew

Well-Known Member

GraH - Guido

Well-Known Member

Attachments

michaelaltose

New Member