• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Cheating analysis: Comparing every data pair of students' test answers:

michaelaltose

New Member
Hello,

THE QUESTION:
My ultimate goal is to look at how a set of students answered all of the questions on a multiple-choice exam, and compare each student with each other student to see if there are any pairs of students who are more similar to each other than all the other pairs - suggesting that copying (or "collaboration") might have occurred between those two students.

I am open to your suggestions, but my idea is to create a "heat map" showing how similar each pair is.

THE DETAILS:
This is a sample set of data, where the students are named S1 thru S10, and there are 10 questions (Q1 thru Q10) in this data set. Each question could be answered A,B,C, or D.

. Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
S1 A B D C A B C D A A
S2 B B A C A A D D A A
S3 A B D C A B D D A A
S4 B A A C B B C D B A
S5 A B B D C D C D A D
S6 C B D C B C D D C D
S7 A D D C A B D A A D
S8 A B C A A B C D A A
S9 D B D C D A C D D C
S10 A C D B A B C D A C


So far, I can compare S1 to all the other students for just Q1 using the formula:
{=IF(B2=$B$2:$B$11,1,0)}

which will give me the output:

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 1
S2 0
S3 1
S4 0
S5 1
S6 0
S7 1
S8 1
S9 0
S10 1

But automatically extending this table to compare S2 to the class in the second column, then comparing S3 to the class in the third column, etc, requires a lot of manual input. If there were a way to do this easily, I would end up with a table that looks like this:

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 1 0 1 0 1 0 1 1 0 1
S2 0 1 0 1 0 0 0 0 0 0
S3 1 0 1 0 1 0 1 1 0 1
S4 0 1 0 1 0 0 0 0 0 0
S5 1 0 1 0 1 0 1 1 0 1
S6 0 0 0 0 0 1 0 0 0 0
S7 1 0 1 0 1 0 1 1 0 1
S8 1 0 1 0 1 0 1 1 0 1
S9 0 0 0 0 0 0 0 0 1 0
S10 1 0 1 0 1 0 1 1 0 1

Repeating this process again for questions Q2 thru Q10 would yield a total of 10 tables like the table above. And adding all 10 tables together would yield a table like this (I just made up the values):

. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S1 10 6 1 5 5 5 2 3 2 3
S2 6 10 4 7 4 6 4 10 1 7
S3 1 4 10 7 6 3 1 4 8 3
S4 5 7 7 10 7 3 8 3 7 7
S5 5 4 6 7 10 6 7 6 4 2
S6 5 6 3 3 6 10 4 8 5 6
S7 2 4 1 8 7 4 10 9 2 4
S8 3 10 4 3 6 8 9 10 1 7
S9 2 1 8 7 4 5 2 1 10 2
S10 3 7 3 7 2 6 4 7 2 10

Obviously each student is a perfect match with him/herself (and the top right of the table is a mirror image of the bottom left). But otherwise, high matches might suggest that cheating occurred. I could also use the conditional formatting function to create a color heat map.

And of course, I have more than 10 students (usually about 70) and more than 10 questions (usually 20-40 per exam).

I don't have any VBA experience (yet!), so if this can be done with formulas or existing Excel tools, that would be easiest for me, but I'm willing to take any suggestions you can offer!

Thank you for reading this far and helping me ponder this question and improve my Excel skills!
 
Could you please upload a sample workbook with fake results for say 5-10 students, Michael? As per forum rules.
Also let us know which Excel version you are using. Many possibilities I guess: formulae, vba, Power Query (which I would use and prefer).
 
michaelaltose
Upload a Sample file ...
Questions:
If two or more students answer same way, eg 100% correct, how You would handle those?
Could You do own random order exam for every student?
Could those exams be also one way to learn something? Gotta find correct answer!
 
Thanks for your comments!
I have attached a file with some sample data.
I am using Excel 2016


Side note to vletm:
I agree that my statistical methods are dubious, since you may want a correlation of wrong answers more than a correlation of correct answers, which could just indicate that they simply knew the material. But even that's not quite accurate. Wrong answers may yield more information, but only if we assume that the majority of class is getting the right answer. On the other hand it if it is a difficult question, then two people getting the right answer might be more statistically valuable. So I am open to other thoughts about way to analyze this data. But in the meantime my approach is an interesting start to just visualizing the data, and a good chance to expand my Excel capabilities anyway.
 

Attachments

  • Book1.xlsx
    16.5 KB · Views: 7
Would this get you started? The names studentA and studentB are relative references to row and column indices for the output table.

paperA: = INDEX( test, studentA, 0 )
paperB: = INDEX( test, studentB, 0 )
matchingAnswers: = N( paperA = paperB )

The worksheet formula that generates the table is then
= SUM( matchingAnswers )
 

Attachments

  • Student test answers (PB).xlsx
    15.1 KB · Views: 13
michaelaltose
hmm ... sorry, so far, for me, not Excel case ;)

I've offered something same before somewhere ... but ...
Seems, that You try to find how do students has learnt (or actually answered).

If most of students has answered same way
then that should be the correct answer ... okay ... hmm?
That could be true and work too!
(eg 70% has answered 'A' and 30% has answered 'B')
But, if sometimes that would be 'wrong'...
( if 'the most correct answer' is 'B' or ... OUCH! even 'C')
If someone has taught it 'wrong' or something misunderstood.
Of course, those cases would/should/have to notice/find and
then try to find out how to teach 'better' way and
how to 'offer/supply' correct answers to students.
That would be the challenge for both sides (for teachers and students).
... the heat map in the real world.
 
Using Power Query

1. Load table , unpivot
2. Merge table with itself (cross table)
3. match lists
4. pivot table

(some steps needs some tweaks to enable variable number of students and/or questions.)

New data? => simply refresh :)

Code:
let
  Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
  #"Changed Type" = Table.TransformColumnTypes(Source,{{".", type text}, {"Q1", type text}, {"Q2", type text}, {"Q3", type text}, {"Q4", type text}, {"Q5", type text}, {"Q6", type text}, {"Q7", type text}, {"Q8", type text}, {"Q9", type text}, {"Q10", type text}}),
  #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"."}, "Attribute", "Value"),
  #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{"Attribute", "Quenstions"}, {"Value", "Answers"}, {".", "Student"}}),
  #"Merged Columns" = Table.CombineColumns(#"Renamed Columns",{"Quenstions", "Answers"},Combiner.CombineTextByDelimiter("-", QuoteStyle.None),"Q-A"),
  #"Grouped Rows" = Table.Group(#"Merged Columns", {"Student"}, {{"Table", each _, type table}}),
  #"Added Custom" = Table.AddColumn(#"Grouped Rows", "AsList", each Table.Column([Table],"Q-A")),
  #"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 1, 0)
in
  #"Added Index"

Code:
let
  Source = Table.NestedJoin(StudentAnswers,{"Index"},StudentAnswers,{"Index"},"StudentAnswers",JoinKind.LeftOuter),
  #"Expanded StudentAnswers" = Table.ExpandTableColumn(Source, "StudentAnswers", {"Student", "AsList"}, {"Student.1", "AsList.1"}),
  #"Added Custom" = Table.AddColumn(#"Expanded StudentAnswers", "%Match", each List.Count(List.Intersect({[AsList],[AsList.1]}))/List.Count([AsList])),
  #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"%Match", Percentage.Type}}),
  #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Table", "AsList", "Index", "AsList.1"}),
  #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Student.1]), "Student.1", "%Match")
in
  #"Pivoted Column"
 

Attachments

  • StudentAnswers.xlsx
    25 KB · Views: 5
@madhusudhana
paperA: = INDEX( test, studentA, 0 )

'paperA' is a defined name that refers to the formula that follows.
'studentA' is an index that identifies a specific student from the list of students.
'test' is the range containing the complete set of results for all students.
The formula
= INDEX( test, studentA, 0 )
therefore returns the complete set of results for 'studentA'.
Note:
= INDEX( test, studentA, 5 )
would be their answer to question 5 but 0 or blank gives the entire row.

From there, equating arrays does a pairwise comparison the result of which is aggregated using SUM.

I hope this helps.
 
@madhusudhana


'paperA' is a defined name that refers to the formula that follows.
'studentA' is an index that identifies a specific student from the list of students.
'test' is the range containing the complete set of results for all students.
The formula
= INDEX( test, studentA, 0 )
therefore returns the complete set of results for 'studentA'.
Note:
= INDEX( test, studentA, 5 )
would be their answer to question 5 but 0 or blank gives the entire row.

From there, equating arrays does a pairwise comparison the result of which is aggregated using SUM.

I hope this helps.

Thank you Peter, this definitely accomplishes what I am trying to do!
I have applied it to my actual data set, which has 39 questions and 76 students. With some tweaking, it definitely works! And it is a striking way to visualize the data. Screenshot 2018-12-15 22.44.38.png

A few minor questions:
1. I notice that the data set has to have the students down the left side, and the questions going across. My data set happens to be transposed, with question numbers down the left side and student names across the top. Obviously I can rotate the data set, and then it works. This must be intrinsic to how the INDEX function works? Can you help me understand how this works?

2. In your file, how do you make the cells show S1-10 and Q1-10 when their actual values are 1-10?

3. In general, how can I keep the cells showing each students' name but allow the function to work (which requires the cell to have a number)?

Thank you again for your elegant answer.
 
Using Power Query

1. Load table , unpivot
2. Merge table with itself (cross table)
3. match lists
4. pivot table

(some steps needs some tweaks to enable variable number of students and/or questions.)

New data? => simply refresh :)

Code:
let
  Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
  #"Changed Type" = Table.TransformColumnTypes(Source,{{".", type text}, {"Q1", type text}, {"Q2", type text}, {"Q3", type text}, {"Q4", type text}, {"Q5", type text}, {"Q6", type text}, {"Q7", type text}, {"Q8", type text}, {"Q9", type text}, {"Q10", type text}}),
  #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Changed Type", {"."}, "Attribute", "Value"),
  #"Renamed Columns" = Table.RenameColumns(#"Unpivoted Other Columns",{{"Attribute", "Quenstions"}, {"Value", "Answers"}, {".", "Student"}}),
  #"Merged Columns" = Table.CombineColumns(#"Renamed Columns",{"Quenstions", "Answers"},Combiner.CombineTextByDelimiter("-", QuoteStyle.None),"Q-A"),
  #"Grouped Rows" = Table.Group(#"Merged Columns", {"Student"}, {{"Table", each _, type table}}),
  #"Added Custom" = Table.AddColumn(#"Grouped Rows", "AsList", each Table.Column([Table],"Q-A")),
  #"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 1, 0)
in
  #"Added Index"

Code:
let
  Source = Table.NestedJoin(StudentAnswers,{"Index"},StudentAnswers,{"Index"},"StudentAnswers",JoinKind.LeftOuter),
  #"Expanded StudentAnswers" = Table.ExpandTableColumn(Source, "StudentAnswers", {"Student", "AsList"}, {"Student.1", "AsList.1"}),
  #"Added Custom" = Table.AddColumn(#"Expanded StudentAnswers", "%Match", each List.Count(List.Intersect({[AsList],[AsList.1]}))/List.Count([AsList])),
  #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"%Match", Percentage.Type}}),
  #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Table", "AsList", "Index", "AsList.1"}),
  #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Student.1]), "Student.1", "%Match")
in
  #"Pivoted Column"

Thanks for taking the time to reply. The output you created looks exactly like what I am trying to achieve. But I don't have enough experience with Pivot Tables and Power Query to be able to implement your suggestion. If you are willing to walk me through the process in a little more detail, I would be interested in giving it a try. If not, I will work with Peter Bartholomew's suggestion.

Best,
Michael
 
Would it not just be simpler to sit the students further apart when completing their tests? ;)
 
1. I notice that the data set has to have the students down the left side, and the questions going across.
INDEX(Range, rowNum, columnNum) returns a single cell reference from the 2D range [ a more standard math notation would be Range(rowNum, columnNum) but that is not how spreadsheeters do it! ]. Omitting or using zero as a parameter is interpreted by Excel as 'for all values of the parameter'. Thus
= INDEX( test, studentA, 0 )
returns a row of test results for studentA, each cell being the answer given to a particular question.
If your data is transposed, then you will need to adjust the formula, using Name Manager, to read
= INDEX( test, 0, studentA )
That will return a column from the 'test' range.

I have uploaded an example with the input transposed. Unfortunately, I had made some unrecorded changes since the previous upload. The indices, presented as row and column headings, are no longer relative references. The necessary array calculation is done using the named formula final cell formula may be committed by using Enter (or Ctrl+Enter for multiple cells) or CSE.
 

Attachments

  • Student test answers - transposed (PB).xlsx
    15.1 KB · Views: 8
In your file, how do you make the cells show S1-10 and Q1-10 when their actual values are 1-10?
That is by using custom number formatting to display a letter before the number. It is possible to make even more refined changes, for example, the number format
0;-0;"absent";@
will display zeros as "absent". In Excel, what you see is not necessarily what you have got!

3. In general, how can I keep the cells showing each students' name but allow the function to work (which requires the cell to have a number)?
In this instance (working with INDEX which does not treat array parameters in the same manner as most functions), the index row and column is an essential part of the solution; they cannot be replaced by in-memory index arrays. They can be hidden, reformatted or moved away from the result range to make room for student names.

I hope this helps.
 
On another front, to make sense of your map, I suspect you need a calculation of the results you might expect were there no collusion. Summing in one direction you know how many times each answer was selected for a given question. Summing in the other direction, you have how many correct answers were provided by each student. The trick would be to form an outer product of these results which would tell you the probability of two students giving the same answer to each question and then over multiple questions.

The squares of the differences between the observed and expected values might give information of significance.
 
Thanks for taking the time to reply. The output you created looks exactly like what I am trying to achieve. But I don't have enough experience with Pivot Tables and Power Query to be able to implement your suggestion. If you are willing to walk me through the process in a little more detail, I would be interested in giving it a try. If not, I will work with Peter Bartholomew's suggestion.

Best,
Michael
Don't be afraid of Power Query. This can simply be done with some mouse clicks.
I've made and documented such a solution in the attachment.
 

Attachments

  • StudentAnswers.xlsx
    392.3 KB · Views: 10
Thank you again for taking the time to give me your explanations. I will try to go through them - I've already learned quite a bit.
 
Back
Top