When I was in school, I would have gotten value out of a program that gave an accurate grade for a writing. The official NAPLAN creative writing response examplars are graded out of 47. Using LLMs, I've quickly achieved grading within 2 points for 87% of responses, and within 3 points for 96% of responses. See the data.
If this sounds interesting to you, please have a look at the website:
Even without perfect accuracy, LLM generated feedback is relevant and useful, including both general and detailed comments. I'm particularly happy with the inline annotation system, which imitates a teacher's feedback. I'd like to expand the system to guide a student towards improving their writing in an organic way in future.
I expect the current accuracy to improve with refinement. It has already been increased greatly with newer model releases and improvements in prompting like chain-of-reasoning. In Table 1, you can see some are better, like 6-Cohesion, which it marks almost perfectly. The less accurate grades are often due to strange rules in the NAPLAN rubric involving counting. For example: A student with five simple, but correctly punctuated sentences, could score a 5/5 for 9-Punctuation; A student with twenty complex sentences with less than 80% correct can only get a 3/5 maximum. One could argue the LLM grades in these cases could be better than official ones. However, the goal of the program is to simulate an official mark, so that's the result this program is improving towards. LLMs are not particularly good at counting right now, but can be optimised for it with the assistance of custom code.
Future considerations
- Iterative feedback for improving student responses
- AI improvements
- Payment integrations
Data
1. Grading data by criterion
Exact /23 Mean Max Min Std Dev
1-Audience : 18 -0.26 0 -2 0.54
2-Text structure : 18 0.043 1 -1 0.47
3-Ideas : 18 -0.22 0 -1 0.42
4-Character and Setting : 20 -0.17 0 -2 0.49
5-Vocabulary : 17 0.26 1 0 0.45
6-Cohesion : 22 0.043 1 0 0.21
7-Paragraphing : 20 -0.043 1 -1 0.37
8-Sentence Structure : 12 -0.17 1 -2 0.78
9-Punctuation : 16 0.043 1 -1 0.56
10-Spelling : 14 0.26 1 -2 0.69
2. Grading data by response
Exact /23 Mean Max Min Std Dev Total AI Total
role play writer : 10 0.0 0 0 0.0 0 0
dungaun : 10 0.0 0 0 0.0 8 8
the casel : 10 0.0 0 0 0.0 11 11
bmx : 8 0.20 1 0 0.42 11 13
my story : 8 0.0 1 -1 0.47 17 17
living dead : 9 -0.10 0 -1 0.32 18 17
woodern box : 7 -0.10 1 -1 0.57 22 21
one sunny morning : 9 -0.10 0 -1 0.32 22 21
october 16 1981 : 9 -0.10 0 -1 0.32 25 24
moving away : 7 0.30 1 0 0.48 29 32
space tour : 8 0.20 1 0 0.42 29 31
the haunted house : 4 -0.20 1 -1 0.79 32 30
gambat : 8 0.0 1 -1 0.47 33 33
tracy : 7 0.10 1 -1 0.57 34 35
best friends : 3 -0.80 1 -2 1.0 38 30
lovely purple boots : 8 0.20 1 0 0.42 39 41
his eyes widened : 7 -0.10 1 -1 0.57 45 44
the water tower : 7 -0.30 0 -1 0.48 45 42
in the distance : 9 0.10 1 0 0.32 46 47
axe : 6 0.0 1 -1 0.67 44 44
deep blue nothing : 8 0.0 1 -1 0.47 46 46
fier brething dragen : 7 0.30 1 0 0.48 18 21
the shade whispered : 6 -0.10 1 -2 0.88 26 25