ATITD Wiki: The Test Of The Venery/Judging

On 11.09.2005 Pharaoh announced a new Scoring Code for Venerys

Pharaoh here... I've completed a major redesign of The Venery, which I've always thought held enormous promise as a Thought Test, but due to some poor initial design choices, never got used much.

There are a ton of changes. Here goes:

First, there is no longer a need to have any of the designers present. Once your Venery is set up, just open it like any other Thought Test and people can run it when they happen upon it.

Designers can now see the last 21 people to attempt their Venery, along with their status (won, lost, playing, withdrew). More on "lost" in a bit.

A Venery can be in Configuration mode, Open-for-Designers mode, or Open-for-the-public mode.

When someone completes your Venery, they receive a certificate which can be inspected from their Read menu. The certificate lists their name, the time it took to run the Venery, and the "win" status of the Venery at the time the certificate was awarded. Might be cool to have a collection of Certificates that showed Venerys that have won since you played them.

I don't know what we'll do with that feature in the future, but it seemed like a cool thing to keep track of.

The "Compile an Overview" function has been cleaned up a great deal... It now lists just the first line of the Clue and Hint for each installed lockbox. And has a "copy to clipboard" button.

There's now an option to set a Maximum Time on a venery. It can be left as "No Limit", in which case anyone that completes your Venery gets a certificate... Or you can enter a time in the form of 0:30 (meaning 30 minutes). That is what I meant by a runner "losing".

Finally, the most significant change: An entirely new Judging system. I'll describe that in a minute. The plan is to try this Judging system for Venery in Tale 2, and then if it's as game-resistant and slick as I think it is, use it for all Thought Tests (and possibly Art Tests) in Tale 3.

I tried to solve several annoyances with our existing judging system: "Hash Collisions", where you get a message like "You or someone like you has already judged this" The problem where the first few judgings that you do don't count, while the system callibrates your average and range. And the 3 per day judging cap.

The new algorithm (Called "PeerReview") keeps some information secret, and gives some feedback to judges. Though some data (I'll explain which) is secret, I'm going to make the algorithm known: The fundamental notion is that a "good" judge *tends* to judge things the way other judges do. Someone who always judges differently, or has no correlation to how others judge, is considered a freak.

Furthermore there is the concept of a "Worldly" judge. Those are judges that have judged alongside a wide ranging group of peers. If you always just judge stuff with your guildmates, you're not worldly. To figure out how much of a freak you are, the system looks at all the Venerys that you have judged, and computes the average scores (poor, fair, ..., Monumental) that they received. Not just a straight average, but a weighted average. Judges that are more worldly have a greater weight in what the average is. So if you want to have great influence as a judge, you should judge many Venerys alongside lots of different judges, and vote the way they do.

The system doesn't tell you how much of a freak you are, but it DOES tell you if you tend to vote higher or lower than the average. So, for the statistician crowd: It tells you your relative average, but not your variance or correlation. It also doesn't tell you how worldly you are.

Since this data must be secret, it may be hard to trust the algorithm. So I'm releasing the source code. http://www.egenesis.com/peerreview.fac

I've had a few questions about how membership in large guilds will affect your Worldlyness... It has no effect. This system doesn't look at your guild membership. It does look at how many different people you've co-judged with. So if me, Josh, and Ed all judge one Venery... And me, Dave, Ed all judge another Venery... Then I have 3 peers: Josh, Ed, and Dave. No wierd normalizing is done on your scores. A monumental means the same thing regardless of the judge. Just that some judges have more influence based on the above.

I've been secretly running this algorithm on Raeli Mosaics over the last week or so, and so far it's been quite good. Or at least obviously gamed Mosaics don't have real high scores, and good ones do. Of course, that's not as good a test as when people know the algorithm they're trying to game against. And I am sure there's no such thing as a game *proof* algorithm, just degrees of game-resistance. So we'll see how this one works.

Question about being able to correct your vote if you click wrong. Yes - it gives feedback on what your vote is, and a second click gives feedback of the form "You change your vote from ___ to ___"

Question about if there will be a minimum number of judges: Yes, though I believe we can get accurate scores with fewer... I think 14 or more probably gives a reliable score, and remember that all judges count - there's no callibration period per judge, and no daily caps. So getting those 14 should be a lot easier.

Question about resetting Venery scoring. Unfortunately yes. The data required by this system is so radically different from the previous one that a conversion wasn't even remotely possible... As of a few weeks ago, only 1 Venery had enough votes to pass, and it was blatently gamed. (A whole bunch of 1-click Venerys next to it, all with super-low scores, and the main Venery itself was through the roof.) (Maybe gamed isn't the right description of that one, no offense intended. Just that the structure of the Test as I designed it made it hard/impossible to even get legitimate scores in a reasonable timeframe.)

All this stuff is released, and various components tested. In any rewrite this complicated, I would bet that there's maybe one tiny bug lurking somewhere ;) As always, DevCall if you find anything that doesn't look right...

I'm heading home for tonight, but I'm planning to spend time tomorrow as bugs do come up, and to further refine things as needed. Good Night, Egypt! Pharaoh out.

Pharaoh here..

I'm probably going to remove the concept of "Designers" from Venerys. Right now the list of designers functions as both a "who will pass" list, and a wierd sort of ownership system that controls who is allowed to configure the Venery. And it tends to get all mixed up with the actual ownership system.

Under the new system, when a Venery is opened, the person opening it is the one Designer, and the one who will pass if the Venery is the best.

If someone else opens the Venery (for instance, it has passed, and another guild member comes along and makes improvements and re-opens it), then judging information is cleared, and the Venery can be re-used.

It's only when the designer *changes* from one person to another that the Judging info will clear. When one designer opens, closes, tweeks, opens, etc. - no clearing.

You could make a case that even that amount of preservation is unfair to judges, since some will be judging an improved version, and so their scores would naturally be higher.

But I think that in practice the improvements won't tend to be big enough for this to be a worry.

I can provide an option where the Designer voluntarily resets judging info if they feel they have made a significant change. I've made an addition to the PeerReview (Venery) Judging system: There is now an option to register a vote of "fraud". This type of judgment doesn't factor in to the piece's score at all, and doesn't affect your standing as a judge in any way... Furthermore, it doesn't even prevent the piece from winning...

If 1/7th of judges have voted a piece as a fraud, then it means that that piece doesn't factor in when computing judges' Worldlyness, and doesn't factor in when computing judge's influence. It's intended to lesssen the temptation to make a whole bunch of token pieces which are in fact all bad, and are simply used to lessen people's "freakyness".

I've posted the updated source code: http://www.egenesis.com/peerreview.fac So in summary, if an object appears to be (for instance) a one-click Venery, a vote of Fraud wouldn't hurt it at all, but would help remove any advantage that judges that voted it "Poor" would otherwise receive. Make sense?

The Test Of The Venery > Judging