One is what those panels of beer drinkers represent. I understand the statistics (believe me, I do),
I don't doubt that you understand them far better than I.
but I don't think they're properly used to produce actionable intelligence. I use that phrase with my students; so something is "significant;" what have you learned of value about the world if it's significant? If you can't say, significance is not useful.
So lets say a brewer thinks he has a diacetyl problem and wants to know if using a proportion of valine rich malt will improve his beer with respect to diacetyl. He brews a test batch (B) and wants to know if it is better than his regular beer (A) which is the same as B except that B contains some portion of the valine rich malt. To see if it's better he gives 40 tasters a sample of each and 17 report that beer B is better. He goes to a table (or whatever) and finds that the probability that 18 or more tasters guessing randomly prefer B is 78.5%. He concludes that as less than half of his tasters preferred B and as it's more likely than not that the data he obtained could be obtained by flipping a coin that B is very likely not better than A and he doesn't adopt the new process. He takes no action. Let's assume, at this point that indeed the new malt does improve the beer by reducing diacetyl but that 22 members of the panel are diacetyl taste deficient. Thus the brewer accepts H0 when H1 is true and we see that this test isn't very powerful because of the panel composition.
Along comes a guy who says "Hey, that's not a very powerful test. Give them three cups...." i.e. advises him to try a triangle test. Under H1 the 18 that picked the lower diacetyl beer in the simple test should be able to detect the difference between A and B and so we would have 18 out of 40 qualifying. The probability of this happening under H0 is 8.3%. That's enough for the brewer to start to think 'maybe this makes a difference' but not below the first level of what is usually statistically significant. And he still doesn't know whether the new process improves the beer. Being this close to statistical significance his action is perhaps to perform additional tests or empanel larger panels or test his panel to see if some of its members are diacetyl sensitive.
The consultant comes back and says "Did you ask the panelists which they preferred?" and the brewer says "Yes but I didn't do anything with the data because this a triangle test." The consultant advises him to process the preference votes which reveals that 11 of the 18 who qualified preferred B. The probability that 18 qualify and 11 prefer under the null hypothesis is 1.6%. Using this data the brewer realizes he is below statistical significance threshold, confidently rejects the null hypothesis and takes the action of adopting the new process. Note that under the assumptions we made above more than 11 out of 18 should find B to be lower in diacetyl. If 14 do then then p < 0.1%
I know that people are guessing when they can't tell, and that's certainly fine for the statistical element of this, but there's an issue with it.
You seem to be saying that while we are trying to make a decision about H1 by the only means available to us i.e. rejecting H0 if the probability of what we observe is low under H0, that a test which produces a lower p than another test isn't necessarily a better test. The lower p the more likely we are to reject H0 when H1 is true (and p does not depend on any of the conditions that pertain when H1 is true) and the probability that we do so is, AFAIK (I'm no statistician for sure), the definition of the 'statistical power' of the test. The two stage test is more powerful than the triangle alone test.
People who guessed correctly simply by luck can't tell the difference. I don't see the point of asking such people about preference, as the preference is just as random.
That's really not a flaw of the technique but rather a feature of it. Yes some unqualified votes (guesses) come in but 2/3 of them are eliminated. Compare to just asking panelists to pick the better beer. 0% of the guessers are eliminated in that case. The power of the two stage triangle test derives from this very feature.
When one does that, the preference data is contaminated by guessing. Like trying to see something through fog.
So let's turn down the contamination level by presenting quadruplets of cups with 3 beers the same and 1 different. In that case only 1/4 of guessers qualify, p(40,18,11) = 0.09% and the test is seen to be even more powerful.
I'd feel better about the panels--and those who "qualified"--if they could reproduce their choice repeatedly.
As I mentioned in a previous post adding the preference part is really asking the panelists to distinguish the beers again by choosing which has less diacetyl than the other. This is sort of similar to multiple runs.
That would tell me they truly were qualified, and not qualified purely on the basis of a lucky guess.
Depending on the nature of the investigation qualification by guessing may be exactly what you are looking for. If you want to see if your market is insensitive to diacetyl creep then you want to see if they have to guess when beings asked to distinguish (or, more important, prefer) beers lower in diacetyl. Keep in mind that to pick one out of three correctly there must be both a discernable difference AND the panelist must be able to detect it. If both those conditions are not met then every panelist must guess (the instructions require him to). These tests are a test of the panel and the beer. I keep saying that.
But where we are, as in the example of this post, investigating something specific we want to qualify our panel by presenting it standard samples for two part triangle testing.
One of my areas of interest/expertise is measurement (though it's in the social science world, not the biological/chemical world). Measures--instruments--need to be valid but to be valid they also must be reliable. I have no indication in any of this that the guessers--who are providing "preference" data--are doing anything in that realm except guessing.
As noted even the 'best' panelists have to guess when the beers are indistinguishable and that's exactly what we want them to do. As I said above guessing is an important feature of this test - not a flaw.
And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.
I've explained it as clearly as I can and if you can't see it then I would, depending on your level of interest, suggest pushing some numbers around or even doing a Monte Carlo or two if you are so inclined. The main disconnect here is that you are arguing that a statistical test, even though more powerful than another, is less valid than the other. That can only mean that could lead us to take the wrong action which, in this case, would imply that asking our hypothetical brewer using the more powerful test to decide against using the low valine malt even though it does improve (defining improve as reduction in diacetyl). I don't see how that could possibly happen.