We're going to have to agree to disagree about all this.
I'm not really sure there is that much disagreement.
Partly you want these tests to be about specific elements of the beer (diacetyl, e.g.),
Actually I want it ti be about whatever the investigator is interested in. If H1 is "Valine rich malt decreases diacetyl" then the focus is on diacetyl. If H1 is "Valine rich malt improves customer acceptance in the demographic represented by this panel" then the focus is on preference. As I've said several times before these are different tests with regard to the panel selection but not the beer. In the diacetyl case you want a panel sensitive to diacetyl which you verify is the case by doing the test with diacetyl spiked samples. You can't do that with a preference testing panel but you can take steps to insure that the panel is representative of the demographic you are interested in.
...whereas I'm looking for areas that potentially confound the results.
They abound and that was the point of my original post in this thread. If the beer differs detectably in any other attribute than the one we are interested in the triangle part of the two stage test is immediately invalidated. The example I have used before is color. If use of the valine rich malt changes the color in addition to the diacetyl and the panelists can see the color the test is invalid. The probability under H0 is small and the investigator is lulled into rejecting it because of something that has nothing to do with the parameter he is interested in whether the question is as to perceived diacetyl or preference. This is why my original and several follow on posts emphasized that the investigators have to be very thoughtful about the design and conduct of the tests and my suggestion that if there were a flaw in Brulosopher's approach that it might well lie in this area.
And allowing the guessers to be part of preference trials may be necessary for the statistical elements of tests to be met, but it is the antithesis of good measurement to include tasters who can't tell a difference.
That depends on what the investigator is interested in. If he wants to know about diacetyl he shouldn't empanel a group that doesn't have demonstrated sensitivity to diacetyl (as demonstrated by triangle tests with diacetyl spiked beers). But in the preference case (H0: "Valine rich malt does not improve customer acceptance in the demographic represented by this panel") we want the panel to include people who can't tell the difference if we are interested in selling beer to a demographic which includes people who can't tell the difference. For such a panel it is possible that H0 may be true. If asked about preference a diacetyl sensitive panel would probably enthusiastically endorse the beer made with the valine enhanced wort causing the investigator to reject H0 and, given that he is interested in a market that has a decent proportion of people who can't tell the difference, thereby commit a type I error.
They've already indicated they have no preference since they can't even tell them apart. Then a forced choice introduces noise into the data, which makes no sense at all.
As noted above, sometimes it does. Type I errors can be as damaging as Type II. It appears that failure to reject H0 when we should accept it (Type I) is the threat in preference (subjective) investigations whereas Type II errors (failure to reject H0 when we should) is the threat in tests in which an objective (e.g. more or less diacetyl) answer is sought. In those cases guessers do introduce noise but as noted in my last post we could easily reduce that noise by using quadruplets rather than triplets. The fact that this is not done indicates (to me, anyway) that the amount of noise injected in a triplet test is not problematical or at least that the reduction going to a quadruplet test is not justified by the extra difficulty in manipulating quadruplets).
Frankly, I think some of this testing is an attempt to hide, under the veneer of "scientific" statistics, the fact that the tasting panels are suspect.
My impression, as an engineer, is that people in fields like biology, medicine, the social sciences, finance and many others, go to college and are given a tool kit of statistical tests which they then apply in their careers and eventually come to a pass where they are plugging numbers into some software package without remembering what they learned years ago in college and thus not fully understanding what the results they get mean. Engineers do this too, BTW. In homebrewing I think you find people scratching their heads over what data they get from their homebrewing experiments mean and then they discover an ASBC MOA into which they can plug their numbers and get a determination as to whether they are 'statistically significant' or not without having a real idea as to what that means. This is kind of tricky stuff. If I go away from it for even a short period of time I have to sit down and rethink the basic concepts. Maybe it's just that I am not intrinsically good at statistics or don't have much experience with it but as the discussion shows here there are many subtle nuances in how experiments are conducted and how the data are analyzed. As engineers say "If all the statisticians in the world were laid end to end they wouldn't reach a conclusion." It's supposed to be a joke but it is true because of the fundamental nature of statistics:
it is a guessing game. That's why a statement
And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.
from a statistician kind of surprises me. Everything we observe is corrupted by noise. We cannot measure voltage with a voltmeter. We can only obtain from it an estimate of what the voltage is and must recognize that the reading is the true voltage plus some error. Statistics is the art of trying to control or at least quantify that error so that the guesses we are ultimately forced to report represent the truth at least fairly well. Well, that's my engineer's perspective on it.
I'd be much more inclined, if I wanted to do a preference test, to just ask people which they preferred, and bag the triangle test. Randomly assign which beer was tasted first, then see what you get.
Interesting that you say that as just this morning I came up with a test. A number of participants are presented n objects one of which is different from the others. The instructions to the participants are:
"You will be given a number of objects and a die. One of the objects is different from the other. Identify it. If you cannot use the die to randomly pick one of the objects. Separate the object you picked and one other object from the group. Now choose which of these two objects you prefer. If you cannot decide on one or the other use the die again (or a coin) to randomly select one."
Thus the test you propose is the first part of my test with n = 2 and the triangle test is first part of my test with n = 3. The following sets of numbers show probabilities, under the null hypothesis, that 10 out of 20 testers will chose the different object correctly AND that 5 of those 10 will prefer one or the other.
3 TR(20,10,5,1/3,1/2) means that n = 3, the panel size is 20, 10 correctly pick, 5 prefer one or the other, that the probability of picking correctly is 1/n = 1/3 and that the probability of preferring is 1/2. The first number in the next line is the confidence level for the triangle part of the test and the second the confidence level for the two part test.
2 TR(20,10,5,1/2,1/2)
0.588099 0.268503
3 TR(20,10,5,1/3,1/2)
0.0918958 0.0507204
4 TR(20,10,5,1/4,1/2)
0.0138644 0.00802728
5 TR(20,10,5,1/5,1/2)
0.00259483 0.00153428
These numbers clearly show that a triangle test is a more powerful test than a pick one of 2 test (which is why triangle tests are performed rather than pick one of two tests) and that a quadrangle test is more powerful than a triangle test. They also show that the two part test is more powerful than the triangle or quadrangle by itself in that they increase one's confidence in what his data shows him. This is all, of course, under the assumption that the investigator does not step into one of the many potential pitfalls we have discussed.
If I can ever figure out how to do one of these exbeeriments at a level which will satisfy my desire to make it well-controlled, I'll do some of this.
I don't think you will ever be able to do enough experiments to get you past your conception that allowing guesses is a detriment. I think Monte Carlo is a much more promising approach.
Maybe this is all just a justification for an upgrade to my system?
If H0 is "You shouldn't upgrade your system" the level of support for that is p << 1.