I believe the p value should be discarded in these types of trials.
Sadly, there wasn't enough "blindness" in the testing process, and there were other confounding variables (I kegged; the other brewer bottle-conditioned), etc. So I cannot claim it is truly scientific, not anywhere near the level of Brulosophy.
But I tasted both beers and definitely perceived a difference lol...
Marshall is a dear friend and I respect and appreciate what he's done. But too many homebrewers take it as the last word, rather than a single data point. The key to science is repeatability. Someone does an experiment, then others do it to verify the results. If there's only one trial, then you can't really draw a conclusion. At Experimental Brewing, we try to get around that with multiple testers and a lot more tasters, but that has its own problems. In short, look at these experiments as a starting point for your own exploration. Trying to convince another brewer, whether homebrewer of commercial, that they're the last word is not only misleading, it's not how any of us intend the experiments to be used.
I think we do agree about qualifying the panel in most cases.
The thing that you don't seem to be able to grasp
is that if you are trying to see how a proposed change in your beer will effect its sales
and assay to do that with a taste panel then that panel had better reflect your market
I will also point out, again, that noise is inevitable - even with a 'qualified' panel and that the power of a triangle (or quadrangle or....) test is that it improves signal to noise ratio. See my post on ROC's.
Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.
The temp-controlled beer had an average score 11 points higher than the non-controlled beer.
View attachment 410318
Not only is that not clear from the way they do it, you point out one of the elemental difficulties with having a one-shot guess "qualify" tasters for the preference test.
Show me you can pick the odd-one-out three or more times in a row, and I'll believe you can detect a difference....and you are qualified to go to the next level.
Guessers cannot tell the difference; why would anyone want them judging preference, and guessing on that too?
Hi friend, that last post came off wrong and too aggressive, and i am sorry for that. I appreciate your opinions and your contributions to this forum very much.
How many tasters?
And I agree with this enthusiastically. As there was a benefit for the doctor in a higher false alarm rate (in that he can perform additional and potentially even more expensive tests and avoid a lawsuit) so is there a potential benefit for brewers in raising the acceptable level of p if it drives more investigation.My argument remains, that a few more false alarms might not be such a terrible thing, if it might encourage more of us to run even more xbmts on our own to support/refute/learn for ourselves.
I believe the p value should be discarded in these types of trials.
No worries. I love to debate. I don't take things personally... As a wise man [me] once coined a phrase:
Offense can never be given; it can only be taken.
We had about 10-11 tasters giving informal feedback. Only about 4 filled out BJCP sheets. So as I said, not exactly scientific, and the confounding variables were an issue. It wasn't anywhere near the level of what Brulosophy does.
Here it is:You can have the last word.
I'll let the insult slide.
Offense can never be given; it can only be taken.
If basing my position on sound principles, supporting my conclusions with examples and data (though it be simulated) and explaining them to the best of my ability be the low road I'll take it. And probably get to Scotland a'fore ye too with Scotland representing a fuller understanding of triangle testing than I've ever had before. That's why I find it so disappointing that you wish to withdraw based on what are clearly misunderstandings of my posts.This is not the first time you've decided to take the low road, my friend.
I have never changed the argument. The central theme of all my posts, was stated in my first post in this thread (#61) asOK, I think we're done here. It's pretty much a universal truth that when your interlocutor decides to change the discussion to another argument, it's a sign he/she doesn't have much with which to respond to the original argument.
and this was repeated many times in others perhaps phrased differently but I feel it should have been clear that the design of the test depends greatly on the nature of the investigation. Whether a particular reader is able to grasp that or not is immaterial as long as most of the readers do....the selection of the panel which must be driven by what one is trying to measure.
My use of the market as an example of a case where we are interested in the verity of the null hypothesis, rather than its alternate is hardly new to the recent posts. In No. 61 I saidNow you want this to be about the market? I noted this whole issue a long time ago when I pointed out the inability to know to whom the sample of tasters is generalizable.
I hope that you will grant me that a brewery's market is included in "whatever demographic".Then we get into questions of how well these 20 guys represent the general public (or whatever demographic the experimenter is interested in - presumably home brewers).
No. I want it to be, as I have said all along, about whatever the investigator is interested in investigating.Further, now you want this to be about the market and not about whether the beers are different?
If you think something silly please say why it is silly.<Snip silliness in the context of the argument>
If you want to argue that "noise is inevitable" without understanding that there are ways to reduce it and the desirability of doing so, then there's not much point in continuing.
This sounds awful and I'm glad the breweries near me aren't this way. We have 4 local breweries and all 4 of them are actively involved with the local homebrew club. They sell us grain at wholesale prices, they each host us at least once a year, they attend 1 or 2 meetings a year, and they sponsor homebrew competitions where the winning beer is brewed on their system.
Thus a quadrangle test with 20 panelists gives slightly better performance than a triangle test with 40 equally qualified panelists. . . It appears for this particular case adding the extra cup is slightly better than doubling the panel size!
It really is awful..I wish more local breweries here took more of a "hand in hand" approach with the local homebrew scene. There are only 1 of them that actively year after year stand with the homebrewers with contests and such.
I wish it were more..I have often thought that having a brewery with a brewery swag shop that also provides basic homebrew supplies and grain at bulk prices (and even clone kits of one or 2 of their beers) along with some "guest" homebrewer batches/classes would be brewing utopia. I know I would be a loyal patron.
Its one of my gameplans to incorporate this idea if I ever pull the trigger on my nano. <insert trademark here>
![]()
And the tester saves 40 cups of beer.
We had one of these in Denver. Dry Dock Brewing. I think they closed the Homebrew side.
I wish it were more..I have often thought that having a brewery with a brewery swag shop that also provides basic homebrew supplies and grain at bulk prices (and even clone kits of one or 2 of their beers) along with some "guest" homebrewer batches/classes would be brewing utopia. I know I would be a loyal patron.
Good point. And with respect to management of the samples its question of fiddling with 20*4 = 80 vs. 40*3 = 120 cups of beer which has got to be easier so we wonder why there is no quadrangle test. Before we get too excited lets keep in mind that this result represents one particular set of circumstances (panel size of of 20, probability of qualification 50%, probability of preference 60%). Perhaps if we examine a wider range of circumstances we would not find the gain so great. Something to look into though.
Though I have apparently not made it clear the main theme in my posts has been that what you do depends on what you are trying to measure. In cases where the object is to see if the process change has decreased diacetyl then it seems that we would want panelists who are sensitive to diacetyl. If the object is to detect whether the process change effects preference for the beer among some group of people then the panel does not need to be qualified other than to make sure that it is representative of the group you are trying to measure.I wonder if it's partly due to how hard it is sometimes to even qualify the panel and achieve statistical significance with a triangle test. As has already been discussed, the tasting panels are not exactly perfectly chosen per your guidelines (i.e. if you're testing for diacetyl, pre-qualify the panel to determine who is sensitive to diacetyl).
With a quadrangle test, yes, you'd require fewer testers to correctly pick the odd beer to achieve significance, but I would worry that with a small panel that you'd find even fewer experiments achieving significance. I don't know the math on this, but think of this as an example:
Triangle: 24 testers, you need 13 correct for p<0.05
Quadrangle: 24 testers, you need 11 correct for p<0.05
This seems better, but when you think of the guessing scenario, pure guessing would suggest 8 tasters in a triangle test would blindly get it right. Pure guessing would suggest 6 tasters in a quadrangle test would blindly get it right.
I think it would be really cool to see Brulosophy use the same experimental batches on two different panels of testers, run once as a triangle and once as a quadrangle.
Because the quadrangle is a more stringent test the probability of a given result is lower by random guessing than it would be for a less stringent test (triangle). That's where the quadrangle attains its apparent advantage. The Monte Carlo test runs for your example confirm this. For the triangle test the average confidence level was p = 0.048 whereas for the triangle it was p = 0.016.My gut instinct (which isn't statistics, I know lol) is that although the implications of statistical significance would be stronger if the panel qualified at p<0.05 than it would in a triangle test, the likelihood of achieving significance is lower because a quadrangle is IMHO a more difficult selection than a triangle.
Imo, the path to better brew is the water. Sitting around mathematically working and justifying the, to me, obvious reaults does not make sense to me. They dont "prove" anything, but to an eager and open mind they demonstrate great information for the commercial and home brewer.
If the object is to detect whether the process change effects preference for the beer among some group of people then the panel does not need to be qualified other than to make sure that it is representative of the group you are trying to measure.
Yes these are all single experiments and I'd also prefer to see them repeated before changing tried and true processes. That said all of them are presented with sufficient detail in the reports than any of us could take on the challenge to try to repeat.
Yes these are all single experiments and I'd also prefer to see them repeated before changing tried and true processes. That said all of them are presented with sufficient detail in the reports than any of us could take on the challenge to try to repeat.
Question: How many of the brulosphy experiments have you read (more than the headline and results)...actually read the full write up? Actually this question goes for other posters in this thread too...I see a lot of comments regarding the experimental design that don't seem to have really read too the reports.
That is certainly a reasonable application. It has been suggested here that when a difference is 'detected' but with poor confidence that it sends the message that further testing is warranted. That is certainly a valid interpretation. Rather than comment on Brulosophy's selection of a particular confidence level for a declaration of 'detection' at this point I would rather emphasize that we can't detect that there is a difference but only estimate the probability that there is no difference given the data that was observed and that the other equally important part of the question is "By whom?".In reading the vast majority of the experiments it seems to me that Marshal and crew are focused on detecting whether process or ingredient changes result in a perceptible difference. Everything else is intended to guide thinking about future experiments.
Given this, if a saw a test that purported to test WLP001 vs US05 and compared beers brewed with them one of which was done in glass and one in SS I'd call foul on that test. Everything but the parameter of interest must be the same or masked. But it is not always possible to do that. These are definitely things that must be considered in planning and evaluating a triangle test.While I am a firm believer in fermentation temperature control I do find it interesting that their experiments show that some other things I took as largely irrelevant may lead to perceptible changes in the beer that are easier for typical bomebrew drinkers to detect than control of fermentation temperature.
Take for example the tasters were able to distinguish between beer brewed with WLP001 and US05. But tasters were not able to distinguish between beer brewed with Galaxy and Mosaic hops. Tasters saw a difference between glass carboy and corney keg fermentation. But did not see difference between chocolate malt and carafa special 2 in a Schwartzbier.
A key theme in all my posts has been that the test needs to be designed to reflect what the investigator is interested in. If you are not interested in the preferences of anyone but yourself then there is little point in asking the preference question. Except that we noted that a second question helps to reduce p thus increasing the confidence that the apparent difference is real.In all of these examples I am much more impressed about whether people could detect a difference than whether the qualified group preferred one over the other. When I design a recipe it is my preference that counts,
I doubtless will read some of their reports in detail at some point in time but thus far my posts are about triangle testing; not Brulosophy's skill in implementig them.
interesting stuff no doubt! Why is there a earth?That is certainly a reasonable application. It has been suggested here that when a difference is 'detected' but with poor confidence that it sends the message that further testing is warranted. That is certainly a valid interpretation. Rather than comment on Brulosophy's selection of a particular confidence level for a declaration of 'detection' at this point I would rather emphasize that we can't detect that there is a difference but only estimate the probability that there is no difference given the data that was observed and that the other equally important part of the question is "By whom?".