• Please visit and share your knowledge at our sister communities:
  • If you have not, please join our official Homebrewing Facebook Group!

    Homebrewing Facebook Group

Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk

Help Support Homebrew Talk:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
Really, you are still holding on to your brew dogma for dear life. Trying your hardest to mathematically make your position have strength when none is there.

Let me ask you this, how many more times do they need to test fermentation temp for you? Now they have done it eight times by three people in three different states. The last one compared 48 degrees and 72. 24 degrees different. Only eight of 20 people could tell the difference if they even could. The person who made them could only get a triangle test right two of four times and anecdotally felt certain they tasted the exact same. But somehow that's not good enough for you or the results from the other 7 tests aren't good enough for you to make an opinion. And I'm just some sort of "fool" for going along with these results.

So how many more times do they need to test this? Really gives us all a number.

That last test was wyeast 2124, they have used wlp800 as well. So they've used at least three different yeasts now. Furthermore the meaningless preference data in this case was once again prefer the warm ferment. In the 82-degree experiment the preference was warm ferment.

So really bwarbiany, explain why it would be so foolish, to come to the conclusion that fermentation temperature isn't as big of a deal as you make it? What's the reasoning, because you read it in a book somewhere, because someone you really trust told you it mattered, because it's just something you think, because you think that's what taught in College. Once again where's your data other than somebody said it to be true. They made the beers and they tasted them blind in a triangle that's how it's done. There really isn't another way to do it. Oh I guess you know we could measure it with a spectrometer or something. Needing a million-dollar machine to discern a difference doesn't make a difference to me.

You need to boil 90 minutes right? Well how come in the two tests he did there was no DMs. And in that test he did send it to a lab and there was no DMs. How come the boil with the lid on experiment didn't come back significant? How come the weak verse strong boil didn't come back significant? How come people couldn't reliably tell the difference between Whirlpool Hops and flame out additions? Is it all just a bunch of BS and you have the real answers, or is there a chance that an overemphasis on process considerations has skewed your perception into tasting and believing things that don't exist. Is there a chance that maybe some of the stuff just doesn't matter and the real answer to better making beer lies elsewhere,

There are literally thousands of published papers studying fermentation temperature and yeast. Your smug attitude is hilarious, that you can actually say show me the data when the "science" you hang your hat on is useless as you would rather trust the taste buds of strangers than actual facts.

And the DMS "experiments" were just bad experiments, what exactly is the control? How much DMS was in the wort to begin with? It's funny you say you don't care about "Needing a million-dollar machine to discern a difference " then quote the absolutely useless testing they did on the sample as some gospel. There lab testing was 100% useless. Not even a question or argument to be had.
 
Bit of a controversial title, but hear me out. I've got some buddies who own commercial breweries who I shoot the ish w/ once in a while. I've brought up things like "cold break" where the master brewer of many years who told me "I've heard of that term. Really don't know what it is." in a fashion that I could only describe as the same way I would regard talk from a flat earth theorist.. On another occasion I referred to some brulosophy experiemnts as well as HBT posts about dry hop length and some other stuff I can't really remember and he told me, "Yeah you can't really believe those Brulosophy posts. It's not real science."

This has led me to believe that perhaps professional brewers look down on said websites. Has anyone else experienced this? What's the deal here?

Are there any professional brewers on here who have contrary opinions about the experiments on brulosophy and other homebrew websites and is professional brewing really such an esoteric field where the rules start changing?

Thanks!:mug:

Marshall is a dear friend and I respect and appreciate what he's done. But too many homebrewers take it as the last word, rather than a single data point. The key to science is repeatability. Someone does an experiment, then others do it to verify the results. If there's only one trial, then you can't really draw a conclusion. At Experimental Brewing, we try to get around that with multiple testers and a lot more tasters, but that has its own problems. In short, look at these experiments as a starting point for your own exploration. Trying to convince another brewer, whether homebrewer of commercial, that they're the last word is not only misleading, it's not how any of us intend the experiments to be used.
 
Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.

The temp-controlled beer had an average score 11 points higher than the non-controlled beer.

The problem I have with this experiment is with the volume of the fermenters. Did they both contain the same amount of trub? Does the yeast multiply or behave different between 10 gallons and 5 gallons? Is the amount of yeast proportional to each samples volume? Is there the same amount of headspace in each fermenter? Were they both kept in the dark or did one receive more light than the other. Even this simple experiment can have too many variables, which could change the final results.
 
I haven't read their fermentation temp experiment, but that seems like a critical part of the process from my experience - depending on the yeast. S-04 works fine in the mid-60s, but is gross if it goes over 70. I also think you can coax different characteristics out of some yeasts by changing the temperature. I have a Belgian yeast that seems to get a lot more barfy (Belgian) at higher temps and is more mellow fruity at lower. I've never done a side by side or even the same beer twice to compare, though.

Do you control temp at all to keep it in ale range, or just let it do what it does?

I've used s-04 a few times and gave up on it, it was gross.

I just let my buckets sit in my basement which is stable. I have used cool brewing bags to get the temps. closer to 15C for the first few days. It works fine for me even at 18C. Referring to 34/70.
 
What I question personally is whether we should place much validity in their results when they expect John Q. Randomguy -- who might know little or nothing about beer -- to be able to detect differences between two beers at a 95% confidence level. But in my view, if we take a more loose approach and only expect John Q. as well as all the other various experienced tasters to detect a difference an average of maybe about 80% of the time,
You are confusing confidence level with preference level. If 10 out of 20 panelists qualify and 6 of them prefer beer B then we say "60% of qualified tasters preferred Beer B at the 2% confidence level. The confidence level is the probability that 6 or more would have preferred B given that A and B are indistinguishable and gives us the information we need to decide whether we think this data is a valid measurement of the beer/panel combination or just the result of random guesses necessitated by an unqualified panel or the fact that the beers are actually the same in quality. If the probability that the data we obtained came from random guesses is only 2% we are pretty confident that our data did not come from random guesses and it is probably true that 60% or more of qualified tasters will prefer beer B.

Thus we can estimate that 60% of qualified tasters prefer a beer and we very confident (p < .001 ~ 0.1%) or moderately confident (p < 0.05 ~ 5%) or not too confident at all (p < 0.2 ~ 20%) that our observation of 6/10 wasn't arrived at through random events.



this experiment has 95% confidence that there seems to be a difference", then with an 80% bar instead of 95%, this lower bar is easier to meet or to "qualify" for further experimentation, rather than rehashing the same old "nope, we didn't achieve 'statistical significance' yet again". Statistically, if they only expect to be right about 80% of the time instead of 95%, the results reported should prove more interesting, at least in my own chumpy eyes.
The higher the confidence level the less likely we are to conclude that one beer is better. We want very small confidence level numbers. This makes us feel well, confident, that we can reject the hypothesis "This panel can't tell these beers apart".

There's more to it than just cofidence levels though. If we pick a certain confidence level (lets say p = 0.01 or 1% which is sort of midway between what is generally considered the largest acceptable value (0.05) and what is often considered the lowest level of interest (p = 0.001) it has implications on how well our test will perform. Overall, recall, that we have made some change to our brewing process and want to know whether this makes better beer. Before we can perform any experiments we must define what 'better' means. For the purposes of the current discussion suffice it to say that
'better' means preferred by more 'qualified' tasters than a beer which is not so good. This, of course, requires definition of 'qualified' and there has been discussion of that. So if we have beer A, beer B and beer C with A preferred by 60% of tasters, B preferred by 70% of tasters and C preferred by 85% of tasters we say that B is better beer than A and C is a better beer than A or B. It should not come as a surprise that the performance of a test depends on the strength of the goodness but it also depends on the design of the test and the threshold we choose. The way we describe performance of a test comes from the RADAR engineers of WWII. A RADAR set sends out RF energy and measures the strength of the signal over time. If the received signal is strong enough to exceed a threshold (under control of the operator in the early days) we decide a target is present and if it doesn't we decide no target is present at the range and azimuth to which the set is listening. The operators threshold control is relative to the noise level inherent in the environment. If the operator sets the threshold at the noise level then noise related voltages will be above the threshold a large proportion of the time and the scope will fill up with 'false alarms' that is, detection decisions made when there is no target present. If he sets the threshold well above the noise only the strongest signals will exceed it and returns from smaller ones will not be detected (false dismissal). In the beer test target detection is the conclusion that one beer is better given the data (voltage) we obtain from the test. A good test gets more 'signal' from the beer relative to the noise which is caused by panelists inability to be perfect tasters or the inclusion of panelists less qualified than we might like and by the fact that some beers are only a little better than others while some are much better than others.

The performance of a test against a particular signal in a particular noise environment is well represented by a "Receiver Operation Characteristic" which is a curve like the ones on the graph. The curve with the open circles represents the performance of a triangle test with 20 panelists. The probability that a panelist is qualified is 50% and the probability that the modified beer is better than the unmodified is 60%. The vertical axis shows the probability that such a test will find the beer better for each point on the curve which is labeled with the value of p used to reject the null hypothesis (the confidence level). At the left end of the curve we demand that the probability of random generation of the observed data is very small. Under those conditions we do not detect better beer very readily. At the other end we accept decisions based on little confidence (high p) and thus detection is almost certain.

The horizontal axis shows the probability that we will accept that the beer is better given that it isn't (false alarm). Thus, as with the RADAR, the probability of detecting the hypothesis "The doctored beer is better" becomes higher with reduced threshold whether the beer is actually better or not.

The point at the upper left hand corner represents the point where the probability of detection is 1 and the probability of false alarm is 0. This represents a perfect receiver (test). The closer one is to that point the better the test for the given beer. This makes it clear that one's choice of threshold should be the one that gets us closest to that corner ( p = 0.059 for the circle curve). But that may not be the case. In medicine, for example, the cost of missing a diagnosis may be high (the patient dies and the doctor gets sued) whereas, on the other hand, having a high false alarm rate is not such a bad thing as the expected loss from a malpractice suit goes down while at the same time he can charge for additional tests to see if indeed the patient does have the disease. Threshold choice is often determined by such considerations. In broad terms, the farther away the ROC curve is from the dashed line, the better the test.

The curve with inverted triangles is the ROC for the test mongoose suggested would be better than a triangle test given the same parameters as the curve with circles i.e. 20 panelists with the probability that a panelist qualifies being 50% and the beer better at the 60% level. The curve with the triangles shows the effect of increasing the panel size to 40 and the curve with squares the effect of making the beer more preferable increasing, there fore, the panel's ability to distinguish.

I'm not going to say more in this post as I expect I'll be coming back to this.

TriangleROC.jpg
 
About some of it, no. But about qualifying for panels, absolutely.
I think we do agree about qualifying the panel in most cases.




Here's why I do think that,[that allowing guesses is a detriment] and it's not a statistical reason, it's a measurement reason. It flies in the face of common sense that one would use, in a test of preference, people who demonstrably cannot make a preference decision.
They're guessing!
The thing that you don't seem to be able to grasp is that if you are trying to see how a proposed change in your beer will effect its sales and assay to do that with a taste panel then that panel had better reflect your market and not the scientists in your QC department. This, I would think is obvious and I shouldn't have to say more but I'll repeat the example I have offered before.

If you want to try a cheaper malt, brew a pilot batch with it and test it against your normal product using a panel that is highly qualified in beer tasting then it is probable that they will be able to detect a difference, p will probably be small and you will probably decide not to market the new beer thus losing the opportunity to save money and increase profits. Your decision to use the qualified panel has led you astray. You have made a mistake.

If, OTOH, you empanel people selected from your customer base most of whom are not as good beer judges as the guys from your QC dept. p is likely to be larger (this panel's selections will be more random) and you are not so likely to dismiss H0. You conclude that you can 'get away' with using the cheaper malt in this market. Profits go up and you get a bonus.

You want people who demonstrably can make that distinction doing such preference evaluation.
Sometimes, even often (see, we do agree) but not in the case I just laid out. It depends on what your investigation is about. And that is my recurring theme here.

I will also point out, again, that noise is inevitable - even with a 'qualified' panel and that the power of a triangle (or quadrangle or....) test is that it improves signal to noise ratio. See my post on ROC's.
 
Bit of a controversial title, but hear me out. I've got some buddies who own commercial breweries who I shoot the ish w/ once in a while. I've brought up things like "cold break" where the master brewer of many years who told me "I've heard of that term. Really don't know what it is." in a fashion that I could only describe as the same
Thanks!:mug:

There is very little that doesn't change between home brewer and "Pro". I can safely say I have not heard the phrase "cold break" since home brewing. As to dry hopping there is zero comparison. If I were to dry hop a batch for 2 days, I might as well not bother wasting my time. Going from 10-15 gallons to over 250 gallons many variables change other than just quantities. I for one have never been a fan of technical jargon. That being said, we do actually make a few 5-10 gal batches, they are primarily there for yeast starters. However we have then played with a few and created a few beers that were so well received that they went to become part of the regular rotation. So don'r get so hung up on terminology and experiments, a lot of it is just intuition or dumb-luck.But it's always about repeatability.
 
Were the judges blind to the variable? Did they know which was which? How careful were you to make sure that none of them knew which was which? I think it would be cool if you did that again and did a blind triangle test. I suspect ale yeast might be more reactive to temperature than lager, just a hunch.

Sadly, there wasn't enough "blindness" in the testing process, and there were other confounding variables (I kegged; the other brewer bottle-conditioned), etc. So I cannot claim it is truly scientific, not anywhere near the level of Brulosophy.

But I tasted both beers and definitely perceived a difference lol...
 
@ajdelange,

Wow, thanks for that chart! I can finally see now, visually, how for many common tests, if aiming for the upper left corner as you suggest, why it would be optimal to select a p value of around 0.05, just as Brulosophy has done. Thanks for this.

On the other hand, the more xbmts they run that conclude "not statistically significant", the more people may tend to ignore them or dismiss or discredit their conclusions as incorrect or quacky, possibly even at the loss of sponsorships, book sales, or whatever, in a manner similar to your medical malpractice scenario (although I'll admit this is extremely unlikely).

My argument remains, that a few more false alarms might not be such a terrible thing, if it might encourage more of us to run even more xbmts on our own to support/refute/learn for ourselves. More maybes and more false alarms might just excite people more than "couldn't tell the difference... again...".

Maybe.

Maybe I don't need to play this broken record anymore. Maybe I'll be quiet now. Maybe.

Cheers all.
 
Sadly, there wasn't enough "blindness" in the testing process, and there were other confounding variables (I kegged; the other brewer bottle-conditioned), etc. So I cannot claim it is truly scientific, not anywhere near the level of Brulosophy.

But I tasted both beers and definitely perceived a difference lol...

Hi friend, that last post came off wrong and too aggressive, and i am sorry for that. I appreciate your opinions and your contributions to this forum very much.
 
Marshall is a dear friend and I respect and appreciate what he's done. But too many homebrewers take it as the last word, rather than a single data point. The key to science is repeatability. Someone does an experiment, then others do it to verify the results. If there's only one trial, then you can't really draw a conclusion. At Experimental Brewing, we try to get around that with multiple testers and a lot more tasters, but that has its own problems. In short, look at these experiments as a starting point for your own exploration. Trying to convince another brewer, whether homebrewer of commercial, that they're the last word is not only misleading, it's not how any of us intend the experiments to be used.

I love your show! Thanks.
 
I think we do agree about qualifying the panel in most cases.

Apparently not as much as you might think.

The thing that you don't seem to be able to grasp

I'll let the insult slide. This is not the first time you've decided to take the low road, my friend.

is that if you are trying to see how a proposed change in your beer will effect its sales

OK, I think we're done here. It's pretty much a universal truth that when your interlocutor decides to change the discussion to another argument, it's a sign he/she doesn't have much with which to respond to the original argument.

and assay to do that with a taste panel then that panel had better reflect your market

Now you want this to be about the market? I noted this whole issue a long time ago when I pointed out the inability to know to whom the sample of tasters is generalizable.

Further, now you want this to be about the market and not about whether the beers are different?

<Snip silliness in the context of the argument>

I will also point out, again, that noise is inevitable - even with a 'qualified' panel and that the power of a triangle (or quadrangle or....) test is that it improves signal to noise ratio. See my post on ROC's.

If you want to argue that "noise is inevitable" without understanding that there are ways to reduce it and the desirability of doing so, then there's not much point in continuing.

Our discussion is at an end, AJ. If you're going to present a moving target, there's no point.

You can have the last word.
 
Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.



The temp-controlled beer had an average score 11 points higher than the non-controlled beer.


How many tasters?
 
View attachment 410318

Not only is that not clear from the way they do it, you point out one of the elemental difficulties with having a one-shot guess "qualify" tasters for the preference test.

Show me you can pick the odd-one-out three or more times in a row, and I'll believe you can detect a difference....and you are qualified to go to the next level.

Guessers cannot tell the difference; why would anyone want them judging preference, and guessing on that too?

Bingo
 
Hi friend, that last post came off wrong and too aggressive, and i am sorry for that. I appreciate your opinions and your contributions to this forum very much.

No worries. I love to debate. I don't take things personally... As a wise man [me] once coined a phrase:

Offense can never be given; it can only be taken.

How many tasters?

We had about 10-11 tasters giving informal feedback. Only about 4 filled out BJCP sheets. So as I said, not exactly scientific, and the confounding variables were an issue. It wasn't anywhere near the level of what Brulosophy does.
 
My argument remains, that a few more false alarms might not be such a terrible thing, if it might encourage more of us to run even more xbmts on our own to support/refute/learn for ourselves.
And I agree with this enthusiastically. As there was a benefit for the doctor in a higher false alarm rate (in that he can perform additional and potentially even more expensive tests and avoid a lawsuit) so is there a potential benefit for brewers in raising the acceptable level of p if it drives more investigation.

I believe the p value should be discarded in these types of trials.

But not so much as all that. Surely this comment was made in jest. Without consideration of p the test is useless. The results are like the statistics you see in the news.
 
No worries. I love to debate. I don't take things personally... As a wise man [me] once coined a phrase:

Offense can never be given; it can only be taken.

Ha! How true! That's one of the best phrases in this entire thread. Thanks man.
 
On No. 205 I presented some ROC curves illustrative of the effectiveness of binary and triangle tests and mentioned that the triangle is more powerful than the binary because it improves signal to noise ratio. In No. 206 I stated that a quad test should be even more powerful than a triangle test because it increases SNR even more. Since posting the curves I have run a Monte Carlo on a quad test for the same conditions as the other tests i.e. a panel of 20 out of whom 10 qualify and 6 prefer the new beer. As expected, the quad test performs better than the triangle for those conditions. The ROC curve lies about midway between the curves with the squares and the curve with the triangles. The curve with triangles represents a triangle test with 40 panel members half of whom qualify and 60% prefer. Thus a quadrangle test with 20 panelists gives slightly better performance than a triangle test with 40 equally qualified panelists. In an earlier post I had suggested that perhaps the reason we use triangle tests instead of quadrangle tests is that, while the quad test is a better test it is not better by enough to justify the extra effort required to juggle 4 samples as opposed to three. It appears for this particular case adding the extra cup is slightly better than doubling the panel size!
 
We had about 10-11 tasters giving informal feedback. Only about 4 filled out BJCP sheets. So as I said, not exactly scientific, and the confounding variables were an issue. It wasn't anywhere near the level of what Brulosophy does.

That makes Brulosophy look like robust, rigorous science!
 
You can have the last word.
Here it is:

I'll let the insult slide.

Offense can never be given; it can only be taken.

This is not the first time you've decided to take the low road, my friend.
If basing my position on sound principles, supporting my conclusions with examples and data (though it be simulated) and explaining them to the best of my ability be the low road I'll take it. And probably get to Scotland a'fore ye too with Scotland representing a fuller understanding of triangle testing than I've ever had before. That's why I find it so disappointing that you wish to withdraw based on what are clearly misunderstandings of my posts.

To whit:
OK, I think we're done here. It's pretty much a universal truth that when your interlocutor decides to change the discussion to another argument, it's a sign he/she doesn't have much with which to respond to the original argument.
I have never changed the argument. The central theme of all my posts, was stated in my first post in this thread (#61) as
...the selection of the panel which must be driven by what one is trying to measure.
and this was repeated many times in others perhaps phrased differently but I feel it should have been clear that the design of the test depends greatly on the nature of the investigation. Whether a particular reader is able to grasp that or not is immaterial as long as most of the readers do.


Now you want this to be about the market? I noted this whole issue a long time ago when I pointed out the inability to know to whom the sample of tasters is generalizable.
My use of the market as an example of a case where we are interested in the verity of the null hypothesis, rather than its alternate is hardly new to the recent posts. In No. 61 I said
Then we get into questions of how well these 20 guys represent the general public (or whatever demographic the experimenter is interested in - presumably home brewers).
I hope that you will grant me that a brewery's market is included in "whatever demographic".

Further, now you want this to be about the market and not about whether the beers are different?
No. I want it to be, as I have said all along, about whatever the investigator is interested in investigating.


<Snip silliness in the context of the argument>
If you think something silly please say why it is silly.

If you want to argue that "noise is inevitable" without understanding that there are ways to reduce it and the desirability of doing so, then there's not much point in continuing.

In No. 205 I presented ROC curves for binary, and triangle tests and described in 206 where a quad test would fall on that chart. I explained that the higher order tests perform better because they increase signal to noise ratio and mentioned in earlier posts that this is because they reduce noise. You must not have read any of these posts as I don't see how you could possibly conclude that I don't understand noise and ways to reduce it if you had. For background I'll point out that I spent the better part of 45 years characterizing noise and figuring out how to combat it.

If you did read those posts and based on them conclude that I don't know anything about noise reduction then, alas, I fear there is no basis for further dialogue. But there seem to be some others here learning something about this kind of testing from my posts and I'll continue to post anything additional findings or insights for their (and my) benefit.
 
This sounds awful and I'm glad the breweries near me aren't this way. We have 4 local breweries and all 4 of them are actively involved with the local homebrew club. They sell us grain at wholesale prices, they each host us at least once a year, they attend 1 or 2 meetings a year, and they sponsor homebrew competitions where the winning beer is brewed on their system.

It really is awful..I wish more local breweries here took more of a "hand in hand" approach with the local homebrew scene. There are only 1 of them that actively year after year stand with the homebrewers with contests and such.

I wish it were more..I have often thought that having a brewery with a brewery swag shop that also provides basic homebrew supplies and grain at bulk prices (and even clone kits of one or 2 of their beers) along with some "guest" homebrewer batches/classes would be brewing utopia. I know I would be a loyal patron.

Its one of my gameplans to incorporate this idea if I ever pull the trigger on my nano. <insert trademark here>
:mug:
 
Thus a quadrangle test with 20 panelists gives slightly better performance than a triangle test with 40 equally qualified panelists. . . It appears for this particular case adding the extra cup is slightly better than doubling the panel size!

And the tester saves 40 cups of beer.
 
It really is awful..I wish more local breweries here took more of a "hand in hand" approach with the local homebrew scene. There are only 1 of them that actively year after year stand with the homebrewers with contests and such.

I wish it were more..I have often thought that having a brewery with a brewery swag shop that also provides basic homebrew supplies and grain at bulk prices (and even clone kits of one or 2 of their beers) along with some "guest" homebrewer batches/classes would be brewing utopia. I know I would be a loyal patron.

Its one of my gameplans to incorporate this idea if I ever pull the trigger on my nano. <insert trademark here>
:mug:

We had one of these in Denver. Dry Dock Brewing. I think they closed the Homebrew side.
 
And the tester saves 40 cups of beer.

Good point. And with respect to management of the samples its question of fiddling with 20*4 = 80 vs. 40*3 = 120 cups of beer which has got to be easier so we wonder why there is no quadrangle test. Before we get too excited lets keep in mind that this result represents one particular set of circumstances (panel size of of 20, probability of qualification 50%, probability of preference 60%). Perhaps if we examine a wider range of circumstances we would not find the gain so great. Something to look into though.
 
  • Like
Reactions: Kee
I wish it were more..I have often thought that having a brewery with a brewery swag shop that also provides basic homebrew supplies and grain at bulk prices (and even clone kits of one or 2 of their beers) along with some "guest" homebrewer batches/classes would be brewing utopia. I know I would be a loyal patron.

Ballast Point in San Diego does this (to an extent; they have a homebrew store in downtown San Diego, but I think the amount of equipment/supplies at the actual brewery is very limited). Phantom Ales in Orange County does more of what you're describing as them being two actual linked businesses.

I think it makes tremendous sense. They're a brewery, so they're already buying ingredients in bulk and have economies of scale there. It gives homebrewers reason to pop in to Phantom when they might not otherwise to buy ingredients and have 1 [or 4] pints and some food while they're there.

Good point. And with respect to management of the samples its question of fiddling with 20*4 = 80 vs. 40*3 = 120 cups of beer which has got to be easier so we wonder why there is no quadrangle test. Before we get too excited lets keep in mind that this result represents one particular set of circumstances (panel size of of 20, probability of qualification 50%, probability of preference 60%). Perhaps if we examine a wider range of circumstances we would not find the gain so great. Something to look into though.

I wonder if it's partly due to how hard it is sometimes to even qualify the panel and achieve statistical significance with a triangle test. As has already been discussed, the tasting panels are not exactly perfectly chosen per your guidelines (i.e. if you're testing for diacetyl, pre-qualify the panel to determine who is sensitive to diacetyl).

With a quadrangle test, yes, you'd require fewer testers to correctly pick the odd beer to achieve significance, but I would worry that with a small panel that you'd find even fewer experiments achieving significance. I don't know the math on this, but think of this as an example:

Triangle: 24 testers, you need 13 correct for p<0.05
Quadrangle: 24 testers, you need 11 correct for p<0.05

This seems better, but when you think of the guessing scenario, pure guessing would suggest 8 tasters in a triangle test would blindly get it right. Pure guessing would suggest 6 tasters in a quadrangle test would blindly get it right. Which means in both cases you need 5 people above and beyond pure guessing to achieve significance, and they've already proven that's hard to achieve with a triangle test.

I think it would be really cool to see Brulosophy use the same experimental batches on two different panels of testers, run once as a triangle and once as a quadrangle.

My gut instinct (which isn't statistics, I know lol) is that although the implications of statistical significance would be stronger if the panel qualified at p<0.05 than it would in a triangle test, the likelihood of achieving significance is lower because a quadrangle is IMHO a more difficult selection than a triangle.
 
Imo, the path to better brew is the water. Sitting around mathematically working and justifying the, to me, obvious reaults does not make sense to me. They dont "prove" anything, but to an eager and open mind they demonstrate great information for the commercial and home brewer.
 
I wonder if it's partly due to how hard it is sometimes to even qualify the panel and achieve statistical significance with a triangle test. As has already been discussed, the tasting panels are not exactly perfectly chosen per your guidelines (i.e. if you're testing for diacetyl, pre-qualify the panel to determine who is sensitive to diacetyl).
Though I have apparently not made it clear the main theme in my posts has been that what you do depends on what you are trying to measure. In cases where the object is to see if the process change has decreased diacetyl then it seems that we would want panelists who are sensitive to diacetyl. If the object is to detect whether the process change effects preference for the beer among some group of people then the panel does not need to be qualified other than to make sure that it is representative of the group you are trying to measure.

Calibrating a panel and going to a quadrangle rather than a triangle do they same thing. They increase the number of qualified votes. Increasing the number of qualified votes by rejecting the votes of unqualified tasters is beneficial but it does, as you point out, reduce the number of votes counted. Going to the quadrangle is justified if the benefit of a higher percentage of qualified votes offsets the loss from fewer total votes counted. Calibrating (pre-qualifying ) the panel is like voir dire. You remove a juror who you think will give the wrong answer but replace him with one you like better. You still have the same number of votes.

With a quadrangle test, yes, you'd require fewer testers to correctly pick the odd beer to achieve significance, but I would worry that with a small panel that you'd find even fewer experiments achieving significance. I don't know the math on this, but think of this as an example:

Triangle: 24 testers, you need 13 correct for p<0.05
Quadrangle: 24 testers, you need 11 correct for p<0.05

This seems better, but when you think of the guessing scenario, pure guessing would suggest 8 tasters in a triangle test would blindly get it right. Pure guessing would suggest 6 tasters in a quadrangle test would blindly get it right.

Keep in mind that those are average numbers that would be obtained by averaging the results from many tests. There is a finite probability that none of the panelists or all of the panelists might correctly pick the odd beer. It is much more likely, however, that 7, 8 or 9 will than 0 or 24.

At this level of consideration the math is complicated enough that you really have to run numbers to gain insight (or, at least, I do - a good mathematician/statistician might be able to give more robust explanations). The ROC represents, IMO, a great way to characterize the worth of a test and apparently, though first invented for RADAR design finds, lots of application in today's world for much more exotic explorations in artificial intelligence, data mining etc. I a previous post I introduced them and put up a chart with a coupe of them on it. In that post I mentioned that a perfect test plots in the upper left hand corner and a test where no information is gained traces out the diagonal dashed line. Intermediate cases ROCs are curves that go though the lower left hand corner and upper right hand corner and which are bowed towards the upper left hand corner. Rather than plot more curves I've come up with a Figure of Merit (FOM) which is 100* (1 - distance_of_closest_approach_to_corner/0.707). 0.707 is the closest approach to the corner for the dashed line. Thus an ROC that lies close to the dashed line (weak test) has an FOM near 0 and one that approaches perfection has an FOM approaching 100.

For your examples:
Triangle test Npanel: 24, Prob(qual): 0.50; Prob(Pref): 0.50; p best perf.: 0.0613, Dist from (0,1): 0.3529, FOM: 50.1 (100 = perfect).
Quadrangle test Npanel: 24, Prob(qual): 0.50; Prob(Pref): 0.50; p best perf.: 0.0316, Dist from (0,1): 0.1793, FOM: 74.6 (100 = perfect)

These data show that for a panel of 24 whose members are only able to pick the odd beer correctly half the time when the beer is only preferred at the 50% level (meaning that 50% of tasters who can tell it is different like it better) the loss of 2 votes (on average) is more than offset by the gain from having the accepted votes be from qualified tasters. The FOM improves from 50 to 75. But now let's suppose that we have a panel that is better qualified in the sense that 60% of them can correctly pick the odd beer as opposed to 50%.

Triangle test Npanel: 24, Prob(qual): 0.60; Prob(Pref): 0.50; p best perf.: 0.0316, Dist from (0,1): 0.1743, FOM: 75.3 (100 = perfect).

This shows that only a small improvement in the proficiency of the panel can get us to the same FOM as going to a quadrangle test so perhaps that suggests that in the types of tests where we are interested in, for example, diacetyl, getting a better qualified panel may be preferable to the added complexity of dealing with 4 samples per panelist.



I think it would be really cool to see Brulosophy use the same experimental batches on two different panels of testers, run once as a triangle and once as a quadrangle.

It would be interesting for sure but I'm not sure what one could conclude from a single test like that.

Note that in coming up with the ROC curves I do exactly what you propose but do it in a computer and do it thousands of times (100,000 to be exact). Each 'panelist' picks the odd beer with the probability I specify and the ones that coose the right one then get to pick one or the other of the two beers as the preferred. The three numbers (number of panelists, number that chose correctly and the number that preferred) then go into the confidence calculation and confidence level statistics are accumulated.

My gut instinct (which isn't statistics, I know lol) is that although the implications of statistical significance would be stronger if the panel qualified at p<0.05 than it would in a triangle test, the likelihood of achieving significance is lower because a quadrangle is IMHO a more difficult selection than a triangle.
Because the quadrangle is a more stringent test the probability of a given result is lower by random guessing than it would be for a less stringent test (triangle). That's where the quadrangle attains its apparent advantage. The Monte Carlo test runs for your example confirm this. For the triangle test the average confidence level was p = 0.048 whereas for the triangle it was p = 0.016.
 
Imo, the path to better brew is the water. Sitting around mathematically working and justifying the, to me, obvious reaults does not make sense to me. They dont "prove" anything, but to an eager and open mind they demonstrate great information for the commercial and home brewer.


This is all over the place. Ignore the statement regarding water.

You say that examining the results in a scientific ("mathematically") does not make sense.

You say the results are obvious.

You say the results don't prove anything.

You say that an open mind will understand the results and provide great information.


How would anyone know what information is to be gleaned from the "exbeerments" if they aren't examined?

How could you even say that water is the pathway to better beer without gathering data and interpreting data? You can't.
 
The near entirety of what you put in your mouth is water. Period. Ill double down on the importance of water. As far as the rest, I feel there is some misunderstanding.
 
Water is one of several stops (often the last one) on the way to very good beer but other things are just as important or more so. Good water is a sine qua non for good beer but good water is easily obtained by adding 1 gram of calcium chloride to each gallon of RO water used (mash and sparge). Brew with water like that to let you get a handle on grist design and preparation, mashing, hopping and fermentation. Then when those are under control you can come back and tweak the water if you like but don't expect dramatic differences.

If a result is obvious (for example adding roast barley to grist darkens beer) then you don't need a statistical test. The value of panel tests is only realized when the result is not obvious or you suspect conclusions are being driven by confirmation bias. In those cases a triangle test can help you determine whether what you have observed supports the hypothesis you are testing or not. It's always easier to make that determination by increasing the sample size but increasing the sample size isn't always an option. And it doesn't remove confirmation (or any other kind of) bias.
 
If the object is to detect whether the process change effects preference for the beer among some group of people then the panel does not need to be qualified other than to make sure that it is representative of the group you are trying to measure.

I really appreciate the scientific rigor you are trying to bring to this debate. I do have a question and a comment.

Question: How many of the brulosphy experiments have you read (more than the headline and results)...actually read the full write up? Actually this question goes for other posters in this thread too...I see a lot of comments regarding the experimental design that don't seem to have really read too the reports.

In reading the vast majority of the experiments it seems to me that Marshal and crew are focused on detecting whether process or ingredient changes result in a perceptible difference. Everything else is intended to guide thinking about future experiments.

While I am a firm believer in fermentation temperature control I do find it interesting that their experiments show that some other things I took as largely irrelevant may lead to perceptible changes in the beer that are easier for typical bomebrew drinkers to detect than control of fermentation temperature.

Take for example the tasters were able to distinguish between beer brewed with WLP001 and US05. But tasters were not able to distinguish between beer brewed with Galaxy and Mosaic hops. Tasters saw a difference between glass carboy and corney keg fermentation. But did not see difference between chocolate malt and carafa special 2 in a Schwartzbier. In all of these examples I am much more impressed about whether people could detect a difference than whether the qualified group preferred one over the other. When I design a recipe it is my preference that counts, but I will likely pay more attention to choice of WLP001 vs US05 in the future.

Yes these are all single experiments and I'd also prefer to see them repeated before changing tried and true processes. That said all of them are presented with sufficient detail in the reports than any of us could take on the challenge to try to repeat.
 
Yes these are all single experiments and I'd also prefer to see them repeated before changing tried and true processes. That said all of them are presented with sufficient detail in the reports than any of us could take on the challenge to try to repeat.

That's exactly what Brulosophy and Drew and I at EB expect people to do. We're not presenting what we feel are scientific conclusions in the way AJ is describing. We're presenting the results of our experiments as starting points for further exploration. We're all frustrated that people take them otherwise.
 
Yes these are all single experiments and I'd also prefer to see them repeated before changing tried and true processes. That said all of them are presented with sufficient detail in the reports than any of us could take on the challenge to try to repeat.

Thats what i take from them. For instanced i used 34/70 in warmer temps then traditional lagers and i'm very very happy with the results.

Science is all good and stuff but i'm just trying to make better beer! Most of the real science is already done by the people who make the ingredients.
 
Question: How many of the brulosphy experiments have you read (more than the headline and results)...actually read the full write up? Actually this question goes for other posters in this thread too...I see a lot of comments regarding the experimental design that don't seem to have really read too the reports.

Not a single one! And why not? Because my goal has not, up to this point, been to critique what Brulosphy did or didn't do but rather to point out things like the importance of designing the test according to the information sought and the demographic it is sought from (a triangle test is a test of the panel and the beer) and some of the pit falls such as failure to mask a differentiating parameter that is not being investigated and possible procedural errors (e.g. failure to isolate panelists from one another and failure to randomize the order of presentation of the cups. The only comment I made about Brulosphy in particular was that if any criticism of what they did was justified it was probably in one of those areas and I'd say that about anybody I knew was doing a triangle test.

I doubtless will read some of their reports in detail at some point in time but thus far my posts are about triangle testing; not Brulosophy's skill in implementig them.


In reading the vast majority of the experiments it seems to me that Marshal and crew are focused on detecting whether process or ingredient changes result in a perceptible difference. Everything else is intended to guide thinking about future experiments.
That is certainly a reasonable application. It has been suggested here that when a difference is 'detected' but with poor confidence that it sends the message that further testing is warranted. That is certainly a valid interpretation. Rather than comment on Brulosophy's selection of a particular confidence level for a declaration of 'detection' at this point I would rather emphasize that we can't detect that there is a difference but only estimate the probability that there is no difference given the data that was observed and that the other equally important part of the question is "By whom?".


While I am a firm believer in fermentation temperature control I do find it interesting that their experiments show that some other things I took as largely irrelevant may lead to perceptible changes in the beer that are easier for typical bomebrew drinkers to detect than control of fermentation temperature.

Take for example the tasters were able to distinguish between beer brewed with WLP001 and US05. But tasters were not able to distinguish between beer brewed with Galaxy and Mosaic hops. Tasters saw a difference between glass carboy and corney keg fermentation. But did not see difference between chocolate malt and carafa special 2 in a Schwartzbier.
Given this, if a saw a test that purported to test WLP001 vs US05 and compared beers brewed with them one of which was done in glass and one in SS I'd call foul on that test. Everything but the parameter of interest must be the same or masked. But it is not always possible to do that. These are definitely things that must be considered in planning and evaluating a triangle test.


In all of these examples I am much more impressed about whether people could detect a difference than whether the qualified group preferred one over the other. When I design a recipe it is my preference that counts,
A key theme in all my posts has been that the test needs to be designed to reflect what the investigator is interested in. If you are not interested in the preferences of anyone but yourself then there is little point in asking the preference question. Except that we noted that a second question helps to reduce p thus increasing the confidence that the apparent difference is real.
 
I doubtless will read some of their reports in detail at some point in time but thus far my posts are about triangle testing; not Brulosophy's skill in implementig them.

You should do that and maybe try some of the experiments.

What the hell is "Virginia/Quebec" by the way? Quebec is in Canada.
 
That is certainly a reasonable application. It has been suggested here that when a difference is 'detected' but with poor confidence that it sends the message that further testing is warranted. That is certainly a valid interpretation. Rather than comment on Brulosophy's selection of a particular confidence level for a declaration of 'detection' at this point I would rather emphasize that we can't detect that there is a difference but only estimate the probability that there is no difference given the data that was observed and that the other equally important part of the question is "By whom?".
interesting stuff no doubt! Why is there a earth?
 
Status
Not open for further replies.
Back
Top