Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum

Help Support Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
This is the part that worries me about "which beer do you prefer" tests.

This touches on another aspect that didn't get much (if any) mention in my previous post. The result, of course, depends on the panel but they also depend on the instructions given to the panel. Where the instructions call for marking of one or the other of the beers based on opinion rather than something more concrete (such as whether one beer tastes more strongly of diacetyl than the other) we have quite (IMO) a different situation. I know (or think I know) how to calibrate a panel to see if it can detect diacetyl but I don't know how to calibrate one to see if it can detect 'better' beer.

Which is fine, if you're judging this "to style". I.e. if the tasters are proficient, they should be told "both beers are a Vienna lager, I'm not telling you what variable is being measured, but you have to answer which beer is the better Vienna lager."

If you do that, and the tasters are proficient and knowledgeable about the style, their answers shouldn't depend on whether they're Germans or Englishmen. If cold-fermenting or warm-fermenting makes a better LAGER, they should be able to figure that out. If you're just asking them which beer they prefer, you are not actually addressing the variable you're trying to, because if a taster doesn't recognize that a beer style should be very clean and low in esters, they may vote for the opposite beer if they like fruity esters in their beer.

Looking further at the panel instruction aspect: it should be clear that taking a panel of, for example, knowledgeable, experienced tasters such as people from the QC department of a brewery, we could expect results if we ask them to pick the better made beer different from the results expected if we ask them to pick the beer they prefer. A Brit (whom I assumed in an earlier post would prefer ale though in fact the British public, like the public elsewhere, prefers lager) might, in comparing an ale with a minor flaw to a perfect lager, select the ale if asked to pick the beer he prefers and the lager if asked to pick the better made beer.

The point of all this is, once again, that triangle testing is powerful but that one must be very careful in designing the test (randomization, blindness, panel selection, panel instructions, test environment...) and in how he interprets the results.

If I have a nice laboratory with a reliable, well trained staff, a panel of certified (calibrated) tasters that I can use for every test and a limited portfolio of beers for which I seek the learn the effects or minor process/materials variations then triangle testing can be a powerful tool. If I'm working by myself and the panels are drawn from volunteers from my homebrew club I'm not so sure. But data is data and if p(M,N,n) is small enough there is information buried in it somewhere.
 
This touches on another aspect that didn't get much (if any) mention in my previous post. The result, of course, depends on the panel but they also depend on the instructions given to the panel. Where the instructions call for marking of one or the other of the beers based on opinion rather than something more concrete (such as whether one beer tastes more strongly of diacetyl than the other) we have quite (IMO) a different situation. I know (or think I know) how to calibrate a panel to see if it can detect diacetyl but I don't know how to calibrate one to see if it can detect 'better' beer.



Looking further at the panel instruction aspect: it should be clear that taking a panel of, for example, knowledgeable, experienced tasters such as people from the QC department of a brewery, we could expect results if we ask them to pick the better made beer different from the results expected if we ask them to pick the beer they prefer. A Brit (whom I assumed in an earlier post would prefer ale though in fact the British public, like the public elsewhere, prefers lager) might, in comparing an ale with a minor flaw to a perfect lager, select the ale if asked to pick the beer he prefers and the lager if asked to pick the better made beer.

The point of all this is, once again, that triangle testing is powerful but that one must be very careful in designing the test (randomization, blindness, panel selection, panel instructions, test environment...) and in how he interprets the results.

If I have a nice laboratory with a reliable, well trained staff, a panel of certified (calibrated) tasters that I can use for every test and a limited portfolio of beers for which I seek the learn the effects or minor process/materials variations then triangle testing can be a powerful tool. If I'm working by myself and the panels are drawn from volunteers from my homebrew club I'm not so sure. But data is data and if p(M,N,n) is small enough there is information buried in it somewhere.

You are right, thats why they dont do it, its perception. Perception must be kept out.

No the home brew club is a great judge. After 20 or more of these tests some of them have taken, they must be keen by now. Anyways all data extrapolated proves that any old taster will do. Thats why we both like many of the same beers. My wife doesnt like them or drink them but i trust her on taste unconditionally.
 
No they werent. In 6 of the 8 tests people were not able to reach any level of statistical signifigance. And in one of the 2 that was, was by one taster, on a lager fermented at 82. Furthermore, preference in that case was for the warm one. And in the pictures i showed above, he took wlp800 and fermented it warm and gave it to all those famous people and everyone else and there was absolutely no statistical signifigance at all.


Ps, if you like ipa he just posted a whirlpool vs flameout and people could not statistically identify any difference in those beers.

I read a few of the experiments maybe not all as I mentioned lager is not my style. Did have a really nice onf from Poland yesterday though and am thinking to give them a try. Main reason I don't lager is temperature controlled fermentation is my bottleneck. I've got room for 10 gallons and can't tie that up for 5-6 weeks on one batch. In reading all of the experiments it seems that ale temperatures probably produce a reasonably good lager with ale turn around time being feasible at this level.

I note that some of the lager temp experiments do seem to be comparing the fast brew "with a cold starting temp" technique to the fast brew "just brew it like an ale" technique. The experiment where they compared a traditional 6+ week lagering process to a ferment-it-like-an-ale did show statistical significance but the brewer concluded that he doesn't care it is still good beer and not persuaded to go back to the traditional 6 week process.
 
I think there's a misunderstanding throughout these comments. Some have mentioned it - it's not about what is better. It's about if it makes a difference.

In many cases, it doesn't make a difference. So go with the easiest process to get the same result.
 
I read a few of the experiments maybe not all as I mentioned lager is not my style. Did have a really nice onf from Poland yesterday though and am thinking to give them a try. Main reason I don't lager is temperature controlled fermentation is my bottleneck. I've got room for 10 gallons and can't tie that up for 5-6 weeks on one batch. In reading all of the experiments it seems that ale temperatures probably produce a reasonably good lager with ale turn around time being feasible at this level.

I note that some of the lager temp experiments do seem to be comparing the fast brew "with a cold starting temp" technique to the fast brew "just brew it like an ale" technique. The experiment where they compared a traditional 6+ week lagering process to a ferment-it-like-an-ale did show statistical significance but the brewer concluded that he doesn't care it is still good beer and not persuaded to go back to the traditional 6 week process.

I probably never would have considered taking on a lager with my current set up. I have only done ales as my basement maintains a temp which lends itself well to my ale temperature requirements. I brewed a Helles recipe from here on the forums, following what they had experimented with on Brulosophy, and I'm pleased with the results.
https://www.homebrewtalk.com/showthread.php?p=8000320#post8000320
Sometimes, others creativeness opens doors for others who may not have considered options that they thought weren't possible, or were ingrained to believe by old school principles.
I guess in a nutshell, I could care less what pro-brewers think of Brulosophy or their experiments. I'm a homebrewer. From a keeping it enjoyable standpoint, I like what the Brulosophy team does.
 
I'm still trying to figure out the difference between a professional brewer and a "professional" brewer.
Anyone?

Well, a Professional Brewer pays attention to detail and is striving to improve their product as they learn and grow; they make an effort to understand not only what works, but *why* it works.

A 'professional brewer' is a person who owns a brewery and follows rote procedures 'because that's how they've always done it' with little/no regard to the final product quality and has no desire to put forth the effort to improve, because the money is still coming in... for now.

One will have a long tenure as a respected brewery and will likely thrive in an increasingly crowded market. One will most likely fail as the novelty of a local brewery wears off and they get crowded off the taps by other upcoming breweries.

This is all my opinion, of course.
 
Well, a Professional Brewer pays attention to detail and is striving to improve their product as they learn and grow; they make an effort to understand not only what works, but *why* it works.



A 'professional brewer' is a person who owns a brewery and follows rote procedures 'because that's how they've always done it' with little/no regard to the final product quality and has no desire to put forth the effort to improve, because the money is still coming in... for now.



One will have a long tenure as a respected brewery and will likely thrive in an increasingly crowded market. One will most likely fail as the novelty of a local brewery wears off and they get crowded off the taps by other upcoming breweries.



This is all my opinion, of course.


A professional brewer is one who gets paid to brew.
 
I probably never would have considered taking on a lager with my current set up. I have only done ales as my basement maintains a temp which lends itself well to my ale temperature requirements. I brewed a Helles recipe from here on the forums, following what they had experimented with on Brulosophy, and I'm pleased with the results.

https://www.homebrewtalk.com/showthread.php?p=8000320#post8000320

Sometimes, others creativeness opens doors for others who may not have considered options that they thought weren't possible, or were ingrained to believe by old school principles.

I guess in a nutshell, I could care less what pro-brewers think of Brulosophy or their experiments. I'm a homebrewer. From a keeping it enjoyable standpoint, I like what the Brulosophy team does.


That's why the question was what pros think of it. Not what home brewers who are learning think of it.
 
That's why the question was what pros think of it. Not what home brewers who are learning think of it.

That's your take on it and I'm fine with that.

My response was more to this question posed in the OP:
This has led me to believe that perhaps professional brewers look down on said websites. Has anyone else experienced this? What's the deal here?


Thanks!:mug:
 
Why are people getting hung up on the "which beer do you prefer" question. Its not important nor do they claim it to be scientific. The main question asked in the triangle test is "which beer is different". Now I have issues with how that is tested but at least address the real purpose of their panels.

BTW my issue is that I've been on off tasting panels where people couldn't detect flavours I found extreme and vice versa. So I don't think the triangle tests are a reliable measure of if a beer is flawed or not. Though I do enjoy the site and read all the posts. I even bought a t-shirt.
 
Why are people getting hung up on the "which beer do you prefer" question. Its not important nor do they claim it to be scientific. The main question asked in the triangle test is "which beer is different". Now I have issues with how that is tested but at least address the real purpose of their panels.
No. The purpose of the triangle test is to test the hypothesis H1: ""Beer A is better than beer B by rejecting the null hypothesis H0: "Beer A is indistinguishable from Beer B". Note that H1 may be something like "Beer A has less diacetyl than Beer B" or "Beer A is hoppier than Beer B" but ultimately we make changes in our ingredients or process in the hopes of improving our beer in which case the 'best' criterion obviously applies uishabut there may be times when we are interested only in whether we can get away with some change without a negative impact on the beer's quality. For example, if we brew batches of lager using the traditional German process (degree/day temperature reduction followed by long lagering) and the more modern diacetyl rest followed by cold crashing, subject them to a triangle test asking 'Which is better' and obtain data showing that only 2 out of 20 panelists qualify we don't need to look at the answers to the question as we already know that the beers are indistinguishable. If we have 20 panelists of whom only 2 qualify both of whom prefer the beer the probability of H0 is p(20,2,2) = 86.9% and we can hardly dismiss it. Just looking at the first part of the test: the probabilty of 2 or more out of 20 qualifying under the null hypothesis is 99.7%. Basesd on that alone we would also clearly accept H0.

On the other hand, if half the panel (10 out of 20) qualifies the probability of that happening under H0 is 9.2% and we might still be tempted to accept the null hypothesis based on just this part of the test (the usual maxium for rejection is 5%). But now supposing that 8 out of the 10 who qualified prefer one beer or the other the probaility that H0 is true idrops to 0.87% which is below the 1% confidence level. We can now be pretty certain that H0 can be rejected and accept the alternative that one or the other of the two beers (whichever got the 8 votes) is better.

So I don't think the triangle tests are a reliable measure of if a beer is flawed or not. Though I do enjoy the site and read all the posts. I even bought a t-shirt.

As I have pointed out several times in this thread you can't measure voltage with a voltmeter that is not calibrated. The same is true here. You must calibrate the panel for what you are trying to measure. As noted in an earlier post, some people cannot smell/taste diacetyl. A panel made up of persons with this 'disability' would be a poor panel indeed for detecting an excess diacetyl flaw. But a panel that has demonstrated good sensitivity to diacetyl by means of triangle tesing with spiked beers would be very good for detection of this flaw.
 
No. The purpose of the triangle test is to test the hypothesis H1: Beer A is better than beer B by rejecting the null hypothesis H0: Beer A is indistinguishable from Beer B.

I don't think that is correct, AJ. If the null is that the two are indistinguishable, the alternative is that they are. Not that one is better than the other.

People are going ga-ga over preference, and it's a mistake to do that, IMO. People like what they like.

The reason for doing exbeeriments like Brulosophy does is generally to see if processes make any detectable difference. I've noted repeatedly my concerns about panel composition as well as whether the panels have palates that are useful for this, partly because we don't know what they were drinking/eating just prior to doing the testing.

Even so, if one beer is preferred by 60 percent of tasters who can distinguish them and the other is preferred by 40 percent, it's not clear that we have learned much, if anything. Which would we like better? Maybe the former, but possibly the latter. As I like to say, what's the actionable intelligence? Answer: Not much if any.

Note that H1 may be something like "Beer A has less diacetyl than Beer B" or "Beer A is hoppier than Beer B" but ultimately we make changes in our ingredients or process in the hopes of improving our beer in which case the 'best' criterion obviously applies uishabut there may be times when we are interested only in whether we can get away with some change without a negative impact on the beer's quality.

For commercial brewers whose changes may save them money, this makes sense.

For example, if we brew batches of lager using the traditional German process (degree/day temperature reduction followed by long lagering) and the more modern diacetyl rest followed by cold crashing, subject them to a triangle test asking 'Which is better'

Forgive me, but a triangle test is to see if the beers can be distinguished from each other, isn't it? Not about preference. It's used to identify who (presumably--but since it includes guessers that's suspect) can tell the beers apart.

and obtain data showing that only 2 out of 20 panelists qualify we don't need to look at the answers to the question as we already know that the beers are indistinguishable.

To some, not to all. Not all flavors are evident to all tasters, and not all flavors are desired by all tasters. I don't care for the flavor of Belgians; that's probably a personal failing, but I like what I like, and that's not what I like.

People vary in their ability to taste.

If we have 20 panelists of whom only 2 qualify both of whom prefer the beer the probability of H0 is p(20,2,2) = 86.9% and we can hardly dismiss it. Just looking at the first part of the test: the probabilty of 2 or more out of 20 qualifying under the null hypothesis is 99.7%. Basesd on that alone we would also clearly accept H0.

Not clear what you mean by qualifying.

<SNIP>

We can now be pretty certain that H0 can be rejected and accept the alternative that one or the other of the two beers (whichever got the 8 votes) is better.

I've tasted beers I find different but have no preference one over the other. I cannot say which is better. There is no objective standard for which is "better," only that certain people prefer one or the other or neither.

As I have pointed out several times in this thread you can't measure voltage with a voltmeter that is not calibrated. The same is true here. You must calibrate the panel for what you are trying to measure. As noted in an earlier post, some people cannot smell/taste diacetyl. A panel made up of persons with this 'disability' would be a poor panel indeed for detecting an excess diacetyl flaw. But a panel that has demonstrated good sensitivity to diacetyl by means of triangle tesing with spiked beers would be very good for detection of this flaw.

That's partially true, AJ, but when the purpose of an experiment is to see if there's a detectable difference, to find out whether a process is reasonably robust or not, one shouldn't get lost in trying to see if there's this flaw or that flaw. If you brewed two beers side-by-side whose processes varied only by one being mashed .2 degrees higher than the other, you'd be unlikely to find anyone who could reliably note a difference.

Everything else is a matter of degree.

And if you want to see if an average beer drinker (or whatever subpopulation about which one cares) can tell the difference between processes, then the issue becomes one of the sample drawn, as well as the conditions under which they do the tasting.

Which has been noted all along as two vital alternative explanations potentially confounding the results of such a test.
 
No. The purpose of the triangle test is to test the hypothesis H1: ""Beer A is better than beer B by rejecting the null hypothesis H0:

I agree completely with everything else you are saying but this point is just incorrect and unfair. I think you may be applying what you are looking for in your triangle tests to what Brulosophy are doing. The "which do you prefer" stuff in their write ups is treated as incidental. They are only testing whether or not their tasters can identify and difference in the beers. They don't even add the preference results some times.
 
There is an interesting page on the brulosophy site analyzing the lousy taster panel argument. It's worth a read. The key is to tell if beers are different. Turns out trained, qualified, tasters not much better than average home brewer or even just average beer drinker. You would think BJCP certified judges would make an effort to take the test seriously, with a clean pallet, ready to really think about flavor and mouthfeel.

http://brulosophy.com/2016/01/21/in...t-xbmt-performance-based-on-experience-level/
 
People tend to think BJCP judges are better tasters, when that's not the case. When it comes to "different" I expect the average person to fare as well as a trained judge. The difference is in the ability to describe the difference, where I expect a BJCP judge to be significantly stronger. If there's an improvement among trained tasters I'd expect it to be slight.

BJCP judges will exhibit the same blindness to certain traits that others will. Any tasting exam grader will tell you that the ability to thoroughly describe and offer suggestions for improvement is more important for passing the exam than accuracy is. Point being that it doesn't say too much about accuracy of palate. Blindness shows even with higher ranking judges (except most of them will know it- I know my diacetyl sensitivity is below average).

Now in terms of preparing with a clean palate, you're right you'd expect a ranked judge to be better. You'd also expect a palate destroyer like pizza not to be a common competition judging lunch but it is. Either way, we still don't know.

A taster's credentials aren't substitute for an actual calibration.
 
Being a pro qhrumphf both your posts offer valuable and excellent insight into the essence of the initial question. I appreciate the bjcp post because it helps show where their hard work and expertise is focused at. I think judges, trained tasters of any kind are important, without judges, there is no competition. That being said I agree and appreciate your pointing out that in a blind triangle its every person for them selves. Time and time again insults are slung at these tests, because the tasters werent holier than thy golden palates. Yet, at the this point massive amounts of data shows that either there is a difference we all could perceive or not.

Curious if you read the decoction experiment. This one blew my mind. They did a triple decoction, a full freaking triple decoction. Where they took a third of the mash each time and boiled it. Surprisingly tasters couldn't tell the difference between that and a regular lager. Putting the experiment aside it is mind-blowing from a mash perspective. If you can boil this stuff 3 times and not even have a group of tasters be able to tell a difference then what we know about tannin extraction, mash temp, etc...is worth exploring.
 
I don't think that is correct, AJ. If the null is that the two are indistinguishable, the alternative is that they are.
But they are distinguishable in some way or ways i.e. one has more or less of some attribute (or attributes) than the other. The second part of the test really asks whether the beers are distinguishable with respect to that attribute which is what the investigator is interested in. H0 should really be worded "Beer A and Beer B are indistinguishable with respect to the attribute or attributes of interest." And it then becomes incumbent on the testing team to insure that other attributes are hidden or masked.

For example, the test team would place the beer in opaque cups if the long lagering in the German vs crash scenario I mentioned in my last post removed more color than the quicker program. Unless the investigators wanted to include color as part of the criteria for goodness.

Not that one is better than the other.

In that example the question was "Is one better?" I grant you that if the answer to the first part of the test is that the beers are indistinguishable there is no point in asking the question but as I showed in that fermentation program example the question augments our ability to confidently accept or reject the null hypothesis.


People are going ga-ga over preference, and it's a mistake to do that, IMO. People like what they like.
And that's fine if preference is the parameter you are trying to measure as in a marketing study. I agree in this case, i.e. where a subjective question is being asked, greater thought is required on the part of the investigators.

The reason for doing exbeeriments like Brulosophy does is generally to see if processes make any detectable difference.
And doing the two part test can help them determine that if things are done carefully.

I've noted repeatedly my concerns about panel composition as well as whether the panels have palates that are useful for this, partly because we don't know what they were drinking/eating just prior to doing the testing.
Those and a lot of other things are a concern. As I've noted elsewhere the instructions in the ASBC MOA (or any source describing discrimination testing) will list some of these and what to do about them (isolating panelists, quiet, masking parameters that are not of interest).

Even so, if one beer is preferred by 60 percent of tasters who can distinguish them and the other is preferred by 40 percent, it's not clear that we have learned much, if anything. Which would we like better? Maybe the former, but possibly the latter.
This is a statistics game. There is no definite answer. The best we can do is compute the probability of the null hypothesis and reject it or not.


As I like to say, what's the actionable intelligence? Answer: Not much if any.
Well in the example you give if the panel has 20 members and 10 qualify (can tell the difference) the probability that that can happen under the null hypothesis is 9.1% and we cannot confidently conclude that it (the null hypothesis) can be rejected. This assumes a properly conducted test. If one of the beers were, for example, served warmer than the other then 20 out of 20 would easily detect the odd beer and we would reject the null hypothesis concluding that our fermentation program choice did indeed make a discernable difference. But the null hypothesis in this case should be "Warm beer is not distinguishable from cold."

Now if, in this case we go on to process the "which did you prefer" responses we find, using p(M,N,n) to be the probability under the null hypothesis that out of M panelists N or more qualified (correctly identified the odd beer) of whom n or more preferred one or the other:

p(20,10,4) = 7.9%
p(20,10,6) = 4.1%
p(20,10,8) = 0.87%

Thus if 6 of the qualified prefer one or the other the support for the null hypothesis drops from above the usual minimum confidence level (5%) at 4 who prefer to beneath it and we can reject the null hypothesis based on the preference part of the experiment whereas we could not based only on detecting the odd beer (9.1%). That, IMO, is the actionable intelligence.



Forgive me, but a triangle test is to see if the beers can be distinguished from each other, isn't it?
Yes it is. Forgive me for being so stubborn on this point.

So where did I get this idea about the second part of the test? I didn't make it up (I should be so clever). I got it from the ASBC Triangle Test MOA and so always have assumed it to be part of a triangle test. That MOA described the procedure as I have been outlining it and had two tables. The first was of the probabilities than N or more out of M correctly identified the odd beer and the second was the probability that n or more out of N preferred one or the other. Now I note that later (this was 25 years ago) editions of the MOA do not include this second table. I have no idea why. It's clear to me that the second part of the test increases its power (as illustrated by your example). But perhaps I am deceiving myself in thinking this.


Not about preference. It's used to identify who (presumably--but since it includes guessers that's suspect) can tell the beers apart.
It's used to see if beers are detectably differnent by the panel. Thus as I keep yelling it is a test of panel and beer. The panel must be qualified. You can, as I have noted several times here, calibrate a panel for an objective test (e.g. investigation of processes that might or might not increase diacetyl) but it is harder to do so for a subjective test (is there any detectable difference at all). As I have also said before, demographics appears to be the best we can do in such cases. Males between 18 and 21, members of your homebrew club and members of Inbevs QC team would doubtless give different results.

Including guessers is a feature, not a flaw. If the panel consists of all guessers the null hypothesis is likely to be accepted and we have demonstrated that this panel cannot distinguish the beers. That is useful information.




To some, not to all. Not all flavors are evident to all tasters, and not all flavors are desired by all tasters. I don't care for the flavor of Belgians; that's probably a personal failing, but I like what I like, and that's not what I like.
Again emphasizing that we must be very careful in picking panel members.



Not clear what you mean by qualifying.
Clearly if you are trying to find out whether one beer is better than another a taster who can't even tell that the beers are different is not qualified to express an opinion as to which is better. Qualify thus simply means able to pick the odd beer.

I've tasted beers I find different but have no preference one over the other. I cannot say which is better. There is no objective standard for which is "better," only that certain people prefer one or the other or neither.
And that is exactly what a brewery is likely to be trying to figure out. Who are these certain people? They are the ones to be targeted for a particular brand. Suppose a brewery has beer A and beer B which it presents to a panel from the western part of its city. Suppose the panel of 20 has 18 qualifiers (the beers are pretty distinct) but that half prefer A and half B). p = 1.4E-7 so there is no question that the test result is valid and it says that there is, among qualified tasters, no preference of brand A or brand B and the brewery shoots for equal shelf space for each in that market.




That's partially true, AJ, but when the purpose of an experiment is to see if there's a detectable difference, to find out whether a process is reasonably robust or not, one shouldn't get lost in trying to see if there's this flaw or that flaw.
If that's what you want to test for then that's what you test for. But if I'm going to do all the work of using the German process as opposed to the crash cooling one I don't only want to know if there is a detectable difference. I want to know if the process change improved the beer.


If you brewed two beers side-by-side whose processes varied only by one being mashed .2 degrees higher than the other, you'd be unlikely to find anyone who could reliably note a difference.
True. So before I did a test on a production beer to see if turning down the heat (thereby saving money) a little made a difference I'd calibrate my panel with 0.2, 0.4, 0.6... degree difference pilot beers. If I couldn't get a panel demonstrably sensitive to this parameter I wouldn't do the test.
 
I agree completely with everything else you are saying but this point is just incorrect and unfair.
As I admit in #97 it is incorrect but unfair?

I think you may be applying what you are looking for in your triangle tests to what Brulosophy are doing.
Well yes. I assumed everyone was doing what was in the MOA I first was exposed to 25 years ago.

The "which do you prefer" stuff in their write ups is treated as incidental. They are only testing whether or not their tasters can identify and difference in the beers. They don't even add the preference results some times.
As is also shown in #97 processing the preference data can improve one's knowledge about the validity of the null hypothesis.
 
As is also shown in #97 processing the preference data can improve one's knowledge about the validity of the null hypothesis.

If I understand this correctly, no it can't. If statistically no level of significance is met, then basically anyone who got it was guessing. The number of tasters who get it right often falls along thirds, meaning guessing one of the three. many are usually just bsing about how they got lucky. When there is significance, it's obvious, and preference data is more meaningful.
 
If I understand this correctly, no it can't.
I'm afraid you don't quite but you are pretty close.

If statistically no level of significance is met, then basically anyone who got it was guessing.
They were guessing when they picked one of three and they are tossing a coin when they pick the better beer. We calculate the probability that the numbers that we got in processing the ballots could have been obtained by a panel that is guessing in this way. As the example in my last post showed the probability that people pick the right beer AND prefer one by guessing can be lower than the probability of just picking the right beer. This allows us to reject the null hypothesis by considering the second stage in cases (the example) where we couldn't by just processing the correct selection of the odd beer.


The number of tasters who get it right often falls along thirds, meaning guessing one of the three.
That's exactly what they are supposed to do if they can't make a decision based on their perceptions.

many are usually just bsing about how they got lucky.
That's OK. They are to decide by any means they wish. If they qualified because they 'got lucky' their votes on which is the preferable beer will count but, by design, 2/3 of people that are not qualified get eliminated.
When there is significance, it's obvious, and preference data is more meaningful.
If significance were obvious most of the time we wouldn't need these tests. The purpose of the test is to see if there is significance in cases where it isn't so obvious.

To return to the data from the example of the last post

p(20,10,4) = 7.9%
p(20,10,6) = 4.1%

This comparison isn't valid because 4 out of 10 preferring A is the same as 6 out of 10 preferring B. In fact p(20,10,4) isn't meaningful because of this symmetry and I should have realized that and I can assure you that these discussions have taught me quite a bit beyond just that.

I need to think more about how I'm computing these probabilities. This is turning into a tar baby.
 
I think "Pro brewers" are a lot like rock musicians. Some really know what they're doing, some do it by the seat of their pants. Some apply science and rigor, others treat it more like an art.

Interestingly, I don't think there's a lot of correlation to approach vs beer quality. I've been to a bizillion micro brewery tap rooms. Tasting at some leave me surprised that they are "pros". I think some succeed because there is a significant percentage of craft beer drinkers that are really still pretty casual. For those of us who seek out new brews, travel on beercations, go to regional and national brew fests, + brew our own beer, they can be pretty mediocre. For the craft beer drinker who doesn't have that type of experience, they're probably pretty decent.

As far as Brulosophy, it has it's place. Sometimes the lack of ability to discern the difference is as telling as otherwise. I use that stuff sometimes to tweak my process if I believe it will be less effort with no difference or will improve my beer. Not doing a secondary for ales is a great example. I think my beers are better for it.

I try to keep learning all the time and I try to make every batch of beer my best. This may be different than a pro who's trying to make every batch of beer the same as the last time they brewed it. In some respects, you could argue that trying to achieve consistency stifles improvement. The approach to brewing is significantly different for most pros vs a home brewer. If I were to open a nano, it would be part of my mission statement to continually strive to evolve and improve all my recipes and that some variability may result.
 
I worry too much. There was nothing wrong with the way I was computing the proabilities (that I know of at this point). When the investigator sees that he has 4 out of 10 preferences for one beer he just needs to realize that he has 6 out of 10 for the other and enter 6 into the program or change the program to do this automatically in which case

p(20,10,4) = 4.1%
p(20,10,5) = 6.2%
p(20,10,6) = 4.1%
p(20,10,7) = 2.1%
p(20,10,8) = 0.9%

Thus, out of 10 qualifying votes 6 for A and 4 for B is equally likely as 4 for B as 6 for A. Either puts us below the first confidence threshold so we can conclude that the beers are discernable but that the preference for one or the other isn't strong.

Increasing the panel size by 2 members but keeping the number of qualifiers the same gives
p(22,10,6) = 7.6%
p(22,10,7) = 4.2%
and we see that considering the preference score can increase our confidence that the beers are discernable.
 
I worry too much. There was nothing wrong with the way I was computing the proabilities (that I know of at this point). When the investigator sees that he has 4 out of 10 preferences for one beer he just needs to realize that he has 6 out of 10 for the other and enter 6 into the program or change the program to do this automatically in which case

p(20,10,4) = 4.1%
p(20,10,5) = 6.2%
p(20,10,6) = 4.1%
p(20,10,7) = 2.1%
p(20,10,8) = 0.9%

Thus, out of 10 qualifying votes 6 for A and 4 for B is equally likely as 4 for B as 6 for A. Either puts us below the first confidence threshold so we can conclude that the beers are discernable but that the preference for one or the other isn't strong.

Increasing the panel size by 2 members but keeping the number of qualifiers the same gives
p(22,10,6) = 7.6%
p(22,10,7) = 4.2%
and we see that considering the preference score can increase our confidence that the beers are discernable.

This was easier to understand. And i get what you are saying about the first confidence overcoming the null dependant on sample size. I enjoyed the post. It does remind me of an equation I saw once that showed one plus one equaled zero. Finally getting this, you make a great valid point about 2nd stage consideration.

To me, for homebrewing and the info I want, if the panel cant discern, then they cant discern. I am ok with that for these tests. Another thought, which is difficult to compute with a calculator is that if you give people the exact same three beers and ask them to give preference, who knows what the results will be. It is plausible people could switch, if given a blind second chance or offer completely unusual opinions. Furthermore, it just dawned on me that my nephew won a huge science fair, for this point. He gave people water and told them one was flavored. Guess which one people thought was flavored. A fifth grader helps shed light that perception is huge. I'm not disagreeing with your math and you seem a brilliant mind. The information gathered speaks for itself plenty as far as I am concerned.
 
That's not true.

Sure, thanks for correcting this for me. Its entirely possible that even statistically calculated random, some noticed a true difference. Wonder how you calculate that, because I know everybody who gets it right, even if by luck, is going to give a little one-upsmanship. I've heard it happen on podcast it's in good fun and games. James Spencer and Steve Wilks kind of have a competition going.
 
No they werent. In 6 of the 8 tests people were not able to reach any level of statistical signifigance. And in one of the 2 that was, was by one taster, on a lager fermented at 82. Furthermore, preference in that case was for the warm one. And in the pictures i showed above, he took wlp800 and fermented it warm and gave it to all those famous people and everyone else and there was absolutely no statistical signifigance at all.

However, here's another way to think about that... What if you add the results together?

I.e. if you have, say, 24 tasters, you need a stronger (i.e. relative to the null case of guessing, i.e. 33% randomly choosing the different sample) result to achieve significance than if you have 48 tasters.

See Table A-2 here.

I.e. to achieve p<0.05, if you have 24 tasters, you need 13 (54%) to choose the correct sample to claim significance, because pure guessing would expect 8 tasters to pick the correct sample. If you have 48 tasters, however, you only need ~22 (46%) to pick the correct sample.

The bigger the sample, the better you are at discerning the variable being measured, i.e. distinguishing signal from noise.

So what if we extrapolate ALL of his fermentation temperature experiments?

#1: 9 of 23 correct
#2: 12 of 22 correct
#3: 13 of 39
#4: 12 of 26
#5: 12 of 21
#6: 10 of 21
#7: (Excluded: it wasn't warm ferment, it was testing temp stability)
#8: 8 of 20

So overall you had 76 of 172 correct responses, or 44%. To achieve a p<0.05 significance with 172 tasters, you'd only need about 68. A result of 76 correct answers is p=0.002 according to a triangle test program I found online. If any test Marshall had done individually had a p-value of 0.002, I'm pretty sure we'd all say that it was gospel.

So while there are undoubtedly other statistical problems that comes from combining experiments in this way, it also suggests that small sample sizes created an unnecessarily high bar to clear.

And one additional thing I would note. The *worst* that the group ever did in a warm fermentation XBMT was 0.33 response (13 of 39). If it was purely guessing, you would expect some panels to err on the high side and some on the low side. That never happened.

So I don't think you can look at his warm fermentation experiments in total and say that fermentation temperature isn't significant, but rather that his experiments individually were too small to show the effect.
 
I think there's a misunderstanding throughout these comments. Some have mentioned it - it's not about what is better. It's about if it makes a difference.

In many cases, it doesn't make a difference. So go with the easiest process to get the same result.

But you're overstating the results. In many cases, based on the sample size, the difference was not statistically significant.

However, if you look at the Traditional vs Short & Shoddy exbeeriment, he showed that four separate exbeeriments on process that didn't achieve significance yielded a statistically significant difference when combined.

So it may not be smart to change your process based on one exbeeriment, just because it didn't achieve significance. Because it may be that that particular variable DOES affect the beer in a positive way, but that the effect is too small to be seen by a typical taster. It doesn't mean the effect is zero, and for people who are always trying to make our beer better, we try to squeeze out every little advantage we can.
 
But you're overstating the results. In many cases, based on the sample size, the difference was not statistically significant.



However, if you look at the Traditional vs Short & Shoddy exbeeriment, he showed that four separate exbeeriments on process that didn't achieve significance yielded a statistically significant difference when combined.



So it may not be smart to change your process based on one exbeeriment, just because it didn't achieve significance. Because it may be that that particular variable DOES affect the beer in a positive way, but that the effect is too small to be seen by a typical taster. It doesn't mean the effect is zero, and for people who are always trying to make our beer better, we try to squeeze out every little advantage we can.


That's true, but the difference is so imperceptible that it might as well be zero. And per this thread, it might be all guess.
 
I.e. to achieve p<0.05, if you have 24 tasters, you need 13 (54%) to choose the correct sample to claim significance,
That's significant at the 2.8% level which isn't bad. But, as we have been discussing, if 8 of the 13 preferred one or the other of the beers or found it to have more of some property such as bitterness or diacetyl then the confidence goes down to 0.7%

If you have 48 tasters, however, you only need ~22 (46%) to pick the correct sample.
I calculate that you'd need 23 (48%) to get to about that same level of confidence (2.4%). I'm using a binomial test - perhaps you are using chi-squared or a t-test and that might explain the small difference. As 48% < 54% your point is obviously valid. If 12 of the 23 further distinguish the beers by preferring one or finding one to have more of some attribute than the other then the confidence again drops to 0.7%.

Thus one can increase confidence either by increasing the sample size or by incorporating the preference test. By incorporating the preference test one is sort of increasing the panel size as a subset of the panel, the qualified members, conduct, under the null hypothesis, a statistically independent second test. The probability that one correctly guesses the odd beer and picks the hoppier both by chance is smaller than the than the individual probabilities.

I'll repeat again that ASBC used to recommend this but don't seem to any more. Perhaps there is some fundamental flaw here but damned if I can see it.



A result of 76 correct answers is p=0.002 according to a triangle test program I found online.

For a binary test using EXCEL the probability of 76 or more out of 172 is found by typing =1-BINOM.DIST(76-1,172,1/3,1)
into a cell.
 
And per this thread, it might be all guess.
The object of this kind of testing is to be sure that the probability that it is all guess is very small, preferably 1% and better still 0.1% are typical numbers but 5% is sometimes accepted. It's really up to you to decide what it takes to convince you.

The test is more likely to fall down because something that shouldn't (e.g. the color of the beer) got telegraphed to the panelists somehow. That's why I keep going on about the care required of the invetigators.
 
That's true, but the difference is so imperceptible that it might as well be zero. And per this thread, it might be all guess.

The short & shoddy had 22 tasters. The "guess" hypothesis would indicate 7.33 (33%) correct tasters if it were purely by guess. In fact, 13 (50%) correctly chose the different beer. I'd say that's not imperceptible. That was one of the few ones that achieved statistical significance despite being a small sample size.

I think in a number of these, I've seen >33% tasters correctly identify the sample but not enough to reach the statistical significance.

This was why I made the point in my post above my response to you about taking multiple experiments dealing with the same variable in the aggregate. When you do that, it appears to achieve significance.

It's also why I specifically highlighted the short & shoddy in my response to you. It deals with multiple variables that individually did not achieve significance, but when combined, the result was significant. It suggests that unless you have good reason for removing a step, and you like how your beer is coming out, you may not want to remove it simply because one brulosophy experiment suggested it wasn't statistically significant.
 
I calculate that you'd need 23 (48%) to get to about that same level of confidence (2.4%). I'm using a binomial test - perhaps you are using chi-squared or a t-test and that might explain the small difference. As 48% < 54% your point is obviously valid.

I simply used the PDF I linked above. It has entries for 45 testers and for 50, so I kinda split the difference.

But agreed that the higher numbers are key. For statistical significance of p=0.05 with 172, I said you'd need 68 correct, which is 39.5%. Point being that as sample size grows, the percentage above 33% to justify you're above "guessing" reduces.

For a binary test using EXCEL the probability of 76 or more out of 172 is found by typing =1-BINOM.DIST(76-1,172,1/3,1)
into a cell.

I'm not a statistician. I used a program I found online called Difftest. It seemed to achieve the same results when I plugged in the 24 and 48 tester numbers, so I assumed it was also valid at 172.
 
People tend to think BJCP judges are better tasters, when that's not the case. When it comes to "different" I expect the average person to fare as well as a trained judge. The difference is in the ability to describe the difference, where I expect a BJCP judge to be significantly stronger. If there's an improvement among trained tasters I'd expect it to be slight.

BJCP judges will exhibit the same blindness to certain traits that others will. Any tasting exam grader will tell you that the ability to thoroughly describe and offer suggestions for improvement is more important for passing the exam than accuracy is. Point being that it doesn't say too much about accuracy of palate. Blindness shows even with higher ranking judges (except most of them will know it- I know my diacetyl sensitivity is below average).

Now in terms of preparing with a clean palate, you're right you'd expect a ranked judge to be better. You'd also expect a palate destroyer like pizza not to be a common competition judging lunch but it is. Either way, we still don't know.

A taster's credentials aren't substitute for an actual calibration.

I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.
 
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.



Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*



One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'



The other came back with a low 40's score and no significant flaws noted.



These were the *same* beers, in the *same* flight, by the *same* judges.



More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.


If your job is to judge beers, you have to come up with something! They can't all be winners.
 
The short & shoddy had 22 tasters. The "guess" hypothesis would indicate 7.33 (33%) correct tasters if it were purely by guess. In fact, 13 (50%) correctly chose the different beer. I'd say that's not imperceptible. That was one of the few ones that achieved statistical significance despite being a small sample size.



I think in a number of these, I've seen >33% tasters correctly identify the sample but not enough to reach the statistical significance.



This was why I made the point in my post above my response to you about taking multiple experiments dealing with the same variable in the aggregate. When you do that, it appears to achieve significance.



It's also why I specifically highlighted the short & shoddy in my response to you. It deals with multiple variables that individually did not achieve significance, but when combined, the result was significant. It suggests that unless you have good reason for removing a step, and you like how your beer is coming out, you may not want to remove it simply because one brulosophy experiment suggested it wasn't statistically significant.


That's kind of my point. You might have not seen where I'm not a fan of brulosophy. For that kind of reason. They seem to often (from what I have read, which is not many, to be honest) end up saying "it all came out roughly the same."
 
However, here's another way to think about that... What if you add the results together?


Although, you make a point surely, I disagree strongly. And your total lack of consideration for the massive amount of solid qualitative data is disappointing to me.
 
But you're overstating the results. In many cases, based on the sample size, the difference was not statistically significant.

However, if you look at the Traditional vs Short & Shoddy exbeeriment, he showed that four separate exbeeriments on process that didn't achieve significance yielded a statistically significant difference when combined.

So it may not be smart to change your process based on one exbeeriment, just because it didn't achieve significance. Because it may be that that particular variable DOES affect the beer in a positive way, but that the effect is too small to be seen by a typical taster. It doesn't mean the effect is zero, and for people who are always trying to make our beer better, we try to squeeze out every little advantage we can.

If you think these concepts are going to make better beer you couldn't be further from the mark, imo.
 
It is strange to me how dearly and desperately many of you seem to hold on to your Brewing pedagogy. It's like the only thing you have left in the world to believe in and you're not going to let it go no matter what. Brewing, home brewing, is not a moral endeavor, its an aesthetic one.
 
It is strange to me how dearly and desperately many of you seem to hold on to your Brewing pedagogy. It's like the only thing you have left in the world to believe in and you're not going to let it go no matter what. Brewing, home brewing, is not a moral endeavor, its an aesthetic one.

I liked this post thinking I understood your point. Then I realized I did not know what pedagogy meant and had to look it up. Not sure brewing pedagogy makes sense now that I did but thanks for the $10 word!
 
I liked this post thinking I understood your point. Then I realized I did not know what pedagogy meant and had to look it up. Not sure brewing pedagogy makes sense now that I did but thanks for the $10 word!

Haha, thanks for pointing that out, I've been known to make words up to.
 
Status
Not open for further replies.
Back
Top