• Please visit and share your knowledge at our sister communities:
  • If you have not, please join our official Homebrewing Facebook Group!

    Homebrewing Facebook Group

Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk

Help Support Homebrew Talk:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.

Good post and I suppose answers a question I've been kicking around...why don't the brulosophers submit samples to competitions and add the scores to the discussion. The term temp experiments is good example. be interesting if the guided tester...the bjcp judge with style guide...might rate the beers differently. Would not validate or invalidate the primary finding but might add additional anecdotal information of interest.
 
This comparison isn't valid because 4 out of 10 preferring A is the same as 6 out of 10 preferring B. In fact p(20,10,4) isn't meaningful because of this symmetry and I should have realized that and I can assure you that these discussions have taught me quite a bit beyond just that.

I need to think more about how I'm computing these probabilities. This is turning into a tar baby.

Yeah, it is. There are a number of things that make this problem a....problem.

One is what those panels of beer drinkers represent. I understand the statistics (believe me, I do), but I don't think they're properly used to produce actionable intelligence. I use that phrase with my students; so something is "significant;" what have you learned of value about the world if it's significant? If you can't say, significance is not useful.

I know that people are guessing when they can't tell, and that's certainly fine for the statistical element of this, but there's an issue with it. People who guessed correctly simply by luck can't tell the difference. I don't see the point of asking such people about preference, as the preference is just as random. When one does that, the preference data is contaminated by guessing. Like trying to see something through fog.

I'd feel better about the panels--and those who "qualified"--if they could reproduce their choice repeatedly. That would tell me they truly were qualified, and not qualified purely on the basis of a lucky guess.

One of my areas of interest/expertise is measurement (though it's in the social science world, not the biological/chemical world). Measures--instruments--need to be valid but to be valid they also must be reliable. I have no indication in any of this that the guessers--who are providing "preference" data--are doing anything in that realm except guessing.

And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.
 
Good post and I suppose answers a question I've been kicking around...why don't the brulosophers submit samples to competitions and add the scores to the discussion. The term temp experiments is good example. be interesting if the guided tester...the bjcp judge with style guide...might rate the beers differently. Would not validate or invalidate the primary finding but might add additional anecdotal information of interest.

He has! His warm ferment vienna lager went to round two or something at nhbc. It scored a 41 in the first round. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.
 
He has! His warm ferment vienna lager went to round two or something at nhbc. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.

Confirmation bias?

What would they think of the beer if they didn't know he'd brewed it?

I'm not saying it's bad, not at all. I'm just noting there are other explanations.
 
Confirmation bias?

What would they think of the beer if they didn't know he'd brewed it?

I'm not saying it's bad, not at all. I'm just noting there are other explanations.

Maybe that's why he doesn't Brew all the experiments. There are a few brulosophers now. You had already quoted me before I added that it scored a 41. I don't know how competitions are scored but I have no reason to believe that this man's beers aren't great.
 
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.

Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.
 
He has! His warm ferment vienna lager went to round two or something at nhbc. It scored a 41 in the first round. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.


Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder
 
Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder

Actually no...I was thinking to submit the beer to a competition as a test of whether it was actually a good beer or something mediocre. If the testers could not tell the beers apart and both scored in the high thirties I'm thinking differently about the outcome than if both scored in the 20s. The warm ferment is good example but I'd like to see him submit both beers to same comp in same category and see how they do. Maybe that tinge of diacetyl doesn't show badly in triangle test with no information provided about the style or the test but does stand out in a lineup of 10 beers intended to be judged against the style.
 
I liked this post thinking I understood your point. Then I realized I did not know what pedagogy meant and had to look it up. Not sure brewing pedagogy makes sense now that I did but thanks for the $10 word!
IMO ingrained or deep-rooted or fixed beliefs would be better descriptors.
 
Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder

This is one of my main points in all these discussions. Just because there is a difference, it doesnt represent better or worse, just a difference. Difference or not, the dude can brew very well and everyone has complimented him. He likes, makes, and drinks good beer. I wonder if some haven't read or heard much from him. He is more shocked than anyone and brews by and believes in standard practices for sure.
 
This is one of my main points in all these discussions. Just because there is a difference, it doesnt represent better or worse, just a difference. Difference or not, the dude can brew very well and everyone has complimented him. He likes, makes, and drinks good beer. I wonder if some haven't read or heard much from him. He is more shocked than anyone and brews by and believes in standard practices for sure.

I noticed that when he did a 60 minute vs. 30 minute mash. There was no difference in the outcome, but he still does a 60 minute mash.

So he is innovative, and he's come up with some information we didn't have before. Especially stuff that runs counter to the prevailing ways of doing things. That makes it interesting.

But I also wonder if people can't hold to the level of precision he can, are his findings less useful. I mean, I am for 152 for a mash temp, and that could be 150-154 sometimes. He does two mashes side by side and they are 150.2 and 150.0. And I assume he was aiming for 150.

So the next step is replication - can others replicate his experiments and results.

As others have mentioned, his articles aren't peer reviewed or vetted. He runs experiments (once as far as I can tell) and publishes the results. If you compare that to actual lab work, it is very lightweight.
 
This touches on another aspect that didn't get much (if any) mention in my previous post. The result, of course, depends on the panel but they also depend on the instructions given to the panel. Where the instructions call for marking of one or the other of the beers based on opinion rather than something more concrete (such as whether one beer tastes more strongly of diacetyl than the other) we have quite (IMO) a different situation. I know (or think I know) how to calibrate a panel to see if it can detect diacetyl but I don't know how to calibrate one to see if it can detect 'better' beer.


But Brulosphy clearly states that they are just looking for the tasters opinion in this round of questions.
And they only asks their opinions if they correctly chose the odd beer out.

This isn't actually part of the test.
 
But Brulosphy clearly states that they are just looking for the tasters opinion in this round of questions.
And they only asks their opinions if they correctly chose the odd beer out.

This isn't actually part of the test.

That's a really important thing to understand about what they do. They mostly test processes and some ingredients to see if there is a difference.

They aren't really even asking "is there more diacetyl?" They're just saying, "Here are 3 glasses. Two are the same as each other, and one is different from those two. Which is different?" And from there, they discuss which they like better.
 
Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.

Additionally, two beers on opposite ends of the flight can show significant palate fatigue, as well as a scoring shift relative to other beers in the flight. I try to teach newer judges to account for that- there needs to be reason(s) I am scoring a beer better or worse than the beers that preceded it. Sometimes it can mean adjusting scores of current beer. Sometimes it can mean adjusting previous scores.

While that's only marginally relevant to the topic at hand, the palate issues are very relevant.

And at the end of the day, statistics aside, if Brulosophy experiments cannot be replicated then they are fundamentally worthless and there is some variable or combination of variables, known or unknown, that are not being accounted for.

Some folks really need to hop off the bandwagon, stop misquoting people and putting words in others mouths, and actually listen to other people.
 
i thoroughly enjoy their site but would never consider it rigorous scientific method. but isn't that kind of the whole point? in a lab setting and with proper analytical techniques, no doubt it could be determined and demonstrated that beers are different...but we don't exist in a lab. our bodies don't have the sensitivity of advanced lab equipment so does it even matter? if anything, they are just reinforcing what papazian has been saying for decades (rdwhahb).

i like the exbeeriments also reinforce human nature. like the one where they split a blonde ale and added flavorless, odorless colorant to half the batch and had folks compare (they could see the color difference). sure enough, folks described traditional characteristics when comparing the dark and light versions, even though they were identical. they categorized the light beer as a cream ale, pale ale, light lager, etc. and the dark as a dark lager, brown ale, porter, etc. some testers were served both samples blindfolded and not surprisingly, they couldn't tell the difference. just goes to show the power of our senses and associated preconceived biases. they had another one with an ipa up against pliny. folks thought the pliny tasted pretty good but once they were told it was pliny, they couldn't get enough of it. again, the power of persuasion.

i also like the ones where they take the tests themselves and can't tell the difference. they know the variable, know what to hunt for and still can't tell them apart. yes, yes, perception is in the eye of the beholder and it is just one person but still pretty interesting...
 
Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.

I've run into this inconsistency enough times to make me believe it's more widespread than believed. One would hope, that if you were actually judging beers, that you'd try to do it with consistency. I know palate fatigue is a real thing but in ideal circumstances the person doing the judging should be aware of it and make allowances for it. To have the same beer, bottled on the same day in the same way have such a disparity in results just tells me that the method of measurement (the judges) is faulty. I'm not saying *all* judges palates are flawed, I'm just saying that maybe there should be a bit more rigor in how judges are selected and ranked.
 
I noticed that when he did a 60 minute vs. 30 minute mash. There was no difference in the outcome, but he still does a 60 minute mash.

So he is innovative, and he's come up with some information we didn't have before. Especially stuff that runs counter to the prevailing ways of doing things. That makes it interesting.

But I also wonder if people can't hold to the level of precision he can, are his findings less useful. I mean, I am for 152 for a mash temp, and that could be 150-154 sometimes. He does two mashes side by side and they are 150.2 and 150.0. And I assume he was aiming for 150.

So the next step is replication - can others replicate his experiments and results.

As others have mentioned, his articles aren't peer reviewed or vetted. He runs experiments (once as far as I can tell) and publishes the results. If you compare that to actual lab work, it is very lightweight.

Others have! There are 3 or 4 of them. The way i see it, lightweight, or not, its what we have. On the fermentation temperature reproach thread, I asked time and time again for other research, other data, and you know how much showed up. Zero, zip, zilch. We would all be willing to consider any other data, where is it. Btw, he has used labs in a few and the dms cam back nil on a 30 min boil of German pilsner. Hot side aeration, dms from short boil, lid off, mash temp, fermentation temp, autolysis, lot to consider.
 
I've run into this inconsistency enough times to make me believe it's more widespread than believed. One would hope, that if you were actually judging beers, that you'd try to do it with consistency. I know palate fatigue is a real thing but in ideal circumstances the person doing the judging should be aware of it and make allowances for it. To have the same beer, bottled on the same day in the same way have such a disparity in results just tells me that the method of measurement (the judges) is faulty. I'm not saying *all* judges palates are flawed, I'm just saying that maybe there should be a bit more rigor in how judges are selected and ranked.

People do overestimate the difficulty of attaining BJCP Recognized or Certified rank. The bar is lower than many think. However the leap between Certified and National is huge, and the leap between National and Master/GM is even larger. My experience over years of entering has confirmed that too.

Judges get biases too. Newer judges often think they're super-tasters and hunt down imaginary off flavors. Higher ranked seasoned judges get arrogant and let their preconceived notions of style bias them. However I am more inclined to trust their palates.

I'd be curious to see the two sets of scoresheets in question and any other info (if flight size/order were marked on the cover sheet).

BJCP ranking, especially Recognized or Certified, isn't strong enough to make me trust their palate on its own in most contexts.
 
If you think these concepts are going to make better beer you couldn't be further from the mark, imo.

So to the short & shoddy experiment, you're saying that a longer mash, longer boil, pitching a proper amount of yeast and controlling fermentation temperature *won't* make better beer than a short mash, short boil, under-pitched yeast and no temp control?

I'm not sure which of those elements is most important among the four, but the experiment clearly showed both statistically significant differences between the two batches AND a preference for the "traditional" beer.
 
i thoroughly enjoy their site but would never consider it rigorous scientific method. but isn't that kind of the whole point? in a lab setting and with proper analytical techniques, no doubt it could be determined and demonstrated that beers are different...but we don't exist in a lab. our bodies don't have the sensitivity of advanced lab equipment so does it even matter? if anything, they are just reinforcing what papazian has been saying for decades (rdwhahb).

i like the exbeeriments also reinforce human nature. like the one where they split a blonde ale and added flavorless, odorless colorant to half the batch and had folks compare (they could see the color difference). sure enough, folks described traditional characteristics when comparing the dark and light versions, even though they were identical. they categorized the light beer as a cream ale, pale ale, light lager, etc. and the dark as a dark lager, brown ale, porter, etc. some testers were served both samples blindfolded and not surprisingly, they couldn't tell the difference. just goes to show the power of our senses and associated preconceived biases. they had another one with an ipa up against pliny. folks thought the pliny tasted pretty good but once they were told it was pliny, they couldn't get enough of it. again, the power of persuasion.

i also like the ones where they take the tests themselves and can't tell the difference. they know the variable, know what to hunt for and still can't tell them apart. yes, yes, perception is in the eye of the beholder and it is just one person but still pretty interesting...

Wow, thanks for this insight. Yeah thinking of the way we Homebrew in garages juxtaposed with a lab says a lot. I couldnt agree more and perception is king. I try vvveeeerrryyyy hard to keep my perception out of things, even though its hard.
 
So to the short & shoddy experiment, you're saying that a longer mash, longer boil, pitching a proper amount of yeast and controlling fermentation temperature *won't* make better beer than a short mash, short boil, under-pitched yeast and no temp control?

I'm not sure which of those elements is most important among the four, but the experiment clearly showed both statistically significant differences between the two batches AND a preference for the "traditional" beer.

Yes, i am, especially not in the terms of better in the way you're describing it. Definitely not in the terms of great the way you want to make great beer. I actually hadn't seen this one I don't think, only heard the original which was a podcast. You skewed the numbers a little and once again totally failed on picking up on the qualitative and empirical data. Yeah of 22 Taster's 13 could tell a difference. It reached a level of confidence but it wasn't like all 22 of them could. Then you quote 6 verse 2 in preference. Well the other five couldn't tell a difference or didn't have a preference. That means seven either liked the short one, or didn't care which one. No one described the short one as bad and the person who made it themself was startled at how similar they were. Now I make beers in two and a half hours, your definition of better and mine might be a little different. Either way if one was so much unbelievably better than the other more than 13 would have seen the difference, the person who brewed it would have said it was way better, and many more of the seven of the 13, the majority, would have preferred it. Imo, this is not the road to beer making Nirvana, and if you want to split hairs at this level to be right then you can be right.
 
Others have! There are 3 or 4 of them. The way i see it, lightweight (your opinion), or not, its what we have. On the fermentation temperature reproach thread, I asked time and time again for other research, other data, and you know how much showed up. Zero, zip, zilch. We would all be willing to consider any other data, where is it. Btw, he has used labs in a few and the dms cam back nil on a 30 min boil of German pilsner. Hot side aeration, dms from short boil, lid off, mash temp, fermentation temp, autolysis, lot to consider.

But I think you're wrong on fermentation temp, based on the fact that it has been replicated so many times. Of his ferm temp experiments, 7 of 8 were deliberately testing warm vs cool.

1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

2) He showed that temperature gap matters. In experiment 5, he deliberately skewed the temp all the way up to 82 degrees. This was one of the experiments that achieved significance with 21 testers.

3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.
 
Getting back on topic, my book by Mike karnowsky, who is a monster Brewer in the field, has quite a few little experiments like brulosophy that he did and reports on. Also on a pro forum I saw discussion on various topics. So I think there's reason to believe that some professionals like to experiment and find other ways of Brewing.
 
But I think you're wrong on fermentation temp, based on the fact that it has been replicated so many times. Of his ferm temp experiments, 7 of 8 were deliberately testing warm vs cool.

1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

2) He showed that temperature gap matters. In experiment 5, he deliberately skewed the temp all the way up to 82 degrees. This was one of the experiments that achieved significance with 21 testers.

3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.


The data speaks for itself and you are free to see it how you want. Once again i feel you left out the empirical data. And brought up an experiment that goes against your arguments. The way i see the 82 degree ferment xbmt srtrengthens my argument. Yep, you are right it showed significance at 82 degrees ferment. Who ferments at 82 anyways, but he did. But see what you're missing is that it only shows there was a difference. Using your perception and deep-rooted beliefs, you assume that difference was bad. However seven preferred the warm fermented one, to two cool. And if you include the other four who didn't care either way it's still 7 to 6. That's not enough for me to go around making claims about warm fermenting. And if I was going to, I guess I would have to say warm ferment is better as six of the eight tests didn't even show a difference, the 82 deg xbmt showed a difference with preference on warm, and not having to buy or have a bunch of junk in my house is a no brainer. But I won't go saying that warm ferment is better, but I will say this isn't the road to beer making Nirvana, imo.
 
Additionally, two beers on opposite ends of the flight can show significant palate fatigue, as well as a scoring shift relative to other beers in the flight. I try to teach newer judges to account for that- there needs to be reason(s) I am scoring a beer better or worse than the beers that preceded it. Sometimes it can mean adjusting scores of current beer. Sometimes it can mean adjusting previous scores.

While that's only marginally relevant to the topic at hand, the palate issues are very relevant.

And at the end of the day, statistics aside, if Brulosophy experiments cannot be replicated then they are fundamentally worthless and there is some variable or combination of variables, known or unknown, that are not being accounted for.

Some folks really need to hop off the bandwagon, stop misquoting people and putting words in others mouths, and actually listen to other people.

^This!
 
1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

This is a beautiful example of someone who is really thinking about what this all means. I really mean that!

It's a wonderful "Hmmmm...." moment. I still have my issues with how panels are constituted and whether there's palate fatigue or what prior drinking/eating does to panelists, but this is interesting.


3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

A meta-analysis! Nicely done. Of course, they're different, but what remains is whether one is preferable to the other.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

I'm half tempted to use this in my classes; good thinking here, and a great way to show tilting the evidence away from a particular outcome and still achieving it.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Well, I'd say that point #2 "suggests" something....there's still a lot of looseness in how people do these tests.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

The only issue with this approach is that all of the samples did different processes/recipes. I do a coin-flip exercise in my classes--20 students flip a coin, we record the number of heads. We do it again and again, 10 times. Of course, the number of heads oscillates around 10/20, and I ask the students, what would it look like with a sample of 200? Of course, 10 trials of flipping 20 coins is a sample of 200.

But the parameters in flipping coins don't change; the parameters in the 7 exbeeriments did.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.

I think your approach is more valuable than most above. I still have my issues w/ the panels (who here doesn't get that by now :)), but I find this to be better "out of the box" thinking than a blind statistical analysis.

Bravo!
 
One is what those panels of beer drinkers represent. I understand the statistics (believe me, I do),
I don't doubt that you understand them far better than I.

but I don't think they're properly used to produce actionable intelligence. I use that phrase with my students; so something is "significant;" what have you learned of value about the world if it's significant? If you can't say, significance is not useful.

So lets say a brewer thinks he has a diacetyl problem and wants to know if using a proportion of valine rich malt will improve his beer with respect to diacetyl. He brews a test batch (B) and wants to know if it is better than his regular beer (A) which is the same as B except that B contains some portion of the valine rich malt. To see if it's better he gives 40 tasters a sample of each and 17 report that beer B is better. He goes to a table (or whatever) and finds that the probability that 18 or more tasters guessing randomly prefer B is 78.5%. He concludes that as less than half of his tasters preferred B and as it's more likely than not that the data he obtained could be obtained by flipping a coin that B is very likely not better than A and he doesn't adopt the new process. He takes no action. Let's assume, at this point that indeed the new malt does improve the beer by reducing diacetyl but that 22 members of the panel are diacetyl taste deficient. Thus the brewer accepts H0 when H1 is true and we see that this test isn't very powerful because of the panel composition.

Along comes a guy who says "Hey, that's not a very powerful test. Give them three cups...." i.e. advises him to try a triangle test. Under H1 the 18 that picked the lower diacetyl beer in the simple test should be able to detect the difference between A and B and so we would have 18 out of 40 qualifying. The probability of this happening under H0 is 8.3%. That's enough for the brewer to start to think 'maybe this makes a difference' but not below the first level of what is usually statistically significant. And he still doesn't know whether the new process improves the beer. Being this close to statistical significance his action is perhaps to perform additional tests or empanel larger panels or test his panel to see if some of its members are diacetyl sensitive.

The consultant comes back and says "Did you ask the panelists which they preferred?" and the brewer says "Yes but I didn't do anything with the data because this a triangle test." The consultant advises him to process the preference votes which reveals that 11 of the 18 who qualified preferred B. The probability that 18 qualify and 11 prefer under the null hypothesis is 1.6%. Using this data the brewer realizes he is below statistical significance threshold, confidently rejects the null hypothesis and takes the action of adopting the new process. Note that under the assumptions we made above more than 11 out of 18 should find B to be lower in diacetyl. If 14 do then then p < 0.1%


I know that people are guessing when they can't tell, and that's certainly fine for the statistical element of this, but there's an issue with it.
You seem to be saying that while we are trying to make a decision about H1 by the only means available to us i.e. rejecting H0 if the probability of what we observe is low under H0, that a test which produces a lower p than another test isn't necessarily a better test. The lower p the more likely we are to reject H0 when H1 is true (and p does not depend on any of the conditions that pertain when H1 is true) and the probability that we do so is, AFAIK (I'm no statistician for sure), the definition of the 'statistical power' of the test. The two stage test is more powerful than the triangle alone test.


People who guessed correctly simply by luck can't tell the difference. I don't see the point of asking such people about preference, as the preference is just as random.
That's really not a flaw of the technique but rather a feature of it. Yes some unqualified votes (guesses) come in but 2/3 of them are eliminated. Compare to just asking panelists to pick the better beer. 0% of the guessers are eliminated in that case. The power of the two stage triangle test derives from this very feature.


When one does that, the preference data is contaminated by guessing. Like trying to see something through fog.
So let's turn down the contamination level by presenting quadruplets of cups with 3 beers the same and 1 different. In that case only 1/4 of guessers qualify, p(40,18,11) = 0.09% and the test is seen to be even more powerful.


I'd feel better about the panels--and those who "qualified"--if they could reproduce their choice repeatedly.
As I mentioned in a previous post adding the preference part is really asking the panelists to distinguish the beers again by choosing which has less diacetyl than the other. This is sort of similar to multiple runs.

That would tell me they truly were qualified, and not qualified purely on the basis of a lucky guess.
Depending on the nature of the investigation qualification by guessing may be exactly what you are looking for. If you want to see if your market is insensitive to diacetyl creep then you want to see if they have to guess when beings asked to distinguish (or, more important, prefer) beers lower in diacetyl. Keep in mind that to pick one out of three correctly there must be both a discernable difference AND the panelist must be able to detect it. If both those conditions are not met then every panelist must guess (the instructions require him to). These tests are a test of the panel and the beer. I keep saying that.

But where we are, as in the example of this post, investigating something specific we want to qualify our panel by presenting it standard samples for two part triangle testing.


One of my areas of interest/expertise is measurement (though it's in the social science world, not the biological/chemical world). Measures--instruments--need to be valid but to be valid they also must be reliable. I have no indication in any of this that the guessers--who are providing "preference" data--are doing anything in that realm except guessing.
As noted even the 'best' panelists have to guess when the beers are indistinguishable and that's exactly what we want them to do. As I said above guessing is an important feature of this test - not a flaw.


And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.

I've explained it as clearly as I can and if you can't see it then I would, depending on your level of interest, suggest pushing some numbers around or even doing a Monte Carlo or two if you are so inclined. The main disconnect here is that you are arguing that a statistical test, even though more powerful than another, is less valid than the other. That can only mean that could lead us to take the wrong action which, in this case, would imply that asking our hypothetical brewer using the more powerful test to decide against using the low valine malt even though it does improve (defining improve as reduction in diacetyl). I don't see how that could possibly happen.
 
After reading this whole thread plus the Brulosophy exbeeriments I will say that it does cause one to wonder about their brewing habits. Whether some parts of my brewing day should be changed or not as a result of all of this, probably not.

One thing that comes to mind, is that when all these tests are done the tasting is more or less a controlled environment. But for the average beer drinker who is out tasting flights or drinking these beers with a meal, one will probably never notice a difference. Being that their palates are no longer clean and uncontaminated by other variables, makes it hard to tell if there was a slight change. Yes if the change was great enough even a dirty palate should notice a difference.

For now though I will carry on as always, and stick with the thought of, "if it ain't broke, don't fix it".
 
Status
Not open for further replies.

Latest posts

Back
Top