• Please visit and share your knowledge at our sister communities:
  • If you have not, please join our official Homebrewing Facebook Group!

    Homebrewing Facebook Group

Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk

Help Support Homebrew Talk:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
It's been awhile since I reviewed their methodology, but does Brulosophy keep the order of the triangle tests the same for each taster, or do they randomly assign order?

That would be interesting to look at in repeated tests....can those who correctly identified (or guessed right) the odd beer out repeat that result if the order of samples is different?

charades.png

Not only is that not clear from the way they do it, you point out one of the elemental difficulties with having a one-shot guess "qualify" tasters for the preference test.

Show me you can pick the odd-one-out three or more times in a row, and I'll believe you can detect a difference....and you are qualified to go to the next level.

Guessers cannot tell the difference; why would anyone want them judging preference, and guessing on that too?
 
What I question personally is whether we should place much validity in their results when they expect John Q. Randomguy -- who might know little or nothing about beer -- to be able to detect differences between two beers at a 95% confidence level. But in my view, if we take a more loose approach and only expect John Q. as well as all the other various experienced tasters to detect a difference an average of maybe about 80% of the time, with the ultimate goal being, "MAYBE, JUST MAYBE, there is something going on here", rather than expecting a result of "yea, verily, this experiment has 95% confidence that there seems to be a difference", then with an 80% bar instead of 95%, this lower bar is easier to meet or to "qualify" for further experimentation, rather than rehashing the same old "nope, we didn't achieve 'statistical significance' yet again". Statistically, if they only expect to be right about 80% of the time instead of 95%, the results reported should prove more interesting, at least in my own chumpy eyes.

.

The 95% confidence limit requires tasters to be right far far less than 80% of the time. If you have 30 tasters for example, you only need 15 to pick the right sample to be 95% confident they didn't pick right by accident. If the simply guessed you would expect to get 10 right answers and 20 wron answers. If 80% chose correctly and you only had 8 tasters you would be 99% certain this was not lucky guessing. If you had 14/20 tasters get right answer...that's not 80%... you have p value of .001
 
The 95% confidence limit requires tasters to be right far far less than 80% of the time. If you have 30 tasters for example, you only need 15 to pick the right sample to be 95% confident they didn't pick right by accident. If the simply guessed you would expect to get 10 right answers and 20 wron answers. If 80% chose correctly and you only had 8 tasters you would be 99% certain this was not lucky guessing. If you had 14/20 tasters get right answer...that's not 80%... you have p value of .001

I understand your point. My point is that given the unknown variables about how qualified the tasters are, as well as perhaps how well the beers were brewed, if they'd just compensate for these additional unknowns by expecting only to kinda sorta find out if maybe the variable is having an effect by setting confidence to just 80% (p<=0.2), then for the same 30 tasters, instead of needing 15 people to select the right beer to reach 'statistical significance', you'd only need like 13 of them to be 80% confident that the result might be showing us something worth further experimentation. To me, 80% confidence is a pretty decent goal for us silly homebrewers. We are not scientists working in laboratories; this is just beer brewed at home. If the experiment is screwed up, then drink it and run it over again, no big whoop. We don't need to set a 95% confidence goal all the damn time. That's my point. Sure, if you could get 100 tasters for every experiment, or better yet, 200, then 95% confidence would be achievable. But to expect 95% confidence with a little sample size of only 20-30 people? Not so realistic.

Realism. That's what it boils down to for me. Goals need to be realistic. Otherwise we usually get the same result: "did not achieve statistical significance with a goal of p<=0.05". If the final result is p=0.06, or 0.1, or even 0.15, could there be something going on? This is really never talked about by M&Co, they just say "not statistically significant", which really misses the point IMO.
 
Take out "to match the scienciness of these experiments" and you are spot on. That is exactly what a close to significance level tells you. That something might be going on but you didn't measure it to statistical significance because your panel wasn't big enough. Before one can draw even that conclusion, however, he must be convinced that the methodology was sound. Marginally high confidence in the data from a flawed experiment is a useless as high confidence in the data from a flawed experiment.

This is what bothers me *NOT* about Brulosophy, but how some people interpret the results.

It sounds like some people say "Oh, this experiment failed to achieve significance? I guess I'll remove that step from my process! Woo time saved!"

But that's not a viable conclusion. As I tried to show in the meta-analysis of fermentation temp, few of the fermentation temp experiments achieved significance. However, notable was that there was NEVER an experiment where the triangle test was <33% at picking out the odd beer. A number of the tests were very close to significance. And if you did a meta-analysis (which is dangerous because it's different experiments), it suggested that the significance was stronger than individual panels could provide.

So yes, if it is close to p<0.05 significance, ONE possible takeaway is that there is an effect and that the panel wasn't large enough. ANOTHER possible takeaway is that guessers might have screwed it up. That's the problem with going to p<0.15 as your significance test. But if 7 experiments all occur and they're almost ALL "close" to significance, that's harder to wave away as the variable having no effect due to not achieving significance in a single experiment.

The good news for @dmtaylor is that he can figure this out on his own. Brulosophy publishes the number of testers in each panel and the number who correctly choose the odd beer out. If he prefers a lower significance value than p<0.05, he can actually quite easily calculate what the significance is of each experiment. (Truthfully, I don't think he even has to... I think Brulosophy provides the p-value of every experiment...)
 
I have had excellent results using saflager at ale temps. I'm in no way a perfectionist when it comes to beer but its what i've been looking for since i started homebrewing. I've burned though probably over 10, 55lb sacks of grains in a year and a half! I buy nothing but CMC pilsner sacks now for my most of my bases.

I love munich helles type beers. They are the best beer i ever made, my friends say the same.

The only ales i make now are wheat and ambers. All my pale beers with 34/70. I haven't tried any dark ales. I'm a session beer guy mostly.

The biggest "mistake" i made when first starting out was making all kinds of smash beers to "figure out" the ingredients. I think to avoid a steep learning curve people should avoid smash beer unless its in the style guidelines like some bohemian pils.

Made some decent beer but when i started using jamils book and recipes online my beer took a big leap.
 
This is what bothers me *NOT* about Brulosophy, but how some people interpret the results.

It sounds like some people say "Oh, this experiment failed to achieve significance? I guess I'll remove that step from my process! Woo time saved!"

But that's not a viable conclusion...

I suspect you'll find little disagreement from anyone here. They approach things with as much scientific and statistical rigueur as they can and seem to be very competent brewers. The suggestions raised in this thread I think they would wholeheartedly agree with. But are probably for the most part out of their scope. The issue isn't Brulosophy, it's the giving more credit than is due to the results.
 
never got into stats, more of a probability man myself...

'draw five cards from a shuffled deck, what is the probability that two are red and one is a seven?'
Give each of 20 guys 3 pingpong balls, one of which has a black mark on it, concealed in black bags and ask them to pick two of the three bags randomly and then randomly pick one of the two. What's the probability that the chosen ball will be the one with the black mark? The answer to that question is the confidence, p, we have been speaking of here repeatedly.

factorials, ftw!:ban:

p = &#8721;[ j= N to M](1/3)^j*(2/3)^(N-j)*M!/(j!*(M-j)!

is the probability that N or more out of M choose the black ball in the first step i,e. the confidence level. IOW this part of statistics (Hypothesis testing) is concerned with computing the probability of one of the hypotheses given the data observed. And given that observations are often integers (or at least limited to a non continuous set of possible numbers) combinatorics and thus factorials come into play.
 
Before one of the mods terminates this thread, I want to commend the OP for constructing one of the best loaded titles I've seen on HBT :mug:
Seriously, that's art right there ;)

True dat!
 
I understand your point @dmtaylor but I'd argue that in the 13 out of 30 example where you would choose to say study suggest a trend ... and maybe it does .... am am troubled that a clear majority of the tasters failed to pick the right sample.
 
Why would they terminate this thread? :confused:

I said that because very once in a while, a mod steps in when people get too excited/passionate OR call names like fool or chump.

This isn't the internet, it's a private forum, so they can enforce the rules they want to maintain a civil environment.
 
I have had excellent results using saflager at ale temps. I'm in no way a perfectionist when it comes to beer but its what i've been looking for since i started homebrewing. I've burned though probably over 10, 55lb sacks of grains in a year and a half! I buy nothing but CMC pilsner sacks now for my most of my bases.

I love munich helles type beers. They are the best beer i ever made, my friends say the same.

The only ales i make now are wheat and ambers. All my pale beers with 34/70. I haven't tried any dark ales. I'm a session beer guy mostly.

The biggest "mistake" i made when first starting out was making all kinds of smash beers to "figure out" the ingredients. I think to avoid a steep learning curve people should avoid smash beer unless its in the style guidelines like some bohemian pils.

Made some decent beer but when i started using jamils book and recipes online my beer took a big leap.

I haven't read their fermentation temp experiment, but that seems like a critical part of the process from my experience - depending on the yeast. S-04 works fine in the mid-60s, but is gross if it goes over 70. I also think you can coax different characteristics out of some yeasts by changing the temperature. I have a Belgian yeast that seems to get a lot more barfy (Belgian) at higher temps and is more mellow fruity at lower. I've never done a side by side or even the same beer twice to compare, though.

Do you control temp at all to keep it in ale range, or just let it do what it does?
 
I understand your point @dmtaylor but I'd argue that in the 13 out of 30 example where you would choose to say study suggest a trend ... and maybe it does .... am am troubled that a clear majority of the tasters failed to pick the right sample.

13/30 versus 10/30 is still MAYBE (~80% confidence) better than random guessing. The majority thing doesn't bother me. Sometimes maybe "maybe" is "good enough". To each our own.
 
This is what bothers me *NOT* about Brulosophy, but how some people interpret the results.

It sounds like some people say "Oh, this experiment failed to achieve significance? I guess I'll remove that step from my process! Woo time saved!"

But that's not a viable conclusion. As I tried to show in the meta-analysis of fermentation temp, few of the fermentation temp experiments achieved significance. However, notable was that there was NEVER an experiment where the triangle test was <33% at picking out the odd beer. A number of the tests were very close to significance. And if you did a meta-analysis (which is dangerous because it's different experiments), it suggested that the significance was stronger than individual panels could provide.

So yes, if it is close to p<0.05 significance, ONE possible takeaway is that there is an effect and that the panel wasn't large enough. ANOTHER possible takeaway is that guessers might have screwed it up. That's the problem with going to p<0.15 as your significance test. But if 7 experiments all occur and they're almost ALL "close" to significance, that's harder to wave away as the variable having no effect due to not achieving significance in a single experiment.

The good news for @dmtaylor is that he can figure this out on his own. Brulosophy publishes the number of testers in each panel and the number who correctly choose the odd beer out. If he prefers a lower significance value than p<0.05, he can actually quite easily calculate what the significance is of each experiment. (Truthfully, I don't think he even has to... I think Brulosophy provides the p-value of every experiment...)

Really, you are still holding on to your brew dogma for dear life. Trying your hardest to mathematically make your position have strength when none is there.

Let me ask you this, how many more times do they need to test fermentation temp for you? Now they have done it eight times by three people in three different states. The last one compared 48 degrees and 72. 24 degrees different. Only eight of 20 people could tell the difference if they even could. The person who made them could only get a triangle test right two of four times and anecdotally felt certain they tasted the exact same. But somehow that's not good enough for you or the results from the other 7 tests aren't good enough for you to make an opinion. And I'm just some sort of "fool" for going along with these results.

So how many more times do they need to test this? Really gives us all a number.

That last test was wyeast 2124, they have used wlp800 as well. So they've used at least three different yeasts now. Furthermore the meaningless preference data in this case was once again prefer the warm ferment. In the 82-degree experiment the preference was warm ferment.

So really bwarbiany, explain why it would be so foolish, to come to the conclusion that fermentation temperature isn't as big of a deal as you make it? What's the reasoning, because you read it in a book somewhere, because someone you really trust told you it mattered, because it's just something you think, because you think that's what taught in College. Once again where's your data other than somebody said it to be true. They made the beers and they tasted them blind in a triangle that's how it's done. There really isn't another way to do it. Oh I guess you know we could measure it with a spectrometer or something. Needing a million-dollar machine to discern a difference doesn't make a difference to me.

You need to boil 90 minutes right? Well how come in the two tests he did there was no DMs. And in that test he did send it to a lab and there was no DMs. How come the boil with the lid on experiment didn't come back significant? How come the weak verse strong boil didn't come back significant? How come people couldn't reliably tell the difference between Whirlpool Hops and flame out additions? Is it all just a bunch of BS and you have the real answers, or is there a chance that an overemphasis on process considerations has skewed your perception into tasting and believing things that don't exist. Is there a chance that maybe some of the stuff just doesn't matter and the real answer to better making beer lies elsewhere,
 
The excitement and suspense keeps us all coming back.

I am not a statistician, but I know a guy who knows *some* stuff... and he thinks a sample size of roughly 100-150 is appropriate for 95% confidence. Below 50 samples, he expects only 80% confidence in any result.

So the answer to scrap's question, given my slightly-more-than-zero knowledge of the topic, is that we should keep testing until we get around 100 tasters involved, IF you want 95% confidence in the result. Less tasters than that, and the confidence should not be as high.

That's my 2 cents. Cheers.
 
So really bwarbiany, explain why it would be so foolish, to come to the conclusion that fermentation temperature isn't as big of a deal as you make it? What's the reasoning, because you read it in a book somewhere, because someone you really trust told you it mattered, because it's just something you think, because you think that's what taught in College. Once again where's your data other than somebody said it to be true. They made the beers and they tasted them blind in a triangle that's how it's done. There really isn't another way to do it. Oh I guess you know we could measure it with a spectrometer or something. Needing a million-dollar machine to discern a difference doesn't make a difference to me.

Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.

The temp-controlled beer had an average score 11 points higher than the non-controlled beer.
 
To me the most mind-blowing experiment is the triple decoction. It is incredible to me that people were unable to reliably tell the difference between a single infusion Mash and one where one-third of the mash was taken out and boiled three times. Basically even boiling the mash three times didn't create a difference, that and the first Mash temperature experiment of 146 and 161 being non-significant makes a serious argument about the importance of mash temp.


The reason why I'm bringing this up is I noticed in the discussion that he said people felt that if people were told what the variable is they would be better. That has been brought up on this thread. So they went ahead and told people the variable in a separate group of 22 tasters and only 8, p.46, got it right. But man oh man the merits of triple decoction are splattered across this website and the people who make them will tell you how amazing they are comparatively. I have been accused, not directly (thanks), of being foolish or whatever for blindly trusting these experiments. While I don't blindly trust them a hundred percent, they paint a very interesting picture. While I won't go on record and say that triple decoction doesn't matter, based on all this information, I just won't make one figuring the difference isn't that big. If the difference was so massive there would have been significance in at least one of the two times they tested this on people, especially having given people the variable. So while it may be very well true that there is a difference, based on this I will adjust my Brewing practices a little. I don't need to see more than one test. I have no problem trusting this information and as of now it Still Remains the only data that's been offered. As homebrewers we all owe a thank you to brulosophy, whether you believe them or like them or not. For free at least they're doing something and testing and challenging thought.
 
Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.

The temp-controlled beer had an average score 11 points higher than the non-controlled beer.

Were the judges blind to the variable? Did they know which was which? How careful were you to make sure that none of them knew which was which? I think it would be cool if you did that again and did a blind triangle test. I suspect ale yeast might be more reactive to temperature than lager, just a hunch.
 
Really, you are still holding on to your brew dogma for dear life. Trying your hardest to mathematically make your position have strength when none is there.

Let me ask you this, how many more times do they need to test fermentation temp for you? Now they have done it eight times by three people in three different states. The last one compared 48 degrees and 72. 24 degrees different. Only eight of 20 people could tell the difference if they even could. The person who made them could only get a triangle test right two of four times and anecdotally felt certain they tasted the exact same. But somehow that's not good enough for you or the results from the other 7 tests aren't good enough for you to make an opinion. And I'm just some sort of "fool" for going along with these results.

So how many more times do they need to test this? Really gives us all a number.

That last test was wyeast 2124, they have used wlp800 as well. So they've used at least three different yeasts now. Furthermore the meaningless preference data in this case was once again prefer the warm ferment. In the 82-degree experiment the preference was warm ferment.

So really bwarbiany, explain why it would be so foolish, to come to the conclusion that fermentation temperature isn't as big of a deal as you make it? What's the reasoning, because you read it in a book somewhere, because someone you really trust told you it mattered, because it's just something you think, because you think that's what taught in College. Once again where's your data other than somebody said it to be true. They made the beers and they tasted them blind in a triangle that's how it's done. There really isn't another way to do it. Oh I guess you know we could measure it with a spectrometer or something. Needing a million-dollar machine to discern a difference doesn't make a difference to me.

You need to boil 90 minutes right? Well how come in the two tests he did there was no DMs. And in that test he did send it to a lab and there was no DMs. How come the boil with the lid on experiment didn't come back significant? How come the weak verse strong boil didn't come back significant? How come people couldn't reliably tell the difference between Whirlpool Hops and flame out additions? Is it all just a bunch of BS and you have the real answers, or is there a chance that an overemphasis on process considerations has skewed your perception into tasting and believing things that don't exist. Is there a chance that maybe some of the stuff just doesn't matter and the real answer to better making beer lies elsewhere,

There are literally thousands of published papers studying fermentation temperature and yeast. Your smug attitude is hilarious, that you can actually say show me the data when the "science" you hang your hat on is useless as you would rather trust the taste buds of strangers than actual facts.

And the DMS "experiments" were just bad experiments, what exactly is the control? How much DMS was in the wort to begin with? It's funny you say you don't care about "Needing a million-dollar machine to discern a difference " then quote the absolutely useless testing they did on the sample as some gospel. There lab testing was 100% useless. Not even a question or argument to be had.
 
Bit of a controversial title, but hear me out. I've got some buddies who own commercial breweries who I shoot the ish w/ once in a while. I've brought up things like "cold break" where the master brewer of many years who told me "I've heard of that term. Really don't know what it is." in a fashion that I could only describe as the same way I would regard talk from a flat earth theorist.. On another occasion I referred to some brulosophy experiemnts as well as HBT posts about dry hop length and some other stuff I can't really remember and he told me, "Yeah you can't really believe those Brulosophy posts. It's not real science."

This has led me to believe that perhaps professional brewers look down on said websites. Has anyone else experienced this? What's the deal here?

Are there any professional brewers on here who have contrary opinions about the experiments on brulosophy and other homebrew websites and is professional brewing really such an esoteric field where the rules start changing?

Thanks!:mug:

Marshall is a dear friend and I respect and appreciate what he's done. But too many homebrewers take it as the last word, rather than a single data point. The key to science is repeatability. Someone does an experiment, then others do it to verify the results. If there's only one trial, then you can't really draw a conclusion. At Experimental Brewing, we try to get around that with multiple testers and a lot more tasters, but that has its own problems. In short, look at these experiments as a starting point for your own exploration. Trying to convince another brewer, whether homebrewer of commercial, that they're the last word is not only misleading, it's not how any of us intend the experiments to be used.
 
Because I did a similar experiment. I brewed 15 gallons of an IPA. I kept 10 gallons for myself, fermented in a temperature-controlled chamber. I gave 5 gallons to a fellow homebrew club member, which he fermented without any temperature control. We then presented the two beers to our homebrew club at a meeting, and had them evaluate them to BJCP guidelines.

The temp-controlled beer had an average score 11 points higher than the non-controlled beer.

The problem I have with this experiment is with the volume of the fermenters. Did they both contain the same amount of trub? Does the yeast multiply or behave different between 10 gallons and 5 gallons? Is the amount of yeast proportional to each samples volume? Is there the same amount of headspace in each fermenter? Were they both kept in the dark or did one receive more light than the other. Even this simple experiment can have too many variables, which could change the final results.
 
I haven't read their fermentation temp experiment, but that seems like a critical part of the process from my experience - depending on the yeast. S-04 works fine in the mid-60s, but is gross if it goes over 70. I also think you can coax different characteristics out of some yeasts by changing the temperature. I have a Belgian yeast that seems to get a lot more barfy (Belgian) at higher temps and is more mellow fruity at lower. I've never done a side by side or even the same beer twice to compare, though.

Do you control temp at all to keep it in ale range, or just let it do what it does?

I've used s-04 a few times and gave up on it, it was gross.

I just let my buckets sit in my basement which is stable. I have used cool brewing bags to get the temps. closer to 15C for the first few days. It works fine for me even at 18C. Referring to 34/70.
 
What I question personally is whether we should place much validity in their results when they expect John Q. Randomguy -- who might know little or nothing about beer -- to be able to detect differences between two beers at a 95% confidence level. But in my view, if we take a more loose approach and only expect John Q. as well as all the other various experienced tasters to detect a difference an average of maybe about 80% of the time,
You are confusing confidence level with preference level. If 10 out of 20 panelists qualify and 6 of them prefer beer B then we say "60% of qualified tasters preferred Beer B at the 2% confidence level. The confidence level is the probability that 6 or more would have preferred B given that A and B are indistinguishable and gives us the information we need to decide whether we think this data is a valid measurement of the beer/panel combination or just the result of random guesses necessitated by an unqualified panel or the fact that the beers are actually the same in quality. If the probability that the data we obtained came from random guesses is only 2% we are pretty confident that our data did not come from random guesses and it is probably true that 60% or more of qualified tasters will prefer beer B.

Thus we can estimate that 60% of qualified tasters prefer a beer and we very confident (p < .001 ~ 0.1%) or moderately confident (p < 0.05 ~ 5%) or not too confident at all (p < 0.2 ~ 20%) that our observation of 6/10 wasn't arrived at through random events.



this experiment has 95% confidence that there seems to be a difference", then with an 80% bar instead of 95%, this lower bar is easier to meet or to "qualify" for further experimentation, rather than rehashing the same old "nope, we didn't achieve 'statistical significance' yet again". Statistically, if they only expect to be right about 80% of the time instead of 95%, the results reported should prove more interesting, at least in my own chumpy eyes.
The higher the confidence level the less likely we are to conclude that one beer is better. We want very small confidence level numbers. This makes us feel well, confident, that we can reject the hypothesis "This panel can't tell these beers apart".

There's more to it than just cofidence levels though. If we pick a certain confidence level (lets say p = 0.01 or 1% which is sort of midway between what is generally considered the largest acceptable value (0.05) and what is often considered the lowest level of interest (p = 0.001) it has implications on how well our test will perform. Overall, recall, that we have made some change to our brewing process and want to know whether this makes better beer. Before we can perform any experiments we must define what 'better' means. For the purposes of the current discussion suffice it to say that
'better' means preferred by more 'qualified' tasters than a beer which is not so good. This, of course, requires definition of 'qualified' and there has been discussion of that. So if we have beer A, beer B and beer C with A preferred by 60% of tasters, B preferred by 70% of tasters and C preferred by 85% of tasters we say that B is better beer than A and C is a better beer than A or B. It should not come as a surprise that the performance of a test depends on the strength of the goodness but it also depends on the design of the test and the threshold we choose. The way we describe performance of a test comes from the RADAR engineers of WWII. A RADAR set sends out RF energy and measures the strength of the signal over time. If the received signal is strong enough to exceed a threshold (under control of the operator in the early days) we decide a target is present and if it doesn't we decide no target is present at the range and azimuth to which the set is listening. The operators threshold control is relative to the noise level inherent in the environment. If the operator sets the threshold at the noise level then noise related voltages will be above the threshold a large proportion of the time and the scope will fill up with 'false alarms' that is, detection decisions made when there is no target present. If he sets the threshold well above the noise only the strongest signals will exceed it and returns from smaller ones will not be detected (false dismissal). In the beer test target detection is the conclusion that one beer is better given the data (voltage) we obtain from the test. A good test gets more 'signal' from the beer relative to the noise which is caused by panelists inability to be perfect tasters or the inclusion of panelists less qualified than we might like and by the fact that some beers are only a little better than others while some are much better than others.

The performance of a test against a particular signal in a particular noise environment is well represented by a "Receiver Operation Characteristic" which is a curve like the ones on the graph. The curve with the open circles represents the performance of a triangle test with 20 panelists. The probability that a panelist is qualified is 50% and the probability that the modified beer is better than the unmodified is 60%. The vertical axis shows the probability that such a test will find the beer better for each point on the curve which is labeled with the value of p used to reject the null hypothesis (the confidence level). At the left end of the curve we demand that the probability of random generation of the observed data is very small. Under those conditions we do not detect better beer very readily. At the other end we accept decisions based on little confidence (high p) and thus detection is almost certain.

The horizontal axis shows the probability that we will accept that the beer is better given that it isn't (false alarm). Thus, as with the RADAR, the probability of detecting the hypothesis "The doctored beer is better" becomes higher with reduced threshold whether the beer is actually better or not.

The point at the upper left hand corner represents the point where the probability of detection is 1 and the probability of false alarm is 0. This represents a perfect receiver (test). The closer one is to that point the better the test for the given beer. This makes it clear that one's choice of threshold should be the one that gets us closest to that corner ( p = 0.059 for the circle curve). But that may not be the case. In medicine, for example, the cost of missing a diagnosis may be high (the patient dies and the doctor gets sued) whereas, on the other hand, having a high false alarm rate is not such a bad thing as the expected loss from a malpractice suit goes down while at the same time he can charge for additional tests to see if indeed the patient does have the disease. Threshold choice is often determined by such considerations. In broad terms, the farther away the ROC curve is from the dashed line, the better the test.

The curve with inverted triangles is the ROC for the test mongoose suggested would be better than a triangle test given the same parameters as the curve with circles i.e. 20 panelists with the probability that a panelist qualifies being 50% and the beer better at the 60% level. The curve with the triangles shows the effect of increasing the panel size to 40 and the curve with squares the effect of making the beer more preferable increasing, there fore, the panel's ability to distinguish.

I'm not going to say more in this post as I expect I'll be coming back to this.

TriangleROC.jpg
 
About some of it, no. But about qualifying for panels, absolutely.
I think we do agree about qualifying the panel in most cases.




Here's why I do think that,[that allowing guesses is a detriment] and it's not a statistical reason, it's a measurement reason. It flies in the face of common sense that one would use, in a test of preference, people who demonstrably cannot make a preference decision.
They're guessing!
The thing that you don't seem to be able to grasp is that if you are trying to see how a proposed change in your beer will effect its sales and assay to do that with a taste panel then that panel had better reflect your market and not the scientists in your QC department. This, I would think is obvious and I shouldn't have to say more but I'll repeat the example I have offered before.

If you want to try a cheaper malt, brew a pilot batch with it and test it against your normal product using a panel that is highly qualified in beer tasting then it is probable that they will be able to detect a difference, p will probably be small and you will probably decide not to market the new beer thus losing the opportunity to save money and increase profits. Your decision to use the qualified panel has led you astray. You have made a mistake.

If, OTOH, you empanel people selected from your customer base most of whom are not as good beer judges as the guys from your QC dept. p is likely to be larger (this panel's selections will be more random) and you are not so likely to dismiss H0. You conclude that you can 'get away' with using the cheaper malt in this market. Profits go up and you get a bonus.

You want people who demonstrably can make that distinction doing such preference evaluation.
Sometimes, even often (see, we do agree) but not in the case I just laid out. It depends on what your investigation is about. And that is my recurring theme here.

I will also point out, again, that noise is inevitable - even with a 'qualified' panel and that the power of a triangle (or quadrangle or....) test is that it improves signal to noise ratio. See my post on ROC's.
 
Bit of a controversial title, but hear me out. I've got some buddies who own commercial breweries who I shoot the ish w/ once in a while. I've brought up things like "cold break" where the master brewer of many years who told me "I've heard of that term. Really don't know what it is." in a fashion that I could only describe as the same
Thanks!:mug:

There is very little that doesn't change between home brewer and "Pro". I can safely say I have not heard the phrase "cold break" since home brewing. As to dry hopping there is zero comparison. If I were to dry hop a batch for 2 days, I might as well not bother wasting my time. Going from 10-15 gallons to over 250 gallons many variables change other than just quantities. I for one have never been a fan of technical jargon. That being said, we do actually make a few 5-10 gal batches, they are primarily there for yeast starters. However we have then played with a few and created a few beers that were so well received that they went to become part of the regular rotation. So don'r get so hung up on terminology and experiments, a lot of it is just intuition or dumb-luck.But it's always about repeatability.
 
Were the judges blind to the variable? Did they know which was which? How careful were you to make sure that none of them knew which was which? I think it would be cool if you did that again and did a blind triangle test. I suspect ale yeast might be more reactive to temperature than lager, just a hunch.

Sadly, there wasn't enough "blindness" in the testing process, and there were other confounding variables (I kegged; the other brewer bottle-conditioned), etc. So I cannot claim it is truly scientific, not anywhere near the level of Brulosophy.

But I tasted both beers and definitely perceived a difference lol...
 
@ajdelange,

Wow, thanks for that chart! I can finally see now, visually, how for many common tests, if aiming for the upper left corner as you suggest, why it would be optimal to select a p value of around 0.05, just as Brulosophy has done. Thanks for this.

On the other hand, the more xbmts they run that conclude "not statistically significant", the more people may tend to ignore them or dismiss or discredit their conclusions as incorrect or quacky, possibly even at the loss of sponsorships, book sales, or whatever, in a manner similar to your medical malpractice scenario (although I'll admit this is extremely unlikely).

My argument remains, that a few more false alarms might not be such a terrible thing, if it might encourage more of us to run even more xbmts on our own to support/refute/learn for ourselves. More maybes and more false alarms might just excite people more than "couldn't tell the difference... again...".

Maybe.

Maybe I don't need to play this broken record anymore. Maybe I'll be quiet now. Maybe.

Cheers all.
 
Status
Not open for further replies.
Back
Top