Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum

Help Support Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
https://en.wikipedia.org/wiki/Meta-analysis

Interesting. Based on the Wiki article, it seems that it is okay to combine. Clearly there wasn't any problem in the searching for studies (as this was covered by them all being run by the same group), and there isn't a publication bias problem because we know that Brulosophy publishes both results and non-results. Given that they used the same methodology in each study, that means that I didn't have to correct for multiple results to reach a homogeneous sample--they were already homogeneous.

However, the aggregation of studies is still limited by the quality of the original experiments. If there were flaws in their methodology (such as not doing AAB, ABA, BAA, ABB, BAB, BBA random presentation of samples, etc), that error is not in any way corrected for by the meta-analysis.

One could also accuse me of bias because I already believe that temperature control is important, but I could not use that bias in any way given my selection of studies. I took all 7 of the Brulosophy studies comparing cold vs warm ferment as written without qualifying them, and the only study I excluded--which was static vs variable temp--actually had a stronger result than studies I used so excluding it didn't strengthen my case.

But in general, I feel that the meta-analysis in this case has bolstered the contention that fermentation temp has a statistically significant effect on beer.
 
Because you will pass every single time you do the test. That's why I stopped using it. I have never done a diastatic power calculation and never had a mash that has failed to convert completely. What's the point of doing a test that tells me I'm completely converted when I already know that?

I guess I was thinking of doing it at 20-30 minutes and sparging from there if the conversion is done. that is one way to save a few mins on brew day :ban:
 
https://en.wikipedia.org/wiki/Meta-analysis

Interesting. Based on the Wiki article, it seems that it is okay to combine. Clearly there wasn't any problem in the searching for studies (as this was covered by them all being run by the same group), and there isn't a publication bias problem because we know that Brulosophy publishes both results and non-results. Given that they used the same methodology in each study, that means that I didn't have to correct for multiple results to reach a homogeneous sample--they were already homogeneous.

However, the aggregation of studies is still limited by the quality of the original experiments. If there were flaws in their methodology (such as not doing AAB, ABA, BAA, ABB, BAB, BBA random presentation of samples, etc), that error is not in any way corrected for by the meta-analysis.

One could also accuse me of bias because I already believe that temperature control is important, but I could not use that bias in any way given my selection of studies. I took all 7 of the Brulosophy studies comparing cold vs warm ferment as written without qualifying them, and the only study I excluded--which was static vs variable temp--actually had a stronger result than studies I used so excluding it didn't strengthen my case.

But in general, I feel that the meta-analysis in this case has bolstered the contention that fermentation temp has a statistically significant effect on beer.

Five different yeasts. That's right five. Three different people testing it in different states. 8 experiments in total, 6 unable to show significance as tested. One of them brought to National Homebrew convention and tasted by everybody, including famous people in Homebrew. Tons of anecdotal and qualitative data. Meaningless preference data often times showing preference for the warm fermented anyways. And yet you have added everything together so you can prove something. I think you've done a great job of showing how real information can be skewed in any way anybody wants. I dont get what you are holding onto so much that you feel the need to add all these negative results together to reach a positive. Look I didnt care if any of this was true from the get go, so I guess we just had opposite beginnings. It seems way to upsetting to me. If someone told me pasta could be made in cold water, it wouldnt make me all angry either. Beer can be mashed in cold water overnight it turns out. I think its cool, not something to disprove. Brulosophy is cool and interesting to me, i try not to attach any personal value on the findings.
 
I guess I was thinking of doing it at 20-30 minutes and sparging from there if the conversion is done. that is one way to save a few mins on brew day :ban:


A better test than iodine is to check the actual gravity of the mash.

The mash gravity is quite predictable for a known quantity of grain and water.

http://www.braukaiser.com/wiki/index.php?title=Understanding_Efficiency#Conversion_efficiency

I have done measurements recently as the mash proceeded and you clearly see the gravity changing rapidly early on and then plateauing and slowly creeping to 100% conversion after about 2-3 hours.

Things do still continue to happen after 20-30 minutes of mashing.
 
Five different yeasts. That's right five. Three different people testing it in different states. 8 experiments in total, 6 unable to show significance as tested. One of them brought to National Homebrew convention and tasted by everybody, including famous people in Homebrew. Tons of anecdotal and qualitative data. Meaningless preference data often times showing preference for the warm fermented anyways. And yet you have added everything together so you can prove something. I think you've done a great job of showing how real information can be skewed in any way anybody wants. You are a smart guy, I dont get what you are holding onto so much that you feel the need to add all these negative results together to reach a positive. Look I didnt care if any of this was true from the get go, so I guess we just had opposite beginnings. It seems way to upsetting to me. If someone told me pasta could be made in cold water, it wouldnt make me all angry either. Beer can be mashed in cold water overnight it turns out. I think its cool, not something to disprove. Brulosophy is cool and interesting to me, i try not to attach any personal value on the findings.

Sometimes I just like to geek out on numbers... I'm an engineer, after all ;)

This one was particularly interesting to me, though, as I've always thought that fermentation temp control was a significant step in improving my beer. I even thought that years ago when I did a double-batch with my former brewing partner, and we fermented one in a temp-controlled fridge but had to put the other one in his spare bathroom un-controlled. The un-controlled batch seemed "hot", i.e. with higher-alcohol flavors.

But for me, something just seemed "off" regarding these experiments. While the individual experiments didn't achieve significance, I was noticing that the error was always in the "correct" direction. From a statistics standpoint, that belied the idea to me that we weren't dealing with blind chance. It suggested that beer was not completely indifferent to fermentation temperatures, but that perhaps the effect was too small to be seen based on the size of the tasting panels.

That's why I looked at the meta-analysis, and based upon what I can see, my tactic was NOT statistically unsound. Meta-analysis is used for just this purpose--to find effects that may be too small to be significant in individual experiments, but taken in the aggregate are meaningful.

FYI there's now a 9th ferm temp experiment, and this one actually hit p=0.002... Interestingly this one was taking WLP300 and fermenting the batches at either 60 or 72, both within range.

So with this, we're now at 91 of 208 (43.75%) correct identification of the odd beer in the triangle, against an expected null hypothesis of 33%. This corresponds to p=0.001. Assuming meta-analysis is a valid technique, these experiments are absolutely demonstrating IMHO that fermentation temperature affects the finished product.

But of course, as you state, that doesn't get to preference. Which is obviously an important factor, and the warm ferment lager was preferred to the cool, which is a surprising finding.
 
Sometimes I just like to geek out on numbers... I'm an engineer, after all ;)

This one was particularly interesting to me, though, as I've always thought that fermentation temp control was a significant step in improving my beer. I even thought that years ago when I did a double-batch with my former brewing partner, and we fermented one in a temp-controlled fridge but had to put the other one in his spare bathroom un-controlled. The un-controlled batch seemed "hot", i.e. with higher-alcohol flavors.

But for me, something just seemed "off" regarding these experiments. While the individual experiments didn't achieve significance, I was noticing that the error was always in the "correct" direction. From a statistics standpoint, that belied the idea to me that we weren't dealing with blind chance. It suggested that beer was not completely indifferent to fermentation temperatures, but that perhaps the effect was too small to be seen based on the size of the tasting panels.

That's why I looked at the meta-analysis, and based upon what I can see, my tactic was NOT statistically unsound. Meta-analysis is used for just this purpose--to find effects that may be too small to be significant in individual experiments, but taken in the aggregate are meaningful.

FYI there's now a 9th ferm temp experiment, and this one actually hit p=0.002... Interestingly this one was taking WLP300 and fermenting the batches at either 60 or 72, both within range.

So with this, we're now at 91 of 208 (43.75%) correct identification of the odd beer in the triangle, against an expected null hypothesis of 33%. This corresponds to p=0.001. Assuming meta-analysis is a valid technique, these experiments are absolutely demonstrating IMHO that fermentation temperature affects the finished product.

But of course, as you state, that doesn't get to preference. Which is obviously an important factor, and the warm ferment lager was preferred to the cool, which is a surprising finding.

And, of course, determining preference isn't the goal of a triangle test...

Thanks for crunching the numbers on all the temp results! Certainly interesting stuff!
 
But in general, I feel that the meta-analysis in this case has bolstered the contention that fermentation temp has a statistically significant effect on beer.

I think you are doubtless right but even the single case experiment I looked at (9 out of 21 correct guesses) supports that conclusion. Though those results only support rejection of the hypothesis that fermentation temperature does not make a difference to the 25% significance level that does not prove the null hypothesis is false. In the long post I showed that we can calculate the probable range of differentiabilities and these data indicate it is between 0 and 0.35 with 90% confidence. Based on that we are not likely to accept the null hypothesis (Pd = 0). I took it one step further and modified the calculation routine to also calculate the most probable value of Pd given the number of correct answers obtained and the panel size. For 9 out of 21 correct this is Pd_max_liklihoood = 14%. That's respectably far from the hull hypothesis Pd = 0.

H1(Pd=0.30): Panelists: 21; 3-ary test; 9 Correct Choices; P(< 9) = 0.11886; 0.00 < Pd < 0.35 with conf. 0.90
Most Likely Pd: 0.140
 
So with this, we're now at 91 of 208 (43.75%) correct identification of the odd beer in the triangle, against an expected null hypothesis of 33%. This corresponds to p=0.001. Assuming meta-analysis is a valid technique, these experiments are absolutely demonstrating IMHO that fermentation temperature affects the finished product.
Lets look at the combined data and see what it implies:
H0(Pd=0): Panelists: 208; 3-ary test; 91 Correct Choices; P(>= 91) = 0.00112
H1(Pd=0.20): Panelists: 208; 3-ary test; 91 Correct Choices; P(< 91) = 0.18079; 0.09 < Pd < 0.22 with conf. 0.90
Most Likely Pd: 0.160

The larger sample size gets us to confidence of 0.0011 that we can toss the null hypothesis. Fiat! It also tightens our 90% confidence band for differentiability (and gets 0 out of it) and gives us a maxiumum liklihood estimate of 16% for the differentiability as opposed to 14%. That's not really very different. This consistency suggests that the various experimnental result are indeed capable of being combined. But what did it buy us? A substantial improvement in support for rejecting the null hypothesis. But we already were pretty sure we should do that. Or, put another way, alpha bigger than we might like based on the assumption that it always needs to be < 0.05 isn't the whole story.
 
Excellent response. So what did you take from that xbmt? I took a lot. It confirmed what I had been thinking for a while that ale, hef, some yeasts are more reactive then others. I certainly never meant or intended to be pigeonholed into any yeast strain never showing a difference. Just that it didnt always matter in the grand scheme of things.

Imo the assumption is in defense of the purchased equipment. Quickly the assumption seems made that there was a difference, so yay cold was better, i am right, this is how good beer is made. But putting oneself in not caring shoes, one can see that 7 preferred the warm and 8 the cold, 4 had no preference. Furthermore the brewer claims taking a 3rd place medal with a 75/25 mix of the warm and cold. So should I go buy a fridge and controller? If so why? Is anyone really surprised that a hef yeast showed a reaction to temp? Which one is better? The cold, right because that is the common thought and equipment defense. Surely, this has to make sense to someone.


So they noticed a difference, now what? If you like hefs, it says to me that flavor could be manipulated by temp and that warm and cold would both be worth trying. It says to me that 72 would be ok for a hef as preference seems split anyways and he did well with the 3rd. Nothing in this speaks to dogmatism about ferment temp in the grand scheme of things. Both warm and cold will make good beer and the joy of not caring or extra equipment gives the warmer an edge to me.
 
Some day ABI, Miller, Heineken, et al. might stumble onto the repository of science that is Brulosophy and discover they can ferment their lagers warm with no consequence. Think of the implications to their bottom line from the increase in production and the decrease in residence time. They must not be spending their research money wisely.
 
No doubt Bilsch. They cant though can they because of past? Its my understanding they spend more now than ever to make their beer. They have to spend more on rice now than ever due to prices, cant remember podcast source.
 
Since I'm advocating ASTM E1885 as gospel let me quote a bit from ¶7.2

Thus it depends on whom and what you are interested in. The fact that assessors trained to tell the difference between Beer A and B can tell the difference between beer A and B doesn't tell me much about whether my SO is going to be able to tell the difference. OTOH if that panel was trained (or pruned) to detect diacetyl then I have confidence that Beer C is different from beer D with respect to diacety if they determine it to be so. The problem here seems to be that the investigators have not done all their homework in terms of determining what they want to determine before doing the experiments.

Hey AJ this seems to be the same E1885 I linked back in post #25. Guess you answered my question about credible source. Rang a bell cause we both pointed out section 7.2. It's not $45 it's free. Oddly my reading of same document pointed at how much the brulosophers were getting right in their effort to use the triangle test while your read is more or less the opposite.
 
Man, I'm not as nerdy into statistics as a few of you guys by a long shot, but I nerded out on this thread with you. Thanks AJ and bwarbiany for the great posts. My inclinations were initially the same as yours B, but I didn't have the stats background to show it. And then as AJ pointed out, it has to actually reach the null hypothesis in order to be considered fully indifferent (if I understood that correctly). So although the experiments didn't *prove* that fermentation temp mattered, they didn't *disprove* the theory either. Rather, they should've raised more questions and more testing. Then the fact that the meta-analysis is actually acceptable practice, and that it, in fact, does *prove* that fermentation temp will make a difference, only furthers my gut-feeling that if the panel sizes were larger, we would see more significant results.

I'm not sure how inclined AJ is to converse with the brulosophy dudes, but just from my random interactions with them online, I have a feeling they'd actually appreciate some of this stuff. I would bet you that they're much closer to the mode of thinking of wanting to actually do things right, and would prefer to have such inclinations as fermentation temp proven correctly. And I'd venture to say that they're completely against those types in this thread who are taking these experiments as brewing gospel.

Lastly, after going through the meta-analysis on the ferment temp experiments, it'd be interesting to do the same with others that are testing the same variables.
 
Man, I'm not as nerdy into statistics as a few of you guys by a long shot, but I nerded out on this thread with you. Thanks AJ and bwarbiany for the great posts. My inclinations were initially the same as yours B, but I didn't have the stats background to show it. And then as AJ pointed out, it has to actually reach the null hypothesis in order to be considered fully indifferent (if I understood that correctly). So although the experiments didn't *prove* that fermentation temp mattered, they didn't *disprove* the theory either. Rather, they should've raised more questions and more testing. Then the fact that the meta-analysis is actually acceptable practice, and that it, in fact, does *prove* that fermentation temp will make a difference, only furthers my gut-feeling that if the panel sizes were larger, we would see more significant results.

I'm not sure how inclined AJ is to converse with the brulosophy dudes, but just from my random interactions with them online, I have a feeling they'd actually appreciate some of this stuff. I would bet you that they're much closer to the mode of thinking of wanting to actually do things right, and would prefer to have such inclinations as fermentation temp proven correctly. And I'd venture to say that they're completely against those types in this thread who are taking these experiments as brewing gospel.

Lastly, after going through the meta-analysis on the ferment temp experiments, it'd be interesting to do the same with others that are testing the same variables.

Well, I suppose this is one way to view it, but lost in all the statistical handwaving is the ultimate problem with all of this, and that is the measuring process being used by the Brulosophy approach is hugely flawed.

I'm not antistatistical--I've had 10 university stat courses in my life, 9 of which were at the graduate level. Nothing less than an A in any of those classes. PhD minor in Statistics (yeah, they had those). One of the things one learns when one beats one's head against the statistical wall like I did is that one should never, not ever, forget that if you don't measure your variables reliably and validly, the statistics are not--ARE NOT--worth a hill of beans.

I use this brulosophy material in my own classes as a way to illustrate what happens when people get lost in probability theory and forget that in the end, without accurately measuring the variables at issue, one's conclusions are really uncertain at best.

This was pointed out earlier in the thread, and it's an absolute thread-killer, in that it really cannot be refuted. This is fundamental to research, and to statistics.

We do not know who the samples are and thus to what populations they may be generalizable. Tasters are "qualified" even if they just guessed the right answer, and then treated as if they are effective in distinguishing differences. We know that there is no consistency in what tasters may have or may not have been drinking prior to the triangle tests, and it's clear that there is quite the potential for palate fatigue, or taste-bud numbing. We simply do not know who they are, how they are prepared, and that is not science, it's something else entirely.

I have little doubt there will be another attempt to convince people with statistical hand-waving, but in the end, you can largely ignore that effort. It's just a way to distract from the fundamental problem with the brulosophy approach, something that the reliance on statistics cannot overcome.

Of course, people can believe what they want to believe, and if this "let's assume it's all measured well and then proceed as if it's valid" stuff is convincing, well, so be it.
 
Man, I'm not as nerdy into statistics as a few of you guys by a long shot, but I nerded out on this thread with you. Thanks AJ and bwarbiany for the great posts. My inclinations were initially the same as yours B, but I didn't have the stats background to show it. And then as AJ pointed out, it has to actually reach the null hypothesis in order to be considered fully indifferent (if I understood that correctly). So although the experiments didn't *prove* that fermentation temp mattered, they didn't *disprove* the theory either. Rather, they should've raised more questions and more testing. Then the fact that the meta-analysis is actually acceptable practice, and that it, in fact, does *prove* that fermentation temp will make a difference, only furthers my gut-feeling that if the panel sizes were larger, we would see more significant results.

I'm not sure how inclined AJ is to converse with the brulosophy dudes, but just from my random interactions with them online, I have a feeling they'd actually appreciate some of this stuff. I would bet you that they're much closer to the mode of thinking of wanting to actually do things right, and would prefer to have such inclinations as fermentation temp proven correctly. And I'd venture to say that they're completely against those types in this thread who are taking these experiments as brewing gospel.

Lastly, after going through the meta-analysis on the ferment temp experiments, it'd be interesting to do the same with others that are testing the same variables.

The probability that Marshall is reading this thread is 1. ;)

Although I've worked with a scientific modeler who says that there's never a probably of 0 or 1...
 
Hey AJ this seems to be the same E1885 I linked back in post #25. Guess you answered my question about credible source. Rang a bell cause we both pointed out section 7.2. It's not $45 it's free.
Well I certainly wish I'd remembered that post. I'd be $45 richer! As to the credibility of the source it is, by definition, credible as it is a standard. There are typos in it and I think I have found an error in the confidence interval calculation which I did not correct in the spreadsheet because it is a standard (and I'm not, at this point, 100% sure i'm right).

Oddly my reading of same document pointed at how much the brulosophers were getting right in their effort to use the triangle test while your read is more or less the opposite.
In the one description of an experiment I read all panelists were presented, at best, the permutations of AAB (no BBA). Panelists that couldn't decide were not instructed to guess. These are pretty glaring errors in protocol and suggest that there were others. For example, I doubt that they have individual isolated booths for their panel members. These things would introduce 'noise' yet the single experiment and the pooled experiment results both suggest that the differentiability is about 15% so I think we have to conclude that they did some things right.
 
I think you are doubtless right but even the single case experiment I looked at (9 out of 21 correct guesses) supports that conclusion. Though those results only support rejection of the hypothesis that fermentation temperature does not make a difference to the 25% significance level that does not prove the null hypothesis is false. In the long post I showed that we can calculate the probable range of differentiabilities and these data indicate it is between 0 and 0.35 with 90% confidence. Based on that we are not likely to accept the null hypothesis (Pd = 0). I took it one step further and modified the calculation routine to also calculate the most probable value of Pd given the number of correct answers obtained and the panel size. For 9 out of 21 correct this is Pd_max_liklihoood = 14%. That's respectably far from the hull hypothesis Pd = 0.

H1(Pd=0.30): Panelists: 21; 3-ary test; 9 Correct Choices; P(< 9) = 0.11886; 0.00 < Pd < 0.35 with conf. 0.90
Most Likely Pd: 0.140

Lets look at the combined data and see what it implies:
H0(Pd=0): Panelists: 208; 3-ary test; 91 Correct Choices; P(>= 91) = 0.00112
H1(Pd=0.20): Panelists: 208; 3-ary test; 91 Correct Choices; P(< 91) = 0.18079; 0.09 < Pd < 0.22 with conf. 0.90
Most Likely Pd: 0.160

The larger sample size gets us to confidence of 0.0011 that we can toss the null hypothesis. Fiat! It also tightens our 90% confidence band for differentiability (and gets 0 out of it) and gives us a maxiumum liklihood estimate of 16% for the differentiability as opposed to 14%. That's not really very different. This consistency suggests that the various experimnental result are indeed capable of being combined. But what did it buy us? A substantial improvement in support for rejecting the null hypothesis. But we already were pretty sure we should do that. Or, put another way, alpha bigger than we might like based on the assumption that it always needs to be < 0.05 isn't the whole story.

I understand where you're coming from... The null hypothesis vs achieving "statistical significance" at p<0.05 are two different things...

I agree with you that as we look at these experiments we can not necessarily "toss out" the null hypothesis, but we can somewhat discount its likelihood when we get results that suggest an effect but one not large enough to achieve p<0.05.

The question is how do you explain results to people who may not look at it the same way? That's where I think the meta-analysis can come in. By combining experiments we increase sample size, and we have a much stronger case for not just discounting the null hypothesis, but outright rejecting it.
 
Excellent response. So what did you take from that xbmt? I took a lot. It confirmed what I had been thinking for a while that ale, hef, some yeasts are more reactive then others. I certainly never meant or intended to be pigeonholed into any yeast strain never showing a difference. Just that it didnt always matter in the grand scheme of things.

Imo the assumption is in defense of the purchased equipment. Quickly the assumption seems made that there was a difference, so yay cold was better, i am right, this is how good beer is made. But putting oneself in not caring shoes, one can see that 7 preferred the warm and 8 the cold, 4 had no preference. Furthermore the brewer claims taking a 3rd place medal with a 75/25 mix of the warm and cold. So should I go buy a fridge and controller? If so why? Is anyone really surprised that a hef yeast showed a reaction to temp? Which one is better? The cold, right because that is the common thought and equipment defense. Surely, this has to make sense to someone.


So they noticed a difference, now what? If you like hefs, it says to me that flavor could be manipulated by temp and that warm and cold would both be worth trying. It says to me that 72 would be ok for a hef as preference seems split anyways and he did well with the 3rd. Nothing in this speaks to dogmatism about ferment temp in the grand scheme of things. Both warm and cold will make good beer and the joy of not caring or extra equipment gives the warmer an edge to me.

Well, in this case note that the difference was between 60 and 72. Hefe yeast is generally accepted to produce more clove at low temp and more banana at "high" temp, but in no way was this experiment trying to ferment a hefe with uncontrolled temperatures. The temps were within accepted ranges for hefe yeast. They didn't try to ferment a hefe at 85 degrees.

What it suggests is that buying a fridge and temp controller give you greater control over what your beer will be than not doing so. It suggests that there is a demonstrable difference between fermenting a hefe at 60 vs 72, that these produce different characteristics, and if you want banana vs clove or vice versa, you should control for that.
 
Although I've worked with a scientific modeler who says that there's never a probably of 0 or 1...

Probability that you will die: 1.00000000000
Probability that you will live forever: 0.000000000

Probability that you will pay taxes: 1.00000000000
Probability that you will never have to pay taxes: 0.00000000
 
Well, I suppose this is one way to view it, but lost in all the statistical handwaving is the ultimate problem with all of this, and that is the measuring process being used by the Brulosophy approach is hugely flawed.

I'm not antistatistical--I've had 10 university stat courses in my life, 9 of which were at the graduate level. Nothing less than an A in any of those classes. PhD minor in Statistics (yeah, they had those). One of the things one learns when one beats one's head against the statistical wall like I did is that one should never, not ever, forget that if you don't measure your variables reliably and validly, the statistics are not--ARE NOT--worth a hill of beans.

I use this brulosophy material in my own classes as a way to illustrate what happens when people get lost in probability theory and forget that in the end, without accurately measuring the variables at issue, one's conclusions are really uncertain at best.

This was pointed out earlier in the thread, and it's an absolute thread-killer, in that it really cannot be refuted. This is fundamental to research, and to statistics.

We do not know who the samples are and thus to what populations they may be generalizable. Tasters are "qualified" even if they just guessed the right answer, and then treated as if they are effective in distinguishing differences. We know that there is no consistency in what tasters may have or may not have been drinking prior to the triangle tests, and it's clear that there is quite the potential for palate fatigue, or taste-bud numbing. We simply do not know who they are, how they are prepared, and that is not science, it's something else entirely.

I have little doubt there will be another attempt to convince people with statistical hand-waving, but in the end, you can largely ignore that effort. It's just a way to distract from the fundamental problem with the brulosophy approach, something that the reliance on statistics cannot overcome.

Of course, people can believe what they want to believe, and if this "let's assume it's all measured well and then proceed as if it's valid" stuff is convincing, well, so be it.

I did read your stuff as well throughout the thread. With what time I have to read on this forum, it took me a couple of days to sort through this thread. It's obvious you know what you're talking about. And only now have stated just how much you know about statistics, even though it was practically implied in most of your posts.

I guess that was also my point, that if we disregard just how flawed their testing is (and again, I think that if you talked directly with Marshall and crew, they'd readily admit that point), and simply look at their results... EVEN THEN, the stats are not in their favor. And by "their," I simply mean those who take the results and adhere to them like biblical orthodoxy.

And again, just with the interactions I've had with Marshall (albeit, online), I'd say if these sorts of things were presented to him, that he'd accept it with an open, scientific mind.

I could be wrong, of course, but this is my inclination.
 
Man, I'm not as nerdy into statistics as a few of you guys by a long shot, but I nerded out on this thread with you. Thanks AJ and bwarbiany for the great posts. My inclinations were initially the same as yours B, but I didn't have the stats background to show it. And then as AJ pointed out, it has to actually reach the null hypothesis in order to be considered fully indifferent (if I understood that correctly).
You are in the ballpark but not quite there. What this branch of statistics is all about is positing a hypothesis and then calculating the probability that some data you observed would arise under that hypothesis. If the probability is very low then the hypothesis probably isn't true and ought to be dismissed in favor of an alternative hypothesis.

So although the experiments didn't *prove* that fermentation temp mattered, they didn't *disprove* the theory either.
As has been pointed out (my attempts a wit in another post aside) there are no probabilities of 0 or 1 so there is always uncertainty. We can't know that hot and cold fermented beers are indistinguishable. We can just say that it is unlikely that they are given the data that we observed. The question becomes how improbable do they need to be before we decide that for practical purposes we should say they are not indistinguishable. That is up to those who interpret the data to decide. The probability thresholds are often determined by the costs associated with being wrong (i.e. that they are the same but we decide they are different).

Rather, they should've raised more questions and more testing.
The first thing we would want them to do is correct the procedural errors. Then we'd like to see some of the tests repeated using the correct procedures. If they say "Here's what we got following ASTM protocols" then there data is much more likely to be accepted than if they violate ASTM protocols as they have done in the past. You can challenge me on this by saying "What difference does it make if they only presented BBA?" I'd have to answer that I don't really know but that question goes away of they follow the protocol (equal numbers of ABB, BAB, BBA, BAA, ABA and AAB).

Then the fact that the meta-analysis is actually acceptable practice, and that it, in fact, does *prove* that fermentation temp will make a difference, only furthers my gut-feeling that if the panel sizes were larger, we would see more significant results.
I'm assuming you put "prove" in quotes because data based on larger panel size does not prove any more than data from smaller panel sizes. It may strengthen our convictions with regard to a position on a hypothesis though.

The other approach here is to look at the other information that is available from triangle test data i.e. the information on the range of likely differentianbility and on the level of differntiability that most likely explains the observed data. This may help us to reject or accept the hypotheses even though the confidence levels are not what we hoped for. It may not be necessary to increase sample size or pool data from multiple tests but doing either of those does improve the estimate of differentiability in addition to decreasing alpha.

I'm not sure how inclined AJ is to converse with the brulosophy dudes,
Perfectly. My original suggestion to them here was that they shell out the $45 to ASTM for the procedure but can now change that to suggest that they go to #25 and follow the link there. With that in hand there should be common ground enough that I can answer any questions they might have, make suggestions,etc. This I am happy to do.
 
Probability that you will die: 1.00000000000
Probability that you will live forever: 0.000000000

Probability that you will pay taxes: 1.00000000000
Probability that you will never have to pay taxes: 0.00000000

Death and taxes are knowns. No need to model the probabilities. ;)
 
You are in the ballpark but not quite there. What this branch of statistics is all about is positing a hypothesis and then calculating the probability that some data you observed would arise under that hypothesis. If the probability is very low then the hypothesis probably isn't true and ought to be dismissed in favor of an alternative hypothesis.

As has been pointed out (my attempts a wit in another post aside) there are no probabilities of 0 or 1 so there is always uncertainty. We can't know that hot and cold fermented beers are indistinguishable. We can just say that it is unlikely that they are given the data that we observed. The question becomes how improbable do they need to be before we decide that for practical purposes we should say they are not indistinguishable. That is up to those who interpret the data to decide. The probability thresholds are often determined by the costs associated with being wrong (i.e. that they are the same but we decide they are different).

The first thing we would want them to do is correct the procedural errors. Then we'd like to see some of the tests repeated using the correct procedures. If they say "Here's what we got following ASTM protocols" then there data is much more likely to be accepted than if they violate ASTM protocols as they have done in the past. You can challenge me on this by saying "What difference does it make if they only presented BBA?" I'd have to answer that I don't really know but that question goes away of they follow the protocol (equal numbers of ABB, BAB, BBA, BAA, ABA and AAB).

I'm assuming you put "prove" in quotes because data based on larger panel size does not prove any more than data from smaller panel sizes. It may strengthen our convictions with regard to a position on a hypothesis though.

The other approach here is to look at the other information that is available from triangle test data i.e. the information on the range of likely differentianbility and on the level of differntiability that most likely explains the observed data. This may help us to reject or accept the hypotheses even though the confidence levels are not what we hoped for. It may not be necessary to increase sample size or pool data from multiple tests but doing either of those does improve the estimate of differentiability in addition to decreasing alpha.

Perfectly. My original suggestion to them here was that they shell out the $45 to ASTM for the procedure but can now change that to suggest that they go to #25 and follow the link there. With that in hand there should be common ground enough that I can answer any questions they might have, make suggestions,etc. This I am happy to do.

A couple of you are masters at the multi-quoting, I'm too lazy for that. haha.

1) I gotcha. I think. So, if a test shows that the probability is low, and thus should be thrown out, yet none of these ferment temps experiments (except for perhaps one) have reached that threshold of low enough to throw out, then maybe we should be *at the very least* retesting?

2) Ok, so how improbable do they need to be before you decide that for practical purposes you should say they are not indistinguishable?

3) A) I'd agree. My inclinations were that if there were some set of guidelines, that these tasters were likely not under those guidelines (one test that keeps being mentioned is the one at HomebrewCon from last year - yet how many tasters were already drinking? How many were isolated? etc.). And B) no, I wouldn't challenge you on any of this... ;)

4) Precisely.

5) Yeah, I'm 99% sure I'm with you on this point.

6) Only thing I can say is, as others have said, I doubt Marshall and Co. are ready this thread. If they are, I think it'd be pretty awesome if the "two" of you got into contact. If they aren't, I think it'd be pretty cool if you, AJ, took contact with them - as I think they'd be pretty accepting of your ideas.

In the very beginning when Marshall was alone, he was simply doing side by side test, and if I remember correctly, even pretty much telling his testers what he was experimenting with. Then he decided to move to a triangle test because of suggestions from people he interacted with, and basically said to him, "I think it's awesome what you're trying to achieve here, but here's how you could improve." Then, after a bit more time, and after he had already published tons of *data,* he went back and said, "Actually, we were a bit too stringent with our specificity when it came to the null hypothesis thingy magig. With that said, here are a number of tests that would actually have reached significance had we started out with this method."

That being said, I really think that they would be receptive to hear that they could, and should, be doing their testing, and even the stats part of their testing, that much better.
 
The question becomes how improbable do they need to be before we decide that for practical purposes we should say they are not indistinguishable. That is up to those who interpret the data to decide. The probability thresholds are often determined by the costs associated with being wrong (i.e. that they are the same but we decide they are different).

Well said.

It saddens me to see that so much research these days is about chasing p-values in order to be published.

Slightly off topic, but I'd be interested your thoughts (and others) on:

http://fivethirtyeight.com/features...-agree-on-its-time-to-stop-misusing-p-values/
 
Well, I suppose this is one way to view it, but lost in all the statistical handwaving is the ultimate problem with all of this, and that is the measuring process being used by the Brulosophy approach is hugely flawed.
I am not sure what is meant here because in #273 you wrote:

Which, of course, is the single biggest issue with the triangle testing as it is presented. It's why "qualifying" people on the basis of a lucky guess doesn't make any sense,
This suggests that you feel that triangle testing itself is flawed because of the requirement that panelists guess if they can't decide. I've explained before why each increment from triangle, to quadrangle, to pentagonal testing increases the sensitivity of the test so I won't repeat that but rather ask that you step back and look at the forest. If triangle testing were flawed it wouldn't work and the food and beverage industries would probably have noticed this by now. They would not continue to use it nor would there be published standard procedures for it. Now perhaps I have misinterpreted the "...issue with the triangle testing as it is presented" phrase and perhaps this means that you accept the validity of triangle testing but find fault with the way Brulosophy has implemented it. I do too but wonder if "hugely flawed" is an accurate description. When I criticize them for only presenting 2 A's and 1 B (without knowing if they are permuted) I do so because the protocol calls for triplets equally and randomly distributed among the permutations of ABB and BAA. I recognize that there are doubtless good reasons for the random distribution requirements and can even guess what they are but I do not know what the effects of failure to adhere to this particular requirement are other than that they probably introduce noise. Noise masks the signal from the beer reducing the confidence level and our estimate of the signal to noise ratio i.e. the differentiability.



One of the things one learns when one beats one's head against the statistical wall like I did is that one should never, not ever, forget that if you don't measure your variables reliably and validly, the statistics are not--ARE NOT--worth a hill of beans.
Not true! I cut my eye teeth on this stuff using statistical estimation theory to decode signals immersed in highly colored noise. The measurements were most un-reliable in this sense and yet the stats we collected enabled us to decode the signals. The results weren't beautiful but the tax payers ponied up a pretty tall mountain worth of beans in support of this effort. And we did beat our heads against the statistical wall, believe me. Because of this experience I view the art as being able to extract information in cases where the observations are corrupted. I see the faults in Brulosophy's implementation as corruptors but not sufficiently strong corruptors to render the data they have collected as useless. Why do I say that? Because I can extract differntiability estimates from their data. As noted in earlier posts one of their temperature differential experiments gave an estimate of 14% for differentiability and when that data was pooled with other sets of data the differentiability estimate was 16%. When they did another experiment on wheat beer the maximum liklihood estimate of the differentiability rose to 39%. Here's the analysis

H1(Pd=0.20): Panelists: 32; 3-ary test; 19 Correct Choices; P(< 19) = 0.89677; 0.22 < Pd < 0.56 with conf. 0.90
Most Likely Pd: 0.39000
&#8226;probs(32,19,3,.0,.90,1)
H0(Pd=0): Panelists: 32; 3-ary test; 19 Correct Choices; P(>= 19) = 0.00222

The consistency between the single test and the pooled data and the appreciably larger differentiability for the wheat beer (anyone who has fermented wheat beer at different temperatures knows the 'signal' with respect to temperature difference is much greater than with lagers), while they don't prove anything conclusively, suggest that Brulosophy's procedural errors may not be so serious after all. Perhaps were they to adhere to ASTM E1885-04 the differentiability estimates might go up (error induced noise goes down).



I use this brulosophy material in my own classes as a way to illustrate what happens when people get lost in probability theory and forget that in the end,
Well clearly lots of people apply statistics in ways that lead to wrong conclusions so anyone who knows what he is doing conducts reality checks. The reality checks I just gave on the Brulosophy data seem encouraging though we don't know how they would change if the strictly followed the protocol.


...without accurately measuring the variables at issue, one's conclusions are really uncertain at best.
Naturally there is going to be uncertainty induced by the noise. The maximum liklihood estimate is the location of the peak of the liklihood function. That peak has finite width.

This was pointed out earlier in the thread, and it's an absolute thread-killer, in that it really cannot be refuted. This is fundamental to research, and to statistics.
I don't think anyone disagrees that if the SNR gets too low you can't measure differentiablilty but this is hardly a thread killer as it is apparent here, at least from the data I have looked at that the SNR isn't too low.

We do not know who the samples are and thus to what populations they may be generalizable. We know that there is no consistency in what tasters may have or may not have been drinking prior to the triangle tests, and it's clear that there is quite the potential for palate fatigue, or taste-bud numbing.
This brings us back to ¶7.2 of the standard "Choose assessors in accordance with test objectives." If the objective is to determine whether fermentation temperature makes a detectable difference over a demographic of tasters trained and untrained, those with palate fatigue or not, drunk or not... then these data accurately (except for the other errors) represent that demographic.


Tasters are "qualified" even if they just guessed the right answer, and then treated as if they are effective in distinguishing differences.
Yes, and this is exactly what you need to have panelists do in order to detect that the null hypothesis is true which is sometimes the thing one is interested in. The power of the triangle test relative to duo-trio and paired comparison tests (which also require guessing for the same reason) derives from the fact that the probability of a 'correct' guess under the null hypothesis is 1/3 as opposed to 1/2. That's why it tends to be used rather than those tests. You really need to understand this.


We simply do not know who they are, how they are prepared, and that is not science, it's something else entirely.
Now if they reported in detail who the panel members were, whether squid and garlic sandwiches were served before the test, etc. then it would be science and that's what we are trying to get them around to.

I have little doubt there will be another attempt to convince people with statistical hand-waving, but in the end, you can largely ignore that effort. It's just a way to distract from the fundamental problem with the brulosophy approach, something that the reliance on statistics cannot overcome.
So we don't accept the results of statistical analysis if they provide strong support for a position we don't like.

That triangle testing is widely used in the food, beverage, cosmetic and any other industry where sensory perception is of interest proves that it is a valid technique because if it weren't it would have been dropped. There. No statistics - just common sense. There is no fundamental problem with the Brulosophy approach as it is basically what they are doing in brewery's, candy factories, soft drink manufacturers, drug companies, the audio industry (where they call it an A, B, C test) etc. Now there are some problems with Brulosphy's implementation. They do not adhere to the accepted standard protocols exactly. But examination of the data seems to indicate that the shortcomings do not completely impair our ability to estimate differentiability.

Of course, people can believe what they want to believe/
So it seems.
 
Slightly off topic, but I'd be interested your thoughts (and others) on:

http://fivethirtyeight.com/features...-agree-on-its-time-to-stop-misusing-p-values/

Yes, absolutely. I was so relieved to see that it said that even scientists have trouble explaining it. I have to think about it every time. I was also intrigued by the statement that p is not the whole story. I made that same assertion in a relatively recent post and have found myself in these investigations to be more inclined to be interested in the maximum liklihood estimate of differentiability than p.
 
Have we had jelly beens yet?
Have some jelly beans...

significant.png
 
I just reread the article on p more carefully. Many good points. With respect to the current discussion, the most greivous error comitted by Brulosphy in the the temperature experiments was looking accepting the null hypothesis because they didn't have the 'requisite' p < 0.05. According to the article they have plenty of company among respected scientists.

frequentists_vs_bayesians.png


I'm a Bayseian (by experience - I used Bayes theorem to find the most likely differentiability) and there is definitely a Bayseian tone to ASTM E1885 (as there is bound to be when money is involved) though the frequentist approach is emphasized. Perhaps this is recognition that, as the p paper suggests, both are needed to get a fuller picture and I guess that summarizes my thinking at this point.

Taking another look at the first Brulosophy experiment on temperature:
H0(Pd=0): Panelists: 21; 3-ary test; 9 Correct Choices; P(>= 9) = 0.23988
H1(Pd=0.20): Panelists: 21; 3-ary test; 9 Correct Choices; P(< 9) = 0.28653; 0.00 < Pd < 0.35 with conf. 0.90 Most Likely Pd: 0.14

There's data there but is it 'actionable' as mongoose likes to say? Should a brewery confronted with such data shut buy a refrigeration plant? Looking just at p we might decide it should not as there is enough support for the null hypothesis that we can't reject it. But looking at the liklihood results we see that the percentage of our demographic that can tell the difference is most likely 14%. Probably not high enough to justify the cost. But it's also possible that as many as 35% of our customers could tell a difference. Or as few as none of them. I wouldn't want to have to make a decision based on those results.

Now let's look at the most recent wheat beer test results again:
H0(Pd=0): Panelists: 32; 3-ary test; 19 Correct Choices; P(>= 19) = 0.00222
H1(Pd=0.20): Panelists: 32; 3-ary test; 19 Correct Choices; P(< 19) = 0.89677; 0.22 < Pd < 0.56 with conf. 0.90 Most Likely Pd: 0.39

This is more like it. p = 0.0022 and the liklihood computations show 39% as the most likely fraction of the demographic that would note the difference and that the range goes up to 56% in the 90% confidence band. I'd feel comfortable exploring the purchase of refrigeration gear based on that. This is actionable information.

Thanks to AZ_IPA for posting the link to the p paper. Very timely.
 
I just reread the article on p more carefully. Many good points. With respect to the current discussion, the most greivous error comitted by Brulosphy in the the temperature experiments was looking accepting the null hypothesis because they didn't have the 'requisite' p < 0.05. According to the article they have plenty of company among respected scientists.

frequentists_vs_bayesians.png


I'm a Bayseian (by experience - I used Bayes theorem to find the most likely differentiability) and there is definitely a Bayseian tone to ASTM E1885 (as there is bound to be when money is involved) though the frequentist approach is emphasized. Perhaps this is recognition that, as the p paper suggests, both are needed to get a fuller picture and I guess that summarizes my thinking at this point.

Taking another look at the first Brulosophy experiment on temperature:
H0(Pd=0): Panelists: 21; 3-ary test; 9 Correct Choices; P(>= 9) = 0.23988
H1(Pd=0.20): Panelists: 21; 3-ary test; 9 Correct Choices; P(< 9) = 0.28653; 0.00 < Pd < 0.35 with conf. 0.90 Most Likely Pd: 0.14

There's data there but is it 'actionable' as mongoose likes to say? Should a brewery confronted with such data shut buy a refrigeration plant? Looking just at p we might decide it should not as there is enough support for the null hypothesis that we can't reject it. But looking at the liklihood results we see that the percentage of our demographic that can tell the difference is most likely 14%. Probably not high enough to justify the cost. But it's also possible that as many as 35% of our customers could tell a difference. Or as few as none of them. I wouldn't want to have to make a decision based on those results.

Now let's look at the most recent wheat beer test results again:
H0(Pd=0): Panelists: 32; 3-ary test; 19 Correct Choices; P(>= 19) = 0.00222
H1(Pd=0.20): Panelists: 32; 3-ary test; 19 Correct Choices; P(< 19) = 0.89677; 0.22 < Pd < 0.56 with conf. 0.90 Most Likely Pd: 0.39

This is more like it. p = 0.0022 and the liklihood computations show 39% as the most likely fraction of the demographic that would note the difference and that the range goes up to 56% in the 90% confidence band. I'd feel comfortable exploring the purchase of refrigeration gear based on that. This is actionable information.

Thanks to AZ_IPA for posting the link to the p paper. Very timely.

Good stuff. The cartoon reminds me of the two old-school behavioral psychologists with a total disregard for internal experience. After making love, the woman asks: "It was good for you, was it good for me?".

I am enjoying the discussion, even though I'm reminded how little I've retained from the couple of under-graduate courses I took in statistics.
 
1) I gotcha. I think. So, if a test shows that the probability is low, and thus should be thrown out, yet none of these ferment temps experiments (except for perhaps one) have reached that threshold of low enough to throw out, then maybe we should be *at the very least* retesting?

I think you're still missing something that I completely missed until AJ said it about 2-3 times lol...

The hypothesis being tested is the null hypothesis: "These two beers produced with different methods are indistinguishable." That is the hypothesis we're TRYING to throw out during all this testing.

So if something doesn't achieve significance, we don't throw out the positive hypothesis of "fermentation temps matter", we fail to throw out the hypothesis "fermentation temps have no impact on the finished beer." It's hard to wrap your mind around that distinction, but once you do, it all falls into place.

When we say something "didn't achieve significance", it's referring to a commonly accepted threshold of p<0.05, which is actually a pretty stringent criteria. When we say that it didn't achieve significance, we are not throwing out the positive hypothesis, i.e. we are not proving they're distinguishable, but we are not declaring them indistinguishable. We're failing to prove beyond a certain confidence that they're distinguishable.

If you have, say, 36 tasters, you would expect pure guessing to result in 12 (33%) tasters selecting the odd beer. You would require 17 to achieve p=0.058 (close to significance) and 18 (50%) to achieve p<0.05 (actually p=028) to declare that the test was "statistically significant".

But what if you only get 16? That's p=0.109. We would declare that not statistically significant. But it's not 12 (or even less). So you're in a quandary. That result isn't enough to achieve significance, but it's also hard to accept the null hypothesis as true. You have to weigh the odds that the variance of 4 tasters is random chance (null hypothesis true) vs the odds that the variance of 4 tasters is real (null hypothesis false, but the beers are not distinguishable enough to achieve p<0.05).

The simple fact is that p<0.05 is to some degree an arbitrary threshold. And that sample size is important, because a sample of 20 tasters requires 11 (55%) to get to p<0.05 but a sample of 200 tasters requires only 79 (39.5%) to get to p<0.05. The percentage difference relative to guessing to declare significance is dependent on sample size.

This was my point with the meta-analysis. I looked at two things across a bunch of experiments testing the same variable:

1) The "error", i.e. the results relative to chance, always pointed in the same direction. It was ALWAYS >=33% of tasters selecting the odd beer, but the results weren't strong enough to reliably achieve significance. If the "error" relative to guessing was both positive and negative, I would have a lot more difficult time improving the significance by adding the experiments together.

2) If the error is always the same direction but the experiments don't achieve significance, I assume that's a sample size problem. I.e. "these beers are different but not different enough that it's easy to detect." By aggregating the experiments, I "create" a larger sample size, and with a larger sample size, I need a lower percentage correctly selecting the odd beer to declare p<0.05.

2) Ok, so how improbable do they need to be before you decide that for practical purposes you should say they are not indistinguishable?

Ay, there's the rub...

I'm personally convinced based upon what I've done here in the meta-analysis. I have rejected the null hypothesis, that ferm temp doesn't matter. (Preference is something I am still unsure of, but that's a different issue).

applescrap is not. He thinks I'm basically searching for numbers in support of my bias (that ferm temp matters), which is certainly possible. And I suspect he thinks it affects the beer, but that it's not a sizable enough effect to worry about. He's also pointed to the preference numbers in some tests for warm ferment as evidence that even if the two are different we can't categorically say "cool ferment is better" in all instances, which is truly what we're trying to discover.

mongoose IMHO may not be convinced. He has legitimate concerns about the quality of the tasting panels and the methodology of the study itself that make him doubt that the original studies are good enough. As I pointed out elsewhere, even if my meta-analysis was perfectly conducted, it doesn't correct for methodological errors in the original studies.

I'm not sure of AJ's position, actually, because as he's said he hasn't really read the bulk of the experiments. He's been invaluable to all of us by providing background on testing methodology. I would say that he's likely rejected the null hypothesis as well based on what he's said, but trying to say anything beyond that statement isn't supported by what I've read that he's posted in this thread.

So there you go. 4 different posters who have been arguing this stuff vociferously, and we could have as many as 4 different interpretations of the data.

-----------------------------------

BUT, and this is the important thing, AJ made a very useful point. In all these things we're talking about confidence and error, and how they are balanced against the cost of being wrong (either way; the cost of additional equipment/process to do things a certain way vs the cost of making sub-par beer).

I use fermentation temperature control. I really like the beer I make. I've been doing this 10+ years and based on my process, I believe I'm making commercial-quality beer. I've already got sunk costs into the fridge and temp controller, so the only marginal cost of temp control is electricity and floor space in my garage. Even if I'm wrong, I see no reason to change my process because whatever I'm doing is working. So I'm just geeking out on statistics, numbers, and debate, not actively looking to change my own process anyway.

To answer your questions about how improbable it needs to be, the answer is up to you. If you're not using temp control, do the results of the experiments justify in your own mind that you should start using temp control? If you're already using temp control, do the results of the experiments justify in your own mind that the electricity and space savings are justified to get rid of your temp control setup? That's all that matters.
 
The hypothesis being tested is the null hypothesis: "These two beers produced with different methods are indistinguishable." That is the hypothesis we're TRYING to throw out during all this testing.

Sometimes we want to throw out the null hypothesis for example if we are trying a new more expensive malt in the hopes that it will make the beer so much more delicious that new customers will flock. But sometimes we want to accept the null hypothesis for example, if we are trying a new cheaper malt in the hopes that customers won't notice and sales will be unaffected. Or, more applicable to the home brewing scenario, if a triangle test revealed, to statistical significance, that a panel of experienced tasters couldn't tell the difference between single decocted and triple decocted beers I could save myself a lot of work.

But in either case we make our decision based on testing the null hypothesis and to do that we must require panelists to guess if we are to obtain scores when the null hypothesis is true or nearly so. This is the point that mongoose seems to be unable to grasp.

It's hard to wrap your mind around that distinction, but once you do, it all falls into place.
I dunno. I still have to be very careful with it.

When we say something "didn't achieve significance", it's referring to a commonly accepted threshold of p<0.05, which is actually a pretty stringent criteria.
But often, if the consequences of accepting the null hypothesis are expensive enough, it is lower than that. 0.01 and 0.001 are the other 'popular' values. Unfortunately, as discussed in the paper referened by AZ_IPA people who don't really understand what it means often grab 0.05 simply because it is a number that has become a sort of maximum allowable.

When we say that it didn't achieve significance, we are not throwing out the positive hypothesis, i.e. we are not proving they're distinguishable, but we are not declaring them indistinguishable. We're failing to prove beyond a certain confidence that they're distinguishable.
We can't prove anything with statistics. When we say that a test does not achieve a desired level of significance it says that the data we observed supports the null hypothesis more strongly than the level at which we would feel comfortable dismissing it.


If you have, say, 36 tasters, you would expect pure guessing to result in 12 (33%) tasters selecting the odd beer. You would require 17 to achieve p=0.058 (close to significance)
If you got 17 you would say "The test was significant at the 0.058 level".

and 18 (50%) to achieve p<0.05 (actually p=028) to declare that the test was "statistically significant".
The test was significant at the 0.028 level. For 19 "The test was significant at the .0125 level." The level at which the test is deemed sufficiently significant is up to the person making the decision. For example, AZ_IPA's article implies that p < 0.05 to get a research paper published. p*cost_of_false_alarm is the expected cost of Type I errors. Obviously we want that to be as low as possible but as the ROC discussions of earlier posts show when we set threshold for lower p we are also reducing the probability of detection for a a particular test and increasing the probability of false dismissal (Type II error). There is a cost associated with this too: p_FD*cost_of_false_dismissal. It's pretty clear that we want to set threshold to the value which minimizes
p*cost_of_false_alarm +p_FD*cost_of_false_dismissal
Think about an air defense radar when pondering this.

But what if you only get 16? That's p=0.109. We would declare that not statistically significant. But it's not 12 (or even less). So you're in a quandary. That result isn't enough to achieve significance, but it's also hard to accept the null hypothesis as true. You have to weigh the odds that the variance of 4 tasters is random chance (null hypothesis true) vs the odds that the variance of 4 tasters is real (null hypothesis false, but the beers are not distinguishable enough to achieve p<0.05).
All you can say here is that the probability of getting 16 or more hits out of 36 tasters under the null hypothesis is .109. This means that if you did 1000 panels just like this one giving them identical beers you'd expect 16 or more hits in 109 of them. This isn't vanishingly small support for the null hypothesis so most people wouldn't feel comfortable rejecting it but then it isn't glaringly strong support for it either. Were the beers indistinguishable you would expect 12 of them to get the right answers and you got 16 which is quite a few more so the beers probably are distinguishable but not enough to be able to say "no way we could have gotten this many hits if they were indistinguishable". In a case like this we look for other information from the data i.e. the probable differentiability of the beers (on a scale of 0 which represents the null hypothesis to 1 which means that 100% of the population of interest can tell them apart). Sixteen out of 36 right suggests that the differentiability lies, with 90% probability, between 0.01 and 0.33 and has most likely value 0.17. This is not a terribly strong signal that the beers, as evaluated by this panel, are differentiable but it is a signal to this effect. It tells you that another test needs to be done with a larger number of panelists to see if we can tighten that band around the most likely value. But p = 0.109 says that same thing.


The simple fact is that p<0.05 is to some degree an arbitrary threshold.
It's completely arbitrary. The required value depends on the application.


And that sample size is important,
You want a sensitive test and sensitivity depends on panel size. The other big factor in sensitivity is the probability of a correct guess under the null hypothesis. That's why a triangle test (pc = 1/3) is more sensitive than a duo-trio test (pc = 1/2) and why a quadrangle test (pc = 1/4) is more sensitive than a triangle test.


1) The "error", i.e. the results relative to chance, always pointed in the same direction....

2) If the error is always the same direction but the experiments don't achieve significance, I assume that's a sample size problem.
I think you are on pretty solid ground here.


mongoose IMHO may not be convinced. He has legitimate concerns about the quality of the tasting panels and the methodology of the study itself that make him doubt that the original studies are good enough.
He is not on board because he doesn't understand that the basis for any of these discrimination tests (paired comparison, duo-trio, triangle, quadrangle...) is testing of the null hypothesis and that requires guessing. Despite the wide acceptance of these test methods he thinks them seriously flawed. His concerns about Brulosophy's procedural and analysis errors are, of course, valid. I indicated in my response to his post that I believe the flaws to obscure the signal from the beers to the point where the observed differentiation value (14 - 16) may be attenuated. IOW with those flaws removed we might find the beers differentiable at an appreciably higher level.

I'm not sure of AJ's position, actually, because as he's said he hasn't really read the bulk of the experiments. He's been invaluable to all of us by providing background on testing methodology. I would say that he's likely rejected the null hypothesis as well based on what he's said, but trying to say anything beyond that statement isn't supported by what I've read that he's posted in this thread.
I'm definitely on board with respect to rejection of the null hypothesis though I am now looking at it through the differentiability parameter estimates rather than consideration of p. A signal at the level Pd = 0.14 (14% of the population can distinguish the beers) doesn't exactly blast alternate hypothesis at you but as the null hypothesis corresponds to Pd = 0 (0% of the population can distinguish them) it seems quite unlikely that the null hypothesis applies to these beers.
 
Marshall came and spoke at my homebrew club meeting about a year ago. I wasn't able to make it to that meeting, but I can tell you that testers were instructed to clear their palette with salt-free crackers and water before and between samples.
 
I am not sure (etc.)

AJ, I get it. You don't understand how measurement figures into all this. That doesn't make you a bad person--but it does mean you're missing a basic element of research, one whose failure invalidates conclusions.

Here's a quick and dirty example. Suppose we want to measure people's math aptitudes. We ask them, as our measure, to step on a scale which measures....something.

We ask them to step on, and off, and on, and off, and it returns a consistent figure each and every time. Very consistent, very reliable. The numbers show about 167, which is their math aptitude, right?

Of course, not right. If what you are measuring is not a reasonable measure of the concept (what we call operationalization of the concept), then the figures--and the results!--are meaningless.

Why? Because the measure doesn't measure what it purports to measure. In other words, what we've been talking about all along, Such as when you "qualify" tasters based on a lucky guess. Who would ever intentionally do such a thing?

************

Or try this. We're measuring weight, not math aptitude. We have a person step on, and off, and on, and off, a scale 10 times. The scale returns 167, 155, 111, 211, 106, 92, 47, 113, 197, 175.

So is that a good measure of their weight? Of course not. It's unreliable. An unreliable measure cannot be a valid measure.

In the triangle tests using humans as testers, we have no indication of the reliability of their abilities. In fact, there's a lot of reason to suggest they may be unreliable. And of course, if their reliability is suspect, so too are the results pertaining thereto.

*************

This is basic research and measurement. No amount of statistical handwaving can overcome that.

None of this makes you a bad person. But you cannot take measures that are unreliable, whose validity as a result is suspect, whose generalizability is uncertain at best, and draw conclusions that are meaningful. You don't obtain actionable intelligence. You keep trying, and your doggedness in that is admirable, but without quality measurement, there's nothing to say.

As I tell my students, measurement is where the rubber meets the road. If you can't measure effectively, the rest is meaningless.
 
Mongoose welcome back! I'm stuck because I see what both you and AJ are saying. But with any testing with human's opinion as the measurement is there a way to truly dial it in to an acceptable level? I did read the posts on panel selection, but that was a long time ago (three or four days) and it starts to get a little fuzzy after so many beers :mug:

I was forced to take stats once in pursuit of a BA is Psych and once for my MA, I made it through both, but just barely. I have enjoyed the spirited debate and still have no idea what a "p" is. LOL :ban:
 
Man, If I'd brew beer after what's "not statistically significant" after Brulosophys tests.. disregarding this and that during brewing, I know I'd be making pretty ****ty beer. Those experiments "debunk" so many things which I myself know are true that they have become like reading a comic.

At first it was interesting reads, but when one experiment after another tells you that "it doesn't matter", when you've been experimenting with the same yourself, and gotten pretty terrible results.. Well. Then the rest of the experiments get cut over the same comb. To be honest it's pretty LOL when someone uses brulosophy as a reference. I'm pretty sure he got some sponsors and "must" continue to try things out, but please. Mash pH-test in an IPA and also using gelatine?
 
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.

Not to derail, but as a bjcp judge, im amazed at the difference sometime from bottle to bottle of an entry, from the first judging to a mini-bos, or sometimes when you ask for a second bottle. With the "rough" handling that some competitions or drop off points do to the beer before it comes to the table; im suspecting more and more alot of homebrew competitions is how well you bottle. (cause i know that the brown wet cardboard that arrived at the table saying its an apa, was probably fine when it left the brewers keg). So that could just be variance there. Or crappy judges, that happens too; Or the judges got a flight of 14 and one was #2 and one was #14.
 
Wow Smellyglove now I get why scientists mainly publish successful experiments.

The brulosophers did not invent the triangle test which is clearly an accepted tool in sensory analysis. Yes they make some compromises in design and administration that might be used to throw some question on the results yet I'm not seeing anyone doing it better in our community. I admit I don't see much professional brewing literature but did read a paper today published in a serious journal authored by brewers from Rock Bottom. Compared four different late hop techniques and was pretty interesting. But turns out they brewed the four batches at different breweries, using different ingredients, and apparanlly amazingly different waters. Had a batch with 1200ppm sulfate compared to beers below 100.... And because this is professional brewing and published in a serious presumably peer reviewed journal it must be well done and valid but the homebrewers are done in by failure to randomize AAB to ABA in their sensory test?
 
Status
Not open for further replies.
Back
Top