• Please visit and share your knowledge at our sister communities:
  • If you have not, please join our official Homebrewing Facebook Group!

    Homebrewing Facebook Group

Value of brulosophy exbeeriments, others experience, myths and beliefs

Homebrew Talk

Help Support Homebrew Talk:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
In general I think they do a pretty good job. I have seen several tests where they chose a recipe that, in my opinion, might mask the results. One was hop related (maybe late boil vs flameout or something like that), but they used a recipe that didn't really feature hops that much. I remember looking at the recipe as I read the experiment and thinking "they're not going to see a difference because this recipe would mask any difference". Then, of course, the conclusion was that it didn't matter.

After all the hoops they jumped through to try to follow something resembling scientific process, I felt like I really couldn't trust their conclusion. I sometimes wonder if they're biased towards trying to debunk things (in other words, they want to see no difference).

Overall, though, I think they're doing something that most of us don't have the time or resources to do, so I think they add significant value to the brewing community.
 
Do you care that they tell you "... indicating participants in this xBmt were unable to reliably distinguish..." when there's a very good chance that there's was actually a difference detected? Serious answer, please.

That's what I'm talking about. It's those words (the standard blurb), coupled with the fact that people don't know/care about what the p-stats actually mean (and don't mean) that causes the "Brulosophy proved that doesn't matter" phenomenon. They could very easily add a single sentence to each writeup that would make it crystal clear. But they don't.



Believe me, I'm not asking for that.

another nope

also you ignored my painstaking research in showing that the situation you are describing is the exception not the rule. Most of the time when they say the panel was unable to "reliably distinguish" the outcome was right about where you would expect it to fall if every member of the panel didn't even taste the beers and guessed at random. When results have been close to reaching significance they usually point that out and likely explore the possibility there was something there in the brewers personal evaluation but I appreciate they have the discipline to let the actual results dictate the outcome.

Only two outcomes are allowed...the panel was able to reliably detect a difference or the panel was unable to reliably detect a difference. I don't think this is as complicated as everyone wants to make it out to be. The beers are different. But the panel is handicapped by opaque cups and lack of knowledge about the beer style, recipe or tested variable. I read a lot of snarky comments about how crap these panels are when they can't detect obvious differences but I've tried triangle testing my own split batch experiments (so without the style and variable handicaps) and still struggled.
 
When they report a p-value from a triangle test of their one batch, that's like putting lipstick on a pig. If they have a hypothesis that there is a difference between two methods, and they split a batch to test that, there's only the one sample to test that hypothesis. The triangle test is just looking at whether a group of testers can discern a difference in that one batch. Reporting the p-value is providing a gravitas it doesn't deserve. The p-value reported is not a p-value for their hypothesis of interest. It's dressing up a sample size of one. Even if they were to brew two batches because the hypothesis precludes splitting one batch, that's one sample per treatment.

Another example. Take two plants of the same species and randomly assign one to receive a fertilizer. One month later you measure the plants. Whether you measure them with a ruler or a more accurate 3D scanner, you still only have measurements on two plants, one per treatment. Sounds fancy to use the 3d scanner though.
 
When they report a p-value from a triangle test of their one batch, that's like putting lipstick on a pig. If they have a hypothesis that there is a difference between two methods, and they split a batch to test that, there's only the one sample to test that hypothesis. The triangle test is just looking at whether a group of testers can discern a difference in that one batch. Reporting the p-value is providing a gravitas it doesn't deserve. The p-value reported is not a p-value for their hypothesis of interest. It's dressing up a sample size of one. Even if they were to brew two batches because the hypothesis precludes splitting one batch, that's one sample per treatment.

Another example. Take two plants of the same species and randomly assign one to receive a fertilizer. One month later you measure the plants. Whether you measure them with a ruler or a more accurate 3D scanner, you still only have measurements on two plants, one per treatment. Sounds fancy to use the 3d scanner though.

Are you sure about this? I don't believe the Brulosophy gang invented the triangle test as applied to sensory evaluations of food or the use of the p value for setting the number of correct responses needed for a given sample size.
 
also you ignored my painstaking research in showing that the situation you are describing is the exception not the rule. Most of the time when they say the panel was unable to "reliably distinguish" the outcome was right about where you would expect it to fall if every member of the panel didn't even taste the beers and guessed at random.

I never said or meant to imply that it was the rule, but rather used it as an exercise in logic. I could have picked 0.20 or 0.30, or really, anything less than 0.50. That's still a large amount of potentially misleading write-ups within the groups of experiments with those P-values. That said, I do think low-ish p-values have been more common in the experiments that include a real panel and not just one guy tasting the same beers over and over. If you look back through the archive, you'll find more than a few.

My point is that the standard blurb, when applied to them, is misleading to non-statisticians. Just looking at page two of the experiments, where they were all (I think) panel experiments, there were 37 experiments whose results earned them "the blurb." Of those, 14 (38%) had P-values of less than 0.30. So, for these 38%, the chances of getting at least the number of correct choices that were got due to random chance was, in every case, 29% or less. It's likely that a large portion of those panels detected a difference. But each and every one gets the standard blurb. That's my issue, and I think it's the only one I've raised in this thread. (Based on some responses, I think some folks might think I'm basically saying "Brulosphy sucks," but that's not what I'm saying at all.)

When the p-value is say, 0.20 (or whatever), would it not be helpful to say "Results indicate that if there were no detectable difference between the beers, there was a 20% chance of "X" <fill in the blank> or more tasters identifying the beer that was different, but "Y" tasters actually did."? IMO, that would give the readers something much more tangible on which to base their impressions, and their own risk of changing (or not) their own processes based on them.

ETA: and it might even reduce the number of *^$&*#&^ "Brulosophy debunked <X>" posts.
 
Last edited:
Are you sure about this? I don't believe the Brulosophy gang invented the triangle test as applied to sensory evaluations of food or the use of the p value for setting the number of correct responses needed for a given sample size.
What I am saying is that their p-value applies to their one sample batch. They are only testing one batch and don't repeat the process on other batches. Their treatment is on the batch which from some of their testing is a split sample. Usually they have two treatments. Their one batch is just one batch out of the population of experimental batches possible using their methodology of preparation. Which is like saying, "We tried this once." It's an anecdote. It would be a lot more robust if they repeated the experiment a number of times. If you were measuring a more continuous variable, say IBUs, and you brewed 30 batches, split them preboil, and threw in hops A in one and hops B in the other, that would be a paired t-test with a "reasonable" sample size.

The researcher decides what significance level is actually significant to them. If the p-value they obtain is less than the significance level they chose, they report it significant. The p-value is reported because the reader may not feel that the chosen significance level, suppose we pick 0.10, is stringent enough. Maybe the reader feels 0.05 or 0.01 is more appropriate. That being said, once you have a significance level, and a statistical test picked out, there would be a critical value, here a discrete number which is the number of correct responses needed to arrive at a p-value less than your significance level. Technically, they are comparing their test statistic to a Chi-square distribution which is continuous. I have not used the triangle test but I do understand the procedure based on the link provided and it is just an application of a Chi-square test with only two cells, correct and incorrect responses, that's what goes into the summation. I am familiar with quite a few nonparametric tests which are conducted similarly. Yes, they did not invent critical values for a given significance level.
 
I wholeheartedly agree on p-values and statistical buzzwords being used in a very misleading way, considering the target audience. The fact that many times they didn't reply to comments pointing out fundamental flaws in their experiments doesn't really look good in my opinion; and I'm not talking about hardcore statistics, some of their experiments are simply not showing what they set out to explore.

On a more fundamental note, I don't even agree with the approach at all. As in many other complex processes, some brewing steps are critical, and others are not but still useful. The latter kind may be hard to identify and verify empirically, especially without a scientific approach, and can very easily lead to the misleading conclusion that the variable under scrutiny is irrelevant. Sometimes, many small differences that cannot be appreciated by themselves, can make a perceivable difference all together.
 
Last edited:
it might even reduce the number of *^$&*#&^ "Brulosophy debunked <X>" posts
Doubtful. It appears to be the nature of forum discussion.

Consider accepting that there may be many more years of Brulosophy debunked <X>" posts.

Just like there ...

... were too many years of "you'll kill half the yeast if you don't rehydrate dry yeast", and

... will be many more years of "you can't make a light colored extract based beer".

There doesn't appear to be a cure for this on the horizon. But there may be hope (and comfort) in this: RDWHAAB.

:mug:
 
I have no doubt they have discussed it. But, honest answer, if you would: If you knew nothing of statistics, and you read the words "... indicating participants in this xBmt were unable to reliably distinguish...", what would that mean to you?
I agree that their wording is very misleading to many who read it. I would really love to see them make this more accurate. I ignore the summary statement that you point out. Then their write-ups are really useful. I think about how I evaluate variables. I brew something with my normal process. A few months later, I brew it with something changed - boil intensity, trub to fermenter, or whatever. I mentally compare the new beer with the one from a few months ago. Brulosophy's approach is about a million times better. Of course I still do my own taste test to "verify".
 
At times, they might have 30 panelists and state they need some number, say its 19, to tell the difference, in order to show there is in fact a statistical difference. Not sure what term they use. I am also not sure how they arrive at that number "19". If close to half the people in their group (say 14 in my example) can detect a difference (or are they guessing - its not clear), then i find myself reading more on the findings. Not being fully convinced, despite what the tester states. Nearly half clearly could tell a difference, or guessed.

What is more telling to me, is when the number who can tell a difference is rather low. Or the brulosphy guys get stumped too. Also, when those who can tell the difference are then split over which one they preferred, then its rather getting down to personal preference, and taste. Depending on the variable, you might conclude, even if there is a difference in taste, who is to say it will be better tasting doing it one way or another.
 
Last edited:
I never said or meant to imply that it was the rule, but rather used it as an exercise in logic. I could have picked 0.20 or 0.30, or really, anything less than 0.50. That's still a large amount of potentially misleading write-ups within the groups of experiments with those P-values. That said, I do think low-ish p-values have been more common in the experiments that include a real panel and not just one guy tasting the same beers over and over. If you look back through the archive, you'll find more than a few.

My point is that the standard blurb, when applied to them, is misleading to non-statisticians. Just looking at page two of the experiments, where they were all (I think) panel experiments, there were 37 experiments whose results earned them "the blurb." Of those, 14 (38%) had P-values of less than 0.30. So, for these 38%, the chances of getting at least the number of correct choices that were got due to random chance was, in every case, 29% or less. It's likely that a large portion of those panels detected a difference. But each and every one gets the standard blurb. That's my issue, and I think it's the only one I've raised in this thread. (Based on some responses, I think some folks might think I'm basically saying "Brulosphy sucks," but that's not what I'm saying at all.)

When the p-value is say, 0.20 (or whatever), would it not be helpful to say "Results indicate that if there were no detectable difference between the beers, there was a 20% chance of "X" <fill in the blank> or more tasters identifying the beer that was different, but "Y" tasters actually did."? IMO, that would give the readers something much more tangible on which to base their impressions, and their own risk of changing (or not) their own processes based on them.

ETA: and it might even reduce the number of *^$&*#&^ "Brulosophy debunked <X>" posts.
I will preface, I have been reading your posts but not critically as I felt the p-value was not particularly relevant as I stated. Regarding your p-value of 0.20. No, you wouldn't say that. If your p-value exceeds your significance level, you simply say no significant effect was observed. You don't couch your argument around the value of the p-value. I will try to properly explain. Suppose you pick alpha=0.05 as your significance level. That means 95 times out of 100 you will see data that you feel was generated under the conditions of the null hypothesis. Sometimes that data will be close to producing a test statistic close to the critical value, sometimes not. How close or not makes no difference because sometimes it will be close. That's why you pick your significance level a priori. Five times out of a hundred, it will be so different that you think to yourself, I don't believe that's the case that it came under the conditions of the null hypothesis. It's simply reported as significant. However, what you can do is put a confidence interval on say the significant difference between two treatments. Then you can say, well I found a difference of 0.01-0.05 units. Which could be so small as to be practically useless to you. And if it is a giganormous difference feel free to say so:ghostly:. If the difference is not statistically significant, you wouldn't put a confidence interval on it because statistically the treatments are equal.

If you are really close, I wouldn't discard that line of thought. It's possible you didn't have enough power to determine a difference. You might have undersampled, maybe your samples have a lot of inherent variability.
 
I am also not sure how they arrive at that number "19". If close to half the people in their group (say 14 in my example) can detect a difference (or are they guessing - its not clear), then i find myself reading more on the findings. Not being fully convinced, despite what the tester states. Nearly half clearly could tell a difference, or guessed.

ASTM E1885. I have a copy but should not share based on copyright.

FWIW (nothing), I pay way more attention to any experiments where p<0.20, which is MY definition of where there MIGHT be a difference between two samples. This is based on detailed study of the ASTM and my own intuitive inputs.
 
With a lot of these to tests I assume even one layperson would be able to reliably identify the difference. That's how severe a lot of the dogma is in homebrewing.
I'm not gonna go through all the brulosophy tests, but the stuff I read about fermenting a lager at 60⁰ when i started brewing (around 8 years ago) made me 100% sure it would create a disgusting definitely off flavored beer.
Brulosophy showed that wasn't necessarily the case.
I ferment lagers at their recommended temps as I believe it creates a better beer, but the tests brulosophy and others do shows the "what if".
That's progress imo and learning.
Absolutely doesn't mean we should ferment lagers @70⁰ or only do 20 minute mashes, but I like knowing more.
 
I will preface, I have been reading your posts but not critically as I felt the p-value was not particularly relevant as I stated. Regarding your p-value of 0.20. No, you wouldn't say that. If your p-value exceeds your significance level, you simply say no significant effect was observed. You don't couch your argument around the value of the p-value. I will try to properly explain. Suppose you pick alpha=0.05 as your significance level. That means 95 times out of 100 you will see data that you feel was generated under the conditions of the null hypothesis. Sometimes that data will be close to producing a test statistic close to the critical value, sometimes not. How close or not makes no difference because sometimes it will be close. That's why you pick your significance level a priori. Five times out of a hundred, it will be so different that you think to yourself, I don't believe that's the case that it came under the conditions of the null hypothesis. It's simply reported as significant. However, what you can do is put a confidence interval on say the significant difference between two treatments. Then you can say, well I found a difference of 0.01-0.05 units. Which could be so small as to be practically useless to you. And if it is a giganormous difference feel free to say so:ghostly:. If the difference is not statistically significant, you wouldn't put a confidence interval on it because statistically the treatments are equal.

If you are really close, I wouldn't discard that line of thought. It's possible you didn't have enough power to determine a difference. You might have undersampled, maybe your samples have a lot of inherent variability.

I agree with this, from the perspective of having reached the (arbitrarily chosen) significance level or not. That's binary. But that does not mean that failing to reach that significance level means that there is no difference, which is what "the standard blurb" implies. I see nothing wrong with providing, in plain english, a percentage chance that the result was due to random chance. That's what the p-value does anyway, but the plain english would be more understandable. I do understand that it's not the kind of thing that would be written in an experimental research paper, where understanding of p-values is assumed. And I'm not advocating saying "Result was not significant, but boy it was close, so go ahead and pretend it did." I'm just advocating a plain english explanation of what the result means and doesn't mean for the lay person.
 
I decided to take their word on trub not having a negative affect on flavor when dumped into the fermenter - they were 100% right on that. I've used their ale fermentation schedule that involves ramping up to 75F at the end and that went perfectly well for me too.
 
ASTM E1885. I have a copy but should not share based on copyright.
Darn it, it looks comprehensive to answer some of my own questions. I can mostly explain the critical value, how many they need correct. There's a probability distribution they compare the critical value to. The area under the curve sums to one (think 100%). A bell curve (normal curve) is one example. The one used here is called Chi-square. There's is a value on the curve for which the area to the end is equal to 0.05 (5%). If your test statistic exceeds this critical value, then you reject the null hypothesis.
1617828228905.png

This link shows how to calculate the critical value. If you look at the summation under Data Analysis, you can iterate the value for O (the observed), until you find the number of observed that puts you greater than the critical value from the Chi-square curve. E(The expected) is fixed at 1/3 total of testers for correct responses and 2/3 the total of testers for incorrect responses. You have to do that [(O-E)^2]/E once for the correct responses and once for the incorrect.

The specific Chi-square curve used has one degree of freedom. Data for Chi-square analysis is typically laid out in tables and I believe the degrees of freedom is the number of cells minus 1. These days, you feed the data in and the computer would do the rest. Probably spits out the critical number of responses needed to be significant if its in a stats package.
 

Attachments

  • 1617827827863.png
    1617827827863.png
    301.1 KB
I agree with this, from the perspective of having reached the (arbitrarily chosen) significance level or not. That's binary. But that does not mean that failing to reach that significance level means that there is no difference, which is what "the standard blurb" implies. I see nothing wrong with providing, in plain english, a percentage chance that the result was due to random chance. That's what the p-value does anyway, but the plain english would be more understandable. I do understand that it's not the kind of thing that would be written in an experimental research paper, where understanding of p-values is assumed. And I'm not advocating saying "Result was not significant, but boy it was close, so go ahead and pretend it did." I'm just advocating a plain english explanation of what the result means and doesn't mean for the lay person.
They are absolutely not completely neutral with regards to the interpretation of results. In experiments where they find a P-value of, say, 0.06, they use these standard words: "... indicating participants in this xBmt were unable to reliably distinguish..."

A p-value of 0.06 means that if there were actually no detectable difference, there was only a 6% chance that as many (or more) tasters would have chosen the odd beer as actually did. Does that sound to you like there's very likely not a difference?

They absolutely could be completely neutral if they instead said "Results indicate that if there were no detectable difference between the beers, there was a 6% chance of "X" <fill in the blank> or more tasters identifying the beer that was different." That would be neutral and accurate.
My opinion is you should not do that because you are still only telling part of the story. "With 95% confidence, we did not see a difference between the two treatments." If I repeated the experiment 100 times, 95 times I would find no difference. Provided my statistical methodology is correct, I am not violating assumptions, etc. You aren't acknowledging the confidence level in your second quote. Statistically, you should know you will see close values sometimes but you have to have personal confidence in your confidence level. Otherwise you are not confident. So pick it but do it beforehand, a priori, so you are not tempted to do these things. If it is close, trust the test, but maybe in the future verify with a new experiment. But don't undermine your own statistical test otherwise why use it if you don't trust it?

Now one thing that does make me not trust a statistical test is an inadequate sample size.
 
Great point. By my review of the ASTM standard, I figure they should be aiming for at least 40 if not 45+ tasters per experiment for somewhat valid results. Otherwise... meh...
It’s flawed. Sure they’d like to get 40 tasters but most of us get about 3 and make no effort to blind them to the style, variable, or expected outcome. But still if you put up a post about your experience brewing and tasting a split batch with your homebrew buddies I’d read it and be interested. If I decided meh I’d probably keep that to myself though.

They make an effort to get rid of confirmation bias in their tests. Thats pretty much the story. Confirmation bias seems to be a powerful bias and when they control its ability to sway results interesting data points that lead to questioning modern understanding of historical practices sometimes emerge. It’s interesting or we’d not still be debating it.
 
It’s flawed. Sure they’d like to get 40 tasters but most of us get about 3 and make no effort to blind them to the style, variable, or expected outcome. But still if you put up a post about your experience brewing and tasting a split batch with your homebrew buddies I’d read it and be interested. If I decided meh I’d probably keep that to myself though.

They make an effort to get rid of confirmation bias in their tests. Thats pretty much the story. Confirmation bias seems to be a powerful bias and when they control its ability to sway results interesting data points that lead to questioning modern understanding of historical practices sometimes emerge. It’s interesting or we’d not still be debating it.
This!

Someone might do a split batch to perhaps try different yeasts on a recipe, different dry hops, different ferment temps, and post findings in HBT. Or even refer to one they did in the past. Doesnt happen an awful lot, from what i have seen, but ive seen people refer to it. In general, posts that do so are appreciated, interesting to note, and filed for later personal referral.

Brulosophy do this all the time, with considerable attention to process, never mind trying to find testers, and the "adults" on HBT just want to ridicule it as (p-value) nonsense. The reality is, its nothing more than another piece of the brewing puzzle Brulosphy are offering. At the very least, interesting and provocative. What harm does it do to think outside the box a little?

Is it just a case of equipment envy?

Hey! Teacher! Leave them kids alone.
 
I have absolutely no idea what any of their equipment looks like. It's been about a while since I have even read any of their writeups. Judging from the manner in which some people speak of them, I'm sure the Brulosophy guys have their good points too. However, when all they are analyzing is a split batch, that's just a sample size of one and to me it doesn't have much more weight than if any other HBT member did the same experiment. If what they are doing is really thought provoking and outside the box comparisons, that's when you really want to have the weight of the evidence behind you. You can't do that when you have only one sample, no matter how sophisticated you portray the data, or rather datum.

But no, I'm not envious of their equipment. I have seen some nice brew rigs and brew caves here on HBT. I am pretty happy and grateful for my own setup, which I mainly built myself.
20200209_143927.jpg
and here's my jockey box along with 10 and 15 gallon kegs,
20210408_014311.jpg

and my ferm chambers,
20210408_014349.jpg
.
Now I will hold off on showing my keezer which is almost done but the chairs I'm building aren't there yet. Sure I'd love to have a conical but then I'd want to build a glycol chiller and one of my children would be upset this summer.
 
Brulosophy do this all the time, with considerable attention to process, never mind trying to find testers, and the "adults" on HBT just want to ridicule it as (p-value) nonsense.

Strawman argument. Nobody has said Brulosophy is nonsense, or at least I haven't. But we have levelled fair, reasoned criticisms about important, but fixable, aspects. My issue in particular would be a breeze to fix.

Is it just a case of equipment envy?

Ad hominem argument. It happens to be wrong, but it's irrelevant anyway. It would really be more productive if we could stick to debating facts.
 
Last edited:
I have absolutely no idea what any of their equipment looks like. It's been about a while since I have even read any of their writeups. Judging from the manner in which some people speak of them, I'm sure the Brulosophy guys have their good points too. However, when all they are analyzing is a split batch, that's just a sample size of one and to me it doesn't have much more weight than if any other HBT member did the same experiment. If what they are doing is really thought provoking and outside the box comparisons, that's when you really want to have the weight of the evidence behind you. You can't do that when you have only one sample, no matter how sophisticated you portray the data, or rather datum.

But no, I'm not envious of their equipment. I have seen some nice brew rigs and brew caves here on HBT. I am pretty happy and grateful for my own setup, which I mainly built myself. View attachment 725004and here's my jockey box along with 10 and 15 gallon kegs,
View attachment 725005
and my ferm chambers,
View attachment 725006.
Now I will hold off on showing my keezer which is almost done but the chairs I'm building aren't there yet. Sure I'd love to have a conical but then I'd want to build a glycol chiller and one of my children would be upset this summer.
Provacative enough for you to whip out your package and show off.

Now, indeed, i have equipment envy. Where is my bag? Seems so tiny. :)
 
Last edited:
Provacative enough for you to whip out your package and show off.

Now, indeed, i have equipment envy. Where is my bag? Seems so tiny. :)

Very few people would envy my equipment. I've got a nice bag ;) , but beyond that...

I have brewed about 160 batches on my stove top in a 4-gallon kettle and various kitchen pots. I still bottle most batches but recently got a uKeg Go 128-oz. That's as high tech as I get here. And -- I do not mean to show off but just stating a fact -- still manage to have a drawer full of ribbons and medals. YMMV.

Cheers all.
 
Back
Top