• Please visit and share your knowledge at our sister communities:
  • If you have not, please join our official Homebrewing Facebook Group!

    Homebrewing Facebook Group

Value of brulosophy exbeeriments, others experience, myths and beliefs

Homebrew Talk

Help Support Homebrew Talk:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Some issues I have with this: First is the last sentence--one lone participant said they experienced no difference between the beers. If so, what was she/he doing in the preferences panel? It would appear that this person just guessed, got it right, and now is in the tasting panel--and yet, they can't tell a difference. I know that the statistical test is determining whether a greater-than-chance number of people are picking the correct sample, but how many "correct" ones are really just guesses? Seems to me that if you have to guess, you can't tell them apart, which means....shouldn't you be recorded as a wrong guess?

The problem is that of the people who got the correct answer, you can't actually tell who guessed; who fooled themselves into believing they tasted a difference, but actually guessed (e.g. post-hoc justification); who can't accurately compare beers, so that they tasted differences even between the beers that were the same (noise) but guessed correctly; who actually could taste a difference and who actually did taste a difference, but was unsure and so thought they guessed (lack of trust in senses).

The robust solution is to treat all correct tasters the same.
 
You are certainly entitled to your opinions, but I think you're kind of losing your footing here.

As I understand it, the exbeeriments are setup to ask a single question, can a panel of "random" tasters tell two beers apart that have (as best they can) a single variable changed. The short-and-shoddy ones are a bit different, but the premise is the same. ALL that matters is, was there a detectable difference. To me, personal preferences are something extra on the side, but not really that important, for precisely the reason you pointed out. Everyone likes what they like. It would be egregious to throw out data points post analysis, as the statistics tests account for randomness.

I appreciate the civility with which you're disagreeing with me. But I would suggest that you're missing the point I tried to make, which might simply be the result of a poorly-written explanation.

Suppose tasters can tell a difference--so what? The crucial point for me is "what's the actionable intelligence that result produces?" If it doesn't result in a process or ingredient that allows me to produce better beer, what have I learned?

This is why, at one level, I tend to ignore ingredient exbeeriments (people will like what they like, which I might not like) and focus more on the process exbeeriments. If such a process exbeeriment shows no significant result, suggesting the process variable isn't producing a result most people can discern, then I tend to pay more attention.

Even the short-and-shoddy exbeeriment doesn't give me a lot of confidence in the results. The difference is only four tasters (6-2) and four others didn't have a preference. Pretty slim evidence to me. It may be more vital to you, I don't know.

This is why, unless the results are overwhelmingly one-sided for something that came in as significant, I tend to focus more on the results that say "no difference." And yes, I know I have no way to quantify Type II error when I say that.

******************

As long as I'm at it, here's another area I tend to have issues with.

There's an element of triangle testing that to me doesn't truly evaluate preference. I'd like to know whether people could differentiate between the samples more than once. The panels are said to be able to reliably distinguish between the beers, but that statement really can't be made.

Reliability means a measure is consistent and repeatable. We don't know that. If you gave me the same triangle test for 6 days straight, would I be able to pick out the odd-one-out each time? If I could, my results would reliable.

But what if I had to guess? By chance I'd get 2 correct.

Now, imagine I'm part of one of these panels. Am I just getting lucky in guessing? And if so, is that reliability? No, it's not. it's just luck.

So when I see results from the short-and-shoddy exbeeriment, I see this:

6 preferred the traditional
2 preferred the short-and-shoddy
4 had no preference
1 couldn't tell a difference.

The evidence here is thin, very thin.
 
By the way, my take-away from the Brulosophy experiments as a whole is that pretty often in brewing "good enough" is good enough. Particularly with mash parameters, fermentation parameters, aeration (hot and cold) and sanitation.

The reason for keeping good practices isn't that if you take one variable just out of bounds then the beer will be ruined, but that if enough variables start to run up against the boundaries, then things can go wrong fast.

Of course, some beers are more sensitive to certain parameters than others.
 
The problem is that of the people who got the correct answer, you can't actually tell who guessed; who fooled themselves into believing they tasted a difference, but actually guessed (e.g. post-hoc justification); who can't accurately compare beers, so that they tasted differences even between the beers that were the same (noise) but guessed correctly; who actually could taste a difference and who actually did taste a difference, but was unsure and so thought they guessed (lack of trust in senses).

The robust solution is to treat all correct tasters the same.

I'm still on the issue I noted just above for Isomerization. OK, you bring in all correct tasters, including those who couldn't really tell.

And now we're going to ask them for preference, when they couldn't tell a difference in the first place. That's the part where I have difficulty determining the actionable intelligence. In other words, what am I going to do differently now that I have this result, if anything--knowing that the result includes information from tasters who could not tell the difference between the beers in the first place.
 
I appreciate the civility with which you're disagreeing with me. But I would suggest that you're missing the point I tried to make, which might simply be the result of a poorly-written explanation.

Suppose tasters can tell a difference--so what? The crucial point for me is "what's the actionable intelligence that result produces?" If it doesn't result in a process or ingredient that allows me to produce better beer, what have I learned?

This is why, at one level, I tend to ignore ingredient exbeeriments (people will like what they like, which I might not like) and focus more on the process exbeeriments. If such a process exbeeriment shows no significant result, suggesting the process variable isn't producing a result most people can discern, then I tend to pay more attention.

Even the short-and-shoddy exbeeriment doesn't give me a lot of confidence in the results. The difference is only four tasters (6-2) and four others didn't have a preference. Pretty slim evidence to me. It may be more vital to you, I don't know.

This is why, unless the results are overwhelmingly one-sided for something that came in as significant, I tend to focus more on the results that say "no difference." And yes, I know I have no way to quantify Type II error when I say that.

******************

As long as I'm at it, here's another area I tend to have issues with.

There's an element of triangle testing that to me doesn't truly evaluate preference. I'd like to know whether people could differentiate between the samples more than once. The panels are said to be able to reliably distinguish between the beers, but that statement really can't be made.

Reliability means a measure is consistent and repeatable. We don't know that. If you gave me the same triangle test for 6 days straight, would I be able to pick out the odd-one-out each time? If I could, my results would reliable.

But what if I had to guess? By chance I'd get 2 correct.

Now, imagine I'm part of one of these panels. Am I just getting lucky in guessing? And if so, is that reliability? No, it's not. it's just luck.

So when I see results from the short-and-shoddy exbeeriment, I see this:

6 preferred the traditional
2 preferred the short-and-shoddy
4 had no preference
1 couldn't tell a difference.

The evidence here is thin, very thin.

Good, I'm glad my post didn't come across as a personal attack, as that definitely wasn't the intention.

RE: luck in guessing, that's the reason for including large numbers (within reason, too large of groups can produce artificial significance) and then obtaining a p-value. The odds that everyone is guessing (and getting it correct) become very high once you get p-value below significance. For example, if two people guess correct, that's 1/3 * 1/3 = 1/9, 3 people becomes 1/27, etc. This is obviously oversimplified, but that's why so many of the exbeeriments come back as non-significant (imo).

We certainly agree on the ingredient ones (I think most feel that way actually). That's also why I don't understand your issue with this specific exbeeriment. You want personal preference, yet we all know everyone's will be different. The important piece is the total number of people who correctly picked the odd beer out. Going past that is simply anecdotal, maybe they should stop including that part?
 
There's always an issue of asking for a subjective opinion like preference in these kinds of things because taste and expectation play such a big role. Some responses will be dominated by taste, some by expectation for the style, some by comparison to a favored commercial example. I don't think having guessers involved is as big a problem as the subjectivity, to be honest, and even from the non-guessers there's a good chance the preference is actually "both are good, just different".


OTOH, if you were trying to clone a particular beer, and asking how close you got, or you were trying to fit a BJCP style, then the subjectivity is probably lower, and having the guessers involved does actually make the results worse.

A related fun game would be to ask several panels for a triangle test and a preference, but tell them different things about the beer before hand, like which style it was supposed to be, or what you had varied.
 
In the end we're applying scientific techniques to something that's largely subjective - flavor.



It's hard to make solid science out of that to begin with. Unless you start trying to measure things you can literally measure with instrumentation. Like IBUs or something like that. Even then, you're left with NUMBERS and you don't necessarily know how one person will enjoy or hate that particular number, or even what their threshhold is for tasting it.
 
From the Brulosophy article today: "Out of the 12 of 21 blind tasters who were able to distinguish a beer fermented with Saflager W-34/70 at 60˚F/16˚C from the same beer fermented at 82˚F/28˚C, 7 selected the warm ferment beer as their preferred, 2 chose the cool ferment sample, 2 felt there was a difference but had no preference, and 1 thought there was no difference. This doesn’t mean the warm ferment lager was necessarily better, just that of the participants who were correct, a majority liked it more than the cool fermented sample."

I feel like this paragraph helps explain my position. The majority of people enjoyed the warm fermented lager v. the cold (or "properly") fermented lager. You know what this means to me? These people don't like lagers, lol. What it really means, is the exbeeriment returned a significant result, the beers were different.
 
From the Brulosophy article today: "Out of the 12 of 21 blind tasters who were able to distinguish a beer fermented with Saflager W-34/70 at 60˚F/16˚C from the same beer fermented at 82˚F/28˚C, 7 selected the warm ferment beer as their preferred, 2 chose the cool ferment sample, 2 felt there was a difference but had no preference, and 1 thought there was no difference. This doesn’t mean the warm ferment lager was necessarily better, just that of the participants who were correct, a majority liked it more than the cool fermented sample."

I feel like this paragraph helps explain my position. The majority of people enjoyed the warm fermented lager v. the cold (or "properly") fermented lager. You know what this means to me? These people don't like lagers, lol. What it really means, is the exbeeriment returned a significant result, the beers were different.

What the text there doesn't say is whether they were told that the beer was intended to be a lager before they gave their preference. That might have changed the results a lot - particularly if they are people that don't usually go for lager!
 
Here's another one: "A panel of 37 people with varying levels experience participated in this xBmt. Each blind taster was served 2 samples of the no-boil Berliner Weisse and 1 sample of the boiled Berliner Weisse in differently colored opaque cups then instructed to select the unique sample. At this sample size, 18 tasters (p<0.05) would have had to accurately select the unique sample to achieve statistical significance. Ultimately, 31 tasters (p=0.0000000004) chose the different beer, suggesting participants were able to reliably distinguish the boiled Berliner Weisse from the no-boil sample.

The 31 participants who correctly selected the unique sample in the triangle test were then instructed to complete a brief preference survey comparing only the two different samples, all still blind to the variable. In the end, 15 tasters reported preferring the boiled sample, 13 said they liked the no-boil version better, and 3 people had no preference despite noting a difference between the beers."

Preference was almost 50:50! And it looks like no one (admitted to at least) guessed...

What the text there doesn't say is whether they were told that the beer was intended to be a lager before they gave their preference. That might have changed the results a lot - particularly if they are people that don't usually go for lager!

Absolutely, but I bet they still would have found them different :)
 
By the way, my take-away from the Brulosophy experiments as a whole is that pretty often in brewing "good enough" is good enough. Particularly with mash parameters, fermentation parameters, aeration (hot and cold) and sanitation. ...
+1, which leads us to this: that RECIPE DESIGN is far and away the most important factor for beer success. Specifically,
1) the selection and % of non-base-malts, and
2) yeast selection.

But xbmts typically focus more on process, not recipe (despite recipe being much more important!!).
 
I think they are pretty good about keeping interpretation of the triangle test clear. Does the change result in a beer that can be differentiated by average beer drinkers without relying on visual cues. This is useful information as it allows brewer to make a risk based assessment of changing a process. If risk to the product is high and the savings is marginal don't do it. If risk is low and savings in time or materials are significant might be worth a try.

I saved 10-15 minutes of my brewday based on their boiling with the lid on experiment. I now leave the lid on while bringing kettle to a boil and it gets there much faster. Experiment showed not likely to impact the final product savings looked potentially meaningful, it has changed my brewday.
 
+1, which leads us to this: that RECIPE DESIGN is far and away the most important factor for beer success. Specifically,
1) the selection and % of non-base-malts, and
2) yeast selection.

But xbmts typically focus more on process, not recipe (despite recipe being much more important!!).

haha wow no I really disagree with this. We had a brew club experiment where we all brewed the same recipe. One of the APAs from BCS. Then brought them to a meeting and wow these beers were all over the place. Process is huge it is why so many commercial brewers really don't mind sharing recipes.
 
Good, I'm glad my post didn't come across as a personal attack, as that definitely wasn't the intention.

RE: luck in guessing, that's the reason for including large numbers (within reason, too large of groups can produce artificial significance) and then obtaining a p-value. The odds that everyone is guessing (and getting it correct) become very high once you get p-value below significance. For example, if two people guess correct, that's 1/3 * 1/3 = 1/9, 3 people becomes 1/27, etc. This is obviously oversimplified, but that's why so many of the exbeeriments come back as non-significant (imo).

Oh, I understand this stuff pretty well--I teach it! In fact, I used the exbeeriment comparing Maris Otter and 2-Row in my class in April, to help the students understand the difference between statistical significance and substantive importance. Just because something's significant doesn't mean it's important.

I also use these to demonstrate causality and the search for alternative explanations for results. Sometimes the students pay better attention to the beer examples. :)


We certainly agree on the ingredient ones (I think most feel that way actually). That's also why I don't understand your issue with this specific exbeeriment. You want personal preference, yet we all know everyone's will be different. The important piece is the total number of people who correctly picked the odd beer out. Going past that is simply anecdotal, maybe they should stop including that part?

I'm afraid I disagree--what would be the point if all you could do is say the beers are different? Unless you have a conclusion that points to a better process or ingredient, it's for nothing.

It's like saying (as I would in my class) that there's a statistically-significant difference between men and women with regard to, say, satisfaction with their marriage--and saying nothing more. Without knowing which direction the results point, there's nothing useful there.

I have my own issues with the panels as they're used--we don't know about the reliability of the panel (it's a one-shot tasting!), we don't know who they represent, we don't know their palates, we don't know what they were drinking or eating just prior to tasting.

This is why I'm more interested in the nonsignificant results--people can't tell a difference, which suggests there isn't a meaningful difference between the beers. But even then, testing by me or you is going to be needed, at our local level, to see if it matters to us.
 
haha wow no I really disagree with this. We had a brew club experiment where we all brewed the same recipe. One of the APAs from BCS. Then brought them to a meeting and wow these beers were all over the place. Process is huge it is why so many commercial brewers really don't mind sharing recipes.
ha ha that's funny stuff. But seriously, the xbmts are showing us that as long as you follow good basic practices, many of these details don't matter much. Apparently your brew club ... umm ... [stepping away from the keyboard...]

And don't forget the vast majority of commercial brewers will NOT share their recipes. Or at least their *actual* recipes. ;)
 
Preference was almost 50:50!

Bingo!

In my opinion, too many people are using Xbmt results to try to steer them towards making "better" beer.

That's not the intent of triangle tests.

There's still a taste preference that shouldn't be overlooked, even if an Xbmt results is statistically significant for a perceived difference.
 
Oh, I understand this stuff pretty well--I teach it! In fact, I used the exbeeriment comparing Maris Otter and 2-Row in my class in April, to help the students understand the difference between statistical significance and substantive importance. Just because something's significant doesn't mean it's important.

I also use these to demonstrate causality and the search for alternative explanations for results. Sometimes the students pay better attention to the beer examples. :)




I'm afraid I disagree--what would be the point if all you could do is say the beers are different? Unless you have a conclusion that points to a better process or ingredient, it's for nothing.

It's like saying (as I would in my class) that there's a statistically-significant difference between men and women with regard to, say, satisfaction with their marriage--and saying nothing more. Without knowing which direction the results point, there's nothing useful there.

I have my own issues with the panels as they're used--we don't know about the reliability of the panel (it's a one-shot tasting!), we don't know who they represent, we don't know their palates, we don't know what they were drinking or eating just prior to tasting.

This is why I'm more interested in the nonsignificant results--people can't tell a difference, which suggests there isn't a meaningful difference between the beers. But even then, testing by me or you is going to be needed, at our local level, to see if it matters to us.

We're certainly in agreement here, I think all the exbeeriments are worthy of consideration, whether significance is "achieved" or not.

Would it be wonderful to have a consistent panel of 50 tasters ranging in skill level that get to sample the beer 3x a day for a week? Well sure! That's obviously a stretch, but I don't think its fair to criticize their process in this manner. You might disagree.

Correct me here if I'm wrong, but it sounds like you take issue with the exbeeriments that achieve significance, but not those that don't? That seems awfully selective if so. Please, point me to a beer blog that publishes twice a week like clockwork and has more scientific rigor than this one.
 
We're certainly in agreement here, I think all the exbeeriments are worthy of consideration, whether significance is "achieved" or not.

OK. They're all worthy of consideration; the question is what you do once you consider them.


Would it be wonderful to have a consistent panel of 50 tasters ranging in skill level that get to sample the beer 3x a day for a week? Well sure! That's obviously a stretch, but I don't think its fair to criticize their process in this manner. You might disagree.

Why not? I don't doubt their sincerity, and I appreciate the effort in trying to shed light on home brewing. But that sounds like grading someone on effort, not what they produce, and I never, not ever, give students credit for effort.

The fact we like the Brulosophy folks and appreciate their effort should not stand in for a critical eye on just what the results tell us.

Correct me here if I'm wrong, but it sounds like you take issue with the exbeeriments that achieve significance, but not those that don't?

I'm not sure that take issue is the right term. I'm trying to decide what, if any, actionable intelligence comes out of any of this. If that's not what you're interested in, that's fine, I'm all for people getting out of this whatever makes them happy. But we should not draw certain conclusions given the difficulties inherent in the approach.

I'll try to say it differently, perhaps I wasn't as clear as I could be before. There are a host of possible alternative explanations for the significant results--and in the end, if we can't have confidence that those alternative explanations have been eliminated, then we--perhaps just me--don't have much confidence in the results. Significance, by itself, tells us very little. As I noted above, it could be panel composition, what people did before they tasted the beers, what their palates are like, what the panel generalizes to, whether the tasters like the beer, etc. etc. etc. as possible reasons for the outcomes. If the results are overwhelming, that's different, but almost none of them are.

The null hypothesis here is that there is no difference between batches. Non-significant results mean we don't reject that hypothesis, leaving us to (lightly) conclude that perhaps there's no difference. The concerns above about panel composition and so on are less important--differences in the panel aren't going to account for not finding a significant difference, if that makes sense. So the non-significant results are a little more believable.

Further, even when people say they can perceive a difference and even when one is preferred over the other, we just don't know what that means.


That seems awfully selective if so.

I don't think so. I think I'm applying statistical, measurement, and scientific principles to puzzle this out. That's what I do. It may appear harsh, but as I've always said, this is not an indictment of Marshall or the Brulosophy people. You do what you can with what you have.

Please, point me to a beer blog that publishes twice a week like clockwork and has more scientific rigor than this one.

I cannot, but so what? If what you want is for me to say this is the best there is "out there," well, it would appear so. That's not the same as saying it passes all the demands of high-level, believable research, which it does not.
 
RE: luck in guessing, that's the reason for including large numbers (within reason, too large of groups can produce artificial significance) and then obtaining a p-value. The odds that everyone is guessing (and getting it correct) become very high once you get p-value below significance. For example, if two people guess correct, that's 1/3 * 1/3 = 1/9, 3 people becomes 1/27, etc. This is obviously oversimplified, but that's why so many of the exbeeriments come back as non-significant (imo).

Warning: pedantic stats instructor correction ahead.

You are stating the logic of the hypothesis test backwards. The null hypothesis supposes that everyone is guessing with a probability of getting the correct answer of 1/3 (see also my earlier response for another justification based on randomized cups). A small p-value/significant result makes no statement about the odds/probability/likelihood that people are guessing. Rather the opposite: if it were true that people are guessing, the probability of seeing a result this large or larger is equal to the p-value.

If this probability is very small, we would then be skeptical of the original claim: that all subjects are guessing. The p-value, however, can't be called a probability that subjects are not guessing as it is formed by supposing that they are. The logic may seem a little twisted, but I like to think of it like a proof by contradiction or evaluating the consequences of an argument. We suppose something, see where it leads, and if the results seems impossible or unlikely, we go back and challenge our original assumptions.
 
Isn't this the basis of the series of short and shoddy experiment? At least in that one they showed the instrument was able to discern between a beer made with a starter + 60 min mash and boil + active temperature control fermentation from a beer made with one smack pack + 30 min mash and boil + ambient temperature control.

I think this got to one of mongoose's main critiques. It seemed that a lot of the individual experiments didn't show a significant change. However, when you compare a beer made with 4 "generally accepted best practices" vs a beer made with 4 "shortcuts", then it *is* different enough that tasters can perceive it.

I.e. one critique is that individual process steps may have effects that are below the taster's detection threshold, but if brewers start taking that as gospel and deciding they can make the "short & shoddy" beer without having an impact on their brew, it's incorrect. That we shouldn't look at the result of one experiment and use it to determine that brewer's best practice is incorrect just because no difference is observed.

OK. They're all worthy of consideration; the question is what you do once you consider them.

I think I get at the heart of what you're saying in my above paragraph. The experiments are all interesting, but as you initially said you saw brewers in the comments saying "oh well I guess I can stop doing process X now!" they were making changes based on information that was FAR too incomplete to justify a change if they were happy with their beer.

I'll try to say it differently, perhaps I wasn't as clear as I could be before. There are a host of possible alternative explanations for the significant results--and in the end, if we can't have confidence that those alternative explanations have been eliminated, then we--perhaps just me--don't have much confidence in the results. Significance, by itself, tells us very little. As I noted above, it could be panel composition, what people did before they tasted the beers, what their palates are like, what the panel generalizes to, whether the tasters like the beer, etc. etc. etc. as possible reasons for the outcomes. If the results are overwhelming, that's different, but almost none of them are.

This is where I think you go too far. I.e. the short & shoddy experiment, it was DEFINITELY significant. However, a 6-2 split between the preference test you said wasn't enough to justify that perhaps the 4 "generally accepted best practices" were better than the "shortcuts", and in that case I disagree.

You discount things like preference TOO much. Yes, while I'll agree that we can't give preference too much weight, we shouldn't ignore it either. Human tasters (especially inexperienced and non-trained tasters) are an imprecise tool. Often they may not fully understand the root of their preferences. But that doesn't mean their preferences don't have weight.

Here's why: we're making beer for human consumption. These are people who like and drink beer, so it stands to reason that when we're talking about process, they're more often than not going to prefer the objectively better beer. I.e. their preferences give us a reasonable idea which processes are more likely to help us brew "better" beer. The vast majority of tasters don't like off flavors (hence why they're called off flavors).

One could reasonably suggest that the reason the results were 6-2 in favor of good practices was that the short and shoddy beer had some objective flaws. Is it close? Yes, because we're talking about a brewer with a lot of experience and who probably has his other processes (i.e. sanitation, oxidation, etc) really well tuned. So even if he took the short and shoddy approach, it was probably still not "bad" beer. But 6-2 may be enough to call good processes objectively better.

Sometimes that leads down wrong paths, yes. I.e. the W-34/70 fermented cold vs warm might suggest that the tasters prefer esters to clean lager styles, not that W-34/70 will produce a great clean lager at 82F. And so higher sample sizes and more replication is important, and a critical eye at *interpreting* the results in context.

But when you impugn the concept of tasters based upon mere questions of panel construction, what they did before the tasting, etc, I think you're basically saying that you throw out the human tasting aspect of it entirely as a guide. But there's not necessarily evidence that drinking an IPA an hour before tasting the best practices vs the short & shoddy beer will affect whether or not you can perceive that one is cleaner than another. No taster is ever a fully pristine slate. But human tasters are still able to do a LOT of things better than running beer through a spectrometer to determine it's composition. Humans can tell you whether it tastes good to a human palate. And that is important data, even if it's somewhat noisy.
 
Warning: pedantic stats instructor correction ahead.

You are stating the logic of the hypothesis test backwards. The null hypothesis supposes that everyone is guessing with a probability of getting the correct answer of 1/3 (see also my earlier response for another justification based on randomized cups). A small p-value/significant result makes no statement about the odds/probability/likelihood that people are guessing. Rather the opposite: if it were true that people are guessing, the probability of seeing a result this large or larger is equal to the p-value.

If this probability is very small, we would then be skeptical of the original claim: that all subjects are guessing. The p-value, however, can't be called a probability that subjects are not guessing as it is formed by supposing that they are. The logic may seem a little twisted, but I like to think of it like a proof by contradiction or evaluating the consequences of an argument. We suppose something, see where it leads, and if the results seems impossible or unlikely, we go back and challenge our original assumptions.

Thank you for the nice (correct) explanation. Funny how much trouble is had in understanding p-values and what they really mean!
 
Thank you for the nice (correct) explanation. Funny how much trouble is had in understanding p-values and what they really mean!

For sure. I've made these kinds of mistakes more times than I care to admit. Without going down the rabbit hole of statistical schools of thought, I will say one of the best criticisms of the use of p-values, in my opinion, is that they are often exactly the opposite of what people really want: the probability that the hypothesis is true, given the data. As noted, the p-values value assumes the hypothesis and gives the probability of the the data.
 
I live in a remote area where supplies and Homebrewers are in extremely rare supply. Brulosophy has taught me more about brewing beer than anyone else has. My beer has never been easier to make and has never tasted better. I don't know where I would be without Marshall and his crew.
 
I think this got to one of mongoose's main critiques. It seemed that a lot of the individual experiments didn't show a significant change. However, when you compare a beer made with 4 "generally accepted best practices" vs a beer made with 4 "shortcuts", then it *is* different enough that tasters can perceive it.

I.e. one critique is that individual process steps may have effects that are below the taster's detection threshold, but if brewers start taking that as gospel and deciding they can make the "short & shoddy" beer without having an impact on their brew, it's incorrect. That we shouldn't look at the result of one experiment and use it to determine that brewer's best practice is incorrect just because no difference is observed.



I think I get at the heart of what you're saying in my above paragraph. The experiments are all interesting, but as you initially said you saw brewers in the comments saying "oh well I guess I can stop doing process X now!" they were making changes based on information that was FAR too incomplete to justify a change if they were happy with their beer.



This is where I think you go too far. I.e. the short & shoddy experiment, it was DEFINITELY significant. However, a 6-2 split between the preference test you said wasn't enough to justify that perhaps the 4 "generally accepted best practices" were better than the "shortcuts", and in that case I disagree.

You discount things like preference TOO much. Yes, while I'll agree that we can't give preference too much weight, we shouldn't ignore it either. Human tasters (especially inexperienced and non-trained tasters) are an imprecise tool. Often they may not fully understand the root of their preferences. But that doesn't mean their preferences don't have weight.

Here's why: we're making beer for human consumption. These are people who like and drink beer, so it stands to reason that when we're talking about process, they're more often than not going to prefer the objectively better beer. I.e. their preferences give us a reasonable idea which processes are more likely to help us brew "better" beer. The vast majority of tasters don't like off flavors (hence why they're called off flavors).

One could reasonably suggest that the reason the results were 6-2 in favor of good practices was that the short and shoddy beer had some objective flaws. Is it close? Yes, because we're talking about a brewer with a lot of experience and who probably has his other processes (i.e. sanitation, oxidation, etc) really well tuned. So even if he took the short and shoddy approach, it was probably still not "bad" beer. But 6-2 may be enough to call good processes objectively better.

Sometimes that leads down wrong paths, yes. I.e. the W-34/70 fermented cold vs warm might suggest that the tasters prefer esters to clean lager styles, not that W-34/70 will produce a great clean lager at 82F. And so higher sample sizes and more replication is important, and a critical eye at *interpreting* the results in context.

But when you impugn the concept of tasters based upon mere questions of panel construction, what they did before the tasting, etc, I think you're basically saying that you throw out the human tasting aspect of it entirely as a guide. But there's not necessarily evidence that drinking an IPA an hour before tasting the best practices vs the short & shoddy beer will affect whether or not you can perceive that one is cleaner than another. No taster is ever a fully pristine slate. But human tasters are still able to do a LOT of things better than running beer through a spectrometer to determine it's composition. Humans can tell you whether it tastes good to a human palate. And that is important data, even if it's somewhat noisy.

There are certain foods that accentuate or mask certain flavors, that's not really something that can be argued. Certain foods can make you perceive more sweetness, while others can completely mask flavors. If one person comes straight from dinner and tests samples and another made sure they cleaned their palate the results are compromised. I think most people that have a scientific background see there's not nearly enough control in these experiments to be able to take away anything other than maybes.

And human tasters can't really tell you anything about what a beer tastes like to your palate. But actual real data can tell you what effects changes on the process have to the beer. I don't care what can be perceived by a group of people, I care if I can perceive it. These experiments tell me nothing about changes to the beer and if I could perceive it.
 
We're certainly in agreement here, I think all the exbeeriments are worthy of consideration, whether significance is "achieved" or not.

Would it be wonderful to have a consistent panel of 50 tasters ranging in skill level that get to sample the beer 3x a day for a week? Well sure! That's obviously a stretch, but I don't think its fair to criticize their process in this manner. You might disagree.

Correct me here if I'm wrong, but it sounds like you take issue with the exbeeriments that achieve significance, but not those that don't? That seems awfully selective if so. Please, point me to a beer blog that publishes twice a week like clockwork and has more scientific rigor than this one.

If I run an experiment and don't measure anything at all, then run a second experiment and only measure temperature is the second experiment valid because it's more scientifically rigorous? Of course not, it's still a bad experiment. Running an experiment once, poorly controlling the variables, not testing results outside of poorly controlled test groups, none of this can really be called scientific rigor.

For some reason people are attracted by the "science" of these experiments and passionately defend the scientific validity of them. But as someone who gets paid to do research and perform experiments, these type of methods would tell me 0 about any work I was doing if this is how I approached my research. I'll probably get push back for saying this, but if someone is going to try and present work as having scientific rigor it should stand up to the same scrutiny and be treated to the same rigor actual research is, and this work doesn't hold up to that.

Another poster pointed out what these videos are great for, learning how to brew.
 
I think this got to one of mongoose's main critiques. It seemed that a lot of the individual experiments didn't show a significant change.

No......the data showed that the number of tasters correctly picking the odd one out did not rise to the number necessary to denote statistical signficance at the .05 level.



However, when you compare a beer made with 4 "generally accepted best practices" vs a beer made with 4 "shortcuts", then it *is* different enough that tasters can perceive it.

When you say "tasters" you mean just those that could, and include those who were guessing, correct? One of the things I look at with results like these is not only how many supposedly could discern a difference, but also how many were unable to do so.


I.e. one critique is that individual process steps may have effects that are below the taster's detection threshold, but if brewers start taking that as gospel and deciding they can make the "short & shoddy" beer without having an impact on their brew, it's incorrect. That we shouldn't look at the result of one experiment and use it to determine that brewer's best practice is incorrect just because no difference is observed.

This is why I wonder about the cumulative effect of small differences that each are below the level of perception but when added together rise to a level that can be perceived.


I think I get at the heart of what you're saying in my above paragraph. The experiments are all interesting, but as you initially said you saw brewers in the comments saying "oh well I guess I can stop doing process X now!" they were making changes based on information that was FAR too incomplete to justify a change if they were happy with their beer.

Actually I think it was someone else who said that.


This is where I think you go too far. I.e. the short & shoddy experiment, it was DEFINITELY significant. However, a 6-2 split between the preference test you said wasn't enough to justify that perhaps the 4 "generally accepted best practices" were better than the "shortcuts", and in that case I disagree.

Sure, there's a small difference. But one thing I do is also look at those who could not discern a difference. The results were 6-2-4. Not quite as rock solid as 6-2 appears.

One way I suggest people think about these kinds of results is to assign a dollar value to them. In other words, would you bet $1000 that the results are correct? $100? $10? $5? While not "signifcance" such a thought experiment helps us think about how confident we are in the results. I'm at about $20 in this one. I surely would not bet $1000 on the accuracy of the results.

You discount things like preference TOO much. Yes, while I'll agree that we can't give preference too much weight, we shouldn't ignore it either. Human tasters (especially inexperienced and non-trained tasters) are an imprecise tool. Often they may not fully understand the root of their preferences. But that doesn't mean their preferences don't have weight.

Well, we'll have to disagree on that. I'm not sure what preference means. Only if there's an overwhelming majority on one side or another will I be more confident in the conclusion. Say, if 15 people could discern a difference and the preferences were split 14-1. That seems fairly convincing. A split of 9-6? Not so much. But that's me, you may think differently.

Here's why: we're making beer for human consumption. These are people who like and drink beer, so it stands to reason that when we're talking about process, they're more often than not going to prefer the objectively better beer. I.e. their preferences give us a reasonable idea which processes are more likely to help us brew "better" beer. The vast majority of tasters don't like off flavors (hence why they're called off flavors).

You lost me at "objectively better beer." Yes, if the preferences align such that a large majority (14-1, say) agree one is better than the other, it seems also likely that I'd agree with them if sampling the same beer. But when it comes out as 6-2 with 4 expressing no preference, now I'm not so sure.

One could reasonably suggest that the reason the results were 6-2 in favor of good practices was that the short and shoddy beer had some objective flaws. Is it close? Yes, because we're talking about a brewer with a lot of experience and who probably has his other processes (i.e. sanitation, oxidation, etc) really well tuned. So even if he took the short and shoddy approach, it was probably still not "bad" beer. But 6-2 may be enough to call good processes objectively better.

Again, you're leaving out the 4 who had no preference.

One of the beauties of these things--both the exbeeriments and HBT--is that nobody has to agree with anyone else, and we all can use whatever processes we want. We get to make our own decisions!

If, for you, a 6-2 split with 4 no preferences is enough to make a conclusion, go for it. I've used Brulosophy results--the trub- or no-trub exbeeriments led me to try it myself, and I could not tell any difference. The exbeeriments can certainly give us ideas to try and those that make sense to us should be tried if possible.


Sometimes that leads down wrong paths, yes. I.e. the W-34/70 fermented cold vs warm might suggest that the tasters prefer esters to clean lager styles, not that W-34/70 will produce a great clean lager at 82F. And so higher sample sizes and more replication is important, and a critical eye at *interpreting* the results in context.

Bingo! As we cannot know what the tasters are perceiving, then what do the results indicate? And if you get a panel of tasters for whom those flavors are more important, then there you are.

But when you impugn the concept of tasters based upon mere questions of panel construction, what they did before the tasting, etc, I think you're basically saying that you throw out the human tasting aspect of it entirely as a guide.

No, as I have noted above, if the split is 14-1 in preference to one over the other, that's a stronger result than 9-6.....or 6-2-4.

And if you think I'm "impugning" the panel on the basis of panel construction, what they did before tasting, etc......well, maybe. I just don't know the answer to those questions, and neither does anyone else.

But there's not necessarily evidence that drinking an IPA an hour before tasting the best practices vs the short & shoddy beer will affect whether or not you can perceive that one is cleaner than another. No taster is ever a fully pristine slate. But human tasters are still able to do a LOT of things better than running beer through a spectrometer to determine it's composition. Humans can tell you whether it tastes good to a human palate. And that is important data, even if it's somewhat noisy.

We're going to have to disagree on this. You are setting up straw men and using them to buttress your point. I'm going to knock them down right now.

No, there's no evidence that drinking an IPA an hour (what about 5 minutes) will affect perception. But there's also not evidence that it doesn't. And we don't know what they did before taste testing, which is the point. You don't know. I don't know. And as a scientist, I'm trained to think about what could cause the results to be false. This is one of many things that could cause them to be false.

And while I can tell you if something tastes good to a human palate, we can't know how universal that opinion is. That's the point here.

***************************

Good science is organized skepticism. We try to disprove things in science because as it turns out we cannot really prove anything. That's the whole basis of null hypothesis testing.

We're looking at potential causal processes here. Change this variable and what happens, if anything? To demonstrate causality you need to show correlation (differences with and without the experimental treatment), time order (cause precedes the effect), and nonspuriousness (there are no other explanations for the results).

My comments in this thread focus primarily on whether we're really measuring something useful, and whether there are other explanations for the results.

*****************

Again, bully for Marshall and his associates for trying to bring data to bear on issues related to homebrewing. Everyone can use or not use those results as they see fit. If using them results in better beer, I only hope that someday I get to enjoy one with you.
 
If I run an experiment and don't measure anything at all, then run a second experiment and only measure temperature is the second experiment valid because it's more scientifically rigorous? Of course not, it's still a bad experiment. Running an experiment once, poorly controlling the variables, not testing results outside of poorly controlled test groups, none of this can really be called scientific rigor.

For some reason people are attracted by the "science" of these experiments and passionately defend the scientific validity of them. But as someone who gets paid to do research and perform experiments, these type of methods would tell me 0 about any work I was doing if this is how I approached my research. I'll probably get push back for saying this, but if someone is going to try and present work as having scientific rigor it should stand up to the same scrutiny and be treated to the same rigor actual research is, and this work doesn't hold up to that.

Another poster pointed out what these videos are great for, learning how to brew.

Expectations. That's really what I think the large majority of this debate comes down to. I don't believe any of the contributors are doing this as a full-time job, so you can only expect so much. Of course there are many areas where improvement could be made to increase rigor, but to me, they have exceeded what anyone else has done with respect to approaching brewing from a scientific perspective. This doesn't absolve them from improving, but I believe they have been if you follow the timeline of their experimentation.

You put science in quotations, as if there is some threshold that must be achieved before you can use that term. Science is knowledge (ask someone who knows Latin). I doubt even their most ardent critic would say they hadn't contributed to homebrewing knowledge.
 
The attempts to precisely quantify every aspect of brewing reminds me a lot of the widely varying views of audiophiles. There was a famous electrical engineer whose lab tests of audio equipment were very controversial. He was widely derided for his view that two amplifiers that measure the same will sound the same. This opinion ignored the fact that the effect of the appurtenant equipment will introduce other variables, and that there may be factors involved that can't be measured.

Some audiophiles become so involved in the hobby that they can even hear a difference between two different sets of speaker cables. I have experienced this in my own system so I know it's true, but many people scoff at the idea.

Scientists are very critical of the reviews in audio magazines, insisting that they don't follow proper control procedures, etc., but the human factor always stymies the process. Naturally, you wouldn't want someone evaluating audio equipment who just left their job at a stamping plant, or someone with a cold, but it's not possible to eliminate all such variables. Another major roadblock to objective observation is the fact that audio memory is imprecise, so even an A/B test would be a subjective comparison.

The common thread between audio and brewing is that the final result depends on a sensory observation that is unique to each person. A person's sense of taste can change due to any number of factors including subtle odors in the room or even the use of medications.

As brewers we're all experimenting in an effort to produce better beer, but the vast majority of us don't come close to the taste tests that the Brulosopher performs, regardless of how imperfect they may be. The Brulosophy exBEERiments may not necessarily result in "hard" data, but they do suggest what we may expect in our own experimentation. The real value is that they are being shared with us, and I appreciate that.
 
I honestly don't find most of the experiments to be that helpful in brewing. Absolutely nothing can replace firsthand experience. unless you use the same equipment, process, and ingredients as a particular experiment, you really cant pull any generalized rules from them.

I do see a TON of generalization online, citing the brulosophy blog as a source or proof for arguments, and giving advice for which the experiment that may be related but not necessarily applicable. Even the authors are guilty of it. my own personal bias about blogs aside, I think a lot of brewers use the blog as a substitute for personal experience, which is a mistake.
 
Back
Top