# Statistical Tests in Beer Analysis

### Help Support Homebrew Talk:

#### stickyfinger

##### Well-Known Member
I am trying to start a thread that was motivated by the following post:

To quote myself I guess I made this sound like I was talking about dry hopping but I wasn't. I was talking about whirlpool additions and such to make a point that dry hopping showing an IBU difference might not even be accurate considering there is a 2-3 IBU variance in normal testing.

OP

#### stickyfinger

##### Well-Known Member
OK, now, where are you getting the 2-3 IBU variance statement? It may be correct, but what are you basing that on? The lab results say +/-1 is the "variance", which to me seems to indicate that the lab calculated a confidence interval and came up with a margin of error of 1 IBU. Is that your assumption?

My gut instinct (dangerous!) is that the IBU analyses completed by Oregon Brew Lab for the the Brulosophy articles (e.g. http://brulosophy.com/2016/03/21/kettle-hops-loose-vs-bagged-exbeeriment-results/) are not adequate to determine if there is a difference in these case for two reasons:

1) The confidence interval is 2 IBUs wide, but the difference b/w the two beer samples in the example I listed above, for example, is only 2 IBUs. The biggest difference was only 4 IBUs (the other exBeeriment where they measured IBUs - not the one where the sample got mixed up and they just did one combined measurement of two beers.) That seems like too close of a difference. Yes, we can do a t-test on the means, and it might show significance, but is that really fair here?

2) Also, I think to be safe, you'd have to consider the fact that this is at best likely only a confidence interval determined by probably 3 measurements of one sample of beer (taken from a keg?) and put through the extraction protocol of the ASBC one time (ie not 3 times) by one analyst. That is not a fair way to determine if they are really different either in this case.

I think we're both saying you can't really tell a difference here, but I am interested in seeing when we COULD tell a difference and what we would have to do to be confident that there is a difference.

#### ajdelange

##### Well-Known Member
A fellow by the name of Gosset asked that same question years ago, found a method and published it under the pseudonym Student as his employer, Guiness, prohibited its employees from revealing anything that might be valuable to a competitor. Look up Student's t on Wikipedia for all the details.

OP

#### stickyfinger

##### Well-Known Member
A fellow by the name of Gosset asked that same question years ago, found a method and published it under the pseudonym Student as his employer, Guiness, prohibited its employees from revealing anything that might be valuable to a competitor. Look up Student's t on Wikipedia for all the details.
That's a fun part of brewing history, isn't it! I am wondering though, if maybe another test is more appropriate for this IBU question. I was also thinking that it would be good for the guys at Brulosophy to have the IBUs tested for every batch they do where they have a split batch (2 boils with the exact same conditions as far as IBUs are concerned.) Though, I guess they would never be exact when they do two boils. But, if you could do 2 boils for every 10 gallon batch, you'd start to come to a better understanding of how close the IBU readings should be for beers that are theoretically at the exact same level. Then again, the boil vigor is not necessarily exactly the same.

They could also just get the IBUs analyzed every time the brew a particular recipe (that should be very close to the same.) That way, you take into account a lot of variables.

I guess my main question is how can we know whether the IBU readings for two beers are REALLY different. This includes not being fooled by a stats test.

#### mongoose33

##### Well-Known Member
OK, now, where are you getting the 2-3 IBU variance statement? It may be correct, but what are you basing that on? The lab results say +/-1 is the "variance", which to me seems to indicate that the lab calculated a confidence interval and came up with a margin of error of 1 IBU. Is that your assumption?
"Variance" is not the same as margin of error. A margin of error, as generally understood, is 1.96 standard deviations. In statistics, "variance" is standard deviation squared.

What they probably mean by "variance" is "accurate to within," like a scale that is accurate to 1 gram.

My gut instinct (dangerous!) is that the IBU analyses completed by Oregon Brew Lab for the the Brulosophy articles (e.g. http://brulosophy.com/2016/03/21/kettle-hops-loose-vs-bagged-exbeeriment-results/) are not adequate to determine if there is a difference in these case for two reasons:

1) The confidence interval is 2 IBUs wide, but the difference b/w the two beer samples in the example I listed above, for example, is only 2 IBUs. The biggest difference was only 4 IBUs (the other exBeeriment where they measured IBUs - not the one where the sample got mixed up and they just did one combined measurement of two beers.) That seems like too close of a difference. Yes, we can do a t-test on the means, and it might show significance, but is that really fair here?
We really cannot--brulosopher sent off two samples; you can't do a t-test unless you have numerous samples, and from what I read in the report you linked, there were two samples sent off--one of each.

2) Also, I think to be safe, you'd have to consider the fact that this is at best likely only a confidence interval determined by probably 3 measurements of one sample of beer (taken from a keg?) and put through the extraction protocol of the ASBC one time (ie not 3 times) by one analyst. That is not a fair way to determine if they are really different either in this case.
Nobody I know would do a confidence interval on an n of three.

I think we're both saying you can't really tell a difference here, but I am interested in seeing when we COULD tell a difference and what we would have to do to be confident that there is a difference.
IMO, there are really two issues that need to be addressed with the approach used by brulosopher. One is whether there's a perceptible difference, as indicated by the number of tasters who can pick the odd one out of a triangle test. Lost in that is the number of tasters who could *not* do that. I'm not a super-taster, I just like what I like, and when I see that most tasters can't tell the difference, I perk my ears up--even if a "statistically-significant" number of tasters could pick the right one. (and of those who could pick the right one, how many were guesses?)

Second is substantive importance. There are a couple exbeeriments in which a significant difference was found--but among the tasters who could pick the odd one out in the triangle test, they were split 50-50 on which they preferred!

When tasters cannot agree on which is better, that tells me there's no meaningful difference that cannot be attributed to personal taste and palate. And thus, the results of such an exbeeriment show there's no meaningful difference.

If, for instance, a statistically-significant number of tasters could distinguish the odd one out, and they overwhelmingly chose one beer as "better" than the other, then I would sit up and take notice. That would mean those who could not distinguish between them found each beer equally pleasing, and those who tasted the difference and picked one over the other have identified a better tasting beer (as a result of ingredients, process, etc.). And then you'd know what to brew!

None of this is to denigrate brulosopher and he and his associates' efforts to bring more science to brewing. I'm impressed, actually, by the approach.

However (you knew there was a however coming, didn't you? ), statistical significance, by itself, doesn't tell the whole story, not by a long shot.

OP

#### stickyfinger

##### Well-Known Member
"Variance" is not the same as margin of error. A margin of error, as generally understood, is 1.96 standard deviations. In statistics, "variance" is standard deviation squared.

What they probably mean by "variance" is "accurate to within," like a scale that is accurate to 1 gram.
How do you think they came up with the +/- 1 from the lab? It's hard to know how much it is worth unless we know how they did it. I assumed maybe they ran one sample repeatedly on their instrument to get an idea of the variability of the instrument, and then reported that as a confidence interval? I thought it was a little odd to label it as "variance" when that has a specific meaning in stats.

We really cannot--brulosopher sent off two samples; you can't do a t-test unless you have numerous samples, and from what I read in the report you linked, there were two samples sent off--one of each.
Let's just say you want to know if there IS a difference in IBUs between two beers. Is the only really valid way to do it to brew the beers several times, analyzes several samples of each beer and then do a stats test to see if there is a difference? If there is a big enough difference, say 10 IBUs vs 20 IBUs, we can say there is probably a difference right? How large should the difference be such that we wave the need for multiple analyses?

IMO, there are really two issues that need to be addressed with the approach used by brulosopher. One is whether there's a perceptible difference, as indicated by the number of tasters who can pick the odd one out of a triangle test. Lost in that is the number of tasters who could *not* do that. I'm not a super-taster, I just like what I like, and when I see that most tasters can't tell the difference, I perk my ears up--even if a "statistically-significant" number of tasters could pick the right one. (and of those who could pick the right one, how many were guesses?)
I thought the test was set up so that it accounted for the fact that there were guesses? Also, what do you mean that the number of tasters who could not pick out a difference are "lost." It's just everyone who says they can't taste a difference.

Second is substantive importance. There are a couple exbeeriments in which a significant difference was found--but among the tasters who could pick the odd one out in the triangle test, they were split 50-50 on which they preferred!

When tasters cannot agree on which is better, that tells me there's no meaningful difference that cannot be attributed to personal taste and palate. And thus, the results of such an exbeeriment show there's no meaningful difference.

OK, this is an interesting idea. It sounds like you are looking for a significant difference that is objectively true. I don't see why there's not a meaningful difference just because half prefer one and half the other. It doesn't tell you the simple answer that there is a best way to brew this beer, but it tells you that if you brew this beer with method A, you will likely be able to tell it apart from method B. Once you know which you seem to prefer, you can use that knowledge to always follow method A for example. It would seem to indicate that you are not guaranteed to brew a beer that everyone will prefer though. I see your point that the significant exBeeriments do not necessarily lead to dogma about the right way to do things though.

If, for instance, a statistically-significant number of tasters could distinguish the odd one out, and they overwhelmingly chose one beer as "better" than the other, then I would sit up and take notice. That would mean those who could not distinguish between them found each beer equally pleasing, and those who tasted the difference and picked one over the other have identified a better tasting beer (as a result of ingredients, process, etc.). And then you'd know what to brew!
Yes, I see your point. However, I wouldn't say it's not meaningful when they get a statistically significant test. I am excited when I hear there is a difference in a process, as it means that I can use it as a tool to change my beer flavor.

Thanks for the reply. This is an interesting topic to me, obviously!

#### eric19312

##### Supporting Member
HBT Supporter
Second is substantive importance. There are a couple exbeeriments in which a significant difference was found--but among the tasters who could pick the odd one out in the triangle test, they were split 50-50 on which they preferred!
I disagree. It is meaningful to notice there is a difference and that is the most powerful aspect of the triangle test. Preference is different. I love hoppy IPAs. If you have me taste two light american lagers I may not like either and may just choose the one with the most hop character or even worse like the one with an estery flaw better. The fact I could tell the difference tells the tester that the process made the beer different. The fact that I picked the one with the out of style flavor is not as meaningful.

#### mongoose33

##### Well-Known Member
How do you think they came up with the +/- 1 from the lab? It's hard to know how much it is worth unless we know how they did it. I assumed maybe they ran one sample repeatedly on their instrument to get an idea of the variability of the instrument, and then reported that as a confidence interval? I thought it was a little odd to label it as "variance" when that has a specific meaning in stats.
Yeah, it is odd. I keep thinking about my reloading scale, which is accurate to .1 grains. That means that, as I understand it, a measured 5.1 grain load is anywhere between 5.0500001 (or thereabouts ) and 5.1499999 (or thereabouts ).

Let's just say you want to know if there IS a difference in IBUs between two beers. Is the only really valid way to do it to brew the beers several times, analyzes several samples of each beer and then do a stats test to see if there is a difference? If there is a big enough difference, say 10 IBUs vs 20 IBUs, we can say there is probably a difference right? How large should the difference be such that we wave the need for multiple analyses?
No, I don't think the only valid way to do it is to brew the beers several times....though if you want an average IBU measure then probably that would be useful.

You'd only have to compare the two beers once, if you are certain you have a homogeneous mix, and no one sample will differ from any other. Then it's a matter of the accuracy of the measurement.

I thought the test was set up so that it accounted for the fact that there were guesses? Also, what do you mean that the number of tasters who could not pick out a difference are "lost." It's just everyone who says they can't taste a difference.
No, it's not set up that way. It's just a simple Z-test. When I first saw the method I was moved to reproduce it myself, and they are doing the Z-test correctly. It operates on the assumption (the null hypothesis, if you will) that tasters are randomly choosing the correct one, and how unlikely it would be for "X" number of tasters to choose the correct one, by random chance.

As far as being "lost," what I mean is that there is some value or meaning in the fact that most tasters couldn't tell a difference. That tells me (if you can assume they represent the more normal beer drinking population, a questionable assumption) that there isn't a difference that most people can perceive. That--right there--is also important information.

The tasters doing this are beer drinkers, and predisposed to like beer--and yet, the majority can't tell the difference. Hmmmm....

OK, this is an interesting idea. It sounds like you are looking for a significant difference that is objectively true. I don't see why there's not a meaningful difference just because half prefer one and half the other. It doesn't tell you the simple answer that there is a best way to brew this beer, but it tells you that if you brew this beer with method A, you will likely be able to tell it apart from method B. Once you know which you seem to prefer, you can use that knowledge to always follow method A for example. It would seem to indicate that you are not guaranteed to brew a beer that everyone will prefer though. I see your point that the significant exBeeriments do not necessarily lead to dogma about the right way to do things though.
What I tell my students (yes, I teach this stuff, which is why it is so interesting to me that others are interested in it w/r/t beer!) is that whenever they are doing a statistical test, they should be thinking about what the different possible outcomes mean--ahead of time.

Same here--what are we trying to find out when we do exbeeriments like this? We're trying to find out if something is better--would you agree? If it's not better (i.e., the same), then we'll choose the method that's either easier, less time consuming, or less expensive. If one is better, then we'll balance time, effort, cost in deciding to choose the better one. (If a beer tastes only slightly better but costs \$100 more per batch...well, I suspect most of us would brew the cheaper version.)

The problem is what it is we're trying to discern here. Why are we doing the exbeeriment, and what do we hope to learn? If I were a commercial brewery and I found out that most tasters couldn't tell the difference, but among those who purportly can, they prefer Beer A overwhelmingly to Beer B, then I'd brew Beer A--because at least a subset of all customers prefer it, and the rest have no preference.

And if in the same vein, if those who can tell a difference are split 50-50 w/r/t what they prefer, then it's a matter of what's faster, cheaper, easier.

I just brewed a batch Wednesday, and for the first time ever I just let all the trub go into the fermenter without trying to filter most of it out. There's at least one brulosophy exbeeriment testing that exact idea, and there was no significant difference. I've also seen a little anecdotal evidence suggesting the same, so I'm trying it. Anxious to see how it goes. I've got about 26 more days.

Yes, I see your point. However, I wouldn't say it's not meaningful when they get a statistically significant test. I am excited when I hear there is a difference in a process, as it means that I can use it as a tool to change my beer flavor.
We're going to have to disagree there. I haven't even delved into the issue of tasters discerning the difference. They try once; I'd like to see tasters show they can do it more than once, but that's probably a pretty difficult thing to do in terms of time and effort. We don't know how many of those who chose the right one did so because they guessed right (ergo, they can't really tell the difference but now they're in the tasting panel!), and how many truly can tell.

But as far as what "significance" means, IMO there has to be a conclusion. Just saying two beers do not taste the same tells me nothing as to which is preferable. There's no actionable intelligence there! Which is "better?"

So then we go to the next level which is how they prefer the one or the other. THAT may provide actionable intelligence. But if those who purportedly can distinguish between the beers are split 50-50 as to which they prefer, there's no actionable intelligence there either. Which should you do?

In the end, in such circumstances the answer comes down to your own palate, which means you decide what you like. I like maltier beers; others not so much. Who's right? Neither, of course. (J/K--I'm right, of course! )

Thanks for the reply. This is an interesting topic to me, obviously!
Me too! I think this is one of the places we can learn some things about brewing--but unless there's a clear preference, I'm not sure that we've learned anything other than "it doesn't matter."

That conclusion--"It doesn't matter"--is in fact a terrific conclusion in certain instances. Dump in the trub or don't? Does it matter? If it doesn't, then I'm overturning conventional wisdom and dumping it in. The fact that people can't discern the difference suggests we have a way to make the brewing process easier. If.

#### mongoose33

##### Well-Known Member
I disagree. It is meaningful to notice there is a difference and that is the most powerful aspect of the triangle test. Preference is different. I love hoppy IPAs. If you have me taste two light american lagers I may not like either and may just choose the one with the most hop character or even worse like the one with an estery flaw better. The fact I could tell the difference tells the tester that the process made the beer different. The fact that I picked the one with the out of style flavor is not as meaningful.
I have to respond the same way as with Stickyfingers--what's the actionable intelligence? If those who purportedly can tell the difference are split 50-50 as to which they prefer, what have you learned?

At best, what you've learned it that you have to brew two different batches and compare. Yes, your palate prefers certain kinds of brews. But what did you learn about which method is better if the tasters who presumably could tell a difference are split 50-50 in terms of their preference?

In that case, I'd say we haven't learned anything particularly useful.

I'm going to propose that exbeeriments like brulosopher does have their best influence on process variables--especially when no difference emerges. In other words, it doesn't make a difference which we choose to do, ergo, pick the fastest, easiest, least costly process.

When it comes to ingredients, it's much trickier. A person's palate dictates which they prefer, and that's not an objective criterion. I like hoppy aroma and taste, but not high levels of IBU. My guess is you like that better than I do, but that doesn't mean either of us are wrong. Just different.

If I were brewing a hoppy beer, I might like a beer drinker like you to be on the tasting panel, as someone predisposed to like those kinds of beers, and who can give me a preference. I'm probably not the guy to be on that type of panel.

Fun stuff!

#### eric19312

##### Supporting Member
HBT Supporter
I'm going to propose that exbeeriments like brulosopher does have their best influence on process variables--especially when no difference emerges. In other words, it doesn't make a difference which we choose to do, ergo, pick the fastest, easiest, least costly process.
+1 - 100% agreement with that statement

But here is the thing.

Many of the process variables Brulospher has shown do not make a substantial difference are same processes promoted by other brewers as the key to making better beer. My guess is these apparantly non essential processes entered brewing lore following side by side or worse batch to batch comparisons (did it this way and boy and seems even better than last batch). The application of the triangle test is a big advancement over those more natural/intuitive methods.

#### ajdelange

##### Well-Known Member
How do you think they came up with the +/- 1 from the lab? It's hard to know how much it is worth unless we know how they did it.
That's a very good point. But in today's world where statistics are used to deceive as or more often than they are used to inform one seldom gets that information. When the pH meter ad says "Accuracy 0.01 pH" what does that mean? After looking and probably close to a hundred pH meter spec sheets I have only found one manufacturer who reveals what that means (i.e. how he measures that accuracy).

I assumed maybe they ran one sample repeatedly on their instrument to get an idea of the variability of the instrument, and then reported that as a confidence interval?
That's the general idea but it is more complicated than that. To measure IBU's by a commonly used method 10 mL of beer are mixed with 1 mL 3N (I think it is) HCl and 20 mL of spectrographic grade ioso octane. The mix is shaken for 15 minutes to transfer the bittering principal to the iso octane, centrifuged if necessary to separate the phases and the UV absorption measured at a wavelength I don't recall. There are certainly questions as to the accuracy of the absorption measurement relative to the calibration of the instrument (A and wavelength), its light leakage, its bandwidth etc but there are other error sources as well involving the actual width of the cuvet, the accuracy with which the 10 mL, 1 mL and 20 mL can be measured, the effect of the centrifugation... Further to this the lab may have multiple operators using multiple instruments.

I thought it was a little odd to label it as "variance" when that has a specific meaning in stats.
Today statistics are bandied about with little regard for the necessity to adhere to the formalities that make statistics useful. This would be an example of that. The variance is the second central moment of the error (in this case) distribution and as it is impossible to know that in fact the number they are reporting as the variance is doubtless an estimate of it such as the standard error or a reproducability number (e.g. in 95% of cases the true IBU level will be within ± 1 IBU of what is reported).

Let's just say you want to know if there IS a difference in IBUs between two beers. Is the only really valid way to do it to brew the beers several times, analyzes several samples of each beer and then do a stats test to see if there is a difference?
What you have to do is a case like this is pose a null hypothesis: that there is no difference in the IBU lelvels of the two beers. Next you design an experiment in which you measure the IBUs of several samples of the two beers and determine the probability that you would obtain the measurements you did were the null hypothesis true. If that probability is small then you can confidently (and the amount of confidence is calculable - you define the experiment to make it as small as makes you comfortable) reject the null hypothesis.

If there is a big enough difference, say 10 IBUs vs 20 IBUs, we can say there is probably a difference right? How large should the difference be such that we wave the need for multiple analyses?
You must cast the question into the structure of the formalism we've been discussing. The null hypothesis is: The two beers have equal bitterness. The experiment involves taking two measurements of bitterness on the two beers, b1 and b2. The question to be answered is:
Given the null hypothesis what is the probability that the difference between two IBU measurements is bigger than b1 - b2?

As an example, let's say we measure 41 and 42 for the two beers in a process we know has a standard error of 1 IBU. We assume that the measurement errors are unbiased, are gaussian and have standard deviation 1 IBU. The difference is, under the hypothesis that the means of the measurements are the same, another GRV with standard deviation sqrt(2) IBU. The observed difference amounts to 1/sqrt(2) = 0.707 standard deviations and the probability that a GRV is 0.707 standard deviations or greater from its mean is 48%. This means that in 48% of measurements on identical beers we could expect to see a difference as large as we did or larger and it would be very hard to reject the null hypothesis (thus supporting the hypothesis that the beers are of different bitterness) based on a difference in measurements that small.

Now suppose that b1 - b2 = 2 IBU. Under the null hypothesis this is 2/sqrt(2) = 1.414 standard deviations. The probability that a Gaussian is more than 1.414 standard deviations from its mean is 15.7%. This probability is appreciably smaller and we might begin to accept that the null hypothesis is not true if differences in measurements as large as 2 are seen.

Here are the probabilities for some other values of observed difference on single measurements:
3: 3.4%
3.5: 1.3%
4: 0.5%
4.5: 0.1%
5: 0.04%

You ask how much difference is needed before waiving multiple tests. I respond by asking "How confident do you want to be?". If you want to be very certain that the difference you saw is very unlikely if the beers are the same (0.04%) then you would want to see a difference of 5 IBU in the measurements. If you can accept rejection of the null hypothesis at the 0.5% level then 4 is enough.

I thought the test was set up so that it accounted for the fact that there were guesses?
An extremely important point about triangle testing is that in a triangle test you are testing the panel - not the beer. The null hypothesis is: This panel cannot taste the difference in bitterness of these two beers. It is well known that some people cannot taste diacetyl. If you were to empanel a group of such people to determine whether a process change reduced diacetyl you would get a very different result than if you empaneled tastes with 'normal' diacetyl perception.

#### mongoose33

##### Well-Known Member
An extremely important point about triangle testing is that in a triangle test you are testing the panel - not the beer. The null hypothesis is: This panel cannot taste the difference in bitterness of these two beers. It is well known that some people cannot taste diacetyl. If you were to empanel a group of such people to determine whether a process change reduced diacetyl you would get a very different result than if you empaneled tastes with 'normal' diacetyl perception.
This ^.

The problem w/ these tests is, if the panel is unable to discern a real difference, whether that means the beer has no difference or if the panel simply can't tell the difference.

I'm not criticizing Brulosopher or anyone else doing this, as they are doing the best they can, given resource limitations, to do these comparisons, and I thank them for that. I personally believe, as in I'm changing how I brew because of it, that many of their results are likely valid.

But in the end, we don't really know the composition of the tasting panels, don't know the degree to which they represent what I or you might like--but that's, as they say, simply a condition of competition.

To do this such that we'd eliminate a lot of the methodological difficulties would first require validation of the panels--and I have no idea how much money that would cost.

Over time, as these exbeeriments are replicated at other times in other places with other panels, and if they reach similar conclusions, that will enhance confidence in the results.

OP

#### stickyfinger

##### Well-Known Member
That is a very interesting article.

OP

#### stickyfinger

##### Well-Known Member
This ^.

The problem w/ these tests is, if the panel is unable to discern a real difference, whether that means the beer has no difference or if the panel simply can't tell the difference.

I'm not criticizing Brulosopher or anyone else doing this, as they are doing the best they can, given resource limitations, to do these comparisons, and I thank them for that. I personally believe, as in I'm changing how I brew because of it, that many of their results are likely valid.

But in the end, we don't really know the composition of the tasting panels, don't know the degree to which they represent what I or you might like--but that's, as they say, simply a condition of competition.

Do do this such that we'd eliminate a lot of the methodological difficulties would first require validation of the panels--and I have no idea how much money that would cost.

Over time, as these exbeeriments are replicated at other times in other places with other panels, and if they reach similar conclusions, that will enhance confidence in the results.
Yes, I see. That IS very interesting. We could assume that one or more people who participate in these informal tastings may not be sensitive to whatever is being tested. If that is so, there would be a bias against rejecting the null hypothesis, so we would get more results that indicate that the panel cannot taste a difference. Maybe that is part of the reason that so many of these exBeeriments fail to reach significance.