Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum

Help Support Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
Wow Smellyglove now I get why scientists mainly publish successful experiments.

The brulosophers did not invent the triangle test which is clearly an accepted tool in sensory analysis. Yes they make some compromises in design and administration that might be used to throw some question on the results yet I'm not seeing anyone doing it better in our community. I admit I don't see much professional brewing literature but did read a paper today published in a serious journal authored by brewers from Rock Bottom. Compared four different late hop techniques and was pretty interesting. But turns out they brewed the four batches at different breweries, using different ingredients, and apparanlly amazingly different waters. Had a batch with 1200ppm sulfate compared to beers below 100.... And because this is professional brewing and published in a serious presumably peer reviewed journal it must be well done and valid but the homebrewers are done in by failure to randomize AAB to ABA in their sensory test?

I've done the same experiments as the one in the mentioned paper and to me that was pretty "old news". They did not report the temperature for the WP hops though, I feel that's pretty important. Over or under a certain threshold will give you a very noticable difference. The results were on par with my own experiences, but the WP temperature has a lot to say..
I'm a partypooper. I somewhat trust some sources (when I've double/triple/quadruple checked them) and/or tried it out for myself, if it's something I'm able to try in my brewery.

1200PPM of sulfates is A LOT! I dind't see that number in the paper. Wow! What does that even taste like?

But. When you said "scientists mainly publish successful experiments".. Aren't all experiments sucessful? It's about testing something, see what impact it has. If the result is opposite of what you'd want it to be, then I can see that one could call it an "unsuccessfull" experiment, but that's just because the scientist was hoping/expecting a given result, i guess. How can an experiment be unsuccessful?
 
AJ, I get it. You don't understand how measurement figures into all this. That doesn't make you a bad person--but it does mean you're missing a basic element of research, one whose failure invalidates conclusions.

I would really like you to be able to understand this so I have a simple request for you. Please go to http://editorbar.com/upload/ReBooks/...703b059e4b.pdf and read it. It is only 8 pages. It is ASTM E1885 Standard Test Method for Sensory Analysis - Triangle Test. ASTM is the American Society for Testing and Materials and they promulgate standard procedures to industry for various measurement protocols. Another example whose number I remember is E308 for measurement of color in the CIE tri stimulus system. Investigators in industry use these protocols to insure that they all do a particular measurement in the same way so that inter laboratory comparisons are valid. If you will indeed read this and still feel that triangle testing is "hugely flawed" at least I will be certain that your feelings are based on knowledge of what triangle testing really is. And that's important to me because I can't believe you would oppose it so vehemently if you really understood what it is.

Now when you understand what it is if you still feel it hugely flawed I believe that you are morally obliged to explain this to ASTM because E1885 is accepted as an authority when it comes to sensory testing. They should not, nor would they want to, continue to mislead the world. Also contact ISO as they have a similar standard ISO 4120:2004 which is also promulgated by DIN.

Such as when you "qualify" tasters based on a lucky guess. Who would ever intentionally do such a thing?
Well the thousands of investigators world round who use ASTM E1885 or ISO 4120 to evaluate their companies' products would.


Or try this. We're measuring weight, not math aptitude. We have a person step on, and off, and on, and off, a scale 10 times. The scale returns 167, 155, 111, 211, 106, 92, 47, 113, 197, 175.

So is that a good measure of their weight? Of course not. It's unreliable. An unreliable measure cannot be a valid measure.
Ah, the light comes on. You don't understand an important aspect of measurement and that is that measurements are almost always corrupted by noise. I think you said you are in the social sciences so that may explain it. Engineers and scientists deal with measurement sequences like this all the time and extract useful (actionable) information from them. The guy's weight is 137.4 ± 16.3 lbs. The measurements are valid but are corrupted by noise - in this case quite badly corrupted to the point that we would certainly want to investigate. Estimation theory is an area of study that focuses on extracting useful information from measurements corrupted by noise. We have an estimate of the weight modeled by a gaussian random variable with mean 137.4 and standard deviation 16.3 lbs. If I put a 150 lb guy on that same scale (same meaning it corrupts readings to the same extent) I would obtain a set of readings like 165,156,127,125,92,163,217,83,213,123 from which I would conclude this man weighs 146.4 ± 14.35. Not 150 pounds for sure but clearly the change in weight has been detected. We don't want to weigh things with scales this bad but sometimes we have to and we have the means to deal with such badly corrupted data and extract useful information from them. This may be terra incognita to a social scientist but is an important aspect of measurement in the physical sciences. Your smart phone, for example, uses sophisticated techniques to extract voice and data from what you would call unreliable readings.


In the triangle tests using humans as testers, we have no indication of the reliability of their abilities. In fact, there's a lot of reason to suggest they may be unreliable. And of course, if their reliability is suspect, so too are the results pertaining thereto.
I don't expect that you will understand this either but sometimes we are interested in measuring the 'unreliability'.



This is basic research and measurement. No amount of statistical handwaving can overcome that.
I'm talking about solid, accepted use of statistics to extract useful information from corrupted data.

None of this makes you a bad person.
I'm really trying to bring you into the light here.

But you cannot take measures that are unreliable, whose validity as a result is suspect, whose generalizability is uncertain at best, and draw conclusions that are meaningful.
Well yes you can. Scientists and engineers do that all the time. Your cell phone does it. Your TV set does it. Your GPS receiver does it.

You don't obtain actionable intelligence.
The number of systems that obtain actionable intelligence from corrupted signals is myriad.

Insight is finally coming to me. This is a hypothesis of course but it seems from your remarks that social scientists don't have the tools for extracting information from noise and so reject any data set that is noisy. But physical scientists do and so are able to work with signals that are are little above the noise and, in some cases, well below it (GPS). The problem I see with the "reject all noisy" data sets approach is that all data is noisy to some extent.

You keep trying, and your doggedness in that is admirable, but without quality measurement, there's nothing to say.
I would like to have you understand and a first step would be to read that standard so please consider doing that. But I recognize that you may not be able to and this brings me to the other reason for my doggedness. Your position is frankly lacking in merit and I feel an obligation to the other readers to be sure that they don't reject triangle testing as a possible source of benefit to them because one individual not versed in this aspect of the art tells them it is flawed.

As I tell my students, measurement is where the rubber meets the road. If you can't measure effectively, the rest is meaningless.
No argument here. Our big disconnect is that you can in fact measure effectively under circumstances under which you think we can't.
 
At first it was interesting reads, but when one experiment after another tells you that "it doesn't matter"

Brulosophy makes several mistakes in procedure but their most serious mistake is in the interpretation of their results. They say it doesn't matter because they don't reach the arbitrary p < 0.05. But their data, at least the bits I've looked at, clearly say that there is a signal if a weak one that it does make a difference.
 
How can an experiment be unsuccessful?

An experiment paid for by an oil company that found that the burning of hydrocarbon fuels caused global warming would be unsuccessful. An experiment paid for by a government agency that found that burning hydrocarbon fuels did not cause global warming would be unsuccessful.
 
An experiment paid for by an oil company that found that the burning of hydrocarbon fuels caused global warming would be unsuccessful. An experiment paid for by a government agency that found that burning hydrocarbon fuels did not cause global warming would be unsuccessful.

This is politics. I was just after the homebrew-level.
 
Brulosophy makes several mistakes in procedure but their most serious mistake is in the interpretation of their results. They say it doesn't matter because they don't reach the arbitrary p < 0.05. But their data, at least the bits I've looked at, clearly say that there is a signal if a weak one that it does make a difference.

Maybe I havent studied their data enough, but it didn't take long until i realized I didn't want to waste my time, so it's on me.
 
This is politics. I was just after the homebrew-level.

I thought that example might illustrate the general principle that an unsuccessful experiment is one that does not give the result that the investigator wants for whatever reason.There can be several reasons why that happens that relate from procedure to analysis to politics. If an experiment leads to the conclusion that fermentation temperature does not matter in brewing that's an unsuccessful experiment in that everyone knows it does. The investigator(s) will not be pleased with the result, will assume that they did something wrong and try to figure out what it was (in this case mostly over reliance on small p and procedural errors), takes steps to correct and rerun the experiment if the resources are available.

It is not always the case that doing this gives the 'right' result but in such cases the investigators learn that the sought after result does not represent truth. In this sense the experiment was a success in that the truth was found even though the experimenter (or his sponsor) may not be pleased with the result. This is, I am sure, is what you meant. I was trying to be pithy (and fell flat - yet again).

Of course it is well know that no one attains fame or fortune for negative results. When Hata found that compound 605 woudn't cure syphillis I'll bet Ehrlich didn't give him an attaboy even though he obviously found something of importance. I'll bet there was a much bigger stir around the lab when the experiment with compound 606 found that in it lay the potential cure (Salvarsan).
 
Ah, the light comes on. You don't understand an important aspect of measurement and that is that measurements are almost always corrupted by noise. I think you said you are in the social sciences so that may explain it. Engineers and scientists deal with measurement sequences like this all the time and extract useful (actionable) information from them. The guy's weight is 137.4 ± 16.3 lbs. The measurements are valid but are corrupted by noise - in this case quite badly corrupted to the point that we would certainly want to investigate. Estimation theory is an area of study that focuses on extracting useful information from measurements corrupted by noise. We have an estimate of the weight modeled by a gaussian random variable with mean 137.4 and standard deviation 16.3 lbs. If I put a 150 lb guy on that same scale (same meaning it corrupts readings to the same extent) I would obtain a set of readings like 165,156,127,125,92,163,217,83,213,123 from which I would conclude this man weighs 146.4 ± 14.35. Not 150 pounds for sure but clearly the change in weight has been detected. We don't want to weigh things with scales this bad but sometimes we have to and we have the means to deal with such badly corrupted data and extract useful information from them. This may be terra incognita to a social scientist but is an important aspect of measurement in the physical sciences. Your smart phone, for example, uses sophisticated techniques to extract voice and data from what you would call unreliable readings.

And now we're getting into my wheelhouse... When you look into electronics, it's *all* about extracting signal from waveforms that look to the untrained eye like noise.

My own industry (data storage) is telling. When I first started getting into hard disk drives, I assumed that the data on the platter would show up as a relatively clear "1" or "0"* based upon the magnetic read data. Not true at all... It looks like noise to me.

So how do they get the bits out of your HDD? It's probability-based. They use a method called PRML - partial response, maximum likelihood. It layman's terms, it's basically taking an educated guess and then checking it against the ECC (error correction and checking) codes.

http://www.pcguide.com/ref/hdd/geom/dataPRML-c.html
https://en.wikipedia.org/wiki/Partial-response_maximum-likelihood

Now, when you actually think about it, it seems like fantasy that you can tease some of these signals out of that noise. But it works. My livelihood and the sanctity of your digital data rely on it working.

Similar things are used in a lot of data transmission scenarios. Your cell phone signal is relatively weak, and it's being sent through the air where all manner of other electronic communications are being sent and potentially interfering with yours. Upon receipt at the cell tower, I'll bet that incoming signal looks like noise. Yet your calls go through.

To bring it back to AJ's point, any time you're dealing with human sensory analysis, you have to assume that there is going to be significant noise. Like the experiments that blindfolded wine experts couldn't always tell red wine from white wine, we are very corrupted instruments. But that DOES NOT mean that all tests are unreliable. It simply means that you have to develop tests which can tease a signal out of the noise.

* Overly simplified, as it is not stored as high/low binary values but based on transitions. But that's getting into the weeds.
 
Publication bias is a serious problem and that's one example:

" However, statistically significant results are three times more likely to be published than papers with null results."

https://en.wikipedia.org/wiki/Publication_bias

This is one area where Brulosophy deserves credit. They report all findings, whether p<0.05 or not. I could not have performed my meta-analysis if they were only reporting the experiments which crossed that threshold, and further the readers would be misled if they only reported experiments which crossed the threshold.

(As we're discussing, readers are somewhat being "misled" or perhaps merely misinterpreting the methods to assume the null hypothesis if p>0.05, but that's a whole different ball of wax.)
 
This is one area where Brulosophy deserves credit. They report all findings, whether p<0.05 or not. I could not have performed my meta-analysis if they were only reporting the experiments which crossed that threshold, and further the readers would be misled if they only reported experiments which crossed the threshold.

(As we're discussing, readers are somewhat being "misled" or perhaps merely misinterpreting the methods to assume the null hypothesis if p>0.05, but that's a whole different ball of wax.)

I couldn't agree more. Yes, some may be misled by the results, but Brulosophy always warns readers not to get too carried away. Anyone so inclined can do their own experiments.
 
An experiment paid for by an oil company that found that the burning of hydrocarbon fuels caused global warming would be unsuccessful. An experiment paid for by a government agency that found that burning hydrocarbon fuels did not cause global warming would be unsuccessful.

No doubt :ban:
 
And now we're getting into my wheelhouse... When you look into electronics, it's *all* about extracting signal from waveforms that look to the untrained eye like noise.

My own industry (data storage) is telling. When I first started getting into hard disk drives, I assumed that the data on the platter would show up as a relatively clear "1" or "0"* based upon the magnetic read data. Not true at all... It looks like noise to me.

So how do they get the bits out of your HDD? It's probability-based. They use a method called PRML - partial response, maximum likelihood. It layman's terms, it's basically taking an educated guess and then checking it against the ECC (error correction and checking) codes.

http://www.pcguide.com/ref/hdd/geom/dataPRML-c.html
https://en.wikipedia.org/wiki/Partial-response_maximum-likelihood

Now, when you actually think about it, it seems like fantasy that you can tease some of these signals out of that noise. But it works. My livelihood and the sanctity of your digital data rely on it working.

Similar things are used in a lot of data transmission scenarios. Your cell phone signal is relatively weak, and it's being sent through the air where all manner of other electronic communications are being sent and potentially interfering with yours. Upon receipt at the cell tower, I'll bet that incoming signal looks like noise. Yet your calls go through.

To bring it back to AJ's point, any time you're dealing with human sensory analysis, you have to assume that there is going to be significant noise. Like the experiments that blindfolded wine experts couldn't always tell red wine from white wine, we are very corrupted instruments. But that DOES NOT mean that all tests are unreliable. It simply means that you have to develop tests which can tease a signal out of the noise.

* Overly simplified, as it is not stored as high/low binary values but based on transitions. But that's getting into the weeds.

Sorry for the long quote, I don't yet know how to quote portions :confused: But thanks B. this makes some sense out of all this "noise". :tank:
 
So this is what it looks like when geeks sword fight! :D

maxresdefault.jpg
 
They use a method called PRML - partial response, maximum likelihood. It layman's terms, it's basically taking an educated guess and then checking it against the ECC (error correction and checking) codes.
I'd like to broaden that a bit so that it includes the situation of interest here. It assumes a hypothesis and then computes the probability that the data you observe would arise under that hypothesis. The hypothesis that gives the highest probability for the observation is then chosen as the most likely hypothesis given the observation. In what we have discussed so far we do something similar. If we observe N correct answers (NM panelists) we compute the sum of the probabilities p = P(N|H0) + P(N+1|H0) + ...(M|H0) which is the probability that we would see N or more hits out of M under the null hypothesis. If p is very small we figure the null hypothesis is not the right hypothesis for explaining our observation, reject it, publish p and have a beer.

But we can do more. We can compute
P(N|H0)
P(N|H1)
P(N|H2)
...

and choose the largest. P(N|H2) is read "The probability of exactly N correct responses given that hypothesis H2 is true." Now how do we define the various hypotheses? It's pretty clear that H0 represents the hypothesis that our panel cannot distinguish the beers. I have mentioned differentiability in previous posts. This we symbolize by Pd which is the fraction of the population which can distinguish. For H0 clearly Pd = 0. At the other end of the spectrum if the difference is distinguishable by the entire population then Pd = 1. Thus a reasonable set of hypotheses might use values of Pd separated by 1% (0.01). We could then compute

P(N|Pd = 0)
P(N|Pd = .01)
P(N|Pd = .02)
...
P(N|Pd = 1)

and pick the largest.


(As we're discussing, readers are somewhat being "misled" or perhaps merely misinterpreting the methods to assume the null hypothesis if p>0.05, but that's a whole different ball of wax.)

I have been hinting in my last few posts that I think that we should forget about p and look instead at Pd. Here I am going to propose that we do exactly that. I had hesitated to propose that earlier because I was calculating the MLE (maximum liklihood estimate) of Pd in the way I just described: picking the maximum of a set of computed probabilities and I didn't think that would be too easy to set up in my little spreadsheet but then I realized that the MLE is simply the Pd corresponding to the fraction of correct guesses. I am hoping that use of MLE will make it easier to interpret marginal tests and I am even hopeful that consideration of it may shed light on mongoose's path to understanding. And note that it is in compliance with ASTM 1885. They don't tell you how to compute the MLE but they do give instructions (though I think there is a tiny error there) for computing the band within which Pd lies with specifiable confidence.

First the mechanics: The spreadsheet I gave in No. 274 calculates the MLE and the confidence band. Change the label "Pd_ Clalculated from Pc_ " to "MLE Pd".

A key concept here is the relationship between Pc, the probability of a correct answer and Pd, the differentiability. Given a differentiability of Pd the probability of a correct answer is

Pc = (1)*Pd + (1/k)*(1 - Pd) = Pd*(1 - 1/k) + 1/k

where k is the order of the test. In words it is the probability that someone who can detect the difference will do so (100%) times the fraction of the population that can plus the probability that someone who cannot tell the difference will correctly guess (1/k with k being the order of the test e.g. 3 for a triangle test) times the fraction of the population that cannot tell the difference. So, for example, in a triangle test if the beers are indistinguishable to a panel, the probability of a correct answer will be 1/3. I believe the fact that 1/3 of incompetent panelists 'qualify' is the thorn stuck in mongoose's side. A totally incompetent panel of M members would report something close to M/3 correct answers which ostensibly might seem disturbing. We would estimate Pc close to 1/3 from such a panel's response. But we aren't testing to find the number of correct responses. We are testing to bound Pd. That's what we want to know: how well the panel can distinguish the beers. The formula above is invertible. If we have an estimate of Pc_= (N/M) we can obtain an estimate of Pd from

Pd_ = (Pc_ - 1/k)/(1 - 1/k)

Thus if N/M ~ 1/3 then our estimate of Pd_ is going to be close to 0. IOW, even with guessing allowed, we have a path back to Pd from the 'corrupted' count of correct answers. Guessing also removes or at least ameliorates some of the biases that result when we don't force a choice. See "forced choice testing".

Note that a very qualified panel presented two beers which are the same will return the same result Pd_ close to 0. They are not differentiable.

Note that the spreasheet in accordance with ASTM E1885 (by which I mean I didn't correct their small error) also computes confidence bands for Pd_. Twenty six out of 53 correct answers ( p = 0.0128) says that the MLE of Pd_ is 0.23 but it also says that Pd_ lies between 0.066 and 0.405 with 95% confidence. Twenty seven hits implies Pd_ = 0.26 with 0.095 < Pd_ < 0.433 with 95% confidence. Note that the extra correct answer suggests that Pd is higher but that the width of the confidence band is the same. To reduce that band's width we would need to go to larger M or larger k.
 
[quode=cmac62;8075893]Sorry for the long quote, I don't yet know how to quote portions :confused: [/quote]

I changed quote to quode so it wouldn't be interpreted as a quote.

[quode=cmac62;8075893]But thanks B. this makes some sense out of all this "noise". :tank:[/quote]
 
I read the ASTM. I know stuff now.

If the population of tasters is a bunch of random gumps off the street, I believe Pd is roughly 5-10%.

If the population of tasters consists of 100% self-proclaimed craft brew aficionados, I believe pD is about 20-30%.

If somewhere in between, somewhere in between.

And if the population is of 100% BJCP certified judges.... surprisingly I think the same 20-30% still applies, or perhaps 35% at best.

Yup.
 
So AJ, would I be correct to say that p is a proxy for MLE Pd?

I.e. if MLE Pd is very high, p will be very low in most tests. If MLE Pd is moderately low, we would expect p will be >0.05 by some degree but still suggest a correlation. And if MLE Pd is zero or VERY low, we would expect the taster's results to approximate guessing and thus p would be VERY high.

Is that fair?

I do think that involving MLE Pd is actually a very helpful aspect in this discussion. Focusing on p and an arbitrary threshold of 0.05 is causing a lot of confusion.
 
This thread is like a painting of Kramer. A loathsome, offensive brute, yet I can't look away.

Ok so no one is loathsome or offensive but I still can't avert my eyes.
 
I read the ASTM. I know stuff now.
Delighted to hear that.

At http://brulosophy.com/2016/01/21/in...t-xbmt-performance-based-on-experience-level/
the Brulosopher(s) posted data on number of correct selections vs. type of beer drinker. Lets see how you did:

If the population of tasters is a bunch of random gumps off the street, I believe Pd is roughly 5-10%.
General Beer Drinker: 0 < Pd < 22% (95% conf.) Pd_ = 10%

If the population of tasters consists of 100% self-proclaimed craft brew aficionados, I believe pD is about 20-30%.
Craft Enthusiast: 11% < Pd < 36% (95% conf.) Pd = 23.5%

And if the population is of 100% BJCP certified judges.... surprisingly I think the same 20-30% still applies, or perhaps 35% at best.
BJCP Certified 3.7% Pd < 28% (95% conf.) Pd = 16%

BJCP in training 7% < Pd < 31% (95% conf.) Pd_ = 19%

Now obviously, as Pd depends on beer AND panel, those scores cannot be used to grade judge classes absolutely but as, presumably, overall, the various classes judged essentially the same assortment of beer types we can feel somewhat comfortable in comparing the relative scores. The surprises are that the Craft Enthusiasts were better than the BJCP Certifieds as were the BJCP in training group. Not really all that surprising if you think about it.

I'd say your guesses were pretty good but that you apparently gave more credit to the BJCP judges than this data shows.
 
Delighted to hear that.

At http://brulosophy.com/2016/01/21/in...t-xbmt-performance-based-on-experience-level/
the Brulosopher(s) posted data on number of correct selections vs. type of beer drinker. Lets see how you did:


General Beer Drinker: 0 < Pd < 22% (95% conf.) Pd_ = 10%


Craft Enthusiast: 11% < Pd < 36% (95% conf.) Pd = 23.5%


BJCP Certified 3.7% Pd < 28% (95% conf.) Pd = 16%

BJCP in training 7% < Pd < 31% (95% conf.) Pd_ = 19%

Now obviously, as Pd depends on beer AND panel, those scores cannot be used to grade judge classes absolutely but as, presumably, overall, the various classes judged essentially the same assortment of beer types we can feel somewhat comfortable in comparing the relative scores. The surprises are that the Craft Enthusiasts were better than the BJCP Certifieds as were the BJCP in training group. Not really all that surprising if you think about it.

I'd say your guesses were pretty good but that you apparently gave more credit to the BJCP judges than this data shows.

Cool! Thanks for putting that together.

The last result is indeed intriguing.

For the record, I give BJCP judges far less credit than most people do. I'm the guy who always says, "if you want good scoresheets, you need to enter AT LEAST 3 or 4 competitions to get a large enough population so you can throw out the 60% of the scoresheets that are totally worthless and forget that they ever existed".

And why? Because they get what they deserve, from me anyway.

And I myself am Certified. Not sure if I'm a good judge, though I sure try to be. But odds are highly likely that I'm just another part of the 60-85% who can't discern pah-lish from Poh-lish.

:D
 
So AJ, would I be correct to say that p is a proxy for MLE Pd?
I woudn't say it's a proxy for it. Were it a good one there would be no need to calculate the MLE or the confidence band surrounding it.

I.e. if MLE Pd is very high, p will be very low in most tests. If MLE Pd is moderately low, we would expect p will be >0.05 by some degree but still suggest a correlation. And if MLE Pd is zero or VERY low, we would expect the taster's results to approximate guessing and thus p would be VERY high.
What I'd love for you to do is punch in that little spreadhsheet and push some numbers around. Were you to do that I think you'd find support for these statements.



I do think that involving MLE Pd is actually a very helpful aspect in this discussion. Focusing on p and an arbitrary threshold of 0.05 is causing a lot of confusion.
I did sort of hint that we should toss out p and focus on the MLE Pd. If we did that then the width of the confidence band becomes the substitute for p. The narrower it is the more confidence we have that the value of differntiability we have calculated is close to the true value. But as it is easy to compute p, Pd and the confidence limits with something as simple as a 14 cell spreadsheet (28 if you count the labels) I see no reason not to compute p as well. They all play together.
 
For the record, I give BJCP judges far less credit than most people do.

I always remember the occasion on which I was invited to a party of the local DC beer cognoscenti. Professional brewers, Master judges, National judges. I'm not sure that anyone below National was there. I took growlers of two lagers (Viennas) and asked several people to comparatively taste them. I got the comments I sort of expected to. The darker was richer in flavor, maltier, fuller bodied, sweeter etc. with lots of additional comments about ribes (whatever they are) and things like that. One young lady asked if she could try and after carefully tasting the two said something to the effect "I really don't know much about beer. I just came with my boyfriend. I'm sorry but I really can't taste any difference at all."

I'll bet you can guess where this is going. The two beers were the same beer except that I had dosed one growler with some Sinamar. That young lady was the only one that got it right! In my own defense I did this 'experiment' because I was writing a chapter on color for one of Bamforth's books and had heard him say in a seminar that "We taste with our eyes". But the Panjandrums of D.C. beer were not amused. I never got invited back and one participant wouldn't speak to me for a couple of years (by which time, careful probing revealed, he had forgotten the event).
 
I always remember the occasion on which I was invited to a party of the local DC beer cognoscenti. Professional brewers, Master judges, National judges. I'm not sure that anyone below National was there. I took growlers of two lagers (Viennas) and asked several people to comparatively taste them. I got the comments I sort of expected to. The darker was richer in flavor, maltier, fuller bodied, sweeter etc. with lots of additional comments about ribes (whatever they are) and things like that. One young lady asked if she could try and after carefully tasting the two said something to the effect "I really don't know much about beer. I just came with my boyfriend. I'm sorry but I really can't taste any difference at all."

I'll bet you can guess where this is going. The two beers were the same beer except that I had dosed one growler with some Sinamar. That young lady was the only one that got it right! In my own defense I did this 'experiment' because I was writing a chapter on color for one of Bamforth's books and had heard him say in a seminar that "We taste with our eyes". But the Panjandrums of D.C. beer were not amused. I never got invited back and one participant wouldn't speak to me for a couple of years (by which time, careful probing revealed, he had forgotten the event).

The same was done with wine. The same wine colored white vs red resulted in traditional "white" descriptors for the white, and traditional "red" descriptors for the red.
 
The same was done with wine. The same wine colored white vs red resulted in traditional "white" descriptors for the white, and traditional "red" descriptors for the red.

I participated in an experiment like that during a talk at NHC a while back (Baltimore I think?). My wine expertise is nonexistent but I could tell something was fishy. But the exact same comments thrown about.

Bias and such. No one is immune to it.

I will say though (and this again could well be bias as I've never done it blind), but I get an ever so slight flavor from Sinamar (moreso than brewer's caramel). But it's a very slight hint of roast. That's it. None of the descriptors others were offering.
 
I will say though (and this again could well be bias as I've never done it blind), but I get an ever so slight flavor from Sinamar (moreso than brewer's caramel). But it's a very slight hint of roast. That's it. None of the descriptors others were offering.

There's a candidate for a triangle test.
 
There's a candidate for a triangle test.

Wouldn't this be pretty difficult to test in a triangle test? How could you prevent the tasting panel from noticing the color difference? Blindfolds?


This is one problem I struggle with in interpreting the triangle test results. Beer is a highly visual product. The standard cliche is that you taste with your eyes. Examples above illustrate this effect. But the triangle tastes are always done in opaque cups with significant attention paid to achieving similar fill levels and foam. Why is this? My theory is that our eyes are our most sensitive sense by far. Small changes in color difference or turbidity can be quite easy to detect. Possibly this is because it is possible to look at both at the same time while tasting, while aroma and mouthfeel all must be tested sequentially. I don't have answer this and clearly the standards come down on side of eliminating visual cues but I still wonder how this can be a fair way to test.
 
Wouldn't this be pretty difficult to test in a triangle test? How could you prevent the tasting panel from noticing the color difference?
There would certainly be challenges.

Blindfolds?
Not a bad idea. The usual solution is opaque cups but you'd have to cover the top too and that presents a problem as the panelists would not be able to smell the beer. Swirl, sniff and sip isn't going to work isn't going to be easy if one has to drink through a straw. Empanel blind people? Do it in a dark room?

At the moment the best I can think of (and it's your idea, not mine) would be to have the panelists blindfolded, designate the cups to them as left, center and right and have them report their findings into a recorder (or microphone connected to a scribe in another room - no telegraphing from the scribe to the panelist).



This is one problem I struggle with in interpreting the triangle test results.
It was quite a while ago (measured in post numbers) that I found and joined this thread but my earliest comments were that there were many potential pitfalls for those wishing to use triangle tests and that if there were problems with what Brulosophy has done it was probably that they fell into one or more. Thus it is important that investigators plan and execute carefully and report every detail of what they did.

Beer is a highly visual product. The standard cliche is that you taste with your eyes. Examples above illustrate this effect.
The 'experiment' described in No. 347 shows that it apparently depends on one's training.


But the triangle tastes are always done in opaque cups with significant attention paid to achieving similar fill levels and foam. Why is this? My theory is that our eyes are our most sensitive sense by far. Small changes in color difference or turbidity can be quite easy to detect.
It seems that almost any change in process or material would produce a change in color. I brew the same beers over and over again and while their SRMs are similar they are not the same nor are their higher order color parameters the same. Thus it is, IMO, very important that color information denied to the panelists. I have hinted in earlier posts that I am not too sanguine on the opaque cups approach by suggesting that Sinamar be used to mask color and that's why I'd love to know whether added Sinamar is detectable. Now if we are dealing with, say, 7 SRM beers and find that the new process increases the color to 8 SRM I'd guess we could dose both with Sinamar to the extent of say, 30 SRM without having the Sinamar introduce a detectable difference. At the same time I doubt that augmenting the color of a 7 SRM beer to 8 SRM would make much taste difference (Sinamar, even neat, tastes pretty flat to me). But then we'd have to worry about the higher order color parameters. A 1 SRM color change isn't going to mask those.

Possibly this is because it is possible to look at both at the same time while tasting, while aroma and mouthfeel all must be tested sequentially.
Possibly and that suggests a way of reducing the potential effects of small color differences, Put the cups behind little doors only one of which can be open at a time.

I don't have answer this and clearly the standards come down on side of eliminating visual cues but I still wonder how this can be a fair way to test.
While I think triangle testing is pretty cool, especially with respect to considerations of it from the detection/estimation theory aspects of it (you may have guessed at this from my posts) I have always wondered how one could possibly get around exactly this problem. One way would be to demonstrate, with a triangle test, that small color differences aren't detectable. Another might be to look at the Pd signal. If it is larger than, say, 0.5 we might be suspicious that something is telegraphing the right answer to the panelists.
 
I don't see why this would be an immediate knee-jerk reaction. Professional chefs don't think home cooks are full of BS.

But what do I know, I'm just a wannabe chef.

:ban:
 
Why is that?

Because it's so easy to get a false positive. Also because there's pretty much no way not to convert if you're at the right temp for a long enough time. I haven't done an iodine test in 20 years and haven't felt a need to.
 
I guess I was thinking of doing it at 20-30 minutes and sparging from there if the conversion is done. that is one way to save a few mins on brew day :ban:

You may have converted starches to sugars, but what type of sugars? There's a reason to mash longer.
 
In trying to come up with a clearer explanation for mongoose as to why guessing is not detrimental in a triangle test but in fact beneficial it occurs to me that, for him at least, this may be a veridical problem i.e. one whose correct solution is so counter intuitive that it cannot be accepted by a large number of people, even very qualified ones, despite overwhelming evidence that it is indeed correct. The classic example is the Monte Hall problem. If you are not familiar with it, or even if you are, check the Wikipedia article. It's a fascinating read. If this is the case additional explanation isn't going to change anything but the explanation I hope to give here may be of interest to those who are on board as I hope to collect the stuff scattered through the thread into one place.

At first glance the question seems to be: given two beers produced by slightly different processes or using slightly different materials bill "Are the two beers different?" We realize pretty quickly that the answer to that question is obtained with spectrophotometers, densitomers and GC-MS machines and isn't what we want to know in most cases. What we really, in most applications, and certainly the home brewing one, want to know is "Are there perceptible differences between the two beers?" As the next step we realize that perception depends on the perceiver as well as the beers and thus we are forced to modify the question to "Are there differences in the beers perceptible to a population of interest?" Because we can't test the whole population we select a group of people whom we hope represents the population of interest well. We then do tests involving the selected subgroup (panel) to determine whether they can tell the difference or not and apply that to the population before making a decision to modify our process or not.

Example: I belong to a brew club and take my beers to its meetings. I am gratified when the members say "Hey, A.J., you brew great beer." I've achieved that position by specializing in triple decocted lagers. I'm getting to be an old geezer and a triple decoction brew day is just too long for me now so I'd be interested in knowing if dropping two (or even three) of the decoctions would still get me the "great beer" comments. Here the population of interest is the membership of my brew club and I would draw panel members from the club membership rolls. The attendees at my club's meetings include people who are there for the first time, home brewers who proudly offer up their second batch, homebrewers who have been doing it for 40 years, BJCP judges in training, certified judges of all ranks up to Master, professional brewers and the wives and sweethearts of all the above. In choosing a panel I think mongoose would want me to eliminate guessing from the test or if unable to do that at least eliminate the first timers and low experience home brewers as they are more likely (the theory goes) to guess in a triangle test. But I don't want to eliminate guessing because if dropping the decoctions makes no difference even the best judges are forced to guess and I want to detect that. And I don't want to empanel only the best judges because test results from such a panel answers the question "Are there differences in the beers perceptible to the best judges in my club?" which is likely going to have a different answer than the question "Are there differences in the beers perceptible to the membership of my club?"

Now we must consider what it means to be perceptible. In a lot of explanations of triangle tests "perceptible" means simply "not imperceptible" and we process the data to calculate the probability that we got the data we did under the assumption that the beer differences are imperceptible. If that probability (p) is low we say that it is unlikely that we would get so many answers indicating there is a difference if there were really no difference and say that there is a statistically significant difference. The lower p the more confident we are. If p is less than some number we accept the result.

But there are levels of detectability (called differentiation here). Clearly if there are no detectable differences no one in the population would find differences. If the beers were a Bud Light and a Guiness Tropical Stout we would expect everyone in the population to detect the differences. In the former case we have Pd = 0 and in the latter Pd = 1 where Pd is the fraction of the population detecting the differences. Thus in exploring the question "Are there differences in the beers perceptible to the membership of my club?" we really want to rephrase it, for the last time, as "What fraction of my club membership perceives a difference in the beers". When we have an answer to that question we can make an informed decision as to whether we want to drop decoctions or not. If the experimental results showed that the 95% confidence band lies between 0 and 10% of club members in general population able to differentiate and between 5 and 20% for the most experienced judges then I'd probably decide to drop the decoctions if my motivation is merely to have the club members enjoy my beer. But if my goal were to impress the big dogs I might not so decide.

The information I really need to make any intelligent decision about what to do or not do is Pd. This is fraction of panel members that got it right because they could detect the difference. These are the 'qualified' tasters. 1 - Pd is the fraction that had to guess. One third of those (in a triangle test) got it right. Thus the probability of a correct answer from a panelist who is given k samples of the beer of which k-2 are the same and the other different and asked to choose which is different is
Pc = Pd + (1- Pd)(1/k)
We can easily estimate Pc from the fraction of correct answers returned by the panel and from this we can obtain an estimate of Pd
Pd_ = (Pc_ - 1/2)/(1-1/k)

This is the classic estimation problem in which we have a state variable, Pd, whose value we want to estimate and an observable, Pc, related to it in a known way.
The thing we use to make 'actionable' decisions is our estimate of the fraction of the population that can tell the difference and that is one minus the estimated fraction of guessers as determined from the fraction of correct answers. Guessers are not a problem with the method: they are the lifeblood of it. This just makes a lot of sense to me. The more people have to guess the less strong the signal from the beers. I have previously ranged on about how guessing introduces noise but that we have ways of dealing with noise and pulling the signal out. That's really not quite applicable here. Pd is a function of signal and noise but what we want here is a measure of the SNR.

That should help but if we do indeed have a veridical situation here it won't. I remember a colleague had the company's brand new SPARC station running Monte Carlo's of the Monte Hall problem over the weekend. When he came in on Monday and found the machine had given the right answer he checked his code and ran again over night. Don't remember if he ever saw the light.
 
You may have converted starches to sugars, but what type of sugars? There's a reason to mash longer.

Thanks Denny, I kind of knew this, but was trying to be cheep. :D It is not like I'm not putting 4 or 5 hours aside on brew day and I have always mashed for at least an hour, usually longer because I forget to get my strike water hot early. Anyway, thanks for the reminder.

:off: Question: If I want more unfermentable sugars is a shorter mash an option? I understand a higher mash temp (156-160f) does this, but it would be good to know. :mug:
 
Status
Not open for further replies.
Back
Top