That discussion is about testing hypothesis H1: "Tasters with more experience are better able to discriminate differences in beers than those with less". The null hypothesis then is H0: "The ability to distinguish differences in beers is not related to experience."
In the post he lists correct odd beer detection percentages:
General Beer Drinker: 40%
Craft Beer Enthusiast: 49%
Home Brewer: 43%
BJCP in Training: 46%
BJCP certified: 44%
If we make the assumption that the level of experience ascends as we go up the list then there is a correlation between experience and performance but a weak one. Pearson's r is only 0.235 and the probability that we might see a value of r that high under H0 is 35% which lends very weak support for the notion that performance is related to experience.
Now let's slaughter a sacred cow. Lets assume that the average craft beer enthusiast is actually more experienced (whatever that means) than BJCP judges. After all he may have been critically tasting craft beers for many years while the BJCP judge may have been at this (albeit with great enthusiasm) for but a year or two. Certainly the data suggests that craft enthusiasts have more relevant experience than BJCP judges as they scored better.
Now let's slaughter another sacred cow and assume that the BJCP judge in training is actually a better judge (has more
relevant experience) than the average certified judge. The data suggests that this may be the case as they performed better and I know my beer tasting skills were better when in the thick of training (weekly training panels with other judges) than they have ever been. Anyway, this is my analysis and I can make any assumptions I want. This may lend some insight into the engineer's joke I mentioned in an earlier post i.e. that all the statisticians on earth laid end to end wouldn't reach a conclusion.
With my new assumptions the 'levels of experience' are now in the same order as the performances and the conclusion is very different. For this rearranged data set Pearson's r is 0.988 and the probability that we would see r that big or bigger under the null hypothesis is only 0.08%. We are now on solid ground rejecting the idea that experience doesn't make a difference.
Bottom line here is that testing whether experience makes a difference or not depends very much on how we define experience.
The reason for the linked post is that readers, and the experimenters, have been concerned that many of the experiments results in answers that don't carry statistical significance at the levels we like. The author is seeking an explanation and has focused on his panels. This is definitely the right thing to do but overlooks the fact that the power of a panel depends on signal to noise ratio which depends on the beers as well as the panel. The more different the beers the louder the signal. The better the panel's skills the less the noise. Clearly he is dealing with signal to noise ratios that suggest that half or fewer of panelists are going to be able to detect the odd beer in a triplet. Assuming the number is half and that he is running panels of 20 then 10 would be expected to pick the right beer on average. The statistical significance at the 10 out of 20 level is 9.2% - not significant relative to the usual maximum acceptable level of 5%. A more powerful test is needed to attain significance. Improving the panel to the point where the signal to noise is such that 60% choose correctly would imply that 12 out 20 would be successful on average. The significance level associated with 12 out of 20 is 1.2%. Assuming the pool of tasters is what it is improving the panel is going to be tricky. They could be given a tasting test, for example, but that would have to be done with care. Empaneling only tasters who can demonstrate ability to tell the difference between the beers you want to test is clearly folly so you would have to test them on some other beers but how would you choose the other beers? This gets back to the earlier discussions (that so infuriated some) of matching the panel to the demographic you are interested in. A test showing that 85% of people chosen from a pool that has demonstrated that they can taste the difference doesn't tell you much about the man on the street or the average home brewer or in fact about any particular demographic but the one you have sampled - those you already know can distinguish the beers. But improving the panel isn't the only way to increase SNR and thus significance. Just making it larger will do the same thing. If the panel size were doubled to 40 then we'd expect numbers like 20 correct from a panel with 50% probability of choosing correctly when the beers are distinguishable. The significance associated with 20 out of 40 is 2.1%.
From this we conclude that while panels of 40 are tougher to handle than panels of 20 this may be all that they need to do to gain the significance they desire. The fact that they seem to be unaware of this is a bit disturbing. I haven't delved far into the site but it seems that they are attempting to apply statistical methods to their results (which is commendable) without understanding how to do that. They are certainly not alone in this. They should consult someone who knows how these things are done or shell out the $45 for a copy of ASTM E - 1885 Standard Test Method for Sensory Analysis - Triangle Test.