The problem is the total number of variables involved in brewing. It is almost impossible to keep all of them exactly the same, except for the variable(s) under test. Replicating the brews randomizes the uncontrolled variables. The more replicates, the more confidence you have that the observed difference is due to the variable(s) of interest.
Brew on
But that very difficulty--if important--isn't certain to average out over multiple brews. If there truly are that many variables whose control is in question, you'll need more than three brews. And there's no way to know if differences would "randomize" from batch to batch. In fact, there's reason to believe they wouldn't.
I do think there's room here for a real test of taste--not just that people can determine they're different, but that most agree that the nonDO beer is superior.
There are in fact flaws in how Brulosopher evaluates results--and that's not a criticism in general because he's the only guy trying to figure this stuff out in a systematic way, insofar as I know. For that he has my great respect.
However, just because people can distinguish a difference between beers tells us nothing about which is better. My favorite example is this exbeeriment:
http://brulosophy.com/2016/04/04/si...-brudragon-collaboration-exbeeriment-results/
The results were "significant" at p<.001, with 128 tasters. Of those, 66 were able to correctly identify the odd-one-out. But the results are presented as if all those were able to do so as a result of being able to distinguish via flavor, not as a result of luck. In other words, we don't know how many simply guessed right, and how many truly could distinguish between the beers.
[The answer to this is validating tasters to see that they can reliably identify the odd-one-out, repeatedly, and not just once. But I digress...]
Then, there are the results of which one was better: of the 66 who were able to distinguish between them, 33 favored one beer, 26 the other beer, 3 said no difference, and 4 said they were different but had no preference.
So here we have very "significant" results, but there is no clear preferences as to which is better. In other words: there's no real evidence that one is better than the other, just that they're different.
So--if anyone can do a test of DO as relates to quality, not only must they show a difference, but that there's overwhelming agreement that it's better than the old way of doing things.