Tuesday, September 19, 2006

CSI: the explanation

(No I'm not talking about Crime Scene Investigation, fool!)

OK, so I really should be focusing on the Protein Challenge. However, being easily sidetracked, I got thinking about Dembski's concept of CSI - Complex Specified Information. It's a surprisingly hard concept to understand, not least because AFAICT Dembski makes it as difficult as possible to do so.

The basic principle is that evolutionary processes aren't supposed to be able to produce structures that are improbable (for a sufficiently well-defined value of "improbable"), complicated (ditto) and specified. The concept of a specification is basically an attempt to extend the basic probabilistic concept of an event to handle post-hoc reasoning. A specification defines a target space of possible things that can happen.

The argument goes as follows:

1) Chance processes tend not to give results that are both unlikely and specified. So, for example, drawing 13 cards and getting all spades is highly unlikely, and you'd rightly assume that someone had tinkered with the deck.

2) Natural processes (regularities) tend not to give results that are both complex and specified. For example, though a snowflake may be complex, you won't get the same one twice.

3) Hence, anything that is complex, improbable and specified is most likely the result of intelligent intervention (nb. human brains apparently don't qualify as natural processes).

There are some problems when you try to extend this to evolution and genetic algorithms and so on - both are quite capable of generating complex, improbable, extremely useful systems. Dembski gets round this by saying that genetic algorithms can generate CSI if and only if the target space associated with the specification represents a local optimum of the fitness function. GAs work if and only if the problem you feed them (fitness function) is actually the one you want solved (specification).

That's why examples like the infamous "methinks it is like a weasel" work - the specification we choose (the text) is 'coincidentally' identical to the optimum of the fitness function. Dembski, if I understand correctly, points out that, unless we select our specification to correspond to the fitness function we're using, we still won't generate CSI. We'll have complex information, but it won't match the right specification. As such, he feels justified in saying that, in feeding the GA the right problem for our desired solution, we're "smuggling" CSI into the system.

The problem here is that, looking at the "fitness function" to which real-world genes are exposed, we see that it's basically something along the lines of "ability to survive and breed". In that context, the ability for a gene or combination of genes to produce something like a flagellum would certainly be of value for survival, and hence could represent a local optimum of the fitness function. The flagellum could evolve despite its CSI, because evolution would be selecting for the same underlying trait that we're basing our specification on - ability to live long and prosper.

Thus, simply by basing his specifications on the functionality of a system, Dembski is setting up a range of target spaces that evolution can quite definitely find. It's something of a Texan Marksman issue - Dembski is running round painting targets around all the areas that evolution by natural selection is naturally inclined to hit.

Key terms:

Complex - refers to Kolmogorov complexity, best thought of as a measure of how easily a system can be described. So, for example, "AAAAAAAAAAAAAAAAA" would be low-complexity, "AABBCCDDEEFFGGHH" would be higher, and a random string like "BJECDWYIVFYUEUBUFIIHI" would be highest.

Information - refers to Shannon information, also known as the "surprisal" of a system. So, for example, "EEEEEEEEEEEEEEEEEE" would be fairly low-information because E is a common letter - it doesn't surprise us to see it. "I LIKE FISH" would be higher, as not all of its components occur with such frequency. "XXXXXXXXXXXXXXXXXXXXXX" would be very unexpected (except in the context of really strong beer) so gets a high "surprisal" value.

Target space - refers to a set of states that we'd like the system to end up in. So, for example, the target space of a system composed of lots of bits of wood might be a bookshelf.

Search space - refers to all the states that a system could end up in. So, for example, the search space of a system composed of bits of wood could include both bookshelves and mere piles of planks.

Specification - a simple delineation of the target space. For example, the specification "bookshelf".

Genetic algorithm - a program that attempts to imitate evolution in a model system.

Fitness function - something that allows a GA to tell which of a group of organisms is the most "fit". In the real world, the primary attributes of the fitness function are ability to survive (natural selection) and ability to attract mates (sexual selection).

Local optimum - an area of the search space where there are no small changes that can increase the corresponding organisms' fitness. If you think of fitness as corresponding to height on a graph, the local optima are the peaks of the resulting mountain range.


Paul (probably - maybe Liz) said...

Part of the issue here is: what is the size of all these targets? There are certainly lots of them. The implication of what Dembski writes is that however many there are, the density is still too low (less than 1 part in 10^150) or similar for natural selection to find them. Your argument is that it is certainly reachably high. But nobody seems prepared to attach some real numbers to this. That's what I'm trying to kick around on my blog.

Paul (probably - maybe Liz) said...

PS Congrats on the job.

Dave Thomas said...

Check out the
War of the Weasels, an article with links to a summer series of posts initiated with a Panda's Thumb thread titled Target? TARGET? We don’t need no stinkin’ Target!

Cheers, Dave Thomas

Lifewish said...

Part of the issue here is: what is the size of all these targets?

Actually, that's not really an issue. Even Dembski doesn't dispute that natural processes like evolution can hit really small targets in a reasonable length of time (iirc, he's referred to evolution as a "probability amplifier" - if you like I'll see if I can find the quote). As best I can understand, his main concern is whether that counts as a disproof of CSI or whether it's in some sense cheating.

For an example, I draw your attention to the posts that Dave mentioned (I was just searching through PT to find them when I noticed he'd already listed them!). The genetic algorithm succeeded in finding a solution to a Steiner problem that was incomparably better than anything that random searching could produce.*

The response from Uncommon Descent, as best I can recall, was that it was cheating to present the algorithm with the problem it was supposed to be solving. This is precisely the concern that my post is intended to address - if that is cheating then so are Dembski's calculations.

PS Congrats on the job.

Thanks :)

* Amusingly, one of the evolved solutions (the true Steiner network) was also better than the solutions that Salvador Cordova came up with. However, I can hereby divulge that my name is on the list of people who found that correct solution. Yup, I'm a geek.