Friday, August 25, 2006

The Task Ahead

And the Protein Challenge is underway! But of course, before I can start delving into the literature there's some stuff we need to get out of the way first. Namely: what are we actually trying to test?

Why so uptight?

The reason I think it's important to scope this challenge out in detail beforehand can be expressed in two words: goalpost shifting. Now, Paul is very good on this front compared to the vast majority of online debaters, but even he has his moments, and I'm sure I do too. Scoping everything out in advance means that neither of us can get confused (accidentally or otherwise) about what we were setting out to demonstrate. As I know from long hours of arguing with my sister, this can save much acrimony in later life.

The proof of the pudding

What is it we're setting out to achieve? The goal of this challenge is to answer the concerns Paul expressed in the following comment:
"When a protein emerges ... it will tend to improve its performance towards its local maximum." I am happy with this in some contexts - where there is strong selection. But I would like to see this at least demonstrated in representative computer models before I would accept that this works in nature.


This challenge (there may be others) is therefore specifically addressed at seeing whether evolutionary processes can efficiently optimise the functionality of a protein. This is not going to be easy - the production of proteins from DNA sequences is not terribly easy to compute, and the effect of those proteins even less so. As such, we'll need to break the problem down a bit (see later).

The groundrules

1) This challenge will only be considering the protein-space of one protein task. This could be, for example, the ability to catalyse a given reaction. Selection of the task will be contingent on the ability to find both a number of different proteins in the task's protein-space and a means of determining success at the task. This challenge will not consider:
a) the ability of organisms' genomes to find the protein-space in the first place
b) the origins of the genome itself
c) [more negatives will be added here as necessary]

2) This challenge will proceed using only biologically-realistic information:
a) All minor properties of the GA will be accompanied with journal or textbook references supporting their acceptability
b) The model used for determining the efficacy of proteins will be tested for accuracy before use. If it's not accurate, I'll go back to the drawing board and pick a more accurate but harder option.

3) On completion, the resulting GA will be initialised with a population of proteins that are distinctly suboptimal at their task

4) The party whose conclusions were not supported by the program's results will (if they so wish) have two weeks to pinpoint any unrealistic components, citing academic papers where appropriate. If any are found, the GA will be changed as necessary and rerun. The loser this time round will have only one week to point to inaccuracies.

The breakdown

I've already discussed the primary division between setting up the fitness function and incorporating it into a GA. The hard part here is the first bit, which I see as breaking down into three essential components:

1) Figuring out what a given protein's physical properties (shape, configuration, charge distribution) are

2) Figuring out how they'll perform with respect to the protein-space's assigned task

3) Figuring out what effect that'll have on the whole organism

That effect (fitness) can then be used to determine each organism's survival chances, as is necessary for the GA to function.

Once I've done that I'll go into more detail on the creation of the GA.

Isn't this overkill?

Yes, and quite horribly so given that all Paul wanted was a demonstration that this stuff was possible. The reason I'm going so overboard on this is because I have every intention of using the code I produce here for other stuff in future (there will be other challenges to answer other questions, such as the ease with which the various protein-spaces can be located). I'd also like the demonstration to be as conclusive as possible, of course.
Read the full post

The Protein Challenge

Index of posts relating to this challenge, and list of tasks necessary for its completion.

Background material

Introduction

Open Mouth, Insert Money #2

Challenge Specs

The Task Ahead

Fitness evaluation

Genotype->phenotype

Phenotype->efficacy

Efficacy->fitness

Genetic algorithm

Realistic behaviour

Results

Conclusions
Read the full post

Open mouth, insert money #2

I've been cutting back my online debating recently in favour of doing some actual learning wrt computational biology. However, one culture-wars blog I have kept visiting is Exile From Groggs, which is the most sane ID blog I've found to date.

In particular, it's occasionally possible to convince Paul (the blog owner) of something, given sufficient effort. This makes him practically unique in the world of blogging, let alone the ID community's segment thereof. And it's on this note I wish to speak.

I recently spent some effort trying to figure out exactly which parts of evolutionary theory it was that Paul had a problem with. One of the issues that came up was that, whilst he was broadly convinced that genetic algorithms such as real-world evolution could optimise very simple systems, he wasn't sure how they'd handle something as complicated as proteins, where the search spaces can be Bad And Wrong. The challenge was laid down and accepted: demonstrate that evolutionary algorithms could work in the context of proteins.

This challenge actually has two major components. Firstly, I need to determine a sequence->fitness mapping for a given selection of proteins. Secondly, I need to implement an algorithm using both this and biologically-realistic reproduction&mutation to simulate the real-world evolutionary optimisation of the protein within a given protein-space*. The first part is by far the more difficult.

I threw out a bunch of suggestions as to how this could be computationally modelled (rather than having to produce every single possible protein variant in the lab), and the option that seems to have been settled on is:

I work backwards by looking at an existing protein family, comparing the efficacy of the various proteins, and assuming that everything similar to one of these is also effective. I use some kind of sequence-based active-site-detecting process to fine-tune our guesstimate of the effectiveness of the similar proteins


In bioinformatics, this sort of thing can be worked through quite easily, but it's all too simple to produce hideously wrong answers. In particular, in this case it'll be fairly easy to determine which proteins will work well - the challenge will be avoiding false negatives.

More after the break.

* Syntactic note: we've been referring to the space of protein sequences that perform acceptably at a given task as the protein-space of that task. If we talk about the protein-space of a protein, we're talking about all the proteins that accomplish one of the same tasks as that protein in roughly the same fashion as that protein. This does not include wildly different proteins that happen to perform the same task; we're concerned primarily with families of similar proteins.
Read the full post

Sunday, August 20, 2006

ID's hard evidence

(AKA: Stoner Logic 101)

In response to this article in the York Daily Record, and to this thread on a blog I frequent, I feel I've finally gotten to grips with the evidence behind ID. I feel it can best be summarised by the following dialogue:

"Wow, dude, look at this flagellum thingy. That is soooo cool, man"

"Dude, what if... like... what if all these cool things were made by some Big Reefer in the sky? Cos, y'know, if I was a Big Reefer, that's the sort of thing I'd make..."

"Woah, duuuuude!"


There we have it, the conclusive disproof of evolutionary theory. I expect to win Nobel prizes for this day's work.

Needless to say, however, I have some doubts about the Quality of this argument.
Read the full post