There is a pond. We would like to cross the pond. We cannot walk on water, but we're in luck: there is a row of stepping stones across the pond, and we can walk on stone.
However, the stepping stones appear to have been built by gnomes, since they're much less than a human stride apart. We could step on each of them, one after the other, but that would mean making many tiny steps; we want to cross the pond quickly, so we want to make as few steps as possible, by striding from one stone to another, as far on as possible.
Given a pond of a given width, with a set of stepping stones at given positions along a line, and with the ability to make strides up to some given length, how can we work out a path, being the sequence of stones to step on, that gets us across in the minimum number of strides?
There are probably multiple solutions; we'd like the one which makes the lengths of the strides most even.
Now, some of the stones are a bit slippery, some more so than others, and we'd like to keep our footing as sure as possible; how, given the slipperiness of each stone, can we take that into account in choosing a path?
Incidentally, we'd like to work the path out quickly, but we're really much more concerned with finding the best path - we'd rather be late and dry than on time and soaking wet.
Bonus question: is this the TravelingSalesmanProblem
This is a aesopification of a BioInformatics
problem i'm having. I want to sequence a stretch of DNA (my 'template') a few thousand bases long; i know what the sequence should
be, but i need to verify it. The trouble is, a sequencer can only read a few hundred bases in one go, so you need to make numerous reads of the template, then assemble them to deduce the complete sequence; the reads need to overlap slightly, so they can be merged. I can precisely place the start of a read, but not with complete freedom: only some points in the sequence are possible start points, and those that are can be good or bad. Places that are possible are stepping stones, with better ones being less slippery; the reads, which have to reach from their start to the start of the next read, are the strides.
For the curious, the way I place a read start is by designing a short piece of DNA, called a primer, which matches a corresponding stretch in the sequence; in the sequencing process, this binds to that stretch and serves as a substrate for the reaction. Now, not all primers will work: they won't if they match themselves (they'll bind to each other instead of the template), if they contain palindromes (they'll fold up on themselves) or if they match more than one point in the template (they'll bind to both and the results'll be a mess). Even if they meet those criteria, some primers work better than others: they work best if their 'melting temperature' (a measure of the strength of their binding) is appropriate to the reaction conditions, if they have neither too low or too high a content of any of the different nucleotides, if they have no runs of a single nucleotide, and if their business end binds less strongly than the rest. Thus, i'd like to come up with a set of primers that gives me reads evenly covering the whole template, and where each primer is as good as it can be, to maximise the chance of the sequencing working first time.
The simplest thing I can think of is to start at the start and just repeatedly try to find the next stone to step to: look ahead about a stride, pick the best stone, and step to it. I could perhaps then improve my solution by some sort of simulated annealing: repeatedly look at the worst stones I've picked (in terms of slipperiness and the length of the stride on either side) and see if there's another stone nearby that gives me a better result. This is only heuristic, though.
The only true algorithm i can think of it is to enumerate every possible path (possibly using some sort of BranchAndBound
to skip hopeless cases), figure out how good it is, then pick the best. It's pretty brute force: for N stones, i think it's about O(k ^ N/k), where k is the average number of stones in a stride length.
... i think you're suggesting what amounts to a greedy algorithm - always make the biggest step you can - with some tidying-up afterwards, which sounds very sensible.
... one which makes the lengths of the strides most even.
Why is even-ness so important? If you can get across in 20 irregular strides, wouldn't that be better than getting across in 21 more-regular strides?
Yes, it would. The requirement about evenness is just there to help pick the winner if there are several ways of getting across in the same number of strides (20 regular strides beats 20 irregular strides).
Do you have to pick out *all* the stepping stones ahead of time? Maybe if you picked the maximum stride every time (even if it ended on a less-then-ideal stone), you'd get lucky often enough to save time overall, even though you have to go back and re-do some of the gaps where you weren't so lucky. -- DavidCary
By 'pick out all the stepping stones ahead of time', do you mean choose the route before doing the sequencing? I'd really rather do that if i can - each sequencing reaction involves ordering a primer from a synthesis service, then preparing it (and the template) and sending it off to the sequencing service, which takes at least a week from end to end.
A week turn-around ? Yes, I can see that you don't want
too many iterations of that.
However, allow me to keep exploring the idea.
Currently you're trying to get all the way across in one go
with, say, probability 0.990 .
Say you expect to take roughly 100 steps.
each step needs to succeed with probability 0.9999.
However, consider what happens if we pick steps that we only expect to succeed with 0.90% probability.
Let's say (picking arbitrary numbers to make this plan look unnaturally good) that
the additional flexibility that gives means that we only need to take half as many steps -- 50 steps.
We're almost sure to slip and fall at 5 different places
on the first iteration.
However, on the next iteration,
we can make a bunch of short steps at each of those places.
If we paper over each of those 5 gaps
with 5 more primers each (overkill?).
(I've shifted here from a "stones in a stream" analogy to
"covering a wall with wall-paper" analogy).
Even if those additional primers still only succeed with 0.90 probability,
the probability that at least 2 of the 5 succeed is 0.9995
at each gap, or
an overall expected probability of 0.998 that all 5 gaps will be filled.
This is much better than the "do it all in one go" protocol
that only gives you an expected probability of 0.990 that all the gaps will be filled.
Total, we only needed 75 steps (50 the first week, 25 the second week),
which is less than the 100 steps needed to do it all in one go.
But it did take 2 weeks, rather than the 1 week to do it the other way.
As a side effect, the *actual* failure rate of primers,
compared to the *expected* failure rate of primers based on some model
might help the person who built that model to refine it
to better predict the *actual* failure rate.
(It would be a shame to go to great lengths to avoid a perfectly good stone,
just because flaws in the model make it excessively conservative).
Thank you for posing such an interesting problem.
And thank you for such an interesting solution! Your two-pass idea is actually very good; i could save effort, and money, at the expense of a little time.
As for refining the model - yes, that would be good. I'm probably not going to be trying enough primers to draw any conclusions stronger than the rules of thumb we've already got, but i could certainly try. Maybe some sort of neural net or HMM or something ...
I should note that I was inspired by whoever wrote RidiculousSimplicityGivesRidiculousResources
quickly yields the shortest path between two stones, but you'll have to modify it to deal with evenness & slipperiness. Since DijkstrasAlgorithm
**2) [hence it's not the TravelingSalesmanProblem
in disguise], unless your n
is huge it will run fast enough to afford you some spare cycles for your other requirements.
:I was under the impression that DijkstrasAlgorithm is O(N log N) (or O(V log E))
I could use DijkstrasAlgorithm to enumerate all the possible shortest routes, then go through and compare them for evenness and slipperiness using some objective function. There might be a large number of candidates, though - probably still > O(k^N). A modified version of the algorithm wouldn't solve exactly the algorithmic problem i describe, i think, but it may well solve the actual practical problem - i'll try this out.
Sorry, I hadn't interpreted the problem correctly. Here's a revised idea. Create a graph such that each node (=stone) is connected by edges to its nearest neighbors on both sides [so that there's always a path] and
to any other nodes whose distance from the current node is less than k
. Apply DijkstrasAlgorithm
. Vary k
and compare solutions.
If the gap between stones varies widely, you may be able to partition the problem by splitting at the largest gaps.
How are you measuring length? Number of steps or distance? If the former, you'll enumerate all the minimal routes, which is good, but doesn't help you choose the best one (if you take the first route that you come across, then you're essentially picking one at random). If you do it by bases, you'll find that all routes across the pond are the same length. I think i might have misunderstood.
No, it was my misunderstanding of the problem. DijkstrasAlgorithm
probably won't help, and this whole section is likely worthless and should be deleted. An example data set (including slipperiness) might make the problem clearer; also, how many stones are there in a typical data set?
How are you measuring length? Number of steps or distance?
. Think of the edge weights in DijkstrasAlgorithm
as indicating the *cost* (in dollars or euros) of getting from one point to another.
Create a graph such that each node (=stone) is connected by edges (weighted by cost) to every other node that is within jumping distance.
The cost of getting from node A to node B is
- the cost of primer A, plus
- (the cost of having to re-do things if A fails) times (the probability that A will fail). This factor goes way up when A is slippery.
Then run DijkstrasAlgorithm
to find the minimum cost to get from end of the DNA template to another.
The physical distance in base units from A to B may affect the cost of primer A, or the probability that A will fail, or both.
You'll probably end up with some fixed minimum cost -- tending to minimize the number of steps.
You'll probably end up with an additional cost proportional to the square of the physical distance (or perhaps exponentially rising) -- tending to make the strides relatively even.