Thursday, April 1, 2010

Growing Sequences Problem

UNDERSTANDING THE PROBLEM:

In class we discussed the Growing Sequences Problem:

Input: An unsorted, finite sequence of numbers. (call this list Ulist)

Sample Input: 5, 12, 11, 22, 4, 16, 35, 7, 11, 12, 34, 112, 1

Output: The longest list of numbers found within Ulist, that are sorted within increasing order, and not necessarily adjacent to each other.

Sample Output

The possible (raw) sequences generated would look like: (* = longest.)

---------------------------------------

5, 12, 22, 35, 112 *
12, 22, 35, 112
11, 22, 35, 112
22, 35, 112
4, 16, 35, 112
16, 35, 112
35, 112
7, 11, 12, 34, 112 *
11, 12, 34, 112
12, 34, 112
34, 112
112
1

--------------------------------------

In our case the longest list is actually two lists. being how there is no set criteria to select one list
over another based upon length, I'd say that all lists of the same length are equal.


Sample Output:

------------------------------------------
Longest sequences generated from Source


5, 12, 22, 35, 112
7, 11, 12, 34, 112
------------------------------------------


It is also noted that, if the Input list was sorted (in increasing order)
then the output would be simply the whole list itself, since this
would be the longest increasing sequence of numbers.

Also it can be seen that a lot of time in the program (with a "blind" algorithm is spent generating sequences and sub-sequences.

ie

(5, 12, 22, 35, 112) = A

Has the sub sequences of:

(12, 22, 35, 112)
(22, 35, 112)
(35, 112)
(112)

but 11, 22, 35, 112 isn't a sub-sequence of A since it starts with 11, which isn't even contained in A. A sub-sequence of A would have to contain a subset of A, in the proper sorted order, and contain fewer elements than A.

Eliminating sub-sequence searching, could save time in the algorithm.
This could be an area for optimization, once we get the basic solution.
We might need a "parallel array" that marks which sequences have been already visited. There may not be any easy way to do this, since a sub-sequence of a particular sequence, might be part of another similar (but different) sequence.

ie (22, 35, 112) = M

is a sub-sequence of 3 lists:

(1) 5, 12, 22, 35, 112
(2) 12, 22, 35, 112
(3) 11, 22, 35, 112

In this case 2 is a sub-sequence of 1, but 3, isn't a sub-sequence of 1. The list M, however, is a sub-sequence of all three lists.

list 3 could be longer than list 2, it is just coincidence that they are the same length.

if you just blindly mark numbers in the parallel array as already being visited, then you might accidentally chop of part of a generated sub-sequence, ie (3) above. This would give an incorrect result. We should probably get fancy with our algorithm later, after getting something working.

--------------------------------------------------------------------------------

Also if you have a list of (111,111,111)
then the longest sequence is it (111) or (111,111,111)

I'd say that (111) since all numbers are equal, and thus not increasing. This problem is trivial to avoid. (Just us strictly less than operator to compare numbers)

repeated adjacent numbers could be removed from the list to save search time, but
removing the numbers also takes time, which might be longer than the search itself.

--------------------------------------------------------------------------------

Another thing to note is that you could save time by chopping off all of the numbers in the original list past the maximum number (in our case 112) in the list if you already have a sub-sequence longer than the length of the numbers that follow the maximum number of the list. Again you can't do this blindly, for if the maximum value was at the beginning of the list, then you'd chop off all of the values, in the list except for the first element. This would most likely not give the answer that we are looking for. Again this is a candidate for future optimization.

--------------------------------------------------------------------------------

DEVISE A PLAN

It looks like this is a sort of a problem involving recursion or induction.
Ie

Is the input list empty then exit (base case)

Try to find the sequence generated when starting with the first number of the list.

Search the list until the end of the list.

save this max list

then strip off the first number of the input list, and do the process again.

(Input List - one element)

search the new Input List

compare this list to your current max_list, if longer in length to the current max_list(s), then this is your new max list. If it is the same length as your current max list(s) then add to max list(s)

etc.

so we need probably at least 3 variables to pass.

the original list of numbers
## ( and the smaller versions of this list )
the list (lists) of generated sequences
## (this is what we want, we need to keep a tally)
## the max length of the current generated sequences. (to compare size)


ie Very roughly


Function Grow_Seq (List, Found_Seq_List, Max_Length)

Begin (*function*)

If source list is empty
Then exit
Else check the list for sequences
If found seq length >= Max_Length
Then add to found seq list(s)
Grow_Seq (List[1..end], Found_Seq_List, Max_Length) # recurse with smaller List

End (*function*)

CARRY OUT PLAN

Pseudocode

Function Grow_SEQ (List, FoundSeq, MaxLength, Append)

## LIST is the input list of numbers (an int array)
## FoundSeq is the currently found list of Sequences (could be a file or int array)
## MaxLength is the current length of the longest found list (int)
## Append is a flag that tells what to do with the newly found list (int flag)

-1 = do nothing
1 = destroy/replace (this new found list is better than the old list
2 = append (this list is equal to the lists already found)

## First call List would be the full list, FoundSeq would be empty, MaxLength would be 0, and append would be 1)

Var i, biggest int = 0
temp_foundSeq (int array) *clear array* = 0


Begin {function}

if sizeof(list) = 0 # check base case
then Return (FoundSeq)

## might be a problem here with one element lists
## need to verify, on first glance is ok
## is exit condition when sizeof(list) = 1 or 0?
##

ELSE # input list larger than the empty set

i=1;
lengthNew =1;

biggest = list[0]; # when we start first element is the largest
temp_FoundSeq [0] = biggest;

while i < (sizeof(list) -1)
Begin {while}

if List[i] > biggest
then temp_FoundSeq[lengthNew]=biggest: lengthNew++

# add the new biggest number to the found sequence, add one to the length of
# the found sequence
i++
# advance to the next element of the source list
End {while}

## at this point we have generated a new FoundSeq
## must compare it to other FoundSeq

if lengthnew < MaxLength

## then this sequence is no better than what we've already found
## call with shorter list
then Grow_SEQ (List [1..end], FoundSeq, MaxLength, -1)

if lengthnew > MaxLength

then FoundSeq = Temp_FoundSeq ## change the original value for FoundSeq
## might have to worry about pass by value
## versus pass by variable here

Grow_SEQ (List [1..end], FoundSeq, lengthnew, 1)

## This is the new sequence, better than all yet found


## we can assume that lengthnew = MaxLength if we've made it this far.

Grow_SEQ (List [1..end], Temp_FoundSeq, MaxLength, 2)


End {function}

LOOK BACK

I am confident that the above pseudocode is on the right track.
The easiest way would be to implement the code in some language and see how it goes.

the tricky part, the part that I've not figured out yet is what exactly to do if you have a new sequence that is the same length as the already found sequence(s)

do we store all the found sequences in a file? This might be good, just reset the file when we find a new set of sequences of larger size. Or perhaps some sort of dynamically allocatable variable? Or some sort of linked list. Also some languages pass by value only, so we have to be sure in the function calls if we want to modify the original calling variable. (in this case specifically FoundSeq)

This is just a basic algorithm, when implemented, it could be improved, by removing some of the searches through the subsets of the lists, and by chopping off the end of the original list wherever possible. Care would have to be maintained to not add so many checks that the original function might not be slower.

Also this function may need an auxilliary function

for 1) special cases in the first function call (unlikely)
for 2) functions to take care of appending/destroying the current list of FoundSeq numbers.

No comments:

Post a Comment