This could help, as a start:
for line in fin.readlines():
words = line.split() # list of words
new_words = []
unique_words = set()
for word in words:
if (word not in unique_words and
(not word.isdigit() or int(word) <= 65000)):
new_words.append(word)
unique_words.add(word)
new_line = ' '.join(new_words)
print new_line
Turns this:
A 786 65534 65534 786 786 786 786 10026/AS4637 19151 19151 19151 19151 19151 19151 10796/AS13706
Into this:
A 786 10026/AS4637 19151 10796/AS13706
Obviously, it's not quite what you want yet, but try to do the rest yourself. :) The str.replace()
method might help you getting rid of those /AS
.