문제

how can I generate fragments from a fasta file of a protein sequence? For example, I want to generate 5mer fragments in this way:

Initial sequence:

         >gi|48255
          MSSPPPARSGFYRQEVTKTAWEVRAVYRDLQ

Fragments:

          1
          MSSPP PARSG FYRQE VTKTA WEVRA VYRDL Q
           2
           SSPPP ARSGF YRQEV TKTAW EVRAV YRDLQ
            3
            SPPPA RSGFY RQEVT KTAWE VRAVY RDLQ
             4
             PPPAR SGFYR QEVTK TAWEV RAVYR DLQ
              5
              PPARS GFYRQ EVTKT AWEVR AVYRD LQ 

and so on.

In each and every cycle, it will reduce the sequence by one.

도움이 되었습니까?

해결책

Create a generator function that yields smaller and smaller slices of the given string.

def shrink(s):
    for i in range(len(s)):
        yield s[i:]

Create a function that splits a string into a list of five character segments.

def split_into_five_character_segments(s):
    ret = []
    while len(s) > 5:
        ret.append(s[:5])
        s = s[5:]
    ret.append(s)
    return ret

Combine the two in a list comprehension to generate your fragment library.

sequence = "MSSPPPARSGFYRQEVTKTAWEVRAVYRDLQ"
fragments = [split_into_five_character_segments(s) for s in shrink(sequence)]

Enumerate through each fragment. Use join to combine the pieces of the fragment into a single space-separated string.

for idx, fragment in enumerate(fragments):
    fragment_number = idx + 1
    indent = " " * idx
    print indent + str(fragment_number)
    print indent + " ".join(fragment)

Result:

1
MSSPP PARSG FYRQE VTKTA WEVRA VYRDL Q
 2
 SSPPP ARSGF YRQEV TKTAW EVRAV YRDLQ
  3
  SPPPA RSGFY RQEVT KTAWE VRAVY RDLQ
   4
   PPPAR SGFYR QEVTK TAWEV RAVYR DLQ
    5
    PPARS GFYRQ EVTKT AWEVR AVYRD LQ
     6
     PARSG FYRQE VTKTA WEVRA VYRDL Q
      7
      ARSGF YRQEV TKTAW EVRAV YRDLQ
       8
       RSGFY RQEVT KTAWE VRAVY RDLQ
        9
        SGFYR QEVTK TAWEV RAVYR DLQ
         10
         GFYRQ EVTKT AWEVR AVYRD LQ
          11
          FYRQE VTKTA WEVRA VYRDL Q
           12
           YRQEV TKTAW EVRAV YRDLQ
            13
            RQEVT KTAWE VRAVY RDLQ
             14
             QEVTK TAWEV RAVYR DLQ
              15
              EVTKT AWEVR AVYRD LQ
               16
               VTKTA WEVRA VYRDL Q
                17
                TKTAW EVRAV YRDLQ
                 18
                 KTAWE VRAVY RDLQ
                  19
                  TAWEV RAVYR DLQ
                   20
                   AWEVR AVYRD LQ
                    21
                    WEVRA VYRDL Q
                     22
                     EVRAV YRDLQ
                      23
                      VRAVY RDLQ
                       24
                       RAVYR DLQ
                        25
                        AVYRD LQ
                         26
                         VYRDL Q
                          27
                          YRDLQ
                           28
                           RDLQ
                            29
                            DLQ
                             30
                             LQ
                              31
                              Q

다른 팁

I guess you could do it with a bit simpler method:

fasta_string = 'MSSPPPARSGFYRQEVTKTAWEVRAVYRDLQ'
string_list =  list(fasta_string)
temp1 = []
temp2 = []
for i in range(len(fasta_string)):
    temp1.append(' '*i)
    temp2.append(''.join(string_list[i:len(string_list)]))
    print temp1[i] + str(i+1)
    print temp1[i] + ' ' + temp2[i]

1
 MSSPPPARSGFYRQEVTKTAWEVRAVYRDLQ
 2
  SSPPPARSGFYRQEVTKTAWEVRAVYRDLQ
  3
   SPPPARSGFYRQEVTKTAWEVRAVYRDLQ
   4
    PPPARSGFYRQEVTKTAWEVRAVYRDLQ
    5
     PPARSGFYRQEVTKTAWEVRAVYRDLQ
     6
      PARSGFYRQEVTKTAWEVRAVYRDLQ
      7
       ARSGFYRQEVTKTAWEVRAVYRDLQ
       8
        RSGFYRQEVTKTAWEVRAVYRDLQ
        9
         SGFYRQEVTKTAWEVRAVYRDLQ
         10
          GFYRQEVTKTAWEVRAVYRDLQ
          11
           FYRQEVTKTAWEVRAVYRDLQ
           12
            YRQEVTKTAWEVRAVYRDLQ
            13
             RQEVTKTAWEVRAVYRDLQ
             14
              QEVTKTAWEVRAVYRDLQ
              15
               EVTKTAWEVRAVYRDLQ
               16
                VTKTAWEVRAVYRDLQ
                17
                 TKTAWEVRAVYRDLQ
                 18
                  KTAWEVRAVYRDLQ
                  19
                   TAWEVRAVYRDLQ
                   20
                    AWEVRAVYRDLQ
                    21
                     WEVRAVYRDLQ
                     22
                      EVRAVYRDLQ
                      23
                       VRAVYRDLQ
                       24
                        RAVYRDLQ
                        25
                         AVYRDLQ
                         26
                          VYRDLQ
                          27
                           YRDLQ
                           28
                            RDLQ
                            29
                             DLQ
                             30
                              LQ
                              31
                               Q
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top