i writing program parses sequence alleles. have written code reads file , creates header array , sequence array. here example of file:
>dqb1*04:02:01 ------------------------------------------------------------ --atgtcttggaagaaggctttgcggat-------ccctggaggccttcgggtagcaact gtgacctt----gatgctggcgatgctgagcaccccggtggctgagggcagagactctcc cgaggatttcgtgttccagtttaagggcatgtgctacttcaccaacgggaccgagcgcgt gttggagctccgcacgaccttgcagcggcga----------------------------- ---gtggagcccacagtgaccatctccccatccaggacagaggccctcaaccaccacaac ctgctggtctgctcagtgacag----cattggaggcttcgtgctggggctgatcttcctc gggctgggccttattatc--------------catcacaggagtcagaaagggctcctgc actga------------------------------------------------------- >omixon_consensus_m_155_09_4890_dqb1*04:02:01 -------------------atcaggtccaagctgtgttgactaccactacttttcccttc gtctcaattatgtcttggaagaaggctttgcggatccctggaggccttcgggtagcaact gtgaccttgatgctggcgatgctgagcaccccggtggctgagggcagagactctcccggt aagtgcagggccactgctctccagagccgccactctgggaacaggctctccttgggctgg ggtagggggatggtgatctccatgatctcggacacaatctttcatcaacatttcctctct ttggggaaagagaacgatgttgcattcccatttatcttt--------------------- >gendx_consensus_m_155_09_4890_dqb1*04:02:01 tgccaggtacatcagatccatcaggtccaagctgtgttgactaccactacttttcccttc gtctcaattatgtcttggaagaaggctttgcggatccctggaggccttcgggtagcaact gtgaccttgatgctggcgatgctgagcaccccggtggctgagggcagagactctcccggt aagtgcagggccactgctctccagagccgccactctgggaacaggctctccttgggctgg ggtagggggatggtgatctccatgatctcggacacaatctttcatcaacatttcctctct
the headers ('>dqb1', '>gendx', , '>omixon') , 3 sequences other 3 strings seen above.
the next part of code detects if allele sequence complete or incomplete. allele determined "incomplete" if there more 4 breaks within >dqb1 sequence. (a break signified '-'). example, above sequence broken because there 5 breaks.
i trying write code if there incomplete allele detected, program creates new array >gendx , >omixon headers , sequences.
how can make array not include >dqb1?
here code is:
import sys, re max_num_breaks=4 filename=sys.argv[1] f=open(filename,"r") header=[] header2=[] sequence=[] sequence2=[] string="" line in f: if ">" in line , string=="": header.append(line[:-1]) elif ">" in line , string!="": sequence.append(string) header.append(line[:-1]) string="" else: string=string+line[:-1] sequence.append(string) s1=sequence[0] breaks=sum(1 m in re.finditer("-+",''.join(s1.splitlines()))) if breaks>max_num_breaks: print "incomplete reference allele detected" m in range(len(header)): if re.finditer(header[m], 'omixon') or re.finditer(header[m], 'gendx'): header2.append(header[m]) sequence2.append(sequence[m]) print header2
the problem above code whenever print header2 still includes dqb1.
why want use re.finditer
?
what about
if header[m].find('omixon') > -1 or header[m].find('gendx') > -1:
Comments
Post a Comment