python - Make a new array -


i writing program parses sequence alleles. have written code reads file , creates header array , sequence array. here example of file:

>dqb1*04:02:01 ------------------------------------------------------------ --atgtcttggaagaaggctttgcggat-------ccctggaggccttcgggtagcaact gtgacctt----gatgctggcgatgctgagcaccccggtggctgagggcagagactctcc cgaggatttcgtgttccagtttaagggcatgtgctacttcaccaacgggaccgagcgcgt gttggagctccgcacgaccttgcagcggcga----------------------------- ---gtggagcccacagtgaccatctccccatccaggacagaggccctcaaccaccacaac ctgctggtctgctcagtgacag----cattggaggcttcgtgctggggctgatcttcctc gggctgggccttattatc--------------catcacaggagtcagaaagggctcctgc actga------------------------------------------------------- >omixon_consensus_m_155_09_4890_dqb1*04:02:01 -------------------atcaggtccaagctgtgttgactaccactacttttcccttc gtctcaattatgtcttggaagaaggctttgcggatccctggaggccttcgggtagcaact gtgaccttgatgctggcgatgctgagcaccccggtggctgagggcagagactctcccggt aagtgcagggccactgctctccagagccgccactctgggaacaggctctccttgggctgg ggtagggggatggtgatctccatgatctcggacacaatctttcatcaacatttcctctct ttggggaaagagaacgatgttgcattcccatttatcttt--------------------- >gendx_consensus_m_155_09_4890_dqb1*04:02:01 tgccaggtacatcagatccatcaggtccaagctgtgttgactaccactacttttcccttc gtctcaattatgtcttggaagaaggctttgcggatccctggaggccttcgggtagcaact gtgaccttgatgctggcgatgctgagcaccccggtggctgagggcagagactctcccggt aagtgcagggccactgctctccagagccgccactctgggaacaggctctccttgggctgg ggtagggggatggtgatctccatgatctcggacacaatctttcatcaacatttcctctct 

the headers ('>dqb1', '>gendx', , '>omixon') , 3 sequences other 3 strings seen above.

the next part of code detects if allele sequence complete or incomplete. allele determined "incomplete" if there more 4 breaks within >dqb1 sequence. (a break signified '-'). example, above sequence broken because there 5 breaks.

i trying write code if there incomplete allele detected, program creates new array >gendx , >omixon headers , sequences.

how can make array not include >dqb1?

here code is:

import sys, re  max_num_breaks=4 filename=sys.argv[1] f=open(filename,"r") header=[] header2=[] sequence=[] sequence2=[] string="" line in f:     if ">" in line , string=="":         header.append(line[:-1])     elif ">" in line , string!="":         sequence.append(string)         header.append(line[:-1])         string=""     else:         string=string+line[:-1] sequence.append(string) s1=sequence[0] breaks=sum(1 m in re.finditer("-+",''.join(s1.splitlines()))) if breaks>max_num_breaks:     print "incomplete reference allele detected"     m in range(len(header)):         if re.finditer(header[m], 'omixon') or re.finditer(header[m], 'gendx'):             header2.append(header[m])             sequence2.append(sequence[m])     print header2 

the problem above code whenever print header2 still includes dqb1.

why want use re.finditer?

what about

if header[m].find('omixon') > -1 or header[m].find('gendx') > -1: 

Comments