1

I have a fasta file with multiple sequences in it. Some of the sequences are trailed with ‘-‘ and I’d like to trim them from the final sequences. Is there a clean way to trim them and write a new fasta file without the dashes using Biopython?

I saw this post How to remove all-N sequence entries from fasta file(s) and tried to adapt some of the code but it didn’t work…

file containing a sequence like this:

sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA—————————————————————

def dash_removal(file_in, file_out):
    records = SeqIO.parse(file_in, 'fasta')
    filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
    SeqIO.write(filtered, file_out, 'fasta')
    dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")

all of the sequences should ultimately be trimmed to look like this:

sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA

Any help would be appreciated!