I have a fasta file with multiple sequences in it. Some of the sequences are trailed with ‘-‘ and I’d like to trim them from the final sequences. Is there a clean way to trim them and write a new fasta file without the dashes using Biopython?
I saw this post How to remove all-N sequence entries from fasta file(s) and tried to adapt some of the code but it didn’t work…
file containing a sequence like this:
def dash_removal(file_in, file_out): records = SeqIO.parse(file_in, 'fasta') filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq)) SeqIO.write(filtered, file_out, 'fasta') dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")
all of the sequences should ultimately be trimmed to look like this:
Any help would be appreciated!
All the options using
sed are great because they are faster but here is a way to do it in
The idea is to use
rstrip on the
seq attribute of each record.
rstrip can be used on the sequence just like on any other string in Python.
from Bio import SeqIO import io seq = """>sequence_of_interest CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA CCAGGCCAGATGAGAGAA--------------------------------------------------------------""" f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r') clean_records =  for record in SeqIO.parse(f, "fasta"): record.seq = record.seq.rstrip('-') clean_records.append(record) with open('clean_fasta.fa', 'w') as f: SeqIO.write(clean_records, f, 'fasta')