I have a fasta file with multiple sequences in it. Some of the sequences are trailed with ‘-‘ and I’d like to trim them from the final sequences. Is there a clean way to trim them and write a new fasta file without the dashes using Biopython?
I saw this post How to remove all-N sequence entries from fasta file(s) and tried to adapt some of the code but it didn’t work…
file containing a sequence like this:
sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA—————————————————————
def dash_removal(file_in, file_out):
records = SeqIO.parse(file_in, 'fasta')
filtered = (rec for rec in records if any(ch != '-' for ch in rec.seq))
SeqIO.write(filtered, file_out, 'fasta')
dash_removal("dash_removal_test.fasta", "dashes_gone?.fasta")
all of the sequences should ultimately be trimmed to look like this:
sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCACCAGGCCAGATGAGAGAA
Any help would be appreciated!
All the options using sed
are great because they are faster but here is a way to do it in BioPython
.
The idea is to use rstrip
on the seq
attribute of each record. rstrip
can be used on the sequence just like on any other string in Python.
from Bio import SeqIO
import io
seq = """>sequence_of_interest
CAGGCCATTTCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTTTAGCCCAGAAGTAATACCCAT
GTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAGTGGGGGGACATCAAGCAGCAA
TGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCAGTGCACGCAGGGCCTATTGCA
CCAGGCCAGATGAGAGAA--------------------------------------------------------------"""
f = io.StringIO(seq) # replace it with f = open('my_fasta.fa', 'r')
clean_records = []
for record in SeqIO.parse(f, "fasta"):
record.seq = record.seq.rstrip('-')
clean_records.append(record)
with open('clean_fasta.fa', 'w') as f:
SeqIO.write(clean_records, f, 'fasta')