parsing TM-align output to report protein domain PAV and coords
Posted: Thu Nov 25, 2021 2:30 pm
Greetings!
I have 7 unique query PDBs for my protein domain of interest, downloaded from RCSB, but manually trimmed down to just Pfam-defined domain start-stop coords (these7 trimmed PDBs superpose very well, checked using CHIMERA)
I also have ~30K PDB files, computationally predicted structures, one for each full length protein, for the entire proteome, for my species of interest.
My 2 goals are to parse TM-align output files to
1. separate my dataset into proteins showing presence vs. absence of my protein domain of interest (as defined in Pfam)
2. in the former category, report the start-stop coords of the domain in these proteins
With that as background info, here are my 5 questions for which I seek answers from forum members, please:
I am considering the syntax:
TMalign -a F -split 0 -outfmt 2 $ref.pdb $query.pdb -TMcut -1 -mirror 0 -d 2 -infmt1 0 -infmt2 0 -cp -fast >> ref1_TMalign.txt
Question 1. Can I use this syntax, OR how should I modify it and why?
However, i am not sure IF and how to specify values for these 3 run parameters:
-d TM-score scaled by an assigned d0, e.g. 5 Angstroms
-cp Alignment with circular permutation
-mirror Whether to align the mirror image of input structure
Question 2. How should I choose the value of -d?
Question 3. Does using -cp option complicate the parsing of output file to accomplish my goal of domain presence/absence, and domain start-stop coords reporting?
Question 4. If mirrored searches are switched on, I think it is relevant to my goals, right? And do mirrored searches need to be run separate from non-mirrored search? Should parsing of mirrored search results be performed any differently?
Finally,
Question 5. Are there BioPerl or BioPython based scripts or any other tools to parse TM-align output to report domain start-stop i.e. alignment coordinates?
Thanks in advance.
I wish you and yours a Happy holiday season / Happy Thanksgiving.
Cheers!
I have 7 unique query PDBs for my protein domain of interest, downloaded from RCSB, but manually trimmed down to just Pfam-defined domain start-stop coords (these7 trimmed PDBs superpose very well, checked using CHIMERA)
I also have ~30K PDB files, computationally predicted structures, one for each full length protein, for the entire proteome, for my species of interest.
My 2 goals are to parse TM-align output files to
1. separate my dataset into proteins showing presence vs. absence of my protein domain of interest (as defined in Pfam)
2. in the former category, report the start-stop coords of the domain in these proteins
With that as background info, here are my 5 questions for which I seek answers from forum members, please:
I am considering the syntax:
TMalign -a F -split 0 -outfmt 2 $ref.pdb $query.pdb -TMcut -1 -mirror 0 -d 2 -infmt1 0 -infmt2 0 -cp -fast >> ref1_TMalign.txt
Question 1. Can I use this syntax, OR how should I modify it and why?
However, i am not sure IF and how to specify values for these 3 run parameters:
-d TM-score scaled by an assigned d0, e.g. 5 Angstroms
-cp Alignment with circular permutation
-mirror Whether to align the mirror image of input structure
Question 2. How should I choose the value of -d?
Question 3. Does using -cp option complicate the parsing of output file to accomplish my goal of domain presence/absence, and domain start-stop coords reporting?
Question 4. If mirrored searches are switched on, I think it is relevant to my goals, right? And do mirrored searches need to be run separate from non-mirrored search? Should parsing of mirrored search results be performed any differently?
Finally,
Question 5. Are there BioPerl or BioPython based scripts or any other tools to parse TM-align output to report domain start-stop i.e. alignment coordinates?
Thanks in advance.
I wish you and yours a Happy holiday season / Happy Thanksgiving.
Cheers!