parsing TM-align output to report protein domain PAV and coords

anandksrao · Post by **anandksrao** » Thu Nov 25, 2021 2:30 pm

Greetings!

I have 7 unique query PDBs for my protein domain of interest, downloaded from RCSB, but manually trimmed down to just Pfam-defined domain start-stop coords (these7 trimmed PDBs superpose very well, checked using CHIMERA)

I also have ~30K PDB files, computationally predicted structures, one for each full length protein, for the entire proteome, for my species of interest.

My 2 goals are to parse TM-align output files to
1. separate my dataset into proteins showing presence vs. absence of my protein domain of interest (as defined in Pfam)
2. in the former category, report the start-stop coords of the domain in these proteins

With that as background info, here are my 5 questions for which I seek answers from forum members, please:

I am considering the syntax:
TMalign -a F -split 0 -outfmt 2 $ref.pdb $query.pdb -TMcut -1 -mirror 0 -d 2 -infmt1 0 -infmt2 0 -cp -fast >> ref1_TMalign.txt
Question 1. Can I use this syntax, OR how should I modify it and why?

However, i am not sure IF and how to specify values for these 3 run parameters:
-d TM-score scaled by an assigned d0, e.g. 5 Angstroms
-cp Alignment with circular permutation
-mirror Whether to align the mirror image of input structure
Question 2. How should I choose the value of -d?
Question 3. Does using -cp option complicate the parsing of output file to accomplish my goal of domain presence/absence, and domain start-stop coords reporting?
Question 4. If mirrored searches are switched on, I think it is relevant to my goals, right? And do mirrored searches need to be run separate from non-mirrored search? Should parsing of mirrored search results be performed any differently?

Finally,
Question 5. Are there BioPerl or BioPython based scripts or any other tools to parse TM-align output to report domain start-stop i.e. alignment coordinates?

Thanks in advance.
I wish you and yours a Happy holiday season / Happy Thanksgiving.
Cheers!

zcx@umich.edu · Post by **zcx@umich.edu** » Sun Nov 28, 2021 11:56 am

[1] Pfam is not defined based on tertiary structures. To check if a protein sequence contains a pfam domain, you should use hmmsearch or hmmscan from HMMER to align the sequence and the HMM of the pfam family. TM-align can tell you whether your protein structure model is structurally similar to the experimentally structure for the pfam domain of interest. However, being structurally similar does not imply being evolutionarily or functionally related.

[2] For the following discussion, I assume you want to check if a protein structure model is structurally similar to the experimental structure for the pfam domain of interest. For this purpose, you should not use -mirror, -cp, -d, -a as they are all irrelevant for your purpose. Just run TM-align with the default option to check if the output TM-score (normalized by the length of the pfam domain structure) is >=0.5.

[3] We are unaffiliated with BioPython or BioPerl and do not support their usage.

anandksrao · Post by **anandksrao** » Sun Nov 28, 2021 3:33 pm

Thank you for your explanations and quick response, despite being the long weekend. I appreciate your help _/\_

Let me share 2 more details:
1. It would be easier for me to parse the TM-score from the tabular outtut than the full output with the FASTA alignment etc. Tabular output files are way smaller as well.
2. Also, in some cases, the full length protein can have shorter sequence than query PDBs.

With that as context, I have 3 quick followup questions:

Question 6. Syntax Check
Sometimes length(QUERY1.pdb) > length(Full_Length.pdb)!
So can I use -a F -TMcut -1 rather than the default -a T?
And is the full example syntax shown below OK? Sorry to bother, I'm a first time TMalign user and not a structural biologist.
I deal with sequences usually

TMalign $Full_Length.pdb QUERY1.pdb -a F -outfmt 2 -TMcut -1 -infmt1 0 -infmt2 0 >> TMalign_28Nov2021_cat_queryPDB1.tab
TMalign $Full_Length.pdb QUERY2.pdb -a F -outfmt 2 -TMcut -1 -infmt1 0 -infmt2 0 >> TMalign_28Nov2021_cat_queryPDB2.tab

Question 7. Suppress reporting threshold?
In my trial runs with different query PDBs, with tabular output, for each tabular output file, the number of rows varied.
So this means, I suppose, for different queries, some alignments were NOT reported, since they did not make the
threshold for reporting? Right?
How can I suppress this behavior, so that ALL query<->subject pairs are reported in the tabular output format, via -outfmt 2,
even when TM-scores are super low, even 0?

Question 8. Scenario - Discontiguous structural alignment, but TMscore > 0.5
Is it ever possible for TM-align's pairwise alignment to be broken into > 1 contiguous region, and yet report TM-score > 0.5?
If yes, are these rare instances? And would such dis-contiguous structural alignments hold any special meaning i
n terms of their structure-function evolution and the overall confidence a user should have about such alignments?

zcx@umich.edu · Post by **zcx@umich.edu** » Mon Nov 29, 2021 1:43 am

6. The default value of -a is F. Whether or not -a is F, TM-align can return result even if Length(Full_length.pdb)<Length(QUERY1.pdb). Therefore, there is not need for you to set it to a non-default value.

7. No, TM-align will not refuse to report an alignment even if the TM-score is very low, unless -TMcut is set to a non-default value.
However, it is possible for TM-align to refuse to report an alignment if one of the input protein contains <3 residues, which is mathematically impossible to superimpose. If TM-align cannot report an alignment even when neither of the above two cases are true, please send us your input files (e.g. by attaching it to your next post) so that we can check.

8. Yes, a valid structural alignment can include long gaps between two aligned regions.

Zhanglab forum

parsing TM-align output to report protein domain PAV and coords

parsing TM-align output to report protein domain PAV and coords

Re: parsing TM-align output to report protein domain PAV and coords

Re: parsing TM-align output to report protein domain PAV and coords

Re: parsing TM-align output to report protein domain PAV and coords