Having a FASTA file that defines an entire family is very useful if you want to find repeats with Ancient or use TSSearch/Protocol2 to compare two entire families for homology.
A TC-Family looks like this : 2.A.1 (It has three digits).

The program we are using is called define_family.py

Usage: define_family.py FAMILY <P/PSI> OUTPUT

Open up your terminal application and type:

cd ~/Desktop/ # Changes your working directory to your desktop.
define_family.py 2.A.1 P output.faa # P or PSI

The “P” option refers to BLASTP. Alternatively we can use “PSI” if we are looking for more distant homologs. When comparing families or looking for repeats, it is best to use the “P” option. If no good results are found, then use “PSI”.

When prompted, enter 0.7 for CD-Hit threshold if you are about to compare this family to another. Enter 0.9 if you are searching for repeats. This will remove proteins that are 70% and 90% identical to their clusters, respectively.

We use forgiving thresholds, because having a very large FASTA list will not cost us very much time, so long as we are using TSSearch. When looking for repeats, we don’t want to eliminate too many sequences. This becomes apparent when doing a vertical search with Ancient. A good example of a TMS repeat across two homologs can be masked if we have a threshold that is any lower.