[software] Foldseek installed

Recently, there is an excellent structure search tool named Foldseek to identify protein structures sharing similar 3D fold with your query. The work has been published in Nature Biotechnology (2023) (DOI:https://doi.org/10.1038/s41587-023-01773-0) and promoted in Nature (2023). The source code can be obtained at GitHub.

Foldseek logo copied from its GitHub site

Foldseek claims it searches structures in existing databases including experimental structure database PDB (protein databank) and alphafold predicted structures provided by Deepmind. Foldseek also compared the efficiency with other known software/server such as the Dali server and Combinatorial Extension.

I installed Foldseek in our workstations and tested it with a few known structures. The search speed was amazingly rapid and the output format is very user-friendly in both column-separate txt or HTML.

Here are some example commands and outputs:

Run foldseek to search structures which are similar to 1UBQ (ubiquitin), limited to human (taxonomy id 9606), output as a text file and the alignment scoring is TM

foldseek easy-search pdb/1ubq.pdb /opt/foldseek/afdb --taxon-list 9606 result-human-1UBQ.txt --alignment-type 1 tmp

Run foldseek to search structures which are similar to 1UBQ (ubiquitin), against all alphafold structures, output as a HTML file and the alignment scoring is TM

foldseek easy-search pdb/1ubq.pdb /opt/foldseek/afdb result-all-1UBQ.html --alignment-type 1 --format-mode 3 tmp

Run foldseek to search structures which are similar to 1UBQ (ubiquitin), against all alphafold structures, output as a text file with query, target+header and evalue.

foldseek easy-search pdb/1ubq.pdb /opt/foldseek/afdb result-all-1UBQ.txt --alignment-type 1 --format-output "query,theader,evalue" --format-mode 3 tmp

Usage:

foldseek easy-search <reference structure> <database path> <output file and format> --alignment-type 1 (1 is TM scoring function) --format-mode 3 (3 is HTML format) (remove it for text mode) tmp
Text-based output of Foldseek

The HTML format is very convenient. Click the color bar with the AF code, and one can see the aligned structure (red) compared to the query one (grey).

usage: foldseek easy-search <i:PDB|mmCIF[.gz]> ... <i:PDB|mmCIF[.gz]>|<i:stdin> <i:targetFastaFile[.gz]>|<i:targetDB> <o:alignmentFile> <tmpDir> [options]
options:                               
 -s FLOAT                       Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [9.500]
 --max-seqs INT                 Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [1000]
 --exhaustive-search BOOL       Turns on an exhaustive all vs all search by by passing the prefilter step [0]
                              
 --min-seq-id FLOAT             List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
 -c FLOAT                       List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
 --cov-mode INT                 0: coverage of query and target
                                1: coverage of target
                                2: coverage of query
                                3: target seq. length has to be at least x% of query length
                                4: query seq. length has to be at least x% of target length
                                5: short seq. needs to be at least x% of the other seq. length [0]
 --max-accept INT               Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]
 -e DOUBLE                      List matches below this E-value (range 0.0-inf) [1.000E+01]
 --seq-id-mode INT              0: alignment length 1: shorter, 2: longer sequence [0]
 --alt-ali INT                  Show up to this many alternative alignments [0]
                              
 --num-iterations INT           Number of iterative profile search iterations [1]
                              
 --tmscore-threshold FLOAT      accept alignments with a tmsore > thr [0.0,1.0] [0.000]
 --tmalign-hit-order INT        order hits by 0: (qTM+tTM)/2, 1: qTM, 2: tTM, 3: min(qTM,tTM) 4: max(qTM,tTM) [0]
 --tmalign-fast INT             turn on fast search in TM-align [1]
 --lddt-threshold FLOAT         accept alignments with a lddt > thr [0.0,1.0] [0.000]
 --prefilter-mode INT           prefilter mode: 0: kmer/ungapped 1: ungapped, 2: nofilter [0]
 --alignment-type INT           How to compute the alignment:
                                0: 3di alignment
                                1: TM alignment
                                2: 3Di+AA [1]
 --cluster-search INT           first find representative then align all cluster members [0]
 --mask-bfactor-threshold FLOAT mask residues for seeding if b-factor < thr [0,100] [0.000]
 --format-mode INT              Output format:
                                0: BLAST-TAB
                                1: SAM
                                2: BLAST-TAB + query/db length
                                3: Pretty HTML
                                4: BLAST-TAB + column headers
                                5: Calpha only PDB super-posed to query
                                BLAST-TAB (0) and BLAST-TAB + column headers (4)support custom output formats (--format-output)
                                (5) Superposed PDB files (Calpha only) [0]
 --format-output STR            Choose comma separated list of output columns from: query,target,evalue,gapopen,pident,fident,nident,qstart,qend,qlen
                                tstart,tend,tlen,alnlen,raw,bits,cigar,qseq,tseq,qheader,theader,qaln,taln,mismatch,qcov,tcov
                                qset,qsetid,tset,tsetid,taxid,taxname,taxlineage,
                                lddt,lddtfull,qca,tca,t,u,qtmscore,ttmscore,alntmscore,rmsd,prob
                                 [query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits]
 --greedy-best-hits BOOL        Choose the best hits greedily to cover the query [0]
                              
 --threads INT                  Number of CPU-cores used (all by default) [16]
 -v INT                         Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
 --compressed INT               Write compressed output [0]

examples:
 # Search a single/multiple PDB file against a set of PDB files
 foldseek easy-search examples/d1asha_ examples/ result.m8 tmp
 # Format output differently
 foldseek easy-search examples/d1asha_ examples/ result.m8 tmp --format-output query,target,qstart,tstart,cigar
 # Align with TMalign (global)
 foldseek easy-search examples/d1asha_ examples/ result.m8 tmp --alignment-type 1
 # Skip prefilter and perform an exhaustive alignment (slower but more sensitive)
 foldseek easy-search examples/d1asha_ examples/ result.m8 tmp --exhaustive-search 1
 
 
references:
 - van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)

Show an extended list of options by calling 'foldseek easy-search -h'.