4.5.5. Download protein sequence database#
You can follow these steps to download and prepare the protein sequence database:
Create new directory
Change into it
Download the database archive
Uncompress (or extract) the database archive
Format the database
Create new directory —
In order to keep all BLAST databases in one location, create a directory to store them using the mkdir command:
Change directory —
Change into the newly created directory using the cd command:
Download the database archive —
Visit the database downloads page on the UniProt website.
Navigate to the UniProtKB section.
Right-click on the fasta download link corresponding to Reviewed (Swiss-Prot) and then copy it to clipboard (Fig. 83).
To download the database, you can use the
wget command with the link to download as the
When the download is complete, you will find a file
uniprot_sprot.fasta.gz in the current directory.
You can use the ls command to verify
if it exists:
total 86M -rw-rw-r-- 1 user user 86M Feb 10 15:00 uniprot_sprot.fasta.gz
Since this file is in a compressed format (
you will need to uncompress it before proceeding.
Uncompress the database archive —
To uncompress (or extract) the database archive file
downloaded in the previous step, you can use the
By default, gunzip will remove the original compressed file after extraction.
If you would like to keep the original file (
you can include the
-k (keep input files) option
Provide the file name of the downloaded file as the argument:
When the extraction is complete, you will find the database file in FASTA format in the same directory:
total 267M -rw-rw-r-- 1 user user 267M Feb 10 15:00 uniprot_sprot.fasta
View the database#
Since this extracted database file is large, you can use the head command to view the first few lines of the file:
head -n 5 uniprot_sprot.fasta
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
Alternatively, you can use the less command to view it one page at a time:
If you would like to count the number of sequences in the
database, you can use the
grep ">" -c uniprot_sprot.fasta
-c option of
grep, counts the number of
times the given search string (
> in this case)
occurs in the input file.
A sequence in a FASTA format should start with
> character. Hence, counting the number of times
it occurs gives the number of sequences in the file.
You can now proceed towards formatting the database.
Format the database —
The database needs to be formatted before it can be
used in a BLAST search. You can format it using the
makeblastdb command, which is part of the
NCBI BLAST+ package.
The command has multiple options. Here is an example:
makeblastdb -in uniprot_sprot.fasta -parse_seqids \ -title "Swiss-Prot" -dbtype prot -out swissprot
\ character splits the long command into
Building a new DB, current time: 03/24/2021 15:12:50 New DB name: /home/user/databases/swissprot New DB title: Swiss-Prot Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 564277 sequences in 47.507 seconds.
What the options mean:
File name containing input sequences.
Parse sequence identifiers from the input file. These will be displayed in search results.
A descriptive name for this database.
The type of input sequences — acceptable values are
prot(for protein) and
nucl(for nucleotide) sequences.
The value here will be used to name the output files. This is also the name you will need to use for the database while doing a search (see New DB Name) in output.
When formatting is complete, you will notice the following
files in the
total 585M -rw-rw-r-- 1 user user 100M Mar 24 15:13 swissprot.phr -rw-rw-r-- 1 user user 4.4M Mar 24 15:13 swissprot.pin -rw-rw-r-- 1 user user 2.2M Mar 24 15:13 swissprot.pog -rw-rw-r-- 1 user user 18M Mar 24 15:13 swissprot.psd -rw-rw-r-- 1 user user 411K Mar 24 15:13 swissprot.psi -rw-rw-r-- 1 user user 195M Mar 24 15:13 swissprot.psq -rw-rw-r-- 1 user user 267M Feb 10 15:00 uniprot_sprot.fasta