4.5.5. Download protein sequence database#
You can follow these steps to download and prepare the protein sequence database:
Create new directory
Change into it
Download the database archive
Uncompress (or extract) the database archive
Format the database
Create new directory — mkdir
#
In order to keep all BLAST databases in one location, create a directory to store them using the mkdir command:
mkdir ~/databases
Change directory — cd
#
Change into the newly created directory using the cd command:
cd ~/databases
Download the database archive — wget
#
Visit the database downloads page on the UniProt website.
Navigate to the UniProtKB section.
Right-click on the fasta download link corresponding to Reviewed (Swiss-Prot) and then copy it to clipboard (Fig. 83).

Fig. 83 Download link for Swiss-Prot database#
To download the database, you can use the
wget
command with the link to download as the
argument:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
When the download is complete, you will find a file
named uniprot_sprot.fasta.gz
in the current directory.
You can use the ls command to verify
if it exists:
ls -lh
Output:
total 86M
-rw-rw-r-- 1 user user 86M Feb 10 15:00 uniprot_sprot.fasta.gz
Since this file is in a compressed format (.gz
),
you will need to uncompress it before proceeding.
Uncompress the database archive — gunzip
#
To uncompress (or extract) the database archive file
downloaded in the previous step, you can use the
gunzip
command.
Note
By default, gunzip will remove the original compressed file after extraction.
If you would like to keep the original file (.gz
),
you can include the -k
(keep input files) option
with gunzip
.
Provide the file name of the downloaded file as the argument:
gunzip uniprot_sprot.fasta.gz
When the extraction is complete, you will find the database file in FASTA format in the same directory:
ls -lh
Output:
total 267M
-rw-rw-r-- 1 user user 267M Feb 10 15:00 uniprot_sprot.fasta
View the database#
Since this extracted database file is large, you can use the head command to view the first few lines of the file:
head -n 5 uniprot_sprot.fasta
Output:
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
Alternatively, you can use the less command to view it one page at a time:
less uniprot_sprot.fasta
If you would like to count the number of sequences in the
database, you can use the grep
command.
grep ">" -c uniprot_sprot.fasta
Output:
564277
The -c
option of grep
, counts the number of
times the given search string (>
in this case)
occurs in the input file.
Note
A sequence in a FASTA format should start with
the >
character. Hence, counting the number of times
it occurs gives the number of sequences in the file.
You can now proceed towards formatting the database.
Format the database — makeblastdb
#
The database needs to be formatted before it can be
used in a BLAST search. You can format it using the
makeblastdb
command, which is part of the
NCBI BLAST+ package.
The command has multiple options. Here is an example:
makeblastdb -in uniprot_sprot.fasta -parse_seqids \
-title "Swiss-Prot" -dbtype prot -out swissprot
Note
The \
character splits the long command into
multiple lines.
Output:
Building a new DB, current time: 03/24/2021 15:12:50
New DB name: /home/user/databases/swissprot
New DB title: Swiss-Prot
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 564277 sequences in 47.507 seconds.
What the options mean:
-in
File name containing input sequences.
-parse_seqids
Parse sequence identifiers from the input file. These will be displayed in search results.
-title
A descriptive name for this database.
-dbtype
The type of input sequences — acceptable values are
prot
(for protein) andnucl
(for nucleotide) sequences.-out
The value here will be used to name the output files. This is also the name you will need to use for the database while doing a search (see New DB Name) in output.
When formatting is complete, you will notice the following
files in the databases
directory:
ls -lh
Output:
total 585M
-rw-rw-r-- 1 user user 100M Mar 24 15:13 swissprot.phr
-rw-rw-r-- 1 user user 4.4M Mar 24 15:13 swissprot.pin
-rw-rw-r-- 1 user user 2.2M Mar 24 15:13 swissprot.pog
-rw-rw-r-- 1 user user 18M Mar 24 15:13 swissprot.psd
-rw-rw-r-- 1 user user 411K Mar 24 15:13 swissprot.psi
-rw-rw-r-- 1 user user 195M Mar 24 15:13 swissprot.psq
-rw-rw-r-- 1 user user 267M Feb 10 15:00 uniprot_sprot.fasta