Skip to content

How to import data from NCBI sequence databases into Excel 2007 in Four Easy Steps

Excel can be a useful tool for organizing and analyzing genomic sequences and associated data*.  When there are more than a few sequences to organize and analyze, however, automated data import into Excel becomes important.  This article describes one way to easily import hundreds of sequences that may result from an NCBI search into Excel with the output laid out so that the gi, accession, taxonomy ID numbers and other sequence information are conveniently organized in a table.  The figure below shows an example of one such automatically imported and formatted data in a spreadsheet.

*Important Note: Excel has a limit to the number of characters in a cell-this limit is 32767 characters. What this means is that if any sequences have more than 32767 characters (for example more than 32767 nucleotides for DNA sequences) the sequence data will be truncated to 32767. For single genes this is not a severe limitation since the vast majority of genes are less than 32767 nucleotides. For genomes or vectors, however, the sequence data is likely to be truncated so if the sequence data is important for an application then a tool other than Excel should be used.

excel table

The steps to export data from NCBI and import it into Excel are easy and require no programming or software other than Excel 2007 and a web browser to access NCBI.  The four steps are:

-In NCBI (three steps): ncbisteps

1. Run a search from the NCBI databases that return sequence information (protein, nucleotide, EST, genome*, structure*, etc*).  NCBI displays the search results in a default format called SUMMARY.  SUMMARY format is not easy to import into Excel.  There are several other formats however that appear in the DISPLAY box.  The one most useful for Excel input is TINYseq XML format.

2. In the display box, select TinySeq XML format.  The format of the results will change.

3. The last NCBI step is save the results to a file using the SEND TO box.  In the SENT TO box select file.  After selecting file, the search data will be saved to a file in the default download location in the TinySeq XML format.  Note that all the search results will be saved to a file and not only the ones that were displayed.  This is a useful feature if you have more sequences that can be displayed.

*To view the TINYseq XML for search results from databases that do not have the TINYseq XML, one can use NCBI’s LINK feature to link to the protein or nucleotide databases.  The protein and nucleotide databases have the TINYseq XML enabled.  For example, to view the TINYseq XML for structure databases, you can link from the structure search site to the protein sequence database using the LINKS-OTHER LINKS buttons on the structure search result site. Once at the protein sequence database, the TINYseq XML command is available.

In Excel (one step): step4excel

4. Open Excel and go to the DATA toolbar.  In the DATA toolbar are various buttons that allow for import of various data into Excel.  Click on the From Other Sources and then select from the list FROM XML DATA IMPORT.  You will be prompted for the location of a file-navigate to the location where you downloaded the TinySeq XML formatted file from NCBI.  Select that, answer the default Excel prompts and then the data will appear nicely formatted in the spreadsheet.

Those are the four simple steps.  One can then filter and sort the resulting table and use any excel functions for analysis.  It is also possible to automate additional analysis using macros, visual basic for applications scripts, or Matlab (using Excel link with Matlab’s bioinformatics toolbox).

One last note concerning performance-I have tested the import with 2307 virus nucleotide sequences in the genome database.  Excel reported truncation of data during import (sequences with more than 32767 nucleotides were truncated to 32767 characters-see important note at the beginning of this article).  One problem that did occur with this large dataset was that Excel became unresponsive when displaying the column that had the sequence data.  Hiding the column or scrolling so that the column does not appear resulted in normal performance. Also, formatting the column so that the text appears as characters and not ####… results in normal performance.

Related Posts with Thumbnails

Posted in Explanations, Web Log.

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.