GENE2XML CONVERTER PROGRAM gene2xml is a stand-alone program that converts Entrez Gene ASN.1 into XML. It is available for several computer platforms (Alpha, Linux, Macintosh, Solaris, and Windows) and is distributed in the asn1-converters area of the NCBI public ftp site. From asn1-converters, navigate into by_program and then gene2xml, and download and extract the appropriate file. The following versions of gene2xml have been made available: Version 1.3 March 1, 2010 Report write failures Version 1.2 February 1, 2010 Adds new choices with these values: ncRNA (8) tmRNA (9) miscRNA (10) Adds elements RNA-gen, RNA-qual, RNA-qual-set Entrez Gene data are stored as compressed binary Entrezgene-Set ASN.1 files on the NCBI ftp site, and have the suffix .ags.gz. These are several-fold smaller than compressed XML files, resulting in a significant savings of disk storage and network bandwidth. Normal processing by gene2xml produces text XML files with the same name but with .xgs as the suffix. The command-line arguments to gene2xml are described below. - Version and Argument Display Displays the version of the gene2xml program and its arguments and their descriptions. -p Path to Files [String] Optional Use -p if you want to process a entire directory of files. In this case, gene2xml ignores the -i and -o arguments. Otherwise it takes -a as the single input file, regardless of suffix. -r Path for Results [String] Optional If -p is given but no -r results path is provided, results are written in the same directory as the input file. The -p argument recursively explores any subdirectories, so there can be multiple places where output is written. -i Single Input File [File In] Optional default = stdin -o Single Output File [File Out] Optional default = stdout If -p is not given, -i is used for the input file, and -o is used for the output file. Suffix conventions are ignored in this case. -b File is Binary [T/F] Optional default = F -c File is Compressed [T/F] Optional default = F On UNIX platforms you can decompress .ags.gz files on-the-fly by using both -b and -c. On the PC you will need to manually decompress into .ags files and then only use the -b flag. -t Taxon ID to Filter [Integer] Optional default = 0 If you want to extract only records for a particular organism, pass the NCBI taxon database number with the -t argument. For example gene2xml -i All_Mammalia.ags.gz -b -c -t 9685 -o cats.xgs will only send gene records for cats (taxonomy ID 9685) to the file cats.xgs. -l Log Processing [T/F] Optional default = F When you are processing an entire directory of files, passing -l on the command-line causes gene2xml to print the current file name as it progresses through the directory. The following arguments, -x, -y, and -z, are normally not used, and gene2xml will default to writing Entrezgene-Set XML, which is the normal situation. -x Extract .ags -> text .agc [T/F] Optional default = F To accommodate existing programs, the -x argument will convert .ags files to the catenated Entrezgene text ASN.1 files that were previously distributed. -y Combine .agc -> text .ags (for testing) [T/F] Optional default = F -z Combine .agc -> binary .ags, then gzip [T/F] Optional default = F NCBI uses gene2xml with the -y or -z arguments to process internal data into the compressed binary Entrezgene-Set ASN.1 files that are placed on the NCBI ftp site. It is not expected that anyone outside of NCBI would use these arguments. A sample record that illustrates the structure of Entrezgene-Set XML is shown below. Ellipses (...) are used where blocks of text have been removed for brevity in this documentation. 2652 0 2003 8 28 20 30 0 2005 4 27 21 45 0 6 1 1 Homo sapiens human taxon 9606 man Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo 1 2 PRI 1 X OPN1MW opsin 1 (cone pigments), medium-wave-sensitive (color blindness, deutan) Xq28 MIM 303800 CBD DCB GCP CBBM HGNC:4206 opsin 1 (cone pigments), medium-wave-sensitive (color blindness, deutan) green cone pigment Xq28 LocusLink 2652 2652 1 Reference NC_000023 8 152969013 152982377 51511752 3 Reference NM_000513 1 152969013 152969124 51511752 ... 4503964 8 Reference NP_000504 1 152969013 152969124 51511752 ... 4503965 ... 254 Nomenclature HUGO Gene Nomenclature Committee 16 Official Symbol OPN1MW 16 Official Full Name opsin 1 (cone pigments), medium-wave-sensitive (color blindness, deutan) ... 254 LocusTagLink HGNC 4206 ... LocusID 2652 MIM 303800 LOC2652 PROP phenotype