ChemXSeer Digital Library Gaussian Search Shibamouli Lahiri Computer Science and Engineering The Pennsylvania State University University Park, PA 16802
arXiv:1104.4601v2 [cs.DL] 28 Apr 2011
Juan Pablo Fernández Ramírez Information Sciences and Technology The Pennsylvania State University University Park, PA 16802
Shikha Nangia Biomedical and Chemical Engineering Syracuse University Syracuse, NY 13244
C. Lee Giles Karl T. Mueller
Information Sciences and Technology The Pennsylvania State University University Park, PA 16802
Information Sciences and Technology The Pennsylvania State University University Park, PA 16802
ABSTRACT We report on the Gaussian file search system designed as part of the ChemXSeer digital library. Gaussian files are produced by the Gaussian software , a software package used for calculating molecular electronic structure and properties. The output files are semi-structured, allowing relatively easy access to the Gaussian attributes and metadata. Our system is currently capable of searching Gaussian documents using a boolean combination of atoms (chemical elements) and attributes. We have also implemented a faceted browsing feature on three important Gaussian attribute types - Basis Set, Job Type and Method Used. The faceted browsing feature enables a user to view and process a smaller, filtered subset of documents.
Categories and Subject Descriptors H.3.7 [Information Storage and Retrieval]: Digital Libraries; H.5.2 [Information Interfaces and Presentation]: User Interfaces— graphical user interfaces (GUI), interaction styles, screen design, user-centered design
Chemistry The Pennsylvania State University University Park, PA 16802
ChemXSeer is a digital library and data repository for the Chemoinformatics and Computational Chemistry domains . It currently offers search functionalities on papers and formulae, CHARMM calculation data and Gaussian computation data, and also features a comprehensive search facility on chemical databases. A table search functionality , similar in spirit to the one featured in CiteSeerX1 , is currently under development. Gaussian document search has been a key component of ChemXSeer from its inception. The alpha version of Gaussian search featured a simple query box and an SQL back-end. Here we describe the next generation of Gaussian search2 which includes a customized user interface for Computational Chemistry researchers, boolean query functionality on a pre-specified set of attributes, and a faceted browsing option over three key attribute types. The current version of Gaussian search is powered by Apache Solr3 , a state-of-the-art open-source enterprise search engine indexer. The organization of this paper is as follows. In Section 2, we give a brief overview of the Gaussian software and Gaussian files, emphasizing the need for a customized search interface rather than a simple one. Description of the search interface appears in Section 3, followed by a brief sketch of related work in Section 4. We conclude in Section 5, outlining our contributions and providing directions for future improvement.
2. Keywords ChemXSeer, Gaussian software, Chemoinformatics, Faceted search
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Computational chemists perform Gaussian calculations to determine properties of a chemical system using a wide array of computational methods. The methods include molecular mechanics, ground state semi-empirical, self-consistent field, and density functional calculations. Computational methods such as these are key to the upsurge of interest in chemical calculations, partly because they allow fast, reliable, and reasonably easy analysis, modeling, and prediction of known and proposed systems (e.g., atoms, molecules, solids, proposed drugs, etc.) under a wide range of physical constraints, and partly because of the availability of well-tested, comprehensive software packages like Gaussian that implement many 1
http://citeseerx.ist.psu.edu/ http://cxs05.ist.psu.edu:8080/ChemXSeerGaussianSearch 3 http://lucene.apache.org/solr/ 2
Figure 3: Gaussian search system architecture.
Figure 1: Screenshot of a Gaussian document.
Figure 2: First-generation Gaussian query interface.
of these methods with good tradeoff between accuracy and processing time. The Gaussian software is actually a suite of several different chemical computation models, including packages for molecular mechanics, Hartree-Fock methods, and semi-empirical calculations. While the exact details of the functionalities of this software are beyond the scope of this paper4 , we would like the reader to note that each run of the Gaussian software is equivalent to conducting a chemical experiment with certain inputs and under certain physicochemical conditions. The output of the software consists of a large amount of information returned to the user via the computer console and usually redirected to a suitably-named output file. We are interested in these output files, henceforth referred to as “Gaussian files” or “Gaussian documents”. The Gaussian files contain detailed information about the calculations being performed on the system of interest. Although the details of the calculations are essential for the analysis of the system being studied, the output file can be cumbersome to a new user. Each Gaussian file begins with the issued command that initiated a particular calculation, followed by copyright information, memory and hard disk specification, basis set, job type, method used, and several different matrices (e.g., Z-matrix, distance matrix, orientation matrix, etc.). It may also contain other information like rotational constants, trust radius, maximum number of steps, and steps in a particular run. Gaussian files are semi-structured (Figure 1) in the sense that these parameters tend to appear in a particular order or with explicit markups. Since Gaussian files are important to the design, testing and prediction of new chemical systems, ChemXSeer had integrated a search 4 For details, please see http://www.gaussian.com/g_tech/g_ur/g09help.htm
functionality on these files. The alpha version of Gaussian search interface only consisted of a simple query box (Figure 2), and the back-end of the search engine was an SQL database that stored data extracted from the Gaussian files. Although simple, the interface allowed users to type in fielded queries and view results in an easy-to-understand format. In the current version, we have retained many aspects of the alpha version, including parts of the search results page and visual representation of individual Gaussian files. However, our domain experts argued that a more complex interface including faceted search was justified, partly because it eases the task of a researcher by limiting the number of search results to examine, and partly because such interfaces have already been successfully implemented . A computational chemist usually knows what kinds of parameters he/she is looking for in a Gaussian files database, and therefore it makes sense to refine search results using this information. We identified three important parameters towards this end - Job Type, Method Used and Basis Set. There are other parameters and metadata that we can extract from the Gaussian files, but they are not as important from a domain expert’s point of view. These are Charge, Degree of Freedom, Distance Matrix, Energy, Input Orientation, Mulliken Atomic Charge, Multiplicity, Optimized Parameters, Frequencies, Thermo-chemistry, Thermal Energy, Shielding Tensors, Reaction Path, PCM, and Variational Results. Metadata like ID, Title and File Path are used in organizing the search results.
The basic query to the Gaussian search system is an atom (i.e., element) or a collection of atoms. The system returns all Gaussian files containing those atoms. However, as experienced by researchers, such basic queries often return a large number of search results, many of which are not relevant. While we can think of improving the ranking of search results in tune with traditional information retrieval research, domain experts have informed us that since Gaussian files are semi-structured, a faceted browsing option would be more appropriate. It remains open, however, whether ranking within each facet could be improved. Currently we rank the search results by their external IDs, because our domain experts were not overly concerned with the ranking. The system architecture is given in Figure 3. Figure 3 has three principal components - the query interface, the search results page and the Gaussian file description page. The user supplies a query using the query interface, consisting of atoms (mandatory field), method used, job type and basis set. The last three fields are optional, and can be combined in boolean AND/OR fashion. The boolean query goes to the Gaussian document index, which in turn returns on the search results page all Gaussian files satisfying the
Figure 5: A search results page.
Figure 4: Gaussian query interface. Table 1: Gaussian Attribute Categories Job Type Method Used Basis Set Any Any Any Single Point Semi-empirical gen Opt Molecular Mechanics Freq Hartree-Fock IRC MP Methods IRCMax DFT Methods Force Multilevel Methods ONIOM CI Methods ADMP Coupled Cluster Methods BOMD CASSCF Scan BD PBC OVGF S...