Chemical database

A chemical database is a database for the storage of chemical information. This information is chemical or crystal structures, physical properties of molecules spectra, reactions and syntheses and thermodynamic data.

  • 3.1 substructure
  • 3.2 3D conformation
  • 3.3 descriptors
  • 3.4 Chemical similarity
  • 3.5 Registration
  • 6.1 databases of chemical structures
  • 6.2 Databases of chemical names

Types of chemical databases

Chemical structures

Chemical structures are traditionally line graphics, the chemical bonds between atoms represent (2D structural formulas ). These are ideal for chemists to visual representations, based on the use of computers, these are completely unsuitable ( storage and search). Small molecules (also known as ligand in the drug design process), are typically represented lists of atoms and their compounds. Large molecules such as proteins are shown but more compact by using the sequences of the amino acid building blocks. Large databases of chemical structures are built to handle the storage and retrieval of information on millions of molecules and their physical properties or their compounds.

Literature database

Chemical literature databases connecting structures or other chemical information to relevant references, such as scientific papers or patents. This database example STN, SciFinder and Reaxys.

Crystallographic database

Crystallographic databases managed X-ray crystal structure data. Typical examples are Protein Data Bank and Cambridge Structural Database.

NMR spectra database

NMR spectra databases correlate chemical structure with NMR data. These databases often contain other characterization data such as FTIR and mass spectrometry.

Databases of reactions

Most chemical databases store information about stable molecules but in databases for reactions, intermediate products and temporarily created unstable molecules are stored. Reaction databases contain information about products, starting materials and reaction mechanisms.

Thermophysical Database

Thermophysical data is information about

  • Phase equilibrium with the vapor-liquid equilibrium, solubility of gases in liquids, solids in liquids ( SLE), heats of mixing, evaporation and melting enthalpy.
  • Caloric information, such as heat capacity, standard enthalpy of formation and heat of combustion,
  • Transport properties, such as viscosity and thermal conductivity.

Chemical structure representation

There are two basic techniques for the representation of chemical structures in digital databases, as connection tables / adjacency matrices / lists with additional information about binding (edges) and atomic attributes (nodes), such as: MDL Molfile, PDB, CML. As next linear string notation based on the depth-first search or breadth-first search, such as: SMILES / SMARTS, SLN, WLN, InChI

These approaches have been refined, to enable the display of stereochemical differences and to allow specific types of binding, as they occur in the organometallic compounds. The main advantage of a computer representation of the possibility of increasing the storage capacity and a fast and flexible search.

Search

Substructure

Chemists can search databases using structural parts, parts of their IUPAC names as well as restrictions on properties. Chemical databases differ are looking particularly from other general-purpose databases to assist in the sub - structure. This type of search (sometimes referred to as Monomorphismus ) is achieved by the search for subgraph isomorphism, and is an application of graph theory. The algorithms for search is computationally intensive, many of the time complexity is O (N 3 ), or O ( N 4) (n is the number of atoms involved is ). The search of the components is called atom by atom search ( ABAS ). In this search, atoms and bonds with the target molecule are compared. The ABAS Search usually uses the Ullman algorithm or variations thereof (ie SMSD ). Accelerations of the search can be achieved by partitioning, that is part of the time for the search query is determined by the use of pre- culated stored information ( domain index) saved. These preliminary calculations are typically bit strings, this represents the presence or absence of molecular fragments dar. The ABAS comparison with the target molecules thus need only consider the molecules that have the pre- calculated fragments, the remaining must not be included in the search. This elimination is used as a screening (not to be confused with the screening methods in drug development ) refers. The bit strings that are used for these applications, also referred to as a structural key.

Performance such a key depends on the choice of the fragments for the construction of the key, and the probability of their occurrence in the database of molecules. Another type of key uses hash codes to derive fragments. These are called " fingerprints " although the term is also sometimes used for the structural key. The amount of memory needed to store these structural key and fingerprints can be reduced by the ' folding '. These parts of the key are combined with bitwise operations, thereby shortening the overall length is achieved.

3D conformation

Search for matching 3D conformation of molecules by specifying spatial constraints is another feature that is particularly important in drug development. Search this nature may require a lot of computation time. Many methods are used, provide only approximate results, for example BCUTS, special function representations, moments of inertia, ray tracing histograms, shape multipoles.

Descriptors

All properties of molecules about their structure also can be either divided into physico-chemical or pharmacological properties, also called descriptors. In addition, there are various artificial and more or less standardized naming for molecules that are occupied by more or less ambiguous names and synonyms, and also need to be managed. The IUPAC name is usually a good choice for the representation of a molecular structure in which both human readable and unique for computers string. This unwieldy for larger molecules. A bad choice for the definition of a database key is the common name as well as homonyms and synonyms. While physicochemical descriptors such as molecular weight, (partial) charge, solubility, etc. are usually calculated directly based on the structure of the molecule, pharmacological descriptors may be involved only indirectly with ( multivariate statistics or experimental (screening, bioassay ) Results ). All of these descriptors are not used for the representation of the molecule.

Chemical similarity

Chemical similarity ( or molecular similarity) refers to the similarity of the chemical elements, molecules, or chemical compounds, in relation to either structural or functional properties. There is no uniform definition of molecular similarity, but the concept can be defined depending on the application as follows, and is often described as the inverse of a measure of distance in descriptor space. The two molecules may be referred to as more similar, if, for example, their difference in molecular weights lower than in comparison to other molecules. A variety of other metrics could lead to a multivariate distance measure. Distance measurements are often classified in Euclidean or non-Euclidean metrics classified depending on whether the triangle inequality has existed. Maximum common subgraph (MCS ) based substructure search (similarity or distance measure ) is also very common. MCS is used for the screening of molecular chains as a common part of the graph.

In the chemical databases groups of "similar" molecules are clustered on for similarities. Both hierarchical and non - hierarchical clustering approaches can be applied to chemical entities with several attributes. These attributes or molecular properties can be derived either empirically or mathematically determined. One of the most popular clustering approaches is the Jarvis - Patrick algorithm.

In pharmacologically oriented chemical repositories the similarity usually is in relation to the biological activity of the compounds defined (ADME / tox ), which in turn can be determined semi-automatically from similar combinations of physico-chemical descriptors (QSAR methods).

Registration

Databases for storing chemical compounds are referred to as registration systems. These are often used for chemical indexing (patent and industry databases), refer to the information collected must be clearly marked. Registration systems generally build on uniqueness of the chemical in the database that are represented by the use of unique representations. This is achieved by the generation of unique / ' canonical ' strings as the representative of the chemical as " canonical SMILES. ' Some registration systems, such as the CAS system, utilize the unique codes to generate hash code algorithms to achieve the same goal.

A major difference between a registry and a simple chemical database is the ability to accurately represent what is known, unknown or partially known. For example, a chemical database save a molecule with specified stereochemistry, while a chemical registration system the Registrar to specify prompts whether the stereo - configuration unknown, a racemate or a specific (known) mixture.

Registration systems can also process information in order to avoid the registration of molecules having only trivial chemical differences such as halogen ions.

Tools

The mathematical representations are typically graphical representation of the data corresponding to the input from the registrar. Data entry is facilitated by the use of chemical structure editors. These editors convert the internal data into graphical representations of molecules or reactions. There are numerous algorithms for converting various formats of representation. An open-source program for the conversion is Open Babel.

This search and conversion algorithms are implemented either within the database system itself or as now the trend as an external component (cartridge ), adapted to standard relational database systems, implemented and subsequently installed. Both Oracle and PostgreSQL -based systems use cartridge technology which allow users own data types (eg CTAB as a structure data type). This external component to allow the user to formulate SQL queries with chemical criteria, such as a request for records with a phenyl ring in their structure represented as a SMILES string in a column SMILESCOL could search.

SELECT * FROM WHERE ChemTable SMILESCOL.CONTAINS ( ' c1ccccc1 ') Algorithms for converting IUPAC name to structure representations, and vice versa, are also possible for the used extracting structural information from the text. However, there are difficulties due to the existence of several IUPAC dialects. As a unique standard here InChI has etablieret.

181430
de