Google Scholar

Google Scholar is a search engine company Google Inc. and is for general literature review of scientific documents. This includes both free documents from the Internet free and paid listings. In most cases, results are displayed as full texts, or at least bibliographical references. Google Scholar analyzed and extracted the citations contained in the full text and creates a citation analysis. In addition, the bibliographic details of these quotes can be searched via the search service.

Google Scholar is based on the experience gained with various other Google services in previous years, all of course from the Google Web Search. The layout and the ease of use as well as the indexing of all resources in a total index were transferred to the scientific search engine. With a few adjustments and the Pagerank for evaluation and sorting of sources could be adopted. As a predecessor of Google Scholar applies the project CrossRef. This in addition to open-access documents and documents from the Self - Archiving range the full-text portfolios of numerous publishers and companies have been indexed. About the well-known simple Google search interface all these materials were researched. The project goal was to make a part of the Deep Web, namely the accessible only through registration and registration fee-based publications of publishers and professional societies, for the search engine accessible. The basis for this is a joint agreement between Google and the publishers involved.


On 18 November 2004, Google launched the English-language beta version of Google Scholar, on 21 April 2006, the tracing service also available in German language.

The focus of the literature is proven to the journals. However, Google Scholar will detect other scientific documents in full text or only the corresponding bibliographic data. This includes content from the free web example of private and institutional home pages as well as open access publications and documents from the Self - Archiving area. In addition, premium offerings from publishers and professional societies are detected. This opens up Google Scholar, like its predecessor project CrossRef, a part of the Deep Web.

The special feature of Google Scholar is the full-text analysis and indexing. In scientific databases can be searched for only in the bibliographical references and the abstracts and keywords. The selection and evaluation of the documents are in contrast to the specialized databases not intellectually, but on the basis of algorithms that evaluate the scientific and determine the ranking of the hit list.

The results of a literature search are displayed to the user sorted by relevance. A distinction is made between paid and free publishing offers evidence, however, do not always lead directly to the full text and in open access publications. The added value of the scientific search engine is the one in the ranking of documents as well as in the extraction and analysis of citations. Furthermore, also in the possibility of forwarding queries to the WorldCat and the use of the " Library Links " for users of libraries that work with Google Scholar.

Target group

According to the website, Google Scholar aligned with its offer to the community of academics. Thus, scientists, researchers, students, university lecturers, research associates and graduate students as well as students are counted for the target group.

Since the Google web search is used very strong among adolescents and young adults, may be assumed that Google Scholar is more used by students as by scientists, as the target group has more experience in the acquisition of scientific literature.

Search space

Google Scholar sees itself as a search service for general search for scientific literature. These mainly include journal articles, books, and technical reports. However, term papers and all types of student theses, PowerPoint presentations, abstracts, preprints and conference papers. These documents are partially open to the web available, sometimes they come from commercial suppliers. The full text range will be expanded significantly by integrating the data from Google Books.

The commercial suppliers of data are academic publishers, professional societies and trade associations with whom Google has reached an agreement. This allows web crawlers to index the full-text document. It merely academic articles, but not textbooks or monographs are considered. It is clear that Google's definition of "scientific " very much sums up. In addition to journal articles that are published after a peer review process in journals, are also slides, student projects university font servers, and documents that provide individuals on their homepage, proven.



As already explained, the search space of this search engine includes scientific documents of different quality levels. The documents are to be found partly in various stages of processing. So not only quality-tested technical articles from scientific journals are recorded, but also open access publications, which are not always passed through a peer -review process, as well as preprints or presentation documents. The different versions of a document are grouped by Google Scholar. When hit, the publisher publication is displayed and all other versions are summarized below this hit the link " all ... hit". The list of all indexed versions can be called it.

Google Scholar analyzed and indexed documents in various formats. These include HTML, PDF and Postscript; compressed files can be edited. The scope of documents that are made available as full text available, has been significantly extended by the integration of data from Google Books. However, the issues that have a low popularity poorly in Google Scholar by detecting or full texts are represented.


Google Scholar extracted from the retrieved documents, the metadata such as Title, author and year of publication. This is done automatically by the documents are scanned by the crawler and each of the text segments can be distinguished on the basis of the document layout by means of an algorithm. The software recognizes it as a citation, author name, year of publication, etc. This extraction is difficult because the documents are not guided by or on different standards and in different formats. Accordingly, the recognition of the metadata is partially incorrect. This has negative consequences for the discoverability of documents, and all the features that Google Scholar offers based on this data. This mainly concerns the publication server of institutions whose metadata does not match the required schema from Google.

The extracted data is " cited by " for the Zitationsergebnisse, for the ranking factor of the document as well as the function. In addition, they are required for the enhanced specific search and for export to bibliographic management software.


The ranking process uses the established procedures of Google WebSearch. Since the known Google technology is applied in the background of Google Scholar, indexed search service the same search interface and the same processing speed. However, scientific documents and their contents have special properties that make an adaptation of the principles and algorithms of the Pagerank necessary.

The technology takes into account the full text of the document, the source in which the text was published and, above all, how often it is cited in other articles, to name just some of the factors considered. Since Google is known about the ranking process little information, only guesses can be made about further popularity ratings and the weighting. What is known is that literature which is often cited, is in the hit list is displayed near the top. Thus, since current documents get a lower ranking factor than older documents, the weighting of the publication date was changed in favor of more recent documents date.


For the automatic extraction and analysis of citations Google draws on his experiences with the link analysis and the findings of the search engine CiteSeer. By autonomous citation indexing references are taken from the full texts and proven. Thus, Google Scholar also includes works that extend beyond its coverage. It is mainly books.

Partial Google Scholar is seen as a rival to the costly citation databases Science Citation Index ( SCI) and Scopus, as it offers a free citation analysis and more Open Access journals considered as these databases. Thus, Google Scholar has some advantages over the commercial offerings.

As the automatic extraction of metadata, and the machine recognition of citations is error-prone. So it is partially redundant, incomplete or incorrect entries in the index of Google Scholar.

Google Scholar provides the functions " similar products " and " cited by" the possibility to extend the search. The term " citation " documents are in referenced scientific in other resources that are not included in Google Scholar in full text. The user determined the bibliographic data are only presented. However, the request can be forwarded via the link " Library Search " to skip. About this catalog the nearest library is determined which has this item in stock. The link " similar products " are thematically related documents listed realized. This function is also based on the full-text index and the subsequent automatic extraction and analysis of the data.

System Architecture

Hardware and Infrastructure

Termed Commodity server devices are commercially available or home-built PC devices running the free UNIX operating system is installed. The servers are in different data centers, computing clusters and can process data in the terabyte range together. The advantage of many individual units instead of large systems is in the easy replacement of individual servers. In this way, faulty server can be exchanged without loss of performance and the entire system is quickly and easily expandable.

Since Google announced only very little information about its system architecture, can be taken no specific information about the connection of the servers within the cluster and between clusters. Also on the protocols and interfaces for internal and external data exchange no information can be given.


The web crawler with links to freely available websites which they browse for scientific documents. Through agreements with professional societies and publishers this is possible for the crawlers of Google not only free web but also on the protected pages of the contractors. The crawlers extract the bibliographic data retrieved documents as well as the citations contained therein. For these tasks, special algorithms are used. As with Google usual, there will be no intellectual review of the work performed. Other content providers, such as hosts and libraries, the provider of specialized databases, library catalogs and virtual libraries are, in contrast, create their metadata records completely intellectually or semi intellectually using learning indexing programs.

Link resolver

However, the crawlers have no access to library databases. The access to the necessary data from cooperating libraries can only be done via link resolver. These provide the interface to the electronic resources of libraries dar. this, however, changes in the link resolver by its providers is essential. Then there is Google Scholar possible to pass a library user from the hit list to the full text.

About the interface is the reading of the necessary information about the licensed documentation, such as the provider and the period and the link to the full text from the library catalog possible. For this, an XML file is needed on the library website, which is produced daily by the internal configuration files of the link resolver used. It contains the title of the journal whose ISSN well as information on subscription period. These specifications consist of the year, the vintage and the issue number of the first and last licensed journal issue. In addition, comments about stock shortages or access restrictions can be inserted from the library. In support of the libraries in the creation of this file Google Scholar provides a sample file.

Hit display and search

At each search searches for matching documents and all documents in which these documents are cited. The main results are, where appropriate indexed commercial publications represents the value-added services described are clearly offered each hit at the end of the display.

The hit list can be further restricted. About a pull -down menu, the earliest appearance and publication year are set. About a second menu it is possible to include quotes in the number of hits or just to let Display hits that have at least a summary. With this setting it is possible to exclude both matches without abstracts and citations. More options for sorting by the user but does not provide Google Scholar. Google Scholar offers at this point to an alerting service. Thus, a user can re- indexed documents that match the search request, be informed by e -mail. The search query entered is entered in the field "Notification query ". After any necessary changes to the query and entering the e - mail address of the Alerting Service is "Create Alert" by clicking on set up.

Google Scholar provides a simple search, an advanced search and do a search with operators within the simple search. Certain settings can be made in advance for this search variants. Thus, the language of the documents and the user interface as well as the number of hits can be selected per page. In addition, the home library can be selected for the function library link in the settings. Another default offered relates to the reference management. About the setting of " Bibliography Manager" allows the user to select the format in which you wish to import data into its bibliographic management software.

Simple Search

In the simple search individual search terms can be entered in a row that are automatically combined with " AND". Phrase searching is possible by the inclusion of the keywords in quotation marks. For the search with the author's name, it is irrelevant whether it is entered after the "last name first name " or " first name last name" scheme. However, the search has to be either completely or type out the name of an author and include only the first name abbreviated to find all documents of the author. Of course, the names of several persons may be entered in the search window.

Advanced Search

The advanced search provides several input fields that allow the easy use of Boolean operators. In the selection field " with all of the words " an automatic AND operation is carried out and searched for the terms in all fields of the database. Phrase searching is in the " with the exact phrase " possible. With synonymous, virtually synonymous or different language terms in a search query can be searched with " any words ". The field " without the words" can hit that contain certain terms, are excluded. It corresponds to the operator "NOT".

The search can be performed over the entire full text or only be limited to the title of the article. To search only in the metadata of an intellectually developed document is not supported by Google Scholar unfortunately. Other restrictions on the publication year or a period and on the information in a publication, eg in a professional journal, is possible. However, it must be noted that not all indexed documents contain a year specification and these are therefore not included in the search. It can also be explicitly sought only after the metadata "author". The necessary searches with different versions of the author's name described are also necessary in the advanced search.

Command -based search

The clarifications of the requirements described under " Advanced Search " can be made as a sign or as a term in uppercase letters in the input field of the simple search by entering the corresponding operators.

The ANDing of terms is automatically generated by the juxtaposition of these terms. The "AND" operator or the plus sign leads to the recognition of letters, numbers and common words (stop words ) that are actually ignored in the search.

The minus sign or the word "NOT" in the subsequent term is excluded from the search. Thus documents can be removed with this term from the hit list. The third Boolean operator "OR " can be entered only as a concept. With him, as already described, synonyms, quasi- synonyms or translations of terms are taken into account simultaneously in a search. In this way can be accomplished with a request wider topical coverage.

Other operators can be " author ", " allintitle " and "site". With them, the search can be limited to the metadata author or title of a document or to the source such as a URL. Known from the Google web search operators "file type" and " allinurl " (as already described) of Google Scholar is not supported. In addition, GS offers in comparison to the research possibilities in scientific databases on a few search options. The totality of the search options offers compared with the possibilities in specialist databases fewer search options for scientific research. The started by specialist publishers and professional societies metadata such as abstracts, keywords, etc. are not taken into account, for example by the search engine.


The services of Google Scholar described will be demonstrated using an example search. In the field of the " simple search " the author name Stephen Hawking is entered. The search results ( as of November 2011) 23,500 hits. The hit list points to the first five pages only thematically appropriate documents. However, these are almost exclusively in English and have demonstrated the potent concentration of the evidence to the English -speaking world.

Right next to the search slot of the simple search is the link to the " advanced search". This provides various input fields for the precise formulation of the query. His name must be entered as a phrase to search for publications of Stephen Hawking in the " article written by ". This search identified 554 hits. As already described above, a command- based search using the search slot of the simple search service. Here is the operator for the search must be used by author name. The search was: author: Stephen Hawking. There will also be identified 554 hits since the searches of the extended and the command-driven search command identical.

The possible limitations of the hit list were further already outlined above. Now, the construction of a short hit indicator is described in terms of a proven document from the hit list of the presented search:

[ PDF] The master stroke S Hawking ... -2010 - View, very different even than the image that we might have drawn even before one or two decades. Nevertheless, the first drafts of the new concept rich almost a hundred years back. According to the traditional conception of the universe move ... Cited by 5 - Related articles - HTML Version - All 7 versions

First Google Scholar displays the title of the hit, which leads by clicking the indexed document. In the next step, the extracted bibliographic information of the document will be presented. As seen in this example, the metadata can be so scarce that they are not sufficient for citing in a scientific paper. To assess the document an excerpt from the full text is then offered. In the last line Google Scholar offers the already introduced value-added services.

Clicking Cited by: 5 to which publications will be displayed as a short hit list that cite this work. About the link " Related products " take the user also to a hit list of documents that deal with the same subject. Since this results in PDF format exists, allows Google Scholar its display in HTML. It could be recognized seven different versions of the document that are " all 7 versions of " grouped under the link. Other value-added services are search library and library link. The search in WorldCat is offered when it is at the determined hit by a printed work (usually a book ). If the users of Google Scholar at the same time users of a library that cooperates with the scientific search engine, the "Library Link " is offered also in the bottom line. As already described, the availability of a licensed electronic version of the article is checked and if necessary direct link to the full text.


Positive criticism

The attractiveness of a research of scientifically relevant documents with the search engine Google Scholar is the ease of use, the clear hit in the presentation and processing speed. Also the probably enormous size of the index and therefore the covered search space and the usual quality of the ranking are essential for the success of the scientific search engine. In addition, the search engine is to use intuitive, understanding of thesauri or other classifications controlled vocabularies is not required.

These characteristics have Google Scholar made ​​an important and intensively used competitors of established academic search services. This also includes collaborations with libraries, linking contributed to WorldCat. Emphasized in this context that the scientific search engine Bielefeld Academic Search Engine (BASE) integrates results from Google Scholar into their search results.

Google Scholar makes both full- text and bibliographic data available. The Importance of Google Scholar is the opening of parts of the Invisible Web. Through the cooperation with publishers, etc. Documents are indexed, which are hidden in databases and normally are not accessible to search engine spiders. Together with the indexation of free Web content, the scientific search engine to countless full texts provide direct access, or at least prove bibliography. For fee-based full text is available an abstract basis of which the relevance of the document can be estimated payment of the license fee. Moreover, the proof of works on the actual search space of Google Scholar go beyond. Through the extraction of citations works and bibliographical references are detected that are not available in digital form.

The specification of citations can help to find thematically related documents on the Internet, since it can be browsed by the citing sources. The same applies to the " cited by " stand by the more immediate sources to a topic.

If the source still not digital, Google Scholar offers often to the forwarding to the skip or the library link. This link is for users of libraries that work with Google Scholar, very beneficial.

The search engine is free and comes with its claim to prove scientific literature in competition with commercial database vendors and full text archives. Due to the citation analysis of Webzitierungen Google Scholar can be seen to the established but expensive Science Citation Index and Scopus as an alternative to ( a competitor not necessarily ).

Due to the interdisciplinary design of the search service, the visibility of publications increased multidisciplinary. Google Scholar rated the scientific documents on the basis of the respective layouts. The search engine indexed journals, which are not evaluated due to the selective selection criteria in the Science Citation Index. This particularly applies to open- access journals. Thus, the visibility of the journals and the authors will be increased on the Internet. This can be called " democratization" of the science system are referred to.

The development of Internet resources with web crawlers firstly has the advantage that only one index which must be retrieved in a search. This also facilitates the updating of the data and is an advantage over metasearch engines. Second, the results in the display are displayed simultaneously, regardless of which data suppliers they originate.

Negative criticism

Must be clearly criticized the information policy of Google Scholar. The users are not informed of the criteria on which the assessment of the scientific and the ranking is based. Also on the exact target group only vague statements are made. In principle, the search engine is aimed at those looking for scientific literature. It also remains unclear which assets are indexed. About the indexation factor and possible indexation gaps in the detection of full-text offerings of scientific cooperation partner nothing is announced. The statements remain very imprecise. Also, the size of the database and the update frequency remain unknown.

Must be viewed with a critical also that Google Scholar looks at student work, and even power point presentations as scientific publications. The mixing of these documents with technical articles and their preprints leads to the formal and technical quality of the results is different. Especially for students with no experience in the literature, it is difficult to determine appropriate and high-quality sources. In addition, by taking into account presentation documents and preprints arises the problem of duplicates or nearly duplicates, since the different versions must be identified as belonging together by the software and grouped under the latest version.

However, this requires the correct detection of data during indexing. The index data is only extracted automatically based on algorithms from the full text and used for all services. As a basis of only the layout of the documents used. If during this process the data is read incorrectly or not classified in the correct category, decreases the quality of all offered services.

But not only wrong indexed data have a negative impact. Since determines the citation frequency exclusively from the indexed sources, this means, conversely, that indexed documents can not be used for this service do not. This leads to a distortion of the image. Are citations do not include a title in the index, this title will be ranked worse and appears in the list of results below, although its content is fits very well. In addition, the mechanism for Zitationsextraktion and analysis in its functionality is controversial. The reason is its susceptibility to errors. The output identified by Google Scholar citation rate is not always correct and, as just mentioned, can not it all be included citing works. Thus it can be read from the citation rate not the actual relevance of the hit.

The bibliographic details of all types of documents are in the hit indication also very short. Moreover, they are often in form and content wrong because of the described indexing and extraction algorithms. You hardly meet the demands of academic work.

In addition, users are highly dependent on the ranking of the results that Google Scholar does not provide ways to sort the results. It can be excluded only citations or documents from a selected year of publication. The problem is the lack of intellectual control in this context. The algorithms determine which documents are indexed and ranking what value they receive.

Must be considered critically limiting Google Scholars on the indexing of full texts. Tags, notations or abstracts that have high quality articles from journals are not indexed and therefore remain completely ignored. This has presented Google Scholar a way to increase the precision in the research. Even further processing of the indexed documents stemming process does not take place.

The search tools offered by Google Scholar are very limited. It can be made only restrictions for authors, journals, and publication year. These search options do not meet the requirements of a scientific research. Also, be excluded in the search with the date restriction sources without publication date and not included in the hit list. Therefore, this search restriction is not suitable for precise as well as an applied exhaustive search. Must be viewed critically, that also no truncation and masking can be made.

But several search restrictions that have come to expect users of the Google web search are not supported by Google Scholar. These include the operators " allinurl " and "filetype ". In addition to the Boolean operators Google Scholar only supports " allintitle ", "site" and " author ". Also searches thematically just keywords. This is insufficient for a thematic search. Another disadvantage is the multidisciplinary approach of Google Scholar acts from. The German -language version of Google Scholar provides no thematic restriction, it can be searched exclusively multidisciplinary. In the English version are seven general areas of research on the restriction of the search space to choose from. The quality of the document is not a limitation option. It would make sense the ability to limit the search to certain types of documents or exclude types. The limited search capabilities are also partially flawed because they are based solely on a machine selected, indexed and rated data.

In summary it can be stated that the lack of search options in Google Scholar research in specialized databases can not replace. Just for thematic Search offer thesauri, classifications and abstracts good search capabilities that are not using Google Scholar. Also the lack of truncation is a clear disadvantage compared to specialized databases. For an aligned on the completeness or accuracy literature search, the scientific search engine Google Scholar should not be used. However, it is ideally suited for an introduction to a topic and to search for full-text based on bibliographic information.