Web Mining

Under Web Mining ( web mining ) and web mining is the transfer of techniques of data mining to (partially) automatic extraction of information from the Internet, especially the World Wide Web. Web Mining takes over procedures and methods from the fields of information retrieval, machine learning, statistics, pattern recognition and data mining. Here, three objects of study are distinguished:

  • The content ( Web Content Mining) - for example, by methods of information retrieval.
  • The structure of the link (Web Structure Mining) - for example, using methods of webometrics. When Web Structure Mining so-called hubs are used. There are good hubs that point to many valuable pages, and valuable pages, refer to the many hubs.
  • User behavior (Web Usage Mining) - for example, by the analysis of log files.

Types of Web Mining

Web usage mining attempts to identify regularities in the use of websites or web resources. Here are all secondary data generated by user interaction with a Web resource, processed and analyzed.

Web Structure Mining tries to recognize the domain or a website reference underlying structure. Based on the topology of the references (hyperlinks ) of the Web, with an optional description of the same, a model is created. This can be useful for categorizing and ranking of a website and allows conclusions on similarities between web sites and their relationships to each other. For example, content-rich websites could (so-called Authorities) and survey-like websites (so-called hubs) for a particular topic be identified ( cf. HITS algorithm).

Web content mining is concerned with the detection of regularities in the content of a Web resource. Web content mining is a field of application for text mining. The contents of the data on the Web are composed of unstructured data such as text documents, semi-structured data such as HTML documents and structured data as tables or dynamically generated HTML pages. Basically, there are the contents of a Web page from different data types such as text, images, audio, video, metadata and hyperlinks. Web content mining of multiple data types is called " Multimedia Data Mining " and can be understood as an instance of web content mining. Mainly, however, the contents of the Web from unstructured text. Text mining can be understood as an instance and superordinate research field of web content mining. The methods used are general data mining methods, where statistical and computational linguistics methods to realize the transformation of texts in an adequate ( for data mining ) form.

815089
de