Web Data Mining

7 min readOct 21, 2020

Data mining is a method for the exploration of data warehouse information. This information can be grouped into numerous rules and patterns that can assist users/organizations in the study of collective data and projected decision-making processes. Every organization’s centralized database is known as a data warehouse, where all data is stored in a single huge database. Data mining is a strategy that companies use to extract valuable information from raw data. Software is applied in a massive amount of data (data warehouse) to search for required trends that can help companies learn about their consumers, predict behavior, and enhance marketing strategies.

Web data mining is a field of data mining related to the data available on the internet. The idea is to extract insightful data accessible on web pages through the internet. Users use various search engines to access their necessary information from the internet, which is found by Web Mining, an insightful and user-needed information mining technique. Various methods and algorithms are used to retrieve information from web sites, including web documents, photos, etc. Owing to the rise in the size of text documents over the internet, web mining is increasingly becoming very important, and finding suitable patterns, information, and informative data is very difficult and time-consuming if performed manually. Structure (hyperlinks), use (pages accessed, data consumption), content (text paper, pages) is included in the web back toward the source. Word World Wide Web is linked to the combination of web papers, images, audios, etc. Any of the web mining processes included are:

The method of collecting accurate and useful information on the web is information retrieval. The extraction of information focuses more on the selection of appropriate data from large database collections and the discovery of new knowledge from large volumes of data to user query responses. IR phases include scanning, filtering, and matching. Information extraction is an automated (structured) method of extracting analyzed data. IE is a task that works the same as retrieving information but focuses more on extracting relevant facts.

Machine learning is a support framework that assists in web data mining. By understanding user behavior (interests), machine learning can enhance web search. In search engines, different machine learning techniques are used to provide intelligent web services. It is much more effective than the traditional approach, i.e. the retrieval of data. It is a mechanism capable of understanding user behavior and optimizing performance on tasks.

Categories of Web Data Mining

Web data mining is categorized into 3 types.

1. Web Content Mining

2. Web Structure Mining

3. Web Usage Mining

The following chart depicts those categories in detailed manner.

Web mining consists of huge, complex, varied, and largely unstructured information that provides a significant amount of information. Explosive web creation contributes to some issues, such as locating useful information on the internet and analyzing usage patterns. To solve such problems, attempts have been made to include relevant information in a structural table form that is easy to understand and useful for organizations to anticipate the needs of customers.

Web Content Mining

Content Mining is a web mining process in which web sites extract required informative data. Audio, video, text, hyperlinks, and organized records are included in the material. Online content is intended to provide users with information in the form of text, lists, photographs, videos, and tables. The number of web pages (HTML) has risen to billions over the last few decades and continues to grow. It is very difficult and time-consuming to search for billions of web documents, content mining extracts queried data by performing various mining techniques and narrows down the search data that is easy to find the necessary user data.

Web mining uses various techniques to retrieve information from many databases. There are various types of algorithms used to obtain information about knowledge, and some classification algorithms are listed below:

The Decision Trees is a grouping and hierarchical method consisting of root nodes, branches, and leaf nodes. This is a hierarchical method in which the root node is divided into sub-branches and the class mark is found in the leaf node. Decision Tree is a good technique.

Naïve Bayes is a fast, quick, powerful classification algorithm and is also known as the classifier of Native Bayes. It is based on a theorem by Bayes. For each class, probabilities are determined from predefined dataset values by counting combinations of values. The class with the greatest likelihood is the most probable class.

The theorem of Bayes:

A well-known and simple machine learning and classification algorithm is the Support Vector Machine. It is a technique that can be used for linear and non-linear data sets. . Optimal hyperplane separation (decision boundary) is just a line used to draw depending on the various classification characteristics to distinguish the two groups.

Another web content mining technique using the back-propagation algorithm is the neural network. The algorithm consists of several layers. the input layer, some middle (hidden) layers, and then the output layer, each of which feeds the next layer to the last layer. The fundamental unit of the neural network is the neuron. Inputs are fed to units concurrently. Inputs are fed from the input layer to hidden layers simultaneously. There is normally one Hidden layer, however, there are some undefined number of hidden layers. The last secret layer supplied the input, and the output layer was created.

Web Structure Mining

The vast amount of data on the network is now rising a day. One of the most preferred tools for information processing is the World Wide Web. Web mining techniques are very useful to discover knowledgeable data from the web. Structure mining is one of the core web mining techniques which deals with the structure of hyperlinks. Structure mining essentially reveals the organized description of the website. It describes the relationship between websites linked to web pages. Continued data growth over the internet has become a difficult challenge to find informative and necessary information. Web mining is just a data mining method that collects information from the web. Various algorithmic techniques are used to discover web data. Structure mining analyzes website hyperlinks to gather insightful data and organize them into categories such as similarities and relationships. Intra-page is a method of mining that is done at the level of the document and is known as inter-page mining at the hyperlink level. Link analysis is an old but still useful tool that improves its usefulness in the field of web mining science. Structure analysis is often referred to as link mining.

By reviewing some research papers, I could be able to find some algorithms used for Web structure mining. Those algorithms are,

· The page rank algorithm

· HITS algorithms

The Page Rank Algorithm

This is a comprehensive algorithm created by L. Page and S. Brain. the basic idea of this algorithm is, a Link from page one page to another page is considered as a vote. The website which has a greater number of votes becomes the most important page. If the vote produced from a high weightage page, then the importance of the linking page will become higher.

HITS algorithms

HITS is an algorithm that stands for Hyperlink Induced Topic Search and is used for web structure (hyperlink analysis) mining. In the HITS algorithm, two terminologies are used, i.e. authorities and hubs. Good authority is a page that is pointed by high hub weights and good hubs are pages that point to many authority pages with high weights.

There are some tools use for web structure mining. For example,

· Google PR Checker

· Lin Viewer

Web Usage Mining

Web usage mining also referred to as log mining, is a method of capturing web access information for users and gathering data in the form of logs. After every website user visits, some information is left behind, such as visiting time, IP address, pages visited, etc. Such data is gathered, processed, and preserved in logs. This helps to understand user behavior and can later reinforce the structure of the website. Web usage mining is a technique that automatically archives user access patterns, and web servers that are later collected in access logs mostly provide this information. Logs store much-needed information such as URL address, visit time, Internet Protocol addresses, etc. that can help a company understand the actions of its customers and ensure good quality of service. Dig and analyze data in log files containing user access trends for web use mining. The main aim of web use mining at the time of its contact with the web is to analyze user activity.