Web Mining

Introduction:

Web mining is the process of applying data mining techniques and algorithms to automatically discover and extract information from Web documents and services.

Web mining is a subset of data mining.

The contents of data mined from the Web may be a collection of facts that Web pages are meant to contain. These may consist of text, structured data such as lists and tables, and even images, video and audio.

The web has several aspects that yield multiple approaches for the mining process, such as web pages including text. Web pages are connected via hyperlinks, and user activity can be monitored via web server logs.

Types of Web Mining

  • Web Content Mining

Web content mining

  • Web content mining is the process of collecting useful data from websites

Web structure mining

  • Web structure mining is the application of discovering structure information from the web.

Web usage mining

  • Web usage mining is the process of applying data techniques to derive useful data and information from the weblog.

Why web mining?

Currently we’re living in the era of the internet. We are looking for content on search engines such as Google, Yahoo, and others. The search engine gives out a list of websites based on the query search.

So, we have to know exactly how the search engine works.

In the early 90s, the first search engines used text-based ranking systems to decide which pages to return based on a given query.

Actually, the search engine browses through its index and counts the occurrences of the key words in each web file. The winners of the webpages are the pages with the highest number of occurrences of the key words. These websites display them back to the user.

But the text-based ranking system was not ideal because there will be millions of webpages with that particular word. So the user does not scan all the webpages that contain a given word.

Some users anticipate only the top 5–20 webpages related to their relevant query search.

Modern search engines provide the best related results compared to text-based ranking systems. It uses one of the most influential algorithms for computing the relevance of web pages, the PageRank algorithm, used by the Google search engine.

Searching Engine Types

Title-based Search Engine

  • Searches only with “titles”

Full-Text Search Engine

  • E.g. Google

PageRank

Source: 707-digital

PageRank is Google’s system of counting the links and the algorithmic method that google uses to rank pages and it assign a numeric value to the page.

With the help of that numeric value, it determines how important the webpage is.

How is PageRank calculated?

Source: Management mania

To calculate PageRank, all of the links from the web pages are taken into account.

There are three types of links

  • Inbound Link

Page rank formula

PR(A) = (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn))

This equation shows how important the webpage actually is.

Here,

t1,t2,…,tn are the web pages are linking to the webpage A.

C is the no of outbound links that a page has.

‘d’ is the damping factor (i.e., d= 0.85)

Now we’ll look into some examples of how PageRank works.

Example 1:

Let us assume four web pages A, B, C, and D

Let each page have a PageRank of 0.25

The Page Rank of web page A has PR(A) = PR(B) + PR(C) + PR(D)

Example 2:

The Page Rank of web page A has

PR(A) = PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D)

L(B), L(C), L(D) is the number of outbound links of page B, C, D.

PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3

The parameter d is the damping factor which can be set between 0 and 1 (d is set to 0.85)

The Page rank of Web Page A has:

PR(A) = 1-d/N + d(PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D))

Implementation of page rank algorithm

networkx is a package available in Python to create graph structures, calculate page rank, total nodes and total edges in a page.

Conclusion

It is not the only algorithm used by search engines, as Google also utilizes other algorithms and methods to rank their pages these days. But it was instrumental in launching Google to the forefront and demonstrates the power of web mining.

About the Author

Gowshik is a SDE intern in KBX Digital. He is full stack developer, interested to work in challenging areas and create new ideas while coding. He mainly interested in competitive programming. He always says “Explore new things in life”.

About KBX Digital

At KBX Digital we use server-less technology to auto scale micro-services to serve millions of customers.

If you want to join our team please apply in the link provided below.

Link to Apply

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store