Patents are those items that show us which are the strategies that Google chooses to invest in. Reading it is a hard job, especially if it hasn’t been used yet.
This one is basically a mechanism that classifies URLs by using various crawling methods to get info about them. Here comes the idea of thematic clusters of content – textual analysis and topics that are similar are all kept in mind. They can be published as answers to search queries out there.
The URL that is processed has 3 parts: a crawler (which realizes the host, the subdomain and the subdirectory), a clusterizer (which has its role in adding pages to cluster until there are no new pages to classify) and a publisher (which is the gateway to the SERP content – it’s here to approve or reject clusters, sometimes to even adjust them).
There are two types of crawl: progressive (it gets data from a subset of pages that are included in a cluster) and incremental (its attention is on the additional pages from the crawler before adding new ones)
There are also two types of clusterizers: mature clusters (it is mature when the category of the cluster is certain – when various clusters have the same URL) and immature clusters (well, when it does not meet the requirement for being mature).
As said, the publisher has to approve, reject or adjust clusters:
K-means clustering algorithm (it has to find groups that are related, but that were not labeled yet. When it comes to content cluster, every paragraph from the text will form a centroid).
Hierarchical clustering algorithm (this happens when there is a similar matrix in different clusters).
Meagan Kozlovs is a reporter for Debate Report. She’s worked and interned at Global News Toronto and CHECX. Megan is based in Toronto and covers issues affecting her city. In addition to her severe milk shake addiction, she’s a Netflix enthusiast, a red wine drinker, and a voracious reader.