A couple of years ago, in an effort to better understand technology trends, we initiated a project to identify typical web site characteristics for various geographic regions. We wanted to build a simple query-able interface that would allow our analysts to interact with crawl data and identify nonobvious trends in technology and web design choices. At first this may sound like a pretty typical analysis problem, but we have faced numerous challenges and gained some pretty interesting insights over the years.

Aside from the many challenges present in crawling the Internet and processing that data, at the end of the day, we end up with hundreds of millions of records, each with hundreds of features. Identifying “normal trends” over such a large feature set can be a daunting task. Traditional statistical methods really break down at this point. These statistical methods work well for one or two variables but are rendered pretty useless once you hit more than 10 variables. This is why we have chosen to use cluster analysis in our approach to the problem.

Machine learning algorithms, the Swiss army knife of a data scientist’s toolbox, break down into three classifications: supervised learning, unsupervised learning, and reinforcement learning. Although mixed approaches are common, each of the three lends itself to different tasks. Supervised learning is great for classification problems where you have a lot of labeled training data and you want to identify appropriate labels for new data points. Unsupervised techniques help to determine the shape of your data, categorizing data points into groups by mathematical similarity. Reinforcement learning includes a set of behavioral models for agent-based decision-making in environments where the rewards (and penalties) are only given out on occasion (like candy!). Cluster analysis fits well within the realm of unsupervised learning but can take advantage of supervised learning (making it semi-supervised learning) in a lot of scenarios, too.

So what is cluster analysis and why do we care? Consider web sites and features of those sites. Some sites will be large, others small. Some will have lots of images; others will have lots of words. Some will have lots of outbound links, and others will have lots of internal links. Some web sites will use Angular; others will prefer React. If you look at each feature individually, you may find that the average web site has 11 pages, 4 images and 347 words. But what does that get you? Not a whole lot. Instead, let’s sit back and think about why some sites may have more images than others or choose one JavaScript library over another. Each webpage was built for a purpose, be it to disseminate news, create a community forum, or blog about food. The goals of the web site designer will often guide his or her design decisions. Cluster analysis applies #math to a wide range of features and attempts to cluster websites into groups that reflect similar design decisions.

Once you have your groups, generated by #math, you’ve just made your life a whole lot simpler. A few minutes (or hours) ago you had potentially thousands or millions of items to compare across hundreds of fields. Now you’ve got tens of groups that you can compare in aggregate. Additionally, you now know what makes each group a group and how it distinguishes itself from one or more other groups. Instead of looking at each website or field individually, now you’re looking at everything holistically. Your job just got a whole lot easier!

Cluster analysis gives you some additional bonus wins. Now that you have normal groups of websites, you can identify outliers within the set - those that are substantially dissimilar from the bulk of their assigned group. You can also use these clusters as labels in a classifier and determine in which group of sites a new one fits best.

In coming posts, we will go into more detail about how we cluster and visualize web crawl data. Stay tuned!

Understanding Crawl Data at Scale (Part 1)

John Munro

Understanding Crawl Data at Scale (Part 1)

Understanding Crawl Data at Scale (Part 1)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112