Understanding Crawl Data at Scale (Part 2)

Effective analysis of cyber security data requires understanding the composition of networks and the ability to profile the hosts within them according to the large variety of features they possess. Cyber-infrastructure profiling can generate many useful insights. These can include: identification of general groups of similar hosts, identification of unusual host behavior, vulnerability prediction, and development of a chronicle of technology adoption by hosts. But cyber-infrastructure profiling also presents many challenges because of the volume, variety, and velocity of data. There are roughly one billion Internet hosts in existence today. Hosts may vary from each other so much that we need hundreds of features to describe them. The speed of technology changes can also be astonishing. We need a technique to address these rapid changes and enormous features sets that will save analysts and security operators time and provide them with useful information faster. In this post and the next, I will demonstrate some techniques in clustering and visualization that we have been using for cyber security analytics.

To deal with these challenges, the data scientists at Endgame leverage the power of clustering. Clustering is one of the most important analytic methodologies used to boil down a big data set into groups of smaller sets in a meaningful way. Analysts can then gain further insights using the smaller data sets.

I will continue the use case given in Understanding Crawl Data at Scale (Part 1): the crawled data of hosts. At Endgame, we crawl a large, global set of websites and extract summary statistics from each. These statistics include technical information like the average number of javascript links or image files per page. We aggregate all statistics by domain and then index these into our local Elasticsearch cluster for browsing through the results. The crawled data is structured into hundreds of features including both categorical features and numerical features. For the purpose of illustration, I will only use 82 numerical features in this post. The total number of data points is 6668.

First, I’ll cover how we use visualization to reduce the number of features. In a later post, I’ll talk about clustering and the visualization of clustering results.

Before we actually start clustering, we first should try to reduce the dimensionality of the data. The very basic EDA (Exploratory Data Analysis) method of numerical features is to plot them on a scatter matrix graph, as shown in Figure 1. It is an 82 by 82 plot matrix. Each cell in the matrix, except the ones on the diagonal line, is a two-variable scatter plot, and the plots on the diagonal are the histograms of each variable. Given the large number of features, we can hardly see anything from this busy graph. An analyst could spend hours trying to decipher this and derive useful insights:

Figure 1. Scatter Matrix of 82 Features

Of course, we can try to break up the 82 variables into smaller sets and develop a scattered matrix for each set. However, there is a better visualization technique available for handling the high dimensional data called a Self-Organizing Map (SOM).

The basic idea of a SOM is to place similar data points closely on a (usually) two dimensional map by training the weight vector of each cell on the map with the given data set. A SOM can also be applied to generate a heat map for each of the variables, like in Figure 2. In that case, a one-variable data set is used for creating each subplot in the component plane.

Figure 2. SOM Component Plane of 82 Features

By color-coding the magnitude of a variable, as shown in Figure 2, we can vividly identify those variables whose plots are covered by mostly blue. These variables have low entropy values, which, in information theory, implies that the amount of information is low. We can safely remove those variables and only keep the ones whose heat maps are more colorful. The component plane can also be used to identify similar or linearly correlated variables, such as the image at cell (2,5) and the one at cell (2,6). These cells represent the internal HTML pages count and HTML files count variables, respectively.

Based on Figure 2, 29 variables stood out as potential high information variables. This is a data-driven heuristic for distilling the data, without needing to know anything about information gains, entropy, or standard deviation.

However, 29 variables may still be too many, as we can see that some of them are pretty similar. It would be great to sort the 29 variables based on their similarities, and that can be done with a SOM. Figure 3 is an ordered SOM component plane of the 29 variables, in which similar features are placed close to each other. Again, the benefit of creating this sorted component plane is that any analyst, without the requirement of strong statistical training, can safely look at the graph and hand pick similar features out of each feature group.

Figure 3. Ordered SOM Component Plane

So far, I demonstrated how to use visualization, specifically a SOM, to help reduce the dimensionality of the data set. Please note that dimensionality reduction is another very rich research topic (besides clustering) in data science. Here I only mentioned an extremely small tip of the iceberg, using a SOM component plane to visually select a subset of features. One more important point about the SOM is that it not only helps reduce the number of features, but also brings down the number of data points for analysis by generating a set of codebook data points that summarize the original larger data set according to some criteria.

In Part 3 of this series on Understanding Crawl Data at Scale, I’ll show how we use codebook data to visualize clustering results.

Understanding Crawl Data at Scale (Part 2)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112