hnsgrvr.github.io/experience.html at main · hnsgrvr/hnsgrvr.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="styles.css">
    <title>Experience</title>
</head>
<body>
    <div class="navbar">
        <a href="index.html">Home</a>
        <a href="photography.html">Photography</a>
        <a href="experience.html">Experience</a>
    </div>

    <div class="content" id="experience">
        <h1>Experience</h1>
        <section class="experience-section">
            <div class="job-title">
                <h2>CyberSaint</h2>
                <hr>
                <p><strong>Machine Learning Engineer</strong> - May 2020 to Present</p>
            </div>
            <p>Cybersaint is made up of a small yet powerful team of visionaries. Originally focused on compliance, we've diversified to stay ahead in the evolving cybersecurity landscape. As a curious engineer, I am remarkably grateful to work with a team that is as nimble, adventurous, curious, and pioneering as the one I'm currently part of. I continue to advance my skill set every day with new, innovative, and largely experimental ideas. I am the sole Machine Learning Engineer and Data Scientist here at Cybersaint. Over the past few years, I have developed several innovative technologies, which I will describe below. As I've discovered here, and as I will convey in this overview, even the simplest tasks can reveal complexities beyond imagination.</p>

            <h3>Crosswalking</h3>
            <p>In cybersecurity, crosswalking is the act of mapping one security control framework to another for the purpose of aligning regulation standards, tracking and syncing control performance, and other data movement. Historically, it involves teams of experts and an impractical amount of time performing tedious comparisons. While ideas for improvement circulate occasionally, they largely remain manual processes. A good NLP approach will eat this problem for breakfast in a matter of minutes. The issue here is data. Frameworks are wildly unique and are constantly being adapted by companies into personalized, custom frameworks. This means that the probability of overfitting on the few crosswalk datasets you can get your hands on is high. The battle to stay generalized in the broader security space is challenging, as there aren’t many dedicated security datasets centered around language. Using industry standards—related controls available within popular frameworks—is a good start, but there is widespread disagreement on the accuracy of these mappings and in order to maintain a healthy training set you still need more negatively associated control pairs than positive. Some datasets exist in this regard that were born from an optimized crosswalking workflow and therefore produced intermediary language between controls, but even this is slightly misaligned as it represents a tree structure rather than a many-to-many dataset. </p>

            <h5>Efficient Crosswalking is Weak Interpretation</h5>
            <p>The crosswalking algorithm is regularly updated to include the latest research and advancements in natural language processing. The original algorithm has evolved into three distinct iterations: the first two are smaller, more efficient algorithms designed for enhanced customization and scalability, while the third aims to develop a universal mapping mechanism. Each of the smaller algorithms utilizes a straightforward extension block, which processes standard embeddings from a leading embedding model. The algorithms differ in their approach: one compares controls pair-wise, generating a relatedness score, while the other uses a transformed embedding in a FAISS indexing process for cosine similarity-based semantic searches. Due to the data issue mentioned previously, the base embedding algorithm was intentionally frozen. The extension block is designed as an interpretation of the embedding, capturing only relevant information from the original training dataset that is present within the weights. This block is sized to learn the base embedding's patterns to establish a similarity metric while being sufficiently simple to avoid overfitting on sparse or misaligned data. Other regularization techniques are used, and after running experiments this solution has proven to find matches better than other methods. The primary goal is to isolate similarity features specifically related to the cybersecurity control language used to train the extension block. This approach is especially useful in applications where different customers may require unique interpretations of control mappings. The compact design of this extension block enables training with minimal resources and results in a smaller model footprint. Both simple models display advantages depending on the specific task, and represent efficient, individualized crosswalking.</p>

            <h5>A Stab at a Universal Cybersecurity Mapping Mechanism</h5>
            <p>The third algorithm adopts a first principles approach to crosswalking, reassessing the way connections are stored and analyzed to predict new relationships. Control mappings can be conceptualized as a vast graph network, with each node symbolizing an individual control. Recent advances in graph neural network infrastructure enable simultaneous modeling of control similarities, considering more information than a single pair or triplet could provide using traditional methods. Now, crosswalking new controls is a simple insertion operation into the graph. New insights such as classification, grouping, and effectiveness—a loose term meant to imply abstract conclusions based on concepts such as coverage or when connected to scoring metrics—can be applied at the control or framework level. Using a heterogeneous graph, we can accept more than just controls. Different data types such as CVEs, MITRE TTPs, policy data, custom language, or any other cybersecurity indicators that contain language can be mapped and analyzed as well. As the network density increases, the algorithm’s few-shot learning capabilities improve, essential for classifying new data types. While still in development, this model has incredible potential to autonomously fill gaps in an ever-changing cyber landscape.</p>

            <h3>Historical Event Modeling</h3>
            <p>In the ever-changing and often secretive world of cybersecurity, harnessing historical data is pivotal for accurately navigating current challenges and effectively preparing for future risk. With the cybersecurity field's unpredictable nature and frequent lack of transparency, risk management often resembles a moving target, making it a particularly difficult area to approximate. This reality underscores the significance of a data-driven approach, which, through analyzing past trends and events, provides a more reliable basis for decision-making and resource allocation. While our chosen dataset is often referred to as the most comprehensive available, no dataset is perfect. Exploration by interpolation and other statistical methods proved to be very beneficial for modeling potential financial loss, exposing several hidden insights. Further analysis uncovered the ability to accurately predict a company's propensity to report losses. This effort culminates in an accurate method to proactively estimate single loss expectancy. </p>

            <h5>Likelihood Modeling</h5>
            <p>A key aspect of our data exploration involves modeling the likelihood of cybersecurity incidents. Analyzing historical data from companies of similar size and market position, how many events can you expect in the next year? A query to an inverse poisson log normal distribution using some additional methods extrapolates towards the next 12 months. Feeding this and financial loss distributions into a FAIR monte carlo model transforms these findings into annualized loss expectancy. Packing this process into a concise workflow, not only can we drive decisions based on data rather than opinion, but we can do it with a few clicks and basic knowledge about your company size and sector. Part of the innovation here involves opening a largely unknown field into a plug and play, user friendly mechanism.</p>

            <h3>Risk Quantification</h3>
            <p>The concept of risk is rather abstract. As the field matures, scoring systems become more refined, more data driven, and more actionable. This is the motivation behind our risk quantification projects. The most notable of which is a model co-developed between myself and engineers from Booz Allen Hamilton. As the name suggests, CyberInsights is meant to provide valuable risk insights into a variety of scenarios determined by an encompassing questionnaire. The basis of the model simulates access and impact using an enterprise graph and random variables initialized by a Bayesian belief network molded by the questionnaire inputs. Using a monte carlo style output and ties into MITRE Att&ck, a variety of scores and an annualized loss can assist in important decision making.</p>

            <h3>Partnerships</h3>
            <p>CyberSaint has partnered with a few companies, but specifically to AI our most prominent has been our recent relationship with IBM WatsonX. Accelerated into their program to offer our solution, we also work closely with a team within IBM on a special concept involving generative AI and risk reporting.</p>

            <h3>Patents</h3>
            <p>Involving work done here at CyberSaint, I am referenced on 2 patents currently in the process of being approved. We are also in the process of filing multiple others, all in the topic of AI in Cybersecurity.</p>
        </section>

        <section class="education-section">
            <div class="job-title">
                <h2>Rensselaer Polytechnic Institute</h2>
                <hr>
                <p><strong>Bachelor’s and Master’s Degrees in Computer Science</strong> - Graduated Spring 2020, GPA: 3.84</p>
            </div>
            <h5><strong>Thesis: </strong>Estimation of Animal Orientation and Fiducial Mark Location</h5>
            <p>During graduate school at RPI, I was tasked with orienting images for the purpose of improving animal identification algorithms for <a href="https://www.wildme.org/" target="_blank">Wild Me</a>. Wild Me is a non-profit organization leading efforts to improve the animal tagging and monitoring process to track populations and migrational movement. My contribution oriented a set of animals in order to test the influence on identification accuracy. Given an animal photograph, my software warped the image such that each input image adheres to a standard input homography and passed the photograph down the identification pipeline. The basis of my algorithm was a theory of over-parameterization. Estimating orientation angle from the target rotation has numerous solutions, from as simple as outputting a theta inference to applying quaternion mathematics. I chose a redundant method of outputting the sine and cosine parts of the rotation angle for two main reasons. The first is that we get two unique estimates increasing confidence and giving an uncertainty score derived from the difference, and using trigonometry components gives a continuously differentiable loss function.</p>
        </section>
    </div>
</body>
</html>