SBIR 2020-3 – Huntsville AI

September 15, 2020

The Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs are highly competitive programs that encourage domestic small businesses to engage in Federal Research/Research and Development (R/R&D) with the potential for commercialization.

Brief Intro

What we do at Huntsville AI:

Application – how to solve a problem with a technology
Theory – for those times when knowing “How” something works is necessary
Social / Ethics – Human / AI interaction
Brainstorming – new uses for existing solutions
Hands on Code – for those times when you just have to run something for yourself
Coworking Night – maybe have combined sessions with other groups to discuss application in their focus areas
Community – get to know others in the field. Provide talks like this one and support local tech events (HATCH, SpaceApps)

Quick Reference

The full list of 2020.3 SBIR topics is available HERE

NGA203-003

TITLE

Novel Mathematical Foundation for Automated Annotation of Massive Image Data Sets

OBJECTIVE

This announcement seeks proposals that offer dramatic improvements in automated object detection and annotation of massive image data sets. Imaging data is being created at an extraordinary rate from many sources, both from government assets as well as private ones. Automated methods for accurate and efficient object identification and annotation are needed to fully exploit this resource. This topic is focused on new artificial intelligence (AI) methods to effectively and efficiently solve this problem.

DESCRIPTION

Current choke points blocking optimal exploitation of the full stream of available image data include confronting widely different views (perspective, resolution, etc.) of the same or similar objects and the overwhelming amounts of human effort required for effective processing. Current manual processes requires human eyes on every image to perform detection, identification, and annotation. Current state of the art AI requires intensive human support to generate giant training sets. Further, resulting methods frequently generate rule sets that are overly fragile in that training on one object is not transferrable to the detection of another object, even though the object might strike a human as essentially the same, and thus the need for increased human review of the algorithm decisions.

NGA seeks new types of AI tools optimized for the task of object identification and annotation across diverse families of image data that are reliable, robust, not dependent on extensive training demands, are applicable to objects of interest to both government and commercial concerns, and simultaneously be parsimonious with user resources in general. In particular, we seek solutions that make AI outputs both more explainable and more “lightweight” to human users.

The focus of a successful phase 1 effort should be on explaining the mathematical foundation that will enable the significantly improved AI tools described herein. Of specific interest are novel AI constructs that are more principled and universal and less ad hoc than current technology and can be used to construct a tool that performs relevant tasks. For the purposes of this announcement “relevant tasks” are limited to object identification across view types, drawing an object bounding box, and correctly labelling the object in a text annotation. A successful Phase 1 proposal should explain how the mathematical foundation needed to build the required tools will be developed in Phase 1 and implemented in a software toolkit in Phase 2. Examples should be developed during Phase 1 and should illustrate either improved reliability or robustness over the current state of the art, as well as reducing training demands and user resources. Proposals describing AI approaches that are demonstrably at or near the current state of the art in commercial AI performance, such as on ImageNet data sets, are specifically not of interest under this topic. The foundational element of a successful proposal under this topic is exploitation of novel mathematics that will enable new and better AI approaches.

Direct to Phase 2 proposals are being accepted under this topic. A straight to phase 2 proposal should describe pre-existing mathematical foundations and illustrative examples described in the paragraph above. Phase 2 proposals should also propose a set of milestones and demonstrations that will establish the novel AI tools as a viable commercial offering.

OSD203-004

TITLE

Domain-Specific Text Analysis

OBJECTIVE:

Develop text analysis software that leverages current Natural Language Processing (NLP) algorithms and techniques, (e.g., Bayesian algorithms, word embeddings, recurrent neural networks) for accurately conducting content and sentiment analysis, as well as dictionary development.

DESCRIPTION:

The United Stated Department of Defense (DoD) collects large amounts of text data from their personnel using a variety of different formats including opinion/climate surveys, memoranda, incident reports, standard forms, and transcripts of focus group/sensing sessions. Much of these data are used operationally; however, recent interest in the leveraging of text data to glean insight into personnel trends/behaviors/intentions has prompted a greater degree of research in NLP. Additionally, Topic Modeling and Sentiment Analysis have been explored by various research arms of the DoD; however, two foundational hurdles exist that need to be addressed before they can realistically be applied to the DoD:

First, the varied use of jargon, nomenclature, and acronyms across the DoD and Service Branches must be more comprehensively understood. Additionally, development of a “DoD Dictionary” should enable the fluid use of extant and newly-created jargon, phrases, and sayings used over time. Second, the emergent nature and rapid innovation of NLP techniques has made bridging the technical gap between DoD analysts and tools difficult. Additionally, the understanding and interpreting of NLP techniques by non-technical leadership is particularly difficult. There currently exists no standard format or package that can be used to analyze and develop visualizations for text data in such a way that accommodates the needs of operational leadership to make decisions regarding personnel policies or actions.

SOCOM203-003

TITLE:

High-Performance Multi-Platform / Sensor Computing Engine

OBJECTIVE:

The objective of this topic is to develop a next generation multi-platform & multi-sensor capable Artificial Intelligence-Enabled (AIE), high performance computational imaging camera with an optimal Size, Weight and Power – Cost (SWaP-C) envelope. This computational imaging Camera can be utilized in weapon sights, surveillance and reconnaissance systems, precision strike target acquisition, and other platforms. This development should provide bi-directional communication between tactical devices with onboard real-time scene/data analysis that produces critical information to the SOF Operator. As a part of this feasibility study, the Offerors shall address all viable overall system design options with respective specifications on the key system attributes.

DESCRIPTION:

A system-of-systems approach “smart-Visual Augmentation Systems” and the integration of an next generation smart sensor enables information sharing between small arms, SOF VAS and other target engagement systems. Sensors and targeting that promote the ability to hit and kill the target as well as ensuring Rules of Engagement are met and civilian casualties/collateral damage is eliminated. The positive identification of the target and the precise firing solution will optimize the performance of the operator, the weapon, and the ammunition to increase precision at longer ranges in multiple environments.

This system could be used in a broad range of military applications where Special Operations Forces require: Faster Target Acquisition; Precise Targeting; Automatic Target Classification; Classificationbased Multi Target Tracking; Ability to Engage Moving Targets, Decision Support System; Targeting with Scalable Effects; Battlefield Awareness; Integrated Battlefield (Common Operating Picture with IOBT, ATAK, COT across Squad, Platoon).

HR001120S0019-14

TITLE:

AI-accelerated Biosensor Design

OBJECTIVE:

Apply artificial intelligence (AI) to accelerate the design of highly specific, engineered biomarkers for rapid virus detection.

DESCRIPTION:

This SBIR seeks to leverage AI technologies to accelerate the development of aptamer-based biosensors that specifically bind to biomolecular structures. Aptamers are short single-stranded nucleic acid sequences capable of binding three-dimensional biomolecular structures in a way similar to antibodies. Aptamers have several advantages as compared to antibodies, including long shelf-life, stability at room temperature, low/no immunogenicity, and low-cost.

The current state-ofthe-art aptamer designs rely heavily on in vitro approaches such as SELEX (Systematic Evolution of Ligands by Exponential Enrichment) and its advanced variations. SELEX is a cyclic process that involves multiple rounds of selection and amplification over a very large number of candidates (>10^15). The iterative and experimental nature of SELEX makes it time consuming (weeks to months) to obtain aptamer candidates, and the overall probability of ultimately obtaining a useful aptamer is low (30%-50%). Attempts to improve the performance of the original SELEX process generally result in increased system complexity and system cost as well as increased demand on special domain expertise for their use. Furthermore, a large number of parameters can influence the SELEX process.

Therefore, this is a domain that is ripe for AI. Recent AI research has demonstrated the potential for machine learning technologies to encode domain knowledge to significantly constrain the solution space of optimization search problems such as solving the biomolecular inverse problems. Such in silico techniques consequently offer the potential to provide a costeffective alternative to make aptamer design process more dependable, thereby, more efficient. This SBIR seeks to leverage emerging AI technologies to develop a desktop-based AI-assisted aptamer design capability that accelerates the identification of high-performance aptamers for detecting new biological antigens.

NGA203-004

TITLE:

High Dimensional Nearest Neighbor Search

OBJECTIVE

This topic seeks research in geolocation of imagery and video media taken at near- ground level [1]. The research will explore hashing/indexing techniques (such as [2]) that match information derived from media to a global reference data. The reference data is composed of digital surface models (DSMs) of known geographical regions and features that can be derived from that surface data, together with limited “foundation data” of the same regions consisting of map data such as might be present in Open Street Maps and landcover data (specifying regions that are fields, vegetation, urban, suburban, etc.). Query data consists of images or short video clips that represent scenes covered by the digital surface model in the reference data, but may or may not have geo-tagged still images in the reference data from the same location.

Selected performers will be provided with sample reference data, consisting of DSM data and a collection of foundation data, and will be provided with sample query data. This sample data is described below. However, proposers might suggest other reference and query data that they will use to either supplement or replace government-furnished sample data. This topic seeks novel ideas for the representation of features in the query data and the reference data that can be used to perform retrieval of geo-located reference data under the assumption that the query data lacks geolocation information. The topic particularly seeks algorithmically efficient approaches such as hashing techniques for retrieval based on the novel features extracted from query imagery and reference data that can be used to perform matching using nearest neighbor approaches in feature space.

DESCRIPTION

The reference data includes files consisting of a vectorized two-dimensional representation of a Digital Surface Model (DSM) [4], relative depth information, and selected foundation feature data. The foundation features will included feature categories such as the locations of roads, rivers, and man-made objects.

The desired output of a query is a location within meters of the ground truth location of the camera that acquired the imagery. In practice, only some of the queries will permit accurate geolocation based on the reference data, and in some cases, the output will be a candidate list of locations, such that the true location is within the top few candidates. It is reasonable to assume that there exists a reference database calculated from a global DSM with a minimum spatial resolution of 30 meters that may, in some locations, provide sub-meter spatial resolution. The foundation feature is at least as rich as that present in Open Street Maps, and can include extensive landcover data with multiple feature types. For the purpose of this topic, the reference data will not include images. Sample reference and query data representative of these assumptions, but of limited geographical areas, will be provided to successful proposers.

The topic seeks approaches that are more accurate than a class of algorithms that attempt to provide geolocation to a general region, such as a particular biome or continent. These algorithms are often based on a pure neural network approach, such as described in [3], and is unlikely to produce sufficient precise camera location information that is accurate to within meters.

The objective system, in full production, should be sufficiently efficient as to scale to millions of square kilometers of reference data, and should be able to process queries at a rate of thousands of square kilometers per minute. While a phase 2 system might provide a prototype at a fraction of these capabilities, a detailed complexity analysis is expected to support the scalability of the system.

The proposed approach may apply to only a subset of query imagery types. For example, the proposed approach may be accurate only for urban data, or only for rural scenes. The proposer should carefully explain the likely limitations of the proposed approach and suggest methods whereby query imagery could be filtered so that only appropriate imagery is processed by the proposed system.

Proposers who can demonstrate prior completion of all of the described Phase I activities may propose a “straight to Phase II” effort. In this case the novelty of the proposed feature and retrieval approach will be a consideration in determining an award.

N203-151

TITLE:

Machine Learning Detection of Source Code Vulnerability

OBJECTIVE

Develop and demonstrate a software capability that utilizes machine-learning techniques to scan source code for its dependencies; trains cataloging algorithms on code dependencies and detection of known vulnerabilities, and scales to support polyglot architectures.

DESCRIPTION

Nearly every software library in the world is dependent on some other library, and the identification of security vulnerabilities on the entire corpus of these dependencies is an extremely challenging endeavor. As part of a Development, Security, and Operations (DevSecOps) process, this identification is typically accomplished using the following methods: (a) Using static code analyzers. This can be useful but is technically challenging to implement in large and complex legacy environments. They typically require setting up a build environment for each version to build call and control flow graphs, and are language-specific and thus do not work well when there are multiple versions of software using different dependency versions. (b) Using dynamic code review. This is extremely costly to implement, as it requires a complete setup of an isolated environment, including all applications and databases a project interacts with. © Using decompilation to perform static code analysis. This is again dependent on software version and is specific to the way machine-code is generated.

The above methods by themselves generate statistically significant numbers of false positives and false negatives: False positives come from the erroneous detection of vulnerabilities and require a human in the loop to discern signal from noise. False negatives come from the prevalence of undetected altered dependent software (e.g., copy/paste/change from external libraries).

Promising developments from commercial vendors provide text mining services for project source trees and compare them against vulnerability databases, such as Synopsis/Blackduck Hub, IBM AppScan, and Facebook’s Infer. However, these tools are costly to use and require the packaging of one’s code to be uploaded to a third-party service.

Work produced in Phase II may become classified. Note: The prospective contractor(s) must be U.S. owned and operated with no foreign influence as defined by DoD 5220.22-M, National Industrial Security Program Operating Manual, unless acceptable mitigating procedures can and have been implemented and approved by the Defense Counterintelligence Security Agency (DCSA). The selected contractor and/or subcontractor must be able to acquire and maintain a secret level facility and Personnel Security Clearances, in order to perform on advanced phases of this project as set forth by DCSA and NAVWAR in order to gain access to classified information pertaining to the national defense of the United States and its allies; this will be an inherent requirement. The selected company will be required to safeguard classified material IAW DoD 5220.22-M during the advanced phases of this contract.

NGA20C-001

TITLE:

Algorithm Performance Evaluation with Low Sample Size

OBJECTIVE

Develop novel techniques and metrics for evaluating machine learning -based computer vision algorithms with few examples of labeled overhead imagery.

DESCRIPTION

The National Geospatial Intelligence Agency (NGA) produces timely, accurate and actionable geospatial intelligence (GEOINT) to support U.S. national security. To exploit the growing volume and diversity of data, NGA is seeking a solution to evaluate the performance of a class of algorithms for which there are a limited quantities of training data and evaluation data samples. This is important because statistical significance of the evaluation results is directly tied to the size of the evaluation dataset. While significant effort has been put forth to train algorithms with low sample sizes of labelled data [1-2], open questions remain for the best representative evaluation techniques under the same constraint.

Of specific interest to this solicitation are innovative approaches to rapid evaluation of computer vision algorithms at scale, using small quantities of labelled data samples, and promoting extrapolation to larger data populations. The central challenge to be addressed is the evaluation of performance with the proper range and dimension of data characteristics, when the labeled data represents a small portion of the potential operating conditions. An example is when performance must be evaluated as a function of different lighting conditions, but most of the labelled data was collected under full sun.

The study will be based on panchromatic electro-optical (EO) imagery using a subset (selected by the STTR participants) of the xView detection dataset, although extension to other sensing modalities is encouraged. Solutions with a mathematical basis are desired.

SCO 20.3-001

TITLE:

Machine Learned Cyber Threat Behavior Detection

OBJECTIVE

Develop unsupervised machine learn algorithms to evaluate Zeek logs of common inbound and outbound perimeter network traffic protocols to provide high confident anomaly detection of suspicious and/or malicious network traffic.

The algorithms must be able to be run from a 1U commodity hardware on small to large networks. Report outputs from the algorithms should be retrievable as json or csv formatted files and contain sufficient information for ingestion and correlation against various databases or SIEM systems. At a minimum, the output reports should provide enough data to understand the suspicious threat anomalies identified, corresponding Zeek metadata associated with the detection for correlation and enrichment with other databases, date/time, confidence associated with the detection, and technical reasoning behind the confidence levels and detections made. The government must be equipped with the ability to specify how reporting is generated based confidence thresholds.

DESCRIPTION

Machine Learning of Cyber Behaviour

PHASE I

SCO is accepting Direct to Phase II proposals ONLY. Proposers must demonstrate that the following achievements outside of the SBIR program:

Provide a detailed summary of current research and development and/or commercialization of artificial intelligence methodologies used to identify cyber threats. The summary shall include:

Specific models used in previous research and how they would be applicable for this SBIR. Explain the maturation of these models and previous successes and known limitations in meeting the SBIR goals.
Detailed description of the training data available to the company. Identify whether the proposed training corpus will be accessible in-house, accessed via an open source corpus, or purchased from a commercial training corpus site. Provide the cost to access the proposed training corpus throughout the SBIR period of performance.
Describe the previous work done with the training corpus, specifically the methodologies used and resulting findings.
Finally, include an attachment detailing the schema to be assessed by the proposed algorithm and indicate if the schema was already tested in prior research efforts (NOTE: this schema list does not to count against the maximum number of pages. If this is considered Proprietary information, the company shall indicate this with additional handling instructions).