Welcome to 2021!!!

Hello World!

Welcome to 2021!!!

January 6, 2021

Agenda:

  • Welcome Zoom issue to start? Process coming up for recording parts of our meetup to make available for later viewing? FreeConferenceCall is free and has whiteboard and recording
  • Project updates Jonathan completely migrated his Taiga instance to projectwatch.dev Schools outreach – started building a flyer site to gain support AI4All curriculum – may be able to pull some of this overOpenAI Gym – available to work out your agen behavior for reinforcement learning

    Bug Analysis – XLNET or BERT or GPT-2 – SimpleTransformers

  • Current news
  • Upcoming Schedule for 2020 Ben on MLFlow ? presenting WeightsAndBiases Paper of the month Harsh for Recommendation Systems March – HATCH2021
  • Q&A
  • Close

Deploying ML Models with Django

September 30, 2020

Harsha lead the talk covering the approach for deploying a machine learning model with the Django Framework.

The slides from the talk are available here.

National Defense Education Program

September 26, 2020

This is a presentation given as part of the National Defense Education Program (NDEP) coordinated by Alabama A&M. The intent of the program is to improve the “employment pipeline” of underrepresented people in the defense sector.

HSV-AI Logo

National Defense Education Program

Welcome to Huntsville AI!

Brief Intro

What we do at Huntsville AI:

  • Application – how to solve a problem with a technology
  • Theory – for those times when knowing “How” something works is necessary
  • Social / Ethics – Human / AI interaction
  • Brainstorming – new uses for existing solutions
  • Hands on Code – for those times when you just have to run something for yourself
  • Coworking Night – maybe have combined sessions with other groups to discuss application in their focus areas
  • Community – get to know others in the field. Provide talks like this one and support local tech events (HATCH, SpaceApps)

About me

J. Langley

Chief Technical Officer at CohesionForce, Inc.

Founder of Session Board & Huntsville AI

Involved in Open Source (Eclipse & Apache Foundations)

Started playing with AI about 15 years ago when Intelligent Agents were all the rage.

Developed a Naive Bayes approach for text classification, a Neural Network for audio classification, heavily into NLP.

What is this AI thing?

Artificial Intelligence – computer program learning how to solve problems based on data

Breaking it Down

There are several ways to break the subject of AI into digestible chunks

Here’s a useful way to think about the relationship between AI, ML, and Deep Learning.

Image

This is adapted from Ian Goodfellows book: Deep Learning : https://www.deeplearningbook.org/

Here’s another view of AI:

Image

Lifecycle Breakdown

Yet another way to break down the AI industry is by the lifecycle of a project:

  • Academic – Developing novel approaches or architectures for AI
  • Application – Applying AI techniques to solve real world problems
  • ML-OPS & Tools – Applications that provide support for the development and deployment of AI workflows

Google MLOps Paper

Human Level AI Performance Milestones

Page 68 here: https://hai.stanford.edu/sites/default/files/ai_index_2019_report.pdf

How does this stuff work?

Universal Approximation Theorem

The theorem states that the first layer can approximate any well-behaved function. Such a well-behaved function can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.

Neural Networks!

NeuralNetwork

Deep Neural Network

A deep neural network has a lot more layers stacked in between the inputs and the outputs.

AI Applications

  • Classification
    • Medical
    • Agriculture
    • Fraud Detection
  • Recommendation
    • Product placement
    • Social Media / Marketing
    • Online shopping
  • Generation
    • Images
    • Text

Why you should consider a career in AI

  • It’s important
  • It’s everywhere
  • We need you

It’s really cool!

Top 4 Department of Defense Modernization Priorities:

  • AI
  • Biotech
  • Autonomy
  • Cyber

Diversity can help solve some of the bias problems in current AI:

Research earlier in 2019 from Joy Buolamwini and Timnit Gebru evaluated computer vision products from IBM, Microsoft, and elsewhere, and they found that these products performed worse on women than on men and worse on people with dark skin compared to people with light skin. For instance, IBM’s computer vision software was 99.7% accurate on light-skinned men and only 65% accurate on dark-skinned women.

Amazon’s Face Recognition Falsely Matched 28 Members of Congress With Mugshots:

Link to ACLU Article

Model to create high resolution images from pixelated images

Here’s a tweet that shows a new model that can create realistic high-resolution images from a single low resolution starting point.

And…

Here’s the tweet that shows a pretty big problem with the model.

Free Education Resources:

SBIR 2020-3

September 15, 2020

The Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs are highly competitive programs that encourage domestic small businesses to engage in Federal Research/Research and Development (R/R&D) with the potential for commercialization.

HSV-AI Logo

Brief Intro

What we do at Huntsville AI:

  • Application – how to solve a problem with a technology
  • Theory – for those times when knowing “How” something works is necessary
  • Social / Ethics – Human / AI interaction
  • Brainstorming – new uses for existing solutions
  • Hands on Code – for those times when you just have to run something for yourself
  • Coworking Night – maybe have combined sessions with other groups to discuss application in their focus areas
  • Community – get to know others in the field. Provide talks like this one and support local tech events (HATCH, SpaceApps)

Quick Reference

The full list of 2020.3 SBIR topics is available HERE

NGA203-003

TITLE

Novel Mathematical Foundation for Automated Annotation of Massive Image Data Sets

OBJECTIVE

This announcement seeks proposals that offer dramatic improvements in automated object detection and annotation of massive image data sets. Imaging data is being created at an extraordinary rate from many sources, both from government assets as well as private ones. Automated methods for accurate and efficient object identification and annotation are needed to fully exploit this resource. This topic is focused on new artificial intelligence (AI) methods to effectively and efficiently solve this problem.

DESCRIPTION

Current choke points blocking optimal exploitation of the full stream of available image data include confronting widely different views (perspective, resolution, etc.) of the same or similar objects and the overwhelming amounts of human effort required for effective processing. Current manual processes requires human eyes on every image to perform detection, identification, and annotation. Current state of the art AI requires intensive human support to generate giant training sets. Further, resulting methods frequently generate rule sets that are overly fragile in that training on one object is not transferrable to the detection of another object, even though the object might strike a human as essentially the same, and thus the need for increased human review of the algorithm decisions.

NGA seeks new types of AI tools optimized for the task of object identification and annotation across diverse families of image data that are reliable, robust, not dependent on extensive training demands, are applicable to objects of interest to both government and commercial concerns, and simultaneously be parsimonious with user resources in general. In particular, we seek solutions that make AI outputs both more explainable and more “lightweight” to human users.

The focus of a successful phase 1 effort should be on explaining the mathematical foundation that will enable the significantly improved AI tools described herein. Of specific interest are novel AI constructs that are more principled and universal and less ad hoc than current technology and can be used to construct a tool that performs relevant tasks. For the purposes of this announcement “relevant tasks” are limited to object identification across view types, drawing an object bounding box, and correctly labelling the object in a text annotation. A successful Phase 1 proposal should explain how the mathematical foundation needed to build the required tools will be developed in Phase 1 and implemented in a software toolkit in Phase 2. Examples should be developed during Phase 1 and should illustrate either improved reliability or robustness over the current state of the art, as well as reducing training demands and user resources. Proposals describing AI approaches that are demonstrably at or near the current state of the art in commercial AI performance, such as on ImageNet data sets, are specifically not of interest under this topic. The foundational element of a successful proposal under this topic is exploitation of novel mathematics that will enable new and better AI approaches.

Direct to Phase 2 proposals are being accepted under this topic. A straight to phase 2 proposal should describe pre-existing mathematical foundations and illustrative examples described in the paragraph above. Phase 2 proposals should also propose a set of milestones and demonstrations that will establish the novel AI tools as a viable commercial offering.

OSD203-004

TITLE

Domain-Specific Text Analysis

OBJECTIVE:

Develop text analysis software that leverages current Natural Language Processing (NLP) algorithms and techniques, (e.g., Bayesian algorithms, word embeddings, recurrent neural networks) for accurately conducting content and sentiment analysis, as well as dictionary development.

DESCRIPTION:

The United Stated Department of Defense (DoD) collects large amounts of text data from their personnel using a variety of different formats including opinion/climate surveys, memoranda, incident reports, standard forms, and transcripts of focus group/sensing sessions. Much of these data are used operationally; however, recent interest in the leveraging of text data to glean insight into personnel trends/behaviors/intentions has prompted a greater degree of research in NLP. Additionally, Topic Modeling and Sentiment Analysis have been explored by various research arms of the DoD; however, two foundational hurdles exist that need to be addressed before they can realistically be applied to the DoD:

First, the varied use of jargon, nomenclature, and acronyms across the DoD and Service Branches must be more comprehensively understood. Additionally, development of a “DoD Dictionary” should enable the fluid use of extant and newly-created jargon, phrases, and sayings used over time. Second, the emergent nature and rapid innovation of NLP techniques has made bridging the technical gap between DoD analysts and tools difficult. Additionally, the understanding and interpreting of NLP techniques by non-technical leadership is particularly difficult. There currently exists no standard format or package that can be used to analyze and develop visualizations for text data in such a way that accommodates the needs of operational leadership to make decisions regarding personnel policies or actions.

SOCOM203-003

TITLE:

High-Performance Multi-Platform / Sensor Computing Engine

OBJECTIVE:

The objective of this topic is to develop a next generation multi-platform & multi-sensor capable Artificial Intelligence-Enabled (AIE), high performance computational imaging camera with an optimal Size, Weight and Power – Cost (SWaP-C) envelope. This computational imaging Camera can be utilized in weapon sights, surveillance and reconnaissance systems, precision strike target acquisition, and other platforms. This development should provide bi-directional communication between tactical devices with onboard real-time scene/data analysis that produces critical information to the SOF Operator. As a part of this feasibility study, the Offerors shall address all viable overall system design options with respective specifications on the key system attributes.

DESCRIPTION:

A system-of-systems approach “smart-Visual Augmentation Systems” and the integration of an next generation smart sensor enables information sharing between small arms, SOF VAS and other target engagement systems. Sensors and targeting that promote the ability to hit and kill the target as well as ensuring Rules of Engagement are met and civilian casualties/collateral damage is eliminated. The positive identification of the target and the precise firing solution will optimize the performance of the operator, the weapon, and the ammunition to increase precision at longer ranges in multiple environments.

This system could be used in a broad range of military applications where Special Operations Forces require: Faster Target Acquisition; Precise Targeting; Automatic Target Classification; Classificationbased Multi Target Tracking; Ability to Engage Moving Targets, Decision Support System; Targeting with Scalable Effects; Battlefield Awareness; Integrated Battlefield (Common Operating Picture with IOBT, ATAK, COT across Squad, Platoon).

HR001120S0019-14

TITLE:

AI-accelerated Biosensor Design

OBJECTIVE:

Apply artificial intelligence (AI) to accelerate the design of highly specific, engineered biomarkers for rapid virus detection.

DESCRIPTION:

This SBIR seeks to leverage AI technologies to accelerate the development of aptamer-based biosensors that specifically bind to biomolecular structures. Aptamers are short single-stranded nucleic acid sequences capable of binding three-dimensional biomolecular structures in a way similar to antibodies. Aptamers have several advantages as compared to antibodies, including long shelf-life, stability at room temperature, low/no immunogenicity, and low-cost.

The current state-ofthe-art aptamer designs rely heavily on in vitro approaches such as SELEX (Systematic Evolution of Ligands by Exponential Enrichment) and its advanced variations. SELEX is a cyclic process that involves multiple rounds of selection and amplification over a very large number of candidates (>10^15). The iterative and experimental nature of SELEX makes it time consuming (weeks to months) to obtain aptamer candidates, and the overall probability of ultimately obtaining a useful aptamer is low (30%-50%). Attempts to improve the performance of the original SELEX process generally result in increased system complexity and system cost as well as increased demand on special domain expertise for their use. Furthermore, a large number of parameters can influence the SELEX process.

Therefore, this is a domain that is ripe for AI. Recent AI research has demonstrated the potential for machine learning technologies to encode domain knowledge to significantly constrain the solution space of optimization search problems such as solving the biomolecular inverse problems. Such in silico techniques consequently offer the potential to provide a costeffective alternative to make aptamer design process more dependable, thereby, more efficient. This SBIR seeks to leverage emerging AI technologies to develop a desktop-based AI-assisted aptamer design capability that accelerates the identification of high-performance aptamers for detecting new biological antigens.

NGA203-004

TITLE:

High Dimensional Nearest Neighbor Search

OBJECTIVE

This topic seeks research in geolocation of imagery and video media taken at near- ground level [1]. The research will explore hashing/indexing techniques (such as [2]) that match information derived from media to a global reference data. The reference data is composed of digital surface models (DSMs) of known geographical regions and features that can be derived from that surface data, together with limited “foundation data” of the same regions consisting of map data such as might be present in Open Street Maps and landcover data (specifying regions that are fields, vegetation, urban, suburban, etc.). Query data consists of images or short video clips that represent scenes covered by the digital surface model in the reference data, but may or may not have geo-tagged still images in the reference data from the same location.

Selected performers will be provided with sample reference data, consisting of DSM data and a collection of foundation data, and will be provided with sample query data. This sample data is described below. However, proposers might suggest other reference and query data that they will use to either supplement or replace government-furnished sample data. This topic seeks novel ideas for the representation of features in the query data and the reference data that can be used to perform retrieval of geo-located reference data under the assumption that the query data lacks geolocation information. The topic particularly seeks algorithmically efficient approaches such as hashing techniques for retrieval based on the novel features extracted from query imagery and reference data that can be used to perform matching using nearest neighbor approaches in feature space.

DESCRIPTION

The reference data includes files consisting of a vectorized two-dimensional representation of a Digital Surface Model (DSM) [4], relative depth information, and selected foundation feature data. The foundation features will included feature categories such as the locations of roads, rivers, and man-made objects.

The desired output of a query is a location within meters of the ground truth location of the camera that acquired the imagery. In practice, only some of the queries will permit accurate geolocation based on the reference data, and in some cases, the output will be a candidate list of locations, such that the true location is within the top few candidates. It is reasonable to assume that there exists a reference database calculated from a global DSM with a minimum spatial resolution of 30 meters that may, in some locations, provide sub-meter spatial resolution. The foundation feature is at least as rich as that present in Open Street Maps, and can include extensive landcover data with multiple feature types. For the purpose of this topic, the reference data will not include images. Sample reference and query data representative of these assumptions, but of limited geographical areas, will be provided to successful proposers.

The topic seeks approaches that are more accurate than a class of algorithms that attempt to provide geolocation to a general region, such as a particular biome or continent. These algorithms are often based on a pure neural network approach, such as described in [3], and is unlikely to produce sufficient precise camera location information that is accurate to within meters.

The objective system, in full production, should be sufficiently efficient as to scale to millions of square kilometers of reference data, and should be able to process queries at a rate of thousands of square kilometers per minute. While a phase 2 system might provide a prototype at a fraction of these capabilities, a detailed complexity analysis is expected to support the scalability of the system.

The proposed approach may apply to only a subset of query imagery types. For example, the proposed approach may be accurate only for urban data, or only for rural scenes. The proposer should carefully explain the likely limitations of the proposed approach and suggest methods whereby query imagery could be filtered so that only appropriate imagery is processed by the proposed system.

Proposers who can demonstrate prior completion of all of the described Phase I activities may propose a “straight to Phase II” effort. In this case the novelty of the proposed feature and retrieval approach will be a consideration in determining an award.

N203-151

TITLE:

Machine Learning Detection of Source Code Vulnerability

OBJECTIVE

Develop and demonstrate a software capability that utilizes machine-learning techniques to scan source code for its dependencies; trains cataloging algorithms on code dependencies and detection of known vulnerabilities, and scales to support polyglot architectures.

DESCRIPTION

Nearly every software library in the world is dependent on some other library, and the identification of security vulnerabilities on the entire corpus of these dependencies is an extremely challenging endeavor. As part of a Development, Security, and Operations (DevSecOps) process, this identification is typically accomplished using the following methods: (a) Using static code analyzers. This can be useful but is technically challenging to implement in large and complex legacy environments. They typically require setting up a build environment for each version to build call and control flow graphs, and are language-specific and thus do not work well when there are multiple versions of software using different dependency versions. (b) Using dynamic code review. This is extremely costly to implement, as it requires a complete setup of an isolated environment, including all applications and databases a project interacts with. © Using decompilation to perform static code analysis. This is again dependent on software version and is specific to the way machine-code is generated.

The above methods by themselves generate statistically significant numbers of false positives and false negatives: False positives come from the erroneous detection of vulnerabilities and require a human in the loop to discern signal from noise. False negatives come from the prevalence of undetected altered dependent software (e.g., copy/paste/change from external libraries).

Promising developments from commercial vendors provide text mining services for project source trees and compare them against vulnerability databases, such as Synopsis/Blackduck Hub, IBM AppScan, and Facebook’s Infer. However, these tools are costly to use and require the packaging of one’s code to be uploaded to a third-party service.

Work produced in Phase II may become classified. Note: The prospective contractor(s) must be U.S. owned and operated with no foreign influence as defined by DoD 5220.22-M, National Industrial Security Program Operating Manual, unless acceptable mitigating procedures can and have been implemented and approved by the Defense Counterintelligence Security Agency (DCSA). The selected contractor and/or subcontractor must be able to acquire and maintain a secret level facility and Personnel Security Clearances, in order to perform on advanced phases of this project as set forth by DCSA and NAVWAR in order to gain access to classified information pertaining to the national defense of the United States and its allies; this will be an inherent requirement. The selected company will be required to safeguard classified material IAW DoD 5220.22-M during the advanced phases of this contract.

NGA20C-001

TITLE:

Algorithm Performance Evaluation with Low Sample Size

OBJECTIVE

Develop novel techniques and metrics for evaluating machine learning -based computer vision algorithms with few examples of labeled overhead imagery.

DESCRIPTION

The National Geospatial Intelligence Agency (NGA) produces timely, accurate and actionable geospatial intelligence (GEOINT) to support U.S. national security. To exploit the growing volume and diversity of data, NGA is seeking a solution to evaluate the performance of a class of algorithms for which there are a limited quantities of training data and evaluation data samples. This is important because statistical significance of the evaluation results is directly tied to the size of the evaluation dataset. While significant effort has been put forth to train algorithms with low sample sizes of labelled data [1-2], open questions remain for the best representative evaluation techniques under the same constraint.

Of specific interest to this solicitation are innovative approaches to rapid evaluation of computer vision algorithms at scale, using small quantities of labelled data samples, and promoting extrapolation to larger data populations. The central challenge to be addressed is the evaluation of performance with the proper range and dimension of data characteristics, when the labeled data represents a small portion of the potential operating conditions. An example is when performance must be evaluated as a function of different lighting conditions, but most of the labelled data was collected under full sun.

The study will be based on panchromatic electro-optical (EO) imagery using a subset (selected by the STTR participants) of the xView detection dataset, although extension to other sensing modalities is encouraged. Solutions with a mathematical basis are desired.

SCO 20.3-001

TITLE:

Machine Learned Cyber Threat Behavior Detection

OBJECTIVE

Develop unsupervised machine learn algorithms to evaluate Zeek logs of common inbound and outbound perimeter network traffic protocols to provide high confident anomaly detection of suspicious and/or malicious network traffic.

The algorithms must be able to be run from a 1U commodity hardware on small to large networks. Report outputs from the algorithms should be retrievable as json or csv formatted files and contain sufficient information for ingestion and correlation against various databases or SIEM systems. At a minimum, the output reports should provide enough data to understand the suspicious threat anomalies identified, corresponding Zeek metadata associated with the detection for correlation and enrichment with other databases, date/time, confidence associated with the detection, and technical reasoning behind the confidence levels and detections made. The government must be equipped with the ability to specify how reporting is generated based confidence thresholds.

DESCRIPTION

Machine Learning of Cyber Behaviour

PHASE I

SCO is accepting Direct to Phase II proposals ONLY. Proposers must demonstrate that the following achievements outside of the SBIR program:

Provide a detailed summary of current research and development and/or commercialization of artificial intelligence methodologies used to identify cyber threats. The summary shall include:

  1. Specific models used in previous research and how they would be applicable for this SBIR. Explain the maturation of these models and previous successes and known limitations in meeting the SBIR goals.
  2. Detailed description of the training data available to the company. Identify whether the proposed training corpus will be accessible in-house, accessed via an open source corpus, or purchased from a commercial training corpus site. Provide the cost to access the proposed training corpus throughout the SBIR period of performance.
  3. Describe the previous work done with the training corpus, specifically the methodologies used and resulting findings.
  4. Finally, include an attachment detailing the schema to be assessed by the proposed algorithm and indicate if the schema was already tested in prior research efforts (NOTE: this schema list does not to count against the maximum number of pages. If this is considered Proprietary information, the company shall indicate this with additional handling instructions).

Doc2Vec Update

July 29, 2020

This notebook is a demonstration of using Doc2Vec as an approach to compare the text of bug reports from an open source Bugzilla project. The Doc2Vec model from gensim is based on the word2vec paper but includes an additional input that identifies the document in the input.

HSV-AI Logo

Bug Comparison with Doc2Vec

Personal opinion is that the API is difficult to understand with little tutorial material avaialble for how to implement this model in a practical solution.

API doc here

Based on the paper Distributed Representations of Sentences and Documents by Quoc Le and Tomas Mikolov.

Experiments

  1. stop words vs no stop words
  2. Try different instantiations of the Doc2Vec
  3. Try training pre-trained wikipedia version
#Global Imports
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import spacy
import gensim
import collections
import statistics

Loading Bugzilla Data

The data for this notebook was put together from the Eclipse XText project Bugzilla. You can find a link here

The bugs and associated comments were loaded into a pandas dataframe and stored in parquet format.

url = 'https://github.com/HSV-AI/bug-analysis/raw/master/data/df-xtext.parquet.gzip'
df = pd.read_parquet(url)

Creating a Tokenize Method

The tokenize method is used by the Doc2Vec model to extract a list of words from a document. This method can have major impacts on the performance of the model based on how it is configured. There are different approaches for TF-IDF that drop out many of the words that do not work for the Doc2Vec approach where the order of the words matters.

Other things that work well with probabilistic approaches like capturing the lemma of the word instead of the actual word may actually reduce the accuracy of the Doc2Vec model.

text = """    java.lang.ClassCastException: HIDDEN
    at org.eclipse.xtext.xbase.ui.debug.XbaseBreakpointDetailPaneFactory.getDetailPaneTypes(XbaseBreakpointDetailPaneFactory.java:42)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager$DetailPaneFactoryExtension.getDetailPaneTypes(DetailPaneManager.java:94)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPossiblePaneIDs(DetailPaneManager.java:385)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPreferredPaneFromSelection(DetailPaneManager.java:285)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneProxy.display(DetailPaneProxy.java:109)
    at org.eclipse.jdt.internal.debug.ui.ExpressionInformationControlCreator$ExpressionInformationControl$2.updateComplete(ExpressionInformationControlCreator.java:344)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$4.run(TreeModelContentProvider.java:751)
    at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.notifyUpdate(TreeModelContentProvider.java:737)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.updatesComplete(TreeModelContentProvider.java:653)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.performUpdates(TreeModelContentProvider.java:1747)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.access$10(TreeModelContentProvider.java:1723)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$6.run(TreeModelContentProvider.java:1703)
    at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
    at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:136)
    at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4147)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3764)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$9.run(PartRenderingEngine.java:1151)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1032)
    at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:156)
    at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:648)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:592)
    at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
    at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:138)
    at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:380)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:235)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(null:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke(null:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(null:-1)
    at java.lang.reflect.Method.invoke(null:-1)
    at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:648)
    at org.eclipse.equinox.launcher.Main.basicRun(Main.java:603)
    at org.eclipse.equinox.launcher.Main.run(Main.java:1465)"""

exception_regex = re.compile(r"(?m)^.*?Exception.*(?:[\r\n]+^\s*at .*)+", re.MULTILINE | re.IGNORECASE)
exception_regex.sub("", text)
''
nlp = spacy.load("en_core_web_sm")

exception_regex = re.compile(r".+Exception[^\n].*\s+at", re.MULTILINE | re.IGNORECASE)
greater_regex = re.compile(r"^> .*$", re.MULTILINE | re.IGNORECASE)
gerrit_created_regex = re.compile(r"New Gerrit change created: [^\ ]+", re.MULTILINE | re.IGNORECASE)
gerrit_merge_regex = re.compile(r"Gerrit change [^\s]+ was merged to [^\.]+\.", re.MULTILINE | re.IGNORECASE)
gerrit_commit_regex = re.compile(r"Commit: [^\ ]+", re.MULTILINE | re.IGNORECASE)

filter = ['VERB', 'NOUN', 'PROPN']

def tokenize_spacy(text):
    text = greater_regex.sub("", text)
    text = exception_regex.sub("", text)
    text = gerrit_created_regex.sub("", text)
    text = gerrit_merge_regex.sub("", text)
    text = gerrit_commit_regex.sub("", text)
    filtered_tokens = []
    
    doc = nlp(text)
    for sent in doc.sents:
        for token in sent:
#            if re.fullmatch('[a-zA-Z]+', token.text) and not token.is_stop:
#            if token.pos_ in filter and re.fullmatch('[a-zA-Z]+', token.text):
            if re.fullmatch('[a-zA-Z]+', token.text):
#                 filtered_tokens.append(token.lemma_)
                filtered_tokens.append(token.text)
    return filtered_tokens

TaggedDocument

The Word2Vec model uses an array of TaggedDocuments as input for training. The TaggedDocument consists of an array of words/tokens (from our tokenizer) and a list of tags. In our case, the tags used only includes the ID of the bug.

def read_corpus():
  for i, row in df.iterrows():
    yield gensim.models.doc2vec.TaggedDocument(tokenize_spacy(row['text']), [row['id']])

train_corpus = list(read_corpus())

Let’s take a look at a random TaggedDocument in the corpus. This is a good check to see what the tokenizer is providing based on the text of the bug.

doc_id = random.randint(0, len(train_corpus) - 1)
doc = train_corpus[doc_id]
tag = doc.tags[0]
print(tag,doc.words)
text = df.iloc[doc_id]['text']
print('\n',text)
363914 ['Check', 'that', 'you', 'can', 'not', 'append', 'a', 'null', 'segment', 'to', 'a', 'QualifiedName', 'Build', 'Identifier', 'Just', 'a', 'minor', 'enhancement', 'The', 'factory', 'checks', 'that', 'you', 'can', 'not', 'create', 'a', 'qualified', 'name', 'with', 'a', 'null', 'segment', 'However', 'the', 'function', 'does', 'not', 'Would', 'be', 'better', 'to', 'always', 'guarantee', 'the', 'non', 'null', 'invariant', 'and', 'also', 'check', 'the', 'parameter', 'of', 'the', 'append', 'operation', 'Reproducible', 'Always', 'fixed', 'pushed', 'to', 'master', 'We', 'have', 'to', 'make', 'sure', 'that', 'no', 'client', 'code', 'in', 'the', 'frameworks', 'passes', 'null', 'to', 'As', 'as', 'discussed', 'internally', 'I', 'removed', 'the', 'null', 'check', 'for', 'now', 'since', 'it', 'might', 'lead', 'to', 'new', 'exceptions', 'in', 'clients', 'The', 'plan', 'is', 'to', 'apply', 'the', 'apply', 'the', 'null', 'check', 'again', 'right', 'after', 'we', 'have', 'release', 'Xtext', 'This', 'will', 'allow', 'us', 'to', 'do', 'more', 'thorough', 'testing', 'The', 'commit', 'can', 'be', 're', 'applied', 'via', 'git', 'cherry', 'pick', 'cherry', 'picked', 'and', 'pushed', 'Requested', 'via', 'bug', 'Requested', 'via', 'bug']

 Check that you cannot append a null segment to a QualifiedName  Build Identifier: 20110916-0149

Just a minor enhancement.  The factory org.eclipse.xtext.naming.QualifiedName.create(String...) checks that you cannot create a qualified name with a "null" segment.  However, the org.eclipse.xtext.naming.QualifiedName.append(String) function does not.  Would be better to always guarantee the non-null invariant and also check the parameter of the append operation.

Reproducible: Always fixed; pushed to 'master'. We have to make sure that no client code in the frameworks passes null to QualifiedName#append As as discussed internally, I've removed the null-check for now since it might lead to new exceptions in clients. 

The plan is to apply the apply the null-check again right after we have release Xtext 2.2. This will allow us to do more thorough testing. 

The commit can be re-applied via "git cherry-pick -x b74a06f705a9a0750289e2152d49941f4727e756" cherry-picked and pushed. Requested via bug 522520.

-M. Requested via bug 522520.

-M.

Doc2Vec Model

Believe it or not, there are 22 available parameters for use in the constructor with 21 being optional. The API also does not list the defaults for the optional parameters.

The required parameter is the vector of tagged documents to use for training.

The best way to figure this out is to use


in a notebook cell.

Copying the text of the method headers gives us:

    Doc2Vec(documents=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)

    BaseWordEmbeddingsModel(sentences=None, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000, trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, cbow_mean=1, min_alpha=0.0001, compute_loss=False, fast_version=0, **kwargs)



dm ({1,0}, optional) – Defines the training algorithm. 

If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.

This is similar to the skip-gram method vs DBOW method for word vectors.

The Distributed Memory version takes the order of the words into account when categorizing the document vectors, so we will use that version.

## Building Vocabulary

The first step in setting up the model is to build the vocabulary using the list of tagged documents.

    model.build_vocab(documents, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)

## Training the model

The final step is training the model.

     model.train(documents, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, callbacks=())


```python
model = gensim.models.doc2vec.Doc2Vec(min_count=2, epochs=40)
%time model.build_vocab(train_corpus)
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 5.68 s, sys: 56.1 ms, total: 5.73 s
Wall time: 5.27 s
CPU times: user 1min 7s, sys: 655 ms, total: 1min 8s
Wall time: 26.9 s
model.save('bugzilla.doc2vec')

??model.save()

Using the Doc2Vec Model

The easiest way to use the model is with the most_similar method. This method will return a list of the most similar tagged documents based on the label passed into the method. For our use, we pass in the ID of the bug that we want to find a similar bug for.

model.docvecs.most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)
model.docvecs.most_similar(231773)
[(348199, 0.6316051483154297),
 (287550, 0.6027359962463379),
 (266427, 0.588177502155304),
 (287071, 0.5791216492652893),
 (266426, 0.576055109500885),
 (362787, 0.5739638805389404),
 (366414, 0.5676364898681641),
 (288103, 0.5655356645584106),
 (298734, 0.5629502534866333),
 (457006, 0.5558722019195557)]

The previous method only works from a previously known (and trained) label from a tagged document. The other way to use the model is find the most similar tagged document based on a list of words. In order to do this:

  1. Get a list of words from a new document
  2. Important! Tokenize this list of words using the same tokenizer used when creating the corpus
  3. Convert the list of tokens to a vector using the infer_vectormethod
  4. Call the most_similar method with the new vector
from scipy import spatial

text1 = df.iloc[0,:]['text']
text2 = tokenize_spacy(text1)
vector = model.infer_vector(text2)

similar = model.docvecs.most_similar([vector])
print(similar)

[(231773, 0.8523540496826172), (266426, 0.6661572456359863), (287550, 0.6446412801742554), (476754, 0.6223310828208923), (266427, 0.6220518350601196), (312276, 0.6161692142486572), (473712, 0.613205075263977), (348199, 0.6111923456192017), (402990, 0.6109021902084351), (529006, 0.6108258366584778)]

Evaluating the Doc2Vec Model

Of course, it is helpful that our model returned the ID of the document that we vectorized and passed into the most_similar method. If this model is to be useful, each document in the corpus should be similar to itself. Using a cosine similarity metric, we can calculate the self-similarity of each document.

We’ll calculate the self-similarity below and graph the distribution to see what we have.

similarities = []
for doc_id in range(len(train_corpus)):
    learned_vector = model.docvecs[train_corpus[doc_id].tags[0]]
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    distance = 1 - spatial.distance.cosine(learned_vector, inferred_vector)
    similarities.append(distance)
    
sns.distplot(similarities, kde=False, rug=True)
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<matplotlib.axes._subplots.AxesSubplot at 0x7fc20167fd30>
png

Check Outliers

Let’s look at any items that are not self similar based on the model.

print(min(similarities))
index = similarities.index(min(similarities))
print(df.iloc[index,:])
0.0787610188126564
component                       Website
date          2013-07-16 07:03:42+00:00
id                               413031
product                             TMF
reporter        dixit.pawan@hotmail.com
resolution                   WORKSFORME
status                           CLOSED
text                  fgbnghjm  cvbndfh
title                          fgbnghjm
year                               2013
month                            2013-7
day                          2013-07-16
Name: 3581, dtype: object

Given that the text consists of “fgbnghjm cvbndfh”, you can see why this bug is not handled well by the model.

We can also look at the distribution of the next similar document probabililties.

next_similar = []
for doc_id in range(len(train_corpus)):
    sims = model.docvecs.most_similar(train_corpus[doc_id].tags[0])
    next_similar.append(sims[0][1])
    
sns.distplot(next_similar, kde=False, rug=True)

print(statistics.mean(next_similar))
print(statistics.stdev(next_similar))
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


0.600508476063155
0.10128021819227365
png

Detecting Near Duplicates

Looking into the most similar bugs that have a very high probability (>98%), it appears that we have an issue with the tokenizer when it runs across a Java stack trace.

print(max(next_similar))
index = next_similar.index(max(next_similar))
bug_id = df.iloc[index,:]['id']
print(bug_id,df.iloc[index,:]['text'])

print('\n\n')

sims = model.docvecs.most_similar(bug_id)
text = df.loc[df['id'] == sims[1][0]].iloc[0]['text']
print(sims[1][0],text)


0.9883251190185547
461367 CCE in XbaseBreakpointDetailPaneFactory.getDetailPaneTypes (42)  The following incident was reported via the automated error reporting:


    code:                   120
    plugin:                 org.eclipse.debug.ui_3.11.0.v20150116-1131
    message:                HIDDEN
    fingerprint:            a8a83b9f
    exception class:        java.lang.ClassCastException
    exception message:      HIDDEN
    number of children:     0

    java.lang.ClassCastException: HIDDEN
    at org.eclipse.xtext.xbase.ui.debug.XbaseBreakpointDetailPaneFactory.getDetailPaneTypes(XbaseBreakpointDetailPaneFactory.java:42)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager$DetailPaneFactoryExtension.getDetailPaneTypes(DetailPaneManager.java:94)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPossiblePaneIDs(DetailPaneManager.java:385)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPreferredPaneFromSelection(DetailPaneManager.java:285)
    at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneProxy.display(DetailPaneProxy.java:109)
    at org.eclipse.jdt.internal.debug.ui.ExpressionInformationControlCreator$ExpressionInformationControl$2.updateComplete(ExpressionInformationControlCreator.java:344)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$4.run(TreeModelContentProvider.java:751)
    at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.notifyUpdate(TreeModelContentProvider.java:737)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.updatesComplete(TreeModelContentProvider.java:653)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.performUpdates(TreeModelContentProvider.java:1747)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.access$10(TreeModelContentProvider.java:1723)
    at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$6.run(TreeModelContentProvider.java:1703)
    at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
    at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:136)
    at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4147)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3764)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$9.run(PartRenderingEngine.java:1151)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1032)
    at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:156)
    at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:648)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:592)
    at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
    at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:138)
    at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:380)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:235)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(null:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke(null:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(null:-1)
    at java.lang.reflect.Method.invoke(null:-1)
    at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:648)
    at org.eclipse.equinox.launcher.Main.basicRun(Main.java:603)
    at org.eclipse.equinox.launcher.Main.run(Main.java:1465)



General Information:

    reported-by:      Serhii Belei
    anonymous-id:     648982dc-0aba-4421-a13b-c3f08b2cb5aa
    eclipse-build-id: 4.5.0.I20150203-1300
    eclipse-product:  org.eclipse.epp.package.jee.product
    operating system: Windows7 6.1.0 (x86_64) - win32
    jre-version:      1.8.0_25-b18

The following plug-ins were present on the execution stack (*):
    1. org.eclipse.core.databinding.observable_1.4.1.v20140910-2107
    2. org.eclipse.core.databinding_1.4.100.v20141002-1314
    3. org.eclipse.core.runtime_3.10.0.v20150112-1422
    4. org.eclipse.debug.ui_3.11.0.v20150116-1131
    5. org.eclipse.e4.ui.workbench_1.3.0.v20150113-2327
    6. org.eclipse.e4.ui.workbench.swt_0.12.100.v20150114-0905
    7. org.eclipse.equinox.app_1.3.200.v20130910-1609
    8. org.eclipse.equinox.launcher_1.3.0.v20140415-2008
    9. org.eclipse.jdt.debug.ui_3.6.400.v20150123-1739
    10. org.eclipse.jdt.debug_3.8.200.v20150116-1130
    11. org.eclipse.jdt_3.11.0.v20150203-1300
    12. org.eclipse.swt_3.104.0.v20150203-2243
    13. org.eclipse.ui_3.107.0.v20150107-0903
    14. org.eclipse.ui.ide.application_1.0.600.v20150120-1542
    15. org.eclipse.ui.ide_3.10.100.v20150126-1117
    16. org.eclipse.xtext.xbase.ui_2.7.2.v201409160908
    17. org.eclipse.xtext.xbase_2.7.2.v201409160908
    18. org.eclipse.xtext_2.8.0.v201502030924

Please note that:
* Messages, stacktraces, and nested status objects may be shortened.
* Bug fields like status, resolution, and whiteboard are sent
  back to reporters.
* The list of present bundles and their respective versions was
  calculated by package naming heuristics. This may or may not reflect reality.

Other Resources:
* Report: https://dev.eclipse.org/recommenders/committers/confess/#/problems/54f58a02e4b03058b001ee0f  
* Manual: https://dev.eclipse.org/recommenders/community/confess/#/guide


Thank you for your assistance.
Your friendly error-reports-inbox.



463383 JME in JavaElement.newNotPresentException (556)  The following incident was reported via the automated error reporting:


    code:                   0
    plugin:                 org.apache.log4j_1.2.15.v201012070815
    message:                HIDDEN
    fingerprint:            f72b76f8
    exception class:        org.eclipse.emf.common.util.WrappedException
    exception message:      HIDDEN
    number of children:     0

    org.eclipse.emf.common.util.WrappedException: HIDDEN
    at org.eclipse.xtext.util.Exceptions.throwUncheckedException(Exceptions.java:26)
    at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.handleInvocationTargetException(AbstractDeclarativeValidator.java:137)
    at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.invoke(AbstractDeclarativeValidator.java:125)
    at org.eclipse.xtext.validation.AbstractDeclarativeValidator.internalValidate(AbstractDeclarativeValidator.java:312)
    at org.eclipse.xtext.validation.AbstractInjectableValidator.validate(AbstractInjectableValidator.java:69)
    at org.eclipse.xtext.validation.CompositeEValidator.validate(CompositeEValidator.java:153)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidate(Diagnostician.java:171)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:158)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:185)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:120)
    at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:148)
    at org.eclipse.xtext.xbase.annotations.validation.DerivedStateAwareResourceValidator.validate(DerivedStateAwareResourceValidator.java:33)
    at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:91)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.access$1(CachingResourceValidatorImpl.java:1)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:78)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:1)
    at org.eclipse.xtext.util.OnChangeEvictingCache.get(OnChangeEvictingCache.java:77)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.validate(CachingResourceValidatorImpl.java:81)
    at org.eclipse.xtend.ide.validator.XtendResourceValidator.validate(XtendResourceValidator.java:33)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:91)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:1)
    at org.eclipse.xtext.util.concurrent.CancelableUnitOfWork.exec(CancelableUnitOfWork.java:26)
    at org.eclipse.xtext.resource.OutdatedStateManager.exec(OutdatedStateManager.java:121)
    at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.internalReadOnly(XtextDocument.java:512)
    at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.readOnly(XtextDocument.java:484)
    at org.eclipse.xtext.ui.editor.model.XtextDocument.readOnly(XtextDocument.java:133)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob.createIssues(ValidationJob.java:86)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob.run(ValidationJob.java:67)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)
caused by: org.eclipse.jdt.core.JavaModelException: HIDDEN
    at org.eclipse.jdt.internal.core.JavaElement.newNotPresentException(JavaElement.java:556)
    at org.eclipse.jdt.internal.core.Openable.getUnderlyingResource(Openable.java:344)
    at org.eclipse.jdt.internal.core.CompilationUnit.getUnderlyingResource(CompilationUnit.java:930)
    at org.eclipse.jdt.internal.core.SourceRefElement.getUnderlyingResource(SourceRefElement.java:226)
    at org.eclipse.xtend.ide.validator.XtendUIValidator.isSameProject(XtendUIValidator.java:85)
    at org.eclipse.xtend.ide.validator.XtendUIValidator.checkAnnotationInSameProject(XtendUIValidator.java:72)
    at sun.reflect.GeneratedMethodAccessor171.invoke(null:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.invoke(AbstractDeclarativeValidator.java:118)
    at org.eclipse.xtext.validation.AbstractDeclarativeValidator.internalValidate(AbstractDeclarativeValidator.java:312)
    at org.eclipse.xtext.validation.AbstractInjectableValidator.validate(AbstractInjectableValidator.java:69)
    at org.eclipse.xtext.validation.CompositeEValidator.validate(CompositeEValidator.java:153)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidate(Diagnostician.java:171)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:158)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:185)
    at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
    at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
    at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
    at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:120)
    at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:148)
    at org.eclipse.xtext.xbase.annotations.validation.DerivedStateAwareResourceValidator.validate(DerivedStateAwareResourceValidator.java:33)
    at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:91)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.access$1(CachingResourceValidatorImpl.java:1)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:78)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:1)
    at org.eclipse.xtext.util.OnChangeEvictingCache.get(OnChangeEvictingCache.java:77)
    at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.validate(CachingResourceValidatorImpl.java:81)
    at org.eclipse.xtend.ide.validator.XtendResourceValidator.validate(XtendResourceValidator.java:33)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:91)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:1)
    at org.eclipse.xtext.util.concurrent.CancelableUnitOfWork.exec(CancelableUnitOfWork.java:26)
    at org.eclipse.xtext.resource.OutdatedStateManager.exec(OutdatedStateManager.java:121)
    at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.internalReadOnly(XtextDocument.java:512)
    at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.readOnly(XtextDocument.java:484)
    at org.eclipse.xtext.ui.editor.model.XtextDocument.readOnly(XtextDocument.java:133)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob.createIssues(ValidationJob.java:86)
    at org.eclipse.xtext.ui.editor.validation.ValidationJob.run(ValidationJob.java:67)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)



General Information:

    reported-by:      Tobse
    anonymous-id:     ef35a7d7-0cbe-4995-a50b-ea7da1b26ef1
    eclipse-build-id: 4.5.0.I20150203-1300
    eclipse-product:  org.eclipse.epp.package.dsl.product
    operating system: Windows7 6.1.0 (x86_64) - win32
    jre-version:      1.8.0_25-b18

The following plug-ins were present on the execution stack (*):
    1. org.eclipse.core.jobs_3.7.0.v20150115-2226
    2. org.eclipse.emf.ecore_2.11.0.v20150325-0930
    3. org.eclipse.emf_2.6.0.v20150325-0933
    4. org.eclipse.jdt.core_3.11.0.v20150126-2015
    5. org.eclipse.jdt_3.11.0.v20150203-1300
    6. org.eclipse.xtend.core_2.9.0.v201503270548
    7. org.eclipse.xtend_2.1.0.v201503260847
    8. org.eclipse.xtend.ide_2.9.0.v201503270548
    9. org.eclipse.xtext_2.9.0.v201503270548
    10. org.eclipse.xtext.ui_2.9.0.v201503270548
    11. org.eclipse.xtext.util_2.9.0.v201503270548
    12. org.eclipse.xtext.xbase_2.9.0.v201503270548

Please note that:
* Messages, stacktraces, and nested status objects may be shortened.
* Bug fields like status, resolution, and whiteboard are sent
  back to reporters.
* The list of present bundles and their respective versions was
  calculated by package naming heuristics. This may or may not reflect reality.

Other Resources:
* Report: https://dev.eclipse.org/recommenders/committers/confess/#/problems/55155bfde4b026254edfe60d  
* Manual: https://dev.eclipse.org/recommenders/community/confess/#/guide


Thank you for your assistance.
Your friendly error-reports-inbox. PR: https://github.com/eclipse/xtext/pull/105 Reviewed commit
https://github.com/eclipse/xtext/commit/5da237c2a4a57e4ca2da32dc28f5a3152c1f1eba
from sklearn.metrics.pairwise import cosine_similarity

X = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    X.append(inferred_vector)
    
matrix = cosine_similarity(X)
 
fig, ax = plt.subplots(figsize=(10,10))
cax = ax.matshow(matrix, interpolation='nearest')
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, .75,.8,.85,.90,.95,1])
plt.show()
png

Finding Similar Bugs

Of course, the final test to see if this model will provide a useful return value is to do a random sample and then find the most similar bug.

# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus) - 1)
doc = train_corpus[doc_id]

text = df.loc[df['id'] == doc.tags[0]].iloc[0]['text']
print(doc.tags[0],text)

inferred_vector = model.infer_vector(doc.words)

sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

print("\n\nDocument that is",sims[1][1]," similar below\n\n")

text = df.loc[df['id'] == sims[1][0]].iloc[0]['text']
print(sims[1][0],text)

288444 [validation] quickfixes for unresolved references  A common use case for quickfixes will be to create an object corresponding to an unresolved cross reference. In order for Xtext to support this use case the following changes should be made:

1. The Xtext resource diagnostic interface (org.eclipse.xtext.diagnostics.Diagnostic) should provide some kind of getCode() method similar to what the TransformationDiagnostic implementation already has. Possibly it should also provide getters for the triggering AbstractNode, EObject, and StructuralFeature (allthough it should be possible to derive this from the offset and length) as this may be required to implement a corresponding quickfix.

2. Using the resource diagnostic's code the IXtextResourceChecker (and ValidationJob) can produce an Eclipse marker having a corresponding code attribute set accordingly. This however also requires investigating the marker type to use for resource diagnostics.

3. AbstractDeclarativeQuickfixProvider should also simplify the implementation of fixes for unresolved references. This would be difficult to achieve using the existing @Fix annotation because (a) the marker's code is fixed (unless my some other means overridden by the user) and (b) the type of object to create  cannot be declared (only that of the context object). A possibility would be to add another dedicated annotation. E.g.

  @FixReference(type=Bar.class, label="Create Bar '${name}'")
  public void createNewBar(Foo context, IMarker marker) {...} A currently possible workaround is for a subclass to override the AbstractDeclarativeQuickfixProvider behavior to:

- use the marker's message to check whether it is an unresolved cross reference
- derive the context object and reference (and thus also type of object to create) using the marker's offset and length *** Bug 283439 has been marked as a duplicate of this bug. *** Adding such quickfixed is possible with the new API. Please have a look at 
  org.eclipse.xtext.example.ui.quickfix.DomainmodelQuickfixProvider
for an example. You also have to override
  org.eclipse.xtext.example.ui.linking.DomainmodelLinkingDiagnosticMessageProvider
to return individual issue codes and issue data for different linking problems.

Having worked out the example, I don't think we should offer more simplistic API for these kinds of quickfixes, as there are too many tweaking points

- the EClass of the element to be created, which is not always the eReferenceType of the EReference. In the domainmodel it depends on the type of the container's container (Attribute or Reference)
- attribute initialisation, partly extracted at linking time and provided by means of issue data
- the eContainer for the new element, which is not necessarily the container of the referring element
- formatting Closing bug which were set to RESOLVED before Eclipse Neon.0.


Document that is 0.5803667306900024  similar below


263793 [Editor] Add quick fix support  Created attachment 124816
quick fix proof of concept

The subject of this feature request is support for quick fixes (i.e. Ctrl+1) in the Xtext editor. In particular it should also be possible to implement quick fixes for user defined Check constraint violations.

Terminology
===========
Quick fix : A quick fix is the means of accessing in invoking a marker resolution (see extension point org.eclipse.ui.ide.markerResolution) in the editor using the Ctrl+1 keybinding (default).

Marker resolution : A marker resolution is a runnable object (instance of IMarkerResolution) whose purpose is to solve the problem annotated by a particular marker. A marker resolution is created by a IMarkerResolutionGenerator registered using the extension point org.eclipse.ui.ide.markerResolution).

Check constraint violation : By this we mean the event of a particular model object violating a given Check constraint. This violation will be represented by a marker in the Xtext editor.

Implementation constraints
==========================
The current Check language should not be modified.

Proposed design and implementation
==================================
The Xtext editor is already capable of displaying markers pertaining to the resource being edited. We now also want to be able to access any marker resolutions registered against the markers using the Ctrl+1 key binding (Quick Fix). This involves enhancing XtextSourceViewerConfiguration and implementing a custom QuickAssistAssistant (see patch). This part can IMHO safely be added.

Additionally we want to be able to implement quick fixes for specific Check constraints. Here I propose that the Check constraint should not simply return an error message, but something structured (when the quick fix needs to be adviced). E.g.

context Foo#name WARNING fooName() :
    name != "FOO"
;

With the following extensions (the second would be part of an Xtext library):

fooName(Foo this) :
    error("Don't use the name FOO", "illegalId").addAll({"foo", this})
;

error(Object this, String message, String type) :
    {"message", message, "type", type}
;

The returned data is essentially a map represented as a list (due to oAW expression limitations). Given this list in Check a Diagnostic object will be created with corresponding data (Diagnostic#getData()). In XtextResourceChecker (and XtextDocument) these same properties will be set on the corresponding created Marker objects. If no list is returned by the check constraint the original behavior will be employed.

The "message" (mapped to "message" attribute of marker) and "type" (mapped to markerType of marker!) properties are predefined but it is also important to note that the user has the ability to add any other custom properties which will automatically be attached to the Diagnostic and eventually the marker (as with "foo" in the example).

The marker resolution generator can in turn use these marker attributes to decide if it can create a corresponding marker resolution (actually a lot of this filtering can already be done in the plugin.xml extension).

Additionally the attributes could also represent hints for the actual marker resolution. For example: The Check constraint could set an "resolutionExpr" property to "setName('FOOBAR')" and a corresponding generic marker resolution generator (expecting this property to be set) could then evaluate the given expression when the quick fix is run.

The attached patch implements this described design. Note that it's a proof of concept only!

Alternatives
============
The interface for passing data from the Check constraint to be associated with the corresponding Diagnostic could instead be implemented using a JAVA extension. In this case the Check constraint would just return a String (business as usual). E.g.

context Foo#name WARNING fooName() :
    name != "FOO"
;

String fooName(Foo this) :
    error("Don't use the name FOO", "illegalId")
;

String error(Object this, String message, String type) :
    internalError(message, type) -> message
;

private internalError(Object this, String message, String type) :
    JAVA org.eclipse.xtext...
;

The mechanism for making this data available to the marker resolution generators could also be a dedicated Java API (instead of attaching the data to the marker directly). But this way there is not the possibility of filtering using the <attribute/> element in plugin.xml. Created attachment 126171
quick assist assistant patch

If you agree I'd like to propose the attached patch to enable the Xtext editor to run any available marker resolutions enabled for the displayed markers. Actually there is already an action XtextMarkerRulerAction to support this, but the required QuickAssistAssistant implementation was still missing.

I think the API outlined in the description (for integration of Check etc.) requires some more thinking. This is something I'm working on. But what's in the patch is a necessary first step. Created attachment 140346
simple quickfix generator fragment

The attachment demonstrates a simplistic quickfix generator fragment complete with the supporting changes to the Xtext editor.

As demonstrated in the Domainmodel example a fix can then be declared by a method like this:

    @Fix(code = DomainmodelJavaValidator.CAPITAL_TYPE_NAME, label = "Capitalize name", description = "Capitalize name of type")
    public void fixNameCase(Type type, IMarker marker) {
        type.setName(type.getName().toUpperCase());
    }

The "code" attribute matches up against a corresponding Java check:

    @Check
    public void checkTypeNameStartsWithCapital(Type type) {
        if (!Character.isUpperCase(type.getName().charAt(0))) {
            warning("Name should start with a capital", DomainmodelPackage.TYPE__NAME, CAPITAL_TYPE_NAME);
        }
    } +1 for this RFE. This would be great! Are there any plans about the target milestone? (In reply to comment #3)
> +1 for this RFE. This would be great! Are there any plans about the target
> milestone?
> 

No not yet, we'll update the property accordingly as soon as we have concrete plans. I also like it very much.
Shouldn't we allow a list of codes per diagnostic? There might be multiple alternative ways to fix an issue. 
In AbstractDeclarativeMarkerResolutionGenerator you pass the context EObject out of the read transaction in order to pass it into a modify transaction later. This could cause problems in cases where another write operation has changed or removed that object. I think the context object should be obtained within the modify transaction. 
Some tests would be very nice. :-) (In reply to comment #5)
> I also like it very much.
> Shouldn't we allow a list of codes per diagnostic? There might be multiple
> alternative ways to fix an issue. 

The code simply identifies the problem, so one code per diagnostic should be enough. But we would then like to associate multiple fixes with the problem. Using a declarative approach we would want all @Fix annotated methods referring to that code to match. Thus something similar to the AbstractDeclarativeValidator.

In my patch I use the PolymorphicDispatcher, but I've come to realize that this doesn't make sense here, as we want to match multiple methods, just like the AbstractDeclarativeValidator.

The declarative base class supports @Fix annotated methods where the fix details (label, description, and icon) are in the annotation parameters and the method body simply represents the fix implementation. E.g.

@Fix(code = 42, label = "Capitalize name", description = "Capitalize name of type")
public void fixSomething(Foo foo, IMarker marker) {
   return ...;
}

In addition we may also want to allow a method to return the IMarkerResolution object describing the fix. This would allow for more conditional logic. E.g.

@Fix(code=42)
public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
   return ...;
}

Any thoughts on this?

> In AbstractDeclarativeMarkerResolutionGenerator you pass the context EObject
> out of the read transaction in order to pass it into a modify transaction
> later. This could cause problems in cases where another write operation has
> changed or removed that object. I think the context object should be obtained
> within the modify transaction. 
> Some tests would be very nice. :-)
> 

I agree. (In reply to comment #6)
> (In reply to comment #5)
> > I also like it very much.
> > Shouldn't we allow a list of codes per diagnostic? There might be multiple
> > alternative ways to fix an issue. 
> 
> The code simply identifies the problem, so one code per diagnostic should be
> enough. But we would then like to associate multiple fixes with the problem.
> Using a declarative approach we would want all @Fix annotated methods referring
> to that code to match. Thus something similar to the
> AbstractDeclarativeValidator.
> 
> In my patch I use the PolymorphicDispatcher, but I've come to realize that this
> doesn't make sense here, as we want to match multiple methods, just like the
> AbstractDeclarativeValidator.
> 
> The declarative base class supports @Fix annotated methods where the fix
> details (label, description, and icon) are in the annotation parameters and the
> method body simply represents the fix implementation. E.g.
> 
> @Fix(code = 42, label = "Capitalize name", description = "Capitalize name of
> type")
> public void fixSomething(Foo foo, IMarker marker) {
>    return ...;
> }
> 
> In addition we may also want to allow a method to return the IMarkerResolution
> object describing the fix. This would allow for more conditional logic. E.g.
> 
> @Fix(code=42)
> public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
>    return ...;
> }
> 
> Any thoughts on this?
> 

Sounds reasonable. I thought that the id identifies a fix not a problem.
Of course what you described makes much more sense. :-)
 Fixed in CVS HEAD. > > In addition we may also want to allow a method to return the IMarkerResolution
> > object describing the fix. This would allow for more conditional logic. E.g.
> > 
> > @Fix(code=42)
> > public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
> >    return ...;
> > }
> > 
> > Any thoughts on this?
> > 
> 
> Sounds reasonable.

As I couldn't yet find a concrete use case for this I decided to wait with this enhancement. We can always file a new bug later on.

Also note that the documentation hasn't been written yet. I reopen this so we don't forget the documentation. Thanks Knut :-) Closing all bugs that were set to RESOLVED before Neon.0 Closing all bugs that were set to RESOLVED before Neon.0

Clustering the Embedding Space

We can use KMeans to divide the embeddings into clusters based on cosine similarity.

from sklearn import cluster
from sklearn import metrics
kmeans = cluster.KMeans(n_clusters=10)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
print(kmeans.cluster_centers_.shape)
print(kmeans.labels_.shape)
print(kmeans.labels_)

clusters = kmeans.labels_.tolist()
df['cluster'] = clusters
# print(train_corpus)
# l = kmeans.fit_predict(model.docvecs.vectors_docs)
(10, 100)
(5415,)
[9 1 1 ... 6 7 2]
labels = kmeans.labels_
print(kmeans.labels_)

bugs = { 'id': df.loc[:,'id'], 'cluster': clusters }
frame = pd.DataFrame(bugs)
temp = frame.loc[:,['id','cluster']].groupby('cluster').count().plot.bar()
plt.show()
[9 1 1 ... 6 7 2]
png

Visualizing the Embedding Space

To visualize the space of the embedding, we will use TSNE to reduce from 100 to 2 dimensions.

# Creating and fitting the tsne model to the document embeddings
from MulticoreTSNE import MulticoreTSNE as TSNE
tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4,
                  n_components=2,
                  verbose=1,
                  random_state=2018,
                  n_iter=300)
tsne_d2v = tsne_model.fit_transform(np.array(X))

df['x'] = tsne_d2v[:,0]
df['y'] = tsne_d2v[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot(x='x', y='y', hue="cluster", palette=sns.color_palette("hls", 10), legend="full", alpha=0.3, data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1f22e84a8>
png

Now how about doing it in 3D?

tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4,
                  n_components=3,
                  verbose=1,
                  random_state=2018,
                  n_iter=300)
tsne_d2v = tsne_model.fit_transform(np.array(X))

from mpl_toolkits.mplot3d import Axes3D

df['x'] = tsne_d2v[:,0]
df['y'] = tsne_d2v[:,1]
df['z'] = tsne_d2v[:,2]

ax = plt.figure(figsize=(16,10)).gca(projection='3d')
ax.scatter( xs=df.loc[:,"x"], ys=df.loc[:,"y"], zs=df.loc[:,"z"], c=df.loc[:,"cluster"])
plt.show()
png

Another method that I’ve seen used is to use PCA to perform an initial reduction in dimension and then TSNE to get the dimensions down to 2.

from sklearn.decomposition import PCA

pca_20 = PCA(n_components=20)
pca_result_20 = pca_20.fit_transform(X)
tsne_d2v = tsne_model.fit_transform(np.array(pca_result_20))
df['x'] = tsne_d2v[:,0]
df['y'] = tsne_d2v[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot( x='x', y='y', hue="cluster", palette=sns.color_palette("hls", 10), legend="full", alpha=0.3, data=df )
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1f1b525f8>
png

Transfer Learning

Now that we have some idea of what we’re working with, let’s see if transfer learning is an option with Doc2Vec. To do this, we’ll start with a distributed memory Doc2Vec model that was trained on the WikiPedia corpus. Then we will further train the model on our bug corpus and see what the results are.

from gensim.models.doc2vec import Doc2Vec
loadedModel = Doc2Vec.load('PV-DBOW.doc2vec')
print(loadedModel.corpus_count)
4841417
%time loadedModel.train(train_corpus, total_examples=model.corpus_count, epochs=20)

CPU times: user 6min 47s, sys: 1.91 s, total: 6min 49s
Wall time: 1min 3s
similarities = []
for doc_id in range(len(train_corpus)):
    learned_vector = loadedModel.docvecs[train_corpus[doc_id].tags[0]]
    inferred_vector = loadedModel.infer_vector(train_corpus[doc_id].words)
    distance = 1 - spatial.distance.cosine(learned_vector, inferred_vector)
    similarities.append(distance)
    
sns.distplot(similarities, kde=False, rug=True)
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<matplotlib.axes._subplots.AxesSubplot at 0x7fc18fb6beb8>
png
next_similar = []
for doc_id in range(len(train_corpus)):
    sims = loadedModel.docvecs.most_similar(train_corpus[doc_id].tags[0])
    next_similar.append(sims[0][1])
    
sns.distplot(next_similar, kde=False, rug=True)
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<matplotlib.axes._subplots.AxesSubplot at 0x7fc17b6532e8>
png
import statistics

print(statistics.mean(next_similar))
print(statistics.stdev(next_similar))
0.6487272019641661
0.0749555443609924

Integrating AI and AR

July 29, 2020

Thoughts on AR/AI Integration

Here’s a great article by Adobe that covers some of the ways that they are using an integrated AI/AR approach.

Another good starting point is this article on “Combining artificial intelligence and augmented reality in mobile apps”. It’s somewhat of a sales pitch for Fritz.ai, but it does contain a lot of useful links – some of which I have incorporated.

Other applications of AI that are useful in AR: * Object Detection – finding the boundaries of objects * Image Classification – this can be used to identify known objects in a scene and make a correlation to an object in the digital world * Pose Estimation – determining position of hands to control movement * Text Recognition – determine text (not always horizontally aligned) and convert to actionable content * Audio Recognition – voice commands to control movement

Moving AI to mobile devices

Before AI models were available to be run on mobile devices, most applications followed something along these lines:

  1. Grab data on the device
  2. Move it to a storage location
  3. Trigger some operation (possibly store the results as well)
  4. Respond to the device with the results

ImageImage from http://blog.zenof.ai/object-detection-in-react-native-app-using-tensorflow-js/

To be “believable”, we need better performance to combine with reality.Image
Animation from https://heartbeat.fritz.ai/combining-artificial-intelligence-and-augmented-reality-in-mobile-apps-e0e0ad2cfddc

Question???

Are there any performance guidelines available for maintaining a realistic application? Something for response times or update rate?

Mobile AR Development Frameworks

There are frameworks available for AR in both the Apple (iOS) and Google (Andriod) ecosystems. Apple’s offering is AR Kit, while Google provides AR Core. These frameworks are mostly for basic AR development, but they do provide some broad applications of AI. Walk through each link to highlight AI integration

Mobile AI Frameworks

Similar to the AR frameworks, there are also AI development frameworks available for both Apple and Google. In this case, Apple provides Core-ML while Google provides TensorFlow Lite.

Both frameworks appear to be ‘inference only’ by using a minimized network optimized to their particular hardware platform. This would be useful if you needed a more specialized implementation of an AI technique and wanted to integrate it yourself.

And then there’s React+TenforFlow.js

Thoughts going forward

One thought is to take our application of image recognition for trail camera use and apply it to a mobile platform. We may need some help getting a basic iOS and Android application created though. Maybe a cross-group project somewhere?

Fast.AI Lesson 3

March 25, 2020

For this session, we covered the results of the Fast.AI Lesson 3 and learned about multi-label classification with a CNN. The full notebook is included in this post.

Open In Colab

Multi-label prediction with Planet Amazon dataset

%reload_ext autoreload
%autoreload 2
%matplotlib inline
!curl -s https://course.fast.ai/setup/colab | bash
Updating fastai...
Done.
from fastai.vision import *

Getting the data

The planet dataset isn’t available on the fastai dataset page due to copyright restrictions. You can download it from Kaggle however. Let’s see how to do this by using the Kaggle API as it’s going to be pretty useful to you if you want to join a competition or use other Kaggle datasets later on.

First, install the Kaggle API by uncommenting the following line and executing it, or by executing it in your terminal (depending on your platform you may need to modify this slightly to either add source activate fastai or similar, or prefix pip with a path. Have a look at how conda install is called for your platform in the appropriate Returning to work section of https://course.fast.ai/. (Depending on your environment, you may also need to append “–user” to the command.)

! {sys.executable} -m pip install kaggle --upgrade
Requirement already up-to-date: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied, skipping upgrade: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied, skipping upgrade: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2019.11.28)
Requirement already satisfied, skipping upgrade: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.21.0)
Requirement already satisfied, skipping upgrade: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied, skipping upgrade: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.38.0)
Requirement already satisfied, skipping upgrade: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.8)
Requirement already satisfied, skipping upgrade: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)

Then you need to upload your credentials from Kaggle on your instance. Login to kaggle and click on your profile picture on the top left corner, then ‘My account’. Scroll down until you find a button named ‘Create New API Token’ and click on it. This will trigger the download of a file named ‘kaggle.json’.

Upload this file to the directory this notebook is running in, by clicking “Upload” on your main Jupyter page, then uncomment and execute the next two commands (or run them in a terminal). For Windows, uncomment the last two commands.

! mkdir -p ~/.kaggle/
! mv kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
# For Windows, uncomment these two commands
# ! mkdir %userprofile%\.kaggle
# ! move kaggle.json %userprofile%\.kaggle

You’re all set to download the data from planet competition. You first need to go to its main page and accept its rules, and run the two cells below (uncomment the shell commands to download and unzip the data). If you get a 403 forbidden error it means you haven’t accepted the competition rules yet (you have to go to the competition page, click on Rules tab, and then scroll to the bottom to find the accept button).

path = 'planet/planet'
!kaggle datasets download nikitarom/planets-dataset
!unzip planets-dataset.zip
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train-jpg.tar.7z -p {path}  
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train_v2.csv -p {path}  
# ! unzip -q -n {path}/train_v2.csv.zip -d {path}

To extract the content of this file, we’ll need 7zip, so uncomment the following line if you need to install it (or run sudo apt install p7zip-full in your terminal).

# ! conda install --yes --prefix {sys.prefix} -c haasad eidl7zip

And now we can unpack the data (uncomment to run – this might take a few minutes to complete).

# ! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path.as_posix()}

Multiclassification

Contrary to the pets dataset studied in last lesson, here each picture can have multiple labels. If we take a look at the csv file containing the labels (in ‘train_v2.csv’ here) we see that each ‘image_name’ is associated to several tags separated by spaces.

df = pd.read_csv('planet/planet/train_classes.csv')
df.head()
image_nametags
0train_0haze primary
1train_1agriculture clear primary water
2train_2clear primary
3train_3clear primary
4train_4agriculture clear habitation primary road

To put this in a DataBunch while using the data block API, we then need to using ImageList (and not ImageDataBunch). This will make sure the model created has the proper loss function to deal with the multiple classes.

tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

We use parentheses around the data block pipeline below, so that we can use a multiline statement without needing to add ‘\’.

np.random.seed(42)
src = (ImageList.from_csv('planet/planet', 'train_classes.csv', folder='train-jpg', suffix='.jpg')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))
data = (src.transform(tfms, size=128)
        .databunch().normalize(imagenet_stats))

show_batch still works, and show us the different labels separated by ;.

data.show_batch(rows=3, figsize=(12,9))
png

To create a Learner we use the same function as in lesson 1. Our base architecture is resnet50 again, but the metrics are a little bit differeent: we use accuracy_thresh instead of accuracy. In lesson 1, we determined the predicition for a given class by picking the final activation that was the biggest, but here, each activation can be 0. or 1. accuracy_thresh selects the ones that are above a certain threshold (0.5 by default) and compares them to the ground truth.

As for Fbeta, it’s the metric that was used by Kaggle on this competition. See here for more details.

arch = models.resnet34
acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = cnn_learner(data, arch, metrics=[acc_02, f_score])
Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.cache/torch/checkpoints/resnet34-333f7ec4.pth



HBox(children=(IntProgress(value=0, max=87306240), HTML(value='')))

We use the LR Finder to pick a good learning rate.

learn.lr_find()
<div>
    <style>
        /* Turns off some styling */
        progress {
            /* gets rid of default border in Firefox and Opera. */
            border: none;
            /* Needs to be in here for Safari polyfill so background images work as expected. */
            background-size: auto;
        }
        .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
            background: #F44336;
        }
    </style>
  <progress value='0' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>
  0.00% [0/1 00:00<00:00]
</div>
epochtrain_lossvalid_lossaccuracy_threshfbetatime

 18.18% [92/506 00:21<01:37 2.6318]LR Finder is complete, type {learner_name}.recorder.plot() to see the graph. “` learn.recorder.plot() “` ![png](200325_lesson3_planet_32_0.png) Then we can fit the head of our network. “` lr = 0.01 “` “` learn.fit_one_cycle(5, slice(lr)) “`

epochtrain_lossvalid_lossaccuracy_threshfbetatime
00.1460250.1229860.9441560.89419502:20
10.1160380.1033060.9478540.90921402:20
20.1069120.0960930.9511320.91660602:20
30.0990830.0922620.9541770.91896202:18
40.0973520.0912270.9538420.91987402:18
learn.save('stage-1-rn34')

…And fine-tune the whole model:

learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
<div>
    <style>
        /* Turns off some styling */
        progress {
            /* gets rid of default border in Firefox and Opera. */
            border: none;
            /* Needs to be in here for Safari polyfill so background images work as expected. */
            background-size: auto;
        }
        .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
            background: #F44336;
        }
    </style>
  <progress value='0' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>
  0.00% [0/1 00:00<00:00]
</div>
epochtrain_lossvalid_lossaccuracy_threshfbetatime

 16.80% [85/506 00:20<01:42 0.2839]LR Finder is complete, type {learner_name}.recorder.plot() to see the graph. ![png](200325_lesson3_planet_39_2.png) “` learn.fit_one_cycle(5, slice(1e-5, lr/5)) “`

epochtrain_lossvalid_lossaccuracy_threshfbetatime
00.0991620.0927500.9507320.91875902:22
10.0998360.0925440.9514440.92003502:22
20.0943100.0875450.9550410.92391702:23
30.0888010.0848480.9578460.92693102:22
40.0822170.0845550.9564150.92619302:26
learn.save('stage-2-rn34')
data = (src.transform(tfms, size=256)
        .databunch().normalize(imagenet_stats))

learn.data = data
data.train_ds[0][0].shape
torch.Size([3, 256, 256])
learn.freeze()
learn.lr_find()
learn.recorder.plot()
LR Finder complete, type {learner_name}.recorder.plot() to see the graph.
png
lr=1e-2/2
learn.fit_one_cycle(5, slice(lr))

Total time: 09:01 

epochtrain_lossvalid_lossaccuracy_threshfbeta
10.0877610.0850130.9580060.926066
20.0876410.0837320.9582600.927459
30.0842500.0828560.9584850.928200
40.0823470.0814700.9600910.929166
50.0784630.0809840.9592490.930089
learn.save('stage-1-256-rn50')
learn.unfreeze()
learn.fit_one_cycle(5, slice(1e-5, lr/5))

Total time: 11:25 

epochtrain_lossvalid_lossaccuracy_threshfbeta
10.0829380.0835480.9578460.927756
20.0863120.0848020.9587180.925416
30.0848240.0823390.9599750.930054
40.0787840.0814250.9599830.929634
50.0745300.0807910.9604260.931257
learn.recorder.plot_losses()
png
learn.save('stage-2-256-rn50')

You won’t really know how you’re going until you submit to Kaggle, since the leaderboard isn’t using the same subset as we have for training. But as a guide, 50th place (out of 938 teams) on the private leaderboard was a score of 0.930.

learn.export()

fin

(This section will be covered in part 2 – please don’t ask about it just yet! 🙂 )

#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg.tar.7z | tar xf - -C {path}
#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg-additional.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg-additional.tar.7z | tar xf - -C {path}
test = ImageList.from_folder(path/'test-jpg').add(ImageList.from_folder(path/'test-jpg-additional'))
len(test)
61191
learn = load_learner(path, test=test)
preds, _ = learn.get_preds(ds_type=DatasetType.Test)
thresh = 0.2
labelled_preds = [' '.join([learn.data.classes[i] for i,p in enumerate(pred) if p > thresh]) for pred in preds]
labelled_preds[:5]
['agriculture cultivation partly_cloudy primary road',
 'clear haze primary water',
 'agriculture clear cultivation primary',
 'clear primary',
 'partly_cloudy primary']
fnames = [f.name[:-4] for f in learn.data.test_ds.items]
df = pd.DataFrame({'image_name':fnames, 'tags':labelled_preds}, columns=['image_name', 'tags'])
df.to_csv(path/'submission.csv', index=False)
! kaggle competitions submit planet-understanding-the-amazon-from-space -f {path/'submission.csv'} -m "My submission"
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/ubuntu/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 2.18M/2.18M [00:02<00:00, 1.05MB/s]
Successfully submitted to Planet: Understanding the Amazon from Space

Private Leaderboard score: 0.9296 (around 80th)

Group Project (Parts 1-3)

March 4, 2020

For 3 weeks, we got together and discussed a group project. The intent is to take the output of our Fast.ai lesson 1 model and make it available through a variety of web platforms. Hopefully we will learn how to deploy an image classification model.

We didn’t get much further than discussions of how to implement the project and containerization with Docker. There is an initial repo in Github for this, but no further action at this time.

Fast.AI Lesson 2

February 26, 2020

Agenda:

  • Welcome / Intro
  • Fast.AI Lesson 2
  • Next Sessions

Starting point for Lesson 2

Link to Python Notebook in Colab

Get the instructions for setting up Colab for Fastai HERE

Don’t forget to run this script to setup fastai:

!curl -s https://course.fast.ai/setup/colab | bash

Here’s a post about how to work around the error that you get: HEREhttps://disqus.com/embed/comments/?base=default&f=hsv-ai&t_u=https%3A%2F%2Fhsv-ai.com%2Fmeetups%2F200226_fast_ai_lesson_2%2F&t_d=Huntsville%20AI%20%7C%20Fast.AI%20Lesson%202&t_t=Huntsville%20AI%20%7C%20Fast.AI%20Lesson%202&s_o=default#version=46aa6ce1907927200257678d09dec282