Welcome Zoom issue to start? Process coming up for recording parts of our meetup to make available for later viewing? FreeConferenceCall is free and has whiteboard and recording
Project updates Jonathan completely migrated his Taiga instance to projectwatch.dev Schools outreach – started building a flyer site to gain support AI4All curriculum – may be able to pull some of this overOpenAI Gym – available to work out your agen behavior for reinforcement learning
Bug Analysis – XLNET or BERT or GPT-2 – SimpleTransformers
Current news
Upcoming Schedule for 2020 Ben on MLFlow ? presenting WeightsAndBiases Paper of the month Harsh for Recommendation Systems March – HATCH2021
This is a presentation given as part of the National Defense Education Program (NDEP) coordinated by Alabama A&M. The intent of the program is to improve the “employment pipeline” of underrepresented people in the defense sector.
National Defense Education Program
Welcome to Huntsville AI!
Brief Intro
What we do at Huntsville AI:
Application – how to solve a problem with a technology
Theory – for those times when knowing “How” something works is necessary
Social / Ethics – Human / AI interaction
Brainstorming – new uses for existing solutions
Hands on Code – for those times when you just have to run something for yourself
Coworking Night – maybe have combined sessions with other groups to discuss application in their focus areas
Community – get to know others in the field. Provide talks like this one and support local tech events (HATCH, SpaceApps)
About me
J. Langley
Chief Technical Officer at CohesionForce, Inc.
Founder of Session Board & Huntsville AI
Involved in Open Source (Eclipse & Apache Foundations)
Started playing with AI about 15 years ago when Intelligent Agents were all the rage.
Developed a Naive Bayes approach for text classification, a Neural Network for audio classification, heavily into NLP.
What is this AI thing?
Artificial Intelligence – computer program learning how to solve problems based on data
Breaking it Down
There are several ways to break the subject of AI into digestible chunks
Here’s a useful way to think about the relationship between AI, ML, and Deep Learning.
The theorem states that the first layer can approximate any well-behaved function. Such a well-behaved function can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.
Neural Networks!
Deep Neural Network
A deep neural network has a lot more layers stacked in between the inputs and the outputs.
Diversity can help solve some of the bias problems in current AI:
Research earlier in 2019 from Joy Buolamwini and Timnit Gebru evaluated computer vision products from IBM, Microsoft, and elsewhere, and they found that these products performed worse on women than on men and worse on people with dark skin compared to people with light skin. For instance, IBM’s computer vision software was 99.7% accurate on light-skinned men and only 65% accurate on dark-skinned women.
Amazon’s Face Recognition Falsely Matched 28 Members of Congress With Mugshots:
The Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs are highly competitive programs that encourage domestic small businesses to engage in Federal Research/Research and Development (R/R&D) with the potential for commercialization.
Brief Intro
What we do at Huntsville AI:
Application – how to solve a problem with a technology
Theory – for those times when knowing “How” something works is necessary
Social / Ethics – Human / AI interaction
Brainstorming – new uses for existing solutions
Hands on Code – for those times when you just have to run something for yourself
Coworking Night – maybe have combined sessions with other groups to discuss application in their focus areas
Community – get to know others in the field. Provide talks like this one and support local tech events (HATCH, SpaceApps)
Quick Reference
The full list of 2020.3 SBIR topics is available HERE
NGA203-003
TITLE
Novel Mathematical Foundation for Automated Annotation of Massive Image Data Sets
OBJECTIVE
This announcement seeks proposals that offer dramatic improvements in automated object detection and annotation of massive image data sets. Imaging data is being created at an extraordinary rate from many sources, both from government assets as well as private ones. Automated methods for accurate and efficient object identification and annotation are needed to fully exploit this resource. This topic is focused on new artificial intelligence (AI) methods to effectively and efficiently solve this problem.
DESCRIPTION
Current choke points blocking optimal exploitation of the full stream of available image data include confronting widely different views (perspective, resolution, etc.) of the same or similar objects and the overwhelming amounts of human effort required for effective processing. Current manual processes requires human eyes on every image to perform detection, identification, and annotation. Current state of the art AI requires intensive human support to generate giant training sets. Further, resulting methods frequently generate rule sets that are overly fragile in that training on one object is not transferrable to the detection of another object, even though the object might strike a human as essentially the same, and thus the need for increased human review of the algorithm decisions.
NGA seeks new types of AI tools optimized for the task of object identification and annotation across diverse families of image data that are reliable, robust, not dependent on extensive training demands, are applicable to objects of interest to both government and commercial concerns, and simultaneously be parsimonious with user resources in general. In particular, we seek solutions that make AI outputs both more explainable and more “lightweight” to human users.
The focus of a successful phase 1 effort should be on explaining the mathematical foundation that will enable the significantly improved AI tools described herein. Of specific interest are novel AI constructs that are more principled and universal and less ad hoc than current technology and can be used to construct a tool that performs relevant tasks. For the purposes of this announcement “relevant tasks” are limited to object identification across view types, drawing an object bounding box, and correctly labelling the object in a text annotation. A successful Phase 1 proposal should explain how the mathematical foundation needed to build the required tools will be developed in Phase 1 and implemented in a software toolkit in Phase 2. Examples should be developed during Phase 1 and should illustrate either improved reliability or robustness over the current state of the art, as well as reducing training demands and user resources. Proposals describing AI approaches that are demonstrably at or near the current state of the art in commercial AI performance, such as on ImageNet data sets, are specifically not of interest under this topic. The foundational element of a successful proposal under this topic is exploitation of novel mathematics that will enable new and better AI approaches.
Direct to Phase 2 proposals are being accepted under this topic. A straight to phase 2 proposal should describe pre-existing mathematical foundations and illustrative examples described in the paragraph above. Phase 2 proposals should also propose a set of milestones and demonstrations that will establish the novel AI tools as a viable commercial offering.
OSD203-004
TITLE
Domain-Specific Text Analysis
OBJECTIVE:
Develop text analysis software that leverages current Natural Language Processing (NLP) algorithms and techniques, (e.g., Bayesian algorithms, word embeddings, recurrent neural networks) for accurately conducting content and sentiment analysis, as well as dictionary development.
DESCRIPTION:
The United Stated Department of Defense (DoD) collects large amounts of text data from their personnel using a variety of different formats including opinion/climate surveys, memoranda, incident reports, standard forms, and transcripts of focus group/sensing sessions. Much of these data are used operationally; however, recent interest in the leveraging of text data to glean insight into personnel trends/behaviors/intentions has prompted a greater degree of research in NLP. Additionally, Topic Modeling and Sentiment Analysis have been explored by various research arms of the DoD; however, two foundational hurdles exist that need to be addressed before they can realistically be applied to the DoD:
First, the varied use of jargon, nomenclature, and acronyms across the DoD and Service Branches must be more comprehensively understood. Additionally, development of a “DoD Dictionary” should enable the fluid use of extant and newly-created jargon, phrases, and sayings used over time. Second, the emergent nature and rapid innovation of NLP techniques has made bridging the technical gap between DoD analysts and tools difficult. Additionally, the understanding and interpreting of NLP techniques by non-technical leadership is particularly difficult. There currently exists no standard format or package that can be used to analyze and develop visualizations for text data in such a way that accommodates the needs of operational leadership to make decisions regarding personnel policies or actions.
The objective of this topic is to develop a next generation multi-platform & multi-sensor capable Artificial Intelligence-Enabled (AIE), high performance computational imaging camera with an optimal Size, Weight and Power – Cost (SWaP-C) envelope. This computational imaging Camera can be utilized in weapon sights, surveillance and reconnaissance systems, precision strike target acquisition, and other platforms. This development should provide bi-directional communication between tactical devices with onboard real-time scene/data analysis that produces critical information to the SOF Operator. As a part of this feasibility study, the Offerors shall address all viable overall system design options with respective specifications on the key system attributes.
DESCRIPTION:
A system-of-systems approach “smart-Visual Augmentation Systems” and the integration of an next generation smart sensor enables information sharing between small arms, SOF VAS and other target engagement systems. Sensors and targeting that promote the ability to hit and kill the target as well as ensuring Rules of Engagement are met and civilian casualties/collateral damage is eliminated. The positive identification of the target and the precise firing solution will optimize the performance of the operator, the weapon, and the ammunition to increase precision at longer ranges in multiple environments.
This system could be used in a broad range of military applications where Special Operations Forces require: Faster Target Acquisition; Precise Targeting; Automatic Target Classification; Classificationbased Multi Target Tracking; Ability to Engage Moving Targets, Decision Support System; Targeting with Scalable Effects; Battlefield Awareness; Integrated Battlefield (Common Operating Picture with IOBT, ATAK, COT across Squad, Platoon).
HR001120S0019-14
TITLE:
AI-accelerated Biosensor Design
OBJECTIVE:
Apply artificial intelligence (AI) to accelerate the design of highly specific, engineered biomarkers for rapid virus detection.
DESCRIPTION:
This SBIR seeks to leverage AI technologies to accelerate the development of aptamer-based biosensors that specifically bind to biomolecular structures. Aptamers are short single-stranded nucleic acid sequences capable of binding three-dimensional biomolecular structures in a way similar to antibodies. Aptamers have several advantages as compared to antibodies, including long shelf-life, stability at room temperature, low/no immunogenicity, and low-cost.
The current state-ofthe-art aptamer designs rely heavily on in vitro approaches such as SELEX (Systematic Evolution of Ligands by Exponential Enrichment) and its advanced variations. SELEX is a cyclic process that involves multiple rounds of selection and amplification over a very large number of candidates (>10^15). The iterative and experimental nature of SELEX makes it time consuming (weeks to months) to obtain aptamer candidates, and the overall probability of ultimately obtaining a useful aptamer is low (30%-50%). Attempts to improve the performance of the original SELEX process generally result in increased system complexity and system cost as well as increased demand on special domain expertise for their use. Furthermore, a large number of parameters can influence the SELEX process.
Therefore, this is a domain that is ripe for AI. Recent AI research has demonstrated the potential for machine learning technologies to encode domain knowledge to significantly constrain the solution space of optimization search problems such as solving the biomolecular inverse problems. Such in silico techniques consequently offer the potential to provide a costeffective alternative to make aptamer design process more dependable, thereby, more efficient. This SBIR seeks to leverage emerging AI technologies to develop a desktop-based AI-assisted aptamer design capability that accelerates the identification of high-performance aptamers for detecting new biological antigens.
NGA203-004
TITLE:
High Dimensional Nearest Neighbor Search
OBJECTIVE
This topic seeks research in geolocation of imagery and video media taken at near- ground level [1]. The research will explore hashing/indexing techniques (such as [2]) that match information derived from media to a global reference data. The reference data is composed of digital surface models (DSMs) of known geographical regions and features that can be derived from that surface data, together with limited “foundation data” of the same regions consisting of map data such as might be present in Open Street Maps and landcover data (specifying regions that are fields, vegetation, urban, suburban, etc.). Query data consists of images or short video clips that represent scenes covered by the digital surface model in the reference data, but may or may not have geo-tagged still images in the reference data from the same location.
Selected performers will be provided with sample reference data, consisting of DSM data and a collection of foundation data, and will be provided with sample query data. This sample data is described below. However, proposers might suggest other reference and query data that they will use to either supplement or replace government-furnished sample data. This topic seeks novel ideas for the representation of features in the query data and the reference data that can be used to perform retrieval of geo-located reference data under the assumption that the query data lacks geolocation information. The topic particularly seeks algorithmically efficient approaches such as hashing techniques for retrieval based on the novel features extracted from query imagery and reference data that can be used to perform matching using nearest neighbor approaches in feature space.
DESCRIPTION
The reference data includes files consisting of a vectorized two-dimensional representation of a Digital Surface Model (DSM) [4], relative depth information, and selected foundation feature data. The foundation features will included feature categories such as the locations of roads, rivers, and man-made objects.
The desired output of a query is a location within meters of the ground truth location of the camera that acquired the imagery. In practice, only some of the queries will permit accurate geolocation based on the reference data, and in some cases, the output will be a candidate list of locations, such that the true location is within the top few candidates. It is reasonable to assume that there exists a reference database calculated from a global DSM with a minimum spatial resolution of 30 meters that may, in some locations, provide sub-meter spatial resolution. The foundation feature is at least as rich as that present in Open Street Maps, and can include extensive landcover data with multiple feature types. For the purpose of this topic, the reference data will not include images. Sample reference and query data representative of these assumptions, but of limited geographical areas, will be provided to successful proposers.
The topic seeks approaches that are more accurate than a class of algorithms that attempt to provide geolocation to a general region, such as a particular biome or continent. These algorithms are often based on a pure neural network approach, such as described in [3], and is unlikely to produce sufficient precise camera location information that is accurate to within meters.
The objective system, in full production, should be sufficiently efficient as to scale to millions of square kilometers of reference data, and should be able to process queries at a rate of thousands of square kilometers per minute. While a phase 2 system might provide a prototype at a fraction of these capabilities, a detailed complexity analysis is expected to support the scalability of the system.
The proposed approach may apply to only a subset of query imagery types. For example, the proposed approach may be accurate only for urban data, or only for rural scenes. The proposer should carefully explain the likely limitations of the proposed approach and suggest methods whereby query imagery could be filtered so that only appropriate imagery is processed by the proposed system.
Proposers who can demonstrate prior completion of all of the described Phase I activities may propose a “straight to Phase II” effort. In this case the novelty of the proposed feature and retrieval approach will be a consideration in determining an award.
N203-151
TITLE:
Machine Learning Detection of Source Code Vulnerability
OBJECTIVE
Develop and demonstrate a software capability that utilizes machine-learning techniques to scan source code for its dependencies; trains cataloging algorithms on code dependencies and detection of known vulnerabilities, and scales to support polyglot architectures.
The above methods by themselves generate statistically significant numbers of false positives and false negatives: False positives come from the erroneous detection of vulnerabilities and require a human in the loop to discern signal from noise. False negatives come from the prevalence of undetected altered dependent software (e.g., copy/paste/change from external libraries).
Promising developments from commercial vendors provide text mining services for project source trees and compare them against vulnerability databases, such as Synopsis/Blackduck Hub, IBM AppScan, and Facebook’s Infer. However, these tools are costly to use and require the packaging of one’s code to be uploaded to a third-party service.
Work produced in Phase II may become classified. Note: The prospective contractor(s) must be U.S. owned and operated with no foreign influence as defined by DoD 5220.22-M, National Industrial Security Program Operating Manual, unless acceptable mitigating procedures can and have been implemented and approved by the Defense Counterintelligence Security Agency (DCSA). The selected contractor and/or subcontractor must be able to acquire and maintain a secret level facility and Personnel Security Clearances, in order to perform on advanced phases of this project as set forth by DCSA and NAVWAR in order to gain access to classified information pertaining to the national defense of the United States and its allies; this will be an inherent requirement. The selected company will be required to safeguard classified material IAW DoD 5220.22-M during the advanced phases of this contract.
NGA20C-001
TITLE:
Algorithm Performance Evaluation with Low Sample Size
OBJECTIVE
Develop novel techniques and metrics for evaluating machine learning -based computer vision algorithms with few examples of labeled overhead imagery.
DESCRIPTION
The National Geospatial Intelligence Agency (NGA) produces timely, accurate and actionable geospatial intelligence (GEOINT) to support U.S. national security. To exploit the growing volume and diversity of data, NGA is seeking a solution to evaluate the performance of a class of algorithms for which there are a limited quantities of training data and evaluation data samples. This is important because statistical significance of the evaluation results is directly tied to the size of the evaluation dataset. While significant effort has been put forth to train algorithms with low sample sizes of labelled data [1-2], open questions remain for the best representative evaluation techniques under the same constraint.
Of specific interest to this solicitation are innovative approaches to rapid evaluation of computer vision algorithms at scale, using small quantities of labelled data samples, and promoting extrapolation to larger data populations. The central challenge to be addressed is the evaluation of performance with the proper range and dimension of data characteristics, when the labeled data represents a small portion of the potential operating conditions. An example is when performance must be evaluated as a function of different lighting conditions, but most of the labelled data was collected under full sun.
The study will be based on panchromatic electro-optical (EO) imagery using a subset (selected by the STTR participants) of the xView detection dataset, although extension to other sensing modalities is encouraged. Solutions with a mathematical basis are desired.
SCO 20.3-001
TITLE:
Machine Learned Cyber Threat Behavior Detection
OBJECTIVE
Develop unsupervised machine learn algorithms to evaluate Zeek logs of common inbound and outbound perimeter network traffic protocols to provide high confident anomaly detection of suspicious and/or malicious network traffic.
The algorithms must be able to be run from a 1U commodity hardware on small to large networks. Report outputs from the algorithms should be retrievable as json or csv formatted files and contain sufficient information for ingestion and correlation against various databases or SIEM systems. At a minimum, the output reports should provide enough data to understand the suspicious threat anomalies identified, corresponding Zeek metadata associated with the detection for correlation and enrichment with other databases, date/time, confidence associated with the detection, and technical reasoning behind the confidence levels and detections made. The government must be equipped with the ability to specify how reporting is generated based confidence thresholds.
DESCRIPTION
Machine Learning of Cyber Behaviour
PHASE I
SCO is accepting Direct to Phase II proposals ONLY. Proposers must demonstrate that the following achievements outside of the SBIR program:
Provide a detailed summary of current research and development and/or commercialization of artificial intelligence methodologies used to identify cyber threats. The summary shall include:
Specific models used in previous research and how they would be applicable for this SBIR. Explain the maturation of these models and previous successes and known limitations in meeting the SBIR goals.
Detailed description of the training data available to the company. Identify whether the proposed training corpus will be accessible in-house, accessed via an open source corpus, or purchased from a commercial training corpus site. Provide the cost to access the proposed training corpus throughout the SBIR period of performance.
Describe the previous work done with the training corpus, specifically the methodologies used and resulting findings.
Finally, include an attachment detailing the schema to be assessed by the proposed algorithm and indicate if the schema was already tested in prior research efforts (NOTE: this schema list does not to count against the maximum number of pages. If this is considered Proprietary information, the company shall indicate this with additional handling instructions).
This notebook is a demonstration of using Doc2Vec as an approach to compare the text of bug reports from an open source Bugzilla project. The Doc2Vec model from gensim is based on the word2vec paper but includes an additional input that identifies the document in the input.
Bug Comparison with Doc2Vec
Personal opinion is that the API is difficult to understand with little tutorial material avaialble for how to implement this model in a practical solution.
#Global Imports
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import spacy
import gensim
import collections
import statistics
Loading Bugzilla Data
The data for this notebook was put together from the Eclipse XText project Bugzilla. You can find a link here
The bugs and associated comments were loaded into a pandas dataframe and stored in parquet format.
The tokenize method is used by the Doc2Vec model to extract a list of words from a document. This method can have major impacts on the performance of the model based on how it is configured. There are different approaches for TF-IDF that drop out many of the words that do not work for the Doc2Vec approach where the order of the words matters.
Other things that work well with probabilistic approaches like capturing the lemma of the word instead of the actual word may actually reduce the accuracy of the Doc2Vec model.
text = """ java.lang.ClassCastException: HIDDEN
at org.eclipse.xtext.xbase.ui.debug.XbaseBreakpointDetailPaneFactory.getDetailPaneTypes(XbaseBreakpointDetailPaneFactory.java:42)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager$DetailPaneFactoryExtension.getDetailPaneTypes(DetailPaneManager.java:94)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPossiblePaneIDs(DetailPaneManager.java:385)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPreferredPaneFromSelection(DetailPaneManager.java:285)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneProxy.display(DetailPaneProxy.java:109)
at org.eclipse.jdt.internal.debug.ui.ExpressionInformationControlCreator$ExpressionInformationControl$2.updateComplete(ExpressionInformationControlCreator.java:344)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$4.run(TreeModelContentProvider.java:751)
at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.notifyUpdate(TreeModelContentProvider.java:737)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.updatesComplete(TreeModelContentProvider.java:653)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.performUpdates(TreeModelContentProvider.java:1747)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.access$10(TreeModelContentProvider.java:1723)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$6.run(TreeModelContentProvider.java:1703)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:136)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4147)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3764)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$9.run(PartRenderingEngine.java:1151)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1032)
at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:156)
at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:648)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:592)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:138)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:380)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:235)
at sun.reflect.NativeMethodAccessorImpl.invoke0(null:-2)
at sun.reflect.NativeMethodAccessorImpl.invoke(null:-1)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(null:-1)
at java.lang.reflect.Method.invoke(null:-1)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:648)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:603)
at org.eclipse.equinox.launcher.Main.run(Main.java:1465)"""
exception_regex = re.compile(r"(?m)^.*?Exception.*(?:[\r\n]+^\s*at .*)+", re.MULTILINE | re.IGNORECASE)
exception_regex.sub("", text)
''
nlp = spacy.load("en_core_web_sm")
exception_regex = re.compile(r".+Exception[^\n].*\s+at", re.MULTILINE | re.IGNORECASE)
greater_regex = re.compile(r"^> .*$", re.MULTILINE | re.IGNORECASE)
gerrit_created_regex = re.compile(r"New Gerrit change created: [^\ ]+", re.MULTILINE | re.IGNORECASE)
gerrit_merge_regex = re.compile(r"Gerrit change [^\s]+ was merged to [^\.]+\.", re.MULTILINE | re.IGNORECASE)
gerrit_commit_regex = re.compile(r"Commit: [^\ ]+", re.MULTILINE | re.IGNORECASE)
filter = ['VERB', 'NOUN', 'PROPN']
def tokenize_spacy(text):
text = greater_regex.sub("", text)
text = exception_regex.sub("", text)
text = gerrit_created_regex.sub("", text)
text = gerrit_merge_regex.sub("", text)
text = gerrit_commit_regex.sub("", text)
filtered_tokens = []
doc = nlp(text)
for sent in doc.sents:
for token in sent:
# if re.fullmatch('[a-zA-Z]+', token.text) and not token.is_stop:
# if token.pos_ in filter and re.fullmatch('[a-zA-Z]+', token.text):
if re.fullmatch('[a-zA-Z]+', token.text):
# filtered_tokens.append(token.lemma_)
filtered_tokens.append(token.text)
return filtered_tokens
TaggedDocument
The Word2Vec model uses an array of TaggedDocuments as input for training. The TaggedDocument consists of an array of words/tokens (from our tokenizer) and a list of tags. In our case, the tags used only includes the ID of the bug.
def read_corpus():
for i, row in df.iterrows():
yield gensim.models.doc2vec.TaggedDocument(tokenize_spacy(row['text']), [row['id']])
train_corpus = list(read_corpus())
Let’s take a look at a random TaggedDocument in the corpus. This is a good check to see what the tokenizer is providing based on the text of the bug.
doc_id = random.randint(0, len(train_corpus) - 1)
doc = train_corpus[doc_id]
tag = doc.tags[0]
print(tag,doc.words)
text = df.iloc[doc_id]['text']
print('\n',text)
363914 ['Check', 'that', 'you', 'can', 'not', 'append', 'a', 'null', 'segment', 'to', 'a', 'QualifiedName', 'Build', 'Identifier', 'Just', 'a', 'minor', 'enhancement', 'The', 'factory', 'checks', 'that', 'you', 'can', 'not', 'create', 'a', 'qualified', 'name', 'with', 'a', 'null', 'segment', 'However', 'the', 'function', 'does', 'not', 'Would', 'be', 'better', 'to', 'always', 'guarantee', 'the', 'non', 'null', 'invariant', 'and', 'also', 'check', 'the', 'parameter', 'of', 'the', 'append', 'operation', 'Reproducible', 'Always', 'fixed', 'pushed', 'to', 'master', 'We', 'have', 'to', 'make', 'sure', 'that', 'no', 'client', 'code', 'in', 'the', 'frameworks', 'passes', 'null', 'to', 'As', 'as', 'discussed', 'internally', 'I', 'removed', 'the', 'null', 'check', 'for', 'now', 'since', 'it', 'might', 'lead', 'to', 'new', 'exceptions', 'in', 'clients', 'The', 'plan', 'is', 'to', 'apply', 'the', 'apply', 'the', 'null', 'check', 'again', 'right', 'after', 'we', 'have', 'release', 'Xtext', 'This', 'will', 'allow', 'us', 'to', 'do', 'more', 'thorough', 'testing', 'The', 'commit', 'can', 'be', 're', 'applied', 'via', 'git', 'cherry', 'pick', 'cherry', 'picked', 'and', 'pushed', 'Requested', 'via', 'bug', 'Requested', 'via', 'bug']
Check that you cannot append a null segment to a QualifiedName Build Identifier: 20110916-0149
Just a minor enhancement. The factory org.eclipse.xtext.naming.QualifiedName.create(String...) checks that you cannot create a qualified name with a "null" segment. However, the org.eclipse.xtext.naming.QualifiedName.append(String) function does not. Would be better to always guarantee the non-null invariant and also check the parameter of the append operation.
Reproducible: Always fixed; pushed to 'master'. We have to make sure that no client code in the frameworks passes null to QualifiedName#append As as discussed internally, I've removed the null-check for now since it might lead to new exceptions in clients.
The plan is to apply the apply the null-check again right after we have release Xtext 2.2. This will allow us to do more thorough testing.
The commit can be re-applied via "git cherry-pick -x b74a06f705a9a0750289e2152d49941f4727e756" cherry-picked and pushed. Requested via bug 522520.
-M. Requested via bug 522520.
-M.
Doc2Vec Model
Believe it or not, there are 22 available parameters for use in the constructor with 21 being optional. The API also does not list the defaults for the optional parameters.
The required parameter is the vector of tagged documents to use for training.
The best way to figure this out is to use
in a notebook cell.
Copying the text of the method headers gives us:
Doc2Vec(documents=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)
BaseWordEmbeddingsModel(sentences=None, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000, trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, cbow_mean=1, min_alpha=0.0001, compute_loss=False, fast_version=0, **kwargs)
dm ({1,0}, optional) – Defines the training algorithm.
If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
This is similar to the skip-gram method vs DBOW method for word vectors.
The Distributed Memory version takes the order of the words into account when categorizing the document vectors, so we will use that version.
## Building Vocabulary
The first step in setting up the model is to build the vocabulary using the list of tagged documents.
model.build_vocab(documents, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)
## Training the model
The final step is training the model.
model.train(documents, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, callbacks=())
```python
model = gensim.models.doc2vec.Doc2Vec(min_count=2, epochs=40)
%time model.build_vocab(train_corpus)
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
CPU times: user 5.68 s, sys: 56.1 ms, total: 5.73 s
Wall time: 5.27 s
CPU times: user 1min 7s, sys: 655 ms, total: 1min 8s
Wall time: 26.9 s
model.save('bugzilla.doc2vec')
??model.save()
Using the Doc2Vec Model
The easiest way to use the model is with the most_similar method. This method will return a list of the most similar tagged documents based on the label passed into the method. For our use, we pass in the ID of the bug that we want to find a similar bug for.
The previous method only works from a previously known (and trained) label from a tagged document. The other way to use the model is find the most similar tagged document based on a list of words. In order to do this:
Get a list of words from a new document
Important! Tokenize this list of words using the same tokenizer used when creating the corpus
Convert the list of tokens to a vector using the infer_vectormethod
Call the most_similar method with the new vector
from scipy import spatial
text1 = df.iloc[0,:]['text']
text2 = tokenize_spacy(text1)
vector = model.infer_vector(text2)
similar = model.docvecs.most_similar([vector])
print(similar)
Of course, it is helpful that our model returned the ID of the document that we vectorized and passed into the most_similar method. If this model is to be useful, each document in the corpus should be similar to itself. Using a cosine similarity metric, we can calculate the self-similarity of each document.
We’ll calculate the self-similarity below and graph the distribution to see what we have.
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x7fc20167fd30>
Check Outliers
Let’s look at any items that are not self similar based on the model.
print(min(similarities))
index = similarities.index(min(similarities))
print(df.iloc[index,:])
0.0787610188126564
component Website
date 2013-07-16 07:03:42+00:00
id 413031
product TMF
reporter dixit.pawan@hotmail.com
resolution WORKSFORME
status CLOSED
text fgbnghjm cvbndfh
title fgbnghjm
year 2013
month 2013-7
day 2013-07-16
Name: 3581, dtype: object
Given that the text consists of “fgbnghjm cvbndfh”, you can see why this bug is not handled well by the model.
We can also look at the distribution of the next similar document probabililties.
next_similar = []
for doc_id in range(len(train_corpus)):
sims = model.docvecs.most_similar(train_corpus[doc_id].tags[0])
next_similar.append(sims[0][1])
sns.distplot(next_similar, kde=False, rug=True)
print(statistics.mean(next_similar))
print(statistics.stdev(next_similar))
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
0.600508476063155
0.10128021819227365
Detecting Near Duplicates
Looking into the most similar bugs that have a very high probability (>98%), it appears that we have an issue with the tokenizer when it runs across a Java stack trace.
print(max(next_similar))
index = next_similar.index(max(next_similar))
bug_id = df.iloc[index,:]['id']
print(bug_id,df.iloc[index,:]['text'])
print('\n\n')
sims = model.docvecs.most_similar(bug_id)
text = df.loc[df['id'] == sims[1][0]].iloc[0]['text']
print(sims[1][0],text)
0.9883251190185547
461367 CCE in XbaseBreakpointDetailPaneFactory.getDetailPaneTypes (42) The following incident was reported via the automated error reporting:
code: 120
plugin: org.eclipse.debug.ui_3.11.0.v20150116-1131
message: HIDDEN
fingerprint: a8a83b9f
exception class: java.lang.ClassCastException
exception message: HIDDEN
number of children: 0
java.lang.ClassCastException: HIDDEN
at org.eclipse.xtext.xbase.ui.debug.XbaseBreakpointDetailPaneFactory.getDetailPaneTypes(XbaseBreakpointDetailPaneFactory.java:42)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager$DetailPaneFactoryExtension.getDetailPaneTypes(DetailPaneManager.java:94)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPossiblePaneIDs(DetailPaneManager.java:385)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneManager.getPreferredPaneFromSelection(DetailPaneManager.java:285)
at org.eclipse.debug.internal.ui.views.variables.details.DetailPaneProxy.display(DetailPaneProxy.java:109)
at org.eclipse.jdt.internal.debug.ui.ExpressionInformationControlCreator$ExpressionInformationControl$2.updateComplete(ExpressionInformationControlCreator.java:344)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$4.run(TreeModelContentProvider.java:751)
at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.notifyUpdate(TreeModelContentProvider.java:737)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.updatesComplete(TreeModelContentProvider.java:653)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.performUpdates(TreeModelContentProvider.java:1747)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider.access$10(TreeModelContentProvider.java:1723)
at org.eclipse.debug.internal.ui.viewers.model.TreeModelContentProvider$6.run(TreeModelContentProvider.java:1703)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:136)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4147)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3764)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$9.run(PartRenderingEngine.java:1151)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1032)
at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:156)
at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:648)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:592)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:138)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:380)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:235)
at sun.reflect.NativeMethodAccessorImpl.invoke0(null:-2)
at sun.reflect.NativeMethodAccessorImpl.invoke(null:-1)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(null:-1)
at java.lang.reflect.Method.invoke(null:-1)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:648)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:603)
at org.eclipse.equinox.launcher.Main.run(Main.java:1465)
General Information:
reported-by: Serhii Belei
anonymous-id: 648982dc-0aba-4421-a13b-c3f08b2cb5aa
eclipse-build-id: 4.5.0.I20150203-1300
eclipse-product: org.eclipse.epp.package.jee.product
operating system: Windows7 6.1.0 (x86_64) - win32
jre-version: 1.8.0_25-b18
The following plug-ins were present on the execution stack (*):
1. org.eclipse.core.databinding.observable_1.4.1.v20140910-2107
2. org.eclipse.core.databinding_1.4.100.v20141002-1314
3. org.eclipse.core.runtime_3.10.0.v20150112-1422
4. org.eclipse.debug.ui_3.11.0.v20150116-1131
5. org.eclipse.e4.ui.workbench_1.3.0.v20150113-2327
6. org.eclipse.e4.ui.workbench.swt_0.12.100.v20150114-0905
7. org.eclipse.equinox.app_1.3.200.v20130910-1609
8. org.eclipse.equinox.launcher_1.3.0.v20140415-2008
9. org.eclipse.jdt.debug.ui_3.6.400.v20150123-1739
10. org.eclipse.jdt.debug_3.8.200.v20150116-1130
11. org.eclipse.jdt_3.11.0.v20150203-1300
12. org.eclipse.swt_3.104.0.v20150203-2243
13. org.eclipse.ui_3.107.0.v20150107-0903
14. org.eclipse.ui.ide.application_1.0.600.v20150120-1542
15. org.eclipse.ui.ide_3.10.100.v20150126-1117
16. org.eclipse.xtext.xbase.ui_2.7.2.v201409160908
17. org.eclipse.xtext.xbase_2.7.2.v201409160908
18. org.eclipse.xtext_2.8.0.v201502030924
Please note that:
* Messages, stacktraces, and nested status objects may be shortened.
* Bug fields like status, resolution, and whiteboard are sent
back to reporters.
* The list of present bundles and their respective versions was
calculated by package naming heuristics. This may or may not reflect reality.
Other Resources:
* Report: https://dev.eclipse.org/recommenders/committers/confess/#/problems/54f58a02e4b03058b001ee0f
* Manual: https://dev.eclipse.org/recommenders/community/confess/#/guide
Thank you for your assistance.
Your friendly error-reports-inbox.
463383 JME in JavaElement.newNotPresentException (556) The following incident was reported via the automated error reporting:
code: 0
plugin: org.apache.log4j_1.2.15.v201012070815
message: HIDDEN
fingerprint: f72b76f8
exception class: org.eclipse.emf.common.util.WrappedException
exception message: HIDDEN
number of children: 0
org.eclipse.emf.common.util.WrappedException: HIDDEN
at org.eclipse.xtext.util.Exceptions.throwUncheckedException(Exceptions.java:26)
at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.handleInvocationTargetException(AbstractDeclarativeValidator.java:137)
at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.invoke(AbstractDeclarativeValidator.java:125)
at org.eclipse.xtext.validation.AbstractDeclarativeValidator.internalValidate(AbstractDeclarativeValidator.java:312)
at org.eclipse.xtext.validation.AbstractInjectableValidator.validate(AbstractInjectableValidator.java:69)
at org.eclipse.xtext.validation.CompositeEValidator.validate(CompositeEValidator.java:153)
at org.eclipse.emf.ecore.util.Diagnostician.doValidate(Diagnostician.java:171)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:158)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:185)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:120)
at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:148)
at org.eclipse.xtext.xbase.annotations.validation.DerivedStateAwareResourceValidator.validate(DerivedStateAwareResourceValidator.java:33)
at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:91)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.access$1(CachingResourceValidatorImpl.java:1)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:78)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:1)
at org.eclipse.xtext.util.OnChangeEvictingCache.get(OnChangeEvictingCache.java:77)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.validate(CachingResourceValidatorImpl.java:81)
at org.eclipse.xtend.ide.validator.XtendResourceValidator.validate(XtendResourceValidator.java:33)
at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:91)
at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:1)
at org.eclipse.xtext.util.concurrent.CancelableUnitOfWork.exec(CancelableUnitOfWork.java:26)
at org.eclipse.xtext.resource.OutdatedStateManager.exec(OutdatedStateManager.java:121)
at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.internalReadOnly(XtextDocument.java:512)
at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.readOnly(XtextDocument.java:484)
at org.eclipse.xtext.ui.editor.model.XtextDocument.readOnly(XtextDocument.java:133)
at org.eclipse.xtext.ui.editor.validation.ValidationJob.createIssues(ValidationJob.java:86)
at org.eclipse.xtext.ui.editor.validation.ValidationJob.run(ValidationJob.java:67)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)
caused by: org.eclipse.jdt.core.JavaModelException: HIDDEN
at org.eclipse.jdt.internal.core.JavaElement.newNotPresentException(JavaElement.java:556)
at org.eclipse.jdt.internal.core.Openable.getUnderlyingResource(Openable.java:344)
at org.eclipse.jdt.internal.core.CompilationUnit.getUnderlyingResource(CompilationUnit.java:930)
at org.eclipse.jdt.internal.core.SourceRefElement.getUnderlyingResource(SourceRefElement.java:226)
at org.eclipse.xtend.ide.validator.XtendUIValidator.isSameProject(XtendUIValidator.java:85)
at org.eclipse.xtend.ide.validator.XtendUIValidator.checkAnnotationInSameProject(XtendUIValidator.java:72)
at sun.reflect.GeneratedMethodAccessor171.invoke(null:-1)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.eclipse.xtext.validation.AbstractDeclarativeValidator$MethodWrapper.invoke(AbstractDeclarativeValidator.java:118)
at org.eclipse.xtext.validation.AbstractDeclarativeValidator.internalValidate(AbstractDeclarativeValidator.java:312)
at org.eclipse.xtext.validation.AbstractInjectableValidator.validate(AbstractInjectableValidator.java:69)
at org.eclipse.xtext.validation.CompositeEValidator.validate(CompositeEValidator.java:153)
at org.eclipse.emf.ecore.util.Diagnostician.doValidate(Diagnostician.java:171)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:158)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:181)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.doValidateContents(Diagnostician.java:185)
at org.eclipse.xtext.validation.CancelableDiagnostician.doValidateContents(CancelableDiagnostician.java:49)
at org.eclipse.xtext.xbase.validation.XbaseDiagnostician.doValidateContents(XbaseDiagnostician.java:47)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:161)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:137)
at org.eclipse.xtext.validation.CancelableDiagnostician.validate(CancelableDiagnostician.java:37)
at org.eclipse.emf.ecore.util.Diagnostician.validate(Diagnostician.java:120)
at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:148)
at org.eclipse.xtext.xbase.annotations.validation.DerivedStateAwareResourceValidator.validate(DerivedStateAwareResourceValidator.java:33)
at org.eclipse.xtext.validation.ResourceValidatorImpl.validate(ResourceValidatorImpl.java:91)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.access$1(CachingResourceValidatorImpl.java:1)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:78)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl$1.get(CachingResourceValidatorImpl.java:1)
at org.eclipse.xtext.util.OnChangeEvictingCache.get(OnChangeEvictingCache.java:77)
at org.eclipse.xtend.core.validation.CachingResourceValidatorImpl.validate(CachingResourceValidatorImpl.java:81)
at org.eclipse.xtend.ide.validator.XtendResourceValidator.validate(XtendResourceValidator.java:33)
at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:91)
at org.eclipse.xtext.ui.editor.validation.ValidationJob$1.exec(ValidationJob.java:1)
at org.eclipse.xtext.util.concurrent.CancelableUnitOfWork.exec(CancelableUnitOfWork.java:26)
at org.eclipse.xtext.resource.OutdatedStateManager.exec(OutdatedStateManager.java:121)
at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.internalReadOnly(XtextDocument.java:512)
at org.eclipse.xtext.ui.editor.model.XtextDocument$XtextDocumentLocker.readOnly(XtextDocument.java:484)
at org.eclipse.xtext.ui.editor.model.XtextDocument.readOnly(XtextDocument.java:133)
at org.eclipse.xtext.ui.editor.validation.ValidationJob.createIssues(ValidationJob.java:86)
at org.eclipse.xtext.ui.editor.validation.ValidationJob.run(ValidationJob.java:67)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:55)
General Information:
reported-by: Tobse
anonymous-id: ef35a7d7-0cbe-4995-a50b-ea7da1b26ef1
eclipse-build-id: 4.5.0.I20150203-1300
eclipse-product: org.eclipse.epp.package.dsl.product
operating system: Windows7 6.1.0 (x86_64) - win32
jre-version: 1.8.0_25-b18
The following plug-ins were present on the execution stack (*):
1. org.eclipse.core.jobs_3.7.0.v20150115-2226
2. org.eclipse.emf.ecore_2.11.0.v20150325-0930
3. org.eclipse.emf_2.6.0.v20150325-0933
4. org.eclipse.jdt.core_3.11.0.v20150126-2015
5. org.eclipse.jdt_3.11.0.v20150203-1300
6. org.eclipse.xtend.core_2.9.0.v201503270548
7. org.eclipse.xtend_2.1.0.v201503260847
8. org.eclipse.xtend.ide_2.9.0.v201503270548
9. org.eclipse.xtext_2.9.0.v201503270548
10. org.eclipse.xtext.ui_2.9.0.v201503270548
11. org.eclipse.xtext.util_2.9.0.v201503270548
12. org.eclipse.xtext.xbase_2.9.0.v201503270548
Please note that:
* Messages, stacktraces, and nested status objects may be shortened.
* Bug fields like status, resolution, and whiteboard are sent
back to reporters.
* The list of present bundles and their respective versions was
calculated by package naming heuristics. This may or may not reflect reality.
Other Resources:
* Report: https://dev.eclipse.org/recommenders/committers/confess/#/problems/55155bfde4b026254edfe60d
* Manual: https://dev.eclipse.org/recommenders/community/confess/#/guide
Thank you for your assistance.
Your friendly error-reports-inbox. PR: https://github.com/eclipse/xtext/pull/105 Reviewed commit
https://github.com/eclipse/xtext/commit/5da237c2a4a57e4ca2da32dc28f5a3152c1f1eba
from sklearn.metrics.pairwise import cosine_similarity
X = []
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words)
X.append(inferred_vector)
matrix = cosine_similarity(X)
fig, ax = plt.subplots(figsize=(10,10))
cax = ax.matshow(matrix, interpolation='nearest')
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, .75,.8,.85,.90,.95,1])
plt.show()
Finding Similar Bugs
Of course, the final test to see if this model will provide a useful return value is to do a random sample and then find the most similar bug.
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus) - 1)
doc = train_corpus[doc_id]
text = df.loc[df['id'] == doc.tags[0]].iloc[0]['text']
print(doc.tags[0],text)
inferred_vector = model.infer_vector(doc.words)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
print("\n\nDocument that is",sims[1][1]," similar below\n\n")
text = df.loc[df['id'] == sims[1][0]].iloc[0]['text']
print(sims[1][0],text)
288444 [validation] quickfixes for unresolved references A common use case for quickfixes will be to create an object corresponding to an unresolved cross reference. In order for Xtext to support this use case the following changes should be made:
1. The Xtext resource diagnostic interface (org.eclipse.xtext.diagnostics.Diagnostic) should provide some kind of getCode() method similar to what the TransformationDiagnostic implementation already has. Possibly it should also provide getters for the triggering AbstractNode, EObject, and StructuralFeature (allthough it should be possible to derive this from the offset and length) as this may be required to implement a corresponding quickfix.
2. Using the resource diagnostic's code the IXtextResourceChecker (and ValidationJob) can produce an Eclipse marker having a corresponding code attribute set accordingly. This however also requires investigating the marker type to use for resource diagnostics.
3. AbstractDeclarativeQuickfixProvider should also simplify the implementation of fixes for unresolved references. This would be difficult to achieve using the existing @Fix annotation because (a) the marker's code is fixed (unless my some other means overridden by the user) and (b) the type of object to create cannot be declared (only that of the context object). A possibility would be to add another dedicated annotation. E.g.
@FixReference(type=Bar.class, label="Create Bar '${name}'")
public void createNewBar(Foo context, IMarker marker) {...} A currently possible workaround is for a subclass to override the AbstractDeclarativeQuickfixProvider behavior to:
- use the marker's message to check whether it is an unresolved cross reference
- derive the context object and reference (and thus also type of object to create) using the marker's offset and length *** Bug 283439 has been marked as a duplicate of this bug. *** Adding such quickfixed is possible with the new API. Please have a look at
org.eclipse.xtext.example.ui.quickfix.DomainmodelQuickfixProvider
for an example. You also have to override
org.eclipse.xtext.example.ui.linking.DomainmodelLinkingDiagnosticMessageProvider
to return individual issue codes and issue data for different linking problems.
Having worked out the example, I don't think we should offer more simplistic API for these kinds of quickfixes, as there are too many tweaking points
- the EClass of the element to be created, which is not always the eReferenceType of the EReference. In the domainmodel it depends on the type of the container's container (Attribute or Reference)
- attribute initialisation, partly extracted at linking time and provided by means of issue data
- the eContainer for the new element, which is not necessarily the container of the referring element
- formatting Closing bug which were set to RESOLVED before Eclipse Neon.0.
Document that is 0.5803667306900024 similar below
263793 [Editor] Add quick fix support Created attachment 124816
quick fix proof of concept
The subject of this feature request is support for quick fixes (i.e. Ctrl+1) in the Xtext editor. In particular it should also be possible to implement quick fixes for user defined Check constraint violations.
Terminology
===========
Quick fix : A quick fix is the means of accessing in invoking a marker resolution (see extension point org.eclipse.ui.ide.markerResolution) in the editor using the Ctrl+1 keybinding (default).
Marker resolution : A marker resolution is a runnable object (instance of IMarkerResolution) whose purpose is to solve the problem annotated by a particular marker. A marker resolution is created by a IMarkerResolutionGenerator registered using the extension point org.eclipse.ui.ide.markerResolution).
Check constraint violation : By this we mean the event of a particular model object violating a given Check constraint. This violation will be represented by a marker in the Xtext editor.
Implementation constraints
==========================
The current Check language should not be modified.
Proposed design and implementation
==================================
The Xtext editor is already capable of displaying markers pertaining to the resource being edited. We now also want to be able to access any marker resolutions registered against the markers using the Ctrl+1 key binding (Quick Fix). This involves enhancing XtextSourceViewerConfiguration and implementing a custom QuickAssistAssistant (see patch). This part can IMHO safely be added.
Additionally we want to be able to implement quick fixes for specific Check constraints. Here I propose that the Check constraint should not simply return an error message, but something structured (when the quick fix needs to be adviced). E.g.
context Foo#name WARNING fooName() :
name != "FOO"
;
With the following extensions (the second would be part of an Xtext library):
fooName(Foo this) :
error("Don't use the name FOO", "illegalId").addAll({"foo", this})
;
error(Object this, String message, String type) :
{"message", message, "type", type}
;
The returned data is essentially a map represented as a list (due to oAW expression limitations). Given this list in Check a Diagnostic object will be created with corresponding data (Diagnostic#getData()). In XtextResourceChecker (and XtextDocument) these same properties will be set on the corresponding created Marker objects. If no list is returned by the check constraint the original behavior will be employed.
The "message" (mapped to "message" attribute of marker) and "type" (mapped to markerType of marker!) properties are predefined but it is also important to note that the user has the ability to add any other custom properties which will automatically be attached to the Diagnostic and eventually the marker (as with "foo" in the example).
The marker resolution generator can in turn use these marker attributes to decide if it can create a corresponding marker resolution (actually a lot of this filtering can already be done in the plugin.xml extension).
Additionally the attributes could also represent hints for the actual marker resolution. For example: The Check constraint could set an "resolutionExpr" property to "setName('FOOBAR')" and a corresponding generic marker resolution generator (expecting this property to be set) could then evaluate the given expression when the quick fix is run.
The attached patch implements this described design. Note that it's a proof of concept only!
Alternatives
============
The interface for passing data from the Check constraint to be associated with the corresponding Diagnostic could instead be implemented using a JAVA extension. In this case the Check constraint would just return a String (business as usual). E.g.
context Foo#name WARNING fooName() :
name != "FOO"
;
String fooName(Foo this) :
error("Don't use the name FOO", "illegalId")
;
String error(Object this, String message, String type) :
internalError(message, type) -> message
;
private internalError(Object this, String message, String type) :
JAVA org.eclipse.xtext...
;
The mechanism for making this data available to the marker resolution generators could also be a dedicated Java API (instead of attaching the data to the marker directly). But this way there is not the possibility of filtering using the <attribute/> element in plugin.xml. Created attachment 126171
quick assist assistant patch
If you agree I'd like to propose the attached patch to enable the Xtext editor to run any available marker resolutions enabled for the displayed markers. Actually there is already an action XtextMarkerRulerAction to support this, but the required QuickAssistAssistant implementation was still missing.
I think the API outlined in the description (for integration of Check etc.) requires some more thinking. This is something I'm working on. But what's in the patch is a necessary first step. Created attachment 140346
simple quickfix generator fragment
The attachment demonstrates a simplistic quickfix generator fragment complete with the supporting changes to the Xtext editor.
As demonstrated in the Domainmodel example a fix can then be declared by a method like this:
@Fix(code = DomainmodelJavaValidator.CAPITAL_TYPE_NAME, label = "Capitalize name", description = "Capitalize name of type")
public void fixNameCase(Type type, IMarker marker) {
type.setName(type.getName().toUpperCase());
}
The "code" attribute matches up against a corresponding Java check:
@Check
public void checkTypeNameStartsWithCapital(Type type) {
if (!Character.isUpperCase(type.getName().charAt(0))) {
warning("Name should start with a capital", DomainmodelPackage.TYPE__NAME, CAPITAL_TYPE_NAME);
}
} +1 for this RFE. This would be great! Are there any plans about the target milestone? (In reply to comment #3)
> +1 for this RFE. This would be great! Are there any plans about the target
> milestone?
>
No not yet, we'll update the property accordingly as soon as we have concrete plans. I also like it very much.
Shouldn't we allow a list of codes per diagnostic? There might be multiple alternative ways to fix an issue.
In AbstractDeclarativeMarkerResolutionGenerator you pass the context EObject out of the read transaction in order to pass it into a modify transaction later. This could cause problems in cases where another write operation has changed or removed that object. I think the context object should be obtained within the modify transaction.
Some tests would be very nice. :-) (In reply to comment #5)
> I also like it very much.
> Shouldn't we allow a list of codes per diagnostic? There might be multiple
> alternative ways to fix an issue.
The code simply identifies the problem, so one code per diagnostic should be enough. But we would then like to associate multiple fixes with the problem. Using a declarative approach we would want all @Fix annotated methods referring to that code to match. Thus something similar to the AbstractDeclarativeValidator.
In my patch I use the PolymorphicDispatcher, but I've come to realize that this doesn't make sense here, as we want to match multiple methods, just like the AbstractDeclarativeValidator.
The declarative base class supports @Fix annotated methods where the fix details (label, description, and icon) are in the annotation parameters and the method body simply represents the fix implementation. E.g.
@Fix(code = 42, label = "Capitalize name", description = "Capitalize name of type")
public void fixSomething(Foo foo, IMarker marker) {
return ...;
}
In addition we may also want to allow a method to return the IMarkerResolution object describing the fix. This would allow for more conditional logic. E.g.
@Fix(code=42)
public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
return ...;
}
Any thoughts on this?
> In AbstractDeclarativeMarkerResolutionGenerator you pass the context EObject
> out of the read transaction in order to pass it into a modify transaction
> later. This could cause problems in cases where another write operation has
> changed or removed that object. I think the context object should be obtained
> within the modify transaction.
> Some tests would be very nice. :-)
>
I agree. (In reply to comment #6)
> (In reply to comment #5)
> > I also like it very much.
> > Shouldn't we allow a list of codes per diagnostic? There might be multiple
> > alternative ways to fix an issue.
>
> The code simply identifies the problem, so one code per diagnostic should be
> enough. But we would then like to associate multiple fixes with the problem.
> Using a declarative approach we would want all @Fix annotated methods referring
> to that code to match. Thus something similar to the
> AbstractDeclarativeValidator.
>
> In my patch I use the PolymorphicDispatcher, but I've come to realize that this
> doesn't make sense here, as we want to match multiple methods, just like the
> AbstractDeclarativeValidator.
>
> The declarative base class supports @Fix annotated methods where the fix
> details (label, description, and icon) are in the annotation parameters and the
> method body simply represents the fix implementation. E.g.
>
> @Fix(code = 42, label = "Capitalize name", description = "Capitalize name of
> type")
> public void fixSomething(Foo foo, IMarker marker) {
> return ...;
> }
>
> In addition we may also want to allow a method to return the IMarkerResolution
> object describing the fix. This would allow for more conditional logic. E.g.
>
> @Fix(code=42)
> public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
> return ...;
> }
>
> Any thoughts on this?
>
Sounds reasonable. I thought that the id identifies a fix not a problem.
Of course what you described makes much more sense. :-)
Fixed in CVS HEAD. > > In addition we may also want to allow a method to return the IMarkerResolution
> > object describing the fix. This would allow for more conditional logic. E.g.
> >
> > @Fix(code=42)
> > public IMarkerResolution fixSomething(Foo foo, IMarker marker) {
> > return ...;
> > }
> >
> > Any thoughts on this?
> >
>
> Sounds reasonable.
As I couldn't yet find a concrete use case for this I decided to wait with this enhancement. We can always file a new bug later on.
Also note that the documentation hasn't been written yet. I reopen this so we don't forget the documentation. Thanks Knut :-) Closing all bugs that were set to RESOLVED before Neon.0 Closing all bugs that were set to RESOLVED before Neon.0
Clustering the Embedding Space
We can use KMeans to divide the embeddings into clusters based on cosine similarity.
from sklearn import cluster
from sklearn import metrics
kmeans = cluster.KMeans(n_clusters=10)
kmeans.fit(X)
To visualize the space of the embedding, we will use TSNE to reduce from 100 to 2 dimensions.
# Creating and fitting the tsne model to the document embeddings
from MulticoreTSNE import MulticoreTSNE as TSNE
tsne_model = TSNE(n_jobs=4,
early_exaggeration=4,
n_components=2,
verbose=1,
random_state=2018,
n_iter=300)
tsne_d2v = tsne_model.fit_transform(np.array(X))
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1f1b525f8>
Transfer Learning
Now that we have some idea of what we’re working with, let’s see if transfer learning is an option with Doc2Vec. To do this, we’ll start with a distributed memory Doc2Vec model that was trained on the WikiPedia corpus. Then we will further train the model on our bug corpus and see what the results are.
from gensim.models.doc2vec import Doc2Vec
loadedModel = Doc2Vec.load('PV-DBOW.doc2vec')
print(loadedModel.corpus_count)
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x7fc18fb6beb8>
next_similar = []
for doc_id in range(len(train_corpus)):
sims = loadedModel.docvecs.most_similar(train_corpus[doc_id].tags[0])
next_similar.append(sims[0][1])
sns.distplot(next_similar, kde=False, rug=True)
/opt/tools/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x7fc17b6532e8>
Other applications of AI that are useful in AR: * Object Detection – finding the boundaries of objects * Image Classification – this can be used to identify known objects in a scene and make a correlation to an object in the digital world * Pose Estimation – determining position of hands to control movement * Text Recognition – determine text (not always horizontally aligned) and convert to actionable content * Audio Recognition – voice commands to control movement
Moving AI to mobile devices
Before AI models were available to be run on mobile devices, most applications followed something along these lines:
Grab data on the device
Move it to a storage location
Trigger some operation (possibly store the results as well)
Are there any performance guidelines available for maintaining a realistic application? Something for response times or update rate?
Mobile AR Development Frameworks
There are frameworks available for AR in both the Apple (iOS) and Google (Andriod) ecosystems. Apple’s offering is AR Kit, while Google provides AR Core. These frameworks are mostly for basic AR development, but they do provide some broad applications of AI. Walk through each link to highlight AI integration
Mobile AI Frameworks
Similar to the AR frameworks, there are also AI development frameworks available for both Apple and Google. In this case, Apple provides Core-ML while Google provides TensorFlow Lite.
Both frameworks appear to be ‘inference only’ by using a minimized network optimized to their particular hardware platform. This would be useful if you needed a more specialized implementation of an AI technique and wanted to integrate it yourself.
One thought is to take our application of image recognition for trail camera use and apply it to a mobile platform. We may need some help getting a basic iOS and Android application created though. Maybe a cross-group project somewhere?
Ben lead the talk covering chapter 2 of the Machine Learning Intpretability book. The discussion centered on why interpretability in machine learning matters and the different tradeoffs encountered when making a model more interpretable.
For this session, we covered the results of the Fast.AI Lesson 3 and learned about multi-label classification with a CNN. The full notebook is included in this post.
The planet dataset isn’t available on the fastai dataset page due to copyright restrictions. You can download it from Kaggle however. Let’s see how to do this by using the Kaggle API as it’s going to be pretty useful to you if you want to join a competition or use other Kaggle datasets later on.
First, install the Kaggle API by uncommenting the following line and executing it, or by executing it in your terminal (depending on your platform you may need to modify this slightly to either add source activate fastai or similar, or prefix pip with a path. Have a look at how conda install is called for your platform in the appropriate Returning to work section of https://course.fast.ai/. (Depending on your environment, you may also need to append “–user” to the command.)
Then you need to upload your credentials from Kaggle on your instance. Login to kaggle and click on your profile picture on the top left corner, then ‘My account’. Scroll down until you find a button named ‘Create New API Token’ and click on it. This will trigger the download of a file named ‘kaggle.json’.
Upload this file to the directory this notebook is running in, by clicking “Upload” on your main Jupyter page, then uncomment and execute the next two commands (or run them in a terminal). For Windows, uncomment the last two commands.
! mkdir -p ~/.kaggle/
! mv kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
# For Windows, uncomment these two commands
# ! mkdir %userprofile%\.kaggle
# ! move kaggle.json %userprofile%\.kaggle
You’re all set to download the data from planet competition. You first need to go to its main page and accept its rules, and run the two cells below (uncomment the shell commands to download and unzip the data). If you get a 403 forbidden error it means you haven’t accepted the competition rules yet (you have to go to the competition page, click on Rules tab, and then scroll to the bottom to find the accept button).
To extract the content of this file, we’ll need 7zip, so uncomment the following line if you need to install it (or run sudo apt install p7zip-full in your terminal).
And now we can unpack the data (uncomment to run – this might take a few minutes to complete).
# ! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path.as_posix()}
Multiclassification
Contrary to the pets dataset studied in last lesson, here each picture can have multiple labels. If we take a look at the csv file containing the labels (in ‘train_v2.csv’ here) we see that each ‘image_name’ is associated to several tags separated by spaces.
To put this in a DataBunch while using the data block API, we then need to using ImageList (and not ImageDataBunch). This will make sure the model created has the proper loss function to deal with the multiple classes.
data = (src.transform(tfms, size=128)
.databunch().normalize(imagenet_stats))
show_batch still works, and show us the different labels separated by ;.
data.show_batch(rows=3, figsize=(12,9))
To create a Learner we use the same function as in lesson 1. Our base architecture is resnet50 again, but the metrics are a little bit differeent: we use accuracy_thresh instead of accuracy. In lesson 1, we determined the predicition for a given class by picking the final activation that was the biggest, but here, each activation can be 0. or 1. accuracy_thresh selects the ones that are above a certain threshold (0.5 by default) and compares them to the ground truth.
As for Fbeta, it’s the metric that was used by Kaggle on this competition. See here for more details.
We use the LR Finder to pick a good learning rate.
learn.lr_find()
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
<progress value='0' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>
0.00% [0/1 00:00<00:00]
</div>
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
time
18.18% [92/506 00:21<01:37 2.6318]LR Finder is complete, type {learner_name}.recorder.plot() to see the graph. “` learn.recorder.plot() “`  Then we can fit the head of our network. “` lr = 0.01 “` “` learn.fit_one_cycle(5, slice(lr)) “`
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
time
0
0.146025
0.122986
0.944156
0.894195
02:20
1
0.116038
0.103306
0.947854
0.909214
02:20
2
0.106912
0.096093
0.951132
0.916606
02:20
3
0.099083
0.092262
0.954177
0.918962
02:18
4
0.097352
0.091227
0.953842
0.919874
02:18
learn.save('stage-1-rn34')
…And fine-tune the whole model:
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
<progress value='0' class='' max='1', style='width:300px; height:20px; vertical-align: middle;'></progress>
0.00% [0/1 00:00<00:00]
</div>
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
time
16.80% [85/506 00:20<01:42 0.2839]LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.  “` learn.fit_one_cycle(5, slice(1e-5, lr/5)) “`
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
time
0
0.099162
0.092750
0.950732
0.918759
02:22
1
0.099836
0.092544
0.951444
0.920035
02:22
2
0.094310
0.087545
0.955041
0.923917
02:23
3
0.088801
0.084848
0.957846
0.926931
02:22
4
0.082217
0.084555
0.956415
0.926193
02:26
learn.save('stage-2-rn34')
data = (src.transform(tfms, size=256)
.databunch().normalize(imagenet_stats))
learn.data = data
data.train_ds[0][0].shape
torch.Size([3, 256, 256])
learn.freeze()
learn.lr_find()
learn.recorder.plot()
LR Finder complete, type {learner_name}.recorder.plot() to see the graph.
lr=1e-2/2
learn.fit_one_cycle(5, slice(lr))
Total time: 09:01
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
1
0.087761
0.085013
0.958006
0.926066
2
0.087641
0.083732
0.958260
0.927459
3
0.084250
0.082856
0.958485
0.928200
4
0.082347
0.081470
0.960091
0.929166
5
0.078463
0.080984
0.959249
0.930089
learn.save('stage-1-256-rn50')
learn.unfreeze()
learn.fit_one_cycle(5, slice(1e-5, lr/5))
Total time: 11:25
epoch
train_loss
valid_loss
accuracy_thresh
fbeta
1
0.082938
0.083548
0.957846
0.927756
2
0.086312
0.084802
0.958718
0.925416
3
0.084824
0.082339
0.959975
0.930054
4
0.078784
0.081425
0.959983
0.929634
5
0.074530
0.080791
0.960426
0.931257
learn.recorder.plot_losses()
learn.save('stage-2-256-rn50')
You won’t really know how you’re going until you submit to Kaggle, since the leaderboard isn’t using the same subset as we have for training. But as a guide, 50th place (out of 938 teams) on the private leaderboard was a score of 0.930.
learn.export()
fin
(This section will be covered in part 2 – please don’t ask about it just yet! 🙂 )
! kaggle competitions submit planet-understanding-the-amazon-from-space -f {path/'submission.csv'} -m "My submission"
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/ubuntu/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 2.18M/2.18M [00:02<00:00, 1.05MB/s]
Successfully submitted to Planet: Understanding the Amazon from Space
For 3 weeks, we got together and discussed a group project. The intent is to take the output of our Fast.ai lesson 1 model and make it available through a variety of web platforms. Hopefully we will learn how to deploy an image classification model.
We didn’t get much further than discussions of how to implement the project and containerization with Docker. There is an initial repo in Github for this, but no further action at this time.
Here’s a post about how to work around the error that you get: HEREhttps://disqus.com/embed/comments/?base=default&f=hsv-ai&t_u=https%3A%2F%2Fhsv-ai.com%2Fmeetups%2F200226_fast_ai_lesson_2%2F&t_d=Huntsville%20AI%20%7C%20Fast.AI%20Lesson%202&t_t=Huntsville%20AI%20%7C%20Fast.AI%20Lesson%202&s_o=default#version=46aa6ce1907927200257678d09dec282