Research Collaboration

27 May 2013

ARTICLES

Since March 2013, Dr Ford Lumban Gaol from Binus Graduate Program and two professors from Ben Guiron Israel, Natalia Vanetik, Maria Skeleptik, have been involved with Research in three areas.

For those of you that interested to be part of the research collaboration, please email to fordlg@gmail.com

The first area: Taxonomy extension

Problem statement

The purpose of this project is to explore and extend the structure of existing taxonomies for scientific papers in the field of Mathematics and Computer Science with the help of keywords provided in each paper.

One example of such a taxonomy is MSC2010 which is defined by AMS (see [MSC2010]) and forms a tree structure of the following form.

Items form a pre x tree structure.
Each leaf item in taxonomy consists of have characters XXYZZ, where XX is a number, Y is a capital letter and ZZ is again a number.
Non-leaf items may have the form XX or XXY.
Pre x condition states that every item XXYZZ belongs to subtrees of XX and XXY, and XXY belongs to the subtree of XX.

Computer Science taxonomy nodes in this classification start with XX=68; however, taxonomy subtree of Computer Science field is not deep and thus is not well-suited for paper selection and paper classification tasks.

ACM CCS taxonomy (see [CCS]) is a computer science taxonomy de ned in the ACM digital library. This taxonomy has tree structure and uses short field descriptions rather than codes.

The purpose of this project is to generate extended taxonomy from MSC and ACM taxonomies and extend them by using text analysis methods on keywords provides by authors of scientific papers.

Second area: Classification Scientific Paper

Problem Statement

The task of automated indexing of scientific articles according to predefined taxonomy of categories relates to the more common problem known as text categorization (also as text classification or topic spotting) that is the task of automatically sorting a set of documents into categories from a predefined set. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology.

The elaborated module should get as input:

Scientific papers and
Taxonomy/ontology of the requested research area and return, for each document, the list of probabilities of belonging the document to the taxonomy’s labels.

Third area: Automated Survey Generation – Text Generation.

Problem Statement

The elaborated method/module should get:

a set of individual sentences extracted from the filtered and classified scientific papers, along with
their paper’s metadata like:
1. citation: paper’s title, authors, etc.,
2. assigned taxonomy labels with a relevancy score (output of a probabilistic multi-labeled classification, described in “Classification of scientific papers.docx”),
3. importance (rank value calculated during filtering stage)
4. timestamp (date of publication)and return a connected text describing a state-of-the-art approaches in the requested research area.

If connected text is to be generated, issues of discourse structure and discourse coherency are particularly important. Generation of text requires determining of HOW to organize individual sentences in order to construct an overall framework or outline of a survey, following the evolution of most influential works in the specific area.

The generated text is to be a UNIT: the computational process has to produce a text that “hangs together”. This means that only information that is RELEVANT to the discourse goal is included and each following sentence is semantically related to the previous text.